L-smooth & Lipschitz 连续性假设 • Jeefy's Blog

3 min read 785 words Updated Apr 25, 2026 Created Jun 17, 2026

根据神秘言论：L-Lipschitz smooth 和 Lipschitz 连续性是凸优化与非凸优化理论中最基础、最常用的正则性假设之一。

定义（集合上的局部 Lipschitz）：$f$ 在 $\mathcal{X}$ 上局部 Lipschitz，则：

$$$$

等价形式

$|f(x) - f(y)| \le L_0 \|x-y\|$
$\| \nabla f(x) \| \le L_0$
$f(x) \le f(y) + L_0 \| x - y\|$

定义：对于 $L > 0$，如果 $f: {\mathbb E} \to {\mathbb R}$ 是 L-smooth，则有：

$$\|\nabla f(x) - \nabla f(y) \| \le L \| x - y\| $$

其中 $\|\cdot\|_*$ 表示 $\mathbb E$ 的对偶空间的 L2 范数。

这里 L-smooth 可以看作其梯度上的 Lipschitz 连续性假设，并且是全局的

形式	表达式	适用场景
梯度 Lipschitz	$\\|\nabla f(x) - \nabla f(y)\\| \leq L \\|x-y\\|$	原始定义，最通用
Hessian 有界	$\\|\nabla^2 f(x)\\|_2 \leq L,\ \forall x$	二阶可微时等价，直观表示“最大曲率”
二次上界	$f(y) \leq f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}\\|y-x\\|^2$	优化证明中最常用，直接给出函数值的上界

$\Delta$ L-smooth $\implies$ 局部 Lipschitz 连续

$$\begin{aligned} \| \nabla f(x) \| &= \| \nabla f(x) - \nabla f(x_0) + \nabla f(x_0) \| \\ & \le \| \nabla f(x) - \nabla f(x_0) \| + \| \nabla f(x_0) \| \\ & \le L \| x - x_0 \| + \| \nabla f(x_0) \| \\ & \le L \delta + \| \nabla f(x_0) \| \end{aligned} $$

于是限制在足够小的邻域内，可以认为 $\| \nabla f(x_0) \| = L_0$，所以可以估计：
$\max L_x \le L \| x - x_0 \| + L_{x_0}$

则，$\| \nabla f(x) \|$ 有界，已经可以得到 $L_0$ 是存在的，Lipschitz 连续性已经成立了。继续推导原形式，于是有：

$$h(t) = f(x_0 + t(x - x_0)) \implies \frac {dh(t)}{dt} = \langle\nabla f(x_0 + t(x - x_0)), (x - x_0) \rangle $$

则：

$$ \begin{aligned} f(x) - f(x_0) &= \int_0^1 h(t) dt = h(\xi) = \langle \nabla f(x_\xi), x - x_0 \rangle \\ & \le \| \nabla f(x_\xi) \| \cdot \|x - x_0 \| \\ & \le L_0 \| x- x_0\| \end{aligned} \tag {L.1} $$

二次上界

内积是标量乘法在向量空间的自然延伸，所以采用内积代替：

$$f(y) = f(x) + f'(x)(y - x) + \frac {f'(x + \theta(y - x))} {2} (y - x)^2, \theta \in[0, 1] $$

也就是：

$$f(y) = f(x) + \langle \nabla f(x) , y - x \rangle + \frac 1 2 (y - x)^T \nabla^2 f(\xi) (y - x) $$

为什么是 $(y - x)^T \nabla^2 f(\xi) (y - x)$ 就要涉及到求导法则和 Hessian 矩阵的相关知识了：矩阵、多元求导

考虑到 Hessian 矩阵可以认为一定是实对称的，所以有：

考虑到 $Q$ 不改变范数，也就是 $\|v_i\| = \|u_i\|$，那么根据 $\|H\|_2$ 谱范数的定义：$\|H\|_2 = \sup \frac {\|Hx\|_2}{\|x\|_x} = \sqrt{\lambda_{max}(H^T H)}$

$$(y - x)^T \nabla^2 f(\xi) (y - x) \le \|\nabla^2 f(\xi) \| \cdot \|y - x\|^2 $$

利用梯度不等式求导后的结论，那么有：

$$f(y) \leq f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}\|y-x\|^2 $$

当然我们可以深入一下这个展开的部分的推导矩阵、多元求导#^388d24

然后。。。。。

我们继续深入一下放缩的方法。

对于 $\langle \nabla f({\bf x_0}), {\bf h} \rangle$，利用柯西不等式，则有 $|\langle \nabla f({\bf x_0}), {\bf h} \rangle| \le \| \nabla f({\bf x_0}) \| \| {\bf h} \|$

柯西不等式可以扩展到任意内积运算上，例如：
$\langle f, g \rangle = \int_a^b f(x) g(x) dx$ 有 $\left( \int_a^b f(x)g(x) dx \right)^2 \le \left( \int_a^b f(x)^2 dx \right) \left( \int_a^b g(x)^2 dx \right)$
$\langle A, B \rangle = \text{tr}(A^T B)$ 有 $|\text{tr}(A^T B)| \le \|A\|_F \|B\|_F$

对于 $\frac 1 2 {\bf h}^T \nabla^2 f({\bf x_0} + \theta {\bf h}) {\bf h}$，由于知道 $\nabla^2$ 是对称矩阵：

$$\nabla^2 f({\bf x}) = \begin{bmatrix} \frac {\partial^2 f}{ \partial x_1 \partial x_1} & \frac {\partial^2 f}{ \partial x_1 \partial x_2} & \cdots & \frac {\partial^2 f}{ \partial x_1 \partial x_n} \\ \frac {\partial^2 f}{ \partial x_2 \partial x_1} & \frac {\partial^2 f}{ \partial x_2 \partial x_2} & \cdots & \frac {\partial^2 f}{ \partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac {\partial^2 f}{ \partial x_n \partial x_1} & \frac {\partial^2 f}{ \partial x_n \partial x_2} & \cdots & \frac {\partial^2 f}{ \partial x_n \partial x_n} \end{bmatrix} $$

根据 $f$ 的连续性（可能会有特殊情况，不考虑了），有：

$$\frac {\partial^2 f}{ \partial x_i \partial x_j} = \frac {\partial^2 f}{ \partial x_j \partial x_i} $$

于是这就对称了。

我们知道，对于一个对称矩阵，必定能对角化，所以可以转化为 $Q \Lambda Q^T$ 的形式。就有了上文的证明。

至于为什么，这就问线性代数去吧

梯度与极值

在模型中的重要估计，考虑最小化展开的上界，考虑：

$$\nabla_y \left(f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}\|y-x\|^2 \right) = 0 \implies \nabla f(x) + L(y - x) $$

于是取 $y = x - \frac 1 L \nabla f(x)$ 时有最小值：

$$\begin{aligned} f(y) &\le f(x) + \nabla f(x)^T (-\frac 1 L \nabla f(x)) + \frac L 2 \| \frac 1 L \nabla f(x) \|^2 \\ &\le f(x) - \frac 1 {2L} \| \nabla f(x) \|^2 \end{aligned} $$

取 $x^*$ 表示最优解，那么有：

$$f(x^*) \le f(y) \le f(x) - \frac 1 {2L} \| \nabla f(x) \|^2 \implies f(x) - f(x^*) \ge \frac 1 {2L} \| \nabla f(x) \|^2 $$

也就是说，梯度越大，那么里最小值就越远。