概述

梯度下降方法(GradientDescent)是无约束优化问题中常用的优化方法，也是BP神经网络中比较核心的优化方法，我们用几个例子来测试下梯度下降方法的效果。

理论基础

算法思路

首先我们先回顾一下梯度下降方法的优化思路。其实现算法如下：
给定初始点$x \in domf$
重复进行
1.$\triangle x:= -\nabla f(x)$
2.直线搜索，经过精确或回溯直线搜索确定步长$t$.
3.修改,$x:=x+t\triangle x$
直到满足停止准则
停止准则通常采用$||\nabla f(x)||_2\leqslant\eta$，$\eta$为一个小正数。

步长选择

在实践的过程中步长的选择主要采用非精确的搜索方法，沿射线近似的优化$f$确定步长，其中简单有效的回溯方法描述如下：
给定 $f$在$x \in domf$处的下降方向$ \triangle x$,参数$\alpha \in (0,0.5)$，参数$\beta \in (0,1)$
如果$f(x+\triangle x)>f(x)+\alpha t\nabla f(x)^T\triangle x$，令$t:=\beta t$
直到满足停止条件，只要$t$足够小，则一定会停止搜索。

收敛性

可以证明的是，在精准直线搜索方法下，梯度下降方法最多经过：

$\frac{log((f(x_{(0)})-p^*)/\varepsilon)}{log(1/c)}$

次迭代一定可以得到

$f(x^{(k)})-p^* \leqslant \varepsilon$

其中：

$log((f(x_{(0)})-p^*)/\varepsilon)$

可以理解为初始次优性($f(x^{(0)})-p^*$)和最终词优性($\leqslant\varepsilon$)的比值。

代码实现

采用python实现，并进行可视化，代码如下：

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# aim function
def aimFun(x,y):
    return (x**2+3*y**2)/2

#differential of x
def pxAim(x,y):
    return x

#differential of y
def pyAim(x,y):
    return 3*y

#creat the image of aimFun
fig=plt.figure()
ax=Axes3D(fig)
X,Y=np.mgrid[-2:2:40j,-2:2:40j]
Z=aimFun(X,Y)
ax.plot_surface(X,Y,Z,rstride=1,cstride=1,cmap="rainbow")
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')

# backward_propagation
step=0.001
x=2
y=2
tag_x=[x]
tag_y=[y]
tag_z=[aimFun(x,y)]
new_x=x
new_y=y
Over=False
while Over==False:
    new_x-=step*pxAim(x,y)
    new_y-=step*pyAim(x,y)
    if aimFun(x,y)-aimFun(new_x,new_y)<7e-9:
        Over=True
    x=new_x
    y=new_y
    tag_x.append(x)
    tag_y.append(y)
    tag_z.append(aimFun(x,y))

#visualize the process of iteration
ax.plot(tag_x,tag_y,tag_z,'r.')
plt.title('resule:('+str(x)+","+str(y)+')')
plt.show()

此处所采用的步长不是采用回溯方法得到的，而是采用了固定步长的方法，此方法借鉴了Ng在深度学习中的步长选择方法，他的代码如下：

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of n_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters['W' + str(i)] = ... 
                  parameters['b' + str(i)] = ...
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks

    # Update rule for each parameter
    for k in range(L):
        parameters["W" + str(k+1)] = parameters["W" + str(k+1)] - learning_rate * grads["dW" + str(k+1)]
        parameters["b" + str(k+1)] = parameters["b" + str(k+1)] - learning_rate * grads["db" + str(k+1)]
        
    return parameters

效果测试

test1

首先选择《Convex-Optimization》中的第一个例子作为测试对象，其目标函数如下：

$f(x)=\frac{1}{2}(x_1^2+\gamma x_2^2)$

取$\gamma=3$,固定步长为0.001，迭代效果如下：
$test1p1$
观察其迭代路径如下：
$test1p2$
效果尚可。

test2

第二个测试函数为非二次型的函数，如下：

$f(x_1,x_2)=e^{x_1+3x_2-0.1}+e^{x_1-3x_2-0.1}+e^{-x_1-0.1}$

初始点选择为(1,-1),步长为0.001，迭代效果如下：
$test1p1$
观察其迭代路径如下：
$test1p2$
相关函数及偏导函数代码如下：

# aim function
def aimFun(x,y):
    return np.exp(x+3*y-0.1)+np.exp(x-3*y-0.1)+np.exp(-x-0.1)

#differential of x
def pxAim(x,y):
    return np.exp(x+3*y-0.1)+np.exp(x-3*y-0.1)-np.exp(-x-0.1)

#differential of y
def pyAim(x,y):
    return 3*np.exp(x+3*y-0.1)-3*np.exp(x-3*y-0.1)

小结

由于测试函数是从《Convex-Optimization》中选择的，所以函数都满足凸函数，顾迭代结果接近最优解。但是需要注意的是非凸函数使用梯度下降时，应考虑陷入局部最小值的情况。
在进行测试的时候还出现了一点问题，就是在可视化第二个测试函数的时候，一开始函数图像不太正常，之后利用Matlab进行函数图像的绘制后发现python画出的图像并不正确，最后发现错误原因是原代码中的

1	X,Y=np.mgrid[-2:2:40j,-2:2:40j]

被修改为

1	X,Y=np.mgrid[-5:5,-5:5]

由于默认间隔过大所以无法正常显示图像，在修改间隔为如下：

1	X,Y=np.mgrid[-2:2:0.05,-2:2:0.05]

之后图像正常显示。