用懊悔法学习吴恩达机器学习【2】-----线性回归的梯度下降

以下英文文档皆出自课程配套笔记

课9 代价函数二

这一课时考虑使用两个参数来描述代价函数。此时等价函数是一个碗形，碗底点为最小值，将碗形用等高线表示，等高线中心就是代价函数的最小值。所以距离等高线中心较近的点所对应的( θ0, θ1)，能够较准确的拟合出原图像。

Cost Function - Intuition II
A contour plot is a graph that contains many contour lines. A contour line of a two variable
function has a constant value at all points of the same line. An example of such a graph is the

one to the right below.

Taking any color and going along the 'circle', one would expect to get the same value of the

cost function. For example, the three green points found on the green line above have the
same value for J ( θ 0, θ 1) and as a result, they are found along the same line. The circled x
displays the value of the cost function for the graph on the left when θ 0 = 800 and θ 1= -0.15.

Taking another h(x) and plotting its contour plot, one gets the following graphs:

When θ 0 = 360 and θ 1 = 0, the value of J ( θ 0, θ 1) in the contour plot gets closer to the center
thus reducing the cost function error. Now giving our hypothesis function a slightly positive

slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the
result of θ 1 and θ 0 tend to be around 0.12 and 250 respectively. Plotting those values on our

graph to the right seems to put our point in the center of the inner most 'circle'.

课10 梯度下降

Gradient Descent

用于求出假设函数的参数。

So we have our hypothesis function and we have a way of measuring how well it fits into the data.Now we need to estimate the parameters in the hypothesis function.That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ 0 and θ 1 (actually we are
graphing the cost function as a function of the parameter estimates). We are not graphing x
and y itself, but the parameter range of our hypothesis function and the cost resulting from
selecting a particular set of parameters.
We put θ 0 on the x axis and θ 1 on the y axis, with the cost function on the vertical z axis. The
points on our graph will be the result of the cost function using our hypothesis with those

specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits
in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in
the graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction
to move towards. We make steps down the cost function in the direction with the steepest
descent. The size of each step is determined by the parameter α, which is called the learning
rate.
For example, the distance between each 'star' in the graph above represents a step
determined by our parameter α. A smaller α would result in a smaller step and a larger α
results in a larger step. The direction in which the step is taken is determined by the partial
derivative of J ( θ 0, θ 1). Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting points that end up in two
different places.
The gradient descent algorithm is:
repeat until convergence:

θj := θj − α ∂∂ θjJ ( θ 0, θ 1)

where
j=0,1 represents the feature index number.
At each iteration（迭代） j, one should simultaneously update the parameters θ 1, θ 2,..., θn . Updating a
specific parameter prior to calculating another one on the j ( th ) iteration would yield to a

wrong implementation.

注意同时更新两个参数

课11 梯度下降知识点总结

化简为一个参数，偏导数变为导数。展示了从最小点两边向最小点趋近的数学过程。

Gradient Descent Intuition
In this video we explored the scenario where we used one parameter θ 1 and plotted its
cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

θ 1 :=θ 1 −α ddθ1 J(θ 1 )

Regardless of the slope's sign for ddθ1 J(θ 1 ) , θ 1 eventually converges to its minimum

value. The following graph shows that when the slope is negative, the value of θ 1 increases and when it is positive, the value of θ 1 decreases.

α是用来调节下降的“步伐”。

On a side note, we should adjust our parameter α to ensure that the gradient descent

algorithm converges in a reasonable time. Failure to converge or too much time to obtain

the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size α ?
The intuition behind the convergence is that ddθ1 J(θ 1 ) approaches 0 as we approach the

bottom of our convex function. At the minimum, the derivative will always be 0 and thus

we get:

θ 1 :=θ 1 −α∗0 已经在最小点时，θ 1值不再发生变化。

当接近最小点时，下降的趋势会自动变小。因为导数逐渐趋向于0。

课12 线性回归的梯度下降

将梯度下降和代价函数结合得到线性回归的梯度下降算法。

Gradient Descent For Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation to :

这些文档出现的x下标j，我认为是指代两种可能，一是各个横坐标，j=1。二是常数1，j=0。

上式可以推导出来