摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第二章《单变量线性回归》中第11课时《梯度下降的直观认识》的视频原文字幕。为本人在视频学习过程中逐字逐句记录下来以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助。

In the previous video (article), we gave a mathematical definition of gradient descent. Let's delve deeper, and in this video (article), get better intuition about what the algorithm is doing, and why the steps of the gradient descent algorithm might make sense.

Here's the gradient descent algorithm that we saw last time. And just remind you, this parameter, or this term is called the learning rate. And it controls how big a step we take when updating my parameter . And this second term here is the derivative term. And what I want to do in this video is give you better intuition about what each of these two terms is doing and why, when put together, this entire update make sense.

In order to convey these intuitions, what I want to do is use a slightly simpler example where we want to minimize the function of just one parameter. So say we have a cost function J of just one parameter, , like we did, you know, a few videos back. Where theta one is a real number, okay? Just so we can have 1D plots, which are a little bit simpler to look at. Let's try to understand what gradient descent would do on this function.

So, let’s say here's my function , where is a real number. Right? Now let’s say I have initialized gradient descent with at this location. So imagine that we start off at that point on my function. What gradient descent will do, is it will update

And just as an aside you know this derivative term, right? If you're wondering why I changed the notation from these partial derivative symbols. If you don't know what the difference between these partial derivative symbols is with the , don’t worry about it. Technically in mathematics we call this a partial derivative (), we call this a derivative (), depending on the number of parameters in the function J, but that's a mathematical technicality, you know, for the purpose of this lecture, think of these partial symbols, and as exactly the same thing. And, don't worry about whether there is any difference. I’m gonna try to use the mathematically precise notation. But for our purposes, these notations are really the same thing. So, let's see what this equation will do. And so, we are going to compute this derivative of. I'm not sure if you've seen derivatives in calculus before. But what a derivative does is basically saying, let’s take the tangent of that point, like that straight line, the red line, just touching this function. And let’s look at the slope of this red line. That is where the derivative is. It says what is the slope of the line that is just tangent to the function? Ok, and the slope of the line is of course is just the height divided by this horizontal thing. Now, this line has a positive slope, so it has a positive derivative. And so, my update to theta is going to be gives the update that minus times some positive number. , the learning rate is always a positive number. And so, I’m gonna to take , this update as minus something like, end up moving to the left, decrease . And we can see this is the right thing to do, because I actually went ahead in this direction to get me closer to the minimum over there. So gradient descent so far seems to be doing the right thing.

Let's look at another example. So let's take my same function J. Just trying to draw the same function . Now let’s say I had instead initialized my parameter over there on the left. So is here, I’m gonna add that point on the surface. Now my derivative term, , when evaluated at this point, gonna look at the slope of that line. So this derivative term is a slop of this line. But this line is slanting down, so this line has negative slope. Or alternatively I say that this function has negative derivative, just means negative slope at that point. So this is less than equal to zero. So when I update , is updated as minus alpha times a negative number. And so I have minus a negative number which means I’m actually going to increase , right? Because this is minus of a negative number means I'm adding something to . And what that means is that I’m going to end up increasing . And so we start here, increase , which again seems like the thing I want to do, to try to get me closer to the minimum. So, this hopefully explains the intuition behind what the derivative term is doing. Let's next take a look at the learning rate , and try to figure out what that's doing.

So, here's my gradient descent update rule. Right, there's this equation. And let's look at what can happen if is either too small, or if is too large. So this first example, what happens if is too small. So, here's my function . Let's just start here. If is too small, then what I'm going to do is gonna multiply the update by some small number. So end up taking, you know, it's like a baby step like that. Okay, so that's one step. Then from this new point we're gonna take another step. But if is too small, let's take another baby step. And so if my learning rate is too small, I'm gonna end up, you know, taking these tiny, tiny baby steps to try to get to the minimum. And I'm gonna need a lot of steps to get to the minimum. And so if is too small, gradient descent can be slow because it's gonna take these tiny, tiny baby steps. And it's gonna need a lot of steps before it gets anyway close to the global minimum.

Now how about if the is too large. So, here's my function . Turns out if is too large, then gradient descent can overshoot a minimum, and may even fail to converge or even diverge. Let's say a start of data there. It's actually pretty close to the minimum. So the derivative points to the right, but if is too big, I'm gonna take a huge step. Maybe I'm gonna take a huge step like that. Now, my cost function has got worse, because it starts off from this value, but now my value has got worse. Now my derivatives point to the left, it's actually decrease . But look, if my learning rate is too big, I may take a huge step going from here all the way out there. So I end up going there, right? And if my learning rate is too big, it can take another huge step on the next iteration, and kind of overshoot and overshoot and so on until you notice I'm actually getting further and further away from the minimum. And so if is too large it can fail to converge or even diverge.

Now, I have another question for you. So, this is a tricky one. And when I was first learning this stuff, it actually took me a long time to figure this out. What if your parameter theta one is already at a local minimum? What do you think one step of gradient descent will do? So, let's suppose you initialize at a local minimum. So suppose this is your initial value of over here. And it's already at a local optimum or the local minimum. It turns out that at local optimum your derivative would be equal to zero. Since it’s that slope where it’s that tangent point, so the slope of this line will be equal to zero, and thus this derivative term is equal to zero. And so in your gradient descent update, you have , gives update that , minus times zero. And so, what it means is that, if you’re already at a local optimum, it leaves unchanged because this, you know, updates equals . So if your parameter is already at a local minimum, one step of gradient descent does absolutely nothing. It doesn’t change parameter, which is what you want, because it keeps your solution at the local minimum.

This also explains why gradient descent can converge the local minimum, even with the learning rate fixed. Here's what I mean by that. Let's look at an example. Here is a cost function that maybe I want to minimize. And let’s say I initialize my algorithm my gradient descent algorithm, you know, out there at the magenta point. If I take one step of gradient descent you know, maybe it’ll take me to that point, because my derivative is pretty steep out there, right? Now I’m at this green point, and if I take another step of gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point then compared to at the magenta point out there, right? Because as I approach the minimum, my derivative gets closer and closer to zero as I approach the minimum. So I wanna take another step of the gradient descent, so my new derivative’s slope is smaller. So I’ll take another step of gradient descent. I will naturally take a somewhat smaller step from this green point than I did from the magenta point. Now I am at the new point, the red point, and then now even closer to the global minimums, so the derivative here will be even smaller than it was at the green point. So, when I take another step of gradient descent, you know, now my derivative term is even smaller, and so, the magnitude of the update to is even smaller, so you can take small step like so. And as gradient descent runs, you will automatically take smaller and smaller steps, until eventually you are taking very small steps, you know, and you find the converge to the local minimum. So, just recap, in gradient descent, as we approach the local minimum, gradient descent will automatically take smaller steps, and that’s because as we approach the local minimum, by definition, local minimum is when you have this derivative equal to zero. So, as we approach the local minimum, this derivative team will automatically get smaller, and so gradient descent will automatically take smaller step. So this is what gradient descent looks like, and so actually there is no need to decrease overtime.

So, that's the gradient descent algorithm, and you can use it to minimize, to try to minimize any cost function J, not the cost function J to be defined for linear regression.

In the next video (article), we're going to take the function J, and set that back to be exactly linear regression's cost function, the square cost function that we came up earlier. And taking gradient descent, and the square cost function, and putting them together. That will give us our first learning algorithm, that'll give us our linear regression algorithm.

<end>