梯度下降法优化目标函数Nowadays we can learn about domains that were usually reserved for academic communities. From Artificial Intelligence to Quantum Physics, we can browse an enormous amount of ...
Nowadays we can learn about domains that were usually reserved for academic communities. From Artificial Intelligence to Quantum Physics, we can browse an enormous amount of information available on the Internet and benefit from it.
However, the availability of information has some drawbacks. We need to be aware of a huge amount of unverified sources, full of factual errors (it’s a topic for the whole different discussion). What’s more, we can get used to getting answers with ease by googling it. As a result, we often take them for granted and use them without a better understanding.
The process of discovering things on our own is an important part of learning. Let’s take part in such an experiment and calculate derivatives behind Gradient Descent algorithm for a Linear Regression.
Linear Regression is a statistical method that can be used to model the relationship between variables [1, 2]. It’s described by a line equation:
We have two parameters Θ₀ and Θ₁ and avariable x. Having data points we can find optimal parameters to fit the line to our data set.
我们有两个参数Θ₀和Θ₁和a 变量x 。 有了数据点，我们可以找到最佳参数以使线适合我们的数据集。
Ok, now the Gradient Descent [2, 3]. It is an iterative algorithm that is widely used in Machine Learning (in many different flavors). We can use it to automatically find optimal parameters of our line.
To do this, we need to optimize an objective function defined by this formula:
In this function, we iterate over each point (xʲ, yʲ) from our data set. Then we calculate the value of a function f for xʲ, and current theta parameters (Θ₀, Θ₁). We take a result and subtract yʲ. Finally, we square it and add it to the sum.
Our objective function is a composite function. We can think of it as it has an “outer” function and an “inner” function . To calculate a derivative of a composite function we’ll follow a chain rule:
In our case, the “outer” part is about raising everything inside the brackets (“inner function”) to the second power. According to the rule we need to multiply the “outer function” derivative by the derivative of an “inner function”. It looks like this:
The next step is calculating a derivative of a power function . Let’s recall a derivative power rule formula:
Our “outer function” is simply an expression raised to the second power. So we put 2 before the whole formula and leave the rest as it (2 -1 = 1, and expression raised to the first power is simply that expression).
We still need to calculate a derivative of an “inner function” (right side of the formula). Let’s move to the third step.
步骤3.常数的导数 (Step 3. The derivative of a constant)
The last rule is the simplest one. It is used to determine a derivative of a constant:
As a constant means, no changes, derivative of a constant is equal to zero . For example f’(4) = 0.
作为常数，没有变化，常数的导数等于零。 例如f'(4)= 0 。
Having all three rules in mind let’s break the “inner function” down:
The tricky part of our Gradient Descent objective function is that x is not a variable. x and y are constants that come from data set points. As we look for optimal parameters of our line, Θ₀ and Θ₁ are variables. That’s why we calculate two derivatives, one with respect to Θ₀ and one with respect to Θ₁.
Let’s start by calculating the derivative with respect to Θ₀. It means that Θ₁ will be treated as a constant.
You can see that constant parts were set to zero. What happened to Θ₀? As it’s a variable raised to the first power (a¹=a), we applied the power rule. It resulted in Θ₀ raised to the power of zero. When we raise a number to the power of zero, it’s equal to 1 (a⁰=1). And that’s it! Our derivative with respect to Θ₀ is equal to 1.
Finally, we have the whole derivative with respect to Θ₀:
Now it’s time to calculate a derivative with respect to Θ₁. It means that we treat Θ₀ as a constant.
By analogy to the previous example, Θ₁ was treated as a variable raised to the first power. Then we applied a power rule which reduced Θ₁ to 1. However Θ₁ is multiplied by x, so we end up with derivative equal to x.
The final form of the derivative with respect to Θ₁ looks like this:
完整的梯度下降配方 (Complete Gradient Descent recipe)
We calculated the derivatives needed by the Gradient Descent algorithm! Let’s put them where they belong:
By doing this exercise we get a deeper understanding of formula origins. We don’t take it as a magic incantation we found in the old book, but instead, we actively go through the process of analyzing it. We break down the method to smaller pieces and we realize that we can finish calculations by ourselves and put it all together.
From time to time grab a pen and paper and solve a problem. You can find an equation or method you already successfully use and try to gain this deeper insight by decomposing it. It will give you a lot of satisfaction and spark your creativity.