Day12 ML Review - Gradient Descent (2)
Mathematical Explanation
from https://angeloyeo.github.io/2020/08/16/gradient_descent.html
Derivation of formula for gradient descent
At its core, the gradient descent method is a machine learning technique used to find the value of the independent variable that minimizes the function value. It does this by adjusting the variableโs value to decrease the function value. Think of it as a way to โdescendโ to the lowest point of a function, much like a hiker navigating a mountainous terrain.
If the data is large, finding the solution through an iterative method such as gradient descent can be more computationally efficient.
Gradient descent involves using the functionโs gradient to determine whether it reaches its minimum value when the value is adjusted.
- If the slope is positive, the function value also increases as the value increases.
- Conversely, if the slope is negative, the function value decreases as the value increases.
Also, a large slope value indicates a steep incline, but it also signifies being far from the coordinates, while the position corresponds to the minimum/maximum value.
If the function value increases as x increases at a specific point (the slope is positive), we need to move x in the negative direction. Conversely, if the function value at a specific point decreases as x increases (the slope is negative), we move x in the positive direction.
Reference in Korean
gradient descent๋ ํจ์์ ๊ธฐ์ธ๊ธฐ(์ฆ, gradient)๋ฅผ ์ด์ฉํด $x$์ ๊ฐ์ ์ด๋๋ก ์ฎ๊ฒผ์ ๋ ํจ์๊ฐ ์ต์๊ฐ์ ์ฐพ๋์ง ์์๋ณด๋ ๋ฐฉ๋ฒ์ด๋ผ๊ณ ํ ์ ์๋ค.
๊ธฐ์ธ๊ธฐ๊ฐ ์์๋ผ๋ ๊ฒ์ $x$ ๊ฐ์ด ์ปค์ง ์๋ก ํจ์ ๊ฐ์ด ์ปค์ง๋ค๋ ๊ฒ์ ์๋ฏธํ๊ณ , ๋ฐ๋๋ก ๊ธฐ์ธ๊ธฐ๊ฐ ์์๋ผ๋ฉด $x$๊ฐ์ด ์ปค์ง ์๋ก ํจ์์ ๊ฐ์ด ์์์ง๋ค๋ ๊ฒ์ ์๋ฏธํ๋ค๊ณ ๋ณผ ์ ์๋ค.
๋, ๊ธฐ์ธ๊ธฐ์ ๊ฐ์ด ํฌ๋ค๋ ๊ฒ์ ๊ฐํ๋ฅด๋ค๋ ๊ฒ์ ์๋ฏธํ๊ธฐ๋ ํ์ง๋ง, ๋ ํํธ์ผ๋ก๋ $x$์ ์์น๊ฐ ์ต์๊ฐ/์ต๋๊ฐ์ ํด๋น๋๋ $x$ ์ขํ๋ก๋ถํฐ ๋ฉ๋ฆฌ ๋จ์ด์ ธ์๋ ๊ฒ์ ์๋ฏธํ๊ธฐ๋ ํ๋ค.
Direction Component of the Gradient
In gradient descent, the direction of the update is crucial to effectively minimizing a function. If the functionโs value increases with an increase in $x$ (i.e., the slope is positive), the optimal strategy is to move $x$ in the opposite direction to reduce the functionโs value. Conversely, if the functionโs value decreases as $x$ increases (i.e., the slope is negative), $x$โ should be moved in the positive direction to further decrease the function value.
This concept is mathematically represented as:
where $x_i$ and $x_{i+1}$ represent the current and updated values of $x$, respectively, $\nabla f(x_i)$ denotes the gradient of the function at $x_i$, and $\alpha$ is the learning rate or step size.
Step Size in Gradient Descent
The step size, $\alpha$, controls how much we adjust $x$ at each iteration of gradient descent. Itโs essential to set $\alpha$ appropriately to ensure efficient and effective convergence to the minimum. If $\alpha$ is too large, the algorithm might overshoot the minimum; if itโs too small, convergence could be unnecessarily slow.
The magnitude of the gradient often informs the choice of step size. A larger gradient magnitude indicates that $x$ is far from the minimum, suggesting a potentially larger step. Conversely, a smaller gradient suggests that $x$ is closer to the minimum, and a smaller step might be preferable.
In practice, $\alpha$ can be kept constant or adjusted dynamically based on criteria such as the iteration count or changes in the function value. Advanced versions of gradient descent, such as Adam or RMSprop, incorporate mechanisms to adaptively adjust $\alpha$ based on past gradients.
This method of dynamically adjusting the step size helps in moving significantly when far from the minimum and making finer adjustments as one approaches the target, thus enhancing the convergence properties of gradient descent.
Leave a comment