>[!Abstract] ML is concerned with minimizing some particular objective function (loss/cost function). A loss function measures how well a particular model fits a given dataset, and the lower the cost, the more desirable. One particular optimization method used to reduce a certain cost function is Gradient Descent.
>
>For a convex function, the gradient descent algorithm eventually finds the optimal point by updating the equation below until the value of the next iteration is very close to the current iteration, i.e. convergence:
>$x_{t+1} = x_t - \alpha_t\nabla f(x_t)$
>
>In other words, it calculates the negative of the gradient of the cost function and scales it by some costant $\alpha_t$ known as the *learning rate*, and then moves it in the direction of the result at each iteration.
>
>Many functions can have several valleys and as a result the gradient descent might get stuck in a local minimum. Therefore, a different version of the gradient descent called *stochastic gradient descent (GSD)* is typically used. This alternative implements an element of randomness such that the descent is less likely to get stuck at a local minimum.