If you’re like me, you’ve heard a lot about Gradient Descent. You’ve heard that it is a foundational algorithm for optimising functions which all self respecting Machine Learning techniques employ (or some variation of it). You’ve probably even seen graphs like these:

which show how a function is iteratively scaled until an optimum (maximum or minimum) is reached. These are all great explanations for how the method intuitively works but, based on them alone, I never understood step by step how this method was working.

In this post, I explore the algorithm itself (Cauchy’s variation) and perform an example “by hand”. I will focus on minimising a function but the same intuition applies for maximisation.

## Steepest Descent

Suppose we have a multivariate function f(x_1, x_2,…,x_n), we want to find a point \mathbf{x}^* = (x_1, …, x_n) where the value of f(x_1, x_2,…,x_n) is minimised. Assuming that a minimum exists, we can use Steepest Descent to algorithmically discover that minimum; starting at some arbitrary point \mathbf{x}.

But how do we know we’ve achieved a minimum? Here is where we make use of the Gradient vector which is denoted by \nabla f(\mathbf{x}) and is defined as

(1) \nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, …,\frac{\partial f(\mathbf{x})}{\partial x_n}]

The components of the Gradient vector are partial derivates of the function with respect to each variable x_1,…,x_n. If all of these partial derivatives are 0 at some point \mathbf{x}

$$\frac{\partial f(\mathbf{x})}{\partial x_1} = \frac{\partial f(\mathbf{x})}{\partial x_2} = … = \frac{\partial f(\mathbf{x})}{\partial x_n} = 0$$

then the point \mathbf{x} is a stationary point. In this post I will assume we have a convex function and the stationary point is a minimum – there are other ways to test this but it is outside the scope of the post.

Steepest Descent proceeds by the following equation

(2) \mathbf{x}_{n+1} = \mathbf{x}_n + \lambda^* \nabla f(\mathbf{x}_n)

What (2) tells us is that if we are at some point \mathbf{x}_n, we can determine the next point by going in the direction of the gradient vector at point \mathbf{x}_n. How far we go along this vector is determined by a step-size parameter which is denoted as \lambda^*. (Why we move in the direction of the gradient vector has been covered in this post.)

We continue in this way until we reach a point whose gradient (which is now a vector not a scalar) is 0 or some stopping criteria is reached (sometimes it takes too long to converge to a minimum so a user can decide when a point is “good enough”).

Now all that’s left to do is to determine what value the step-size \lambda should take. In Machine Learning, this parameter is user defined so it can be set to a manual value – it tends to be a very low number like 0.01. However in Cauchy’s approach, this parameter is determined by a univariate optimisation at every iteration. The univariate optimisation method I will use in this post will be differentiation and equating to 0.

For a two variable function, the general algorithm for using Steepest Descent is as follows:

- Choose an initial starting point

$$\mathbf{x_0} = (x_{1_0}, x_{2_0})$$

- Determine a search direction via the Gradient Vector

$$\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

- Substitute starting point \mathbf{x_0} = (x_{1_0}, x_{2_0}) into the gradient vector \nabla f(\mathbf{x_0}) = \nabla f(x_{1_0}, x_{2_0})

- Perform a univariate optimisation to obtain the optimal step-size, \lambda^*

$$f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]$$

$$f(\lambda) = f[(x_{1_0}, x_{2_0}) + \lambda(x_{1_1}, x_{2_1})]$$

$$f(\lambda) = f[x_{1_0} + \lambda x_{1_1}, x_{2_0} + \lambda x_{2_1}]$$

Substitute

$$x_1 = x_{1_0} + \lambda x_{1_1}$$

$$x_2 = x_{2_0} + \lambda x_{2_1}$$

in the function of interest

- Set f'(\lambda) = 0 and solve for \lambda

- State new point

$$\mathbf{x_1} = \mathbf{x_0} + \lambda^* \nabla f(\mathbf{x_0})$$

- Repeat until convergence

The approach can be generalised to n variables without loss of generality.

## Worked Example

Let’s use the steepest descent method to find the minimum of the two variable function

$$ f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$

I will work through one iteration, the reader can continue to replicate the steps until the true minimum of (1,1) is found.

- Choose an initial starting point. Arbitrarily I select

$$ \mathbf{x}_0 = (1,0)$$

- Determine a search direction via the Gradient Vector

$$\nabla f(x_1,x_2) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

$$\nabla f(x_1,x_2) = [2x_1 + x_2 – 3 , x_1 + 2x_2 – 3]$$

- Substitute starting point $$\mathbf{x_0} = (x_{1_0}, x_{2_0}) = (1,0) $$ into the gradient vector $$\nabla f(\mathbf{x}) = \nabla f(x_{1_0}, x_{2_0})$$

$$ \nabla f(\mathbf{x}_0) = \nabla f(1,0) = (-1,-2)$$

This tells us the direction to move in

- Perform a univariate optimisation to obtain the optimal step-size, \lambda^*

Find \lambda to maximise

f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]

f(\lambda) = f[(1,0) + \lambda(-1,-2)]

f(\lambda) = f[1 – \lambda, -2\lambda] = 7\lambda^2 + 5\lambda + 1

- Set f'(\lambda) = 0 and solve f'(\lambda) = \lambda^*

Setting f'(\lambda)= 0

f'(\lambda) = 14\lambda + 5 = 0

\lambda^* = -\frac{5}{14}

- State new point \mathbf{x_{n+1}} = \mathbf{x_{n}} + \lambda^* \nabla f(\mathbf{x_{n}})

\mathbf{x_1} = (1,0) + -\frac{5}{14}(-1,-2) = (\frac{19}{14},\frac{5}{7})

Check \nabla f(\frac{19}{14},\frac{5}{7})

$$ \nabla f(x_1 = \frac{19}{14}, x_2 = \frac{5}{7}) = (\frac{3}{7},-\frac{3}{14})$$

So we’re not at the minimum yet – the next step will be to repeat the process shown. Repeating the process will show a sequence of points converging to the true optimum (1,1).

## Implementation in R

One implementation of a two variable Gradient Descent process can be seen in the R code below. I have excluded the process of univariate optimisation to determine a step-size and have fixed it as a user defined parameter throughout the process.

Since the below code is a user defined function there are some very obvious limitations:

- The function is slow
- It can be misused; if the obj argument is set to “max” for the given function, the while loop can go on forever!
- A tolerance must be set – the algorithm may take a very long time to converge to a point where the gradient vector is the zero vector.
- The input function argument must be entered as a character.

Nonetheless the results can be instructive to see how the algorithm works.

# Gradient Descent Summary for two variables - user defined function grad_descent_two_var <- function(input_function, obj, x_initial, y_initial, lambda = 0.1, tolerance = .Machine$double.eps^0.5) { # Create functions to make partial derivatives apply_deriv_x <- function(x, y, func) { if (is.expression(func)) { f_x_y <- func f_x <- D(f_x_y, "x") x <- x y <- y return(eval(f_x)) } else { return("function must be entered as an expression") } } apply_deriv_y <- function(x, y, func) { if (is.expression(func)) { f_x_y <- func f_y <- D(f_x_y, "y") x <- x y <- y return(eval(f_y)) } else { return("function must be entered as an expression") } } # Create a dataframe with information about the initial point initial_df <- tibble(func = input_function, obj = obj, x1 = x_initial, x2 = y_initial, stepsize = if_else(obj == "max", lambda, -lambda), grad_x1 = apply_deriv_x(x1, x2, parse(text = func)), grad_x2 = apply_deriv_y(x1, x2, parse(text = func)), x1new = x1 + (stepsize * grad_x1), x2new = x2 + (stepsize * grad_x2)) # Iterate using the gradient descent approach while (near(x = sum(abs(initial_df$grad_x1[nrow(initial_df)]), abs(initial_df$grad_x2[nrow(initial_df)])), y = 0, tol = tolerance) == FALSE) { newrow <- tibble(func = input_function, obj = obj, x1 = initial_df$x1new[nrow(initial_df)], x2 = initial_df$x2new[nrow(initial_df)], stepsize = if_else(obj == "max", lambda, -lambda), grad_x1 = apply_deriv_x(x1, x2, parse(text = func)), grad_x2 = apply_deriv_y(x1, x2, parse(text = func)), x1new = x1 + (stepsize * grad_x1), x2new = x2 + (stepsize * grad_x2)) initial_df <- union(initial_df, newrow) } algo_summary <- tibble(num_iter = nrow(initial_df), optim_x1 = initial_df$x1[nrow(initial_df)], optim_x2 = initial_df$x2[nrow(initial_df)]) return(list(iterates = initial_df, iteration_summary = algo_summary)) } # Save the results (a list of two dataframes) results <- grad_descent_two_var(input_function = "x^2 + (x*y) + y^2 - (3*x) - (3*y) + 3", obj = "min", x_initial = 1, y_initial = 0) # Show how the values of (x1, x2) vary throughout the process View(results$iterates) View(results$iteration_summary)

Applying the R function above to the function $$f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$ gives an output table which looks like this:

The final row in the above table (not displayed here) will show that the minimum is reached when (x_1, x_2) = (1,1).

## Reference

- Image was found at https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

## 1 Comment