If you’re like me, you’ve heard a lot about Gradient Descent. You’ve heard that it is a foundational algorithm for optimising functions which all self respecting Machine Learning techniques employ (or some variation of it). You’ve probably even seen graphs like these:

which show how a function is iteratively scaled until an optimum (maximum or minimum) is reached. These are all great explanations for how the method intuitively works but, based on them alone, I never understood step by step how this method was working.

In this post, I explore the algorithm itself (Cauchy’s variation) and perform an example “by hand”.  I will focus on minimising a function but the same intuition applies for maximisation.

## Steepest Descent

Suppose we have a multivariate function $f(x_1, x_2,…,x_n)$, we want to find a point $\mathbf{x}^* = (x_1, …, x_n)$ where the value of $f(x_1, x_2,…,x_n)$ is minimised. Assuming that a minimum exists, we can use Steepest Descent to algorithmically discover that minimum; starting at some arbitrary point $\mathbf{x}$

But how do we know we’ve achieved a minimum? Here is where we make use of the Gradient vector which is denoted by $\nabla f(\mathbf{x})$ and is defined as

(1) $\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, …,\frac{\partial f(\mathbf{x})}{\partial x_n}]$

The components of the Gradient vector are partial derivates of the function with respect to each variable $x_1,…,x_n$. If all of these partial derivatives are 0 at some point $\mathbf{x}$

$$\frac{\partial f(\mathbf{x})}{\partial x_1} = \frac{\partial f(\mathbf{x})}{\partial x_2} = … = \frac{\partial f(\mathbf{x})}{\partial x_n} = 0$$

then the point $\mathbf{x}$ is a stationary point. In this post I will assume we have a convex function and the stationary point is a minimum – there are other ways to test this but it is outside the scope of the post.

Steepest Descent proceeds by the following equation

(2) $\mathbf{x}_{n+1} = \mathbf{x}_n + \lambda^* \nabla f(\mathbf{x}_n)$

What (2) tells us is that if we are at some point $\mathbf{x}_n$, we can determine the next point by going in the direction of the gradient vector at point $\mathbf{x}_n$. How far we go along this vector is determined by a step-size parameter which is denoted as $\lambda^*$. (Why we move in the direction of the gradient vector has been covered in this post.)

We continue in this way until we reach a point whose gradient (which is now a vector not a scalar) is 0 or some stopping criteria is reached (sometimes it takes too long to converge to a minimum so a user can decide when a point is “good enough”).

Now all that’s left to do is to determine what value the step-size $\lambda$ should take. In Machine Learning, this parameter is user defined so it can be set to a manual value – it tends to be a very low number like 0.01. However in Cauchy’s approach, this parameter is determined by a univariate optimisation at every iteration. The univariate optimisation method I will use in this post will be differentiation and equating to 0.

For a two variable function, the general algorithm for using Steepest Descent is as follows:

1. Choose an initial starting point

$$\mathbf{x_0} = (x_{1_0}, x_{2_0})$$

1. Determine a search direction via the Gradient Vector

$$\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

1. Substitute starting point $\mathbf{x_0} = (x_{1_0}, x_{2_0})$ into the gradient vector $\nabla f(\mathbf{x_0}) = \nabla f(x_{1_0}, x_{2_0})$
1. Perform a univariate optimisation to obtain the optimal step-size, $\lambda^*$

$$f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]$$
$$f(\lambda) = f[(x_{1_0}, x_{2_0}) + \lambda(x_{1_1}, x_{2_1})]$$
$$f(\lambda) = f[x_{1_0} + \lambda x_{1_1}, x_{2_0} + \lambda x_{2_1}]$$

Substitute

$$x_1 = x_{1_0} + \lambda x_{1_1}$$

$$x_2 = x_{2_0} + \lambda x_{2_1}$$

in the function of interest

1. Set $f'(\lambda) = 0$ and solve for $\lambda$
1. State new point

$$\mathbf{x_1} = \mathbf{x_0} + \lambda^* \nabla f(\mathbf{x_0})$$

1. Repeat until convergence

The approach can be generalised to n variables without loss of generality.

## Worked Example

Let’s use the steepest descent method to find the minimum of the two variable function
$$f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$
I will work through one iteration, the reader can continue to replicate the steps until the true minimum of $(1,1)$ is found.

1. Choose an initial starting point.  Arbitrarily I select

$$\mathbf{x}_0 = (1,0)$$

1. Determine a search direction via the Gradient Vector

$$\nabla f(x_1,x_2) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

$$\nabla f(x_1,x_2) = [2x_1 + x_2 – 3 , x_1 + 2x_2 – 3]$$

1. Substitute starting point $$\mathbf{x_0} = (x_{1_0}, x_{2_0}) = (1,0)$$ into the gradient vector $$\nabla f(\mathbf{x}) = \nabla f(x_{1_0}, x_{2_0})$$

$$\nabla f(\mathbf{x}_0) = \nabla f(1,0) = (-1,-2)$$

This tells us the direction to move in

1. Perform a univariate optimisation to obtain the optimal step-size, $\lambda^*$

Find $\lambda$ to maximise

$f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]$

$f(\lambda) = f[(1,0) + \lambda(-1,-2)]$

$f(\lambda) = f[1 – \lambda, -2\lambda] = 7\lambda^2 + 5\lambda + 1$

1. Set $f'(\lambda) = 0$ and solve $f'(\lambda) = \lambda^*$

Setting $f'(\lambda)= 0$

$f'(\lambda) = 14\lambda + 5 = 0$

$\lambda^* = -\frac{5}{14}$

1. State new point  $\mathbf{x_{n+1}} = \mathbf{x_{n}} + \lambda^* \nabla f(\mathbf{x_{n}})$

$\mathbf{x_1} = (1,0) + -\frac{5}{14}(-1,-2) = (\frac{19}{14},\frac{5}{7})$

Check $\nabla f(\frac{19}{14},\frac{5}{7})$
$$\nabla f(x_1 = \frac{19}{14}, x_2 = \frac{5}{7}) = (\frac{3}{7},-\frac{3}{14})$$

So we’re not at the minimum yet – the next step will be to repeat the process shown. Repeating the process will show a sequence of points converging to the true optimum $(1,1)$.

## Implementation in R

One implementation of a two variable Gradient Descent process can be seen in the R code below. I have excluded the process of univariate optimisation to determine a step-size and have fixed it as a user defined parameter throughout the process.

Since the below code is a user defined function there are some very obvious limitations:

1. The function is slow
2. It can be misused; if the obj argument is set to “max” for the given function, the while loop can go on forever!
3. A tolerance must be set – the algorithm may take a very long time to converge to a point where the gradient vector is the zero vector.
4. The input function argument must be entered as a character.

Nonetheless the results can be instructive to see how the algorithm works.

# Gradient Descent Summary for two variables - user defined function

grad_descent_two_var <- function(input_function, obj, x_initial, y_initial, lambda = 0.1, tolerance = .Machine$double.eps^0.5) { # Create functions to make partial derivatives apply_deriv_x <- function(x, y, func) { if (is.expression(func)) { f_x_y <- func f_x <- D(f_x_y, "x") x <- x y <- y return(eval(f_x)) } else { return("function must be entered as an expression") } } apply_deriv_y <- function(x, y, func) { if (is.expression(func)) { f_x_y <- func f_y <- D(f_x_y, "y") x <- x y <- y return(eval(f_y)) } else { return("function must be entered as an expression") } } # Create a dataframe with information about the initial point initial_df <- tibble(func = input_function, obj = obj, x1 = x_initial, x2 = y_initial, stepsize = if_else(obj == "max", lambda, -lambda), grad_x1 = apply_deriv_x(x1, x2, parse(text = func)), grad_x2 = apply_deriv_y(x1, x2, parse(text = func)), x1new = x1 + (stepsize * grad_x1), x2new = x2 + (stepsize * grad_x2)) # Iterate using the gradient descent approach while (near(x = sum(abs(initial_df$grad_x1[nrow(initial_df)]), abs(initial_df$grad_x2[nrow(initial_df)])), y = 0, tol = tolerance) == FALSE) { newrow <- tibble(func = input_function, obj = obj, x1 = initial_df$x1new[nrow(initial_df)],
x2 = initial_df$x2new[nrow(initial_df)], stepsize = if_else(obj == "max", lambda, -lambda), grad_x1 = apply_deriv_x(x1, x2, parse(text = func)), grad_x2 = apply_deriv_y(x1, x2, parse(text = func)), x1new = x1 + (stepsize * grad_x1), x2new = x2 + (stepsize * grad_x2)) initial_df <- union(initial_df, newrow) } algo_summary <- tibble(num_iter = nrow(initial_df), optim_x1 = initial_df$x1[nrow(initial_df)], optim_x2 = initial_df$x2[nrow(initial_df)]) return(list(iterates = initial_df, iteration_summary = algo_summary)) } # Save the results (a list of two dataframes) results <- grad_descent_two_var(input_function = "x^2 + (x*y) + y^2 - (3*x) - (3*y) + 3", obj = "min", x_initial = 1, y_initial = 0) # Show how the values of (x1, x2) vary throughout the process View(results$iterates)
View(results\$iteration_summary)

Applying the R function above to the function $$f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$ gives an output table which looks like this:

The final row in the above table (not displayed here) will show that the minimum is reached when $(x_1, x_2) = (1,1)$.