If you’re like me, you’ve heard a lot about Gradient Descent. You’ve heard that it is a foundational algorithm for optimising functions which all self respecting Machine Learning techniques employ (or some variation of it). You’ve probably even seen graphs like these:

which show how a function is iteratively scaled until an optimum (maximum or minimum) is reached. These are all great explanations for how the method intuitively works but, based on them alone, I never understood step by step how this method was working. 

In this post, I explore the algorithm itself (Cauchy’s variation) and perform an example “by hand”.  I will focus on minimising a function but the same intuition applies for maximisation.

Steepest Descent

Suppose we have a multivariate function f(x_1, x_2,…,x_n), we want to find a point \mathbf{x}^* = (x_1, …, x_n) where the value of f(x_1, x_2,…,x_n) is minimised. Assuming that a minimum exists, we can use Steepest Descent to algorithmically discover that minimum; starting at some arbitrary point \mathbf{x}

But how do we know we’ve achieved a minimum? Here is where we make use of the Gradient vector which is denoted by \nabla f(\mathbf{x}) and is defined as

(1) \nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, …,\frac{\partial f(\mathbf{x})}{\partial x_n}]

The components of the Gradient vector are partial derivates of the function with respect to each variable x_1,…,x_n. If all of these partial derivatives are 0 at some point \mathbf{x} 

$$\frac{\partial f(\mathbf{x})}{\partial x_1} = \frac{\partial f(\mathbf{x})}{\partial x_2} = … = \frac{\partial f(\mathbf{x})}{\partial x_n} = 0$$

then the point \mathbf{x} is a stationary point. In this post I will assume we have a convex function and the stationary point is a minimum – there are other ways to test this but it is outside the scope of the post.

Steepest Descent proceeds by the following equation

(2) \mathbf{x}_{n+1} = \mathbf{x}_n + \lambda^* \nabla f(\mathbf{x}_n)

What (2) tells us is that if we are at some point \mathbf{x}_n, we can determine the next point by going in the direction of the gradient vector at point \mathbf{x}_n. How far we go along this vector is determined by a step-size parameter which is denoted as \lambda^*. (Why we move in the direction of the gradient vector has been covered in this post.)

We continue in this way until we reach a point whose gradient (which is now a vector not a scalar) is 0 or some stopping criteria is reached (sometimes it takes too long to converge to a minimum so a user can decide when a point is “good enough”). 

Now all that’s left to do is to determine what value the step-size \lambda should take. In Machine Learning, this parameter is user defined so it can be set to a manual value – it tends to be a very low number like 0.01. However in Cauchy’s approach, this parameter is determined by a univariate optimisation at every iteration. The univariate optimisation method I will use in this post will be differentiation and equating to 0. 

For a two variable function, the general algorithm for using Steepest Descent is as follows:

  1. Choose an initial starting point

$$\mathbf{x_0} = (x_{1_0}, x_{2_0})$$

  1. Determine a search direction via the Gradient Vector

$$\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

  1. Substitute starting point \mathbf{x_0} = (x_{1_0}, x_{2_0}) into the gradient vector \nabla f(\mathbf{x_0}) = \nabla f(x_{1_0}, x_{2_0})
  1. Perform a univariate optimisation to obtain the optimal step-size, \lambda^*

$$f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]$$
$$f(\lambda) = f[(x_{1_0}, x_{2_0}) + \lambda(x_{1_1}, x_{2_1})]$$
$$f(\lambda) = f[x_{1_0} + \lambda x_{1_1}, x_{2_0} + \lambda x_{2_1}]$$

Substitute

$$x_1 = x_{1_0} + \lambda x_{1_1}$$

$$x_2 = x_{2_0} + \lambda x_{2_1}$$

in the function of interest

  1. Set f'(\lambda) = 0 and solve for \lambda
  1. State new point 

$$\mathbf{x_1} = \mathbf{x_0} + \lambda^* \nabla f(\mathbf{x_0})$$

  1. Repeat until convergence

The approach can be generalised to n variables without loss of generality. 

Worked Example

Let’s use the steepest descent method to find the minimum of the two variable function
$$ f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$ 
I will work through one iteration, the reader can continue to replicate the steps until the true minimum of (1,1) is found. 

  1. Choose an initial starting point.  Arbitrarily I select

$$ \mathbf{x}_0 = (1,0)$$

  1. Determine a search direction via the Gradient Vector

$$\nabla f(x_1,x_2) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}]$$

$$\nabla f(x_1,x_2) = [2x_1 + x_2 – 3 , x_1 + 2x_2 – 3]$$

  1. Substitute starting point $$\mathbf{x_0} = (x_{1_0}, x_{2_0}) = (1,0) $$ into the gradient vector $$\nabla f(\mathbf{x}) = \nabla f(x_{1_0}, x_{2_0})$$

$$ \nabla f(\mathbf{x}_0) = \nabla f(1,0) = (-1,-2)$$

This tells us the direction to move in

  1. Perform a univariate optimisation to obtain the optimal step-size, \lambda^*

Find \lambda to maximise

f(\lambda) = f[\mathbf{x_0} + \lambda \nabla f(\mathbf{x_0})]

f(\lambda) = f[(1,0) + \lambda(-1,-2)]

f(\lambda) = f[1 – \lambda, -2\lambda] = 7\lambda^2 + 5\lambda + 1

  1. Set f'(\lambda) = 0 and solve f'(\lambda) = \lambda^*

Setting f'(\lambda)= 0

f'(\lambda) = 14\lambda + 5 = 0

\lambda^* = -\frac{5}{14}

  1. State new point  \mathbf{x_{n+1}} = \mathbf{x_{n}} + \lambda^* \nabla f(\mathbf{x_{n}})

\mathbf{x_1} = (1,0) + -\frac{5}{14}(-1,-2) = (\frac{19}{14},\frac{5}{7})

Check \nabla f(\frac{19}{14},\frac{5}{7})
$$ \nabla f(x_1 = \frac{19}{14}, x_2 = \frac{5}{7}) = (\frac{3}{7},-\frac{3}{14})$$

So we’re not at the minimum yet – the next step will be to repeat the process shown. Repeating the process will show a sequence of points converging to the true optimum (1,1).

Implementation in R

One implementation of a two variable Gradient Descent process can be seen in the R code below. I have excluded the process of univariate optimisation to determine a step-size and have fixed it as a user defined parameter throughout the process. 

Since the below code is a user defined function there are some very obvious limitations:

  1. The function is slow 
  2. It can be misused; if the obj argument is set to “max” for the given function, the while loop can go on forever!
  3. A tolerance must be set – the algorithm may take a very long time to converge to a point where the gradient vector is the zero vector.
  4. The input function argument must be entered as a character.

Nonetheless the results can be instructive to see how the algorithm works. 

# Gradient Descent Summary for two variables - user defined function

grad_descent_two_var <- function(input_function, obj, x_initial, y_initial, lambda = 0.1, tolerance = .Machine$double.eps^0.5) {
  
  # Create functions to make partial derivatives
  apply_deriv_x <- function(x, y, func) {
    
    if (is.expression(func)) {
      f_x_y <- func
      f_x <- D(f_x_y, "x")
      x <- x
      y <- y
      
      return(eval(f_x)) }
    
    else {
      
      return("function must be entered as an expression")
      
    }
  }
  
  apply_deriv_y <- function(x, y, func) {
    
    if (is.expression(func)) {
      f_x_y <- func
      f_y <- D(f_x_y, "y")
      x <- x
      y <- y
      
      return(eval(f_y)) }
    
    else {
      
      return("function must be entered as an expression")
      
    }
  }
  
  # Create a dataframe with information about the initial point
  initial_df <- tibble(func = input_function,
                       obj = obj,
                       x1 = x_initial,
                       x2 = y_initial,
                       stepsize = if_else(obj == "max", lambda, -lambda),
                       grad_x1 = apply_deriv_x(x1, x2, parse(text = func)),
                       grad_x2 = apply_deriv_y(x1, x2, parse(text = func)),
                       x1new = x1 + (stepsize * grad_x1),
                       x2new = x2 + (stepsize * grad_x2))
  
  # Iterate using the gradient descent approach
  while (near(x = sum(abs(initial_df$grad_x1[nrow(initial_df)]), abs(initial_df$grad_x2[nrow(initial_df)])), y = 0, tol = tolerance) == FALSE) {
    
    newrow <- tibble(func = input_function,
                     obj = obj,
                     x1 = initial_df$x1new[nrow(initial_df)],
                     x2 = initial_df$x2new[nrow(initial_df)],
                     stepsize = if_else(obj == "max", lambda, -lambda),
                     grad_x1 = apply_deriv_x(x1, x2, parse(text = func)),
                     grad_x2 = apply_deriv_y(x1, x2, parse(text = func)),
                     x1new = x1 + (stepsize * grad_x1),
                     x2new = x2 + (stepsize * grad_x2))
    
    initial_df <- union(initial_df, newrow)
    
  }
  
  algo_summary <- tibble(num_iter = nrow(initial_df), optim_x1 = initial_df$x1[nrow(initial_df)], optim_x2 = initial_df$x2[nrow(initial_df)])

  return(list(iterates = initial_df, iteration_summary = algo_summary))
  
  }

# Save the results (a list of two dataframes)
results <- grad_descent_two_var(input_function = "x^2 + (x*y) + y^2 - (3*x) - (3*y) + 3", obj = "min", x_initial = 1, y_initial = 0)

# Show how the values of (x1, x2) vary throughout the process
View(results$iterates)
View(results$iteration_summary)

Applying the R function above to the function $$f(x_1,x_2) = x_1^2 + x_1 x_2 + x_2^2 -3x_1 – 3x_2 + 3$$ gives an output table which looks like this:

The final row in the above table (not displayed here) will show that the minimum is reached when (x_1, x_2) = (1,1).

Reference

  1. Image was found at https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Leave a Reply