Introduction

A very interesting paper was brought to my attention which proposes an adjustment to the traditional Stochastic Gradient Descent approach called Oddball Stochastic Gradient Descent. I want to use this blogpost to describe how this adjustment works with some of my own commentary – much of the notation is not in the original work so I’m open to revising how it’s presented.  This post presumes a high level knowledge of Gradient Descent and its use in parameter updating for Machine Learning models – if you’d like a refresher, I’ve written about the methods in previous posts here and here.

The original paper is very accessible and easy to read so I’ve referenced it at the end. Please consult that for details and an experiment which shows the method in action and compares it with other stochastic gradient descent variations.

Stochastic Gradient Descent

Before speaking about the Oddball variation, let’s briefly review how Stochastic Gradient Descent (SGD) works. For large datasets, traditional gradient descent can be very slow. For a training dataset of size n, gradient descent performs n computations for every model parameter that must be optimised, at each training iteration.

With Stochastic Gradient Descent (SGD), in each training iteration, a sample of the training observations (strictly speaking, only 1 observation is sampled) are taken and the model parameters are optimised based on this sample. Although SGD is defined strictly to sample 1 observation from the training data per iteration. It is more common to select a small subset of data – called a mini-batch, from the training data per iteration. The number of data items in the mini-batch is called the batch size. A larger batch size means we will get a more accurate and stable estimate of the gradients from the loss function, but it will take longer.

Oddball Stochastic Gradient Descent (Oddball SGD)

SGD can be statistically biased so that certain elements of the training set are learned more rapidly than others. A founding assumption of stochastic gradient descent (SGD) is that the learning necessary for each element of the training set should be ‘uniform’.

If the assumption of uniform learning does not hold, then some proportion of steps taken might be unnecessary or even counter productive. Thus, it might be useful to let a given Machine Learning model’s state of knowledge drive the SGD process.

For a given iteration t, the Oddball SGD process works in the following way:

To have an illustrative example in mind, assume we have a feed forward neural network classifier with a multi-class output with p classes over a training dataset of size n

  1. Run a feed forward network over the training dataset of n observations. We obtain n prediction vectors for each observation, denoted \mathbf{\hat{y}}_i, i = 1,…,n. Each prediction vector \mathbf{\hat{y}}_i has p elements where p is the number of output nodes in the final layer of the neural network. The element \hat{y}_{ij}, i=1,…,n, j=1,…,p represents the probability output prediction for observation i from output node j.
  2. For observation i, compute the absolute prediction error loss. This is done by computing the absolute value difference between the one-hot encoded (OHE) label vector \mathbf{y}_i and the network prediction vector \mathbf{\hat{y}}_i, denote this as \mathbf{d}_i = |\mathbf{y}_i – \mathbf{\hat{y}}_i|, i = 1,…,n
  3. Compute the L1 norm for each vector \mathbf{d}_i, ||\mathbf{d}_i||_1, to obtain a single scalar prediction error for each training observation i = 1,…,n. Denote these values by e_i, i = 1,…,n and collect them in an n \times 1 vector \mathbf{r} = [e_1,…, e_n]. \mathbf{r} is called the novelty vector.
  4. Normalise the vector \mathbf{r} so that all elements in \mathbf{r} sums to 1. Call this resultant vector – the selection probability vector, denote this by \mathbf{r}_N.
  5. To update model parameters via Oddball SGD, sample a training observation in a biased way – biased sampling means weighted sampling, where the selection probabilities over the training dataset are described by the vector \mathbf{r}_N. This means that the new training observation that will be used to update the model parameters via gradient descent will most likely be an observation with high novelty. 

Pseudo-Example

Let’s consider how this method might work for the classic MNIST handwriting digit classification problem. This is a pseudo-example where I won’t be using an actual model – the objective here is to see how the oddball SGD process would work given any set of model predictions.

Consider a Neural Network aimed at classifying digits 0 – 9 from the MNIST dataset, suppose we have a 10-unit Softmax output layer with 256 training observations. Each output layer node predicts a probability of the i-th training observation belonging to digit/label that it corresponds to. E.g. A probability from the output node corresponding to ’0’ means that this is the probability of the observation being a ’0’. Each label for a training observation is represented as a sparse vector or a OHE vector with a 1 denoting the correct label and 0 otherwise.

At each training iteration of oddball SGD, we begin with a feed forward pass over the 256 element training set. For each training observation, we compute the absolute prediction error across the 10-way output later and sum the result (computing the L1 norm). Thus for every training observation, we have a single value for prediction error – in our case, the number of prediction error values will be 256. 

To see this in detail for a single training observation – suppose a training observation i has a label of ’9’, this is represented by the OHE vector [0,0,0,0,0,0,0,0,0,1]^T = \mathbf{y}_i. Each position in the OHE vector represents each digit from 0 – 9.

Suppose the prediction for observation i is given by the following output node probabilities

Output Node Label\mathbf{y}_i\mathbf{\hat{y}}_i
000.10
100.05
200.05
300.01
400.01
500.01
600.05
700.01
800.01
910.70

The absolute error across these 10 output nodes is given below

Output Node Label\mathbf{y}_i\mathbf{\hat{y}}_i\mathbf{d}_i = |\mathbf{y}_i – \mathbf{\hat{y}}_i|
000.10|0-0.10| = 0.10
100.05|0-0.05| = 0.05
200.05|0-0.05| = 0.05
300.01|0-0.01| = 0.01
400.01|0-0.01| = 0.01
500.01|0-0.01| = 0.01
600.05|0-0.05| = 0.05
700.01|0-0.01| = 0.01
800.01|0-0.01| = 0.01
910.70|1-0.70| = 0.30

The sum of these output prediction errors is e_i = 0.60.

We take all the prediction errors for each training observation and hold it in a n \times 1 vector \mathbf{r} = [e_1,…,e_n]. The vector \mathbf{r} represents the state of novelty of each training observation.

We normalise the novelty vector so that all elements sum to 1, denote this as \mathbf{r}_N. This normalised vector is called the selection probability vector. The selection probability vector is then used to assign selection probabilities to each training observation – so that selection probability is proportional to the novelty.

During each iterative step of oddball SGD, one example of the training set is randomly selected according to the selection probabilities/ weights described in \mathbf{r}_N. This is in contrast to the traditional SGD method where the entire training set is used in random order for each full-sweep iteration.

References

  1. Original Paper: https://arxiv.org/pdf/1509.05765.pdf
  2. StatQuest Stochastic Gradient Descent Explainer: https://www.youtube.com/watch?v=vMh0zPT0tLI

3 Comments

  1. Hi, Junaid.

    I come from your linkedIn post. Thanks for sharing these interesting ideas. I have some questions for Oddball SGD:
    1. Are the selection probability r_N got calculated only once at the beginning of training, or they are updated during each epoch?
    2. Outliers tend to have larger absolute prediction error loss, which results in larger selection probability. Will this affect the performance of the model? Or recognition and deletion of outliers are quite essencial at the preprocessing stage if we want to use Oddball SGD.

    Please correct me if I misunderstand the method. I am looking forward to your reply.

    Thanks,

    Fang Lyu

    1. Hi Fang,

      Many thanks for reading. I’ve tried to answer your questions are below:

      1) I believe these selection probabilities in r_N are updated in each epoch. This way as the model begins to predict well on a subset of the data it can have a new idea of novelty and begin correcting for those errors. If the selection probabilities were the same every epoch, we would eventually select points that we perform well on which is the opposite to this idea.

      2) Yes outliers will affect the model significantly and give large absolute prediction error. In this case the model can get stuck trying to overfit to outliers. In a case like this I believe you would need to preprocess and handle outliers – you can also change the error function to see if that makes an improvement. There is no theoretical reason to strictly stick to the absolute prediction error loss, from what I gathered of reading the paper.

      I encourage you to read the original paper, the author has tried multiple types of SGD variations and dropout variations to tune performance. It’s likely that Oddball SGD with a mix of these methods will be most effective.

      Hope that helps!

      1. Hi Junaid. Thank you so much for your answers. Your answers make sense to me and I am going to read the original paper. Look forward to more technical blogposts from you.

Leave a Reply