The Naive Bayes Algorithm is a simple and elegant approach for tackling supervised learning problems in Machine Learning. This post will be a brief introduction to the Naive Bayes algorithm, I will focus on the theoretical aspects of the method and leave practical implementations in code to a subsequent post.

Bayes Theorem

Before beginning the discussion of the classification algorithm, let’s refresh our understanding of Bayes Theorem. Let Pr(Y|X) denote the conditional probability of observing the random variable Y whenever the random variable X takes a particular value. More formally,

(1) Pr(Y|X) = \frac{Pr(X,Y)}{Pr(X)} \rightarrow Pr(X,Y) = Pr(Y|X)Pr(X)

Rearranging the expression leads to Bayes Theorem 

(2) Pr(Y|X) = \frac{Pr(X|Y)Pr(Y)}{Pr(X)}

Using the law of total probability we can rewrite Bayes Theorem as 

(3) Pr(Y|X) = \frac{Pr(X|Y)Pr(Y)}{\sum_{i=1}^{k}Pr(X|y_i)Pr(y_i)}

Using Bayes Theorem for Classification

Given a set of explanatory variables or features \textbf{x}, we want to compute the probability that a given observation belongs to a class label Y = y. We represent such a case by Pr(y|\textbf{x}) which is known as the posterior probability of the target class.  Algorithmically computing probabilities in the manner described above fall into a class of algorithms called generative classification models. We can represent the posterior probability using Bayes theorem

(4) Pr(y|\textbf{x}) = \frac{Pr(\textbf{x}|y)Pr(y)}{Pr(\textbf{x})}

The term Pr(\textbf{x}|y) is known as the class conditional probability of the features \textbf{x} given a class label Y = y. If \textbf{x} belongs to class Y = y, then we should expect Pr(\textbf{x}|y) to be high. The term Pr(y) is known as the prior probability which represents our prior beliefs about the distributions of the class labels Y, independent of the observed feature values. The term Pr(\textbf{x}) is known as the probability of evidence and acts as a normalising constant to ensure that the outcome of the identity (4) is between 0 or 1, Pr(y|\textbf{x}) \in [0,1].

Calculating Class Conditional Probabilities

Bayes theorem provides a convenient way to produce predicted class probabilities by combining prior beliefs with the likelihood of obtaining the observed feature values. During the training phase, the algorithm must learn the parameters Pr(y) and Pr(\textbf{x}|y).

  1. The prior probability Pr(y) can be estimated from training data by calculating the fraction of training observations belonging to each class Y=y
  2. The class conditional probability Pr(\textbf{x}|y) can be estimated from training data by calculating the fraction of training observations for a given class for every possible combination of feature values. More detail is given below.

Suppose there are 2 features X_1 and X_2 which can take discrete values c_1,c_2,…,c_k. Let n^{0} denote the number of training observations belonging to class 0 (y_i = 0), out of these observations, n^{0}_{ij} training observations have X_1 = c_i and X_2 = c_j.

The class conditional probability is calculated as 

(5) Pr(X_1 = c_i, X_2 = c_j| Y = 0) = \frac{n^{0}_{ij}}{n^{0}}

As an illustration, let a set of training observations be 

$$Y$$$$X_1$$$$X_2$$
011
012
023
012
023

The class conditional probability for X_1 = 1, X_2 = 2 given Y = 0 is 

$$Pr(X_1 = 1, X_2 = 2 | Y = 0) = \frac{Pr(X_1 = 1, X_2 = 2, Y = 0)}{Pr(Y = 0)}$$

$$Pr(X_1 = 1, X_2 = 2| Y = 0) = \frac{Number\ of\ observations\ where\ X_1 = 1, X_2 = 2, Y = 0}{Number\ of\ observations\ where\ Y = 0} = \frac{2}{5}$$

Naive Bayes Assumption

As the number of features in a dataset increase, there will be a corresponding growth in the number of feature value combinations. This makes the class conditional probability calculations for large datasets computationally intensive. As a result, the Naive Bayes classifier makes a simplifying assumption which assists in obtaining reliable estimates of class conditional probabilities, even when the number of features are large. This simplifying assumption is called conditional independence. 

Conditional independence allows the the class conditional probability of all features \textbf{x} to be factored as a product of class conditional probabilities of every singular component feature x_i

(6) Pr(\textbf{x}|y) = \prod_{i=1}^{d} Pr(x_i|y)
where every data observation \textbf{x} consists of d features \{x_1,…,x_d\}.

The assumption implies that the features in the dataset are influenced only by the target class and not of the other features. If the class label is known, then the features can be thought of as independent of each other. 

Conditional Independence 

To understand the mathematical details of conditional independence we look at the algebraic definition more closely. Let \textbf{X}_1, \textbf{X}_2, \textbf{Y} denote 3 vectors of random variables. We say the variables in \textbf{X}_1 are conditionally independent of \textbf{X}_2 given \textbf{Y} if 

(7) Pr(\textbf{X}_1|\textbf{X}_2, \textbf{Y}) = Pr(\textbf{X}_1|\textbf{Y})

This implies that conditioned on \textbf{Y}, the distribution of \textbf{X}_1 is not influenced by the outcomes of \textbf{X}_2 and hence is conditionally independent of \textbf{X}_2. An alternate method of describing conditional independence is to consider the joint conditional probability 

$$Pr(\textbf{X}_1, \textbf{X}_2| \textbf{Y}) = \frac{Pr(\textbf{X}_1, \textbf{X}_2, \textbf{Y})}{Pr(\textbf{Y})}$$

$$Pr(\textbf{X}_1, \textbf{X}_2| \textbf{Y}) = \frac{Pr(\textbf{X}_1, \textbf{X}_2, \textbf{Y})}{Pr(\textbf{X}_2, \textbf{Y})} \times \frac{Pr(\textbf{X}_2, \textbf{Y})}{Pr(\textbf{Y})}$$

Note: 

i) Pr(\textbf{X}_1| \textbf{X}_2, \textbf{Y}) = \frac{Pr(\textbf{X}_1, \textbf{X}_2, \textbf{Y})}{Pr(\textbf{X}_2, \textbf{Y})}

ii) Pr(\textbf{X}_2 | \textbf{Y})= \frac{Pr(\textbf{X}_2 , \textbf{Y})}{Pr(\textbf{Y})}

$$Pr(\textbf{X}_1, \textbf{X}_2| \textbf{Y}) = Pr(\textbf{X}_1| \textbf{X}_2, \textbf{Y}) \times Pr(\textbf{X}_2| \textbf{Y})$$

Using the assumption of conditional independence (identity (7)) gives

(8) Pr(\textbf{X}_1, \textbf{X}_2| \textbf{Y}) = Pr(\textbf{X}_1|\textbf{Y}) \times Pr(\textbf{X}_2| \textbf{Y})

The identity (8) implies that the joint conditional probability of \textbf{X}_1 and \textbf{X}_2 given \textbf{Y} can be factored as the product of conditional probabilities of \textbf{X}_1 and \textbf{X}_2 (given Y) considered separately. Conditional Independence forms the basis of the Naive Bayes assumption. This assumption can be generalised to n variables.

Naive Bayes Classifier

The Naive Bayes classifier computes the posterior probability for a test observation \textbf{x} by using the identity 

(9) Pr(y|\textbf{x}) = \frac{Pr(y)\prod_{i=1}^{d} Pr(x_i|y)}{Pr(\textbf{x})}

Pr(\textbf{X}) is fixed for every y and only acts as a normalising constant to ensure that Pr(y|\textbf{x}) \in [0,1], hence we can rewrite (9) as 

(10) Pr(y|\textbf{x}) \propto Pr(y)\prod_{i=1}^{d}Pr(x_i|y)

The class label assigned to a test observation \textbf{x}, is the class Y=y that maximises the probability Pr(y)\prod_{i=1}^{d}Pr(x_i|y).

Worked Example – Predicting Loan Default

Consider the training data 

Default Borrower (Y)Home Owner (X_1)Marital Status (X_2)Annual Income (X_3) (in thousands)
NoYesSingle125
NoNoMarried100
NoNoSingle70
NoYesMarried120
YesNoDivorced95
NoNoMarried60
NoYesDivored220
YesNoSingle85
NoNoMarried75
YesNoSingle90

Before we calculate probabilities for the above dataset, we mention how the Naive Bayes classifier deals with different types of feature variables. 

Dealing with Categorical Features

For a categorical feature X_i, the conditional probability Pr(X_i = c| y) is estimated by the ratio

 (11) Pr(X_i = c |y) = \frac{n_c}{n}

where n is the number of observations belonging to class Y=y and n_c is the number of observations belonging to class Y=y and have X_i = c. For instance, if we wish to compute the class conditional probability

$$Pr(X_2 = Single| Y = Yes) = \frac{Number\ of\ observations\ where\ X_2 = Single\ and\ Y = Yes}{Number\ of\ observation\ where\ Y = Yes} = \frac{2}{3}$$

Dealing with Continuous Features

Dealing with continuous features we can either discretise the values and treat them as categorical features or we can assume a certain probability distribution for the continuous variable and estimate the parameters of the distribution using the training data. It is better to represent the continuous variable by a probability distribution since discretisation of the variable can be somewhat arbitrary and requires more user input. Results can potentially change based on the analyst’s choice of discrete banding. In practice, the Normal Distribution is used to represent the conditional probability of continuous features.

The Normal Distribution is characterised by 2 parameters – the mean \mu and the variance \sigma^{2}. For each class value Y = y_i, the class conditional probability for a feature X_i is given by the probability density function (pdf) of the Normal Distribution

(12) Pr(X_i = x_i|Y = y_j) = \frac{1}{\sqrt{2\pi\sigma_{ij}}}exp[-\frac{(x_i – \mu_{ij})^{2}}{2\sigma_{ij}^{2}}]

The mean \mu_{ij} can be estimated by the sample mean \bar{x}_{ij} over all training observations that belong to  class y_j. The variance \sigma_{ij}^2 can be estimated by the sample variance s^2 over all training observations that belong to class y_j

For instance, when looking at the annual income feature X_3 we calculate the sample mean, sample variance and sample standard deviation when the response variable is No (Y = No).

a) \bar{x} = \frac{125 + 100 + 70 + 120 + 60 + 220 + 75}{7} = 110

b) s^2 = \frac{(125-110)^2 + (100-110)^2 + … + (75-110)^2}{6} = 2975

c) s = \sqrt{2975} = 54.54

Given a test observation with income $120,000, we can compute the conditional probability 

$$Pr(X_3 = 120 | Y = No) = \frac{1}{\sqrt{2\pi} \times 54.54}exp[-\frac{(120 – 110)^{2}}{2 \times 2975}] = 0.0072$$

Discretising Continuous Features

Discretising numeric features means that numbers are put into categories known as bins. The most common way to discretise a feature is to explore the data for natural categories or cut points in the distribution of the data. For instance, if we had a feature for hours of the day – we can bin those times into quarters of the day 00:00 – 06:00, 07:00 – 12:00 etc. This makes the numeric data divided into levels of a new nominal feature which can be used with Naive Bayes. If there are no obvious cut points, one option can be to discretise the feature using the quantiles in the data.

Discretising a numeric feature always results in a reduction of information because the feature’s original granularity is reduced to a smaller number of categories. There is a balance to strike, too few bins can result in important trends being obscured. Too many bins can result in small counts in the Naive Bayes frequency table, which can increase the algorithm’s sensitivity to noisy data.

Making Predictions

As an example, we are interested in predicting the class label of a test observation with feature vector

$$\textbf{x} = (X_1 = No, X_2 = Married, X_3 = 120)$$

In plain English, we have an individual who is not a Homeowner,  Married and makes an annual income of $120,000. We want to use Naive Bayes to tell us whether this individual will be classified as a Default Borrower or not.

A class label will be attributed based on the probability

$$Pr(y|\textbf{x}) \propto Pr(y)\prod_{i=1}^{d}Pr(x_i|y)$$

The class which maximises the above probability yielded above is the class which will attributed to the test observation \textbf{x}.

Computing the prediction class proceeds as follows:

1). Compute Prior Probabilities 

(a) Pr(Y = Yes) = \frac{3}{10} = 0.3
(b) Pr(Y = No) = \frac{7}{10} = 0.7

2). Compute independent class conditional probabilities 

(a) Pr(\textbf{x}| Y = No) = Pr(X_1 = No, X_2 = Married, X_3 = 120| Y = No)

$$Pr(X_1 = No, X_2 = Married, X_3 = 120| Y= No)= Pr(X_1 = No | Y = No) \times Pr(X_2 = Married | Y = No) \times Pr(X_3 = 120| Y = No)$$

$$Pr(X_1 = No | Y = No) \times Pr(X_2 = Married | Y = No) \times Pr(X_3 = 120| Y = No) = \frac{4}{7} \times \frac{4}{7} \times 0.0072 = 0.0024$$

(b) Pr(\textbf{x}| Y = Yes) = Pr(X_1 = No, X_2 = Married, X_3 = 120| Y = Yes)

$$Pr(X_1 = No, X_2 = Married, X_3 = 120| Y = Yes)= Pr(X_1 = No | Y = Yes) \times Pr(X_2 = Married | Y = Yes) \times Pr(X_3 = 120| Y = Yes)$$

$$Pr(X_1 = No | Y = Yes) \times Pr(X_2 = Married | Y = Yes) \times Pr(X_3 = 120| Y = Yes) = \frac{3}{3} \times \frac{0}{3} \times 10^{-9} = 0$$

3). Estimate posterior probabilities 

(a) Pr(Y = No |\textbf{x}) = \frac{Pr(Y = no)Pr(\textbf{x}|Y=No)}{Pr(\textbf{x})} = \frac{0.7 \times 0.0024}{Pr(\textbf{x})} = 0.0016 \alpha where \alpha = \frac{1}{Pr(\textbf{x})}

(b) Pr(Y = Yes | \textbf{x}) =\frac{Pr(Y = Yes)Pr(\textbf{x}|Y=Yes)}{Pr(\textbf{x})} = \frac{0.3 \times 0}{Pr(\textbf{x})} = 0

Since Pr(Y = Yes|\textbf{x}) < Pr(Y = No| \textbf{x}), the observation is classified as a Y = No class label.

The above result implies that, given the data we have observed from the table above (training data) the probability of our test individual not being a Default Borrower (Default Borrower = No) is higher than the probability of them being a Default Borrower.  As a result, the algorithm classified the individual as not being a Default Borrower.

Note: We ignore the normalising constant Pr(\textbf{x}) = \frac{1}{\alpha}. It can be calculated using the law of total probability from the training data but it has a small influence on the decision value of the final probabilities save from ensuring they fall between 0 and 1.

Handling Zero Conditional Probabilities

It is possible that some combination of feature values and class labels are never observed in the training data which results in a class conditional probability of zero (As seen above). To address this, we adjust the conditional probability estimates so they do not yield zero class conditional probabilities and hence are more robust.

Estimates of conditional probabilities include

  1. Laplace estimate: Pr(X_i = c| y) = \frac{n_c + 1}{n + v}
  1. M-estimate: Pr(X_i = c| y) = \frac{n_c + mp}{n + m}

where 

  • n is the number of training observations of class y
  • n_c is the number of training observations with X_i = c and Y = y
  • v is the total number of feature values X_i can take
  • p is the initial estimate of Pr(X_i = c| y) which represents a prior belief
  • m is a hyper parameter indicating confidence in using p when the fraction of training observations is small. 

References

  1. Machine Learning with R. Brett Lantz.
  2. Introduction to Data Mining. P.N. Tang, M. Steinbach, A. Karpatne, V. Kumar

Leave a Reply