Nearest neighbour algorithms classify unlabelled instances (data observations/ cases) by assigning them to a class which is the most similar found in the training data.

Nearest neighbours are well suited for classification tasks where the relationships among the features and response variable are complicated but items of similar class type tend to be fairly homogenous.

If the data is noisy and there is a lot of mixing between different classes then this implies that no clear distinction exists among groups; in cases like this the nearest neighbour algorithms may struggle to find class boundaries.

## The k-nn algorithm

One instantiation of the nearest neighbour approach is the k-nearest neighbours algorithm (k-nn).

The k-nn algorithm uses information about a given data observation’s k-nearest neighbours to classify unlabelled observations. $k$ is a variable term which is defined by the user and determines how many neighbouring observations can be used to contribute to the classification decision.

The algorithm requires a training dataset with labelled observations which have been classified into distinct groups.

For each unlabelled observation in the test dataset, the k-nn algorithm identifies $k$ records in the training data that are the “nearest” to the test observation in similarity.

The unlabelled test instance is assigned to the majority class of the $k$ nearest neighbours.

Consider a dataset of foods broken down by two features – sweetness and crunchiness. Each feature is rated on a scale of $1 – 10$

Each ingredient is labelled in 3 food types: fruits, vegetables or proteins.

The first few rows of a dataset may look like:

The k-nn algorithm treats the features as coordinates in a multi-dimensional feature space. Since we have 2 features – Sweetness($X_1$) and Crunchiness ($X_2$), we can take the feature space to be 2-dimensional.

A scatter plot of the data may look like:

From the graph above, there is a natural grouping of the data – similar types of food tend to be grouped closely together and cluster in different pockets of the feature space.

Suppose that we have a new ingredient that we want to classify using the nearest neighbour approach.

The question we want to ask is – which group label should tomato have?

## Measure Similarity

Locating a new data point’s nearest neighbours requires a distance function, or a measure of similarity between 2 instances or data points.

Traditionally, the k-nn algorithm uses the Euclidean distance which is defined by

(1) $dist(p,q) = \sqrt{(p_1 – q_1)^2 + (p_2 – q_2)^2 + … + (p_n – q_n)^2}$

where $p$ and $q$ are instances to be compared; each having $n$ features. Each $p_i$ and $q_i$ corresponds to the $i$-th feature value.

If we wanted to compute the distance between a tomato and a green bean in the dataset, we can use the formula

$p = tomato \rightarrow$ $sweetness = p_1 = 6$, $crunchiness = p_2 = 4$

$q = green\ bean \rightarrow$ $sweetness = q_1 = 3$, $crunchiness = q_2 = 7$

$dist(tomato, green\ bean) = \sqrt{(6 – 3)^2 + (4 – 7)^2} = \sqrt{9 + 9} = 4.2$

Similarly, we can compute the distance between a tomato and several of its closest neighbours.

If we assign the tomato to the single nearest neighbour – in this case $k=1$-nn classification, we would assign the tomato to the same label as orange because it has the smallest distance from the tomato.

If we use the k-nn algorithm with $k = 3$, we will perform a vote among the $k = 3$ nearest neighbours and take the majority class.  In this case the $k = 3$ nearest neighbours are orange, grape and nuts. Since the majority class in this set is fruit, we classify tomato as a fruit.

## Choosing an appropriate k

The choice for an appropriate value for $k$ is a classic balance between overfitting and under-fitting to the training data.

If we select $k$ to be the size of the entire training dataset $k = n$, then the most common class in the training data will always be chosen to label the unlabelled test data point.

Alternatively, if we set $k = 1$, then we allow potential outliers or noisy data points to influence classification of unlabelled data points. If an unlabelled data point happens to be near an outlier point – the likelihood of of it can be misclassified is higher. For instance, if a nut happens to be near a particularly sweet and crunchy banana, there is a chance that a nut may be classified as a fruit.

The best value of $k$ lies between these 2 extremes.

In practice, choosing $k$ depends on the complexity of the concept to be learned in the training data as well as the number of records in the training data.

One common practice recommended by textbooks is to initially set $k \approx \sqrt{N}$ where $N$ is the number of training examples.

Alternatively we can test various $k$ values on a variety of test datasets (cross-validation) and choose the one that delivers the best classification performance.

## Pre-processing data

Before we apply the k-nn algorithm to data we want to standardise our features.

This is because the distance formula is highly dependent on how features are measured.

If certain features have a much larger range of values than the other features, the distance measurements will be strongly dominated by features with larger ranges. It would be similar to having an axis on a coordinate axes whose value increments are much larger than the other axis.

The traditional method of feature rescaling is min-max normalisation.

Min-max normalisation scales a feature such that all values are bounded in the range $[0,1]$. The formula for normalisation of a feature is

(2) $X_{new} = \frac{X – min(X)}{max(X) – min(X)}$

$X_{new}$ can be interpreted as indicating where, on a scale of 0 percent to 100 percent, the original value falls along the range between the original minimum and maximum.

Alternatively we can use the z-score standardisation which is characterised by

(3) $X_{new} = \frac{X – \mu}{\sigma} = \frac{X – mean(X)}{stddev(X)}$

(3) rescales values in terms of how many standard deviations they fall above or below the mean value.

$X_{new}$ is also called the z-score and it falls in an unbounded range of negative and positive numbers.

One important practical point to note is that the same scaling method applied to features on the training data must also be applied to unlabelled examples.

## Similarity for Nominal Data

The Euclidean distance is not defined for nominal data.

To calculate the distance between nominal features, we have to convert them to a numeric format.

One strategy is to use dummy variables where 1 indicates one category and 0 indicates the other.

For instance, dummy variable coding for a gender variable is

Male = 1 if $X$= Male and Male = 0 if $X$ = Female

Extending this concept to an $n$-category nominal feature, we can dummy code by creating binary indicator variables (flags) for $(n-1)$ levels of the nominal feature.

Consider a 3 category temperature variable with categories hot, medium and cold. We could set this up like a sparse matrix by breaking the variable as $(n-1) = (3-1) =2$ features like:

Hot = 1 if $X$= Hot and Hot = 0 if $X$ = Not Hot

Medium = 1 if $X$= Medium and Medium = 0 if $X$ = Not Medium

If hot = 0 and medium = 0 then this indicates that the temperature is cold. We don’t need a third feature variable (hence another column in the data table) to symbolise the cold category.

Conveniently, dummy coding naturally places feature values on a 0-1 scale which is the same scale as the min-max normalised numeric data – hence no additional transformations is needed.

## K-nn is a lazy learner

Classification algorithms based on the nearest neighbour methods are considered lazy because no abstraction occurs (no function is learned via the estimation of model parameters).

All the k-nn algorithm does is compute distances between each training instance and recomputes this for every new unlabelled instance. These methods are also called instance-based learners or rote learners.

An instance-based learner does not build a model, no parameters are learned about the data via optimisation algorithms like Maximum Likelihood or Least Squares.

One one hand we do not generate theories or functions about the underlying data but alternatively the learner can still find natural patterns as opposed to trying to fit the data into a preconceived or biased functioned form.

## Wisconsin Breast Cancer – Worked Example

The following example is done in R. All that is needed to download the data and load the packages.

# Load packages
library(dplyr)
library(purrr)
library(caret)
library(class)
library(gmodels)
library(ggplot2)

# Load data - Wisconsin breast cancer data
# Data can be cloned from git at: https://github.com/stedy/Machine-Learning-with-R-datasets
# Dataset name is: "wisc_bc_data.csv"

wbcd <- read.csv(file = "~/developer/Machine-Learning-with-R-datasets/wisc_bc_data.csv", stringsAsFactors = FALSE)

# Drop the ID column

wbcd <- wbcd %>% select(-id)

# We want to take the diagnosis variable to be our target variable

table(wbcddiagnosis) # Add diagnosis as a factor wbcd.fct <- wbcd %>% mutate(diagnosis = factor(diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))) round(prop.table(table(wbcd.fctdiagnosis)) * 100, digits = 1)

# We want to normalize the data using min max normalization

normalize <- function(x) {
return( (x - min(x))/ (max(x) - min(x)) )
}

# Normalise features
wbcd.nrm.ft <- wbcd.fct %>%
select(-diagnosis) %>%
map(normalize)

# Attach target variable column
wbcd.nrm <- cbind.data.frame(wbcd.nrm.ft, diagnosis = wbcd.fct$diagnosis) # Visualise some example features # We take radius_mean and area_mean as an example ggplot(data = wbcd.nrm, aes(x = radius_mean, y = area_mean)) + geom_point(aes(shape = diagnosis, color = diagnosis)) # Split data into training and test sets. Split 80:20 set.seed(3456) trainIndex <- createDataPartition(wbcd.nrm$diagnosis, p = .8,
list = FALSE,
times = 1)

wbcd.train <- wbcd.nrm[ trainIndex,]
wbcd.test  <- wbcd.nrm[-trainIndex,]

wbcd_train <- wbcd.train %>% select(-diagnosis)
wbcd_test <- wbcd.test %>% select(-diagnosis)

# Extract labels
wbcd_train_labels <- wbcd.train$diagnosis wbcd_test_labels <- wbcd.test$diagnosis

# Train the model
## Fit the knn model with k = sqrt(n.train) = 21

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

# Evaluate Model Performance

# Confusion Matrix
confusion.matrix <- CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)

test.error.rate <- (confusion.matrix$prop.tbl + confusion.matrix$prop.tbl) * 100 # False Positive + False Negative

# Improving Model Performance
# We could use a z transformation instead
# Or we can test alternative values of k

# Test alternative parameters of k. From k = 1 to k = 21
param <- seq(from = 1, to = sqrt(nrow(wbcd_train)), by = 1)
error <- list()
results <- data.frame()

for (i in seq_along(param)) {

mdl <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = param[[i]])

conf.mat <- CrossTable(x = wbcd_test_labels, y = mdl, prop.chisq = FALSE)

error[[i]] <- (conf.mat$prop.tbl + conf.mat$prop.tbl) * 100

results <- bind_rows(results, cbind.data.frame(k = param[[i]], test.error.rate = error[[i]]))
}

# Which k value gives the lowest test error rate

results %>%
slice(which.min(test.error.rate))

# We decide on k = 3 since it gives the smallest test error