Recently I’ve had occasion to use the bootstrap and have been reminded at what a remarkably powerful technique this is despite it’s simplicity. I thought I’d write a quick blogpost discussing the basics of the bootstrap as a reminder to myself if I’m ever in a statistical jam. I’ve modelled this blogpost on the chapter on bootstrapping found in the wonderful book by Maria L. Rizzo – Statistical Computing with R. It’s a must read for any applied statistician.

Bootstrap methods are a class of nonparametric Monte Carlo methods that estimate the distribution of a population by resampling. Resampling methods treat an observed sample as a finite population and random samples are resampled from it (with replacement) to estimate population characteristics and make inferences about the original sample’s population. This doesn’t seem like a sensible technique but works very well in practice.

The bootstrap generates random samples from the empirical probability distribution of the sample. Suppose that $\textbf{x} = (x_1,…,x_n)$ is an observed random sample from a distribution with an empirical cumulative distribution function (ecdf) $F(x)$. If $X^{*}$ is a resample from the original sample selected at random from $\textbf{x}$, then

$Pr(X^* = x_i) = \frac{1}{n}$, $i=1,…,n$

Resampling generates a random sample $X^*_1,…,X^*_n$ by sampling with replacement from the observed sample $\textbf{x}$. The random variables $X^*_i$ are independently identically distributed (i.i.d) and uniformly distributed on the set $\textbf{x} = \{x_1,…,x_n\}$. This means that when we resample from the original sample $\mathbf{x}$, each observation has an equal probability of being selected. The empirical distribution function $F_n(x)$ is an estimator for the cdf $F(x)$. $F_n(x)$ is the cdf of the random variable $X^*$.

In the bootstrap process, there are 2 approximations made:

1.  The empirical distribution function $F_n$ is an approximation to the cdf $F_X$

2.  The empirical distribution function $F^*_m$ of the bootstrap samples/ replicates is an approximation to the original sample empirical distribution function $F_n$.
$$F^{*}_{m} \rightarrow F_n \rightarrow F_{X}$$

Resampling from the sample $\textbf{x}$ is equivalent to generating random samples from the distribution $F_n(x)$

# The Bootstrap Procedure

To generate a bootstrap random sample by resampling the observed sample $\textbf{x}$, we first generate $n$ random integers $\{i_1,…,i_n\}$ uniformly distributed on $\{1,…,n\}$ and select the bootstrap sample $\textbf{x}^* = (x_{i1},…,x_{in})$. Suppose $\theta$ is the parameter of interest and $\hat{\theta}$ is an estimator of $\theta$. The bootstrap estimate of the distribution of $\hat{\theta}$ is obtained via the following procedure

1. For each bootstrap replicate, indexed $b=1,…,B$
1. Generate a sample $x^{*(b)} = x^*_1,…,x^*_n$ by sampling with replacement from the observed sample $\textbf{x} = (x_1,…,x_n)$.
2. Compute the $b$-th replicate $\hat{\theta^{(b)}}$ from the $b$-th bootstrap sample. The replicate is a statistic computed on the bootstrap sample, $\hat{\theta}^{(b)}(x^{*}_{1},…,x^{*}_{n}) = \hat{\theta}^{(b)}$
2. The bootstrap estimate of the cdf of the estimator $F_{\hat{\theta}}(.)$ is the empirical distribution of the replicates $\hat{\theta}^{(1)},…,\hat{\theta}^{(b)}$

Example 1: Suppose we have observed the sample

$$\textbf{x} = \{2,2,1,1,5,4,4,3,1,2\}$$

Resampling from $\textbf{x}$ we select 1, 2, 3, 4, 5 with probabilities 0.3, 0.3, 0.1, 0.2, 0.1 which gives the cdf $F^*_X$ of a randomly selected replicate. Incidentally, this is identical to the empirical distribution function $F_n(x)$

Note: If the empirical cdf $F_n$ is not close to the population cdf $F_X$ then the distribution of the bootstrap replicates will not be close to the population cdf $F_X$. The sample above is from a Poi(2) distribution (Poisson distribution with mean 2), so it can theoretically include a value of 0. Resampling from $\textbf{x}$ provides a good estimate for $F_n$ but not for $F_X$ because regardless of how many replicates are drawn, the bootstrap sample will never include a 0.

# Bootstrap Estimation of Standard Error

The bootstrap estimate of standard error of some estimator $\hat{\theta}$ is the sample standard deviation of the bootstrap replicates $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)}$

(1) $\hat{se}(\hat{\theta}^*) = \sqrt{\frac{1}{B-1}\sum_{b=1}^{B}(\hat{\theta}^{(b)} – \hat{\bar{\theta}}^*)^2}$

where
$\hat{\bar{\theta}}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}^{(b)} \rightarrow$ Mean of the bootstrap replicates.

Example 2: We have data from LSAT (mean score on law school admission test score) and GPA (mean grade point average) scores for 15 law schools.

We want to estimate the correlation $(\hat{\theta})$ between LSAT and GPA scores, and compute the bootstrap estimate of the standard error of the sample correlation. The procedure is

1. For each bootstrap replicate, indexed $b=1,…,B$
1. Generate sample $x^{*(b)} = x^{*}_1,…, x^*_n$ by sampling with replacement from the observed sample $x_1,…,x_n$
2. Compute the $b$-th replicate $\hat{\theta}^{(b)}$ from the b-th bootstrap sample, where $\hat{\theta}$ is the sample correlation $R$ between (LSAT, GPA).
2. The bootstrap estimate of $se(R)$ is the sample standard deviation of the replicates $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)} = R^{(1)},…,R^{(B)}$

An example in code is shown at the bottom of this post.

# Bootstrap Estimation of Bias

If $\hat{\theta}$ is an unbiased estimator of $\theta$, then $E(\hat{\theta}) = \theta$.

The bias of an estimator $\hat{\theta}$ for $\theta$ is
$$bias(\hat{\theta}) = E[\hat{\theta} – \theta] = E[\hat{\theta}] – \theta$$

The bootstrap estimation of bias uses the bootstrap replicates of $\hat{\theta}$ to estimate the sampling distribution of $\hat{\theta}$. For the finite population $\textbf{x} = (x_1,…,x_n)$, the parameter is $\hat{\theta}(x)$ and there are $B$ independent and identically distributed estimators $\hat{\theta}^{(b)}$.

The sample mean of the bootstrap replicates $\{\hat{\theta}^{(b)} \}$ is unbiased for its expected value $E[\hat{\theta}^*]$, so the bootstrap estimate of bias is

(2) $\hat{bias}(\hat{\theta}) = \hat{\bar{\theta}}^* – \hat{\theta}$

where
$\hat{\bar{\theta}}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}^{(b)}$ and $\hat{\theta} = \hat{\theta}(\textbf{x})$ is the estimate computed from the original observed sample.

# Bootstrap Confidence Intervals

Whenever we make point estimates, it is important to have confidence intervals around these point estimates to encapsulate the uncertainty we have in the sampling distribution. There are many approaches to obtaining approximate Confidence Intervals (CI) for a target parameter in a bootstrap. Examples of approaches:

1. Standard Normal Bootstrap CI
2. Basic Bootstrap Pivot CI
3. Bootstrap Percentile CI
4. Better Bootstrap CI (BCa)

## Standard Normal Bootstrap Confidence Interval

Suppose that $\hat{\theta}$ is an estimator of a parameter $\theta$ with standard error $se(\hat{\theta})$. If $\hat{\theta}$ is a sample mean and the sample size is large then the central limit theorem (CLT) implies that

(3) $z = \frac{\hat{\theta} – E(\hat{\theta})}{se(\hat{\theta})}$

is approximately standard normal.

Hence if $\hat{\theta}$ is unbiased for $\theta$, then an approximate $100(1 – \alpha)$% confidence interval for $\theta$ is

(4) $\hat{\theta} \pm z_{\alpha/2}se(\hat{\theta})$

where $z_{\alpha/2} = \Phi{}^{-1}(1 – \alpha/2)$

Although this CI is easy to compute, we have made some assumptions:

1. $\hat{\theta}$ is normally distributed
2. $\hat{\theta}$ is a sample mean
3. $n$ is large
4. $\hat{\theta}$ is unbiased for $\theta$

These assumptions don’t tend to hold in real world cases, so there are other methods of constructing confidence intervals.

## Basic Bootstrap Confidence Interval

The basic confidence interval transforms the distribution of the bootstrap replicates by subtracting the observed statistic.  The quantiles of the transformed sample $\hat{\theta}^* – \hat{\theta}$ are used to determine the confidence limits.

The $100(1 – \alpha)$% confidence limits for the basic bootstrap confidence intervals are

(5) $(L, U) = (2\hat{\theta} – \hat{\theta}^*_{1-\alpha/2},\ 2\hat{\theta} – \hat{\theta}^*_{\alpha/2})$

where $\hat{\theta}^*_\alpha$ is the $\alpha$ sample quantile of the distribution induced by the bootstrap replicates $\hat{\theta}^*$.

To derive (5), suppose that a random sample $\textbf{x} = (x_1,…,x_n)$ is observed and the statistic $\hat{\theta} = \hat{\theta}(x_1,…,x_n)$ is an estimator of a parameter $\theta$ of the sampled distribution. Assume we have $B$ replicates $\hat{\theta}^*$ from an ordinary bootstrap. For a symmetric $100(1 – \alpha)$% CI (L, U) for the parameter $\theta$ based on the bootstrap replicates, we require

$$Pr(L \geq \theta) = Pr(U \leq \theta) = \frac{\alpha}{2}$$

Then
$$\frac{\alpha}{2} = Pr(L \geq \theta) = Pr(L – \hat{\theta} \geq \theta – \hat{\theta})$$
$$Pr(-1(L – \hat{\theta}) \geq -1(\theta – \hat{\theta})) = Pr(\hat{\theta} – \theta \leq \hat{\theta} – L)$$

Similarly
$$1 – \frac{\alpha}{2} = 1 – Pr(L \geq \theta) = Pr(L < \theta) = Pr(L – \hat{\theta} < \theta – \hat{\theta})$$

The percentile can be estimated from the bootstrap replicates. We first find $\hat{\theta}^{*}_{1 – \alpha/2}$, the $1 – \alpha/2$ sample quantile of the bootstrap replicates $\hat{\theta}^*$.

Then
$$\hat{\theta}^{*}_{1 – \alpha/2} – \hat{\theta}$$

is approximately equal to the $1 – \alpha/2$ quantile of $\hat{\theta} – \theta$

Set
$$\hat{\theta} – L = \hat{\theta}^{*}_{1 – \alpha/2} – \hat{\theta}$$

Similarly solving $\alpha/2 = Pr(U \leq \theta)$ leads to the estimate

$$\hat{\theta} – U = \hat{\theta}^{*}_{\alpha/2} – \hat{\theta}$$

Hence the basic bootstrap confidence interval is

$$(\hat{\theta} – L = \hat{\theta}^{*}_{1 – \alpha/2} – \hat{\theta}, \hat{\theta} – U = \hat{\theta}^{*}_{\alpha/2} – \hat{\theta})$$

$$(2\hat{\theta} – \hat{\theta}^{*}_{1 – \alpha/2} = L, 2\hat{\theta} – \hat{\theta}^{*}_{\alpha/2} = U)$$

$$(L,U) = (2\hat{\theta} – \hat{\theta}^{*}_{1 – \alpha/2}, 2\hat{\theta} – \hat{\theta}^{*}_{\alpha/2})$$

## Percentile Bootstrap Confidence Interval

A bootstrap percentile interval uses the empirical distribution of the bootstrap replicates as the reference distribution. The quantiles of the empirical distribution are estimators of the quantiles for the sampling distribution of $\hat{\theta}$. These random quantiles may match the true distribution better when the sampling distribution of $\hat{\theta}$ is not normally distributed.

If $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)}$ are bootstrap replicates of the statistic $\hat{\theta}$. Then we obtain the percentile CI from the empirical cdf of the bootstrap replicates, we compute the $\alpha/2$ quantile $\hat{\theta}_{\alpha/2}$ and the $1 – \alpha/2$ quantile $\hat{\theta}_{1 – \alpha/2}$

Obtaining Bootstrap Percentile Confidence Intervals

To obtain bootstrap confidence intervals, take the $B$ bootstrap estimates $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)}$ and rank them from smallest to largest. Denote these ordered bootstrap estimates by
$$\hat{\theta}^{*}_{(1)},…, \hat{\theta}^{*}_{(B)}$$

where the number in brackets shows the order in terms of size. Thus, $\hat{\theta}^{*}_{(1)}$ is the smallest bootstrap estimate and $\hat{\theta}^{*}_{(B)}$ is the largest bootstrap estimate.

The spread in the bootstrap estimates tells us approximately how large the effect of chance error is in the original sample upon the variation in the estimate $\hat{\theta}$. This approximation improves as $n$ increases.

Suppose we want to set a 95% confidence interval on $\theta$, the true parameter value for the real population $F(x)$. Further suppose we take $B=1000$ bootstrap samples. The bootstrap method suggests that approximately 95% of the time, the true parameter value for $F^{*}_{m}$ falls between the 2.5th percentile of the bootstrap samples and the 97.5th percentile.

The 95th percentile confidence interval is
$$(L, U) = (\hat{\theta}^{*}_{(0.025)}, \hat{\theta}^{*}_{(0.975)})$$

## Better Bootstrap Confidence Intervals (BCa)

Better bootstrap confidence intervals are a modified version of percentile intervals that have better theoretical properties and better performance in practice.

For a $100(1 – \alpha)$% CI, the standard $\alpha/2$ and $1 – \alpha/2$ quantiles are adjusted by 2 factors.

1). A correction for bias (denoted by $\hat{z_0}$)

2). A correction for skewness (denoted by $\hat{a}$)

The bias correction factor is essentially measuring the media bias of the bootstrap replicates $\hat{\theta}^*$ for $\hat{\theta}$.The estimate of the bias correction is

(6) $\hat{z_0} = \Phi{}^{-1}(\frac{1}{B}\sum_{b=1}^{B}I(\hat{\theta}^{(b)}<\hat{\theta}))$

where $I(.)$ is an indicator function that returns a 1 if the argument condition is met and 0 otherwise. Note that $\hat{z_o}$ if $\hat{\theta}$ is the median bias of the replicates $\hat{\theta}^*$ for $\hat{\theta}$.

The skewness adjustment factor is estimated from jackknife replicates

(7) $\hat{a} = \frac{\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^3}{6[\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^2]^{3/2}}$
which measure skewness.

The skewness factor $\hat{a}$ estimates the rate of change of the standard error of $\hat{\theta}$ with respect to the target parameter $\theta$. The skewness factor aims to adjust the confidence limits to account for the possibility that the variance of the estimator may depend on the true value of the target parameter.

For a $100(1 – \alpha)$% BCa bootstrap interval compute

(8) $\alpha_1 = \Phi(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 – \hat{a}(\hat{z}_0 + z_{\alpha/2})})$
(9) $\alpha_2 = \Phi(\hat{z}_0 + \frac{\hat{z}_0 + z_{1 – \alpha/2}}{1 – \hat{a}(\hat{z}_0 + z_{1 – \alpha/2})})$

where $z_{\alpha} = \Phi{}^{-1}(\alpha)$ and $\hat{z}_0$ and $\hat{a}$ are defined by (6) and (7).

The BCa interval is $(\hat{\theta}^*_{\alpha_1}, \hat{\theta}^*_{\alpha_2})$. The upper and lower confidence limits of the BCa CI are the empirical $\alpha_1$ and $\alpha_2$ quantiles of the bootstrap replicates

Properties of BCa intervals

If $(\hat{\theta}^{*}_{\alpha_1}, \hat{\theta}^{*}_{\alpha_2})$ is a confidence interval for $\theta$, and $t(\theta)$ is a transformation of the parameter $\theta$, then the corresponding interval for $t(\theta)$ is $(t(\hat{\theta}^{*}_{\alpha_1}), t(\hat{\theta}^{*}_{\alpha_2}))$.

The error tends to 0 at rate $\frac{1}{\sqrt{n}}$ for sample size $n$, and it is second order accurate if the error tends to 0 at rate $\frac{1}{n}$

Workflow in Computing BCa Intervals

Suppose we have a sample of size $n$ denoted as $\mathbf{x} = (x_1,…,x_n)$ and we draw $B$ bootstrap samples. For each bootstrap sample $x^{*}_{j} = (x^{*}_{j1},…, x^{*}_{jn})$, $j = 1,…,B$, we compute bootstrap estimate of a statistic $\hat{\theta}$. The bootstrap estimates are shown in the table below:

We want a $100(1 – \alpha)$% confidence interval and we use the distribution composed of $\hat{\theta}^{(1)}, \hat{\theta}^{(2)} ,…, \hat{\theta}^{(B)}$ as the reference distribution. Since we want $(\hat{\theta}^*_{\alpha_1}, \hat{\theta}^*_{\alpha_2})$, we want to adjust the usual $\frac{\alpha}{2}$ and $1 – \frac{\alpha}{2}$ quantiles by the bias correction and skewness correction factors.

Recall that the BCa interval is $(\hat{\theta}^*_{\alpha_1}, \hat{\theta}^*_{\alpha_2})$ and
$\alpha_1 = \Phi(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 – \hat{a}(\hat{z}_0 + z_{\alpha/2})})$

$\alpha_2 = \Phi(\hat{z}_0 + \frac{\hat{z}_0 + z_{1 – \alpha/2}}{1 – \hat{a}(\hat{z}_0 + z_{1 – \alpha/2})})$

where $z_{\alpha} = \Phi{}^{-1}(\alpha)$ and
$\hat{z_0} = \Phi{}^{-1}(\frac{1}{B}\sum_{b=1}^{B}I(\hat{\theta}^{(b)}<\hat{\theta}))$

where $I(.)$ is an indicator function and
$\hat{a} = \frac{\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^3}{6[\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^2]^{3/2}}$

To compute these intervals we would compute the following:

1). Compute $z_{\alpha/2}$ and $z_{1-\frac{\alpha}{2}}$. Since $z_{\alpha} = \Phi^{-1}(\alpha)$ is obtained from the quantile of the standard normal distribution, we can look up these values from the standard normal distribution tables or use qnorm(x)/scipy.stats.norm.ppf(x) in R/Python. The confidence level $\alpha$ is user defined.

2). Compute $\hat{z}_{0}$

Recall that $\hat{\theta}$ is computed from the original sample $\mathbf{x} = (x_1,…,x_n)$ and is denoted as $\hat{\theta}(x_1,…,x_n) = \hat{\theta}$ so this is constant. We compare this value $\hat{\theta}$ with every bootstrap estimate we have $\hat{\theta}^{(b)}$, $b = 1,…, B$

Suppose we observe $p$ instances where $\hat{\theta}^{(b)} < \hat{\theta}$, if we let $p < B$ then our $\hat{z}_{0}$ becomes

$\hat{z}_{0} = \Phi^{-1}(\frac{1}{B}\sum_{b=1}^{B} I(\hat{\theta}^{(b)} < \hat{\theta})) = \Phi^{-1}(\frac{p}{B})$

We can find this value in the same way as we found $z_{\frac{\alpha}{2}}$, by looking at the quantiles of the standard normal distribution.

3). Compute $\hat{a}$

We want to compute jackknife estimates from the original observed sample. $\hat{\theta}_{(i)}$ for $i = 1,…,n$. Jackknife estimates are like leave one out cross validation and are computed based on the observed sample with the $i$-th observation left out.

$$\theta_{(1)} = \theta(x_2,…,x_n)$$

$$\theta_{(2)} = \theta(x_1, x_3,…,x_n)$$

$$…$$

$$\theta_{(i)} = \theta(x_1,…,x_{i-1}, x_{i+1},…,x_n)$$

$$…$$

$$\theta_{(n)} = \theta(x_1,…,x_{n-1})$$

Compute the mean of all jackknife estimates, $\bar{\theta}_{(.)}$

$$\bar{\theta}_{(.)} = \frac{1}{n} \sum_{i=1}^{n}\theta_{(i)}$$

Combine to compute ratio

$$\hat{a} = \frac{\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^3}{6[\sum_{i=1}^{n}(\bar{\theta}_{(.)} – \theta_{(i)})^2]^{3/2}}$$

4). Compute quantile limits $\alpha_1$ and $\alpha_2$

Now we have values for $\hat{z}_{0}, z_{\frac{\alpha}{2}}, \hat{a}$, we can use. the cdf of the standard normal distribution to obtain $\alpha_1$ and $\alpha_2$. In R/Python we use pnorm(x)/scipy.stats.norm.cdf(x)

Finally, order the bootstrap estimates $\hat{\theta}^{(1)},…,\hat{\theta}^{(B)}$ and select the $\alpha_1$ and $\alpha_2$ quantiles from the distribution of $\{\hat{\theta}^{(j)} \}_{j=1}^{B}$

## Example in R

library(bootstrap)
library(boot)

# Estimating Standard Error ----

## Correlation for sample data - theta
cor(law$LSAT, law$GPA)

## Correlation for population
cor(law82$LSAT, law82$GPA)

## Compute Bootstrap estimates and corresponding standard error
B <- 200 # Number of replicates
n <- nrow(law) # sample size
R <- numeric(B) # storage for replicates

### bootstrap estimate of standard error of R
for (b in 1:B) {
#randomly select the indices
i <- sample(1:n, size = n, replace = TRUE)
LSAT <- law$LSAT[i] GPA <- law$GPA[i]
R[b] <- cor(LSAT, GPA) # Bootstrap estimates}

### Print Bootstrap Standard Error
print(sd(R))

# Alternate method to compute standard error ----
r <- function(x, i) {
# We want correlation between columns 1 and 2
cor(x[i, 1], x[i, 2])}

## Use boot command that computes bootstrap estimate with 2000 replicates
obj <- boot(data = law, statistic = r, R = 2000)
print(obj)

# Standard Normal Bootstrap Confidence Interval ----
boot.obj <- boot(data = law, R = 2000, statistic = r)
std_normal_bootstrap_ci <- boot.ci(boot.obj, conf = 0.95, type = c("norm"))
print(std_normal_bootstrap_ci)

# Basic Bootstrap Pivot Confidence Interval ----
boot.obj <- boot(data = law, R = 2000, statistic = r)
basic_bootstrap_ci <- boot.ci(boot.obj, conf = 0.95, type = c("basic"))
print(basic_bootstrap_ci)

# Bootstrap Percentile Confidence Interval ----
boot.obj <- boot(data = law, R = 2000, statistic = r)
percentile_bootstrap_ci <- boot.ci(boot.obj, conf = 0.95, type = c("perc"))
print(percentile_bootstrap_ci)

# Better Bootstrap Percentile Confidence Interval ----
boot.obj <- boot(data = law, R = 2000, statistic = r)
bca_bootstrap_ci <- boot.ci(boot.obj, conf = 0.95, type = c("bca"))
print(bca_bootstrap_ci)