[Complete] Plotting and bootstrap demo

[Complete] Plotting and bootstrap demo#

2025-02-13


# imports we'll need
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

0. The normal distribution#

A probability distribution that we will frequently encounter in addition to the categorical distributions we saw in worksheet 1 is the normal distribution. The normal distribution has the following properties:

  • centered and symmetric about the mean \(\mu\) (or \(E[X]\))

  • spread by the standard deviation \(\sigma\)

    • the variance is \(\sigma^2\) (or \(V[X]\))

We write a random variable as \(X\) that follows a normal distribution as \(X \sim \mathcal{N}(\mu, \sigma^2)\).

Tip

Watch whether the software you are using takes the variance or the standard deviation as input. In numpy, it is the standard deviation, which is different than how we write the distribution above.

We can draw random samples from the normal distribution using numpy.random.Generator.normal():

# This creates a generator with a fixed seed, so we can generate the same random numbers
rng = np.random.default_rng(seed=42)

# generate 50000 samples from a normal distribution with E[X]=10 and standard deviation 1
n_samples = 50000
goose_samples = rng.normal(loc=10, scale=1, size=n_samples)
goose_samples.shape
sns.histplot(goose_samples);
../_images/f096efd8e453a143e9f66b276e826021f24102824bf418d750e32c96d5d35e32.png
sample = goose_samples[:10]
goose_mean = np.mean(sample)

# in reality, we don't have access to the "true" distribution, which is why it's commented out
#sns.histplot(goose_samples)
plt.axvline(x=goose_mean, color='red');
../_images/39703f0717db3983aacd33bb6ff175d1d9157ed5f557e5380db80572dfe23775.png

For bootstrap samples, we can use random.Generator.choice() with replacement to draw samples from our dataset.

bootstrap_means = []
n_bootstraps = 50000

for i in range(n_bootstraps):
    # 1. draw a bootstrap sample
    bootstrap_sample = rng.choice(sample, size=10, replace=True)
    bootstrap_sample
    # 2. compute our mean again, with the bootstrap sample
    bootstrap_mean = np.mean(bootstrap_sample)

    # 3. add our bootstrap_mean to the list
    bootstrap_means.append(bootstrap_mean)

plt.axvline(goose_mean, color="red")

# as our n_bootstraps increase, we're able to see the spread of the estimate
sns.histplot(bootstrap_means)
# fix the xaxis
plt.xlim(8, 12);
../_images/01ea958c0dbed41827187a4acda81f6e52059d2aa154e988827f03533cc2adcf.png