[Complete] Plotting and bootstrap demo#
2025-02-13
# imports we'll need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
0. The normal distribution#
A probability distribution that we will frequently encounter in addition to the categorical distributions we saw in worksheet 1 is the normal distribution. The normal distribution has the following properties:
centered and symmetric about the mean \(\mu\) (or \(E[X]\))
spread by the standard deviation \(\sigma\)
the variance is \(\sigma^2\) (or \(V[X]\))
We write a random variable as \(X\) that follows a normal distribution as \(X \sim \mathcal{N}(\mu, \sigma^2)\).
Tip
Watch whether the software you are using takes the variance or the standard deviation as input. In numpy
, it is the standard deviation, which is different than how we write the distribution above.
We can draw random samples from the normal distribution using numpy.random.Generator.normal()
:
# This creates a generator with a fixed seed, so we can generate the same random numbers
rng = np.random.default_rng(seed=42)
# generate 50000 samples from a normal distribution with E[X]=10 and standard deviation 1
n_samples = 50000
goose_samples = rng.normal(loc=10, scale=1, size=n_samples)
goose_samples.shape
sns.histplot(goose_samples);
data:image/s3,"s3://crabby-images/17efc/17efcb00931a9943d9f8e412ed9cd2d3f88da1e8" alt="../_images/f096efd8e453a143e9f66b276e826021f24102824bf418d750e32c96d5d35e32.png"
sample = goose_samples[:10]
goose_mean = np.mean(sample)
# in reality, we don't have access to the "true" distribution, which is why it's commented out
#sns.histplot(goose_samples)
plt.axvline(x=goose_mean, color='red');
data:image/s3,"s3://crabby-images/7e721/7e72123e98030f8e06691c67379a40c6efb9cb9f" alt="../_images/39703f0717db3983aacd33bb6ff175d1d9157ed5f557e5380db80572dfe23775.png"
For bootstrap samples, we can use random.Generator.choice() with replacement to draw samples from our dataset.
bootstrap_means = []
n_bootstraps = 50000
for i in range(n_bootstraps):
# 1. draw a bootstrap sample
bootstrap_sample = rng.choice(sample, size=10, replace=True)
bootstrap_sample
# 2. compute our mean again, with the bootstrap sample
bootstrap_mean = np.mean(bootstrap_sample)
# 3. add our bootstrap_mean to the list
bootstrap_means.append(bootstrap_mean)
plt.axvline(goose_mean, color="red")
# as our n_bootstraps increase, we're able to see the spread of the estimate
sns.histplot(bootstrap_means)
# fix the xaxis
plt.xlim(8, 12);
data:image/s3,"s3://crabby-images/a6209/a62099c44ed428239b65345149b1b27767f15bc8" alt="../_images/01ea958c0dbed41827187a4acda81f6e52059d2aa154e988827f03533cc2adcf.png"