import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
jury = pd.DataFrame({
'Ethnicity':np.array(['Asian', 'Black', 'Latino', 'White', 'Other']),
'Eligible':np.array([0.15, 0.18, 0.12, 0.54, 0.01]),
'Panels':np.array([0.26, 0.08, 0.08, 0.54, 0.04])
})
jury
Ethnicity | Eligible | Panels | |
---|---|---|---|
0 | Asian | 0.15 | 0.26 |
1 | Black | 0.18 | 0.08 |
2 | Latino | 0.12 | 0.08 |
3 | White | 0.54 | 0.54 |
4 | Other | 0.01 | 0.04 |
jury.plot.barh('Ethnicity');
random.multinomial
(n, pvals, size=None)
The multinomial distribution is a multivariate generalization of the binomial distribution. Take an experiment with one of p possible outcomes. An example of such an experiment is throwing a dice, where the outcome can be 1 through 6. Each sample drawn from the distribution represents 'n' such experiments. Its values, X_i = [X_0, X_1, ..., X_p], represent the number of times the outcome was i.
N.B. New code should use the multinomial method of a default_rng()
instance instead; please see the Quick Start).
Call default_rng
to get a new instance of a Generator, then call its methods to obtain samples from different distributions.
A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail.
# Legacy version
from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)
more_vals
array([ 1.04178783, 1.30550513, 1.95578289, -0.46404912, -0.5442649 , -0.17819663, -0.32144174, -0.95697455, 1.20457654, -0.52214162])
# New version
from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)
more_vals
array([-0.30723259, -1.98679925, 0.62638645, 1.2351097 , -1.12786288, 0.1655601 , -1.94840903, 0.61373912, 1.76144768, 0.25620496])
model = np.array([0.15, 0.18, 0.12, 0.54, 0.01])
model
array([0.15, 0.18, 0.12, 0.54, 0.01])
def sample_proportions(sample_size, probabilities):
return np.random.multinomial(sample_size, probabilities) / sample_size
simulated = sample_proportions(1423, model)
simulated
array([0.15811665, 0.1658468 , 0.11314125, 0.55024596, 0.01264933])
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
Parameters kwargsdict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
Returns DataFrame A new DataFrame with the new columns in addition to all the existing columns.
jury_with_simulated = jury.assign(Simulated = simulated)
jury_with_simulated
Ethnicity | Eligible | Panels | Simulated | |
---|---|---|---|---|
0 | Asian | 0.15 | 0.26 | 0.158117 |
1 | Black | 0.18 | 0.08 | 0.165847 |
2 | Latino | 0.12 | 0.08 | 0.113141 |
3 | White | 0.54 | 0.54 | 0.550246 |
4 | Other | 0.01 | 0.04 | 0.012649 |
jury_with_simulated.plot.barh('Ethnicity');
diffs = jury['Panels'] - jury['Eligible']
jury_with_difference = jury.assign(Difference = diffs)
jury_with_difference
Ethnicity | Eligible | Panels | Difference | |
---|---|---|---|---|
0 | Asian | 0.15 | 0.26 | 0.11 |
1 | Black | 0.18 | 0.08 | -0.10 |
2 | Latino | 0.12 | 0.08 | -0.04 |
3 | White | 0.54 | 0.54 | 0.00 |
4 | Other | 0.01 | 0.04 | 0.03 |
In probability theory, the total variation distance
(Wikipedia) is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational distance.
The Monte Carlo sampling methods give us the desired stationary distribution starting from a base Markov chain. Assuming one step of the Markov chain takes unit time, we can compare their computational efficiency in terms of number of operations needed per sample, and the number of sampledneeded to reach close enough to the desired stationary distribution.
Markov Chain - Python
- (GeeksforGeeks)
Markov chains, named after Andrey Markov, a stochastic model that depicts a sequence of possible events where predictions or probabilities for the next state are based solely on its previous event state, not the states before. In simple words, the probability that n+1th steps will be x depends only on the nth steps not the complete sequence of steps that came before n. This property is known as Markov Property or Memorylessness.
def tvd(dist1, dist2):
return sum(abs(dist1 - dist2))/2
obsvd_tvd = tvd(jury['Panels'], jury['Eligible'])
obsvd_tvd
0.14
tvd(sample_proportions(1423, model), jury['Eligible'])
0.022030920590302157
def simulated_tvd():
return tvd(sample_proportions(1423, model), model)
tvds = np.array([])
num_simulations = 10000
for i in np.arange(num_simulations):
new_tvd = simulated_tvd()
tvds = np.append(tvds, new_tvd)
title = 'Simulated TVDs (if model is true)'
bins = np.arange(0, .05, .005)
tvd = pd.DataFrame({title:tvds}).hist(bins = bins, ec='white');
print('Observed TVD: ' + str(obsvd_tvd))
Observed TVD: 0.14