Statistical Foundation

Descriptive

describe the sample

Univariate statistics

Central Tendency

Mean

sample mean

Median

If n is even

Mode

Unimodal = 1 mode

bimodal = 2 modes

multimodal = many modes

Outliers

Tip

Tip The i th percentile is the value at which i% of the observations are less than that value, so the 99th percentile is the value in X where 99% of the x's are less than it.

Measure of spread from the (MEAN)

e1May22

Mean-based measures of dispersion

Variance

c1r

The variance describes how far apart observations are spread out from the average value (the mean). The population variance is denoted as σ2 (pronounced sigma-squared), and the sample variance is written as s2. It is calculated as the average squared distance from the mean. Note that the distances must be squared so that distances below the mean don't cancel out those above the mean.

Standard Deviation

c1r

Standard deviation We can use the standard deviation to see how far from the mean data points are on average. A small standard deviation means that values are close to the mean, while a large standard deviation means that values are dispersed more widely. This is tied to how we would imagine the distribution curve: the smaller the standard deviation, the thinner the peak of the curve (0.5); the larger the standard deviation, the wider the peak of the curve (2):The standard deviation is simply the square root of the variance. By performing this operation, we get a statistic in units that we can make sense of again ($ for our income example): Note that the population standard deviation is represented as σ, and the sample standard deviation is denoted as s.

Coefficient of Variation (CV)

CV is the Ratio of the
standard deviation to the mean

Comparing Two Datasets
with different units problem

Compare volatility
or risk with the amount of
return expected from investments.

Comparing between parameters using relative units such as Celsius and Fahrenheit.

Median-based measures of dispersion

Interquartile range (IQR)

Quantiles
(25%, 50%, 75%, and 100%)

Quartile coefficient of dispersion

It is calculated by dividing the semi-quartile range (half the IQR) by the midhinge (midpoint between the first and third quartiles):

Range

c1r

Gives us the dispersion of the entire dataset. Downfall: it doesn't tell us about the dispersion at the center. it is useless if the data contains outliers.

Visualizing Distribution

Summarizing data & Visualizing Skew

5-number summary

Box plot (or box and whisker plot)

Top line = Max or 75 quartile

Tukey box plot

The lower bound of the whiskers will be Q1 – 1.5 * IQR and the upper bound will be Q3 + 1.5 * IQR, which is called the Tukey box plot:

Histograms

Important note:In practice, we need to play around with the number of bins to find the best value. However, we have to be careful as this can misrepresent the shape of the distribution.

Kernel density estimates (KDEs)

KDEs can be used for discrete variables, but it is easy to confuse people that way.

Continuous variables
(Heights or time)

Probability density function (PDF)

The PDF tells us how probabilityis distributed over the values. (Higher values = Likelihoods)

Discrete variables
(people & time)

Rolling a 6-sided die can only result in 6 possible outcomes. (1,2,3,4,5, or 6) this is an example of a discrete variable because it isn't possible to role a 2.2 or 3.4 etc. On the other hand, a continuous variable such as height can end up being any number or fraction of a number between two discrete numbers. A persons hight can be 5"6" or 5"6.3" etc.

Both the KDE and Histogram
estimate the distribution.

Cumulative distribution function (CDF)

Empirical cumulative
distribution function (ECDF)

import numpy as np x = np.sort(df['column_name']) y = np.arange(1,len(x)+1) / len(x) # by default plt.plot generates lines. To connect our data points we pass '.' to the marker and 'none' to the linestyle. _ = plt.plot(x,y, marker='.', linestyle = 'none') _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('ECDF') To keep data off plot edges we set plot margins to (0.02)plt.margins(0.02)

x-axis = the sorted data being measured

import numpy as np x = np.sort(df['column_name'])

y-axis = evenly spaced data points with a maximum of 1

y = np.arange(1,len(x)+1) / len(x)

Cumulative Probability

Random Variable

Each column in our data is a random variable, because every time we observe it, we get a value according to the underlying distribution—it's not static.

Visualize Skew & Kurtosis

Kurtosis

Important note:There is also another statistic called kurtosis, which compares the density of the center of the distribution with the density at the tails. Both skewness and kurtosis can be calculated with the SciPy package.

Skew

Left (negative) skewed distribution

Right (positive) skewed distribution

No Skew

Common distributions

While there are many probability distributions, each with specific use cases, there are some that we will come across often. The Gaussian, or normal, looks like a bell curve and is parameterized by its mean (μ) and standard deviation (σ). The standard normal (Z) has a mean of 0 and a standard deviation of 1. Many things in nature happen to follow the normal distribution, such as heights. Note that testing whether a distribution is normal is not trivial—check the Further reading section for more information. The Poisson distribution is a discrete distribution that is often used to model arrivals. The time between arrivals can be modeled with the exponential distribution. Both are defined by their mean, lambda (λ). The uniform distribution places equal likelihood on each value within its bounds. We often use this for random number generation. When we generate a random number to simulate a single success/failure outcome, it is called a Bernoulli trial. This is parameterized by the probability of success (p). When we run the same experiment multiple times (n), the total number of successes is then a binomial random variable. Both the Bernoulli and binomial distributions are discrete. We can visualize both discrete and continuous distributions; however, discrete distributions give us a probability mass function (PMF) instead of a PDF:

Gaussian/normal

Poisson distribution

Exponential distribution

Uniform distribution

Bernoulli trial

Standard Normal (Z)

Binomial PMF - many Bernoulli trials

Quantifying relationships between variables

Covariance

E[X] = the expected value of X or the expectation of X. It is calculated by summing all the possible values of X multiplied by their probability - it is the long-run average of X.This will tell us whether the variables are positively or negatively correlated.

Correlation

Pearson correlation coefficent

To find the correlation, we calculate the Pearson correlation coefficient, symbolized by ρ (the Greek letter rho), by dividing the covariance by the product of the standard deviations of the variables:This normalizes the covariance and results in a statistic bounded between -1 and 1, making it easy to describe both the direction of the correlating (sign) and the strength of it (magnitude).

Perfect positive (linear) correlation
[as x increases y increases]

Scatter plot example of correlation
between x and y variables.

It is possible that there is another variable Z that causes both X and Y.

Perfect negative correlation
[as x increases y decreases]

Standardize data between two distributions

Scaling data

In order to compare variables from different distributions we have to scale the data. There are, of course, additional ways to scale our data, and the one we end up choosing will be dependent on our data and what we are trying to do with it. By keeping the measures of central tendency and measures of dispersion in mind, you will be able to identify how the scaling of data is being done in any other methods you come across.

min-max scaling

Take each data point and subtract it by the minimum of the dataset, then divide by the range. This normalizes the data (scales it to the range [0,1]).

Z-score

To standardize using the Z-score we would subtract the mean from each observation and then divide by the standard deviation to standardize the data. The resulting distribution is normalized with a mean of 0 and a standard deviation (and variance) of 1.

Sampling

Should always be randomly sampled.

Resampling

simple random sample

stratified random sample

randomly pick preserving the
population of groups in the data

bootstrap sample

Resources For Bootstrap Information

Statistical Foundation

Descriptive

Sampling

Unitless measure of Dispersion

Floating topic

Website References

Statistics how to