Statistic
Exploring Data
Display
Categorical data
Pie chart
bar chart
dotplot
Quantitative data
stem plot
back to back
splitting stem
trimming
histogram
frequency
relative frequency
ogive
cululative frequency
Tree plot
variable on vertical axis
time on the horizontal axis
Describing graphical displays
mode,center,spread,clusters,gaps,outliers
shape
symmetric
skewed
uniformed
bell shaped (inverted bell)
Five number summary: box plot (for skewed distribution)
Q1: 25%
Q3: 75%
median
range IQR
Maximum
Minimum
outliers: Q1-1.5IQR,Q3+1.5IQR
Mean and standard deviation (for symmetric distribution, free of outliers)
variance
standard deviation
always positive or 0
spread,outliers & skeweness
Changing uni of measure
linear transformation x: y=ax+b
mean a+bx
median a+bM
standard deviation bs
IQR bR
Comparing distribution
categorical data: side by side bar graph
quantitative values
back to back stemplots
side by side boxplots
Describing location in a distribution
meausre of relative standard
chebyshev's inequality (the distribution most be skewed,100(1-1/k^2)
z-score
percentile (less than or equal to)
assessing normality
graphical display-bell shaped
proportion of observation, empirical rule
normal probability plot, linear/straight line
nomal distribution
symmetric,unimodel and bell-shaped
empirical rule,N()
68% fall within of
95% fall within of
99.7% fall within of
probability density function
standard normal distribution
no shape change for linear transformation
N(0,1)
density curve
area under the curve
total is 1
proportion
mean:equal areas point
median: balance point
Examing relationship
data
categorical or quantitative
explanatory variable y, response variable x
scatterplots
direction, form, strength, overall pattern, association (+,-),linear,outliers
different colors or symbols for categories
correlation
r,measure direction and strength of linear relationship
r has a value of 1, r=+,-1,perfect straight line relation
away from 0 to +,-1,relation gets stronger
regression line
predict y
line passes through
r^2 percent of variation in y can be explained by the least squares regression line relating y&x
extrapolation: predict outside the range of values of x,not accurate
lurking variable, neither x nor y, but influence the interpretation of relationship among x and y
assessing model quality
coefficient of determination
residual plot
residualsagainst y
assess how well the regression line fits the data
mean of residuals is always 0
no obvious pattern
More about relationships between two variables
transforming the variable
linear growth
y=ab^x
increased by fix amount
exponential growth
lny=lna+xlnb
increase by a fixed percent
power law mode
y=ax^p
take logarithm of both sides,lny=lna+plnx
establishing causation
causation usually from experiment x~y
common response z~x,y
confounding effect z~y,x?y
criteria for causation
consistent association
the alleged cause is plausible
large values of y
strong association
the alleged cause precedes the effect in time
relations in categorical data
two way table
marginal distribution
row sums
column sums
conditional distribution
entry/column total
entry/row total
simpson's paradox,anassociation holds for all groups
Binomial & Geometric distribution
Bernoulli distribution
random phenomena
two outcomes of interest
x= success1,x=0 failure
continuity correction
binomial distribution
conditions
B(n,p)
formulas
mean and standrad deviation
nomal approximation
representations
condition
geometric distribution
conditions
P(x=n)=(1-p)^n-1p
mean and standard deviation
Sampling distribution
parameters
statistic
sampling variability
sample mean
distributions of values taken by the statistic in all possible samples of the same size from the same populations
sample proportions
bias and variability
unbiased if u=x
variability decribed by spread
determined by sampling design
larger samples give smaller spread
Estimating with confidence
confidence interval:statistic+,-marginal error
confidence level C
when population sd is unkown
t distribution,one sample
when population sd is known
conditions
srs
normality
independence
Testing a claim
significance test
hypothesis
null
alternative
conditions
srs
normality
independence
test statistics
p value
statiscal significance
significance level
z test form the population mean when sd is known
confidence intervals and two sided test
confidence interval cannot be used in place of a significance test for one sided test