Statistic

Exploring Data

Display

Categorical data

Pie chart

bar chart

dotplot

Quantitative data

stem plot

back to back

splitting stem

trimming

histogram

frequency

relative frequency

ogive

cululative frequency

Tree plot

variable on vertical axis

time on the horizontal axis

Describing graphical displays

mode,center,spread,clusters,gaps,outliers

shape

symmetric

skewed

uniformed

bell shaped (inverted bell)

Five number summary: box plot (for skewed distribution)

Q1: 25%

Q3: 75%

median

range IQR

Maximum

Minimum

outliers: Q1-1.5IQR,Q3+1.5IQR

Mean and standard deviation (for symmetric distribution, free of outliers)

variance

standard deviation

always positive or 0

spread,outliers & skeweness

Changing uni of measure

linear transformation x: y=ax+b

mean a+bx

median a+bM

standard deviation bs

IQR bR

Comparing distribution

categorical data: side by side bar graph

quantitative values

back to back stemplots

side by side boxplots

Describing location in a distribution

meausre of relative standard

chebyshev's inequality (the distribution most be skewed,100(1-1/k^2)

z-score

percentile (less than or equal to)

assessing normality

graphical display-bell shaped

proportion of observation, empirical rule

normal probability plot, linear/straight line

nomal distribution

symmetric,unimodel and bell-shaped

empirical rule,N()

68% fall within of

95% fall within of

99.7% fall within of

probability density function

standard normal distribution

no shape change for linear transformation

N(0,1)

density curve

area under the curve

total is 1

proportion

mean:equal areas point

median: balance point

Examing relationship

data

categorical or quantitative

explanatory variable y, response variable x

scatterplots

direction, form, strength, overall pattern, association (+,-),linear,outliers

different colors or symbols for categories

correlation

r,measure direction and strength of linear relationship

r has a value of 1, r=+,-1,perfect straight line relation

away from 0 to +,-1,relation gets stronger

regression line

predict y

line passes through

r^2 percent of variation in y can be explained by the least squares regression line relating y&x

extrapolation: predict outside the range of values of x,not accurate

lurking variable, neither x nor y, but influence the interpretation of relationship among x and y

assessing model quality

coefficient of determination

residual plot

residualsagainst y

assess how well the regression line fits the data

mean of residuals is always 0

no obvious pattern

More about relationships between two variables

transforming the variable

linear growth

y=ab^x

increased by fix amount

exponential growth

lny=lna+xlnb

increase by a fixed percent

power law mode

y=ax^p

take logarithm of both sides,lny=lna+plnx

establishing causation

causation usually from experiment x~y

common response z~x,y

confounding effect z~y,x?y

criteria for causation

consistent association

the alleged cause is plausible

large values of y

strong association

the alleged cause precedes the effect in time

relations in categorical data

two way table

marginal distribution

row sums

column sums

conditional distribution

entry/column total

entry/row total

simpson's paradox,anassociation holds for all groups

Binomial & Geometric distribution

Bernoulli distribution

random phenomena

two outcomes of interest

x= success1,x=0 failure

continuity correction

binomial distribution

conditions

B(n,p)

formulas

mean and standrad deviation

nomal approximation

representations

condition

geometric distribution

conditions

P(x=n)=(1-p)^n-1p

mean and standard deviation

Sampling distribution

parameters

statistic

sampling variability

sample mean

distributions of values taken by the statistic in all possible samples of the same size from the same populations

sample proportions

bias and variability

unbiased if u=x

variability decribed by spread

determined by sampling design

larger samples give smaller spread

Estimating with confidence

confidence interval:statistic+,-marginal error

confidence level C

when population sd is unkown

t distribution,one sample

when population sd is known

conditions

srs

normality

independence

Testing a claim

significance test

hypothesis

null

alternative

conditions

srs

normality

independence

test statistics

p value

statiscal significance

significance level

z test form the population mean when sd is known

confidence intervals and two sided test

confidence interval cannot be used in place of a significance test for one sided test