av Chen Yushi 14 år siden
827
Mer som dette
lacking of realism limits ability to apply the conclusions of an experiment to the settings of greater interest
double-blind experiment
Do something to individuals in order to observe the response
Matched Pairs Design
more similar than unmatched subjects => more effective
compare tow treatments and the subjects are matched in pairs
an example of block design
Block Design
characteristics
allows to draw separate conclusions about each block
chosen based on the likelihood
can have any size
another form of control, which controls the effects of some outside variables by bring those variables into the experiment to form the blocks
formed based on teh most important unavoidable sources of variability among the experimental units
random assignment of units to treatments is carried out separately within a block
expected to systematically affect the response to the treatments
a group of experimental untis or subjects similar in some way
Randomizations
statistically significant
would rarely occur by chance
ovserved effect is very large
completely randomized design
all the experimental units are allocated at random among all treatments
randomized comparative experiment
ensure that influences other than the treatments operate equally on all groups
divide experimental units into groups by SRS
Replication
increase the sensitivity of the experiment to differences between treatments
reduce the role of variation
natural variability among the experiment units
Control
placebo
a dummy treatment
overall effort to mnimize variability in the way experimental unit are obtained and treated
reduce the problems posed by confounding and lurking variables
compares the responses in reatment group and control group
another group does not reveive any treatment (control group)
a group reveives the treatment
Sampling Bias
response bias
the ordering of question may influence the response
respondent desires to please the interviewer
respondent may fail to understand the question
not give truthful responses to a question
wording bias
wording of the question influences the response in a systematic way
non-response bias
the possible biases of those who choose not to respond
persons who feel most strongly about an issue are most likely to respond
self-selected samples
Undercoverage
some part of population being sampled is somehow excluded
all the individuals in the chosen clusters are selected to be in the sample
randomly select some of the clusters
divide population into groups as clusters
Stratified Random Sampling
subgroups of sample, strata appear in approximately the same proportion in the sample as they do in the population
systematic sample
the rest are chosen according to some well-defined pattern
first member of the sample is chosen according to the some random procedure
Simple Random Sampling
4 steps to choose SRS
4 identify sample
3 stopping rule
2 table
1 label
every possibel sample is equally likely to be chosen
A sample of a given size
random sample
each member of the population is equally likely to be included
Convenience Sampling
undercoverage bias
individuals who are easiest to reach
Voluntary response sampling
voluntary response bias
people who feel most strongly about an issue are most likely to respond
people who choose themselves by responding to a general appeal
Population and Sample
sample
sampling
may miss out centain characteristics of the population
advantages
less time needed
sampling involves studing a part in order to gain information about the whole
part of population that we actually examine in order to gather information
population
census
disadvantages
time causing
expensive
advantage
able to find all characteric of the population accurately
census attempts to contact every individual in the entire population
entire group of individuals
experiment
allow to pin down the effects of specific variables of interest
observal the responses
deliberately impose some treatment on individual
Ovservational study
cheaper
explanatory variable is confounded with lurking variable
the effect of one variable on another often fail
do not attempt to influence the responses
observe individuals and measure variables of interest
Explaning Causation
Explaing Association
Simpon's Paradox
Power Law Model
Dependent variable = log(y) Independent variable = log(x) log(y)= b0 + b1log(x) ŷ = 10b0 + b1log(x)
Exponential Growth Model
Dependent variable = log(y) log(y) = b0 + b1x ŷ = 10b0 + b1x
Correlations Based on Averaged Data
Correlation: Measuring Linear Association
Lurking Variable
not among the explanatory or response variable in a study
Outliers and Influential Obserbation in Regression
Coefficient of Determination
r^2
% of observations lie on least squares regression line
Residuals & Residuals Plot
Std of residual
residual plot
should show no obvious pattern for liner relationship
how well the regression line fits the data
residual
observed y - predicted y
Correlation
Least Squares Regression Line
y = a + bx
predict the value of y
Stright line describes how a response varible y changes sa an explanatory varible x changes
Does not measure curved relationship between variables
Measrue dircetion and strength of a linear relationship
r near to 1 or -1
strong linear relationship
r near to 0
weak relationship
Between -1 to 1
Categorical Variables in Scatterplots
Subtodisplay the different categories
different symbol
different plot color
Positive & Negative Association
Negative associate
above-average values of one tend to accompany below-average values of the other and vice versa
Positive associated
The above-average values and below-average value tend to occur together
Interpretion a Scatterplot
Describe the pattern by the direction, form and strenght of the relationship
Outlier: an individual value that falls outside the overall pattern of the relationship
Look for overall pattern and for striking deviations from that pattern
Plot the explanaory variable as the horizontal axis, and the response variable on the vertical axis
The relationship between 2 quantitative varibles measured on the same individuals
Explanatory variable does not necessary causes the change in the response variable
To identify a response and explanatory variable by specifulying values of one variables in order to see how it affects another variable
An explanatory variable helps explain or influences changes in a response variable
A response variable measures an outcome of a study
Assesing Normallity
A normal probbility plot
Systematis deviations from a straight line indicate a non-Normal distribution
Outliers appear as points that are far away from the overall pattern of the plot
If the points on a Normal probility plot lie close to a straight line, the plot indicates the data are Normal
How well the data fit the empirical rule
Box plots
Stem plots
Histograms
Standard Normal Calculation
Standard normal table
Proportion of observations lie in some range of values
Standard Normal Distribution
For any Normal distribution we can perform a linear transformation on the variable to obtain an standard Normal distribution
A Normal distribution with mean 0 and standard deviation 1
Empirical Rule
Almost all (99.7%) of the values lie within 3 standard deviations of the mean
μ ± 3σ.
About 95% of the values lie within 2 standard deviations of the mean
μ ± 2σ.
About 68% of the values lie within 1 standard deviation of the mean
μ ± σ.
Normal Distribution
Importance
Many statistical inference procedures based on Normal distribution work well for other roughly symmetrical distributions
Good approximations to results of many chance outcomes
Often a good descriptor for some distribution of real data
Density function
These density curves are symmetric, single-peaked and bell-shaped
Position of the Score(How your score compares to other peoples's score)
Mean & Median of Density Curve
For different types of curves
Skewed lift: Mean is on the left side of the median
Skewed righta: Mean is on the right side of the median
Symmeric density curve: Mean = Median
Mean:"banlance point"
Median: "equal-areas point"
Density Curves
Mean and standard deviation
The area underneath is exactly 1
Always on or above and horizontal axis
Idealized description
A mathematical model for the distribution
Chebyshey's Inequality
In any distribution, the percentage of observations falling within standard deviation of the mean is at least
The pth percentil of a distribution is defined as the value with p percent of the ovservation less or equal to it
Z-Score
To measred how many standard deviations is avay from the mean
Comparing Distribution
Change Unit of Measures
Measuring Spread
Percentile
Box Plot
Interquartile range(IQR)
The difference between the first and third quartile
Q3, the 75 percentile
Q1, the 25 percentile
Median, M, the 50 percentile
Range:
difference between large and small observation
Mearuring the Center
Reisitant Measure
iThe median value is not affected by outlier values. Wd describe the median as a resistant measure.
Median
The Median is the middle number of a set of a data arranged in nmbrical ascending/descending order
NO. of items is ever
E.g:{3,5,9,10,15,15} Median = (9+10)/2
NO. of items is ole
E.g: {3.5,9,15,15},Median=9
Mean
The mean of a series of variables is the arithmetical average of those numbers
Shape of Graphical Displays
Bell-shaped
Uniformed
Skewed
Skewed right
Skewed left
Symmetric
Describing a Graphical Display
Outliers
Extreme values in the distribution
Due to natural variation in the data
Due to errors
Requires futher analysis
gaps
Holes in which no values fall into
Clusters
The natural subgroups in whch the values fall into
Spread
The scope of the valves from the smallest to the largest values
Center
The center which roughly separates the values roughly in hafl
Mode
The mode is one of the major "peaks" in a distribution
bimodal
Distribution with 2 mode
Unimodal
distribution with exactly one mode
Graphs for Quantitative Data
Time Plots
Cumulative Frequency Plots
Stemplots
Histrogram
Graphs for Categorical Data
Bar Chart
Pie Chart
Dotplots
Independence
expected count = (row total*column total)/n
P-value and significant test
Indenpendence
SRS
The Two-Sample t procedure
Degree of Freedom
Choose the small one
software comput
two-sample t statistic
level C confidence interval
Two-Sample z Statistic
Normality condition
n(1-Po)≥10
nPo≥10
z = P-Po/√(Po(1-Po)/n)
Significant test
Type I and Type II error
If we fail to reject Ho when Ho is false, we have committed a type II error
power test
Increase the power of a test
Decrease std through improving the measurement process and restricting attention to a subpopulation
Increase the sample size
consider a particular alternative that is farther away from µ
increase alpha
The power of a test against any alternative is 1 minus the probability of a type II error for that alternative
1-ß
when a particular alternative value of parameter is true is called the power of the test againgst that alternative
If we reject Ho when Ho is actually true, we have a committed type I error
The significance level of any fixed level test is the probability of type I error
Use and Abuse of Tests
Beware of multile analyses
Statistical inference is not valid for all sets of data
Statistical Significance and Practical Importance
A statistically significant effect need not be parctically important
very small effects can be highly significan
choosing a Level of significance
increasing strong evidence as the P-value decrease
no sharp border between statistically significan and tattistically insignificant
What are the consequedces of rejecting Ho
How plausible is Ho
valued because an effect that is ulikely to occur simply by chance
Confidence intervals and Two-Sided Test
cannout use a confidence interval in place of a significance test for one-sided tests
in two-sided hypothesis test, a significance test and a confidence interval will yield the same conclusion
The link between two-sided significance tests and confidence interval is called duality
a two-sided significance test rejects a hypothesis exactly when the value µo falls outside a level 1- alpha confidence interval for µ
Test from confidene interval
if we are 95% confident that the true µ kues in the interval, we are also confident that the values of µ that fall outside ourinterval are incompatible with the data
intimate connection between confidence and significance and significance
porcedure
Step 4: Interpretation
Conclusion, connection and context
Interpret the P-value or make a decision about Ho using statistical significance
Step3: Calculations
Find the P-value
by GC
value used
µ≠µo, 2p
µ<µo, p
µ>µo, p
Calculate the test statistic
Step 2: Conditions: Choose the appropriate inference procedure
Sample follow Noraml distribution
Samples indenpendent from each other
Sample from SRS
t-Test for population mean
population std unknown
t = (x-µ)/(s/√(n))
z-Test for population mean
population std known
z = (x-µ)/(std/√(n))
Step1: Hypothesis
State hypotheses
Identity the population of interest and the parameter
Signficance tests are performed to determine whether the ovserved of a sample statistic differs significant from the hypothesized value of a population parameter
The probability of a result at least as far out as the result we actually got
alternative hypotheses
the claim about the population that we are trying to find evidence for
H0 : μ1≠μ2
null hypotheses
the statement being tested in a significance test
H0 : μ1 = μ2
How rare is rare
An outcome that would arely happen if a claim were true is good evidence that the claim is not true
using a statistical test to asses the evidence provided by data about some claim concerning a populatin
using a confidence interval to estimate the population parameter
Choosing the sample size
n=(z*/m)^2 p(1-p)
m = z*√(p(1-p)/n)
Conditions for Inference about a proportion
Normality
The datas are taken from SRS
level C confidence for population proportion: p±z*√(p(1-p)/n)
replace the standard deviation by the SE of p
large samples
Sample size less than 15, data close to normal
more important than the assumption that the population distribution is Normal
not robuts against outliers
robust against non-Normality of the population
The parameter µ is a paired t procedure is
the mean difference between before-and-after measurements for all individuals in the population
the mean difference in response to the two treatments fro individuals in the population
the mean difference in the responses to the two treatments within matched pairs of subjects in the entire population
compare the responses to the two treatments on the same sjubects, applys one sample t procedure
before-and-after measurement
matched pairs design
The interval approximately correct fro large n in other cases
The interval is exactly correct when the population distribution is Normal
t* is the critical value for the t(n-1)
a level C confidence interval for µ is: x ± t*s/√n
unknown mean µ
Density curves
density curve
as k increase. the denstiy curve approaches the N(0,1)curve ever more closely
The spread of the t distribution is a bit greater than that of the standard Normal distribution
similar in shape to the standard Normal curve
degree of freedom
write the t distribution with k degrees of freedom as t(k)
because we are using the sample standard deviation s in our calulation
df = n-1
Not normal, it is a t distribution
Standard error of sample mean
*s stand for sample std
Conditions:
Individual observations are dndependent
Population have a normal distribution
Samples takne from SRS
Std of population unknown
Sample size
*E stand for margin of error
smaller margin of error
the sample size increase
the population standard deviation decrease
The confidence level C decreases
Step 4: Interpret results in the context of the problem
Step 3: calculate the confidence interval
Step 2: Name the inference prodedure and check conditions
Step 1 : state the parameter of interest
Independence : population size is at least 10 tmes as large as the sample size
Normality: n is at least 30
sample taken from SRS
Confidence Interval and Confidence Level
Margin of error
z*std/√n
Standard error
std/√n
the estimated standard deviation of that percentage
the standard error of a reported proportion or percentage p measures its accuracy
The range of values to the left and right of the point estimate in which the parameter likely lies
for a parameter
gives the probability that the interval will capture the true parameter value in repeated samples
calculated from the sample data: statistic ± margin of error
Confidence interval
center
sample statistics
range of values
generated using a set of sample data
a range of plausible values that is likely to contain the unknown population parameter
Introduction
point estimate
a single value that has been calculated to estimate the unknwn parameter
Estimation
Porcess of determining the value of a population parameter from information provided by a sample statistic
Statistical inference
Provides methods fro drawing conclusions about a population from sample data
the mean and standard deviation of the sampling distribution is given by µ and std/√n
for large sample size n, the sampling distribution os x is approximately Normal for any population with finite std
standard deviation of sampling distribution = std/n
mean of sampling distribution = µ
a Normal approximation
n(1-p)≥10
np≥10
Only used when the population is at least 10 times as large as the sample
The standard deviation of the sampling disribution of p is given by √(p(1-p)/n)
The mean of sampling distribution of p is given by p
Bias and Variability
Variability
larger samples give smaller spread
determined by the sampling design and size of the sample
described by the spread of sampling distribution
To estimate a parameter is unbiased
The statistic is called an unbiased estimate of the parameter
The mean of its sampling distribution is equal to the true value of parameter being estimated
Sampling Distribution
the distribution of values taken by the statistic in all possible samples of the same size from the same population
Sampling Variability
The value of a statistic varies in repeated random sampling
Different samples will give different values of sample mean and proportion
Parameter & Statistic
Statistics comes from samples while parameters come from populations
A statistic is a number that can e computed from the sample data without making use of any unknown parameters
A parameter is a number that describes the population
Calculating Geometric Probabilities
Geometric Mean and Standard Deviation
Varience = std^2 = (1-P)/p^2
Mean = 1/p
P(X=n) = (1-p)^(n-1)p
Geometric Distribution
The variable of interest, X, is the number of trials required to obtain the first success
The probability of a success, p is the same for each observation
The observations are all independent
Each observation falls into one of just two categories, "success" or "failure"
In a geometric random variable. X counts the number of trals until an event of interest happens
Normal Approximation to Bionomial Distribution
The accuracy of the Normal approximation improves as the sample size n increases
N(np, √np(1-p)
When n is large, the distribution of X is approximately Normal
The formula for binomial probabilities becomes awkward as the no. of trials n increase
Binomial Mean and Standard Deviation
Binomial Probability
Cumulative distribution function
cdf of X calculates the sum of probabilities for 0, 1, 2......, up to the value X
Probability distribution function
pdf assins a probability to each value of X
Binomial Formula
Binomial distribution is very important in statistics when we wish to make inferences about the proportion p of "success" in a popultaion
X~B(n,p)
X is binomially distributed with parameters n and p
X be the number of success after n trials
Binomail Distribution
Conditions
probility of success for each trial is the same
trials are independent
a fixed number n of trials
only 2 outcomes in each trial
The experiment is repeated a number of times independently
Two outcomes
Bernoulli distribution
Two possible outcomes: "success" or "failure"
p+q=1
q=P(failure)=1-p
p=P(success)
Condition
p(xi) = P(X=xi)
standard deviation
variances
mean
continuous random variable
random variable that assumes values associated with one or more intervals on the number line
discrete random variable
random variable with a countable number of outcomes
Probability distribution
a list of the possible values of the DRV together with their respective probabilities
A numerical value assigned to an outcome of a random phenomenon.
Probability Tree
Conditional probability
A given B
Important Probability Results
indenpendent event
chance of one event happen or does not happen doesn't change the probability that the other event occurs
P(A and B) = P(A) * P(B)
two events A and B are disjoint
mutually exclusive
complement rule
Probability("not" an event) = 1 - Probability(event)
P(S) = 1
P(A)= 0, event A will never occur
P(A) =1, event A will certainly occur
0≤P(A)≤1
Probility of an Event with Equal Likely Outcomes
P(A) = n(A)/s(A)
probility modal
a mathematical description of a random phenomenon consisting of two parts
a way of assigning probabilities to event
a sample space S
event
any outcome or a set of outcomes of a random phenomenon
Sample space
S
the set of all possible outcomes
The Idea of Probability
random
probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very large series of reptitions
a regular distribution of outcomes in a large no. of repetitions
individual outcomes are uncertain
chance behavior is unpredicatable in the short run but has a regular and predictable pattern in the long run
Circular permutation
Subtopic
we have (n-1)! wys to arrange n distinct objects in a circle
each object has two "neighbors" when arranged in a circle
Conbination
nCr
there are c ways of selecting r objects
order is not important
Permutation
if there are n objects in total with p identical objects, we have n!/p! ways of arranging all the objects in a row
if some of the objects are identical, the no. of permutations will be less
if we choose r objects from n distinct objects, wer have nPr ways to arranging the r objects
if there are n distincts objects. we have n! ways of arranging all the objects in a row
The ordir of the objects is important
a permutation is the arrangement of objects taken from a ser
Addition and Multiplication Principle