par lu xin Il y a 15 années
475
Plus de détails
A variable whose value is a random numerical outcome.
E(a) = a
E(aX + b) = aE[X] + b
E(X + Y) = E[X] + E[Y]
Var(a) = 0
Var(aX + b) = a^2 Var(X)
Var(X + Y) = Var(X) + Var(Y)
Var(X - Y) = Var(X) + Var(Y)
Takes all values in an interval of numbers
For all continuous probability distributions, P(any individual outcome) = 0
∫_-∞^x f(t)dt
Total area under graph is 1
f(x)≥0
∫_a^b f(x)dx
values that might be observed are restricted to being within a pre-defined list of possible values
µ_x=∑_(i=1)^k (x_i p_i )
σ_x^2=∑_(i=1)^k (x_i-µ_x )^2 p_i
Var(x)=E[(x-µ_x)^2]
All probabilities must add up to 1.
0≤P_k≤1
Probability Histogram
Probability distributions of real-valued random variables
Random - individual outcomes are uncertain but there is a regular distribution of outcomes in a large number of repetitions.
Conditional probability is the probability of some event A, given the occurrence of some other event B.
Conditional probability is written P(A|B), and is read "the probability of A, given B".
P(A|B)=(P(A n B))/(P(B))
If A and B are mutually exclusive, then P(A | B) = 0.
If A and B are independent, then P(A | B) = P(A).
Diagrammatic representation of possible outcomes of series of events.
A probability tree to calculate the chances of flipping a coin and coming up heads three times in a row would have three levels. The first reflects the chances of throwing either heads or tails; the second level reflects the chances of throwing heads or tails after throwing heads the first time, and the chances of throwing heads or tails after throwing tails the first time: the third level shows the chances of throwing heads or tails after all the possible outcomes of the first two throws. The series of probabilities can be multiplied to give the overall probability of a possible event occurring.
Probabilities add up to 1.
Two events A and B are independent if the chance of one event happening or not happening does not change the probability that the other event occurs.
If A and B are independent, then P(A and B) = P(A)*P(B).
P(A)=(n(A))/(n(S))
where n(A) represents the number of outcomes in event A and n(S) represents number of outcomes in space S.
Sample space of a random phenomenon - set of all possible outcomes
Event - any outcome or set of outcomes
Probability model - mathematical description of random phenomenon.
When objects are arranged in a circle, since each object has the same neighbours, they can be rotated.
We have (n-1)! ways to arrange n distinct objects in a circle.
The unordered selection of objects from a set.
If there are n distinct objects, then we can select r objects in ways: 〖(_^n)C〗_r
The order of objects is important.
If there are n distinct objects, we have n! ways of arranging all the objects in a row.
Distinct
r objects from n distinct objects
〖(_^n)P〗_r=n!/(n-r)!
Identical objects
p identical objects and q identical objects,from a total of n objects
arrange the r objects in a row in:
n!/p! or n!/(p!q!)
If you can do task 1 in m number of ways, and task 2 in n number of ways,
both tasks can be done in m*n number of ways.
the number of ways of selecting a single objects; m+n.
Lack of realism
Cannot duplicate exact conditions that we want to study
Limits our ability to apply conclusions to the settings of greater interest.
Statistical analysis cannot tell us how far the results will generalise to other settings.
Double-blind experiment
Neither subjects nor those who measure the response know which treatment a subject received.
Controls the placebo effect.
Matched pairs design
Example of block design.
Compare two treatments and the subjects are matched in pairs.
Block design
Block - group of experimental subjects that are known to be similar in some way that is expected to systematically affect the response to the treatments.
Blocks are a form of control.
Blocks are chosen to reduce variability based on the likelihood that the blocking variable is related to the response.
Blocks should be formed based on the most important unavoidable sources of variability among the experimental units.
Blocking allows us to draw separate conclusions about each block.
Treatment groups are essentially similar and there is no systematic differences between them.
Even with control, there is still natural variability.
Replication reduces the role of chance variation and increase the sensitivity of the experiment to differences between the treatments.
Effort to minimise variability in the way experimental units are obtained and treated.
Helps reduce problems from confounding and lurking variables.
One group receives the treatment while the other group does not.
Compare responses between 2 groups.
Placebo
See if there is any placebo effects which could have affected the reults
Deliberately impose some treatment on
individuals in order to observe their responses.
Individuals - experimental units or subjects (humans).
Treatment - experimental condition applied.
Factors - explanatory variables.
Undercoverage - Some groups in the population are left out.
Non-response - Individuals do not respond or cooperate.
Response bias - lying
Wording of questions - confusing & misleading questions.
Probability sample - sample chosen by chance.
Stratified random sampling - first divided nto strata, then SRS from the stratas.
Cluster sample - divide population into clusters, then randomly select some clusters.
Multi-stage sampling design.
population-Entire group we want information about
Sample - Part of the population that we examine.
Sampling - studying a part in order to gain information about the whole.
Census - attempts to contact every individual.
Voluntary response sample - people who choose themselves.
Convenience sampling - choosing individuals who are easiest to reach.
Simple Random Sample (SRS)
Consists of n individuals chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.
1. Label. Assign a numerical label to every individual in the population.
2. Table. Use the random number table to select labels at random.
3. Stopping rule. Indicate when you should stop sampling.
4. Identify sample. Use the labels to identify subjects selected to be in the sample.
Two-way table organizes data about 2 categorical variables.
Row & column totals - marginal distributions or marginal frequencies.
Find the condition distribution of the row variable for one specific value of the column variable, look only at the one column in the table. Find each entry in the column as a percent of the column total.
Causation - change in x causes the direct change in y.
Common response - the observed association between x and y can be explained by lurking variable z.
Confounding effect - variable effects cannot be distinguished from each other.
To explain causation, we need to conduct carefully-designed experiments.
Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined.
able to apply the least squares regression line.
y=ax^p
ln y=ln a+ p ln x
Plot ln y vs ln x to obtain straight line with gradient p.
Exponential growht increases by a fixed percent of the previous total in each equal time period.
y=ab^x
ln y=ln(ab^x)
ln a = c
x ln b= m
Plot ln y against x to obtain a straight line with gradient ln b.
A variable that is not among the explanatory or response variabels in a study and yet may influence the interpretation of relationships among these variables.
Outlier - observation that lies outside the overall pattern of the other osbervations.
Influential - if removing it would markedly change the result of the calculation.
Residual = observed y - predicted y. Sum of residuals = 0.
Residual plot should show no obvious pattern.
r^2=1-(∑ (y-y.hat)^2 )/(∑(y-y.bar)^2 )
Measure error size: compare Standard deviation of residuals to actual data points.
Regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.
You can use a regression line to predict the value of y for any value of x.
y.hat = a+bx
Correlation measures the direction and strength of the linear relationship between two quantitative variables.
∑((x_i-x.bar)/s_x )((y_i-y.bar)/s_y )
Scatterplot shows relationship between 2 quantitative variables
Explanatory on x-axis
Response on y-axis
Postive - above-average values of one tend to accompany above-average values of the other, and vice versa.
Negative - above-average values of one tend to accompany below-average values of the other.
1. Look for overall pattern and for striking deviations from that pattern.
2. Describe the pattern by the direction, form and strength of the relationship
3. Look for outliers.
Response (dependent) - measures outcome of study
Explanatory (independent) - explains or influences changes in response variable
Probablity density function given as:
f(x)=1/(σ√(2π )) e^(-〖(x-µ)〗^2/〖2σ〗^2 )
Close to straight line - Normal
Systematic deviations - non-Normal
For any Normal distribution we can perform a linear transformation to obtain standard Normal distribution
If the variable x has Normal distribution N(µ, σ), then the standard variable has Normal distribution N(0,1);
z=(x-µ)/σ
The area under the standard Normal curve can be found from a standard Normal table, or the GC.
68% fall within 1 standard deviation of the mean
95% fall within 2 standard deviation of the mean
99.7% fall within 3 standard deviation of the mean
Mathematical model for the distribution
A curve that is always or above the x axis
Area underneath it is always exactly 1.
% of observations falling within k standard deviations of the mean is at least (100)(1-1/k^2)
Median is the equal-areas point
Mean is the balance point
In a symmetric density curve the mean and median are the same
pth percentile of a distribution is the value with p% of the observation less than or equal to it.
z=(x-x.bar)/s
Side-by-side graphs
Back-to-back stemplots
Narrative comparisons
Transformed variables
mean a+bx
median a+bM
standard deviation bs
IQR bR
Linear transformations
Median is not affected by outlier values
Symmetric
Skewed (spreads far and thinly)
Uniformed
Bell-shaped
Range
Percentile
Interquartile Range (IQR)
Box plots (five number summary)
Variance
Standard deviation, s
Mean
Median
Area represents the size of data
Relative frequencies on vertical axis
Given X~B(n,p) such that n is large (>50) and np<5 (normally p<0.1), then the binomial distribution can be approximated usingthe poison distribution with mean λ=np
It is more accurate when n gets larger and p gets smaller
If X and Y are independent Poisson random variable and X~Po(λ), Y~Po(µ),Then;
X+Y~po(λ+µ)
If X~Po(λ), then E(X) =λ and Var(X) =λ
λ=average number of ocurrance
1) The events occur singly and randomly.
2) The events occur uniformly.
3) The events occur independently.
4) The probability of occurrence of an event within a small fixed interval is negligible.
The slope b and intercept a of the least-squares regression line are statistics.
t= (b√(∑〖(x-x.bar)〗^2 ))/s
Ho: β = 0
This Ho says that there is no true linear relationship between x and y.
This Ho also says that there is no correlation between x and y.
The testing correlation makes sense only if the observations are a random sample.
degrees of freedom = n -2
Residual = observed y - predicted y
Standard error= s=√(∑(y-y.hat)^2 )/(n-2))
confidence interval = b ± t* s/√(∑(x-x.bar)^2 )
Repeated response y are independent of each other.
Scatterplot: overall pattern is roughly linear. Residual plot has a random pattern.
The standard deviation σ of y (σ is unknown) is the same for all values of x.
For any fixed value of x, the response y varies according to a Normal distribution.
2-way table from a single SRS
each individualy classified according to both categorical variables.
Ho: There is no association between two categorical variables.
Ha: There is an association between two categorical variables.
Select SRS from each c populations.
Each individual is classified in a sample according to a categorical response variable with r possible values.
There are c different sets of proportions to be compared, one for each population.
No more of 20% of the expected counts are less than 5, all individual expected counts are at least 1.
All counts in a 2 x 2 table should be at least 5.
Expected count = (row total x column total) / n
null hypothes is is that the distribution of the response variable is the same in all c populations.
alternative hypothesis is that these c distributions are not all the same.
Implications
If Ho is accepted, the Chi-square statistic has approximately a chi-square distribution with (r-1)(c-1) degrees of freedom.
We can compare the 2 proportions using the z test.
The Chi-square statistic is the square of the z statistic, and the P-value for Chi-square is the same as the two-sided P-value for z.
Used to compare two proportions because it gives the choice of a one-sided test and is related to a confidence interval for p1-p2.
The total area under a chi-square curve is equal to 1.
Each chi-square curve (except when degrees of freedom = 1) begins at 0 or the horizontal axis, increase to a peak, and approaches the horizontal axis asymptotically from above.
Each chi-square curve is skewed to the right. As the number of degrees of freedom icnreases, the curve becomes more symmetric and looks more like a Normal curve.
x^2=∑(O-E)^2/E
df = k -1
P-value = P(X^2>x^2)
Ho: The actual population proportions are equal to the hypothesised proportions.
Ha: At least one of the actual population proportions differ from the hypothesised proportions.
More robust than the one-sample t methods, particularly when the distributions are not symmetric.
Choose equal sample sizes if possible.
n1 and n2 must both be at least 5.
If n1+n2 > 30, the two-sample t procedure can be used even for skewed distributions.
z=(p.hat_1-p.hat_2)/√(p.hat_c (1-p.hat_c )(1/n_1 +1/n_2 ))
(p.hat_1-p.hat_2 )±z* √((p.hat_1 (1-p.hat_1))/n_1 +(p.hat_2 (1-p.hat_2))/n_2 )
(x.bar_1-x.bar_2 )±t* √((s_1^2)/n_1 +(s_2^2)/n_2 )
z= (x.bar_1-x.bar_2 (μ_1-μ_2))/√((σ_1^2)/n_1 +(σ_2^2)/n_2 )
SRS: We have two SRSs, from two distinct populatoins.
Independence: The samples are independent. That is, one sample has no influence on the other.
When sampling without replacement, each population must be at least 10 times as large as the corresponding sample size.
Normality: Both populations are Normally distributed.
t=(x.bar - µ0)/(s/√n)
µ>µ0 P(T>t)
µ<µ0 P(T<t)
µ≠µ0 2P(T≥|t|)
z = (p.hat - p0)/√((p0-(1-p0))/n)
p>p0 P(Z>z)
p<p0 P(Z<z)
p≠p0 2P(Z≥|z|)
Normality condition: np and n(1-p) ≥ 10
fail to reject Ho when Ho is false.
Power
The probability that a fixed level α test will reject Ho when a particular alternative value of the parameter is true is called the power of the tests against that alternative.
Increasing power
Increase alpha.
Consider a particular alternative further away from the mean
Increase the sample size; decreases standard error
Decrease σ
Probability
1. Calculate when the test stops accepting Ho.
2. Use the critical value obtained and standardise using a curve based on alternative hypothesis to find the probability.
reject Ho when Ho is actually true.
Significance
The significance level α of any fixed level test is the probability fo a Type I error.
α is the probability that the test will reject the null hypothesis when it is in fact true.
Badly designed surveys or experiments often produce invalid results.
Faulty data collection, outliers in the data, and testing a hypothesis on the same data can invalidate a test.
Beware of multiple analyses; many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true.
There is a tendency to infer there is no effect whenever a P-value fails to attain the usual 5% standard.
Lack of significance does not imply that H0 is true.
In some areas of research, small effects that are detectable only with lage sample sizes can be of great practical significance.
A statistically significant effect need not be practically important.
Use confidence intervals to estimate the actual value for parameters as confidence intervals estimate the size of an effect rather than simply asking if it is too large to reasonably occur by chance alone.
There is no sharp border between "statistically significant" and "statistically insignificant" so giving the P-value allows each of us to decide individually if the evidence is sufficiently strong.
A level α 2-sided significance test rejects a hypothesis exactly when the value µ0 falls outside a level 1-α confidence interval for µ
The link between 2-sided significance tests and confidence intervals is called duality
For a two-sided hypothesis test for mean, a significance test (level α) and a confidence interval (level C = 1-α) will yield the same conclusion.
z=(x.bar - µ0)/(s/√n)
µ>µ0 P(Z>z)
µ<µ0 P(Z<z)
µ≠µ0 2P(Z≥|z|)
Interpretation
These P-values are exact if the population distribution is Normal and are approximately correct for large n in other cases.
Failing to find evidence against H0 means only that the data are consistent with H0, not that we have clear evidence that H0 is true.
1. Hypotheses: Identify the population of interest and the parameter you want to draw conclusions about.
2. Conditions: Choose the appropriate inference procedure. Verify the conditions for using it.
3. Calculations: Calculate test statistic and the P-value.
4. Interpretation: Interpret your results in context of the problem.
The test is based on a statistic that compares the value of the parameter as stated in the null hypothesis with an estimate of the parameter from the sample data.
Values of the estimate far from the parameter value in the direction specified by the alternative hypothesis give evidence against H0.
Standardise the estimate:
Test statistic = (estimate - hypothesised value) / standard deviation of estimate
SRS from the population of interest
Normality: np > 10 and n(1-p) > 10
Independent observations.
We can have one-sided or two-sided alternative hypotheses.
alternative hypothesis
The alternative hypothesis states that it is present in the population.
null hypothesis
The null hypothesis is the statement that this effect is not present in the population.
Rule of thumb: alpha = 0.05 unless otherwise stated
A result with a small P-value (less than alpha) is called statistically significant.
Small P-value
Small P-values are evidence against H0 because they say that the observed trait is unlikely to occur just by chance.
Large P-value
Large P-values fail to give evidence against H0.
An outcome that would rarely happen if a claim was true is good evidence that the claim is false.
The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree.
p.hat±z*√((p.hat(1-p.hat))/n)
Margin of error involves the sample proportion of successes, we need to guess this value when choosing m
The guess is called p*
Use a guess p* based on pilot study or past experiences with similar studies.
Use p* = 0.5 as the guess. Margin of error is largest when p-hat = 0.5.
SRS: The data are an SRS from the population of interest
Normality: For a confidence interval, n is so large that both np-hat and n(1-p-hat) are 10 or more
Independence: Individual observations are independent.
When sampling without replacement, the population is at least 10 times as large as the sample.
x.bar±t*s/√n
df=n-1
Paired t procedures
Matched pairs design or before-and-after measurements on the same subjects
t-distributions
Substitute standard deviation σ for standard error s.
The resultant distribution is not Normal. It is a t distribution. There is a different t distribution for each sample size n. We specify a particular t distribution by giving its degrees of freedom (df).
The density curves of the t distributions are similar in shape to the standard Normal curve. The spread of the t distributions is a bit greater than that of the standard Normal distribution.
As the degrees of freedom increases, the t(k) density curve approaches the N(0,1) curve ever more closely.
This interval is exactly correct when the population distribution is Normal and approximately correct for large n in other cases.
Using the t procedures
Except in the case of small samples, the assumption that the data are an SRS from the population of interest is more improtant than the assumption that the population distribution is Normal.
Sample size less than 15. Use t procedures if the data are close to Normal.
Samples size at least 15. The t procedures can be used except in the presence of outliers or strong skewness.
Large samples. The t procedures can be used even for clearly skewed distribution when the sample is large (central limit theorem).
Robustness of t procedures
Procedures that are not strongly affected by lack of Normality are called robust.
t-procedures are not robust against outliers
But they are quite robust against non-Normality of the population, when there are no outliers, even if the distribution is asymmetric.
Larger samples improve accuracy of critical values from the t distributions when the population is not Normal. This is because of the central limit theorem.
SRS: Data are SRS of size n from population of interest or come from a randomised experiment
Normality: Observations from the population have a Normal distribution.It is enough that the distribution be symmetric and single-peaked.
Independence: Individual observations are independent.
The population size should be at least 10 times the sample size.
x.bar±z*σ/√n
Reducing Margin of error
The confidence level C decreases (z* gets smaller)
The population standard deviation decreases
The sample size increases
Procedure for inference with Confidence Intervals
1. State the parameter of interest.
2. Name the inference procedure and check conditions.
3. Calculate the confidence interval.
4. Interpret results in the context of the problem.
1. the sample must be an SRS from the population of interest.
2. The sampling distribution of the sample mean x-bar is at least approximately Normal.
If the population distribution is not Normal, central limit theorem tells us that is approximately Normal if n is large.
3. Individual observations are independent.
4.The population size is at least 10 times as large as the sample size.
Range of plausible values that are likely to contain the unknown populatoin parameter.
Generated using a set of sample data.
confidence level C, which gives the probability that the interval will capture the true parameter value in repeated samples.
For large sample size n>30, the sampling distribution of x-bar is approximately Normal for any population with a finite standard deviation.
The mean is given by µ and standard deviation by σ/√n
The sample size n needed depends on the poplatoin. More observations are required if the shape is skewed.
Bias: how far the mean of the sampling distribution is from the true value of the parameter being estimated.
Variability: spread of its sampling distribution. Larger samples gives a smaller spread.
Mean of x bar: µ
Standard deviation of x bar: σ/√n
1. The formula for standard deviation of x-bar is only used when the population is at least 10 times as large as the sample.
2. The facts above about the mean and standard deivation of x-bar are true no matter what the population distribution looks like.
3. The shape of the distribution of x-bar depends on the shape of the population distribution. In particular, if the population distribution is Normal, then the population distribution of x-bar is also Normal.
We often take SRS of size n and use the p-hat to estimate the unknown parameter p.
Mean of sampling distribution is given by p
Standard deviation of sampling distribution is given by;
√((p(1-p))/n)
The formula for standard deviation of p-hat is only used when the population is at least 10 times as large as the sample.
Normal approximation is used when np and n(1-p)≥10
Distribution of values taken by the statistic in all possible samples of the same size from the same population.
9.5, 9.6, 9.7 (pages 571-573)
When discribing a histogram
Center: center of distribution is very close to the true value of p
Shape: overall shape is roughly symmetric and approximately Normal.
Spread: values of p-hat range from 0.2 to 0.55.
Since spread is approx Normal, we can use standard deviation to describe its spread.
A parameter is a number that discribe the population;
µ, p, σ
A statistic is a number that can be computed from the sample data without the use of any unknown parameters; x-bar, p-hat, s.
mean = µ=1/p
variance = σ^2=(1-p)/p^2
Probability that it takes more than n trials to see the first success is
P(X>n)=(1-p)^n
the probability that the first success occurs on the nth trial;
P(X=n)=(1-p)^(n-1)p
p=probability of success
Calculator function;
tistat.geomPdf(p,n)
tistat.geomCdf(p,n)
1) Each observation has only 2 outcomes, "success" and "failure"
2) The n observations are all independent
3) The probability of success, p, is the same for each observation.
4) The variable of interest, X, is the number of trails required to obtain the first success
The number of trails in a geometric setting is not fixed but is the variable of interest
Normal Approximation to Binomial Distributions
When n is large, the distribution of X is appoxmately Normal
Can be used when np≥10 and n(1-p)≥10
most accurate when p close to 0.5
least accurate when p is near 0 or 1
Binomial Mean and standard Deviation
µ=np
σ=√(np(1-p))
Binomial Probability
If X has the binomial distribution with n observations and probability p of success on each observation, the possible values of X are 0,1,2...n
P(X=k)=(n¦k) * (p)^k * (1-p)^(n-k)
(n¦k) is known as the binomial coefficient
This counts the number of ways in which k successes can be distributed among n observations.
Cumulative distribution function (cdf)
Given a random variable X, the cdf of X calculates the sum of probabilities for 0,1,2... till X.
It calculates the probability of obtaining at most X successes in n trails.
Calculator function;
tistat.binomCdf(n,p,X)
Large number of red and white balls
25% are red
If balls are picked randomly what is the least number of balls to be picked so that hte probability of getting at least 1 red ball is greater than 0.95?
x = no. of red balls
P(x≥1)=1-p(x=0)
=1-(0.75)^n
1-(0.75)^n>0.95
(0.75)^n<0.05
n>10.4133
n=11
Binomial Distribution.
p=0.06 are out of shape
SRS of 20 bears
What is probability there are more then 3 bears
P(x>3)= 1-P(x=0)-P(x=1)-P(x=2)-P(x=3)
= 0.028966
tistat.binomcdf(20,0.06,4,20)
= 0.028966
Probability distribution function (pdf)
Given a discrete random variable X, the pdf assigns a probability to each value of X
Calculator function;
tistat.binomPdf(n,p,X)
Example
Binomial Distribution.
n=10000 balls
p=0.2 are white balls
SRS of 10 balls
What is probability there are exactly 2 white balls
P(x=2)= (10¦2) * (0.2)^2 * (0.8)^2
= 0.30199
tistat.binompdf(10,0.2,2)
= 0.30199
Binomial Setting
1) Each observation has only 2 outcomes, "success" and "failure"
2) There is a fixed number of observations, n
3) The n observations are all independent
4) The probability of success, p, is the same for each observation.
It is important to reconize which situations binomial distributions can and cannot be used.
Sampling Distribution of a count
Choose an SRS (simple random sample) of size n from a population with proportion p of successes.
When the population is much larger than the sample, the count X of successes in the sample has approximately the binomial distribution with parameters n and p.