Stats

Organizing and Representing Data: Visual Representations

Katie Morrisonにより

Penrod and Cutler

Geraldine Pearceにより

Natural Disasters-Unit of work

Paula Steelにより

CRISP-DM

G Thulasiramにより

Statistics II & III

Chapter 7: Random Variables

A variable whose value is a random numerical outcome.

Properties for expectations and variances

E(a) = a

E(aX + b) = aE[X] + b

E(X + Y) = E[X] + E[Y]

Var(a) = 0

Var(aX + b) = a^2 Var(X)

Var(X + Y) = Var(X) + Var(Y)

Var(X - Y) = Var(X) + Var(Y)

Continuous random variable

Takes all values in an interval of numbers

For all continuous probability distributions, P(any individual outcome) = 0

Cumulative distribution function

∫_-∞^x f(t)dt

Probability distribution of X is described by a probability distribution function

Total area under graph is 1

f(x)≥0

∫_a^b f(x)dx

Discrete random variable

values that might be observed are restricted to being within a pre-defined list of possible values

Equations

µ_x=∑_(i=1)^k (x_i p_i )

σ_x^2=∑_(i=1)^k (x_i-µ_x )^2 p_i

Var(x)=E[(x-µ_x)^2]

All probabilities must add up to 1.

0≤P_k≤1

Probability Histogram

Probability distributions of real-valued random variables

Chapter 6: Permutation, combination and probability

Random - individual outcomes are uncertain but there is a regular distribution of outcomes in a large number of repetitions.

Conditional probability

Conditional probability is the probability of some event A, given the occurrence of some other event B.

Conditional probability is written P(A|B), and is read "the probability of A, given B".

P(A|B)=(P(A n B))/(P(B))

If A and B are mutually exclusive, then P(A | B) = 0.

If A and B are independent, then P(A | B) = P(A).

Probability Tree

Diagrammatic representation of possible outcomes of series of events.

A probability tree to calculate the chances of flipping a coin and coming up heads three times in a row would have three levels. The first reflects the chances of throwing either heads or tails; the second level reflects the chances of throwing heads or tails after throwing heads the first time, and the chances of throwing heads or tails after throwing tails the first time: the third level shows the chances of throwing heads or tails after all the possible outcomes of the first two throws. The series of probabilities can be multiplied to give the overall probability of a possible event occurring.

Probabilities add up to 1.

Independent events

Two events A and B are independent if the chance of one event happening or not happening does not change the probability that the other event occurs.

If A and B are independent, then P(A and B) = P(A)*P(B).

Probability of event with equal likely outcomes

P(A)=(n(A))/(n(S))

where n(A) represents the number of outcomes in event A and n(S) represents number of outcomes in space S.

Probabilty models

Sample space of a random phenomenon - set of all possible outcomes

Event - any outcome or set of outcomes

Probability model - mathematical description of random phenomenon.

Permutation and combination

Circular permutation

When objects are arranged in a circle, since each object has the same neighbours, they can be rotated.

We have (n-1)! ways to arrange n distinct objects in a circle.

Combination

The unordered selection of objects from a set.

If there are n distinct objects, then we can select r objects in ways: 〖(_^n)C〗_r

Permutation

The order of objects is important.

If there are n distinct objects, we have n! ways of arranging all the objects in a row.

Distinct

r objects from n distinct objects

〖(_^n)P〗_r=n!/(n-r)!

Identical objects

p identical objects and q identical objects,from a total of n objects

arrange the r objects in a row in:

n!/p! or n!/(p!q!)

Multiplication principle

If you can do task 1 in m number of ways, and task 2 in n number of ways,

both tasks can be done in m*n number of ways.

Addition principle

the number of ways of selecting a single objects; m+n.

Chapter 5: Producing data

Experiment

Lack of realism

Cannot duplicate exact conditions that we want to study

Limits our ability to apply conclusions to the settings of greater interest.

Statistical analysis cannot tell us how far the results will generalise to other settings.

Double-blind experiment

Neither subjects nor those who measure the response know which treatment a subject received.

Controls the placebo effect.

Designs

Matched pairs design

Example of block design.

Compare two treatments and the subjects are matched in pairs.

Block design

Block - group of experimental subjects that are known to be similar in some way that is expected to systematically affect the response to the treatments.

Blocks are a form of control.

Blocks are chosen to reduce variability based on the likelihood that the blocking variable is related to the response.

Blocks should be formed based on the most important unavoidable sources of variability among the experimental units.

Blocking allows us to draw separate conclusions about each block.

Randomisation

Treatment groups are essentially similar and there is no systematic differences between them.

Replication

Even with control, there is still natural variability.

Replication reduces the role of chance variation and increase the sensitivity of the experiment to differences between the treatments.

Control

Effort to minimise variability in the way experimental units are obtained and treated.

Helps reduce problems from confounding and lurking variables.

One group receives the treatment while the other group does not.

Compare responses between 2 groups.

Placebo

See if there is any placebo effects which could have affected the reults

Definition

Deliberately impose some treatment on

individuals in order to observe their responses.

Individuals - experimental units or subjects (humans).

Treatment - experimental condition applied.

Factors - explanatory variables.

Observational study

Cautions

Undercoverage - Some groups in the population are left out.

Non-response - Individuals do not respond or cooperate.

Response bias - lying

Wording of questions - confusing & misleading questions.

Other sampling methods

Probability sample - sample chosen by chance.

Stratified random sampling - first divided nto strata, then SRS from the stratas.

Cluster sample - divide population into clusters, then randomly select some clusters.

Multi-stage sampling design.

Designing samples

population-Entire group we want information about

Sample - Part of the population that we examine.

Sampling - studying a part in order to gain information about the whole.

Census - attempts to contact every individual.

Voluntary response sample - people who choose themselves.

Convenience sampling - choosing individuals who are easiest to reach.

Simple Random Sample (SRS)

Consists of n individuals chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

1. Label. Assign a numerical label to every individual in the population.

2. Table. Use the random number table to select labels at random.

3. Stopping rule. Indicate when you should stop sampling.

4. Identify sample. Use the labels to identify subjects selected to be in the sample.

Chapter 4: Relationships between two variables

Relationship between categorical variables

Two-way table organizes data about 2 categorical variables.

Row & column totals - marginal distributions or marginal frequencies.

Find the condition distribution of the row variable for one specific value of the column variable, look only at the one column in the table. Find each entry in the column as a percent of the column total.

Explaining association

Causation - change in x causes the direct change in y.

Common response - the observed association between x and y can be explained by lurking variable z.

Confounding effect - variable effects cannot be distinguished from each other.

To explain causation, we need to conduct carefully-designed experiments.

Simpson's paradox

Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined.

Transforming to achieve linearity

able to apply the least squares regression line.

Power law model

y=ax^p

ln y=ln a+ p ln x

Plot ln y vs ln x to obtain straight line with gradient p.

Exponential growth model

Exponential growht increases by a fixed percent of the previous total in each equal time period.

y=ab^x

ln y=ln(ab^x)

ln a = c

x ln b= m

Plot ln y against x to obtain a straight line with gradient ln b.

Chapter 3: Examining Relationship

Lurking variable

A variable that is not among the explanatory or response variabels in a study and yet may influence the interpretation of relationships among these variables.

Outliers and influential observations

Outlier - observation that lies outside the overall pattern of the other osbervations.

Influential - if removing it would markedly change the result of the calculation.

Residuals & Residual plot

Residual = observed y - predicted y. Sum of residuals = 0.

Residual plot should show no obvious pattern.

Coefficient of determination

r^2=1-(∑ (y-y.hat)^2 )/(∑(y-y.bar)^2 )

Standard deviation

Measure error size: compare Standard deviation of residuals to actual data points.

Least Squares Regression Line

Regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.

You can use a regression line to predict the value of y for any value of x.

y.hat = a+bx

Correlation

Correlation measures the direction and strength of the linear relationship between two quantitative variables.

∑((x_i-x.bar)/s_x )((y_i-y.bar)/s_y )

Scatterplot

Scatterplot shows relationship between 2 quantitative variables

Explanatory on x-axis

Response on y-axis

Associations

Postive - above-average values of one tend to accompany above-average values of the other, and vice versa.

Negative - above-average values of one tend to accompany below-average values of the other.

Interpreting a scatterplot

1. Look for overall pattern and for striking deviations from that pattern.

2. Describe the pattern by the direction, form and strength of the relationship

3. Look for outliers.

Response and explanatory variables

Response (dependent) - measures outcome of study

Explanatory (independent) - explains or influences changes in response variable

Chapter 2: Describing location in a distribution

Normal distribution

Probablity density function given as:

f(x)=1/(σ√(2π )) e^(-〖(x-µ)〗^2/〖2σ〗^2 )

Assessing Normality

Close to straight line - Normal

Systematic deviations - non-Normal

Standard Normal distribution

For any Normal distribution we can perform a linear transformation to obtain standard Normal distribution

If the variable x has Normal distribution N(µ, σ), then the standard variable has Normal distribution N(0,1);

z=(x-µ)/σ

The area under the standard Normal curve can be found from a standard Normal table, or the GC.

Empirical rule

68% fall within 1 standard deviation of the mean

95% fall within 2 standard deviation of the mean

99.7% fall within 3 standard deviation of the mean

Density curves

Mathematical model for the distribution

A curve that is always or above the x axis

Area underneath it is always exactly 1.

Mean, µ,σ

% of observations falling within k standard deviations of the mean is at least (100)(1-1/k^2)

Median is the equal-areas point

Mean is the balance point

In a symmetric density curve the mean and median are the same

Chebyshev's inequality

Percentile

pth percentile of a distribution is the value with p% of the observation less than or equal to it.

Z-score

z=(x-x.bar)/s

Chapter 1: Exploring data

Description of graphical display

Comparing distributions

Side-by-side graphs

Back-to-back stemplots

Narrative comparisons

Changing units of measure

Transformed variables

mean a+bx

median a+bM

standard deviation bs

IQR bR

Linear transformations

Resistant measure

Median is not affected by outlier values

Shape

Symmetric

Skewed (spreads far and thinly)

Uniformed

Bell-shaped

Outliers

Gaps

Clusters

Spread

Range

Percentile

Interquartile Range (IQR)

Box plots (five number summary)

Variance

Standard deviation, s

Center

Mean

Median

Mode

Quantitative data

Cumulative frequency plot (Ogive)

Stemplots

Histogram

Area represents the size of data

Relative frequencies on vertical axis

Numerical data

Categorical data

Qualitative categories

Pie charts, dotplots, bar charts

Poisson Distribution

Approximating Binomial distribution with poisson distribution

Given X~B(n,p) such that n is large (>50) and np<5 (normally p<0.1), then the binomial distribution can be approximated usingthe poison distribution with mean λ=np

It is more accurate when n gets larger and p gets smaller

Additive Property of the poisson distribution

If X and Y are independent Poisson random variable and X~Po(λ), Y~Po(µ),Then;

X+Y~po(λ+µ)

Mean & variance

If X~Po(λ), then E(X) =λ and Var(X) =λ

λ=average number of ocurrance

1) The events occur singly and randomly.

2) The events occur uniformly.

3) The events occur independently.

4) The probability of occurrence of an event within a small fixed interval is negligible.

Chapter 15: Inference for regression

The slope b and intercept a of the least-squares regression line are statistics.

Significant tests for regression slope

t-statistics

t= (b√(∑〖(x-x.bar)〗^2 ))/s

Ho: β = 0

This Ho says that there is no true linear relationship between x and y.

This Ho also says that there is no correlation between x and y.

The testing correlation makes sense only if the observations are a random sample.

degrees of freedom = n -2

Residual = observed y - predicted y

Standard error= s=√(∑(y-y.hat)^2 )/(n-2))

confidence interval = b ± t* s/√(∑(x-x.bar)^2 )

Repeated response y are independent of each other.

Scatterplot: overall pattern is roughly linear. Residual plot has a random pattern.

The standard deviation σ of y (σ is unknown) is the same for all values of x.

For any fixed value of x, the response y varies according to a Normal distribution.

Chapter 14: Chi-square procedures

Chi-square test of association & independence

2-way table from a single SRS

each individualy classified according to both categorical variables.

Ho: There is no association between two categorical variables.

Ha: There is an association between two categorical variables.

Chi-square test for homogeneity of populations

Select SRS from each c populations.

Each individual is classified in a sample according to a categorical response variable with r possible values.

There are c different sets of proportions to be compared, one for each population.

No more of 20% of the expected counts are less than 5, all individual expected counts are at least 1.

All counts in a 2 x 2 table should be at least 5.

Expected count = (row total x column total) / n

null hypothes is is that the distribution of the response variable is the same in all c populations.

alternative hypothesis is that these c distributions are not all the same.

Implications

If Ho is accepted, the Chi-square statistic has approximately a chi-square distribution with (r-1)(c-1) degrees of freedom.

Chi-square test and the z test

We can compare the 2 proportions using the z test.

The Chi-square statistic is the square of the z statistic, and the P-value for Chi-square is the same as the two-sided P-value for z.

Uses

Used to compare two proportions because it gives the choice of a one-sided test and is related to a confidence interval for p1-p2.

Chi-square test for goodness of fit

Chi-square distributions

The total area under a chi-square curve is equal to 1.

Each chi-square curve (except when degrees of freedom = 1) begins at 0 or the horizontal axis, increase to a peak, and approaches the horizontal axis asymptotically from above.

Each chi-square curve is skewed to the right. As the number of degrees of freedom icnreases, the curve becomes more symmetric and looks more like a Normal curve.

Calculations

x^2=∑(O-E)^2/E

df = k -1

P-value = P(X^2>x^2)

Hypothese

Ho: The actual population proportions are equal to the hypothesised proportions.

Ha: At least one of the actual population proportions differ from the hypothesised proportions.

Chapter 13: Comparing two population parameters

Robustness

More robust than the one-sample t methods, particularly when the distributions are not symmetric.

Choose equal sample sizes if possible.

n1 and n2 must both be at least 5.

If n1+n2 > 30, the two-sample t procedure can be used even for skewed distributions.

Two-sample tests

Two-proportion z test

z=(p.hat_1-p.hat_2)/√(p.hat_c (1-p.hat_c )(1/n_1 +1/n_2 ))

Two-proportion z interval

(p.hat_1-p.hat_2 )±z* √((p.hat_1 (1-p.hat_1))/n_1 +(p.hat_2 (1-p.hat_2))/n_2 )

Two-sample t procedure

(x.bar_1-x.bar_2 )±t* √((s_1^2)/n_1 +(s_2^2)/n_2 )

Two-sample z statistic

z= (x.bar_1-x.bar_2 (μ_1-μ_2))/√((σ_1^2)/n_1 +(σ_2^2)/n_2 )

SRS: We have two SRSs, from two distinct populatoins.

Independence: The samples are independent. That is, one sample has no influence on the other.

When sampling without replacement, each population must be at least 10 times as large as the corresponding sample size.

Normality: Both populations are Normally distributed.

Chapter 12: Tests about a population mean

One-sample t test

t=(x.bar - µ0)/(s/√n)

µ>µ0 P(T>t)

µ<µ0 P(T<t)

µ≠µ0 2P(T≥|t|)

One-proportion z test

z = (p.hat - p0)/√((p0-(1-p0))/n)

Alternate hypotheses

p>p0 P(Z>z)

p<p0 P(Z<z)

p≠p0 2P(Z≥|z|)

Normality condition: np and n(1-p) ≥ 10

Chapter 11: Testing a claim

Using inference to make decisions

Type II error

fail to reject Ho when Ho is false.

Power

The probability that a fixed level α test will reject Ho when a particular alternative value of the parameter is true is called the power of the tests against that alternative.

Increasing power

Increase alpha.

Consider a particular alternative further away from the mean

Increase the sample size; decreases standard error

Decrease σ

Probability

1. Calculate when the test stops accepting Ho.

2. Use the critical value obtained and standardise using a curve based on alternative hypothesis to find the probability.

Type I error

reject Ho when Ho is actually true.

Significance

The significance level α of any fixed level test is the probability fo a Type I error.

α is the probability that the test will reject the null hypothesis when it is in fact true.

Importance of Significance

Statistical inference is not valid for all sets of data

Badly designed surveys or experiments often produce invalid results.

Faulty data collection, outliers in the data, and testing a hypothesis on the same data can invalidate a test.

Beware of multiple analyses; many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true.

Don't ignore lack of significance

There is a tendency to infer there is no effect whenever a P-value fails to attain the usual 5% standard.

Lack of significance does not imply that H0 is true.

In some areas of research, small effects that are detectable only with lage sample sizes can be of great practical significance.

Statistical significance and practical importance

A statistically significant effect need not be practically important.

Use confidence intervals to estimate the actual value for parameters as confidence intervals estimate the size of an effect rather than simply asking if it is too large to reasonably occur by chance alone.

Choosing a level of significance

There is no sharp border between "statistically significant" and "statistically insignificant" so giving the P-value allows each of us to decide individually if the evidence is sufficiently strong.

Carrying out significance tests

Confidence intervals and two-sided tests

A level α 2-sided significance test rejects a hypothesis exactly when the value µ0 falls outside a level 1-α confidence interval for µ

The link between 2-sided significance tests and confidence intervals is called duality

For a two-sided hypothesis test for mean, a significance test (level α) and a confidence interval (level C = 1-α) will yield the same conclusion.

z-test for population mean

z=(x.bar - µ0)/(s/√n)

µ>µ0 P(Z>z)

µ<µ0 P(Z<z)

µ≠µ0 2P(Z≥|z|)

Interpretation

These P-values are exact if the population distribution is Normal and are approximately correct for large n in other cases.

Failing to find evidence against H0 means only that the data are consistent with H0, not that we have clear evidence that H0 is true.

General procedure

1. Hypotheses: Identify the population of interest and the parameter you want to draw conclusions about.

2. Conditions: Choose the appropriate inference procedure. Verify the conditions for using it.

3. Calculations: Calculate test statistic and the P-value.

4. Interpretation: Interpret your results in context of the problem.

The Basics

Test statistic

The test is based on a statistic that compares the value of the parameter as stated in the null hypothesis with an estimate of the parameter from the sample data.

Values of the estimate far from the parameter value in the direction specified by the alternative hypothesis give evidence against H0.

Standardise the estimate:

Test statistic = (estimate - hypothesised value) / standard deviation of estimate

Conditions for significance tests

SRS from the population of interest

Normality: np > 10 and n(1-p) > 10

Independent observations.

Hypotheses

We can have one-sided or two-sided alternative hypotheses.

alternative hypothesis

The alternative hypothesis states that it is present in the population.

null hypothesis

The null hypothesis is the statement that this effect is not present in the population.

P-value

Rule of thumb: alpha = 0.05 unless otherwise stated

A result with a small P-value (less than alpha) is called statistically significant.

Small P-value

Small P-values are evidence against H0 because they say that the observed trait is unlikely to occur just by chance.

Large P-value

Large P-values fail to give evidence against H0.

Basic:

An outcome that would rarely happen if a claim was true is good evidence that the claim is false.

The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree.

Chapter 10: Estimating with confidence

Estimating a population proportion

p.hat±z*√((p.hat(1-p.hat))/n)

Choosing the sample size

Margin of error involves the sample proportion of successes, we need to guess this value when choosing m

The guess is called p*

Use a guess p* based on pilot study or past experiences with similar studies.

Use p* = 0.5 as the guess. Margin of error is largest when p-hat = 0.5.

SRS: The data are an SRS from the population of interest

Normality: For a confidence interval, n is so large that both np-hat and n(1-p-hat) are 10 or more

Independence: Individual observations are independent.

When sampling without replacement, the population is at least 10 times as large as the sample.

Confidence interval for a population mean

Unknown σ

x.bar±t*s/√n

df=n-1

Paired t procedures

Matched pairs design or before-and-after measurements on the same subjects

t-distributions

Substitute standard deviation σ for standard error s.

The resultant distribution is not Normal. It is a t distribution. There is a different t distribution for each sample size n. We specify a particular t distribution by giving its degrees of freedom (df).

The density curves of the t distributions are similar in shape to the standard Normal curve. The spread of the t distributions is a bit greater than that of the standard Normal distribution.

As the degrees of freedom increases, the t(k) density curve approaches the N(0,1) curve ever more closely.

This interval is exactly correct when the population distribution is Normal and approximately correct for large n in other cases.

Using the t procedures

Except in the case of small samples, the assumption that the data are an SRS from the population of interest is more improtant than the assumption that the population distribution is Normal.

Sample size less than 15. Use t procedures if the data are close to Normal.

Samples size at least 15. The t procedures can be used except in the presence of outliers or strong skewness.

Large samples. The t procedures can be used even for clearly skewed distribution when the sample is large (central limit theorem).

Robustness of t procedures

Procedures that are not strongly affected by lack of Normality are called robust.

t-procedures are not robust against outliers

But they are quite robust against non-Normality of the population, when there are no outliers, even if the distribution is asymmetric.

Larger samples improve accuracy of critical values from the t distributions when the population is not Normal. This is because of the central limit theorem.

SRS: Data are SRS of size n from population of interest or come from a randomised experiment

Normality: Observations from the population have a Normal distribution.It is enough that the distribution be symmetric and single-peaked.

Independence: Individual observations are independent.

The population size should be at least 10 times the sample size.

Known σ

x.bar±z*σ/√n

Reducing Margin of error

The confidence level C decreases (z* gets smaller)

The population standard deviation decreases

The sample size increases

Procedure for inference with Confidence Intervals

1. State the parameter of interest.

2. Name the inference procedure and check conditions.

3. Calculate the confidence interval.

4. Interpret results in the context of the problem.

1. the sample must be an SRS from the population of interest.

2. The sampling distribution of the sample mean x-bar is at least approximately Normal.

If the population distribution is not Normal, central limit theorem tells us that is approximately Normal if n is large.

3. Individual observations are independent.

4.The population size is at least 10 times as large as the sample size.

Confidence intervals

Range of plausible values that are likely to contain the unknown populatoin parameter.

Generated using a set of sample data.

Confidence level

confidence level C, which gives the probability that the interval will capture the true parameter value in repeated samples.

Chapter 9: Sampling Distributions

Central Limit Theorem

For large sample size n>30, the sampling distribution of x-bar is approximately Normal for any population with a finite standard deviation.

The mean is given by µ and standard deviation by σ/√n

The sample size n needed depends on the poplatoin. More observations are required if the shape is skewed.

Sampling

Bias and variability

Bias: how far the mean of the sampling distribution is from the true value of the parameter being estimated.

Variability: spread of its sampling distribution. Larger samples gives a smaller spread.

Sample mean

Mean of x bar: µ

Standard deviation of x bar: σ/√n

1. The formula for standard deviation of x-bar is only used when the population is at least 10 times as large as the sample.

2. The facts above about the mean and standard deivation of x-bar are true no matter what the population distribution looks like.

3. The shape of the distribution of x-bar depends on the shape of the population distribution. In particular, if the population distribution is Normal, then the population distribution of x-bar is also Normal.

Sample proportion

We often take SRS of size n and use the p-hat to estimate the unknown parameter p.

Mean of sampling distribution is given by p

Standard deviation of sampling distribution is given by;

√((p(1-p))/n)

The formula for standard deviation of p-hat is only used when the population is at least 10 times as large as the sample.

Normal approximation is used when np and n(1-p)≥10

Sampling distribution

Distribution of values taken by the statistic in all possible samples of the same size from the same population.

9.5, 9.6, 9.7 (pages 571-573)

When discribing a histogram

Center: center of distribution is very close to the true value of p

Shape: overall shape is roughly symmetric and approximately Normal.

Spread: values of p-hat range from 0.2 to 0.55.

Since spread is approx Normal, we can use standard deviation to describe its spread.

Parameter and statistic

A parameter is a number that discribe the population;

µ, p, σ

A statistic is a number that can be computed from the sample data without the use of any unknown parameters; x-bar, p-hat, s.

Chapter 8: The Binomial and Geometric distributions

Geometric Distributions

Geometric Mean and standard Deviation

mean = µ=1/p

variance = σ^2=(1-p)/p^2

Probability that it takes more than n trials to see the first success is

P(X>n)=(1-p)^n

Calculating Geometric Probabilities

the probability that the first success occurs on the nth trial;

P(X=n)=(1-p)^(n-1)p

p=probability of success

Calculator function;

tistat.geomPdf(p,n)

tistat.geomCdf(p,n)

Conditions

1) Each observation has only 2 outcomes, "success" and "failure"

2) The n observations are all independent

3) The probability of success, p, is the same for each observation.

4) The variable of interest, X, is the number of trails required to obtain the first success

The number of trails in a geometric setting is not fixed but is the variable of interest

Binomial Distributions

Binomial Formula

Normal Approximation to Binomial Distributions

When n is large, the distribution of X is appoxmately Normal

Can be used when np≥10 and n(1-p)≥10

most accurate when p close to 0.5

least accurate when p is near 0 or 1

Binomial Mean and standard Deviation

µ=np

σ=√(np(1-p))

Binomial Probability

If X has the binomial distribution with n observations and probability p of success on each observation, the possible values of X are 0,1,2...n

P(X=k)=(n¦k) * (p)^k * (1-p)^(n-k)

(n¦k) is known as the binomial coefficient

This counts the number of ways in which k successes can be distributed among n observations.

Cumulative distribution function (cdf)

Given a random variable X, the cdf of X calculates the sum of probabilities for 0,1,2... till X.

It calculates the probability of obtaining at most X successes in n trails.

Calculator function;

tistat.binomCdf(n,p,X)

Large number of red and white balls

25% are red

If balls are picked randomly what is the least number of balls to be picked so that hte probability of getting at least 1 red ball is greater than 0.95?

x = no. of red balls

P(x≥1)=1-p(x=0)

=1-(0.75)^n

1-(0.75)^n>0.95

(0.75)^n<0.05

n>10.4133

n=11

Binomial Distribution.

p=0.06 are out of shape

SRS of 20 bears

What is probability there are more then 3 bears

P(x>3)= 1-P(x=0)-P(x=1)-P(x=2)-P(x=3)

= 0.028966

tistat.binomcdf(20,0.06,4,20)

= 0.028966

Probability distribution function (pdf)

Given a discrete random variable X, the pdf assigns a probability to each value of X

Calculator function;

tistat.binomPdf(n,p,X)

Example

Binomial Distribution.

n=10000 balls

p=0.2 are white balls

SRS of 10 balls

What is probability there are exactly 2 white balls

P(x=2)= (10¦2) * (0.2)^2 * (0.8)^2

= 0.30199

tistat.binompdf(10,0.2,2)

= 0.30199

Binomial Setting

1) Each observation has only 2 outcomes, "success" and "failure"

2) There is a fixed number of observations, n

3) The n observations are all independent

4) The probability of success, p, is the same for each observation.

It is important to reconize which situations binomial distributions can and cannot be used.

Sampling Distribution of a count

Choose an SRS (simple random sample) of size n from a population with proportion p of successes.

When the population is much larger than the sample, the count X of successes in the sample has approximately the binomial distribution with parameters n and p.

This text delves into the essentials of interpreting and managing data, focusing on both categorical and quantitative data. It begins by exploring the graphical representation of categorical data through various charts such as pie charts, dotplots, and bar charts.

Organizing and Representing Data: Visual Representations

Penrod and Cutler

Natural Disasters-Unit of work

CRISP-DM

Statistics II & III

Chapter 7: Random Variables

Properties for expectations and variances

Continuous random variable

Cumulative distribution function

Probability distribution of X is described by a probability distribution function

Discrete random variable

Equations

Chapter 6: Permutation, combination and probability

Conditional probability

Probability Tree

Independent events

Probability of event with equal likely outcomes

Probabilty models

Permutation and combination

Circular permutation

Combination

Permutation

Multiplication principle

Addition principle

Chapter 5: Producing data

Experiment

Designs

Randomisation

Replication

Control

Definition

Observational study

Cautions

Other sampling methods

Designing samples

Chapter 4: Relationships between two variables

Relationship between categorical variables

Explaining association

Simpson's paradox

Transforming to achieve linearity

Power law model

Exponential growth model

Chapter 3: Examining Relationship

Lurking variable

Outliers and influential observations

Residuals & Residual plot

Coefficient of determination

Standard deviation

Least Squares Regression Line

Correlation

Scatterplot

Associations

Interpreting a scatterplot

Response and explanatory variables

Chapter 2: Describing location in a distribution

Normal distribution

Assessing Normality

Standard Normal distribution

Empirical rule

Density curves

Mean, µ,σ

Chebyshev's inequality

Percentile

Z-score

Chapter 1: Exploring data

Description of graphical display

Comparing distributions

Changing units of measure

Resistant measure

Shape

Outliers

Gaps

Clusters

Spread

Center

Mode

Quantitative data

Cumulative frequency plot (Ogive)