AP Statistics

Advanced Placement Statistics covering the College Board CED Units 1-9: exploring data, sampling and experimentation, probability, sampling distributions, and statistical inference for proportions, means, chi-square, and slopes.

Ämne: Matematik · Nivå: Gymnasium (16–19) · 401 kort

Innehåll

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make inferences about populations from samples.
Categorical (qualitative) variables place individuals into groups or categories (e.g., eye color, blood type). Quantitative variables take numerical values for which arithmetic operations make sense (e.g., height, age).
Discrete quantitative variables take countable values (often integers — number of siblings). Continuous quantitative variables can take any value in an interval (height, time).
A frequency table lists counts of each category. A relative frequency table reports proportions or percentages of the total. Marginal distributions are the totals from rows or columns of a two-way table.
Bar charts display frequencies for categorical variables with separated bars. Pie charts show parts of a whole as slices. Bar heights/slice areas represent proportions of each category.
Dotplots show each data value as a dot above its number-line position; good for small datasets. Stemplots split numbers into stem and leaf; preserve raw data. Histograms group data into bins; better for large datasets.
When describing a distribution use the acronym SOCS — Shape, Outliers, Center, Spread. Always describe in context of the variable measured.
Distribution shape: symmetric (mirror image around center), right-skewed (long tail to the right — mean > median), left-skewed (long tail to the left — mean < median), uniform, bimodal (two peaks).
Measures of center: mean x̄ = Σxᵢ/n (sensitive to outliers), median (middle value of ordered data; resistant to outliers), mode (most frequent value).
Measures of spread: range (max − min), IQR = Q3 − Q1 (resistant), variance s² = Σ(xᵢ − x̄)²/(n−1), standard deviation s = √s² (sensitive to outliers).
The five-number summary: minimum, Q1, median, Q3, maximum. It is the basis for the boxplot and gives a quick snapshot of center, spread, and shape.
The 1.5×IQR rule: an observation is an outlier if it is below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. Used to identify outliers in boxplots.
Boxplots display the five-number summary as a box from Q1 to Q3 with a median line; whiskers extend to the smallest/largest non-outlier values; outliers shown as separate points.
Percentile rank: the p-th percentile is the value below which p% of observations fall. The median is the 50th percentile; Q1 is the 25th and Q3 the 75th.
A z-score (standardized value) measures how many standard deviations a value is from the mean: z = (x − μ)/σ for population, z = (x − x̄)/s for sample. Z-scores are unitless.
Linear transformation effects: adding a constant a shifts center and quartiles by a but does not change spread. Multiplying by b multiplies center, quartiles, and spread (range, IQR, s) by |b|; variance is multiplied by b².
The normal distribution is a continuous, symmetric, bell-shaped distribution defined by mean μ and standard deviation σ. Notation: N(μ, σ). The standard normal distribution N(0, 1) has μ = 0 and σ = 1.
The 68-95-99.7 rule (empirical rule): for any normal distribution, approximately 68% of observations fall within 1σ of μ, 95% within 2σ, and 99.7% within 3σ.
Calculator commands: normalcdf(lower, upper, μ, σ) returns the proportion of normal data between lower and upper. invNorm(area, μ, σ) returns the x-value with the given area to its left.
Normal probability plot (Q-Q plot): plots data against expected normal quantiles. If points fall on a roughly straight line, the data are approximately normal; systematic curvature indicates skewness or heavy tails.
A parameter is a numerical summary of a population (μ, σ, p, ρ). A statistic is a numerical summary of a sample (x̄, s, p̂, r). Statistics estimate parameters.
A scatterplot shows the relationship between two quantitative variables. Place the explanatory (independent, x) variable on the horizontal axis and the response (dependent, y) variable on the vertical axis.
Describe a scatterplot using DUFS — Direction (positive/negative), Unusual features (outliers, clusters), Form (linear/curved), Strength (weak/moderate/strong). Always in context.
The correlation coefficient r measures the strength and direction of the linear relationship between two quantitative variables. Range: −1 ≤ r ≤ 1. Sign of r matches slope of LSRL.
Properties of r: unitless, unaffected by linear changes of scale or units, sensitive to outliers, measures only linear association, the same regardless of which variable is x or y.
Correlation does not imply causation. A strong r only indicates association; lurking variables, common cause, or confounding can produce correlation without causal link.
The least-squares regression line (LSRL) minimizes the sum of squared vertical residuals: ŷ = a + bx (or ŷ = b₀ + b₁x). It always passes through (x̄, ȳ).
Formulas for the LSRL: slope b = r(s_y/s_x); intercept a = ȳ − b·x̄, where r is correlation, s_x and s_y are standard deviations, and x̄, ȳ are sample means.
Interpreting slope b in context: a one-unit increase in x is associated with a predicted change of b units in y. Interpret intercept a only when x = 0 is meaningful in context.
A residual is the difference between an observed and predicted value: residual = y − ŷ. Positive residuals lie above the LSRL; negative residuals lie below.
A residual plot shows residuals on the y-axis vs explanatory variable on the x-axis. A random scatter (no pattern) suggests a linear model is appropriate; a curved/U-shaped pattern suggests nonlinearity.
The coefficient of determination r² gives the proportion of variation in y explained by the LSRL. Interpretation: r² × 100% of the variation in [y context] is explained by [x context].
Standard deviation of residuals s = √(Σresidual²/(n−2)) measures the typical size of a residual — how far observed y values typically deviate from the LSRL.
An outlier in regression is a point with an unusually large residual. An influential point substantially changes the LSRL (slope, intercept, r) if removed; usually has extreme x-value (high leverage).
Extrapolation is predicting y for x-values outside the range of the observed data. Unreliable because the linear pattern may not continue. Interpolation (within range) is generally trustworthy.
If a scatterplot or residual plot shows curvature, transform the data — try log(y), √y, or 1/y — to linearize. After transformation, fit LSRL to the transformed data and back-transform predictions.
Power model y = ax^b becomes linear after taking log of both variables: log y = log a + b·log x. Exponential model y = ab^x becomes linear after log of y only: log y = log a + x·log b.
A two-way table shows the joint distribution of two categorical variables. Conditional distributions condition on one variable to examine association; the variables are associated if conditional distributions differ.
Simpson's paradox: a trend that appears in several groups can reverse when the groups are combined. Always check for lurking variables when combining data from subgroups.
Population = the entire group of interest. Sample = the subset actually examined. Census = data collected from every member of the population (rarely feasible).
A simple random sample (SRS) of size n is a sample chosen so that every group of n individuals has an equal chance of being selected. Implemented using a random number table or random digit generator.
Stratified random sampling: divide the population into homogeneous strata (e.g., by grade or sex), then take an SRS from each stratum. Reduces variability when strata differ on the variable of interest.
Cluster sampling: divide the population into clusters (often by geography), randomly select entire clusters, and use everyone in them. Saves cost when individuals are widely dispersed.
Systematic random sampling: pick a random starting point, then select every k-th individual from a list. Approximates an SRS if the list is not ordered by the variable of interest.
Convenience sampling: choose individuals easy to reach. Voluntary response sampling: people self-select to participate. Both are non-random and produce biased samples.
Sources of bias: undercoverage (some groups left out), nonresponse (selected individuals don't respond), response bias (lie or misremember), wording effects (leading questions). Bias is systematic, not random error.
Observational study: observe individuals without imposing a treatment — can establish association only. Experiment: deliberately impose treatments to measure response — can establish causation.
Experiment vocabulary: experimental units (people = subjects), factors (explanatory variables), levels (specific values of a factor), treatments (combinations of factor levels), response variable.
Three principles of experimental design: (1) Comparison — use a control or comparison group; (2) Random assignment — balance unknown variables; (3) Replication — enough subjects per treatment to reduce variability.
A confounding variable is associated with both the explanatory and response variables, so its effect cannot be separated from the explanatory variable's. Random assignment helps eliminate confounding.

AP Statistics

Innehåll

Mer från Matematik