AP Statistics
Advanced Placement Statistics covering the College Board CED Units 1-9: exploring data, sampling and experimentation, probability, sampling distributions, and statistical inference for proportions, means, chi-square, and slopes.
Ämne: Matematik · Nivå: Gymnasium (16–19) · 401 kort
Innehåll
- Statistics is the science of collecting, organizing, analyzing, and interpreting data to make inferences about populations from samples.
- Categorical (qualitative) variables place individuals into groups or categories (e.g., eye color, blood type). Quantitative variables take numerical values for which arithmetic operations make sense (e.g., height, age).
- Discrete quantitative variables take countable values (often integers — number of siblings). Continuous quantitative variables can take any value in an interval (height, time).
- A frequency table lists counts of each category. A relative frequency table reports proportions or percentages of the total. Marginal distributions are the totals from rows or columns of a two-way table.
- Bar charts display frequencies for categorical variables with separated bars. Pie charts show parts of a whole as slices. Bar heights/slice areas represent proportions of each category.
- Dotplots show each data value as a dot above its number-line position; good for small datasets. Stemplots split numbers into stem and leaf; preserve raw data. Histograms group data into bins; better for large datasets.
- When describing a distribution use the acronym SOCS — Shape, Outliers, Center, Spread. Always describe in context of the variable measured.
- Distribution shape: symmetric (mirror image around center), right-skewed (long tail to the right — mean > median), left-skewed (long tail to the left — mean < median), uniform, bimodal (two peaks).
- Measures of center: mean x̄ = Σxᵢ/n (sensitive to outliers), median (middle value of ordered data; resistant to outliers), mode (most frequent value).
- Measures of spread: range (max − min), IQR = Q3 − Q1 (resistant), variance s² = Σ(xᵢ − x̄)²/(n−1), standard deviation s = √s² (sensitive to outliers).
- The five-number summary: minimum, Q1, median, Q3, maximum. It is the basis for the boxplot and gives a quick snapshot of center, spread, and shape.
- The 1.5×IQR rule: an observation is an outlier if it is below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. Used to identify outliers in boxplots.
- Boxplots display the five-number summary as a box from Q1 to Q3 with a median line; whiskers extend to the smallest/largest non-outlier values; outliers shown as separate points.
- Percentile rank: the p-th percentile is the value below which p% of observations fall. The median is the 50th percentile; Q1 is the 25th and Q3 the 75th.
- A z-score (standardized value) measures how many standard deviations a value is from the mean: z = (x − μ)/σ for population, z = (x − x̄)/s for sample. Z-scores are unitless.
- Linear transformation effects: adding a constant a shifts center and quartiles by a but does not change spread. Multiplying by b multiplies center, quartiles, and spread (range, IQR, s) by |b|; variance is multiplied by b².
- The normal distribution is a continuous, symmetric, bell-shaped distribution defined by mean μ and standard deviation σ. Notation: N(μ, σ). The standard normal distribution N(0, 1) has μ = 0 and σ = 1.
- The 68-95-99.7 rule (empirical rule): for any normal distribution, approximately 68% of observations fall within 1σ of μ, 95% within 2σ, and 99.7% within 3σ.
- Calculator commands: normalcdf(lower, upper, μ, σ) returns the proportion of normal data between lower and upper. invNorm(area, μ, σ) returns the x-value with the given area to its left.
- Normal probability plot (Q-Q plot): plots data against expected normal quantiles. If points fall on a roughly straight line, the data are approximately normal; systematic curvature indicates skewness or heavy tails.
- A parameter is a numerical summary of a population (μ, σ, p, ρ). A statistic is a numerical summary of a sample (x̄, s, p̂, r). Statistics estimate parameters.
- A scatterplot shows the relationship between two quantitative variables. Place the explanatory (independent, x) variable on the horizontal axis and the response (dependent, y) variable on the vertical axis.
- Describe a scatterplot using DUFS — Direction (positive/negative), Unusual features (outliers, clusters), Form (linear/curved), Strength (weak/moderate/strong). Always in context.
- The correlation coefficient r measures the strength and direction of the linear relationship between two quantitative variables. Range: −1 ≤ r ≤ 1. Sign of r matches slope of LSRL.
- Properties of r: unitless, unaffected by linear changes of scale or units, sensitive to outliers, measures only linear association, the same regardless of which variable is x or y.
- Correlation does not imply causation. A strong r only indicates association; lurking variables, common cause, or confounding can produce correlation without causal link.
- The least-squares regression line (LSRL) minimizes the sum of squared vertical residuals: ŷ = a + bx (or ŷ = b₀ + b₁x). It always passes through (x̄, ȳ).
- Formulas for the LSRL: slope b = r(s_y/s_x); intercept a = ȳ − b·x̄, where r is correlation, s_x and s_y are standard deviations, and x̄, ȳ are sample means.
- Interpreting slope b in context: a one-unit increase in x is associated with a predicted change of b units in y. Interpret intercept a only when x = 0 is meaningful in context.
- A residual is the difference between an observed and predicted value: residual = y − ŷ. Positive residuals lie above the LSRL; negative residuals lie below.
- A residual plot shows residuals on the y-axis vs explanatory variable on the x-axis. A random scatter (no pattern) suggests a linear model is appropriate; a curved/U-shaped pattern suggests nonlinearity.
- The coefficient of determination r² gives the proportion of variation in y explained by the LSRL. Interpretation: r² × 100% of the variation in [y context] is explained by [x context].
- Standard deviation of residuals s = √(Σresidual²/(n−2)) measures the typical size of a residual — how far observed y values typically deviate from the LSRL.
- An outlier in regression is a point with an unusually large residual. An influential point substantially changes the LSRL (slope, intercept, r) if removed; usually has extreme x-value (high leverage).
- Extrapolation is predicting y for x-values outside the range of the observed data. Unreliable because the linear pattern may not continue. Interpolation (within range) is generally trustworthy.
- If a scatterplot or residual plot shows curvature, transform the data — try log(y), √y, or 1/y — to linearize. After transformation, fit LSRL to the transformed data and back-transform predictions.
- Power model y = ax^b becomes linear after taking log of both variables: log y = log a + b·log x. Exponential model y = ab^x becomes linear after log of y only: log y = log a + x·log b.
- A two-way table shows the joint distribution of two categorical variables. Conditional distributions condition on one variable to examine association; the variables are associated if conditional distributions differ.
- Simpson's paradox: a trend that appears in several groups can reverse when the groups are combined. Always check for lurking variables when combining data from subgroups.
- Population = the entire group of interest. Sample = the subset actually examined. Census = data collected from every member of the population (rarely feasible).
- A simple random sample (SRS) of size n is a sample chosen so that every group of n individuals has an equal chance of being selected. Implemented using a random number table or random digit generator.
- Stratified random sampling: divide the population into homogeneous strata (e.g., by grade or sex), then take an SRS from each stratum. Reduces variability when strata differ on the variable of interest.
- Cluster sampling: divide the population into clusters (often by geography), randomly select entire clusters, and use everyone in them. Saves cost when individuals are widely dispersed.
- Systematic random sampling: pick a random starting point, then select every k-th individual from a list. Approximates an SRS if the list is not ordered by the variable of interest.
- Convenience sampling: choose individuals easy to reach. Voluntary response sampling: people self-select to participate. Both are non-random and produce biased samples.
- Sources of bias: undercoverage (some groups left out), nonresponse (selected individuals don't respond), response bias (lie or misremember), wording effects (leading questions). Bias is systematic, not random error.
- Observational study: observe individuals without imposing a treatment — can establish association only. Experiment: deliberately impose treatments to measure response — can establish causation.
- Experiment vocabulary: experimental units (people = subjects), factors (explanatory variables), levels (specific values of a factor), treatments (combinations of factor levels), response variable.
- Three principles of experimental design: (1) Comparison — use a control or comparison group; (2) Random assignment — balance unknown variables; (3) Replication — enough subjects per treatment to reduce variability.
- A confounding variable is associated with both the explanatory and response variables, so its effect cannot be separated from the explanatory variable's. Random assignment helps eliminate confounding.