Math

Correlation Is Not Causation (And Other Traps)

P-values, confidence intervals, A/B tests — the statistical concepts you need to navigate a data-driven world.

Apr 22, 20267 min listen5 chapters

What you'll learn

Correlation vs. causation with real examples
What p-values and confidence intervals actually mean
Why most published research findings are false
How tech companies use A/B testing at scale

Correlation and causation are not the same thing

note

Correlation Is Not Causation (And Other Traps)

P-values, confidence intervals, A/B tests — the statistical concepts you need to navigate a data-driven world.

note

Correlation vs. causation

Correlation means two variables move together. Causation means one variable changes the other.

A correlation coefficient can be positive, negative, or near zero. But even a strong correlation does not prove a cause.

Common traps

Hidden confounder: a third variable affects both outcomes
Reverse causation: the effect is actually driving the cause
Coincidence: large datasets produce spurious matches

Real example

Ice cream sales and drowning deaths both rise in summer. Heat increases both beach trips and ice cream purchases. The shared cause is temperature, not ice cream.

diagram

chart · scatter

Summer pattern example

P-values and confidence intervals are about evidence, not certainty

note

What a p-value means

The p-value is the probability of data at least as extreme as what you observed, assuming the null hypothesis is true.

If p = 0.03, that does not mean:

there is a 3% chance the null is true
there is a 97% chance the result is true

It means the observed result would be uncommon in a no-effect world.

What a confidence interval means

A 95% confidence interval is a range produced by a method that captures the true parameter in 95% of repeated samples.

A narrow interval usually means more precision. A wide interval means more uncertainty.

equation

p = P(\text{data at least as extreme as observed} \mid H_0)

diagram

note

Why sample size changes everything

A tiny effect can become statistically significant in a huge sample. That is why you should always ask about effect size, uncertainty, and practical meaning.

Why many published findings do not survive contact with reality

note

Why false positives happen

If you test 20 independent null hypotheses at a 5% threshold, you expect about 1 false positive on average.

Problems get worse when researchers:

test many outcomes
try many subgroups
stop collecting data when the result looks good
publish only successful findings

Key paper

John Ioannidis published "Why Most Published Research Findings Are False" in 2005 in PLOS Medicine.

diagram

chart · bar

Expected false positives at p less than 0.05

A/B testing turns guesswork into measurement

note

What A/B testing does

A/B testing compares two versions under random assignment.

It helps answer a causal question: did version B cause a change in behavior?

Why randomization matters

Random assignment makes the groups similar on average, so differences are more likely to come from the treatment itself rather than hidden confounders.

Example

If a product team tests a new checkout button on 1,000,000 visitors and sees conversion rise from 4.8% to 5.0%, the absolute lift is 0.2 percentage points. That sounds tiny, but at scale it can mean thousands of extra purchases.

diagram

chart · line

Conversion rate in an A B test

A practical checklist for reading data claims

note

Read claims in this order

What question was actually tested?
Was the design observational or experimental?
What is the effect size?
What is the confidence interval?
Was there replication?
Were many analyses tried?

Fast rule of thumb

Strong claims need strong designs. A flashy p-value is not enough.

diagram

illustration

A classroom whiteboard showing correlation, p-values, confidence intervals, and A B testing with arrows and simple charts

note

The bottom line

Correlation is a clue, not a conclusion.

P-values measure surprise under a null model.

Confidence intervals show plausible effect sizes.

A/B tests, when randomized and well run, can support causal claims.

The best analysts do not worship a single number. They ask what the data can and cannot prove.

Transcript

Welcome to Slate. Today we're looking at Correlation Is Not Causation (And Other Traps). We'll cover Correlation vs. causation with real examples, What p-values and confidence intervals actually mean, Why most published research findings are false, and How tech companies use A/B testing at scale. Let's get into it.

A scatterplot can tell a tempting story. Two lines move together, and your brain wants a cause. But correlation only says two variables travel together. It does not say one drives the other. Ice cream sales and drowning deaths both rise in summer. The hidden variable is heat. More people swim, and more people buy ice cream. That is a confounder. Think of correlation like two umbrellas opening in the same storm. They are linked by weather, not by one umbrella causing the other to open. The diagram on screen shows three paths: a direct cause, a shared cause, and coincidence. That last one matters more than people think. In 1999, the economist David Leinweber joked that U.S. stock-market returns could be predicted by butter production in Bangladesh. The joke works because with enough data, random matches appear. Real-world analysis needs a causal story, not just a pattern. When you ask whether a drug works, whether an ad increases sales, or whether a policy changes behavior, the question is not “Do these numbers move together?” It is “What would have happened otherwise?” That counterfactual is the heart of causation, and it is why experiments matter.

A p-value is easy to misuse because it sounds like a probability of truth. It is not. A p-value is the probability of seeing data at least this extreme if the null hypothesis were true. The null is usually a no-effect world. So a p-value of 0.03 does not mean there is a 97 percent chance the result is real. It means the data would be fairly unusual under the no-effect assumption. That is a very different claim. Confidence intervals tell a different part of the story. A 95 percent confidence interval is a method that, over many repeated samples, captures the true value 95 percent of the time. For one study, the interval either contains the truth or it does not. The phrase is about the procedure, not about a 95 percent chance for this single interval. Here is the useful intuition: the p-value asks how surprising the data are under a baseline. The confidence interval asks which effect sizes are still plausible. In medicine, that matters more than a yes-or-no threshold. A drug that lowers blood pressure by 1 point and a drug that lowers it by 20 points can both be statistically significant if the sample is large enough. Significance is not the same as importance.

A result can be statistically significant and still be wrong. That sounds harsh, but the math is unforgiving. If a field uses a 5 percent significance threshold, then even when every tested idea is false, about 1 in 20 tests will look significant by chance alone. Now add selective publishing. Positive results get written up. Negative results disappear. The literature starts to look more certain than it is. John Ioannidis argued in 2005 that many published findings are false because of low power, flexible analyses, multiple comparisons, and bias toward exciting results. Low power means studies are too small to detect real effects reliably. Flexible analysis means researchers may try several ways to slice the data until something crosses the line. That is like throwing darts at a wall and only reporting the one that lands on the bullseye. In 2011, a large project led by Brian Nosek and colleagues found that only 36 of 100 psychology studies could be replicated with the original significance pattern. Replication does not mean the first paper was worthless. It means science needs repeated checks. The best defense is pre-registration, larger samples, and transparency about all analyses, not just the winning one.

Tech companies use A-B tests because opinions are cheap and behavior is expensive to guess. In a typical experiment, users are randomly assigned to version A or version B. Randomization matters because it balances hidden factors on average. If the groups are large enough, differences in conversion rate can be attributed to the change being tested. Here is the visual logic: the company does not ask whether a button color feels better. It asks whether the new button increases clicks, sign-ups, or purchases. At scale, even small lifts matter. If a site has 10 million visitors a month and a signup rate of 5 percent, a lift of 0.2 percentage points means about 20,000 extra signups. That is why companies run thousands of experiments. But scale also creates traps. If you peek early, test many metrics, or rerun until you get a win, you inflate false positives. Good experiment teams predefine the primary metric, decide the sample size before launch, and use statistical correction when they run many tests at once. The result is not magic. It is disciplined uncertainty.

When you read a chart, a headline, or a paper, start with the question behind the numbers. Is this correlation, or is there a comparison that supports causation? Was there random assignment? If not, what confounders could explain the pattern? Then look at the uncertainty. A p-value can tell you whether the result is surprising under a null model, but it cannot tell you whether the effect matters in practice. The confidence interval tells you the plausible range. If the interval includes both a tiny and a large effect, the evidence is still fuzzy. Next, ask whether the result was replicated. One study is a clue. Several independent studies are stronger evidence. Finally, ask how many ways the data were sliced before the final answer appeared. The more flexible the analysis, the more cautious you should be. Think of statistical evidence like a courtroom case. One witness can be mistaken. Several witnesses who agree, under different conditions, are much more convincing. Data science works the same way. Good judgment comes from combining design, uncertainty, replication, and domain knowledge.

X LinkedIn WhatsApp

Keep going with Slate

Pick up where this left off in your own voice session.

Built with Slate