In probability and statistics, **Simpson’s paradox**, or the **Yule–Simpson effect**, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics,^{[1]} and is particularly confounding when frequency data are unduly given causal interpretations.^{[2]} Simpson’s Paradox disappears when causal relations are brought into consideration. Many statisticians believe that the mainstream public should be informed of the counter-intuitive results in statistics such as Simpson’s paradox.^{[3]}^{[4]}

Edward H. Simpson first described this phenomenon in a technical paper in 1951,^{[5]} but the statisticians Karl Pearson, et al., in 1899,^{[6]} and Udny Yule, in 1903, had mentioned similar effects earlier.^{[7]} The name *Simpson’s paradox* was introduced by Colin R. Blyth in 1972.^{[8]} Since Edward Simpson did not actually discover this statistical paradox (an instance of Stigler’s law of eponymy), some writers, instead, have used the impersonal names *reversal paradox* and *amalgamation paradox* in referring to what is now called *Simpson’s Paradox* and the *Yule–Simpson effect*.

## Examples

### Kidney stone treatment

This is a real-life example from a medical study^{[10]} comparing the success rates of two treatments for kidney stones.^{[11]}

The table below shows the success rates and numbers of treatments for treatments involving both small and large kidney stones, where Treatment A includes all open surgical procedures and Treatment B is percutaneous nephrolithotomy. The numbers in parentheses indicate the number of success cases over the total size of the group. (For example, 93% equals 81 divided by 87.)

Treatment A | Treatment B | |
---|---|---|

Small Stones | Group 193% (81/87) |
Group 287% (234/270) |

Large Stones | Group 373% (192/263) |
Group 469% (55/80) |

Both | 78% (273/350) | 83% (289/350) |

The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B is more effective when considering both sizes at the same time. In this example the “lurking” variable (or **confounding variable**) of the stone size was not previously known to be important until its effects were included.

Which treatment is considered better is determined by an inequality between two ratios (successes/total). The reversal of the inequality between the ratios, which creates Simpson’s paradox, happens because two effects occur together:

- The sizes of the groups, which are combined when the lurking variable is ignored, are very different. Doctors tend to give the severe cases (large stones) the better treatment (A), and the milder cases (small stones) the inferior treatment (B). Therefore, the totals are dominated by groups 3 and 2, and not by the two much smaller groups 1 and 4.
- The lurking variable has a large effect on the ratios, i.e. the success rate is more strongly influenced by the severity of the case than by the choice of treatment. Therefore, the group of patients with large stones using treatment A (group 3) does worse than the group with small stones, even if the latter used the inferior treatment B (group 2).

Based on these effects, the paradoxical result can be rephrased more intuitively as follows: Treatment A, when applied to a patient population consisting mainly of patients with large stones, is less successful than Treatment B applied to a patient population consisting mainly of patients with small stones.

### Berkeley gender bias case

One of the best-known real-life examples of Simpson’s paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there. The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted, and the difference was so large that it was unlikely to be due to chance.^{[12]}^{[13]}

Applicants | Admitted | |
---|---|---|

Men | 8442 | 44% |

Women | 4321 | 35% |

But when examining the individual departments, it appeared that no department was significantly biased against women. In fact, most departments had a “small but statistically significant bias in favor of women.”^{[13]} The data from the six largest departments are listed below.

Department | Men | Women | ||
---|---|---|---|---|

Applicants | Admitted | Applicants | Admitted | |

A | 825 | 62% | 108 | 82% |

B | 560 | 63% | 25 | 68% |

C | 325 | 37% |
593 | 34% |

D | 417 | 33% | 375 | 35% |

E | 191 | 28% |
393 | 24% |

F | 373 | 6% | 341 | 7% |

The research paper by Bickel et al.^{[13]} concluded that women tended to apply to competitive departments with low rates of admission even among qualified applicants (such as in the English Department), whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants (such as in engineering and chemistry). The conditions under which the admissions’ frequency data from specific departments constitute a proper defense against charges of discrimination are formulated in the book *Causality* by Pearl.^{[2]}

### Low birth weight paradox

The low birth weight paradox is an apparently paradoxical observation relating to the birth weights and mortality of children born to tobacco smoking mothers. As a usual practice, babies weighing less than a certain amount (which varies between different countries) have been classified as having low birth weight. In a given population, babies with low birth weights have had a significantly higher infant mortality rate than others. normal birth weight infants of smokers have about the same mortality rate as normal birth weight infants of non-smokers, and low birth weight infants of smokers have a much lower mortality rate than low birth weight infants of non-smokers, but infants of smokers overall have a much higher mortality rate than infants of non-smokers. This is because many more infants of smokers are low birth weight, and low birth weight babies have a much higher mortality rate than normal birth weight babies.^{[14]}

### Batting averages

A common example of Simpson’s Paradox involves the batting averages of players in professional baseball. It is possible for one player to hit for a higher batting average than another player during a given year, and to do so again during the next year, but to have a lower batting average when the two years are combined. This phenomenon can occur when there are large differences in the number of at-bats between the years. (The same situation applies to calculating batting averages for the first half of the baseball season, and during the second half, and then combining all of the data for the season’s batting average.)

A real-life example is provided by Ken Ross^{[15]} and involves the batting average of two baseball players, Derek Jeter and David Justice, during the baseball years 1995 and 1996:^{[16]}

1995 | 1996 | Combined | ||||
---|---|---|---|---|---|---|

Derek Jeter | 12/48 | .250 | 183/582 | .314 | 195/630 | .310 |

David Justice | 104/411 | .253 |
45/140 | .321 |
149/551 | .270 |

In both 1995 and 1996, Justice had a higher batting average (in bold type) than Jeter did. However, when the two baseball seasons are combined, Jeter shows a higher batting average than Justice. According to Ross, this phenomenon would be observed about once per year among the possible pairs of interesting baseball players. In this particular case, the Simpson’s Paradox can still be observed if the year 1997 is also taken into account:

1995 | 1996 | 1997 | Combined | |||||
---|---|---|---|---|---|---|---|---|

Derek Jeter | 12/48 | .250 | 183/582 | .314 | 190/654 | .291 | 385/1284 | .300 |

David Justice | 104/411 | .253 |
45/140 | .321 |
163/495 | .329 |
312/1046 | .298 |

The Jeter and Justice example of Simpson’s paradox was referred to in the “Conspiracy Theory” episode of the television series *Numb3rs*, though a chart shown omitted some of the data, and listed the 1996 averages as 1995.^{[citation needed]}

If you use weighting this goes away. Normalise for the largest totals so that you are comparing the same thing.

1995 | 1996 | Combined | ||||||
---|---|---|---|---|---|---|---|---|

Derek Jeter | 12/48*411 | 102.75/411 | .250 | 183/582*582 | 183/582 | .314 | 285.75/993 | .288 |

David Justice | 104/411*411 | 104/411 | .253 |
45/140*582 | 187/582 | .321 |
291/993 | .293 |