Combining IQ Scores in Atkins Cases

When the U.S. Supreme Court decided in Atkins v. Virginia in 2002 that people who are mentally retarded (now called intellectually disabled) can’t be executed no matter how heinous the crime, it opened a can or worms regarding deciding who actually qualifies for that category. The line between that condition and the next level up (borderline intellectual functioning) is a matter of convention, not really science, so there is a range of disagreement.

A case now being briefed before the court, Hamm v. Smith, deals with the question of how to assess the IQ of someone who has been tested multiple times. The court briefly touched on that issue in 2014 in Hall v. Florida. The year before, Joel Schneider of Temple University proposed a method in a chapter of an edited book. The opinion of the court cited that chapter but brushed it off with the comment that his method is “a complicated endeavor.” Really? It’s not all that complicated. I ran the numbers myself on the data in the Smith case. It wasn’t simple, but it was simpler than computing my 2024 income tax return.

As a preliminary matter, the makers of IQ tests regularly publish a “standard error of measurement” (SEM). That number represents, in a statistical way, the scatter one could expect in giving a test multiple times to the same person or to multiple people with identical true IQs. It doesn’t account for a host of other possible errors such as incorrect administration of the test, poor testing conditions, transient mental or physical problems of an examinee having a bad day, or–the big one in criminal cases–malingering.

So, putting those aside, here is how we do the math on the Smith case with the Schneider method.

Smith was given five IQ tests over the course of his life: two by schools in his youth, one by an expert hired by trial counsel before his trial, and two by experts hired by opposing sides during the habeas corpus litigation many years after the trial. Here is a table of the info:

Tester Year Test Score SEM
Stapleton Sch. 1979 WISC-R 75 3.19
Baldwin Sch. 1982 WISC-R 74 3.19
Chudy 1998 WAIS-R 72 2.53
Fabian 2014 SB-5 78 2.30
King 2017 WAIS-IV 74 2.16

Notice the SEMs get smaller as time goes on. The test makers improve their products. The old rule of thumb that the confidence interval is plus or minus five is just wrong if one uses the WAIS-IV, or now the WAIS-V. Even if +-2*SEM is the appropriate width (which is debatable), that would round to four now, not five.

The Schneider method also requires that we make a matrix of the correlation coefficients between the various tests, a measure of how well the tests track each other. That would look like this:

Tester WISC-R WISC-R WAIS-R SB-5 WAIS-IV
Stapleton Sch. 1.00 1.00 0.88 Unknown Unknown
Baldwin Sch. 1.00 1.00 0.88 Unknown Unknown
Chudy 0.88 0.88 1.00 Unknown Unknown
Fabian Unknown Unknown Unknown 1.00 0.90
King Unknown Unknown Unknown 0.90 1.00

Bummer. We can’t do the full composite because the essential inputs linking the first three tests to the last two are unavailable. We can, however, compute two composites, one for the first three tests and one for the last two. The correlation of a test with itself is necessarily 1, by the way.

Some of the steps below require the sum of the elements of the matrix, so let’s calculate that first. Old: 8.52; New 3.80.

Now on to the steps of Schneider’s method, as described in the blog excerpt of his book chapter.

Computing a Composite Score

1. Add up all the scores: old 221; new 152

2. Subtract the number of tests times 100: old -79; new -48

3. Divide by the square root of the sum of all the elements in the correlation matrix: old -27.06; new -24.62

4. Complete the computation of the composite score by adding 100: old 72.94; new 75.38

Confidence Intervals of Composite Scores

1. Calculate the composite reliability.

1.a. Subtract the number of tests from the sum of the correlation matrix: old 5.52; new 1.80

1.b. Add in all the test reliability coefficients.

The SEMs above were calculated from the reliability coefficients, but we can easily reverse the formula and get the reliability coefficients from the SEMs. It’s 1-SEM2/SD2. Standard deviation for IQ tests is 15. That gains us a significant digit as well, given that the publishers give us three significant digits for the SEMs but round off reliability to just two.

Result: old 8.39; new 3.75

1.c. Divide by the original sum of the correlation matrix: old 0.985; new 0.987    This is the composite reliability.

2. Calculate the standard error of the estimate by subtracting the reliability coefficient squared from the reliability coefficient and taking the square root. Then, multiply by the standard deviation, 15: old 1.84; new 1.71

Note that the formula for computing SEE from reliability is not the same as SEM.

3. Calculate the 95% margin of error by multiplying the standard error of the estimate by 1.96: old 3.60; new 3.35.

Note that the choice of a 95% interval is controversial. See Justice Alito’s dissent in Hall. Schneider gives the Excel function for calculating a different interval.

4. Calculate the estimated true score by subtracting 100 from the composite score, multiplying the reliability coefficient, and adding 100.

Result: old 73.3; new 75.7

5. Calculate the upper and lower bounds of the 95% confidence interval by starting with the estimated true score and then adding and subtracting the margin of error: old 69.7 to 77.0; new 72.4 to 79.1

If we used a 90% confidence interval instead, the results would be: old 70.3 to 76.5; new 72.9 to 78.5

If we used a one-SEE interval, which is about 68%, the results would be: old 71.5 to 75.2; new 74.0 to 77.4.

But maybe we don’t need confidence intervals at all. Why not just calculate the probability that the murderer’s true IQ is 70 or less and decide if that number amounts to an “unacceptable risk,” as Hall called it, that we could be executing a person who actually is intellectually disabled. David Kaye of Penn. State says we should actually calculate a probability that true IQ is less than 70.5. The theory, I gather, is that IQ scores are traditionally rounded to integers, and the criterion is “70 or less,” so any number that rounds to 70 or less counts.

On that theory, the probabilities are: 6.07% old; 0.12% new.

Even though we can’t make a composite of these two numbers mathematically, just considering them together qualitatively can we reach a conclusion that there is no “unacceptable risk” in this case? I think so. The newer scores with the better instruments and likely conducted under better conditions deserve more weight, in my unexpert opinion.

For more, see our amicus curiae brief in this case, to be filed Monday. The brief also does the math with the Bayesian alternative.

Update (9/10/25): Our brief in the case was filed August 11. It is available on the Supreme Court’s site, here., and on CJLF’s own site, here.