Flawed study boasts 30% reduction in recidivism: San Francisco’s Make-it-Right program

The “Make-it-Right” (MIR) program is a restorative justice conferencing and diversion program that was implemented in San Francisco for high-risk teenagers facing medium-severity felony offenses (e.g., burglary, assault, motor vehicle theft). The National Bureau of Economics Research (NBER) recently published a working paper that boasts a 30% reduction in four-year recidivism rates for youths in the program when compared with a control group. The researchers claim that the study is especially strong due to being a randomized controlled trial (RCT), i.e., the strongest type of individual research study. However, upon closer review of the study, there is reason to be skeptical of the results.

To summarize, it looks like the randomization method was severely compromised, rendering the “key strength” of the design effectively invalid. Now, this does happen in research sometimes, and there are ways to try to deal with it. However, I am disappointed that the authors did not acknowledge this problem, nor did they take any steps to mitigate it. Below, I provide a review of the study, explain how the randomization went wrong, how this affects the results, and some steps that the authors should have taken (but didn’t).

As stated above, the MIR program had two key elements: restorative justice conferencing and diversion from felony prosecution. Restorative justice programming can take many forms, but it typically involves a conversation between the victim and the offender where they discuss the harm that was done. The offender is required to accept culpability and the victim is able to explain how the crime impacted their life and well being. At the end of the conference, both parties agree to a plan for restoring the harm caused to the victim. While the offender obviously cannot undo previous criminal behavior, “restoring harm” typically involves something like paying restitution fees to the victim or agreeing to participate in a certain type of community service. Sometimes restorative justice programming can be used in lieu of criminal prosecution, as in the MIR program. In other words, offenders who successfully completed the MIR program were not subject to felony prosecution and were diverted from the criminal justice system.

Most of the existing research on restorative justice focuses on improving victim outcomes (e.g., victim satisfaction, post-traumatic stress symptoms), which generally suggests that it may be effective in doing so. However, the impact on offender outcomes (e.g., recidivism) is less conclusive. A research review from 2013 claimed that there was a lack of high-quality evidence on the impacts of restorative justice interventions on recidivism, particularly when considering long-term impacts.

According to the authors of the NBER study, their evaluation contributes to this literature in several ways, the big one being that they used random assignment. As such, they claim that “there are no observed or unobserved confounders to the intervention in our setting since assignment to treatment and control groups was done at random.” However, let me explain to you how this is not really true.

Now, I’m not denying the fact that randomization, if successful, is a major strength in a research study. When done well, it ensures that both groups are identical on all observed and unobserved factors, the only difference being that one group received the intervention and the other one didn’t. But, the randomization has to be done well in order for this to be the case. How do you know if the author’s randomization process was actually successful?

My biggest criticism of this study is that it is touted as an RCT, but after taking a deeper dive into the paper, it appears that the randomization was actually compromised. In other words, the authors wanted to conduct an RCT but fell short. Next, they failed to acknowledge this. Incidentally, they did not take steps to make their (now quasi-experimental) study stronger; they simply regarded it as an RCT anyway without addressing some serious limitations. That being said, I would not trust these results as they are currently reported.

First, let’s look at how they did the random assignment. They identified 143 people who were eligible for the program, and then randomly assigned 99 of them to the MIR group and 44 of them to the control group. At first, the groups did appear comparable, but what happened next is troubling. Specifically, of the 99 people assigned to the MIR group, only 80.8% actually enrolled in the program. This means that the treatment group sample immediately lost 19 people, dropping it to 80 instead of 99. Then, out of these 80 people, only 53 of them actually completed the program. The researchers are not entirely forthcoming about these numbers though. They still maintain the fact that their “final sample” is 143 (99 in the treatment and 44 in the control) — but here’s the kicker: they didn’t have outcomes for all of these people, so the final sample is actually 97 (53 in the treatment and 44 in the control). To claim that the final sample size is 143 is a huge oversight. Not only is it misleading, it is completely wrong.

To understand why this is the case, I like to think about the sample more dynamically. In an RCT, you start with the “randomized sample.” This is the total number of people who were randomly assigned to groups at the beginning of a study. If the randomization method is done well, this will generate groups that are statistically similar to each other on all observed and unobserved factors. Researchers will often display sample characteristics (e.g., demographic breakdowns) side-by-side for treatment and comparison groups to show that they appear similar on certain factors — in other words, the researchers usually try to show that the groups have “baseline equivalence” in terms of prior criminal activity, age, gender, etc.

However, as we have seen above, it is rare that all of the randomized individuals will actually complete the study. More commonly, there will be at least some drop out (in research terms, we call this “attrition”). For people that drop out, there are no outcomes to examine. Thus, when it comes to measuring outcomes (the part that we care about), the sample is usually smaller than it was originally. This smaller sample is referred to as the “analytic sample,” or the sample that is actually being analyzed. The analytic sample can be thought of as the “final” sample.

If attrition levels are low (say, less than 20% as a liberal estimate), then we don’t need to worry as much about the randomization being compromised. But if attrition levels are high, there is reason to worry, as it can drastically impact the sample to the point where groups are no longer comparable. Think about it this way: the people that drop out of a program are very different than those who complete the program. So how do we know who exactly dropped out of the program, and what impact did this have on the final sample? Are the groups still statistically similar to each other, even though so many people have dropped out at this point?

Well, we don’t know, unless the authors prove baseline equivalence on the analytic sample. Unfortunately in the current study, the authors only assess for baseline equivalence for the randomized sample, which we know has been severely compromised. It is disappointing that the authors fail to make this distinction and incorrectly refer to their final sample as N=143 when it is in fact N=97. The authors were not very forthcoming about this at all, and I actually had to calculate the final sample size manually because it was not provided.

As someone reading this study, there are a few things to consider. On its face, the randomization element is a strength, and it appears to have successfully generated equivalent treatment and control groups — at first, anyway. But as I stated above, approximately half of the treatment group dropped out, such that none of their outcomes could be included in the final analysis. Not only does this dramatically decrease the sample size, it also represents a large amount of attrition. So the major question is this: if the groups were comparable at the outset, were they still comparable after half of the treatment group dropped out? Well, we don’t know, because authors do not acknowledge or look into this problem.

This is where it is important to read between the lines. The authors do not directly state that their sample size decreased or that people dropped out of the study, so the flaw is not necessarily apparent at first glance. For example, they do mention that “80.8% of those assigned to MIR enrolled in the program” (read: 20.2% dropped out immediately). Then, they mention that “among those enrolling in MIR, 66.7% completed the program” (read: an additional 33% didn’t complete the program and there are no outcomes on them). Reading between the lines reveals that only 53% of the original 99 people actually completed the program.

So, even if the groups were comparable when initially randomized, the high level of attrition means that the sample composition may have changed dramatically. And depending on how much the sample composition shifts, it can render the RCT completely invalid. When an RCT has high attrition, it essentially counteracts any of the benefits achieved from randomization and is effectively no better than a quasi-experiment. Further, a compromised RCT is of even lower methodologically quality than a quasi-experiment if authors fail to assess and acknowledge the impact of attrition.

When attrition occurs in an RCT (which it often does), it is on the researchers to prove that the study has not been totally compromised. In cases where attrition is severe, authors need to prove baseline equivalence again, but only for the analytic sample — this would show that groups are still equivalent despite attrition. However, even if authors are unable to do this, it is not the end of the world. In this case though, the authors should attempt to control for observed differences between groups via their statistical analysis methods. Unfortunately in the NBER study, the authors fail to do either, and therefore the attrition remains a serious limitation.

To be clear, I am not so disappointed that the researchers’ randomization was compromised, because this is not uncommon. However, I am very disappointed that they did not acknowledge this problem nor did they make any attempts to mitigate the situation. Further, it is incredibly misleading to claim that the sample size was 143, when recidivism outcomes were only examined for 97 of these people. Overall, there are some very concerning oversights in the current working paper that I hope will be addressed prior to its actual publication.