Is it practical to measure the effectiveness of educational programs with standardized tests?

Mike Preiner
5 min readJul 11, 2022

--

In principle, standardized tests offer one of the best ways to measure the impact of educational programs and interventions, where our basic question is: how much did students in a program learn compared to their peers who are not in the program? In this post we’ll look into a few practical considerations for using two of the commonly used standardized tests in Washington state: the MAP (Measures of Academic Progress) and SBA (Smarter Balanced Assessment).

We previously looked at how these two tests matched up with our internal weekly skill assessments. In this post we’ll extend that analysis with an eye towards using standardized tests for measuring the effectiveness of educational programs.

Given the question posed above, the first thing we want to understand are the noise levels in our assessments. Conceptually, we can think of this as the spread in results if we repeatedly tested a single student. In our case, where we have annual (or semi-annual) tests, we will have a limited set of data points, and it becomes natural to compare year-on-year results. For example, imagine if assessments in 3rd grade had absolutely no relation to results from 2nd grade. The data would clearly not be useful when trying to measure the impact of an educational program! So how well do our current standardized tests predict the results of subsequent tests? There has been some good research on this already: for a nice example, you can see this nice longitudinal study that looks at the predictive power of the SBA. In our case, we’ll start by showing both the MAP and SBA tests from the 2021–22 academic year for two different schools. We plot the Spring 2022 results vs. the Fall 2021 results below.

Figure 1. Spring 2022 assessment results versus Fall 2021 results for individual students at two different Seattle Public Schools. We show results for both the MAP and SBA, along with the results of a simple linear model.

From the graphs we can see a few things. The first is that in both tests there is a straightforward relationship between spring and fall results: students that score higher in the fall tend to score higher the following spring. We can can also see that the MAP assessment seems to have less “noise” than the SBA. There are a few ways to quantify this difference. One is via the correlation coefficient. However, to get a more useful metric, we can use the following procedure:

  1. Make a simple linear model that predicts spring results based on fall results: the linear fits in Figure 1.
  2. Measure how much the actual spring results differ from the predictions. In other words, we’ll measure the residual of our simple model.
  3. Since the SBA and MAP results are both reported on arbitrary scale scores, comparing them is difficult. To make the results easier to interpret and allow a direct comparison, we’ll convert the residuals to grade levels (via the procedure described here) and calculate the median absolute deviation, which can be thought of as the “typical” error or noise.

The results are shown below. Notably, the MAP results have a median absolute deviation of about 0.3 grade levels, while the SBA’s is 0.6 grade levels. Thus we can say that the SBA results are roughly twice as noisy as the MAP results.*

*Detailed side note: This year’s fall SBA used a slightly different testing procedure than the spring SBA, which could account for some of the differences. However, in both the spring and the fall the MAP correlated better with our own internal measurements of student learning than the SBA, which makes us suspect there may be real differences in testing noise between the two assessments. This is something we’ll look at as we get more data.

This brings us to our final question: what are the practical implications of the different noise levels in our standardized tests? To illustrate what this would look like in action, we simulated experiments using a regression discontinuity design, which can isolate the causes of any gains in student learning. In our case, we simulated choosing which students to enroll in an educational intervention based on a threshold value in their fall assessments.

Specifically, our simulations are run in the following manner:

  1. We use the student score distributions from Figure 1 to build a simulated class of sixty 2nd grade students.
  2. We choose a threshold fall score (in this case, 170), and assume students scoring below the threshold are enrolled in an intervention.
  3. We assume the students in the intervention gain an average additional 0.5 to 1.5 grade levels of academic growth. We also assume there is a spread in this academic growth rate: some students will benefit more than average and some less.

The results are shown below.

Figure 2. Simulated results of math interventions for a 2nd grade class of sixty students, where about half are enrolled in the intervention, assuming students are taking the MAP. We model our student score distribution (along with associated noise) from the data shown in Figure 1.
Figure 3. Simulated results of math interventions for a 3rd grade class of sixty students, where about half are enrolled in the intervention, assuming students are taking the SBA. We model our student score distribution (along with associated noise) from the data shown in Figure 1.

As the program effectiveness increases, it becomes possible to clearly see the impact even without any statistical analysis. But that raises a question: how realistic are the impact numbers? To put them in perspective, during the 2021–22 school year, students enrolled in our program showed an average of ~0.6 grade levels of growth more than what they had shown in the past (more on this in a future post). Unfortunately, we weren’t using an experimental design that let us separate our impact vs. other effects, such as a hypothetical “post-COVID bounce”. However, these results make us suspect that it should be possible for a high-quality intervention to increase student learning by 0.5 to 1.0 grade levels over the course of a year. Finally, it is worth noting that while the results are broadly similar with SBA data, the higher noise levels in the SBA mean that the program impact isn’t as strongly visible.

In summary, it appears that with an appropriate experimental design it should be possible to clearly measure the causal impact of an effective intervention program within a single school year via standardized testing data.

We’ll be incorporating the lessons from this post into our work for next year, so stay tuned!

--

--

Mike Preiner

PhD in Applied Physics from Stanford. Data scientist and entrepreneur. Working to close education gaps in public schools.