How well do weekly math assessments match up to standardized tests?

Mike Preiner
4 min readJan 7, 2022

There is a lot of debate about standardized testing in schools these days, but one thing seems clear: they are one of the best yardsticks we have today for measuring many of the inequalities in our education system.

That being said, there many issues with how standardized tests get used today. For those interested in improving how students learn, one of the key frustrations is that standardized assessments are typically given only once or twice a year. This makes it very difficult to confidently iterate on programs at a frequency higher than once a year. In a previous post, we outlined what an ideal assessment system could look like, and included a case study from our math intervention program. When we wrote the case study, we were missing one key piece of data: an explicit comparison between the system we were using (based on IXL’s diagnostic module) and standardized testing data.

This comparison is important because we are interested in closing education gaps, which today are measured via standardized tests. To quantitatively understand our success (or lack thereof), we want to measure our progress in the same units. The good news: standardized testing resumed in Washington this fall, and we now have the data for our comparison!

The Measurements

As part of our standard Math Agency program, we perform an initial diagnostic assessment using IXL, and then have students update their assessments on a weekly basis. We’ll be comparing the overall results (in terms of student grade level) to the two main standardized tests used at our pilot schools: the Measures of Academic Progress (MAP) and Smarter Balanced Assessment (SBA). We’ll use IXL data from the same week that the assessments were taken. I won’t delve into the meaning of the MAP RIT score or SBA scale score here; the post mentioned above outlines some of the challenges with interpretability.

Let’s start with the basics: we’ll plot the standardized test scores for each student against our diagnostics results. Our schools assessed 2nd and 3rd graders with the MAP, and 4th and 5th graders via the SBA.

Comparison of MAP scores and IXL Diagnostics for 2nd and 3rd grade students (n=24) at two of our participating elementary schools.
Comparison of SBA scores and IXL Diagnostics for 4th and 5th grade students (n=30) at two of our participating elementary schools. The two data points with asterisks (*) are discussed below.

From the graphs above, we can see that there is a pretty clear correlation between the IXL data and both the MAP and SBA results. What isn’t clear directly from the plots is 1) exactly how strong the correlation is, and 2) how much of a correlation we should expect in the first place.

To answer the second question, we’ll reference some nice data from NWEA (the makers of the MAP assessment) that measures the correlation between MAP and SBA results from a very large sample of students. They find a correlation coefficient of 0.88 between the two tests; a scatterplot of the results is show below.

Comparison of MAP and SBA scores for 39,582 students in grades 3–8. Reproduced from “Linking the Smarter Balanced Assessments to NWEA MAP Assessments”, NWEA. The full text can be accessed here.

How does our data compare? The table below shows the three measurements. We can see that based off our initial data, the weekly IXL assessments roughly agree with the MAP and SBA scores as much the the MAP and SBA scores agree with each other. I won’t go into the gory details of the uncertainty estimates of the different correlations, but clearly we are working with pretty small samples sizes for the IXL comparisons:)

Pearson’s correlation coefficient (r) between the three different assessments. The coefficients for the IXL comparisons were calculated using the Thorndike formula to account for restriction of range: the fact that our data covers a smaller span of values than the larger MAP-SBA dataset.

Finally, we were interested in understanding some of the outliers in our comparisons. We took two of the largest outliers of the SBA-IXL comparison from one of our schools (marked with asterisks on the IXL-SBA plot above) and interviewed the coaches and teachers of the two students. There was a clear consensus that the IXL data better represented the current skills of each student. We shouldn’t assume that will always be the case, but it does increase our confidence in our high-frequency assessments.

Wrap Up

Broadly speaking, there is good agreement between the IXL Diagnostics and both the MAP and SBA results. Of course, we don’t expect the results to match up perfectly (e.g. a correlation of 1.0), but the relatively high correlations suggest that growth measured via the IXL Diagnostics should also be observed in MAP and SBA results. More concretely: if we are closing education gaps measured via IXL (as we are here), it looks like we will also be closing gaps as measured by standardized tests.

--

--

Mike Preiner

PhD in Applied Physics from Stanford. Data scientist and entrepreneur. Working to close education gaps in public schools.