Building a model of school academic performance in Washington: Part 1

Mike Preiner
4 min readMay 29, 2020

In this post I’m going to walk through the beginning of a “school performance” model. Broadly speaking, we’ll be looking at how well we can predict student outcomes based purely on demographics. Why would we want to do that? There are two main reasons:

  • After we factor in all of the demographic factors, we can see which schools have “unusual” results. The schools with outcomes that cannot be explained by simple demographics may have particularly good (or bad) teaching practices or interesting environmental factors.
  • Predictive models can often help in sorting through many correlated variables to figure out which ones are the direct drivers of outcomes. Knowing which demographic factors are causally connected to outcomes would clearly be helpful in creating support programs. We’ll cover this in more detail as we go.

Before we start with the modeling, let’s take a look at our the data we have available. In this post, we’ll be predicting 2018–19 school-level, 5th grade math SBA scores. I’ve plotted a correlation heatmap of the various demographic factors below. Each square on the chart shows the correlation coefficient of the corresponding row and column. Positive correlations are colored red and negative correlations are blue.

A plot of the correlations between school-level demographic variables and the percent of students meeting the math SBA standard across the 1084 schools in WA that reported scores for 5th graders in 2018–19.

There are two key thing to notice from this data: the first is that almost all demographic variables are correlated with math scores. The only exception is Military Parent Fraction. The second item is that almost all of the demographic variables are correlated with each other. Neither of these are particular surprising; they matches most of the existing research in this area.

At this point there are two common paths in deciding which variables to include in our model. We could 1) use all of the variables in reasonably robust machine learning model, or 2) we could do a careful variable-by-variable analysis to decide what to keep as a model input. For completeness, we’ll do both!

Let’s start with Path #1.

We’ll do a quick analysis of all of the variables using a standard machine learning algorithm: gradient-boosted regression. One of the very nice things of this methodology is that it can give a quantitative evaluation of how important each variable is in the resulting predictive model. Of course, this raises the question about what exactly we mean by variable importance. In short, it is a combination of 1) how much more accurate the variable makes the model when it is used, and 2) how often the variable is used in the model. As an example, let’s pretend that schools with high numbers of military parents did better on the SBA exam than other schools, but only a couple of schools had high numbers; the rest had no military parents. In this case, “Military Parent Fraction” would be an important variable when used, but wouldn’t get used very often, so it wouldn’t have an extremely high importance.

Given that definition, what does a variable importance analysis of our Washington schools show? I’ve made a chart below, and there is one key takeaway: despite all the correlations with other variables, Low-Income Fraction is more than 10 times more important than any other variable in predicting SBA scores. This is because it varies widely from school to school, has a very large effect on test scores, and after it is taken into account not much else matters.

Now let’s go down Path #2 and do a sanity check using a very simple model.

We’ll model math SBA scores with a very simple polynomial (quadratic) model that only depends on one variable: Low-Income Fraction. The model is shown below, along with the actual data that built it.

The percent of students passing the math portion of the SBA as a function of school Low-Income Fraction. Dot size is proportional to the number of students in each school. The black line shows the results of a quadratic fit.

Next, we’ll look at the how the other demographic variables correlate with SBA scores before and after using our simple model. The general idea with this approach is that some variables may be correlated with SBA scores mainly via their correlation with Low-Income Fraction (i.e. they are not direct causes themselves). If so, their correlation with SBA scores will decrease after our model accounts for Low-Income Fraction. The results for each demographic variable are shown below.

Chart showing the correlation of each demographic variable with math SBA scores. “Raw Data” results shown in blue are equivalent to those shown in the heatmap above, while the orange bars show correlations after adjusting for Low-Income Fraction.

Our simple model results broadly agree with our more sophisticated variable importance analysis, even though this method doesn’t take into account how often a variable is used.

Interestingly, after accounting for Low-Income Fraction, there are only 4 demographic variables that are particularly correlated (more than +/- 0.1) with SBA scores. Two of them are race related: Asian and American Indian/Alaska Native Fractions. The other two are special needs categories: Students with Disabilities and English Language Learners Fractions. The fact that Student Disability and ELL status become important variables seems reasonable: it sure feels like they ought to still influence SBA scores after controlling for income status.

To wrap it up, this post outlines a first step towards building a school performance model that accounts for student demographics…we’ll look at how well our models actually perform in our next post.



Mike Preiner

PhD in Applied Physics from Stanford. Data scientist and entrepreneur. Working to close education gaps in public schools.