Building a model of school academic performance in Washington: Part 2

4 min readJun 4, 2020

In our last post, we spent some time quantifying the importance of various demographic effects in predicting 5th grade school outcomes.

In this post, we’ll quantify the extent that we can predict school outcomes from purely demographic factors. Or to put it another way: to what extent do teachers, principals, and other “management” matter?

Throughout this post, we’ll compare two pretty different models:

The mean model: this is the simplest possible model: it is just the average SBA score across all schools. As such, it doesn’t require any input variables, which makes it easy to use. It also provides a useful benchmark: if a more complicated model doesn’t beat the mean model by much, we may be wasting our time with our fancy algorithms!

A gradient boosted regression (GBR) model: this is a more sophisticated, decision-tree based model that can handle a wide variety of linear, non-linear, and categorical variables as inputs. It is one of the workhorses of machine learning, and good starting point for seeing how much additional data sources can improve your prediction abilities.

Since we covered the predictor variables we’ll be using for the GBR model in the last post, let’s jump in and see how the models compare!

Let’s start by simply comparing model predictions to the actual 5th grade outcomes for each school Washington. We’ll be using our standard “SBA L1” metric that combines both the math and English portions and is normalized so that 1.0 corresponds to 100% of students passing both. The results are shown below. As expected, the mean model only has one predicted value for all schools in Washington :)

Actual SBA L1 values vs. predictions for both the mean model and the gradient boosted regression model for all 1084 schools with 5th grade SBA results in 2018–19.

It turns out that is usually useful to look at the residuals of the models: the differences between what we observe and what we predict. I’ll use the term residuals interchangeably with model error, although there are some minor differences. I’ve plotted the residuals of the models against school low-income fraction below. Not surprisingly, since the mean model doesn’t account for low-income fraction, we can that the residual (which can generally be thought of as the model error) depends strongly on low-income fraction. The GBC residuals don’t depend on low-income fraction, since the model explicitly factors income into its predictions.

Plot of model residuals vs. low-income fraction for school-level 5th grade SBA results in Washington.

Below, we compare the model residuals directly. We can see that the GBR model is quite a bit more accurate than the mean model; almost all of the GBR residuals are within +/- 0.2, while the mean model residuals extend to +/- 0.4. It looks like the fancy model is providing some extra predictive power.

Histogram of model residuals, binned by the number of schools.

Finally, let’s quantity our model accuracy:

Mean Model

Median absolute error: 0.12
R²: 0.0

GBR Model

Median absolute error: 0.05
R²: 0.82

The median absolute error (MAE) is just what it sounds like: the 50th percentile of the model error. We can see that the GBR model is about 2.5x more accurate than the mean model, with an MAE of 0.05. That means that we can predict the typical school’s SBA scores to within 5%, which seems pretty good given that we are only using demographic factors.

The other metric we’ll use is R², which describes the portion SBA variance explained by the model. Since the mean model is constant, it doesn’t explain any variability, and thus has a R² of exactly 0.0. The GBR model (with all of its predictor variables) naturally does better, with a fairly high R² of 0.82. To put that in perspective, that number suggests that school management today probably accounts for ~20% of the variation in test scores, while demographics accounts for ~80%.

All that being said, it should be noted that our plot of school residuals shows that there are some real outlier schools. Some schools are significantly outperforming (or underperforming) their demographics: for example, there are 4 schools that score over 20% higher on the SBA than we’d expect. A couple of questions that I’m very interested in answering is “what are those schools doing differently?”, and “what happens if we try some of their practices at other schools”?

What does it all mean?

Our first pass at predicting SBA scores at the school-level highlights a few things:

Demographics are important. They can explain more than 80% of the variation in outcomes between 5th grade test scores at schools in Washington.
There are some schools where management seems to really matter. Based on what I’ve been learning talking to principals, I don’t think it is just noise. We’ll dig more into this in the future.
This type of modeling is what I’d expect the beginning of a school-performance scorecard to look like. Even if management (i.e. school quality) accounts for only ~20% of the variation of student outcomes today, this seems like a natural area of focus because it is one of the most controllable aspects of variation. And of course, the first steps to managing something is to measure it :)

Building a model of school academic performance in Washington: Part 2

Written by Mike Preiner

No responses yet