# Building a model of school academic performance in Washington: Part 1

In this post I’m going to walk through the beginning of a “school performance” model. Broadly speaking, we’ll be looking at how well we can *predict* student outcomes based purely on demographics. Why would we want to do that? There are two main reasons:

- After we factor in all of the demographic factors, we can see which schools have “unusual” results. The schools with outcomes that cannot be explained by simple demographics may have particularly good (or bad) teaching practices or interesting environmental factors.
- Predictive models can often help in sorting through many correlated variables to figure out which ones are the direct drivers of outcomes. Knowing which demographic factors are
*causally*connected to outcomes would clearly be helpful in creating support programs. We’ll cover this in more detail as we go.

Before we start with the modeling, let’s take a look at our the data we have available. In this post, we’ll be predicting 2018–19 school-level, 5th grade math SBA scores. I’ve plotted a correlation heatmap of the various demographic factors below. Each square on the chart shows the correlation coefficient of the corresponding row and column. Positive correlations are colored red and negative correlations are blue.

There are two key thing to notice from this data: the first is that **almost all demographic variables are correlated with math scores.** The only exception is Military Parent Fraction. The second item is that **almost all of the demographic variables are correlated with each other**.** **Neither of these are particular surprising; they matches most of the existing research in this area.

At this point there are two common paths in deciding which variables to include in our model. We could 1) use all of the variables in reasonably robust machine learning model, or 2) we could do a careful variable-by-variable analysis to decide what to keep as a model input. For completeness, we’ll do both!

**Let’s start with Path #1.**

We’ll do a quick analysis of all of the variables using a standard machine learning algorithm: gradient-boosted regression. One of the very nice things of this methodology is that it can give a quantitative evaluation of how important each variable is in the resulting predictive model. Of course, this raises the question about what exactly we mean by *variable importance*. In short, it is a combination of 1) how much more accurate the variable makes the model when it is used, and 2) how often the variable is used in the model. As an example, let’s pretend that schools with high numbers of military parents did better on the SBA exam than other schools, but only a couple of schools had high numbers; the rest had no military parents. In this case, “Military Parent Fraction” would be an important variable when used, but wouldn’t get used very often, so it wouldn’t have an extremely high importance.

Given that definition, what does a variable importance analysis of our Washington schools show? I’ve made a chart below, and there is one key takeaway: **despite all the correlations with other variables, Low-Income Fraction is more than 10 times more important than any other variable in predicting SBA scores. **This is because it varies widely from school to school, has a very large effect on test scores, and after it is taken into account not much else matters.

**Now let’s go down Path #2 and do a sanity check using a very simple model.**

We’ll model math SBA scores with a very simple polynomial (quadratic) model that only depends on one variable: Low-Income Fraction. The model is shown below, along with the actual data that built it.

Next, we’ll look at the how the other demographic variables correlate with SBA scores before and after using our simple model. *The general idea with this approach is that some variables may be correlated with SBA scores mainly via their correlation with Low-Income Fraction (i.e. they are not direct causes themselves). *If so, their correlation with SBA scores will decrease after our model accounts for Low-Income Fraction. The results for each demographic variable are shown below.

Our simple model results broadly agree with our more sophisticated variable importance analysis, even though this method doesn’t take into account how often a variable is used.

Interestingly, *after *accounting for Low-Income Fraction, there are only 4 demographic variables that are particularly correlated (more than +/- 0.1) with SBA scores. Two of them are race related: Asian and American Indian/Alaska Native Fractions. The other two are special needs categories: Students with Disabilities and English Language Learners Fractions. The fact that Student Disability and ELL status become important variables seems reasonable: it sure feels like *they ought* to still influence SBA scores after controlling for income status.

To wrap it up, this post outlines a first step towards building a school performance model that accounts for student demographics…we’ll look at how well our models actually perform in our next post.