Building a school-based graduation model for Washington: Part 1
After walking through the academic performance model for Washington state, I thought it would be interesting to go through a similar process for graduation rates. Just like our academic model, outliers of a purely demographic model are probably signs of particularly good or bad management. We’ve also seen that graduation rates can be driven by different factors than standardized test scores, so it will be interesting to see what variables will work best in a predictive model. Finally, it will be useful to see how well purely demographic factors can predict graduation rates. Let’s dive in!
As usual, let’s start with the data. We’ll be using data from several different datasets, merged together at the school-level:
- 2017–18 OSPI graduation data and 12th grade OSPI enrollment data
- 2016–16 OSPI assessment data for 11th graders (the same students graduating in 2017–18)
Let’s start by looking at the correlation of different demographic variables with a correlation heatmap, shown below.
There are a couple of caveats on this plot. The first is that I’ve included three variables (math and English SBA scores from the graduating class, along with the number of students in the grade) that technically aren’t demographic. However, the relationship between SBA scores and graduation rates is one that we’ll want to explore in more depth at some point as we start to investigate specific ways to improve graduation rates.
The second caveat is that an important variable (school type) doesn’t show up here because it is a categorical variable and thus doesn’t have a natural place in a correlation plot. However, to illustrate it’s importance I’ve plotted graduation rate vs. the low income fraction and colored the schools by school type.
Finally, I want to note the extremely high correlation of Mobile Fraction with graduation rate. This isn’t surprising, since highly mobile students are defined as those attending school less than 150 days during the year…so anyone who drops out will also likely be highly mobile.
Ok, now let’s look at how our variables get used in a model!
We’ll start by examining the variable importance plot (leaving out the SBA scores) we get from a gradient-boosted regression model. We see that school type is by far the most important variable.
Let’s remove the effect of school type by just looking at standard public schools: the results are shown below. A couple of things stick out when we do this:
- The homeless fraction is the most important variable, just beating out the low-income fraction.
- The number of students (in 12th grade) enrolled at the schools is a surprisingly good predictor: small schools seem to have much higher drop-out rates.
What does it all mean?
The biggest takeaway here is that predicting graduation rates is more complicated that 5th grade SBA scores: there are a lot more contributing factors other than just low-income status, which isn’t that surprising in retrospect.
We’ll take a look at how well we can actually predict school graduation rates next week.