Does it matter who your tutor is?

6 min readJun 10, 2021

If you’ve seen our previous posts, you know that we’re testing out a tutoring program at Lowell Elementary to see if we can quantify some of the factors that impact student learning.

In this post we’ll look deeper into our data to understand what drives the impact of individual tutors. One major barrier to making any educational intervention repeatably successful is that results usually depend on who is actually implementing the program. In our case, if we want to have confidence that we can increase student learning, we need to understand how our impacts vary by tutor.

Let’s start with a diagram of our program’s “theory of action”, shown below.

A four-stage theory of action model for our program. Relationship and mentoring (and fun!) drive attendance, particularly in a remote school environment. Attendance and relationship are the main drivers of practice, which in turn promotes math skill development.

We have year-end data for 2020–21 on each of these four components, and in this post we’ll dig into the first three. But first…

A Super Important Caveat

It is worth emphasizing that 3 of our 4 tutors have pretty small numbers of students (10–13 each). This means that we always need to be concerned about whether any differences we see are simply due to random differences in the students the tutors work with instead of differences from the tutors themselves. Instead of boring you with p-values and confidence intervals, I’ll focus on some common-sense sanity checks and relevant anecdotal data to justify our confidence in any conclusions, but we can never completely rule out this issue. With that in mind, let’s look at our tutors!

Tutor Background

For the rest of this post, we’re going to show a lot of data broken out by tutor. To interpret the results, its helpful to have a quick understanding of the different tutor backgrounds:

Tutor A is a professional math interventionist who works full time at Lowell. This year they typically ran “small groups” of 4–8 students at a time, used IXL for math practice, and worked with around 30 students.
Tutors B, C, and D were new tutors with no formal previous experience. They ran either 1:1 or 1:2 sessions with students, used a combination Khan Academy and IXL, and worked with 10-13 students. Tutor B started in October 2020 and Tutors C and D started in February 2021.

Relationships (and Fun!)

This is the most weakly instrumented part of our program. To get an understanding of how students felt about the program, we asked them two simple questions near the end of the year:

How do you feel about your practice sessions with your tutor/coach? (1 = hate it, 5 = love it)
Do you feel like you are learning *more math* because of your time with your tutor/coach? (1 = not at all, 5 = heck yeah)

The results are shown below. The only dramatic differences between tutors was in the response rate. No results are shown for Tutor D because they didn’t have any responses. Unfortunately, based on these results, it doesn’t look there is a lot we can learn from our survey data, other than Tutors C and D had less student engagement; i.e. a low response rate. In the future (based on some of our findings described below), we’ll explore more detailed questions, such as “how motivated is your tutor to help you succeed?”.

Survey results by tutor. Tutor D had no student responses.

Attendance

There were clear differences in attendance between tutors. For example, attendance for Tutors C and D was about half that of Tutor B. It seems likely that this wasn’t simply due to different student groups. For example, we know that Tutors C and D weren’t as proactive about managing attendance by working with teachers and parents as Tutors A and B (despite being actively coached about it). During interviews with teachers, some mentioned that some students stopped going to sessions with C /D, and they were surprised that C/D never reached out to the teachers for support.

Tutor A had “in between” attendance, which matches expectations given that they did active outreach to teachers and parents but had too many students to do it completely effectively.

Our attendance data also gives us a nice chance to test out the question: “to what extent are differences between students driving the different results for tutors?” It turns out that Tutor A and Tutor C shared four students. We can compare attendance for those four students and have greater confidence that the differences are from the tutors themselves. The results are shown below. The data clearly supports that conclusion that Tutor A was more effective at driving attendance, and the differences shown above are not simply from the students themselves.

Attendence for Tutor A and Tutor C for the 4 students that were shared between them.

Finally, it worth noting that the large differences in attendance were not predicted by the survey data. This could be due to sample bias (the students who didn’t like the sessions didn’t answer the survey because they didn’t show up), or it could be that survey results just aren’t strongly related to actual attendance.

Practice

When we look at student practice, we see even more extreme differences between tutors. This is important, because we expect practice time to be more directly linked to learning than attendance. To start, we’ll look at average practice per student (below, in blue). We see that Tutors A,C, and D have much lower per-student practice levels than Tutor B. However, it is important to remember that Tutor A has a much larger case load: 31 students! To get a better idea of the total tutor impact, we also show the total practice time (summed over all students) in orange. As with attendance, Tutors C and D are significantly less impactful than A and B. In particular, Tutor D has by far the least practice; remember, this is the tutor who didn’t get any of their students to take the survey. It is reasonable to suspect that those are linked.

It is interesting to see that Tutors A and B have similar amounts of total practice, despite having very different numbers of students. This suggest that there may be clear impact tradeoffs. For example: do we want to help a few students a lot, or a bunch of students a little bit?

What does it all mean?

Based on all the evidence, it seems pretty clear that the large variation in student practice (and to some extent, attendance) was driven at least in part by differences in tutors. I’ve already mentioned some anecdotal evidence, but we had a lot of other indicators of why tutors C/D weren’t as effective as A /B. For example, they didn’t utilize many of the student motivators that A/B were using (such as goal setting and competition), which is a topic of a future post.

The key point here is that we have a system that can quickly identify and quantify tutor effectiveness: within in a few weeks of their start date we had a clear signal that practice from Tutors C and D was lower than expected. This means in the future we can create better ways to screen tutors and quickly react when we don’t see the results we’re expecting.

In other words, we’ve learned a lot about how to measure tutor effectiveness. And going forward we’ll be able to use this information to have an even bigger impact on students.