Are there good ways to measure learning *within* a school year?

Mike Preiner
5 min readMar 4, 2021

In a set of previous posts I described an experiment we’re running to try to close educational gaps at a local Seattle school. In this post I want to dig into a really important piece of that experiment: how will we know if it worked? In other words, how will we measure our actual impact on how much the students are learning?

This raises the general (and often fraught) question of assessing students. I’ll won’t get into the debate about standardized tests; instead here I’ll focus on what attributes we’d want in a system to measure the impact of a program like ours. It turns out there are three basic factors to consider: accuracy, granularity, and ease of use.

I’ll flesh out these three dimensions below, and then show how the specific assessment system we’re using at Lowell Elementary (the Diagnostics module from IXL) stacks up on the key metrics.

The takeaway for those who aren’t interested in the details: IXL has a pretty good diagnostic system that seems to do well on all three factors.

Accuracy/Repeatability

Fundamentally, of course, the assessment has to actually work. There are two key pieces of “actually working”:

Repeatability: if we run the same assessment on the same student in the same conditions, do we get the same result? Of course, nothing is ever perfectly repeatable, and in future posts we’ll spend some time discussing the amount of noise in our assessments. The repeatability/noise level will set the smallest impact size we can reliably measure.

Accuracy: even if a system is repeatable, it isn’t necessarily accurate. For our purposes, we’ll equate accuracy with the ability to predict whether students can successfully meet specific, well defined math standards.

Granularity

To illustrate the importance of granularity, let’s consider two different assessments:

Assessment A tells us that a student is in the Xth percentile in math for their grade. This could be useful for deciding whether the student is behind or not, but that is about all we can do with results.

Assessment B, in addition to telling us the student is in the Xth percentile overall, also tells us that the student is in the Yth percentile for fractions, Zth percentile for geometry, etc. This data can be used to create a specific action plan to address particular skills, and thus is much more useful than Assessment A.

Ease of Use

As with most things in life, when all else is equal, the easier the better. There are 3 key types of “easy” that we are interested in:

Easy to interpret: assessment results are often surprisingly difficult to understand. For example, scores are often reported in arbitrary scale scores and/or levels. As an example, this report shows just how convoluted the reporting is for the Smarter Balanced Assessment (SBA), one of the assessments that I’ve often used for analyzing WA state data. It is not a good sign if the assessment requires specialized training just to understand the results!

Easy to administer: teachers are busy, and some assessments can take a lot of effort to run. For example, one of the screeners we used this winter at Lowell took 20–60 minutes of teacher time per student.

Easy to take: in addition to the teachers, assessments can be a big burden on the students, especially when set up in a “high stakes” environment. We’d like students to spent most of their time learning, not taking assessments.

Case Study: IXL Diagnostics

Now that we have a blueprint for evaluating assessment systems, let’s take a look at what we are using at Lowell Elementary: IXL Diagnostics.

I’ll run through our blueprint in reverse order, along with a rough grade for each section.

Ease of Use: A. The IXL results are quite easy to interpret: all the assessment data is given in simple grade levels (more on this below). It is also fairly easy to administer, assuming the school is already using IXL. No additional tools are required; the teacher simply needs to have the student log in and run through the adaptive diagnostic tests. Finally (and perhaps most importantly), it is easy to take and easy to keep up to date. An initial assessment takes somewhere between 20–40 minutes and can be done independently by the student (very little teacher time). After that, results can be updated with an additional 5–10 minutes of assessment time from the student.

The low barrier for updates creates the possibility of high frequency (such as weekly) assessment data, which could be extremely helpful when trying to quickly evaluate the effectiveness of different programs. At Lowell we are now assessing our students on a weekly basis.

Granularity: A. The results from IXL are broken down in enough detail to be useful for topic-specific planning. In addition to an overall “grade level”, the results are also broken down into 6 topic areas, shown below for one of our students.

Diagnostic results for a 4th grader in our tutoring program at Lowell Elementary. At the time of the assessment, the student was ~1 grade level behind overall, but there is a significant variability by topic area. For example, the student was at a 5th grade level on Algebraic Thinking, but at only a 1st grade level on Geometry. These results qualitatively matched my 1:1 observations.

Repeatability/Accuracy: TBD. This area is more challenging to assess, since 1) we only started gathering data in late January and 2) there will not be any statewide standardized tests to compare against for quite a while, since Seattle Public Schools has cancelled all assessments for the year.

That being said, there are several promising signs. The clearest is the correlation between my observations and diagnostics results. For example, it was clear that Geometry was challenging for “Student A”; he didn’t know the difference between a rectangle and a square. On the other hand, he was a whiz at multiplication. More quantitatively, we can also see clear relationships between the content we are teaching and the student assessments.

In early February we started practicing fractions in Student A’s 1:1 sessions, and in mid-February his class started fractions. The chart below shows a clear increase in skill that matches when we started practicing: a good sign! He has gone up over one full grade level in this topic since we started. The data also gives us a feel for the magnitude of the noise in the measurement; on any given day the measurement could vary by 0.1–0.3 grade levels. This highlights the importance of a regular series of measurements over time, which lets us average out the noise.

”Fractions” score for Student A over time. In early February we started practicing fractions during our tutoring sessions, and in mid-February he started studying fractions in class.

**Update**: since I first published this post, we have had a chance to do a comparison with standardized tests. You can see the results here.

What does it all mean?

To sum it all up, we’ve ended up implementing a pretty robust system to gauge student learning throughout the rest of the year. This means we are in a good position to quantitatively measure the impact our current tutoring program is (or isn’t) having on students! In future posts we’ll go deeper into what we are learning from that data.

--

--

Mike Preiner

PhD in Applied Physics from Stanford. Data scientist and entrepreneur. Working to close education gaps in public schools.