Bursting the Bubble Sheet: NC DPI’s Disingenuous Claims on K-12 Testing Data Part 1

Preface: Over a decade ago Rhett Carlson, a high school science teacher, reverse-engineered EVAAS – a K-12 test data translation tool created by NC-based analytics company SAS and used by the NC Department of Public Instruction. This program is modeled after one that was used in agriculture to increase plant growth and cull dairy herds and is used in North Carolina to assess public school student performance, staff effectiveness, and create school report card grades. After Rhett finished his findings, he met with NC Department of Public Instruction official Tom Tomberlin, who reportedly confirmed Rhett’s methodology and analysis. Fellow science teacher Jason Wolfe was also in that meeting.

About 2 years ago, Rhett & Jason generously shared their time and expertise with me. This series of explainers is long overdue. During the delay, NC data company SAS produced “SAS EVAAS Statistical Models and Business Rules” in 2023 for the NC Department of Public Instruction.Frankly, Rhett & Jason shared the reverse-engineered description of how EVAAS works over our lemonades in Starbucks better than SAS explained it in a 50-page report. I hope to extend the same clarity on these data-reporting dynamics through this “Bursting the Bubble Sheet” series of posts. Thank you Rhett & Jason.

What are scale scores?

A scale score is the translation of the number of questions a student answered correctly on a test (raw score). If a student answered 35 questions correctly on a 40 question test, the scale score may be the same as the raw score (35) or could be translated into any other number (Ex. 135, 486, 345…anything goes).

How are scale scores used in DPI testing reports?

(Author’s note: This section has been modified from the original version to remove a segment that contained cheeky speculation and focus only on evidence-based observations.)

Scale scores in DPI testing reports are basically the same as raw scores by indicating the number of questions a student answered correctly, just with a different number attached to it. There is a basically 1:1 ratio of raw score performance to scale score assignments.

Here’s an example of what parents see as scale scores in score reports for their child’s performance. This one is for Math 3. Notice the scale score range is 527-575 for a test with 50 operational questions.

Here’s another example from a Grade 4 Reading EOG score report with a scale range of 517-568 for a test with 48 operational questions.

In addition to substantiating the basically 1:1 correlation between raw score performance and scale score representation, these summaries are also instructive in terms of how students receive the oversimplified Level 1, 2, 3, 4 or 5 label. 

Notice the narrow 3 question range required for a student to reach a Level 5 designation on this Grade 4 Reading EOG and contrast it with the 12 question range on the high school Math 3 report needed to reach a Level 5 label.

Notice how across score reports for all grades and subjects, only Levels 4 & 5 ‘earn’ the “On track for Career-and-College Readiness” label. 

In a future post in this series I’ll follow up on the consistent nature for how the data translation is designed to only “award” this label to 25-30% (top quartile) of students. Keep in touch.

Why does this matter?

Part of why scale score reporting matters, particularly when there’s a near 1:1 ratio between scale score range and the number of operational questions, is that it unnecessarily complicates data in a way that makes it less understandable for families, educators, and the general public.

Whereas SAS doubles down on the aura of complexity with its EVAAS explainer, the Bursting the Bubble Sheet series on this blog has the goal to demystify the data with user-friendly explanations and primary source evidence to empower follow-up questions on how student testing data is collected and translated.

Here’s an example from the a recent State Board of Education meeting presentation with a SAS graph showing performance on 3rd grade reading End of Grade tests:

Credit: SAS / NC Department Of Public Instruction slide 20

At face value, it appears that third grade reading levels are significantly declining, and saw a major drop during the pandemic. But notice the y-axis with a range from 537-540.

Now that you understand scale scores, you can see this indicates basically a 4-question range on a 40-question grade 3 reading test. It’s an incomplete y-axis.

The incomplete y-axis takes advantage of most folks’ misunderstanding of SAS/DPI scale scores and fosters this faulty notion that students are experiencing a crisis of reading skills by zooming in to only 10% of possible scale scores. 

Instead, what this graph actually shows is over a ten year period, NC 3rd graders on average answer about 1.5 questions less correct on their 40-question Grade 3 Reading EOG. This assumes the test hasn’t changed in that 10 year period. Though the y-axis is missing a scale score number that reaches the low point in 2021, it appears it’s around 536, indicating after two school years touched by the pandemic, students answered around 3 fewer questions correctly out of 40 than they did in 2019.

In 2022 they answered around 2 more questions correctly than they did in 2021. In other words, they answered about 1 less question correctly on average than their pre-pandemic peers in 2019.

For the record, my son was in 3rd grade in 2021 lest anyone think I’m somehow out of touch with how that school year affected third graders. Folks may range from hair-on-fire to no-big-deal with a 3 question performance difference on a 40-question reading test. However, the graph as displayed above tips the scale on the face-value of the data by literally not giving folks the whole picture resulting in a visually exaggerated trend line.

I obtained the full scale score range for that Grade 3 Reading test from this NC DPI report and approximated the average performance data points to the nearest tenth based on the SAS graph presented. Using that data I recreated the graph with a proper y-axis showing the full scale score range. Compare the data in the graph I assembled with proper context below…

…with the selective SAS-created graph presented to the State Board of Education:

More than numbers

In December 2020, I wrote this in a commentary for Cardinal & Pine as NC was gearing up to resume standardized testing:

“Listen carefully. They want to see how much kids don’t know.  This is key to understanding the current push to issue these tests and the overall motivation in nurturing a standardized testing culture. It fosters the false narrative that public schools aren’t good enough and snipes at students along the way. 

It’s not a coincidence that schools of “choice” such as private schools are exempt from these tests.  The powers pushing to privatize education aren’t interested in gathering data that could knock those schools off their undeserved pedestals.

That was the moment when it fully clicked for me that the purpose of standardized tests was more of a glass half-empty approach to show kids’ deficits instead of their successes. This revelation upset me as a parent and as a teacher. The manner in which this testing data is translated and reported is as problematic as the poor quality of the tests themselves.

Stay tuned for continued posts in this Bursting the Bubble Sheet series that will unpack additional testing data components and put them in a more complete context. Hopefully better understanding these concepts and how they’re used will offer insight into the score reports for the kids you care about, and how they’re used to portray NC public school, student and staff performance.

5 thoughts on “Bursting the Bubble Sheet: NC DPI’s Disingenuous Claims on K-12 Testing Data Part 1

Add yours

  1. Our testing coordinator has informed us we have to administer the NC Checkins because “they” will use the students scores or the questions they do well on to create their EOG. I felt this must have been a misunderstanding of some sort as it seems outlandish to me. Now I wonder if the EOG will purposely not include questions the student answered well and instead will set the student up will only questions that were challenging in order to ensure the scholar will not perform as well.

    Like

    1. And somehow, students will get different levels of questions, yet still be scored on the same scale. Can a kid miss 5 questions on the “hard” test and get the same score as a kid who missed 1 question on the “easy” test? Is is better for kids to do poorly on the check ins so they get the easier EOG? There are SOOOO many questions and issues with this set up!

      Like

Leave a comment

Blog at WordPress.com.

Up ↑