What is Educational and Psychological Measurement Like?

Educational and psychological measurement is like lots of things. In introductory textbooks, it’s compared to physical measurement – rulers for measuring length or floor scales for measuring weight. Another popular analogy is shooting at a target. Picture Robin Hood splitting the Sheriff of Nottingham’s arrow, like it’s no big deal, to claim the bullseye – that’s accurate measurement.

Throwing away the thermometer

Sometimes testing, as the embodiment of educational and psychological measurement, is compared to instruments used in medical settings. Here’s a case where the analogy is used in defense of college admission testing (Roorda, 2019).

It’s inappropriate to blame admissions testing for inequities in society. We don’t fire the doctor or throw away the thermometer when an illness has been diagnosed. Test scores as well as high school grades expose issues that need to be fixed.

This analogy is simple and relatable, and mostly OK, but it suggests that tests are as precise as thermometers, that test constructs like math proficiency can be observed and quantified via test questions as well as body temperature via controlled chemical changes. They can’t. Testing is less reliable, and in some cases it may be closer to mercury rising in astrology than in a vacuum tube. I say we throw away the thermometer analogy, or at least put an asterisk on it.

Blood oximeters

Sticking with medical testing, educational and psychological tests are like blood oximeters, instruments used to measure oxygen saturation in the blood. Oxygen saturation is an indicator of respiratory health, like math achievement is an indicator of college readiness. Neither is a perfect measure of the target construct, but they’re both useful.

Oximeters come in a variety of shapes and sizes, employing different technologies that vary in cost and complexity, much like standardized tests. And, as with standardized tests, reliability and accuracy depend on the instrument. The simplest instrument – called the pulse ox – was widely used during the coronavirus pandemic, even though it is known to produce biased results for people of color (Moran-Thomas, 2020; Sjoding et al., 2020), and this was despite the availability of less biased but more complicated alternatives (Moran-Thomas, 2021).

The traditional, ultra-standardized, multiple-choice test is a lot like the pulse ox, developed – for convenience and efficiency – based on a majority group of test takers without fully considering the unique needs of underserved and minoritized students. Our research and industry standards have improved over time, especially since the 1990s, and this has led to less biased tests with comparable predictive validity across groups. So, the pulse ox might be an outdated comparison. But we still prefer simpler testing methods over more expensive and contextualized ones, and we’re still considering what it means to test with equity in mind.

Just do it

Let’s move from tests themselves to the testing industry, which gets us into testing policy.

Koljatic et al. (2021) compare the testing industry to the sporting apparel industry. Focusing on Nike in the 1990s, they argue that we, like Nike, need to accept more responsibility with respect to the social impacts of our products. I (2021) countered that, unlike the apparel industry, we make products for clients according to their specifications. In that respect, our tests are doing what they’re supposed to do – inform fair comparisons among test takers. Really, what needs to change is education policy on test use. The problem for industry, if we extent Koljatic’s reasoning, is that it isn’t doing enough to influence policy. Simply put, industry would need to say no when clients ask for tests that don’t promote equity.

Saying no wouldn’t solve our problems, absent other policy or tools to fill the void, but I think it’s the only conclusion considering what Koljatic and critics are really asking for. How can we make selection tests less like tests used for selection, and more like tests not used for selection, while still having systems that require selection? By not testing, I guess.

In my 2021 article, I tweaked the Nike analogy a bit.

The company recently released a new shoe that can be put on and taken off hands-free, extending their lineup of more accessible footwear (Newcomb, 2021). This innovation is regarded as a major step forward, so to speak, in inclusive and individualized design (Patrick & Hollenbeck, 2021). However, concerns have been raised about accessibility in terms of high cost and limited availability (Weaver, 2021). We can compare to admission testing in a variety of ways, but this example highlights at the very least the need for a more comprehensive consideration of accessibility.

Admission tests, like other large-scale assessments, have historically been inaccessible to students, by design, until the moment of administration. Integration with K12 assessment systems would provide significantly more access and richer data for admission decisions (e.g., Kurlaender et al., 2020), and testing innovations promise measurement that is more individualized and engaging (The Gordon Commission on the Future of Assessment in Education, 2013). Yet, despite these advances, our products will still be largely inaccessible outside controlled conditions, like inclusively designed shoes that can only be rented or worn on certain occasions and under supervision. Our vision should be to distribute full ownership of the product itself.

More on social responsibility

The idea of social responsibility is intriguing. Can the measurement industry be more involved in promoting positive outcomes? Controversies from two other US industries can shed some light here.

Testing resembles the pharmaceutical industry, where standardized tests are like drugs. In both cases, the product can take years to develop and at great expense. Both target practical issues faced by lots of people – for example, ulcerative colitis or pandemic learning loss. Both are designed in laboratory settings. And the countless – sometimes absurd – side effects make us question whether the potential benefits are worth the costs and risks. Drug makers have been found partly responsible for the opioid epidemic because they misrepresented risk and overpromised on results (Haffajee & Mello, 2020). Critics would say we do the same with testing.

We can also learn about social responsibility, and the lack thereof, from social media companies. It looks like Facebook, now Meta, hid what they knew about the harms of Instagram for young people (Gayle, 2021). TikTok, owned by the company ByteDance, is considered a threat to US national security because of how it collects and manages user data (Treisman, 2022). Obviously, nobody is consuming standardized tests like they do algorithmically curated photo and video content. Few people love standardized tests, whereas everyone loves cats chasing lasers. But Meta and ByteDance, like College Board and Smarter Balanced, are making products that have positive and negative impacts depending on their use. And it’s not out of bounds to expect that companies study the negative impacts, share what they know, and contribute to more positive consequences.

Just like drugs and social media, I don’t think standardized testing is going away. I recommend that the testing industry relinquish some secrecy and security and move toward more transparency and free public access to test content, data, and results (Albano, 2021).

References

Albano, A. D. (2021). Commentary: Social responsibility in college admissions requires a reimagining of standardized testing. Educational Measurement: Issues and Practice, 40, 49-52.

Gayle, D. (2021). Facebook aware of Instagram’s harmful effect on teenage girls, leak reveals. The Guardian. Retrieved from https://www.theguardian.com/technology/2021/sep/14/facebook-aware-instagram-harmful-effect-teenage-girls-leak-reveals.

Haffajee, R. L., & Mello, M. M. (2017). Drug companies’ liability for the opioid epidemic. The New England Journal of Medicine, 377(24), 2301–2305.

Koljatic, M., Silva, M., & Sireci, S. G. (2021). College admission tests and social responsibility. Educational Measurement: Issues and Practice, 40(4), 22-27.

Moran-Thomas, A. (2020). How a popular medical device encodes racial bias. Boston Review. Retrieved from http://bostonreview.net/science-nature-race/amy-moran-thomas-how-popular-medical-device-encodes-racial-bias

Moran-Thomas, A. (2021). Oximeters used to be designed for equity. What happened? Wired. Retrieved from https://www.wired.com/story/pulse-oximeters-equity/.

Randall, J., Slomp, D., Poe, M. & Oliveri, M. E. (2022). Disrupting white supremacy in assessment: Toward a justice-oriented, antiracist validity framework. Educational Assessment, 27(2), 170-178.

Roorda, (2019). Comment on X. Retrieved from https://x.com/MartenRoorda/status/1204465574111105024.

Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E., & Valley, T. S. (2020). Racial bias in pulse oximetry measurement. New England Journal of Medicine, 383, 2477-2478.

Treisman, R. (2022). The FBI alleges TikTok poses national security concerns. NPR. Retrieved from https://www.npr.org/2022/11/17/1137155540/fbi-tiktok-national-security-concerns-china.