Can Educational and Psychological Testing be Equitable?

As you might expect, the answer to this question is, sometimes. Equitable testing depends on what we consider tests, and how we define equity.

In the past few years, we’ve seen a big swell of interest in equity, social justice, and antiracism in educational measurement. Two articles that I reference and share often are Sireci (2020), which encourages us to unstandardize our tests as much as possible (Sireci calls it understandardization) and Randall (2021), which shows how traditional construct development (and thus test development) is too narrow and White-centric to support equitable outcomes. I think the discussion is taking us in the right direction, but we’re also going in circles on some key points, including how educational tests can be equitable or not.

If the measurement literature is a river – not a fantastic analogy, but let’s try it – then ambiguous terms are like eddies, swirling water that defies the current and slows our understanding such that we can end up writing past each other. Equity is arguably the most popular term lately for describing our goals for educational improvement – we see it everywhere, from mission statements to conference themes – yet, it is often left up to interpretation. Articles in a recent special issue of Applied Measurement in Education focusing on equity in assessment (2023, volume 36, issue 3) use the term throughout, but never simply define it. The Standards for Educational and Psychological Testing (AERA, APA, NCME, 2014) describes features of testing (e.g., affects, access, treatment of participants) as equitable or inequitable, but again without a clear definition.

Equity just means parity or equality in outcomes across groups. It’s not complicated. Maybe authors take for granted that their readers have this fundamental understanding, or maybe they’re keeping the literature waterways open and a little swirly to promote discussion? Either way, we have a definition. If equity is equality of outcomes across groups, then equitable testing is simply testing that shows equal outcomes, and making tests more equitable means designing them to produce results that don’t differ for groups of test takers.

Side note – the Standards (2014, p. 54) interpret fairness in a way that does not require “equality of testing outcomes for relevant test-taker subgroups.” That’s equity, they just don’t identify it as such.

Extra side note – there’s lots of writing on culturally responsive and sustaining assessment (e.g., Shultz & Englert, 2023). I see this as overlapping with but not the same as equitable testing.

The second term to nail down is testing. Most of us probably think of testing as standardized and large-scale, designed for lots of people. And most of our standardized large-scale tests are used to compare test takers either to one another (e.g., rank ordering when selecting for admission or a scholarship) or to some reference point on our score scale (e.g., performance standards of “meets expectations” or “gets a driver’s license”). Testing also includes smaller-scale and less formal or less standardized measures used in classrooms, clinics, or employment settings.

Putting the terms together, equal outcomes in testing really only make sense for certain kinds of tests. The purpose or intended use determines whether a test can be designed intentionally for equity. Standardized large-scale tests intended to compare results across groups can’t also be designed to reduce differences between groups because the two purposes conflict. Whatever the context, even outside of education and psychology, an instrument can’t indicate and influence results at the same time. However, if we aren’t constrained by comparison, we can design tests however we like, including with content and methods focused on elevating specific groups of test takers.

Proponents of antiracist and socially just educational measurement might argue that testing has traditionally benefitted White/majority groups of test takers – that we’ve only pretended that testing was a fair indicator in the past, when in actuality it was always influencing results. Since both designs or purposes coexisted before, though one of them covertly and perhaps unintentionally, they should also coexist now, especially in situations where comparative testing leads to adverse impact (e.g., as in college admission testing or licensure testing). This kind of argument applies with other restorative policies like affirmative action, but it doesn’t really apply to comparative testing, if only because the purpose of a comparative test might be to evaluate the results of something like affirmative action. Social justice isn’t served by tests that mask social injustice.

Now that I’ve typed this all out – putting on my snorkel and goggles, if you will – I see that the conflict really comes from having equal outcomes as our objective, our main criterion for valid measurement. The measurement water gets turbulent when we consider equity alongside validity. I’ll have to come back to this later.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Randall, J. (2021). “Color‐neutral” is not a thing: Redefining construct definition and representation through a justice‐oriented critical antiracist lens. Educational Measurement: Issues and Practice, 40(4), 82-90.

Shultz, P. K., & Englert, K. (2023). The promise of assessments that advance social justice: An indigenous example. Applied Measurement in Education, 36(3), 255-268.

Sireci, S. G. (2020). Standardization and UNDERSTANDardization in educational assessment. Educational Measurement: Issues and Practice, 39(3), 100-105.