Do Standardized Tests Benefit from Inequality?

I’m reading Wayne Au’s (2023) Unequal by Design: High Stakes Testing and the Standardization of Inequality, a short (140 pages) overview of how our capitalist education system in the US perpetuates inequities, using testing to turn students into commodities. It’s dramatic at times – Au sets the stage with testing as a monster to be slayed (e.g., p. xii) – but I’ve been looking for a good summary of the Marxist, critical-theory, anti-testing perspective, and this seems to fit.

While slayers of testing often oversimplify and misconstrue their enemy, I was surprised to see this basic distortion of test design (p. 78, emphasis in original).

At the root of this is the fact that all standardized tests are designed to produce what’s called a “bell curve” – what test makers think of as a “normal distribution” of test scores (and intelligence) across the human population. In a bell curve most students get average test scores (the “norm”), with smaller numbers of students getting lower or higher scores. If you look at this graphically, you would see it as a bell shape, where the students getting average scores make up the majority – or the hump – of the curve (Weber, 2015, 2016). Standardized tests are considered “good” or valid if they produce this kind of bell curve, and the data from all of them, even standard’s based exams [sic], are “scaled” to this shape (Tan & Michel, 2011). Indeed, this issue is the reason I titled this book, Unequal By Design, because at their core – baked into the very assumptions at the heart of their construction – standardized tests are designed to produce inequality.

I think I know what Au is getting at here. Tests designed for norm referencing (e.g., selection, ranking, prediction) are optimized when there is variability in scores – if everyone does well or everyone does poorly on the test, scores are bunched up, and it’s harder to make comparisons. So, technically, norm referencing does benefit from inequality.

But this ignores the simple fact that state accountability testing, which is the main focus of the book, isn’t designed solely for normative comparisons. In fact, the primary use of state testing is comparison to performance standards. Norms can also be applied, but, since they aren’t designed for comparison among test takers, state tests aren’t tied to inequality in results. Au distinguishes between norm and criterion referencing earlier in the book (p. 10) but not here, when it really matters.

Au gives three references here, none of which support the claim that test scores must be bell shaped to be valid. Tan and Michel (2011) is an explainer from ETS that says nothing about transforming to a curve. It’s an overview of scaling and equating, which are used to put scores from different test forms onto a common scale for reporting purposes. The Weber references (2015, 2016) are two blog posts that also don’t prove that scores have to be normally distributed, with variation, to be valid. The posts do show lots of example score distributions that are roughly bell shaped, but this is to demonstrate how performance standards can be moved around to produce different pass rates even though the shapes of score distributions don’t change.

Weber (2015) makes the same mistake that Au did, uncritically referencing Tan and Michel (2011) as evidence that test developers intentionally craft bell curves.

After grading, items are converted from raw scores to scale scores; here’s a neat little policy brief from ETS on how and why that happens. Between the item construction, the item selection, and the scaling, the tests are all but guaranteed to yield bell-shaped distributions.

It’s true that tests designed for norm referencing will gravitate toward content, methods, and procedures that increase variability in scores, because higher variability improves precision in score comparisons. But this doesn’t guarantee a certain shape – uniform and skewed distributions could also work well – and, more importantly, the Tan and Michel reference doesn’t support this point at all.

I assume Au and Weber don’t have any good references here because there aren’t any. State tests aren’t “normalized” as Weber (2016) claims. Rescaling and equating aren’t normalizing. If they produce normal distributions, it’s probably because what state tests are measuring is actually normally distributed. Regardless, the best source of evidence for state testing having inequality “baked into the very assumptions at the heart of their construction” (Au, see above) would be the publicly available technical documentation on how the tests are actually constructed, and the book doesn’t go there.

References

Au, W. (2023). Unequal by design: High-stakes testing and the standardization of inequality. New York, NY: Routledge.

Tan, X., & Michel, R. (2011). Why do standardized testing programs report scaled scores? Why not just report the raw or precent-correct scores? ETS R&D Connections, 16. https://www.ets.org/Media/Research/pdf/RD_Connections16.pdf

Weber, M. (2015, September 25). Common core testing: Who’s the real “liar”? Jersey Jazzman. https://jerseyjazzman.blogspot.com/2015/09/common-core-testing-whos-real-liar.html

Weber, M. (2016, April 27). The PARCC silly season. Jersey Jazzman. https://jerseyjazzman.blogspot.com/2016/04/the-parcc-silly-season.html

Can Educational and Psychological Testing be Equitable?

As you might expect, the answer to this question is, sometimes. Equitable testing depends on what we consider tests, and how we define equity.

In the past few years, we’ve seen a big swell of interest in equity, social justice, and antiracism in educational measurement. Two articles that I reference and share often are Sireci (2020), which encourages us to unstandardize our tests as much as possible (Sireci calls it understandardization) and Randall (2021), which shows how traditional construct development (and thus test development) is too narrow and White-centric to support equitable outcomes. I think the discussion is taking us in the right direction, but we’re also going in circles on some key points, including how educational tests can be equitable or not.

If the measurement literature is a river – not a fantastic analogy, but let’s try it – then ambiguous terms are like eddies, swirling water that defies the current and slows our understanding such that we can end up writing past each other. Equity is arguably the most popular term lately for describing our goals for educational improvement – we see it everywhere, from mission statements to conference themes – yet, it is often left up to interpretation. Articles in a recent special issue of Applied Measurement in Education focusing on equity in assessment (2023, volume 36, issue 3) use the term throughout, but never simply define it. The Standards for Educational and Psychological Testing (AERA, APA, NCME, 2014) describes features of testing (e.g., affects, access, treatment of participants) as equitable or inequitable, but again without a clear definition.

Equity just means parity or equality in outcomes across groups. It’s not complicated. Maybe authors take for granted that their readers have this fundamental understanding, or maybe they’re keeping the literature waterways open and a little swirly to promote discussion? Either way, we have a definition. If equity is equality of outcomes across groups, then equitable testing is simply testing that shows equal outcomes, and making tests more equitable means designing them to produce results that don’t differ for groups of test takers.

Side note – the Standards (2014, p. 54) interpret fairness in a way that does not require “equality of testing outcomes for relevant test-taker subgroups.” That’s equity, they just don’t identify it as such.

Extra side note – there’s lots of writing on culturally responsive and sustaining assessment (e.g., Shultz & Englert, 2023). I see this as overlapping with but not the same as equitable testing.

The second term to nail down is testing. Most of us probably think of testing as standardized and large-scale, designed for lots of people. And most of our standardized large-scale tests are used to compare test takers either to one another (e.g., rank ordering when selecting for admission or a scholarship) or to some reference point on our score scale (e.g., performance standards of “meets expectations” or “gets a driver’s license”). Testing also includes smaller-scale and less formal or less standardized measures used in classrooms, clinics, or employment settings.

Putting the terms together, equal outcomes in testing really only make sense for certain kinds of tests. The purpose or intended use determines whether a test can be designed intentionally for equity. Standardized large-scale tests intended to compare results across groups can’t also be designed to reduce differences between groups because the two purposes conflict. Whatever the context, even outside of education and psychology, an instrument can’t indicate and influence results at the same time. However, if we aren’t constrained by comparison, we can design tests however we like, including with content and methods focused on elevating specific groups of test takers.

Proponents of antiracist and socially just educational measurement might argue that testing has traditionally benefitted White/majority groups of test takers – that we’ve only pretended that testing was a fair indicator in the past, when in actuality it was always influencing results. Since both designs or purposes coexisted before, though one of them covertly and perhaps unintentionally, they should also coexist now, especially in situations where comparative testing leads to adverse impact (e.g., as in college admission testing or licensure testing). This kind of argument applies with other restorative policies like affirmative action, but it doesn’t really apply to comparative testing, if only because the purpose of a comparative test might be to evaluate the results of something like affirmative action. Social justice isn’t served by tests that mask social injustice.

Now that I’ve typed this all out – putting on my snorkel and goggles, if you will – I see that the conflict really comes from having equal outcomes as our objective, our main criterion for valid measurement. The measurement water gets turbulent when we consider equity alongside validity. I’ll have to come back to this later.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Randall, J. (2021). “Color‐neutral” is not a thing: Redefining construct definition and representation through a justice‐oriented critical antiracist lens. Educational Measurement: Issues and Practice, 40(4), 82-90.

Shultz, P. K., & Englert, K. (2023). The promise of assessments that advance social justice: An indigenous example. Applied Measurement in Education, 36(3), 255-268.

Sireci, S. G. (2020). Standardization and UNDERSTANDardization in educational assessment. Educational Measurement: Issues and Practice, 39(3), 100-105.

What is Educational and Psychological Measurement Like?

Educational and psychological measurement is like lots of things. In introductory textbooks, it’s compared to physical measurement – rulers for measuring length or floor scales for measuring weight. Another popular analogy is shooting at a target. Picture Robin Hood splitting the Sheriff of Nottingham’s arrow, like it’s no big deal, to claim the bullseye – that’s accurate measurement.

Throwing away the thermometer

Sometimes testing, as the embodiment of educational and psychological measurement, is compared to instruments used in medical settings. Here’s a case where the analogy is used in defense of college admission testing (Roorda, 2019).

It’s inappropriate to blame admissions testing for inequities in society. We don’t fire the doctor or throw away the thermometer when an illness has been diagnosed. Test scores as well as high school grades expose issues that need to be fixed.

This analogy is simple and relatable, and mostly OK, but it suggests that tests are as precise as thermometers, that test constructs like math proficiency can be observed and quantified via test questions as well as body temperature via controlled chemical changes. They can’t. Testing is less reliable, and in some cases it may be closer to mercury rising in astrology than in a vacuum tube. I say we throw away the thermometer analogy, or at least put an asterisk on it.

Blood oximeters

Sticking with medical testing, educational and psychological tests are like blood oximeters, instruments used to measure oxygen saturation in the blood. Oxygen saturation is an indicator of respiratory health, like math achievement is an indicator of college readiness. Neither is a perfect measure of the target construct, but they’re both useful.

Oximeters come in a variety of shapes and sizes, employing different technologies that vary in cost and complexity, much like standardized tests. And, as with standardized tests, reliability and accuracy depend on the instrument. The simplest instrument – called the pulse ox – was widely used during the coronavirus pandemic, even though it is known to produce biased results for people of color (Moran-Thomas, 2020; Sjoding et al., 2020), and this was despite the availability of less biased but more complicated alternatives (Moran-Thomas, 2021).

The traditional, ultra-standardized, multiple-choice test is a lot like the pulse ox, developed – for convenience and efficiency – based on a majority group of test takers without fully considering the unique needs of underserved and minoritized students. Our research and industry standards have improved over time, especially since the 1990s, and this has led to less biased tests with comparable predictive validity across groups. So, the pulse ox might be an outdated comparison. But we still prefer simpler testing methods over more expensive and contextualized ones, and we’re still considering what it means to test with equity in mind.

Just do it

Let’s move from tests themselves to the testing industry, which gets us into testing policy.

Koljatic et al. (2021) compare the testing industry to the sporting apparel industry. Focusing on Nike in the 1990s, they argue that we, like Nike, need to accept more responsibility with respect to the social impacts of our products. I (2021) countered that, unlike the apparel industry, we make products for clients according to their specifications. In that respect, our tests are doing what they’re supposed to do – inform fair comparisons among test takers. Really, what needs to change is education policy on test use. The problem for industry, if we extent Koljatic’s reasoning, is that it isn’t doing enough to influence policy. Simply put, industry would need to say no when clients ask for tests that don’t promote equity.

Saying no wouldn’t solve our problems, absent other policy or tools to fill the void, but I think it’s the only conclusion considering what Koljatic and critics are really asking for. How can we make selection tests less like tests used for selection, and more like tests not used for selection, while still having systems that require selection? By not testing, I guess.

In my 2021 article, I tweaked the Nike analogy a bit.

The company recently released a new shoe that can be put on and taken off hands-free, extending their lineup of more accessible footwear (Newcomb, 2021). This innovation is regarded as a major step forward, so to speak, in inclusive and individualized design (Patrick & Hollenbeck, 2021). However, concerns have been raised about accessibility in terms of high cost and limited availability (Weaver, 2021). We can compare to admission testing in a variety of ways, but this example highlights at the very least the need for a more comprehensive consideration of accessibility.

Admission tests, like other large-scale assessments, have historically been inaccessible to students, by design, until the moment of administration. Integration with K12 assessment systems would provide significantly more access and richer data for admission decisions (e.g., Kurlaender et al., 2020), and testing innovations promise measurement that is more individualized and engaging (The Gordon Commission on the Future of Assessment in Education, 2013). Yet, despite these advances, our products will still be largely inaccessible outside controlled conditions, like inclusively designed shoes that can only be rented or worn on certain occasions and under supervision. Our vision should be to distribute full ownership of the product itself.

More on social responsibility

The idea of social responsibility is intriguing. Can the measurement industry be more involved in promoting positive outcomes? Controversies from two other US industries can shed some light here.

Testing resembles the pharmaceutical industry, where standardized tests are like drugs. In both cases, the product can take years to develop and at great expense. Both target practical issues faced by lots of people – for example, ulcerative colitis or pandemic learning loss. Both are designed in laboratory settings. And the countless – sometimes absurd – side effects make us question whether the potential benefits are worth the costs and risks. Drug makers have been found partly responsible for the opioid epidemic because they misrepresented risk and overpromised on results (Haffajee & Mello, 2020). Critics would say we do the same with testing.

We can also learn about social responsibility, and the lack thereof, from social media companies. It looks like Facebook, now Meta, hid what they knew about the harms of Instagram for young people (Gayle, 2021). TikTok, owned by the company ByteDance, is considered a threat to US national security because of how it collects and manages user data (Treisman, 2022). Obviously, nobody is consuming standardized tests like they do algorithmically curated photo and video content. Few people love standardized tests, whereas everyone loves cats chasing lasers. But Meta and ByteDance, like College Board and Smarter Balanced, are making products that have positive and negative impacts depending on their use. And it’s not out of bounds to expect that companies study the negative impacts, share what they know, and contribute to more positive consequences.

Just like drugs and social media, I don’t think standardized testing is going away. I recommend that the testing industry relinquish some secrecy and security and move toward more transparency and free public access to test content, data, and results (Albano, 2021).

References

Albano, A. D. (2021). Commentary: Social responsibility in college admissions requires a reimagining of standardized testing. Educational Measurement: Issues and Practice, 40, 49-52.

Gayle, D. (2021). Facebook aware of Instagram’s harmful effect on teenage girls, leak reveals. The Guardian. Retrieved from https://www.theguardian.com/technology/2021/sep/14/facebook-aware-instagram-harmful-effect-teenage-girls-leak-reveals.

Haffajee, R. L., & Mello, M. M. (2017). Drug companies’ liability for the opioid epidemic. The New England Journal of Medicine377(24), 2301–2305.

Koljatic, M., Silva, M., & Sireci, S. G. (2021). College admission tests and social responsibility. Educational Measurement: Issues and Practice, 40(4), 22-27.

Moran-Thomas, A. (2020). How a popular medical device encodes racial bias. Boston Review. Retrieved from http://bostonreview.net/science-nature-race/amy-moran-thomas-how-popular-medical-device-encodes-racial-bias

Moran-Thomas, A. (2021). Oximeters used to be designed for equity. What happened? Wired. Retrieved from https://www.wired.com/story/pulse-oximeters-equity/.

Randall, J., Slomp, D., Poe, M. & Oliveri, M. E. (2022). Disrupting white supremacy in assessment: Toward a justice-oriented, antiracist validity framework. Educational Assessment, 27(2), 170-178.

Roorda, (2019). Comment on X. Retrieved from https://x.com/MartenRoorda/status/1204465574111105024.

Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E., & Valley, T. S. (2020). Racial bias in pulse oximetry measurement. New England Journal of Medicine, 383, 2477-2478.

Treisman, R. (2022). The FBI alleges TikTok poses national security concerns. NPR. Retrieved from https://www.npr.org/2022/11/17/1137155540/fbi-tiktok-national-security-concerns-china.

Article on Intersectional DIF in Applied Measurement in Education

Brian French, Thao Thu Vo, and I recently (February, 2024) published an open-access paper in Applied Measurement in Education on Traditional vs Intersectional DIF Analysis: Considerations and a Comparison Using State Testing Data.

https://doi.org/10.1080/08957347.2024.2311935

The paper extends research by Russell and colleagues (e.g., 2021) on intersectional differential item functioning (DIF).

Here’s our abstract.

Recent research has demonstrated an intersectional approach to the study of differential item functioning (DIF). This approach expands DIF to account for the interactions between what have traditionally been treated as separate grouping variables. In this paper, we compare traditional and intersectional DIF analyses using data from a state testing program (nearly 20,000 students in grade 11, math, science, English language arts). We extend previous research on intersectional DIF by employing field test data (embedded within operational forms) and by comparing methods that were adjusted for an increase in Type I error (Mantel-Haenszel and logistic regression). Intersectional analysis flagged more items for DIF compared with traditional methods, even when controlling for the increased number of statistical tests. We discuss implications for state testing programs and consider how intersectionality can be applied in future DIF research.

We refer to intersectional DIF as DIF with interaction effects, partly to highlight the methodology – which builds on traditional DIF as an analysis of main effects – and to distinguish it as one piece of a larger intersectional perspective on the item response process. We don’t get into the ecology of item responding (Zumbo et al., 2015), but that’s the idea – traditional DIF just scratches the surface.

A few things keep DIF analysis on the surface.

  1. More complex analysis would require larger sample sizes for field/pilot testing. We’d have to plan and budget for it.
  2. Better analysis would also require a theory of test bias that developers may not be in a position to articulate. This brings in the debate on consequential validity evidence – who is responsible for investigating test bias, and how extensive does analysis need to be?
  3. Building on 2, only test developers have ready access to the data needed for DIF analysis. Other researchers and the public, who might have good input, aren’t involved. I touch on this idea in a previous post.

References

Albano, T., French, B. F., & Vo, T. T. (2024). Traditional vs intersectional DIF analysis: Considerations and a comparison using state testing data. Applied Measurement in Education, 37(1), 57-70. https://doi.org/10.1080/08957347.2024.2311935

Russell, M., & Kaplan, L. (2021). An intersectional approach to differential item functioning: Reflecting configurations of inequality. Practical Assessment, Research & Evaluation, 26(21), 1-17.

Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Olvera Astivia, O. L., & Ark, T. K. (2015). A methodology for Zumbo’s third generation DIF analyses and the ecology of item responding. Language Assessment Quarterly, 12(1), 136-151. https://doi.org/10.1080/15434303.2014.972559

Review of Cizek’s Validity Book

I recently reviewed G. J. Cizek’s book Validity – An Integrated Approach to Test Score Meaning and Use (published by Routledge, 2020) for the journal Applied Measurement in Education. Here’s a link to my review.

Here’s an overview, from the first paragraph in the review.

Can measurement inferences be meaningful but not useful? Are we better off evaluating test score interpretations separate from their applications? Does validity theory itself need to be revamped? These are the kinds of big philosophical questions Cizek tackles, though with limited philosophizing, in his book Validity – An Integrated Approach to Test Score Meaning and Use. The premise of the book, that validity does need revamping, won’t come as a surprise to readers familiar with his earlier writing on the topic. The main ideas are the same, as are some of his testing examples and metaphors. However, the book does give Cizek space to elaborate on his comprehensive framework for defensible testing, and the target audience of “rigorous scholars and practitioners… who have no wish to be philosophers of science” may appreciate the book’s focus on pragmatic recommendations over “metaphysical contemplations.”

And here’s my synopsis of the book by chapter.

After an intriguing preface (current validation efforts are described as anemic and lacking in alacrity), the book starts with an introduction to some foundational testing concepts (Chapter 1), and then reviews areas of consensus in validation (e.g., content, response process, convergent evidence; Chapter 2), before highlighting the essential point of disagreement (i.e., how we handle test uses and consequences; Chapter 3). Cizek’s main argument, reiterated throughout the book, is that considerations around score inference should nearly always be detached from considerations around test use, and that combining the two (common in the US since the 1990s) has been counterproductive. He presents a framework that separates a) validation of the intended meaning of scores via the usual sources of evidence, minus uses and consequences (Chapter 4), from b) justifying the intended uses of scores, following theory and methods from program evaluation (Chapter 5). The book ends with recommendations for determining how much evidence is enough for successful validation and justification (Chapter 6), and, finally, a summary with comments on future directions (Chapter 7).

Throughout the book, Cizek critiques the writings of Messick, a distinguished validity theorist, and he acknowledges in the book’s preface that doing so felt like tugging on Superman’s cape. I’m not sure where that puts me, someone who has only ever written about validity as it relates to other issues like item bias. I guess I’m either spitting into the wind or pulling the mask off the Old Lone Ranger.

Though I agree with Cizek on some key issues – including that validity theory is becoming impractically complex – my review of the book ended up being mostly critical. Maybe half of my 1800 or so words went to summarizing two limitations that I see in the book. First, it oversimplifies and sometimes misrepresents the alternative and more mainstream perspective that uses and consequences should be part of validity. Quotations and summaries of the opposing views could have been much tighter (I highlight a few in my review). Second, the book leaves us wanting more on the question of how to integrate information – if we evaluate testing in two stages, based on meaning in scores and justification of uses, how do we combine results to determine if a test is defensible? The two stages are discussed separately, but the crucial integration step isn’t clearly explained or demonstrated.

I do like how the book lays out program evaluation as a framework for evaluating (some would say validating) uses and consequences. Again, it’s unclear how we integrate conclusions from this step with our other validation efforts in establishing score meaning. But program evaluation is a nice fit to the general problem of justifying test use. It offers us established procedures and best practices for study design, data collection, and analyzing and interpreting results.

I also appreciate that Cizek is questioning the ever creeping scope of validity. Uses and consequences can be relevant to validation, and shouldn’t be ignored, but they can also be so complex and open-ended as to make validation unmanageable. Social responsibility and social justice – which have received a lot of attention in the measurement literature in the past three years and so aren’t addressed in their latest form in the book – are a pertinent example. To what extent should antiracism be a component of test design? To what extent should adverse impact in test results invalidate testing? And who’s to say? I still have some reading to do (Applied Measurement in Education has a new special issue on social justice topics), but it seems like proponents would now argue, in the most extreme case, that any group difference justifies pausing or reconsidering testing. Proposals like this need more study and discussion (similar to what we had on social responsibility in admission testing) before they’re applied generally or added to our professional standards.

Calculating Implicit Association Test Scores

I wrote a couple years ago about the limitations of implicit association tests (IAT) for measuring racial bias. Their reliability (test-retest) and validity (correlations with measures of overt bias) are surprisingly low, considering the popularity of the tests.

At the time, I couldn’t find an explanation of how IAT scores are calculated (I didn’t look very hard). Here are a few references.

Some of the original scoring methods come from Greenwald, McGhee, and Schwartz (1998) and updated methods are given in Greenwald, Nosek, and Banjo (2003). All of the methods are based on response latencies measured in milliseconds. Rohner and Thoss (2019) summarize how the methods work and demonstrate with R code.

References

Greenwald, A., McGhee, D., & Schwartz, J. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464-1480.

Greenwald, A., Nosek, B., & Banaji, M. (2003). Understanding and using the Implicit Association Test: An improved scoring algorithm. Journal of Personality and Social Psychology, 85, 197-216.

Röhner, J. & Thoss, P. J. (2019) A tutorial on how to compute traditional IAT effects with R. The Quantitative Methods for Psychology, 15(2), 134-147. https://doi.org/10.20982/tqmp.15.2.p134

Differential Item Functioning in the Smarter Balanced Test

In class last fall, we reviewed the Smarter Balanced (SB) technical report for examples of how validity evidence is collected and documented, including through differential item functioning (DIF) analysis.

I teach and research DIF, but I don’t often inspect operational results from a large-scale standardized test. Results for race/ethnicity showed a few unexpected trends. Here’s a link to the DIF section of the 2018/2019 technical report.

https://technicalreports.smarterbalanced.org/2018-19_summative-report/_book/test-fairness.html#differential-item-functioning-dif

The report gives an overview of the Mantel-Haenszel method, and then shows, for ELA/literacy and math, numbers of items from the test bank per grade and demographic variable that fall under each DIF category.

  • The NA category is for items that didn’t have enough valid responses, for a given comparison (eg, female vs male), to estimate DIF. Groups with smaller sample sizes had more items with NA.
  • A, B, C are the usual Mantel-Haenszel levels of DIF, where A is negligible, B is moderate, and C is large. Testing programs, including SB, focus on items at level C and mostly leave A and B alone.
  • The +/- indicates the direction of the DIF, where negative is for items that favor the reference group (eg, male) or disadvantage the focal group (eg, female), and positive is for items that do the opposite, favor the focal group or disadvantage the reference group.

The SB report suggests that DIF was conducted at the field test stage, where items weren’t yet operational. But the results tables say “DIF items in the current summative pool,” which makes it sound like they include operational items. I’m not sure how this worked.

ELA

Here’s a bar chart that summarizes level C DIF by grade for ELA in a subset of demographic comparisons. The blueish bars going up are percentages of items with C+ DIF (favoring focal group) and the redish bars going down are for C- (favoring reference). The groups being compared are labeled on the right side.

Smarter Balanced 2018/2019 DIF results, percentages of items with level C DIF for ELA/literacy

I’m using percentages instead of counts of items because the number of items differs by grade (under 1,000 in early grades, over 2,000 in grade 11), and the number of items with data for DIF analysis varies by demographic group (some groups had more NA than others). Counts would be more difficult to compare. These percentages exclude items in the NA category.

For ELA, we tend to see more items favoring female (vs male) and asian (vs white) students. There doesn’t seem to be a trend for black and white students, but there are more items favoring white students when compared with hispanic (almost none). In some groups, we also see a slight increase for later grades, but a decrease at grade 11.

Math

Here’s the same chart but for math items. Note the change in y-axis (now maxing at 4 percent instead of 2 for ELA) to accommodate the increase in DIF favoring asian students (vs white). Other differences from ELA include slightly more items favoring male students (vs female), and more balance in results for black and white students, and hispanic and white students.

DIF in grades 6, 7, and 11 reaches 3 to 4% of items favoring asian students. Converting these back to counts, the total numbers of items with data for DIF analysis are 1,114, 948, and 966 in grades 6, 7, and 11, respectively, and the numbers of C+ DIF favoring asian students are 35, 30, and 38.

Conclusions

These DIF results are surprising, especially for the math test, but I’d want some more information before drawing conclusions.

First, what was the study design supporting the DIF analysis? The technical report doesn’t describe how and when data were collected. Within a given grade and demographic group, do these results accumulate data from different years and different geographic locations? If so, how were forms constructed and administered? Were field test items embedded within the operational adaptive test? And how were results then linked?

Clarifying the study design and scaling would help us understand why so many items had insufficient sample sizes for estimating DIF analysis, and why these item numbers in the NA category differed by grade and demographic group. Field test items are usually randomly assigned to test takers, which would help ensure numbers of respondents are balanced across items.

Finally, the report leaves out some key details on how the Mantel-Haenszel DIF analysis was conducted. We have the main equations, but we don’t have information about what anchor/control variable was used (eg, total score vs scale score), whether item purification was used, and how significance testing factored into determining the DIF categories.

Linking vs Mapping vs Predicting

I recently came across a few articles that discuss scale linking in the health sciences, where researchers measure things like psychological distress, well-being, and fatigue, and need to convert patient results from one instrument to another. The literature refers to the process as mapping (Wailoo et al, 2017) but the goals seem to be the same as with other forms of scaling, linking, and equating in education and psychology.

Fayers and Hays (2014) talk about how mapping with health scales is typically accomplished using regression models, which can produce biased results because of regression to the mean. They recommend linking methods. Thompson, Lapin, and Katzan (2017) demonstrate linking with linear and equipercentile functions.

On a related note, someone also shared Bottai et al (2022), which derives a linear prediction function, based on the concordance correlation from Lin (1989), that ends up being linear equating.

References

Bottai, M., Kim, T., Lieberman, B., Luta, G., & Peña, E. (2022). On optimal correlation-based prediction. The American Statistician76(4), 313-321. https://doi.org/10.1080/00031305.2022.2051604

Fayers, P. M., & Hays, R. D. (2014). Should linking replace regression when mapping from profile-based measures to preference-based measures? Value in Health, 17(2), 261-265. http://dx.doi.org/10.1016/j.jval.2013.12.002

Lin, L. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255–268.

Thompson, N. R., Lapin, B. R., & Katzan, I. L. (2017). Mapping PROMIS global health items to EuroQol (EQ-5D) utility scores using linear and equipercentile equating. Pharmacoeconomics, 35, 1167-1176. http://dx.doi.org/10.1007/s40273-017-0541-1

Wailoo, A. J., Hernandez-Alava, M., Manca, A., Mejia, A., Ray, J., Crawford, B., Botteman, M., & Busschbach, J. (2017). Mapping to estimate health-state utility from non–preference-based outcome measures: An ISPOR good practices for outcomes research task force report. Value in Health20(1), 18-27. http://dx.doi.org/10.1016/j.jval.2016.11.006

More issues in the difR package for differential item functioning analysis in R

I wrote last time about the difR package (Magis, Beland, Tuerlinckx, & De Boeck, 2010) and how it doesn’t account for missing data in Mantel-Haenszel DIF analysis. I’ve noticed two more issues as I’ve continued testing the package (version 5.1).

  1. The problem with Mantel-Haenszel also appears in the code for the standardization method, accessed via difR:::difStd, which calls difR:::stdPDIF. Look there and you’ll see base:::length used to obtain counts (e.g., number of correct/incorrect for focal and reference groups at a given score level). Missing data will throw off these counts. So, difR standardization and MH are only recommended with complete data.
  2. In the likelihood ratio method, code for pseudo $R^2$ (used as a measure of DIF effect size) can lead to errors for some models. The code also seems to assume no missing data. More on these issues below.

DIF with the likelihood ratio method is performed using the difR:::difLogistic function, which ultimately calls difR:::Logistik to do the modeling (via glm) and calculate the $R^2$. The functions for calculating $R^2$ are embedded within the difR:::Logistik function.

R2 <- function(m, n) {
  1 - (exp(-m$null.deviance / 2 + m$deviance / 2))^(2 / n)
}
R2max <- function(m, n) {
  1 - (exp(-m$null.deviance / 2))^(2 / n)
}
R2DIF <- function(m, n) {
  R2(m, n) / R2max(m, n)
}

These functions capture $R^2$ as defined by Nagelkerke (1991), which is a modification to Cox and Snell (1989). When these are run via difR:::Logistik, the sample size n argument is set to the number of rows in the data set, which ignores missing data on a particular item. So, n will be inflated for items with missing data, and $R^2$ will be reduced (assuming a constant deviance).

In addition to the missing data issue, because of the way they’re written, these functions stretch the precision limits of R. In the R2max function specifically, the model deviance is first converted to a log-likelihood, and then a likelihood, before raising to 2/n. The problem is, large deviances correspond to very small likelihoods. A deviance of 500 gives us a likelihood of 7.175096e-66, which R can manage. But a deviance of 1500 gives us a likelihood of 0, which produces $R^2 = 1$.

The workaround is simple – avoid calculating likelihoods by rearranging terms. Here’s how I’ve written them in the epmr package.

r2_cox <- function(object, n = length(object$y)) {
  1 - exp((object\$deviance - object\$null.deviance) / n)
}
r2_nag <- function(object, n = length(object$y)) {
  r2_cox(object, n) / (1 - exp(-object$null.deviance / n))
}

And here are two examples that compare results from difR with epmr and DescTools. The first example shows how roughly 10% missing data reduces $R^2$ by as much as 0.02 when using difR. Data come from the verbal data set, included in difR.

# Load example data from the difR package
# See ?difR:::verbal for details
data("verbal", package = "difR")

# Insert missing data on first half of items
set.seed(42)
np <- nrow(verbal)
ni <- 24
na_index <- matrix(
  sample(c(TRUE, FALSE), size = np * ni / 2,
    prob = c(.1, .9), replace = TRUE),
  nrow = np, ncol = ni / 2)
verbal[, 1:(ni / 2)][na_index] <- NA

# Get R2 from difR
# verbal[, 26] is the grouping variable gender
verb_total <- rowSums(verbal[, 1:ni], na.rm = TRUE)
verb_difr <- difR:::Logistik(verbal[, 1:ni],
  match = verb_total, member = verbal[, 26],
  type = "udif")

# Fit the uniform DIF models by hand
# To test for DIF, we would compare these with base
# models, not fit here
verb_glm <- vector("list", ni)
for (i in 1:ni) {
  verbal_sub <- data.frame(y = verbal[, i],
    t = verb_total, g = verbal[, 26])
  verb_glm[[i]] <- glm(y ~ t + g, family = "binomial",
    data = verbal_sub)
}

# Get R2 from epmr and DescTools packages
verb_epmr <- sapply(verb_glm, epmr:::r2_nag)
verb_desc <- sapply(verb_glm, DescTools:::PseudoR2,
  which = "Nag")

# Compare
# epmr and DescTools match for all items
# difR matches for the last 12 items, but R2 on the
# first 12 are depressed because of missing data
verb_tab <- data.frame(item = 1:24,
  pct_na = apply(verbal[, 1:ni], 2, epmr:::summiss) / np,
  difR = verb_difr$R2M0, epmr = verb_epmr,
  DescTools = verb_desc)

This table shows results for items 9 through 16, the last four items with missing data and the first four with complete data.

item pct_na difR epmr DescTools
9 0.089 0.197 0.203 0.203
10 0.085 0.308 0.318 0.318
11 0.139 0.408 0.429 0.429
12 0.136 0.278 0.293 0.293
13 0.000 0.405 0.405 0.405
14 0.000 0.532 0.532 0.532
15 0.000 0.370 0.370 0.370
16 0.000 0.401 0.401 0.401
Some results from first example

The second example shows a situation where $R^2$ in the difR package comes to 1. Data are from the 2009 administration of PISA, included in epmr.

# Prep data from epmr::PISA09
# Vector of item names
rsitems <- c("r414q02s", "r414q11s", "r414q06s",
  "r414q09s", "r452q03s", "r452q04s", "r452q06s",
  "r452q07s", "r458q01s", "r458q07s", "r458q04s")

# Subset to USA and Canada
pisa <- subset(PISA09, cnt %in% c("USA", "CAN"))

# Get R2 from difR
pisa_total <- rowSums(pisa[, rsitems],
  na.rm = TRUE)
pisa_difr <- difR:::Logistik(pisa[, rsitems],
  match = pisa_total, member = pisa$cnt,
  type = "udif")

# Fit the uniform DIF models by hand
pisa_glm <- vector("list", length(rsitems))
for (i in seq_along(rsitems)) {
  pisa_sub <- data.frame(y = pisa[, rsitems[i]],
    t = pisa_total, g = pisa$cnt)
  pisa_glm[[i]] <- glm(y ~ t + g, family = "binomial",
    data = pisa_sub)
}

# Get R2 from epmr and DescTools packages
pisa_epmr <- sapply(pisa_glm, epmr:::r2_nag)
pisa_desc <- sapply(pisa_glm, DescTools:::PseudoR2,
  which = "Nag")

# Compare
pisa_tab <- data.frame(item = seq_along(rsitems),
  difR = pisa_difr$R2M0, epmr = pisa_epmr,
  DescTools = pisa_desc)

Here are the resulting $R^2$ for each package, across all items.

item difR epmr DescTools
1 1 0.399 0.399
2 1 0.268 0.268
3 1 0.514 0.514
4 1 0.396 0.396
5 1 0.372 0.372
6 1 0.396 0.396
7 1 0.524 0.524
8 1 0.465 0.465
9 1 0.366 0.366
10 1 0.410 0.410
11 1 0.350 0.350
Results from second example

References

Cox, D. R. & Snell, E. J. (1989). The analysis of binary data. London: Chapman and Hall.

Magis, D., Beland, S, Tuerlinckx, F, & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847-862.

Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691-692.

Issues in the difR Package Mantel-Haenszel Analysis

I’ve been using the difR package (Magis, Beland, Tuerlinckx, & De Boeck, 2010) to run differential item functioning (DIF) analysis in R. Here’s the package on CRAN.

https://cran.r-project.org/package=difR

I couldn’t get my own code to match the Mantel-Haenszel (MH) results from the difR package and it looks like it’s because there’s an issue in how the difR:::difMH function handles missing data. My code is on GitHub.

https://github.com/talbano/epmr/blob/master/R/difstudy.R

The MH DIF method is based on counts for correct vs incorrect responses in focal vs reference groups of test takers across levels of the construct (usually total scores). The code for difR:::difMH uses the length of a vector that is subset with logical indices to get the counts of test takers in each group. But missing data here will return NA in the logical comparisons, and NA isn’t omitted from length.

I’m pasting below the code from difR:::mantelHaenszel, which is called by difR:::difMH to run the MH analysis. Lines 19 to 33 all use length to find counts. This works fine with complete data, but as soon as someone has NA for an item score, captured in data[, item], they’ll figure into the count regardless of the comparisons being examined.

function (data, member, match = "score", correct = TRUE, exact = FALSE, 
    anchor = 1:ncol(data)) 
{
    res <- resAlpha <- varLambda <- RES <- NULL
    for (item in 1:ncol(data)) {
        data2 <- data[, anchor]
        if (sum(anchor == item) == 0) 
            data2 <- cbind(data2, data[, item])
        if (!is.matrix(data2)) 
            data2 <- cbind(data2)
        if (match[1] == "score") 
            xj <- rowSums(data2, na.rm = TRUE)
        else xj <- match
        scores <- sort(unique(xj))
        prov <- NULL
        ind <- 1:nrow(data)
        for (j in 1:length(scores)) {
            Aj <- length(ind[xj == scores[j] & member == 0 & 
                data[, item] == 1])
            Bj <- length(ind[xj == scores[j] & member == 0 & 
                data[, item] == 0])
            Cj <- length(ind[xj == scores[j] & member == 1 & 
                data[, item] == 1])
            Dj <- length(ind[xj == scores[j] & member == 1 & 
                data[, item] == 0])
            nrj <- length(ind[xj == scores[j] & member == 0])
            nfj <- length(ind[xj == scores[j] & member == 1])
            m1j <- length(ind[xj == scores[j] & data[, item] == 
                1])
            m0j <- length(ind[xj == scores[j] & data[, item] == 
                0])
            Tj <- length(ind[xj == scores[j]])
            if (exact) {
                if (Tj > 1) 
                  prov <- c(prov, c(Aj, Bj, Cj, Dj))
            }
            else {
                if (Tj > 1) 
                  prov <- rbind(prov, c(Aj, nrj * m1j/Tj, (((nrj * 
                    nfj)/Tj) * (m1j/Tj) * (m0j/(Tj - 1))), scores[j], 
                    Bj, Cj, Dj, Tj))
            }
        }
        if (exact) {
            tab <- array(prov, c(2, 2, length(prov)/4))
            pr <- mantelhaen.test(tab, exact = TRUE)
            RES <- rbind(RES, c(item, pr$statistic, pr$p.value))
        }
        else {
            if (correct) 
                res[item] <- (abs(sum(prov[, 1] - prov[, 2])) - 
                  0.5)^2/sum(prov[, 3])
            else res[item] <- (abs(sum(prov[, 1] - prov[, 2])))^2/sum(prov[, 
                3])
            resAlpha[item] <- sum(prov[, 1] * prov[, 7]/prov[, 
                8])/sum(prov[, 5] * prov[, 6]/prov[, 8])
            varLambda[item] <- sum((prov[, 1] * prov[, 7] + resAlpha[item] * 
                prov[, 5] * prov[, 6]) * (prov[, 1] + prov[, 
                7] + resAlpha[item] * (prov[, 5] + prov[, 6]))/prov[, 
                8]^2)/(2 * (sum(prov[, 1] * prov[, 7]/prov[, 
                8]))^2)
        }
    }
    if (match[1] != "score") 
        mess <- "matching variable"
    else mess <- "score"
    if (exact) 
        return(list(resMH = RES[, 2], Pval = RES[, 3], match = mess))
    else return(list(resMH = res, resAlpha = resAlpha, varLambda = varLambda, 
        match = mess))
}

Here’s a very simplified example of the issue. The vector 1:4 is in place of the ind object in the mantelHaenszel function (created on line 17). The vector c(1, 1, NA, 0) is in place of data[, item] (e.g., on line 20). One person has a score of 0 on this item, and two have scores of 1, but length returns count 2 for item score 0 and 3 for item score 1 because the NA is not removed by default.

length((1:4)[c(1, 1, NA, 0) == 0])
## [1] 2
length((1:4)[c(1, 1, NA, 0) == 1])
## [1] 3

With missing data, the MH counts from difR:::mantelHaenszel will all be padded by the number of people with NA for their item score. It could be that the authors are accounting for this somewhere else in the code, but I couldn’t find it.

Here’s what happens to the MH results with some made up testing data. For 200 people taking a test with five items, I’m giving a boost on two items to 20 of the reference group test takers (to generate DIF), and then inserting NA for 20 people on one of those items. MH stats are consistent across packages for the first DIF item (item 4) but not the second (item 5).

# Number of items and people
ni <- 5
np <- 200

# Create focal and reference groups
groups <- rep(c("foc", "ref"), each = np / 2)

# Generate scores
set.seed(220821)
item_scores <- matrix(sample(0:1, size = ni * np,
  replace = T), nrow = np, ncol = ni)

# Give 20 people from the reference group a boost on
# items 4 and 5
boost_ref_index <- sample((1:np)[groups == "ref"], 20)
item_scores[boost_ref_index, 4:5] <- 1

# Fix 20 scores on item 5 to be NA
item_scores[sample(1:np, 20), 5] <- NA

# Find total scores on the first three items,
# treated as anchor
total_scores <- rowSums(item_scores[, 1:3])

# Comparing MH stats, chi square matches for item 4
# with no NA but differs for item 5
epmr:::difstudy(item_scores, groups = groups,
  focal = "foc", scores = total_scores, anchor_items = 1:3,
  dif_items = 4:5, complete = FALSE)
## 
## Differential Item Functioning Study
## 
##   item  rn  fn r1 f1 r0 f0   mh  delta delta_abs chisq chisq_p ets_level
## 1    4 100 100 61 52 39 48 1.50 -0.946     0.946  1.58  0.2083         a
## 2    5  88  92 55 40 33 52 2.06 -1.701     1.701  4.84  0.0278         c
difR:::difMH(data.frame(item_scores), group = groups,
  focal.name = "foc", anchor = 1:3, match = total_scores)
## 
## Detection of Differential Item Functioning using Mantel-Haenszel method 
## with continuity correction and without item purification
## 
## Results based on asymptotic inference 
##  
## Matching variable: specified matching variable 
##  
## Anchor items (provided by the user): 
##    
##  X1
##  X2
##  X3
## 
##  
## No p-value adjustment for multiple comparisons 
##  
## Mantel-Haenszel Chi-square statistic: 
##  
##    Stat.  P-value  
## X4 1.5834 0.2083   
## X5 4.8568 0.0275  *
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
## 
## Detection threshold: 3.8415 (significance level: 0.05)
## 
## Items detected as DIF items: 
##    
##  X5
## 
##  
## Effect size (ETS Delta scale): 
##  
## Effect size code: 
##  'A': negligible effect 
##  'B': moderate effect 
##  'C': large effect 
##  
##    alphaMH deltaMH  
## X4  1.4955 -0.9457 A
## X5  1.8176 -1.4041 B
## 
## Effect size codes: 0 'A' 1.0 'B' 1.5 'C' 
##  (for absolute values of 'deltaMH') 
##  
## Output was not captured!

One more note, when reporting MH results, the difR package only uses the absolute delta values to assign ETS significance levels (A, B, C). You can see this in the difR:::print.MH function (not shown here). Usually, the MH approach also incorporates the p-value for the chi square (Zwick, 2012).

References

Magis, D., Beland, S, Tuerlinckx, F, & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862.

Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. Princeton, NJ: Educational Testing Service. https://files.eric.ed.gov/fulltext/EJ1109842.pdf