Article on Intersectional DIF in Applied Measurement in Education

Brian French, Thao Thu Vo, and I recently (February, 2024) published an open-access paper in Applied Measurement in Education on Traditional vs Intersectional DIF Analysis: Considerations and a Comparison Using State Testing Data.

https://doi.org/10.1080/08957347.2024.2311935

The paper extends research by Russell and colleagues (e.g., 2021) on intersectional differential item functioning (DIF).

Here’s our abstract.

Recent research has demonstrated an intersectional approach to the study of differential item functioning (DIF). This approach expands DIF to account for the interactions between what have traditionally been treated as separate grouping variables. In this paper, we compare traditional and intersectional DIF analyses using data from a state testing program (nearly 20,000 students in grade 11, math, science, English language arts). We extend previous research on intersectional DIF by employing field test data (embedded within operational forms) and by comparing methods that were adjusted for an increase in Type I error (Mantel-Haenszel and logistic regression). Intersectional analysis flagged more items for DIF compared with traditional methods, even when controlling for the increased number of statistical tests. We discuss implications for state testing programs and consider how intersectionality can be applied in future DIF research.

We refer to intersectional DIF as DIF with interaction effects, partly to highlight the methodology – which builds on traditional DIF as an analysis of main effects – and to distinguish it as one piece of a larger intersectional perspective on the item response process. We don’t get into the ecology of item responding (Zumbo et al., 2015), but that’s the idea – traditional DIF just scratches the surface.

A few things keep DIF analysis on the surface.

  1. More complex analysis would require larger sample sizes for field/pilot testing. We’d have to plan and budget for it.
  2. Better analysis would also require a theory of test bias that developers may not be in a position to articulate. This brings in the debate on consequential validity evidence – who is responsible for investigating test bias, and how extensive does analysis need to be?
  3. Building on 2, only test developers have ready access to the data needed for DIF analysis. Other researchers and the public, who might have good input, aren’t involved. I touch on this idea in a previous post.

References

Albano, T., French, B. F., & Vo, T. T. (2024). Traditional vs intersectional DIF analysis: Considerations and a comparison using state testing data. Applied Measurement in Education, 37(1), 57-70. https://doi.org/10.1080/08957347.2024.2311935

Russell, M., & Kaplan, L. (2021). An intersectional approach to differential item functioning: Reflecting configurations of inequality. Practical Assessment, Research & Evaluation, 26(21), 1-17.

Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Olvera Astivia, O. L., & Ark, T. K. (2015). A methodology for Zumbo’s third generation DIF analyses and the ecology of item responding. Language Assessment Quarterly, 12(1), 136-151. https://doi.org/10.1080/15434303.2014.972559

Review of Cizek’s Validity Book

I recently reviewed G. J. Cizek’s book Validity – An Integrated Approach to Test Score Meaning and Use (published by Routledge, 2020) for the journal Applied Measurement in Education. Here’s a link to my review.

Here’s an overview, from the first paragraph in the review.

Can measurement inferences be meaningful but not useful? Are we better off evaluating test score interpretations separate from their applications? Does validity theory itself need to be revamped? These are the kinds of big philosophical questions Cizek tackles, though with limited philosophizing, in his book Validity – An Integrated Approach to Test Score Meaning and Use. The premise of the book, that validity does need revamping, won’t come as a surprise to readers familiar with his earlier writing on the topic. The main ideas are the same, as are some of his testing examples and metaphors. However, the book does give Cizek space to elaborate on his comprehensive framework for defensible testing, and the target audience of “rigorous scholars and practitioners… who have no wish to be philosophers of science” may appreciate the book’s focus on pragmatic recommendations over “metaphysical contemplations.”

And here’s my synopsis of the book by chapter.

After an intriguing preface (current validation efforts are described as anemic and lacking in alacrity), the book starts with an introduction to some foundational testing concepts (Chapter 1), and then reviews areas of consensus in validation (e.g., content, response process, convergent evidence; Chapter 2), before highlighting the essential point of disagreement (i.e., how we handle test uses and consequences; Chapter 3). Cizek’s main argument, reiterated throughout the book, is that considerations around score inference should nearly always be detached from considerations around test use, and that combining the two (common in the US since the 1990s) has been counterproductive. He presents a framework that separates a) validation of the intended meaning of scores via the usual sources of evidence, minus uses and consequences (Chapter 4), from b) justifying the intended uses of scores, following theory and methods from program evaluation (Chapter 5). The book ends with recommendations for determining how much evidence is enough for successful validation and justification (Chapter 6), and, finally, a summary with comments on future directions (Chapter 7).

Throughout the book, Cizek critiques the writings of Messick, a distinguished validity theorist, and he acknowledges in the book’s preface that doing so felt like tugging on Superman’s cape. I’m not sure where that puts me, someone who has only ever written about validity as it relates to other issues like item bias. I guess I’m either spitting into the wind or pulling the mask off the Old Lone Ranger.

Though I agree with Cizek on some key issues – including that validity theory is becoming impractically complex – my review of the book ended up being mostly critical. Maybe half of my 1800 or so words went to summarizing two limitations that I see in the book. First, it oversimplifies and sometimes misrepresents the alternative and more mainstream perspective that uses and consequences should be part of validity. Quotations and summaries of the opposing views could have been much tighter (I highlight a few in my review). Second, the book leaves us wanting more on the question of how to integrate information – if we evaluate testing in two stages, based on meaning in scores and justification of uses, how do we combine results to determine if a test is defensible? The two stages are discussed separately, but the crucial integration step isn’t clearly explained or demonstrated.

I do like how the book lays out program evaluation as a framework for evaluating (some would say validating) uses and consequences. Again, it’s unclear how we integrate conclusions from this step with our other validation efforts in establishing score meaning. But program evaluation is a nice fit to the general problem of justifying test use. It offers us established procedures and best practices for study design, data collection, and analyzing and interpreting results.

I also appreciate that Cizek is questioning the ever creeping scope of validity. Uses and consequences can be relevant to validation, and shouldn’t be ignored, but they can also be so complex and open-ended as to make validation unmanageable. Social responsibility and social justice – which have received a lot of attention in the measurement literature in the past three years and so aren’t addressed in their latest form in the book – are a pertinent example. To what extent should antiracism be a component of test design? To what extent should adverse impact in test results invalidate testing? And who’s to say? I still have some reading to do (Applied Measurement in Education has a new special issue on social justice topics), but it seems like proponents would now argue, in the most extreme case, that any group difference justifies pausing or reconsidering testing. Proposals like this need more study and discussion (similar to what we had on social responsibility in admission testing) before they’re applied generally or added to our professional standards.

Commentary Article on College Admission Testing in EMIP

The journal Educational Measurement: Issues and Practice (EMIP) is publishing commentaries on a focus article on College Admission Tests and Social Responsibility (Koljatic, Silva, & Sireci, in press, https://doi.org/10.1111/emip.12425). The authors critique how the standardized testing industry has disengaged from efforts to reduce educational inequities.

Here’s the abstract to my commentary article (also in press, https://doi.org/10.1111/emip.12451), where I argue that Social Responsibility in College Admissions Requires a Reimagining of Standardized Testing.

As college admissions becomes more competitive in the United States and globally, with more applicants competing for limited seats, many programs are transitioning away from standardized testing as an application requirement, in part due to the concern that testing can perpetuate inequities among an increasingly diverse student population. In this article, I argue that we can only address this concern by reimagining standardized testing from the ground up. Following a summary of the recent debate around testing at the University of California (UC), I discuss how my perspective aligns with that of Koljatic et al. (in press), who encourage the testing industry to accept more social responsibility. Building on themes from the focus article and other recent publications, I then propose that, to contribute to educational equity, we must work toward testing that is more transparent and openly accessible than ever before.

Article in Frontiers in Education

My colleagues and I recently published an open-access article in Frontiers in Education titled Contextual Interference Effects in Early Assessment: Evaluating the Psychometric Benefits of Item Interleaving. We looked at how interleaving as opposed to blocking items by task affects the psychometric properties of a test.

Here’s the abstract and link to the full text.

https://www.frontiersin.org/articles/10.3389/feduc.2020.00133/full

Research has shown that the context of practice tasks can have a significant impact on learning, with long-term retention and transfer improving when tasks of different types are mixed by interleaving (abcabcabc) compared with grouping together in blocks (aaabbbccc). This study examines the influence of context via interleaving from a psychometric perspective, using educational assessments designed for early childhood. An alphabet knowledge measure consisting of four types of tasks (finding, orienting, selecting, and naming letters) was administered in two forms, one with items blocked by task, and the other with items interleaved and rotating from one task to the next by item. The interleaving of tasks, and thereby the varying of item context, had a negligible impact on mean performance, but led to stronger internal consistency reliability as well as improved item discrimination. Implications for test design and student engagement in educational measurement are discussed.

The plots below show item difficulty (on the left) and discrimination (right) for 20 items. Plotting characters represent the task for each item, abbreviated as F, O, S, and N (letter finding, orienting, selecting, and naming, respectively), with results from the blocked administration on the x-axis and interleaving on the y-axis.

Our sample sizes (50 for blocked and 55 for interleaving) didn’t support item-level comparisons, but the overall trends are still interesting. Item difficulties don’t appear to change consistently but discriminations do seem to increase overall for interleaved.

 

Article in Frontiers in Computer Science

A colleague and I recently published an open-access article in Frontiers, titled Development and Evaluation of the Nebraska Assessment of Computing Knowledge. Abstract and link to full text are below.

One way to increase the quality of computing education research is to increase the quality of the measurement tools that are available to researchers, especially measures of students’ knowledge and skills. This paper represents a step toward increasing the number of available thoroughly-evaluated tests that can be used in computing education research by evaluating the psychometric properties of a multiple-choice test designed to differentiate undergraduate students in terms of their mastery of foundational computing concepts. Classical test theory and item response theory analyses are reported and indicate that the test is a reliable, psychometrically-sound instrument suitable for research with undergraduate students. Limitations and the importance of using standardized measures of learning in education research are discussed.

https://www.frontiersin.org/articles/10.3389/fcomp.2020.00011/full