Notes from NCME 2025

The National Council on Measurement in Education (NCME) held its annual conference in April, so I’ve had a couple months to ruminate on what I learned there. Here are some notes from my notes.

  • I talked to someone about how we really don’t need the term validity. It almost always comes with a type or source of evidence attached, in which case validity could be substituted for another more descriptive term – cultural validity is replaced by culturally responsive, content validity evidence is content alignment. Validity mostly just captures the general idea of effectiveness. Newton and Shaw (2013) suggested “testing quality” as a replacement.
  • A session on The Usefulness of Kane’s “Validity Argument” in Modern Validity Theory got me thinking about how we should acknowledge our philosophical positions when debating validity, because our validity position depends entirely on our philosophical position, and the main philosophical positions (positivism, postmodernism, critical theories) have already been thoroughly hashed out. How should consequences and uses inform validity? That depends on how distinct we consider truth and facts to be from ethics and values.
  • Lots of sessions referenced Randall (2021), who is critical of how validation traditionally aims to minimize construct-irrelevant variance. We should frame things positively instead, so as to maximize construct-relevant variance.
  • People say educational assessment when they’re really talking about educational testing. I’ve also made this mistake. Assessment does sound better, but we need precision here, especially when dealing with culturally responsive/sustaining assessment, which is much more feasible than culturally responsive/sustaining testing. Beware of misleading claims and false advertising.
  • Extending the previous point – I caught the tail end of the session Implications of Culturally Responsive Assessment for Large-Scale Assessment Practices. Someone asked how we reconcile culturally responsive assessment with large-scale testing. One audience member suggested we keep them separate, so as preserve the value of each. Someone replied that separation will result in the two being pitted against each other, in which case large-scale will win. Both are correct, but I’d go with separation if forced to choose.
  • There was a cool session on Evaluating the Psychometric Impacts of Cultural Representation in Item Contexts. Surprisingly, or maybe not, the reported impacts were consistently minimal, but that could be because the cultural adaptations tended to be pretty superficial (e.g., geometry with an Aztec temple instead of a tree).
  • I also attended a few sessions and presented on (Albano, French, & Vo, 2024) differential item functioning, one of my favorite psychometric topics. As is often the case in psychometric modeling as a form of applied stats, we have some fancy and comprehensive models to choose from (e.g., moderated nonlinear factor analysis) but the data and conditions often require that we simplify (Mantel-Haenszel).

References

Albano, T., French, B. F., & Vo, T. T. (2024). Traditional vs intersectional DIF analysis: Considerations and a comparison using state testing data. Applied Measurement in Education, 37(1), 57-70.

Newton, P. E., & Shaw, S. D. (2013). Standards for talking and thinking about validity. Psychological Methods18(3), 301-319.

Randall, J. (2021). “Color‐neutral” is not a thing: Redefining construct definition and representation through a justice‐oriented critical antiracist lens. Educational Measurement: Issues and Practice, 40(4), 82-90.