Linking vs Mapping vs Predicting

I recently came across a few articles that discuss scale linking in the health sciences, where researchers measure things like psychological distress, well-being, and fatigue, and need to convert patient results from one instrument to another. The literature refers to the process as mapping (Wailoo et al, 2017) but the goals seem to be the same as with other forms of scaling, linking, and equating in education and psychology.

Fayers and Hays (2014) talk about how mapping with health scales is typically accomplished using regression models, which can produce biased results because of regression to the mean. They recommend linking methods. Thompson, Lapin, and Katzan (2017) demonstrate linking with linear and equipercentile functions.

On a related note, someone also shared Bottai et al (2022), which derives a linear prediction function, based on the concordance correlation from Lin (1989), that ends up being linear equating.

References

Bottai, M., Kim, T., Lieberman, B., Luta, G., & Peña, E. (2022). On optimal correlation-based prediction. The American Statistician76(4), 313-321. https://doi.org/10.1080/00031305.2022.2051604

Fayers, P. M., & Hays, R. D. (2014). Should linking replace regression when mapping from profile-based measures to preference-based measures? Value in Health, 17(2), 261-265. http://dx.doi.org/10.1016/j.jval.2013.12.002

Lin, L. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255–268.

Thompson, N. R., Lapin, B. R., & Katzan, I. L. (2017). Mapping PROMIS global health items to EuroQol (EQ-5D) utility scores using linear and equipercentile equating. Pharmacoeconomics, 35, 1167-1176. http://dx.doi.org/10.1007/s40273-017-0541-1

Wailoo, A. J., Hernandez-Alava, M., Manca, A., Mejia, A., Ray, J., Crawford, B., Botteman, M., & Busschbach, J. (2017). Mapping to estimate health-state utility from non–preference-based outcome measures: An ISPOR good practices for outcomes research task force report. Value in Health20(1), 18-27. http://dx.doi.org/10.1016/j.jval.2016.11.006

More issues in the difR package for differential item functioning analysis in R

I wrote last time about the difR package (Magis, Beland, Tuerlinckx, & De Boeck, 2010) and how it doesn’t account for missing data in Mantel-Haenszel DIF analysis. I’ve noticed two more issues as I’ve continued testing the package (version 5.1).

  1. The problem with Mantel-Haenszel also appears in the code for the standardization method, accessed via difR:::difStd, which calls difR:::stdPDIF. Look there and you’ll see base:::length used to obtain counts (e.g., number of correct/incorrect for focal and reference groups at a given score level). Missing data will throw off these counts. So, difR standardization and MH are only recommended with complete data.
  2. In the likelihood ratio method, code for pseudo $R^2$ (used as a measure of DIF effect size) can lead to errors for some models. The code also seems to assume no missing data. More on these issues below.

DIF with the likelihood ratio method is performed using the difR:::difLogistic function, which ultimately calls difR:::Logistik to do the modeling (via glm) and calculate the $R^2$. The functions for calculating $R^2$ are embedded within the difR:::Logistik function.

R2 <- function(m, n) {
  1 - (exp(-m$null.deviance / 2 + m$deviance / 2))^(2 / n)
}
R2max <- function(m, n) {
  1 - (exp(-m$null.deviance / 2))^(2 / n)
}
R2DIF <- function(m, n) {
  R2(m, n) / R2max(m, n)
}

These functions capture $R^2$ as defined by Nagelkerke (1991), which is a modification to Cox and Snell (1989). When these are run via difR:::Logistik, the sample size n argument is set to the number of rows in the data set, which ignores missing data on a particular item. So, n will be inflated for items with missing data, and $R^2$ will be reduced (assuming a constant deviance).

In addition to the missing data issue, because of the way they’re written, these functions stretch the precision limits of R. In the R2max function specifically, the model deviance is first converted to a log-likelihood, and then a likelihood, before raising to 2/n. The problem is, large deviances correspond to very small likelihoods. A deviance of 500 gives us a likelihood of 7.175096e-66, which R can manage. But a deviance of 1500 gives us a likelihood of 0, which produces $R^2 = 1$.

The workaround is simple – avoid calculating likelihoods by rearranging terms. Here’s how I’ve written them in the epmr package.

r2_cox <- function(object, n = length(object$y)) {
  1 - exp((object\$deviance - object\$null.deviance) / n)
}
r2_nag <- function(object, n = length(object$y)) {
  r2_cox(object, n) / (1 - exp(-object$null.deviance / n))
}

And here are two examples that compare results from difR with epmr and DescTools. The first example shows how roughly 10% missing data reduces $R^2$ by as much as 0.02 when using difR. Data come from the verbal data set, included in difR.

# Load example data from the difR package
# See ?difR:::verbal for details
data("verbal", package = "difR")

# Insert missing data on first half of items
set.seed(42)
np <- nrow(verbal)
ni <- 24
na_index <- matrix(
  sample(c(TRUE, FALSE), size = np * ni / 2,
    prob = c(.1, .9), replace = TRUE),
  nrow = np, ncol = ni / 2)
verbal[, 1:(ni / 2)][na_index] <- NA

# Get R2 from difR
# verbal[, 26] is the grouping variable gender
verb_total <- rowSums(verbal[, 1:ni], na.rm = TRUE)
verb_difr <- difR:::Logistik(verbal[, 1:ni],
  match = verb_total, member = verbal[, 26],
  type = "udif")

# Fit the uniform DIF models by hand
# To test for DIF, we would compare these with base
# models, not fit here
verb_glm <- vector("list", ni)
for (i in 1:ni) {
  verbal_sub <- data.frame(y = verbal[, i],
    t = verb_total, g = verbal[, 26])
  verb_glm[[i]] <- glm(y ~ t + g, family = "binomial",
    data = verbal_sub)
}

# Get R2 from epmr and DescTools packages
verb_epmr <- sapply(verb_glm, epmr:::r2_nag)
verb_desc <- sapply(verb_glm, DescTools:::PseudoR2,
  which = "Nag")

# Compare
# epmr and DescTools match for all items
# difR matches for the last 12 items, but R2 on the
# first 12 are depressed because of missing data
verb_tab <- data.frame(item = 1:24,
  pct_na = apply(verbal[, 1:ni], 2, epmr:::summiss) / np,
  difR = verb_difr$R2M0, epmr = verb_epmr,
  DescTools = verb_desc)

This table shows results for items 9 through 16, the last four items with missing data and the first four with complete data.

item pct_na difR epmr DescTools
9 0.089 0.197 0.203 0.203
10 0.085 0.308 0.318 0.318
11 0.139 0.408 0.429 0.429
12 0.136 0.278 0.293 0.293
13 0.000 0.405 0.405 0.405
14 0.000 0.532 0.532 0.532
15 0.000 0.370 0.370 0.370
16 0.000 0.401 0.401 0.401
Some results from first example

The second example shows a situation where $R^2$ in the difR package comes to 1. Data are from the 2009 administration of PISA, included in epmr.

# Prep data from epmr::PISA09
# Vector of item names
rsitems <- c("r414q02s", "r414q11s", "r414q06s",
  "r414q09s", "r452q03s", "r452q04s", "r452q06s",
  "r452q07s", "r458q01s", "r458q07s", "r458q04s")

# Subset to USA and Canada
pisa <- subset(PISA09, cnt %in% c("USA", "CAN"))

# Get R2 from difR
pisa_total <- rowSums(pisa[, rsitems],
  na.rm = TRUE)
pisa_difr <- difR:::Logistik(pisa[, rsitems],
  match = pisa_total, member = pisa$cnt,
  type = "udif")

# Fit the uniform DIF models by hand
pisa_glm <- vector("list", length(rsitems))
for (i in seq_along(rsitems)) {
  pisa_sub <- data.frame(y = pisa[, rsitems[i]],
    t = pisa_total, g = pisa$cnt)
  pisa_glm[[i]] <- glm(y ~ t + g, family = "binomial",
    data = pisa_sub)
}

# Get R2 from epmr and DescTools packages
pisa_epmr <- sapply(pisa_glm, epmr:::r2_nag)
pisa_desc <- sapply(pisa_glm, DescTools:::PseudoR2,
  which = "Nag")

# Compare
pisa_tab <- data.frame(item = seq_along(rsitems),
  difR = pisa_difr$R2M0, epmr = pisa_epmr,
  DescTools = pisa_desc)

Here are the resulting $R^2$ for each package, across all items.

item difR epmr DescTools
1 1 0.399 0.399
2 1 0.268 0.268
3 1 0.514 0.514
4 1 0.396 0.396
5 1 0.372 0.372
6 1 0.396 0.396
7 1 0.524 0.524
8 1 0.465 0.465
9 1 0.366 0.366
10 1 0.410 0.410
11 1 0.350 0.350
Results from second example

References

Cox, D. R. & Snell, E. J. (1989). The analysis of binary data. London: Chapman and Hall.

Magis, D., Beland, S, Tuerlinckx, F, & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847-862.

Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691-692.

Issues in the difR Package Mantel-Haenszel Analysis

I’ve been using the difR package (Magis, Beland, Tuerlinckx, & De Boeck, 2010) to run differential item functioning (DIF) analysis in R. Here’s the package on CRAN.

https://cran.r-project.org/package=difR

I couldn’t get my own code to match the Mantel-Haenszel (MH) results from the difR package and it looks like it’s because there’s an issue in how the difR:::difMH function handles missing data. My code is on GitHub.

https://github.com/talbano/epmr/blob/master/R/difstudy.R

The MH DIF method is based on counts for correct vs incorrect responses in focal vs reference groups of test takers across levels of the construct (usually total scores). The code for difR:::difMH uses the length of a vector that is subset with logical indices to get the counts of test takers in each group. But missing data here will return NA in the logical comparisons, and NA isn’t omitted from length.

I’m pasting below the code from difR:::mantelHaenszel, which is called by difR:::difMH to run the MH analysis. Lines 19 to 33 all use length to find counts. This works fine with complete data, but as soon as someone has NA for an item score, captured in data[, item], they’ll figure into the count regardless of the comparisons being examined.

function (data, member, match = "score", correct = TRUE, exact = FALSE, 
    anchor = 1:ncol(data)) 
{
    res <- resAlpha <- varLambda <- RES <- NULL
    for (item in 1:ncol(data)) {
        data2 <- data[, anchor]
        if (sum(anchor == item) == 0) 
            data2 <- cbind(data2, data[, item])
        if (!is.matrix(data2)) 
            data2 <- cbind(data2)
        if (match[1] == "score") 
            xj <- rowSums(data2, na.rm = TRUE)
        else xj <- match
        scores <- sort(unique(xj))
        prov <- NULL
        ind <- 1:nrow(data)
        for (j in 1:length(scores)) {
            Aj <- length(ind[xj == scores[j] & member == 0 & 
                data[, item] == 1])
            Bj <- length(ind[xj == scores[j] & member == 0 & 
                data[, item] == 0])
            Cj <- length(ind[xj == scores[j] & member == 1 & 
                data[, item] == 1])
            Dj <- length(ind[xj == scores[j] & member == 1 & 
                data[, item] == 0])
            nrj <- length(ind[xj == scores[j] & member == 0])
            nfj <- length(ind[xj == scores[j] & member == 1])
            m1j <- length(ind[xj == scores[j] & data[, item] == 
                1])
            m0j <- length(ind[xj == scores[j] & data[, item] == 
                0])
            Tj <- length(ind[xj == scores[j]])
            if (exact) {
                if (Tj > 1) 
                  prov <- c(prov, c(Aj, Bj, Cj, Dj))
            }
            else {
                if (Tj > 1) 
                  prov <- rbind(prov, c(Aj, nrj * m1j/Tj, (((nrj * 
                    nfj)/Tj) * (m1j/Tj) * (m0j/(Tj - 1))), scores[j], 
                    Bj, Cj, Dj, Tj))
            }
        }
        if (exact) {
            tab <- array(prov, c(2, 2, length(prov)/4))
            pr <- mantelhaen.test(tab, exact = TRUE)
            RES <- rbind(RES, c(item, pr$statistic, pr$p.value))
        }
        else {
            if (correct) 
                res[item] <- (abs(sum(prov[, 1] - prov[, 2])) - 
                  0.5)^2/sum(prov[, 3])
            else res[item] <- (abs(sum(prov[, 1] - prov[, 2])))^2/sum(prov[, 
                3])
            resAlpha[item] <- sum(prov[, 1] * prov[, 7]/prov[, 
                8])/sum(prov[, 5] * prov[, 6]/prov[, 8])
            varLambda[item] <- sum((prov[, 1] * prov[, 7] + resAlpha[item] * 
                prov[, 5] * prov[, 6]) * (prov[, 1] + prov[, 
                7] + resAlpha[item] * (prov[, 5] + prov[, 6]))/prov[, 
                8]^2)/(2 * (sum(prov[, 1] * prov[, 7]/prov[, 
                8]))^2)
        }
    }
    if (match[1] != "score") 
        mess <- "matching variable"
    else mess <- "score"
    if (exact) 
        return(list(resMH = RES[, 2], Pval = RES[, 3], match = mess))
    else return(list(resMH = res, resAlpha = resAlpha, varLambda = varLambda, 
        match = mess))
}

Here’s a very simplified example of the issue. The vector 1:4 is in place of the ind object in the mantelHaenszel function (created on line 17). The vector c(1, 1, NA, 0) is in place of data[, item] (e.g., on line 20). One person has a score of 0 on this item, and two have scores of 1, but length returns count 2 for item score 0 and 3 for item score 1 because the NA is not removed by default.

length((1:4)[c(1, 1, NA, 0) == 0])
## [1] 2
length((1:4)[c(1, 1, NA, 0) == 1])
## [1] 3

With missing data, the MH counts from difR:::mantelHaenszel will all be padded by the number of people with NA for their item score. It could be that the authors are accounting for this somewhere else in the code, but I couldn’t find it.

Here’s what happens to the MH results with some made up testing data. For 200 people taking a test with five items, I’m giving a boost on two items to 20 of the reference group test takers (to generate DIF), and then inserting NA for 20 people on one of those items. MH stats are consistent across packages for the first DIF item (item 4) but not the second (item 5).

# Number of items and people
ni <- 5
np <- 200

# Create focal and reference groups
groups <- rep(c("foc", "ref"), each = np / 2)

# Generate scores
set.seed(220821)
item_scores <- matrix(sample(0:1, size = ni * np,
  replace = T), nrow = np, ncol = ni)

# Give 20 people from the reference group a boost on
# items 4 and 5
boost_ref_index <- sample((1:np)[groups == "ref"], 20)
item_scores[boost_ref_index, 4:5] <- 1

# Fix 20 scores on item 5 to be NA
item_scores[sample(1:np, 20), 5] <- NA

# Find total scores on the first three items,
# treated as anchor
total_scores <- rowSums(item_scores[, 1:3])

# Comparing MH stats, chi square matches for item 4
# with no NA but differs for item 5
epmr:::difstudy(item_scores, groups = groups,
  focal = "foc", scores = total_scores, anchor_items = 1:3,
  dif_items = 4:5, complete = FALSE)
## 
## Differential Item Functioning Study
## 
##   item  rn  fn r1 f1 r0 f0   mh  delta delta_abs chisq chisq_p ets_level
## 1    4 100 100 61 52 39 48 1.50 -0.946     0.946  1.58  0.2083         a
## 2    5  88  92 55 40 33 52 2.06 -1.701     1.701  4.84  0.0278         c
difR:::difMH(data.frame(item_scores), group = groups,
  focal.name = "foc", anchor = 1:3, match = total_scores)
## 
## Detection of Differential Item Functioning using Mantel-Haenszel method 
## with continuity correction and without item purification
## 
## Results based on asymptotic inference 
##  
## Matching variable: specified matching variable 
##  
## Anchor items (provided by the user): 
##    
##  X1
##  X2
##  X3
## 
##  
## No p-value adjustment for multiple comparisons 
##  
## Mantel-Haenszel Chi-square statistic: 
##  
##    Stat.  P-value  
## X4 1.5834 0.2083   
## X5 4.8568 0.0275  *
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
## 
## Detection threshold: 3.8415 (significance level: 0.05)
## 
## Items detected as DIF items: 
##    
##  X5
## 
##  
## Effect size (ETS Delta scale): 
##  
## Effect size code: 
##  'A': negligible effect 
##  'B': moderate effect 
##  'C': large effect 
##  
##    alphaMH deltaMH  
## X4  1.4955 -0.9457 A
## X5  1.8176 -1.4041 B
## 
## Effect size codes: 0 'A' 1.0 'B' 1.5 'C' 
##  (for absolute values of 'deltaMH') 
##  
## Output was not captured!

One more note, when reporting MH results, the difR package only uses the absolute delta values to assign ETS significance levels (A, B, C). You can see this in the difR:::print.MH function (not shown here). Usually, the MH approach also incorporates the p-value for the chi square (Zwick, 2012).

References

Magis, D., Beland, S, Tuerlinckx, F, & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862.

Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. Princeton, NJ: Educational Testing Service. https://files.eric.ed.gov/fulltext/EJ1109842.pdf

Some Equations and R Code for Examining Intersectionality in Differential Item Functioning Analysis

A couple of papers came out last year that consider intersectionality in differential item functioning (DIF) analysis. Russell and Kaplan (2021) introduced the idea, and demonstrated it with data from a state testing program. Then, Russell, Szendey, and Kaplan (2021) replicated the first study with more data. This is a neat application of DIF, and I’m surprised it hasn’t been explored until now. I’m sure we’ll see a flurry of papers on it in the next few years.

Side note, the second Russell study, published in Educational Assessment, doesn’t seem justified as a separate publication. They use the same DIF method as in the first paper, they appear to use the same data source, and they have similar findings. They also don’t address in the second study any of the limitations of the original study (e.g., they still use a single DIF method, don’t account for Type I error increase, don’t have access to item content, don’t have access to pilot vs operational items). The second study really just has more data.

Why is the intersectional approach neat? Because it can give us a more accurate understanding of potential item bias, to the extent that it captures a more realistic representation of the test taker experience.

The intersectional approach to DIF is a simple extension of the traditional approach, one that accounts for interactions among grouping variables. We can think of the traditional approach as focusing on main effects for distinct variables like gender (female compared with male) and race (Black compared with White). The intersectional approach simply interacts the grouping variables to examine the effects of membership in intersecting groups (e.g., Black female compared with White male).

Interaction DIF models

I like to organize DIF problems using explanatory item response theory (Rasch) models. In the base model, which assumes no DIF, the log-odds $\eta_{ij}$ of correct response on item $i$ for person $j$ can be expressed as a linear function of overall mean performance $\gamma_0$ plus mean performance on the item $\beta_{i}$ and the person $\theta_j$:

$$\eta_{ij} = \gamma_0 + \beta_i + \theta_j,$$

with $\beta$ estimated as a fixed effect and $\theta \sim \mathcal{N}(\gamma_0, \, \sigma^{2})$. $\gamma_0 + \beta_i$ captures item difficulty, with higher values indicating easier items.

Before we formulate DIF, we estimate a shift in mean performance by group:

$$\eta_{ij} = \gamma_0 + \gamma_{1}group_j + \beta_i + \theta_j.$$

In a simple dichotomous comparison, we can use indicator coding in $group$, where the reference group is coded as 0 and the focal group as 1. Then, $\gamma_0$ estimates the mean performance for the reference group and $\gamma_1$ is the impact or disparity for the focal group expressed as a difference from $\gamma_0$. To estimate DIF, we interact group with item:

$$\eta_{ij} = \gamma_0 + \gamma_{1}group_j + \beta_{0i} + \beta_{1i}group_j + \theta_j.$$

Now, $\beta_{0i}$ is the item difficulty estimate for the reference group and $\beta_{1i}$ is the DIF effect, expressed as a difference in performance on item $i$ for the focal group, controlling for $\theta$.

The previous equation captures the traditional DIF approach. Separate models would be estimated, for example, with gender in one model and then race/ethnicity in another. The interaction effect DIF approach consolidates terms into a single model with multiple grouping variables. Here, we replace $group$ with $f_j$ for female and $b_j$ for Black:

$$\eta_{ij} = \gamma_0 + \gamma_{1}f_j + \gamma_{2}b_j + \gamma_{3}f_{j}b_j + \beta_{0i} + \beta_{1i}f_j + \beta_{2i}b_j + \beta_{3i}f_{j}b_j + \theta_j.$$

With multiple grouping variables, again using indicator coding, $\gamma_0$ estimates the mean performance for the reference group White male and $\gamma_{1}$, $\gamma_{2}$, and $\gamma_3$ are the deviations in mean performance for White women, Black men, and Black women, respectively, from the reference group. The $\beta$ terms are interpreted similarly but in reference to performance on item $i$, with $\beta_1$, $\beta_2$, and $\beta_3$ as DIF effects.

R code

Here’s what the above models look like when translated to lme4 (Bates et al, 2015) notation in R.

# lme4 code for running interaction effect DIF via explanatory Rasch
# modeling, via generalized linear mixed model
# family specifies the binomial/logit link function
# data_long would contain scores in a long/tall/stacked format
# with one row per person per item response
# item, person, f, and b are then separate columns in data_long

# Base model
glmer(score ~ 1 + item + (1 | person),
  family = "binomial", data = data_long)

# Gender DIF with main effects
glmer(score ~ 1 + f + item + f:item + (1 | person),
  family = "binomial", data = data_long)

# Race/ethnicity DIF with main effects
glmer(score ~ 1 + b + item + b:item + (1 | person),
  family = "binomial", data = data_long)

# Gender and race/ethnicity DIF with interaction effects
glmer(score ~ 1 + f + b + item + f:b + f:item + b:item + f:b:item + (1 | person),
  family = "binomial", data = data_long)

# Shortcut for writing out the same formula as the previous model
# This notation will automatically create all main effects and
# 2x and 3x interactions
glmer(score ~ 1 + f * b * item + (1 | person),
  family = "binomial", data = data_long)

In my experience, modeling fixed effects for items like this is challenging in lme4 (slow, with convergence issues). Random effects for items would simplify things, but we would have to adopt a different theoretical perspective, where we’re less interested in specific items and more interested in DIF effects, and the intersectional experience, overall.

Here’s what the code looks like with random effects for items and persons. In place of DIF effects, this will produce variances for each DIF term, which tell us how variable the DIF effects are across items by group.

# Gender and race/ethnicity DIF with interaction effects
# Random effects for items and persons
glmer(score ~ 1 + f + b + f:b + (1 + f + b + f:b | item) + (1 | person),
  family = "binomial", data = data_long)

# Alternatively
glmer(score ~ 1 + f * b + (1 + f * b | item) + (1 | person),
  family = "binomial", data = data_long)

While lme4 provides a flexible framework for explanatory Rasch modeling (Doran et al, 2007), DIF analysis gets complicated when we consider anchoring, which I’ve ignored in the equations and code above. In practice, ideally, our IRT model would include a subset of items where we are confident that DIF is negligible. These items anchor our scale and provide a reference point for comparing performance on the potentially problematic items.

The mirt R package (Chalmers, 2012) has a lot of nice features for conducting DIF analysis via IRT. Here’s how we get at main effects and interaction effects DIF using mirt:::multipleGroup and mirt:::DIF. The former runs the model and the latter reruns it, testing the significance of the multi group extension by item.

# mirt code for interaction effect DIF

# Estimate the multi group Rasch model
# Here, data_wide is a data frame containing scored item responses in
# columns, one per item
# group_var is a vector of main effect or interacting group values,
# one per person (e.g., "fh" and "mw" for female-hispanic and male-white)
# anchor_items is a vector of item names, matching columns in data_wide,
# for the items that are not expected to vary by group, these will
# anchor the scale prior to DIF analysis
# See the mirt help files for more info
mirt_mg_out <- multipleGroup(data_wide, model = 1, itemtype = "Rasch",
  group = group_var,
  invariance = c(anchor_items, "free_means", "free_variances"))

# Run likelihood ratio DIF analysis
# For each item, the original model is fit with and without the
# grouping variable specified as an interaction with item
# Output will then specify whether inclusion of the grouping variable
# improved model fit per item
# items2test identifies the columns for DIF analysis
# Apparently, items2test has to be a numeric index, I can't get a vector
# of item names to work, so these would be the non-anchor columns in
# data_wide
mirt_dif_out <- DIF(mirt_mg_out, "d", items2test = dif_items)

One downside to the current setup of mirt:::multipleGroup and mirt:::DIF is there isn’t an easy way to iterate through separate focal groups. The code above will test the effects of the grouping variable all at once. So, we’d have to run this separately for each dichotomous comparison (e.g., subsetting the data to Hispanic female vs White male, then Black female vs White male, etc) if we want tests by focal group.

Of course, interaction effects DIF can also be analyzed outside of IRT (e.g., with the Mantel-Haenszel method). It simply involves more comparisons per item than with the main effect approach where we consider each grouping variable separately. For example, gender (with two levels, female, male) and race (with three levels, Black, Hispanic, White) gives us 3 comparisons per item with main effects, whereas we have 5 comparisons per item with interaction effects.

After writing up all this example code, I’m realizing it would be much more useful if I demonstrated it with output. I try to round up some data and share results in a future post.

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67 (1): 1–48.

Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29.

Doran, H., D. Bates, P. Bliese, and M. Dowling. 2007. Estimating the Multilevel Rasch Model: With the lme4 Package. Journal of Statistical Software 20 (2): 1–18.

Russell, M., & Kaplan, L. (2021). An intersectional approach to differential item functioning: Reflecting configurations of inequality. Practical Assessment, Research & Evaluation26(21), 1–17.

Russell, M., Szendey, O., & Kaplan, L. (2021). An intersectional approach to DIF: Do initial findings hold across tests? Educational Assessment26, 284–298.

Community Engagement in Assessment Development

In a commentary article from 2021 on social responsibility in admission testing (Albano, 2021), I recommended that we start crowd-sourcing the test development process.

By crowd-sourced development, I mean that the public as a community will support the review of content so as to organically and dynamically improve test quality. Not only does this promise to be more transparent and efficient than review by selected groups, but, with the right training, it also empowers the public to contribute directly to assessing fairness, sensitivity, and accessibility. Furthermore, a more diverse population, potentially the entire target population, will have access to the test, which will facilitate the rapid development of content that is more representative of and engaging for historically marginalized and underrepresented groups. This community involvement need not replace or diminish expert review. It can supplement it.

The idea of crowd-sourcing item writing and review has been on my mind for a decade or so. I pursued it while at the University of Nebraska, creating a web app (https://proola.org, now defunct) intended to support educators in sharing and getting feedback on their classroom assessment items. We piloted the app with college instructors from around the US to build a few thousand openly-licensed questions (Miller & Albano, 2017). But I couldn’t keep the momentum going after that and the project fizzled out.

Also while at Nebraska, I worked with Check for Learning (C4L, also now defunct I believe), a website managed by the Nebraska Department of Education that let K12 teachers from across the state share formative assessment items with one another. The arrangement was that a teacher would contribute a certain number of items to the bank before they could administer questions from C4L in their classroom. If I remember right, the site was maintained for a few years but ultimately shut down because of a lack of interest.

In these two examples, we can think of the item writing process as being spread out horizontally. Instead of the usual limited and controlled sample, access is given to a wider “crowd” of content experts. In the case of C4L, the entire population of teachers could contribute to the shared item bank.

Extending this idea, we can think of community engagement as distributing assessment development vertically to other populations, where we expand both on a) what we consider to be appropriate content, and b) who we consider to be experts in it.

In addition to working with students and educators, engaging the community could involve surveying family members or interviewing community leaders to better understand student backgrounds and experiences. We might review outlines/frameworks together, and get feedback on different contexts, modes, and methods of assessment. We could discuss options for assessment delivery and technology, and how to best communicate regarding assessment preparation, practice at home, and finally interpreting results.

I am hearing more discussion lately about increasing community engagement in assessment development. The aim is to decolonize and create culturally relevant/sustaining content, while also enhancing transparency and buy-in at a more local level. This comes alongside, or maybe in the wake of, a broader push to revise our curricula and instruction to be more oriented toward equity and social justice.

I’m still getting into the literature, but these ideas seem to have taken shape in the context of educational assessment, and then testing and measurement more specifically, in the 1990s. Here’s my current reading list from that timeframe.

  • Ladson-Billings and Tate (1995) introduce critical race theory in education as a framework and method for understanding educational inequities. In parallel, Ladson-Billings (1995) outlines culturally responsive pedagogy.
  • Moss (1996) argues for a multi-method approach to validation, where we leverage the contrast between traditional “naturalist” methods with contextualized “interpretive” ones, with the goal of “expanding the dialogue among measurement professionals to include voices from research traditions different from ours and from the communities we study and serve” (p 20).
  • Lee (1998), referencing Ladson-Billings, applies culturally responsive pedagogy to improve the design of performance assessments “that draw on culturally based funds of knowledge from both the communities and families of the students” and that “address some community-based, authentic need” (p 273).
  • Gipps (1999) highlights the importance of social and cultural considerations in assessment, referencing Moss among others, within a comprehensive review of the history of testing and its epistemological strengths and limitations.
  • Finally, Shepard (2000), referencing Gipps among others, provides a social-constructivist framework for assessment in support of teaching and learning, one that builds on cognitive, constructivist, and sociocultural theories.

References

Albano, A. D. (2021). Commentary: Social responsibility in college admissions requires a reimagining of standardized testing. Educational Measurement: Issues and Practice, 40, 49-52.

Gipps, S. (1999). Socio-cultural aspects of assessment. Review of Research in Education, 24, 355–392.

Ladson-Billings, G. (1995). Toward a theory of culturally relevant pedagogy. American Educational Research Journal32, 465-491.

Ladson-Billings, G., & Tate, W. F. (1995). Toward a critical race theory of education. Teachers College Record, 97, 47-68.

Lee, C. D. (1998). Culturally responsive pedagogy and performance-based assessment. The Journal of Negro Education67, 268-279.

Miller, A. & Albano, A. D. (2017, October). Content Camp: Ohio State’s collaborative, open test bank pilot. Paper presented at OpenEd17: The 14th Annual Open Education Conference, Anaheim, CA.

Moss, P. A. (1996). Enlarging the dialogue in educational measurement: Voices from interpretative research traditions. Educational Researcher, 25, 20-28.

Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29, 4-14.

EMIP Commentaries on College Admission Tests and Social Responsibility by Koljatic, Silva, and Sireci

I’m sharing here my notes on a series of commentaries in press with the journal Educational Measurement: Issues and Practice (EMIP). The commentaries examine the topic of social responsibility (SR) in college admission testing, in response to the following focus article, where the authors challenge the testing industry to be more engaged in improving equity in education.

Koljatic, M., Silva, M., & Sireci, S. (in press). College admission tests and social responsibility. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12425.

I enjoyed reading the commentaries. They are thoughtful and well-written, represent a variety of perspectives on SR, and raise some valid concerns. For the most part, there is agreement that we can do better as a field, though there is disagreement on the specifics.

There are 14 articles, including mine. I’m going to list them alphabetically by last name of first author, and give a short summary of the main points. Full references are at the end.

1. Ackerman, The Future of College Admissions Tests

  • Ackerman defends the testing industry, saying we haven’t ignored SR so much as we’ve attended to what is becoming an outdated version of SR, one that valued merit over high socioeconomic status. We haven’t been complacent, just slow to change course as SR has evolved. This reframing serves to distribute the responsibility, but the main point from the focus article still stands, standardized testing is lagging and we need to pick up our feet.
  • Ackerman recommends considering tests of competence, perhaps something with criterion referencing, resembling Advanced Placement, though we still have to deal with differential access to the target test content.

2. Albano, Social Responsibility in College Admissions Requires a Reimagining of Standardized Testing

  • My article summarizes the debate around SR in admissions in the University of California (UC) over the past few years, with references to some key policy documents.
  • I critique the Nike analogy, pointing out how the testing industry is more similar to a manufacturer, building shoes according to specifications, than it is to a distributer. Nike could just as easily represent an admissions program. This highlights how SR in college admissions will require cooperation from multiple stakeholders.
  • The suggestions from the focus article for how we address SR just scratch the surface. Our goal should be to build standardized assessment systems that are as openly accessible and transparent as possible, optimally having all test content and item-level data available online.

3. Briggs, Comment on College Admissions Tests and Social Responsibility

  • Briggs briefly scrutinizes the Nike analogy, and then contrasts the technical, standard definition of fairness or lack of bias with the public interpretation of fairness as lack of differential impact, acknowledging that we’ve worked as a field to address the former but not so much the latter.
  • He summarizes research, including his own, indicating that although coaching may have a small effect in terms of score changes, admission officers may still act on small differences. This suggests inequitable test preparation shouldn’t be ignored.
  • Briggs also recommends we consider how college admissions improves going forward with optional or no testing. Recent studies show that diversity may increase slightly as a result. It remains to be seen how other admission variables will be interpreted and potentially manipulated in the absence of a standardized quantitative measure.

4. Camara, Negative Consequences of Testing and Admission Practices: Should Blame Be Attributed to Testing Organizations?

  • Camara highlights how disparate impact in admissions goes beyond testing into the admission process itself. Other applicant variables (eg, personal statements, GPA, letters of recommendation) also have limitations.
  • He also says the focus article fails to acknowledge how industry has already been responsive to SR concerns. Changes have been made as requested, but they are slow to implement, and sometimes they aren’t even utilized (eg, non-cognitive assessments, essay sections).

5. Franklin et al, Design Tests with a Learning Purpose

  • Franklin et al propose, in under two pages, that we design admission tests to serve two purposes at once, including 1) teaching, in addition to 2) measuring, which they refer to as the original purpose. Teaching via testing is accomplished via formative feedback that can guide test takers to remediation.
  • As an example, they reference a free and open-source testing system for college placement (https://daacs.net) that provides students with diagnostic information and learning resources.
  • This sort of idea came up in our conversations around admissions at the UC. As a substitute for the SAT, we considered the Smarter Balanced assessments (used for end-of-year K12 testing in California), which, in theory, could provide diagnostic information linked to content standards.
  • Measurement experts might say that when a test serves multiple purposes it risks serving none of them optimally. This assumes that there are limited resources for test development or that the multiple purposes involve competing interests and trade-offs, which may or may not actually be the case.

6. Geisinger, Social Responsibility, Fairness, and College Admissions Tests

  • Geisinger gives some historical context to the discussion of fairness and clarifies from the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) that the users of tests are ultimately responsible for their use.
  • He contrasts validity with the similar but more comprehensive utility theory from industrial/organizational psychology. Utility theory accounts for all of the costs and impacts of test use, and in this way it seems to overlap with what we call consequential validity.
  • Geisinger also recommends we expand DIF analysis to include external criterion measures. This idea also came up in our review of the SAT and alternatives in the UC.

7. Irribarra et al, Large-Scale Assessment and Legitimacy Beyond the Corporate Responsibility Model

  • Irribarra et al argue that admission testing is not a product or service but a public policy intervention, in which case, it’s reasonable to expect testing to have a positive impact. They don’t really justify this position or consider the alternatives.
  • The authors outline three strategies for increasing legitimacy of admission testing as policy intervention, including 1) increased transparency (in reporting), 2) adding value (eg, formative score interpretations), and 3) community participation (eg, having teachers as item writers and ambassadors to the community). These strategies align with the recommendations in other articles, including mine.

8. Klugman et al, The Questions We Should Be Asking About Socially Responsible College Admission Testing

  • This commentary provided lots of concrete ideas to discuss. I’ll probably need a separate post to elaborate.
  • In parsing the Nike analogy, Klugman et al note, as do other commentaries, that testing companies have less influence over test use than a distributor like Nike may have over its manufacturers. As a result, the testing industry may have less leverage for change. The authors also point out that the actual impacts of Nike accepting SR are unclear. We shouldn’t assume that there has been sustained improvement in manufacturing, as there is evidence that problems persist, and it could be that “Nike leadership stomps out scandals as they pop up” (p 1).
  • Klugman et al cite a third flaw in the Nike analogy, and I would push back on this one. They say that, whereas consumers pressured for change with Nike, the consumers of tests (the colleges and universities who use them) “are not demanding testing agencies dramatically reenvision their products and how they are used” (p 2). While I agree that higher education is in the best position to ask for a better testing product, I disagree that they’ve neglected to do so. Concerns have been raised over the years and the testing industry has responded. Camara and Briggs both note this in their commentaries, and Camara lists out a few examples, as do commentaries from ACT and College board (below).
  • That last point might boil down to what the authors meant by “dramatically reenvision” in the quote above. It’s unclear what a dramatic reenvisioning would entail. Maybe the authors would accept that changes have been made, but that they haven’t been dramatic enough.
  • Next, Klugman et al argue that corporate SR for testing companies is “ill-defined and undesirable” (p 2). The gist is that SR would be complicated in practice because reducing score gaps would conflict with existing intended uses of test scores. I was hoping for more discussion here but they move on quickly to a list of recommendations for improving testing and the admissions process itself. Some of these recommendations appear in different forms in other commentaries (focus on content-related validity and criterion referencing, reduce the costs of testing, consider how admissions changes when we don’t use tests), and there was one I didn’t see elsewhere (be careful of biases coded into historical practices and datasets that are used to build new tools and predictive models).

9. Koretz, Response to Koljatic et al: Neither a Persuasive Critique of Admissions Testing Nor Practical Suggestions for Improvement

  • As the title suggests, Koretz is mostly critical of the focus article in his commentary. He reviews its limitations and concludes that it’s largely unproductive. He says the article missteps with the Nike analogy, and that it doesn’t: clarify the purposes and target constructs of admission testing, acknowledge the research showing a lack of bias, give evidence of how testing causes inequities, or provide clear or useful suggestions for improving the situation.
  • Koretz also questions the general negative tone of the focus article, a tone that is evident in key phrases that feel unnecessarily cynical (that’s my interpretation of his point) as well as a lack of support for some of its primary claims (insufficient or unclear references).

10. Lyons et al, Evolution of Equity Perspectives on Higher Education Admissions Testing: A Call for Increased Critical Consciousness

  • Lyons et al summarize how perspectives on admission testing have progressed over time from a) emphasizing aptitude over student background to b) emphasizing achievement over aptitude, and now to c) an awareness of opportunity gaps and d) recognition of more diverse knowledge and skills.
  • The authors argue that systematic group differences in test scores are justification for removing or limiting tests as gatekeepers to admission. They don’t address the broader issue of the admission process itself being a gatekeeper to admission.
  • They end (p 3) with suggestions for expanding selection variables to include “passion and commitment, adaptability, short-term, and long-term goals, ability to build connections and a sense of belonging, cultural competence, ability to navigate adversity, and propensity for leadership and collective responsibility.” They also concede that “Academic achievement, as measured by standardized tests, may be useful in playing a limited, compensatory role, but always in partnership with divergent measures that value and represent multiple ways of knowing, doing, and being.”
  • The authors don’t acknowledge that testing companies are already exploring ways to measure these other variables (discussed, eg, in the Mattern commentary), and admissions programs already try to account for them on their own (eg, via personal statements and letters of recommendation). It’s unclear if the authors are suggesting we need new standardized measures of these variables.

11. Mattern et al, Reviving the Messenger: A Response to Koljatic et al

  • The authors, all from ACT, respond to focus article suggestions that the testing industry 1) review construct irrelevance and account for opportunity to learn, 2) explore new ways of testing to reduce score gaps, and 3) increase transparency and accountability generally.
  • They discuss how the testing industry is already addressing 1) by, eg, aligning tests to K12 curricula, asking college instructors via survey what they expect in new students, and documenting opportunity to learn while acknowledging that it has impacts beyond testing.
  • They interpret 2) as a call from the focus article to redesign admission tests themselves so that they produce “predetermined outcomes,” which Mattern et al reject as “unscientific” (p 2). I don’t know that the focus article meant to say that the tests should be modified to hide group differences, but I can see how their recommendations were open to interpretation. Rather than change the tests, Mattern et al recommend considering less traditional variables like social and emotional learning.
  • Finally, the authors respond to 3) with examples of their commitment to transparency, accountability, and equity. The list is not short, and ACT’s level of engagement seems pretty reasonable, more than they’re given credit for in the other commentaries.

12. Randall, From Construct to Consequences: Extending the Notion of Social Responsibility

  • Randall advocates for an anti-racist approach to standardized testing, in line with her EMIP article from earlier this year (Randall, 2021), wherein we reconsider how our current construct definitions and measurement methods sustain white supremacy.
  • Randall questions the familiar comparison of standardized testing to a doctor or thermometer, pointing out that decision-making in health care isn’t without flaws or racist outcomes, and concluding that the admission testing industry has “failed to… see itself as anything other than some kind of neutral ruler/diagnostic tool,” and that “the possibility that the test is wrong” is something that “many in the admission testing industry are resistant to even considering” (p 1).
  • I appreciate Randall’s critique of this analogy. I hadn’t scrutinized it in this way before, and can see how it oversimplifies the issue, granting to tests an objectivity and essential quality that they don’t deserve. That said, Randall seems to oversimplify the issue in the opposite direction, without accounting for the ways in which industry does now acknowledge and attempt to address the limitations of testing.
  • Randall recommends that, instead of college readiness, we label the target construct of admission testing as “the knowledge, values, and ways of understanding of the white dominant class” (p 2). I don’t know well the critical theory literature behind recommendations like this and I’m curious how it squares with research showing that achievement gaps are largely explained by school poverty. It would be helpful to see examples of test content, in something like the released SAT questions, that uniquely privilege a student’s whiteness separately from their wealth background.

13. Walker, Achieving Educational Equity Requires a Communal Effort

  • Walker summarizes points of agreement with the focus article, eg, standard practices are only a starting point for navigating SR in testing, and testing companies can be more engaged in promoting fair test use, including by collaborating with advocacy groups. Walker highlights the state of Hawaii as an example, as they implemented standards and assessments that better align with their Hawaiian language immersion schools.
  • He also critiques and extends the arguments made in the focus article, saying that our traditional practice in test development and psychometrics “represents a mainstream viewpoint that generally fails to account for the many social and cultural aspects of learning and expression” (p 1). Referring to the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014), he says, “the Standards can only advocate for a superficially inclusive approach to pursuing an exclusive agenda. Thus, any test based on those standards will be woefully inadequate with respect to furthering equity” (p 2).
  • Walker, referring to a report from the UC, argues that admission tests already map onto college readiness, as evidenced in part by correlations between test scores and college grades. Critics would note here that test scores capitalize on the predictiveness of socioeconomic status, and, in the UC at least, they do so more than high school grades do (Geiser, 2020). Test scores measure more socioeconomic readiness than we might realize.
  • Walker concludes that equity will require much more than SR in testing. He says, “Any attempt to reform tests independently of the educational system would simply result in tests that no longer reflected what was happening in schools and that had lost relevance” (p 2). In addition to testing, we need to reevaluate SR in the education system itself. He shares a lot of good examples and references here (eg, on classroom equity and universal design).
  • Finally, Walker refers to democratic testing (Shohamy, 2021), a term I hadn’t heard of. He says, “testing should be a democratic process, conducted in collaboration and cooperation with those tested” (p 2). Further, “everyone involved in testing must assume responsibility for tests and their uses, instead of leaving all the responsibility in the hands of a powerful few” (p 2). This point resonates well with my recommendations for less secrecy and security in testing, and more access, partnership, and transparency.

14. Way, An Evidence-Based Response to College Admission Tests and Social Responsibility

  • The authors, both from College Board, highlight how the company is already working to address inequities through fee waivers, free test prep via Khan Academy, the Landscape tool, etc. By omitting this information, the focus article misrepresents industry.
  • Regarding the focus article’s claim that industry isn’t sufficiently committed to transparency and accountability, the authors reply, “There is no clear explanation provided as to what they are referring to and the claim is simply not based on facts.”
  • The authors recommend that the National Council on Measurement in Education form a task force to move this work forward.

Summary

Here are a few themes I see in the focus article and commentaries.

  1. The focus article and some of the commentaries don’t really acknowledge what has already being done in admission testing with respect to SR. Perhaps this was omitted in the interest of space, but, ideally, a call for action would start with a review of existing efforts (some of which are listed above) and then present areas for improvement.
  2. The Nike analogy has some flaws, as can be expected with any analogy. It still seems instructive though, especially when we stretch it a bit and consider reversing the roles.
  3. As for next steps, there’s some consensus that we need increased transparency and more input, from diverse stakeholders, in the test development process.
  4. Improving SR in admission testing and beyond, so as to reduce educational inequities, will be complicated, and has implications for our education system in general. Though not directly addressed in the articles, the more diverging viewpoints (testing is pretty good vs inherently unjust) probably arise from a lack of consensus on broader issues like meritocracy, the feasibility of objective measurement, and the role of educational standards.

I’m curious to see how Koljatic, Silva, and Sireci bring the discussion together in a response, which I believe is forthcoming in EMIP.

References for Commentaries

Ackerman, P L. (in press). The future of college admissions tests. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12456

Albano, A. D. (in press). Social responsibility in college admissions requires a reimagining of standardized testing. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12451

Briggs, D. C. (in press). Comment on college admissions tests and social responsibility. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12455

Camara, W. J. (in press). Negative consequences of testing and admission practices: Should blame be attributed to testing organizations? Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12448

Franklin, D. W., Bryer, J., Andrade, H. L., & Liu, A. M. (in press). Design tests with a learning purpose. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12457

Geisinger, K. F. (in press). Social responsibility, fairness, and college admissions tests. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12450

Irribarra, D. T., & Santelices, M. V. (in press). Large-scale assessment and legitimacy beyond the corporate responsibility model. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12460

Klugman, E. M., An, L., Himmelsbach, Z., Litschwartz, S. L., & Nicola, T. P. (in press). The questions we should be asking about socially responsible college admission testing. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12449

Koretz, D. (in press). Response to Koljatic et al.: Neither a persuasive critique of admissions testing nor practical suggestions for improvement. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12454

Lyons, S., Hinds, F., & Poggio, J. (in press). Evolution of equity perspectives on higher education admissions testing: A call for increased critical consciousness. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12458

Mattern, K., Cruce, T., Henderson, D., Gridiron, T., Casillas, A., & Taylor, M. (in press). Reviving the messenger: A response to Koljatic et al. (2021). Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12459

Randall, J. (in press). From construct to consequences: Extending the notion of social responsibility. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12452

Walker, M. E. (in press). Achieving educational equity requires a communal effort. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12465

Way, W. D., & Shaw, E. J. (in press). An evidence-based response to college admission tests and social responsibility. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12467

Other References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Lanham, MD: American Educational Research Association.

Geiser, S. (2020). SAT/ACT Scores, High School GPA, and the Problem of Omitted Variable Bias: Why the UC Taskforce’s Findings are Spurious. https://cshe.berkeley.edu/publications/satact-scores-high-school-gpa-and-problem-omitted-variable-bias-why-uc-taskforce’s

Randall, J. (2021). “Color-neutral” is not a thing: Redefining construct definition and representation through a justice-oriented critical antiracist lens. Educational Measurement: Issues and Practice. https://doi.org/10.1111/emip.12429

Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing, 18(4), 373–391.

Commentary Article on College Admission Testing in EMIP

The journal Educational Measurement: Issues and Practice (EMIP) is publishing commentaries on a focus article on College Admission Tests and Social Responsibility (Koljatic, Silva, & Sireci, in press, https://doi.org/10.1111/emip.12425). The authors critique how the standardized testing industry has disengaged from efforts to reduce educational inequities.

Here’s the abstract to my commentary article (also in press, https://doi.org/10.1111/emip.12451), where I argue that Social Responsibility in College Admissions Requires a Reimagining of Standardized Testing.

As college admissions becomes more competitive in the United States and globally, with more applicants competing for limited seats, many programs are transitioning away from standardized testing as an application requirement, in part due to the concern that testing can perpetuate inequities among an increasingly diverse student population. In this article, I argue that we can only address this concern by reimagining standardized testing from the ground up. Following a summary of the recent debate around testing at the University of California (UC), I discuss how my perspective aligns with that of Koljatic et al. (in press), who encourage the testing industry to accept more social responsibility. Building on themes from the focus article and other recent publications, I then propose that, to contribute to educational equity, we must work toward testing that is more transparent and openly accessible than ever before.

Some Comments on Renewable or Non-disposable Assessment

If an assignment goes into the recycle bin, but there’s no one there to hear it, does it still make a sound?

I heard about renewable or non-disposable assessment a few years ago at the Open Education Conference, and I’ve seen it mentioned a few times since then in blog posts and papers, most recently a paper in Psychology Teaching and Learning by Seraphin et al.

It looks like David Wiley may have coined the terms disposable and renewable assignments. He wrote about them in a blog post on open pedagogy in 2013, and in another post in 2016.

The premise is that educational assessment often has limited utility outside the classroom experience, because it’s designed primarily to inform instruction and/or grading. Whether it’s an essay on the merits of school uniforms or an observational study of lady bugs, once the assignment is completed, we dispose of student work and move on.

In the 2013 post, Wiley says disposable assignments “add no value to the world.” And in the 2016 post he elaborates.

Try to imagine dedicating large swaths of your day to work you knew would never be seen, would never matter, and would literally end up in the garbage can. Maybe you don’t have to imagine – maybe some part of your work day is actually like that. If so, you may know the despair of looking forward and seeing only piles of work that don’t matter. And that’s how students frequently feel.

In contrast, non-disposable assessment (NDA) requires that students contribute to something beyond their individual coursework. The essays could be featured in the school newsletter, or the lady bug study could be part of a local citizen science project. Because NDA have broader utility and the potential for impact outside the classroom experience, we can expect students to be more engaged with them than with disposable assessments.

This all sounds fine, but I would clarify a few points. Note that I’m using assignment and assessment interchangeably, and I prefer the latter.

  • We can contrive them in younger grades, but NDA really only become feasible as students develop expertise, which is probably why NDA are discussed almost exclusively in the context of higher education, from what I’ve seen.
  • These concepts mostly aren’t new. The complete opposite of NDA might be busy-work, a term we’re all familiar with and try to avoid as instructors. NDA concepts overlap with anti-busy-work ideas from K12, including authentic assessment and performance assessment, which favor tasks that derive meaning from realistic problems and context. The key difference with NDA is that it results in something of value outside the assessment process itself.
  • Often, disposable assessments are disposable for a reason. They’re designed to give students immediate practice in something they’ve likely never encountered before. Students may not be comfortable sharing their novice work via Instagram or Wikipedia entries. NDA adds exposure and thus external pressures that change the learning experience. NDA can also add constraints or extra requirements in format and style that detract from learning.

I like the idea of NDA. Really, any assessment should be designed to create as much value as possible, both within and outside the classroom experience. Educational technology and social media give students more opportunities than ever before to create and share content. Let’s use these tools to help students disseminate their work and contribute to the base of knowledge and resources, whenever such extended applications make sense.

That said, not every assessment can or should be NDA, and being so-called disposable doesn’t mean an assignment doesn’t matter. Wiley’s portrayal quoted above is kind of dramatic. At the very least, an assignment builds knowledge, skills, and abilities that inform next steps in the student’s own development. Often those next steps culminate in a larger project or portfolio of work. But, even if an assessment doesn’t have a tangible outcome, let’s not discount the value of intrinsic motivation in the completion of work that has no audience or recipient.

Is the Academic Achievement Gap a Racist Idea?

In this post I’m going to examine two of the main points from a 2016 article where Ibram Kendi argues that “the academic achievement gap between white and black students is a racist idea.” Similar arguments are made in this 2021 article from the National Education Association, which addresses “the racist beginnings of standardized testing.”

I agree that score gaps, our methods for measuring them, and our continuous discussion of them, can perpetuate educational inequities. Fixating on gaps can be counterproductive. However, I disagree somewhat with the claim from Kendi and others that the tests themselves are the main problem because, they argue, the tests 1) have origins in intelligence testing and 2) assess the wrong kinds of stuff.

Before I dig into these two points, a few preliminaries.

  • I recognize that the articles I’ve linked above are opinion pieces, intended to push the discussion forward while advocating for change, and that their formats may not allow for a comprehensive treatment of these points. My response has more to do with these points needing elaboration and context, and less to do with them being totally incorrect or unfounded.
  • NPR On Point did a series in 2019 on the achievement gap, with one of the interviews featuring Ibram Kendi and Prudence Carter, and both acknowledge the potential benefits of standardized testing. I recognize that Kendi’s 2016 article may not fully capture his perspective on gaps or testing.
  • The term achievement gap can hide the fact that differential academic performance by student group results from differential access and opportunity, the effects of which compound over time. I’ll use achievement here to be consistent with previous work.

Intelligence vs achievement

In his 2016 article, Kendi doesn’t make a clear distinction between intelligence and achievement. He transitions from the former to the latter while summarizing the history of standardized testing, but he refers to the achievement gap throughout, with the implication being that differences in intelligence are the same as, or close enough to, differences in achievement, such that they can be treated interchangeably.

Intelligence and achievement are two moderately correlated constructs, as far as we can measure them accurately. They overlap, but they aren’t the same. Achievement can be improved through teaching and learning, whereas intelligence is thought to be more stable over time (though the Flynn effect raises questions here). Achievement is usually linked to concrete content that is the focus of instruction (eg, fractions, reading comprehension), whereas intelligence is more related to abstract aptitudes (eg, memory, pattern recognition).

An achievement gap is then an average difference in achievement for two or more groups of students, typically measured via standardized tests, with groups defined based on student demographics like race or gender.

Data show that groups differ in variables related both to achievement and intelligence, but how and whether we can or need to interpret these group differences is up for debate. We set instructional and education policy goals based on achievement results. It’s not clear what we do with group differences in intelligence, which leads many to question the utility of analyzing intelligence by race, especially while attributing heritability (this Slate article by William Saletan summarizes the issue well).

Why is a distinction between constructs important? Because the limitations of intelligence testing don’t necessarily carry over into achievement. Both areas of testing involve standardization, but they differ in essential ways, including in design, content, administration, scoring, and use. Intelligence tests need not connect to a specific education system, whereas most achievement tests do (eg, see California content standards, the foundation of its annual end-of-year achievement tests, currently SBAC).

Both of the articles I linked at the start highlight some of the eugenic and racist origins of intelligence testing. Following the history into the 1960s and then 1990s, Kendi notes that genetic explanations for racial differences in intelligence have been disproven, but he still presents achievement testing and the achievement gap as a continuation of the original racist idea.

While intelligence as a construct is roughly 100 years old, standardized testing has actually been around for hundreds if not thousands of years (eg, Chinese civil service exams, from wikipedia). This isn’t to say achievement tests haven’t been used in racists ways in the US or elsewhere, but the methods themselves aren’t necessarily irredeemable simply because they resemble those used in intelligence testing.

Charles Murray, co-author on the controversial 1994 book The Bell Curve (mentioned by Kendi), also seems to conflate intelligence with achievement. Murray claims that persistent achievement gaps confirm his prediction that intelligence differences will remain relatively stable (see his comments at AEI.org). However, studies show that racial achievement gaps are to a large extent explained by other background variables and can be reduced through targeted intervention (summarized in this New York Magazine article, which is where I saw the Murray comments above; see also this article by Linda Darling-Hammond and this one by Prudence Carter). This research tells us achievement is malleable and should be treated separately from intelligence.

Kinds vs levels of achievement

Kendi and others argue that the contents of standardized tests don’t represent the kinds of achievement that are relevant to all students. The implication here is that differences in levels of achievement (ie, gaps) arise from biased test content, and can be explained by an absence of the kinds of achievement that are valued by or aligned with the experiences of underrepresented students. Kendi says:

Gathering knowledge of abstract items, from words to equations, that have no relation to our everyday lives has long been the amusement of the leisured elite. Relegating the non-elite to the basement of intellect because they do not know as many abstractions has been the conceit of the elite.

What if we measured literacy by how knowledgeable individuals are about their own environment: how much individuals knew all those complex equations and verbal and nonverbal vocabularies of their everyday life?

This sounds like culturally responsive pedagogy (here’s the wikipedia entry), where instruction, instructional materials, and even test content will seek to represent and engage students of diverse cultures and backgrounds. We should aim to teach with our entire student population in mind, especially underrepresented groups, rather than via one-size-fits-all approaches that default to tradition or the majority. But we’re still figuring out how this applies to standards-based systems. And, though culturally responsive pedagogy may be optimal, we don’t know that achievement gaps hinge on it.

While I have seen examples of standardized achievement tests that rely on outdated or irrelevant content, I haven’t seen evidence showing that gaps would reduce significantly if we measured different kinds of achievement. Kendi doesn’t reference any evidence to support this claim.

Continuing on this theme, Kendi targets standardized tests themselves as perpetuating a racial hierarchy. He says:

The testing movement does not value multiculturalism. The testing movement does not value the antiracist equality of difference. The testing movement values the racist hierarchy of difference, and its bastard 100-year-old child: the academic achievement gap.

This might be true to some extent, but if our tests are constructed to assess generally the content that is taught in schools, an achievement gap should result more from inequitable access to quality instruction in that content, or the appropriateness of that content, than from testing itself. In this case, other variables like high school grade point average and graduation rate will also reflect achievement gaps to some extent. So, it may be that the concern is more related to standardized education not valuing multiculturalism than standardized testing.

Whatever the reasons, I agree that multiculturalism hasn’t been a priority in the testing movement over the past century. This has bothered me since I started psychometric work over ten years ago. Standardization pushes us to materials devoid of context that is meaningful at the individual or subgroup levels. Fortunately, I am seeing more discussion of this issue in the educational and psychological measurement literature (eg, this article by Stephen Sireci) and am excited for the potential.

Final thoughts

Although my comments here have been critical of the anti-testing and anti-gap arguments, I agree with the general concern around how we discuss and interpret achievement gaps. I wouldn’t say that standardized testing is solely to blame, but I do question the utility in spending so much time measuring and reporting on achievement differences by student groups, especially when we know that these differences mostly reflect access and opportunity gaps. The pandemic has only heightened these concerns.

Returning to the question in the title of this post, is the academic achievement gap a racist idea, I would say, yes, sometimes. Gaps can be misinterpreted in racist ways as being heritable and immutable. To the extent that documenting achievement gaps contributes to inequities, I would agree that the process itself can become a racist one.

That said, research indicates that we can document and address achievement gaps in productive ways, in which case valid measurement is essential. As you might guess, I would aim for better testing instead of zero testing, including measures that are less standardized and more individualized and culturally responsive. The challenge here will be convincing test developers and users that we can move away from norm-referenced score comparisons without losing valuable information.

I didn’t really get into achievement gap research here, outside of a narrow critique of standardized testing. If you’re looking for more, I recommend the articles by Linda Darling-Hammond and Prudence Carter linked above, as well as the NPR On Point series. There’s also this 2006 article by Gloria Ladson-Billings based on her presidential address to the American Educational Research Association. Amy Stuart Wells continues the discussion in her 2019 presidential address, on Youtube.

Limitations of Implicit Association Testing for Racial Bias

Apparently, implicit association testing (IAT) has been overhyped. Much like grit and power posing, two higher profile letdowns in pop psychology, implicit bias seems to have attracted more attention than is justified by research. Twitter pointed me to a couple articles from 2017 that clarify the limitations of IAT for racial bias.

https://www.vox.com/identities/2017/3/7/14637626/implicit-association-test-racism

https://www.thecut.com/2017/01/psychologys-racism-measuring-tool-isnt-up-to-the-job.html

The Vox article covers these main points.

  • The IAT might work to assess bias in the aggregate, for a group of people or across repeated testing for the same person.
  • It can’t actually predict individual racial bias.
  • The limitations of the IAT don’t mean that racism isn’t real, just that implicit forms of it are hard to measure.
  • As a result, focusing on implicit bias may not help in fighting racism.

The second article from New York Magazine, The Cut, gives some helpful references and outlines a few measurement concepts.

There’s an entire field of psychology, psychometrics, dedicated to the creation and validation of psychological instruments, and instruments are judged based on whether they exceed certain broadly agreed-upon statistical benchmarks. The most important benchmarks pertain to a test’s reliability — that is, the extent to which the test has a reasonably low amount of measurement error (every test has some) — and to its validity, or the extent to which it is measuring what it claims to be measuring. A good psychological instrument needs both.

Reliability for the IAT appears to land below 0.50, based on test-retest correlations. Interpretations of reliability depend on context, there aren’t clear standards, but in my experience 0.60 is usually considered too low to be useful. Here, 0.50 would indicate that 50% of the observed variance in scores can be attributed to consistent and meaningful measurement, whereas the other 50% is unpredictable.

I haven’t seen reporting on the actual scores that determine whether someone has or does not have implicit bias. Psychometrically, there should be a scale, and it should incorporate decision points or cutoffs beyond which a person is reported to have a strong, weak, or negligible bias.

Until I find some info on scaling, let’s assume that the final IAT result is a z-score centered at 0 (no bias) with standard deviation of 1 (capturing the average variability). Reliability of 0.50, best case scenario, gives us a standard error of measurement (SEM) of 0.71. This tells us scores are expected to differ on average due to random noise by 0.71 points.

[Confidence Intervals in Measurement vs Political Polls]

Without knowing the score scale and how it’s implemented, we don’t know the ultimate impact of an SEM of 0.71, but we can say that score changes across much of the scale are uninterpretable. A score of +1, or one standard deviation above the mean, still contains 0 within its 95% confidence interval. A 95% confidence interval for a score of 0, in this case, no bias, ranges from -1.41 to +1.41.

The authors of the test acknowledge that results can’t be interpreted reliably at the individual level, but their use in practice suggests otherwise. I took the online test a few times (at https://implicit.harvard.edu/) and the score report at the end includes phrasing like, “your responses suggest a strong automatic preference…” This is followed by a disclaimer.

These IAT results are provided for educational purposes only. The results may fluctuate and should not be used to make important decisions. The results are influenced by variables related to the test (e.g., the words or images used to represent categories) and the person (e.g., being tired, what you were thinking about before the IAT).

The disclaimer is on track, but a more honest and transparent message would include a simple index of unreliability, like we see in reports for state achievement test scores.

Really though, if score interpretation at the individual level isn’t recommended, why are individuals provided with a score report?

Correlations between implicit bias scores and other variables, like explicit bias or discriminatory behavior, are also weaker than I’d expect given the amount of publicity the test has received. The original authors of the test reported an average validity coefficient (from meta analysis) of 0.236 (Greenwald, Poehlman, Uhlmann, & Banaji, 2009; Greenwald, Banaji & Nosek, 2015), whereas critics of the test reported a more conservative 0.148 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock, 2013). At best, the IAT predicts 6% of the variability in measures of explicit racial bias, at worst, 2%.

The implication here is that implicit bias gets more coverage than it currently deserves. We don’t actually have a reliable way of measuring it, and even in aggregate form scores are only weakly correlated, if at all, with more overt measures of bias, discrimination, and stereotyping. Validity evidence is lacking.

This isn’t to say we shouldn’t investigate or talk about implicit racial bias. Instead, we should recognize that IAT may not produce the clean, actionable results that we’re expecting, and our time and resources may be better spent elsewhere if we want our trainings and education to have an impact.

References

Greenwald, A. G., Banaji, M. R., & Nosek, B. A. (2015). Statistically small effects of the Implicit Association Test can have societally large effects. Journal of Personality and Social Psychology108(4), 553–561.

Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97, 17– 41.

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105, 171–192.