Omnibus tests of differential item functioning

Differential item functioning (DIF) with intersectional groups or interaction effects requires lots more statistical comparisons than traditional groupings do. In large-scale, high-stakes testing programs, we usually look at

  1. race/ethnicity (with around five levels),
  2. sex/gender (two levels),
  3. special education status (two levels),
  4. advantage status (two or three), and
  5. language status (two levels).

A grouping variable with $k = 5$ levels produces $k – 1$ DIF comparisons per item (each group level compared with the reference level), so the traditional approach might involve $4 + 1 + 1 + 1 + 1 = 9$ total comparisons per item. An intersectional approach, in contrast, would involve as many as $5 \times 2 \times 2 \times 2 \times 2 – 1 = 79$ (all intersectional levels compared with a single reference intersectional level).

The increase in comparisons leads to an increase in Type I error (DIF by chance). We can adjust for false positives but a simple workaround is to use generalized and omnibus tests in place of repeated pairwise comparisons. Penfield (2001) demonstrated the generalized Mantel-Haenszel and Magis et al. (2011) demonstrated a generalized logistic regression method. Neither used interactions between groups, but the methods still apply. In both studies, the generalized methods tended to be better (more powerful), and a practical benefit is that they’re simpler to implement, with only one test run per item. Finch (2016) compared a few generalized methods.

See this post from 2022 for some background on DIF with intersectional or interaction effects.

Generalized methods also make it easier to reconsider our reference level. The standard approach in DIF is to refer all comparisons to the historically advantaged test takers, so, White, English-speaking, male, etc. My coauthors and I (Albano et al., 2024) gave some recommendations on this point.

The convention of using the historically advantaged test takers as the reference group should also be reconsidered. Because DIF is usually structured as a relative comparison, results will not change except in their signs if reference and focal groups are switched. However, moving away from the conventional reference groups will help to decenter Male and White in discussions of test performance. Alternatives include centering more diverse groups, using models that evaluate DIF effects within groups relative to their own means (we have not seen this method used before) or using models that compare effects against an aggregate of all groups (e.g., Austin & French, 2020).

Omnibus testing comes from ANOVA, where we analyze a set of effects together as a whole before investigating specific comparisons. In the context of DIF, we might not even bother with the post hoc comparisons – an omnibus DIF flag would disqualify an item regardless of the source(s) of variation.

Here’s a simple demonstration with 9 groups, 400 people per group, 20 items, one item having DIF of 0.6 logits for three of the groups, and group abilities normally distributed with means ranging from -1 to 1 logits.

# Load tidyverse
library("tidyverse")

# We also need the epmr package from github
# devtools::install_github("talbano/epmr")

# Generating parameters
set.seed(260120)
ni <- 20
ng <- 9
np <- 400

# 2PL item parameters
ip <- data.frame(a = exp(rnorm(ni, 0, 0.1225)), b = rnorm(ni), c = .2) |> 
  list() |> rep(ng)

# Induce DIF of 0.6 logits on item 1 for groups 2, 6, and 9
ip[[2]]$b[1] <- ip[[6]]$b[1] <- ip[[9]]$b[1] <- ip[[2]]$b[1] + .6

# Theta with range of means by group
theta <- lapply(seq(-1, 1, length = ng), function(m) rnorm(np, m, 1))

# Generate scores by group
scores <- lapply(1:ng, function(g) epmr::irtsim(ip[[g]], theta[[g]]))
scores <- data.frame(rep(paste0("g", 1:ng), each = np),
  do.call("rbind", scores)) |> setNames(c("group", paste0("i", 1:ni)))

DIF is estimated for the first two items with dichotomous Mantel-Haenszel (MH) and logistic regression (LR), and then with generalized methods, using functions from the epmr package.

# Dichotomous DIF, MH and LR
# Eight comparisons per item with g5 as reference group
focal <- paste0("g", 1:ng)[-5]
out_mh <- out_lr <- vector("list", length = ng - 1) |> setNames(focal)
for (g in focal) {
  index <- scores$group %in% c("g5", g)
  temp_scores <- scores[index, -1]
  temp_group <- scores$group[index]
  out_mh[[g]] <- epmr::difstudy(temp_scores, temp_group, ref = "g5",
    method = "mh", dif_items = 1:2, anchor_items = 3:ni)$uniform
  out_lr[[g]] <- epmr::difstudy(temp_scores, temp_group, ref = "g5",
    method = "lr", dif_items = 1:2, anchor_items = 3:ni)$uniform
}

# GMH and GLR DIF
out_gmh <- epmr::difstudy(scores[, -1], scores$group, ref = "g5",
  method = "mh", dif_items = 1:2, anchor_items = 3:ni)$uniform
out_glr <- epmr::difstudy(scores[, -1], scores$group, ref = "g5",
  method = "lr", dif_items = 1:2, anchor_items = 3:ni)$uniform

The generalized methods flagged item 1 (GMH p < .001, GLR p < .001) and item 2 (GMH p = .004, GLR p = .004) for DIF. The next table shows the results for all of the dichotomous comparisons. Even with an unadjusted p-value of .05, MH and LR don’t flag any significant DIF on items 1 or 2 because the thresholds for practical significance (delta for MH and r2d change in R-squared) aren’t met.

itemgroupdeltamh_pets_levlr_pr2dzum_lev
i1g1-0.8640.034a0.0230.008a
i1g2-1.2760.001b0.0010.019a
i1g3-0.3210.445a0.5080.001a
i1g4-0.1670.714a0.6650.000a
i1g6-1.1100.004b0.0010.016a
i1g70.5350.216a0.1380.004a
i1g80.5820.203a0.1560.003a
i1g9-0.0680.946a0.9750.000a
i2g1-0.2450.594a0.5980.000a
i2g2-0.2680.523a0.5250.001a
i2g30.2900.492a0.5670.001a
i2g40.2460.553a0.3800.001a
i2g61.0070.010b0.0060.013a
i2g70.8100.044a0.0660.006a
i2g80.8830.038a0.0410.007a
i2g90.9350.036a0.0360.007a

We can also conduct omnibus DIF tests under an item response theory framework. I mentioned this in my earlier post on DIF with interaction effects. Using the lme4 package, a simple and direct approach is to estimate random effects for item, person, and group, and then an interaction between each DIF item to be tested and group. Here’s some code.

# Stack the data for lme4
scores_long <- scores |> tibble() |> mutate(person = row_number(), .before = 1) |> 
  pivot_longer(cols = i1:i20, names_to = "item", values_to = "score") |> 
  mutate(i1 = ifelse(item == "i1", 1, 0), i2 = ifelse(item == "i2", 1, 0))

# Fit the models
# r2, with DIF estimated for items 1 and 2, is singular
r0 <- lme4::glmer(score ~ 0 + (1 | item) + (1 | person) + (1 | group),
  family = binomial, data = scores_long,
  control = lme4::glmerControl(optimizer = "bobyqa"))
r1 <- lme4::glmer(score ~ 0 + (1 | item) + (1 | person) + (1 + i1 | group),
  family = binomial, data = scores_long,
  control = lme4::glmerControl(optimizer = "bobyqa"))
r2 <- lme4::glmer(score ~ 0 + (1 | item) + (1 | person) + (1 + i1 + i2 | group),
  family = binomial, data = scores_long,
  control = lme4::glmerControl(optimizer = "bobyqa"))

# Compare r0 and r1 to test for DIF
anova(r0, r1)

Here’s a subset of the ANOVA table output comparing fit for models with and without the DIF interaction term for item 1. After the table is the lme4 output for model r1. Person, item, and group have logit standard deviations of 0.71, 0.57, and 0.64 respectively, and performance on item 1 varies over groups with standard deviation 0.22.

nparAIClogLik-2log(L)ChisqPr>Chisq
r0386588-4329186582NANA
r1586577-4328486567140.0007
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: score ~ 0 + (1 | item) + (1 | person) + (1 + i1 | group)
Data: scores_long
AIC BIC logLik -2*log(L) df.resid
86577.40 86623.32 -43283.70 86567.40 71995
Random effects:
Groups Name Std.Dev. Corr
person (Intercept) 0.7070
item (Intercept) 0.5737
group (Intercept) 0.6444
i1 0.2165 0.33
Number of obs: 72000, groups: person, 3600; item, 20; group, 9
No fixed effect coefficients

Finally, we can conduct omnibus DIF analysis using the mirt R package. The first step is to estimate a multi-group Rasch model, then a separate mirt function takes the multi-group output and runs models with and without the grouping variable interacting with each DIF item. The table below shows the output, where AIC and chi-square (X2) indicate DIF for item 1, and no fit indices indicate DIF for item 2.

mirt_mg <- mirt::multipleGroup(scores[, -1], model = 1,
  itemtype = "Rasch", group = scores[, 1],
  invariance = c(paste0("i", 3:20), "free_means", "free_variances"))
mirt_dif <- mirt::DIF(mirt_mg, "d", items2test = c("i1", "i2"))
AICSABICHQBICX2dfp
i1-17.526.570.1331.9933.5280.00
i27.7931.8825.4357.308.2180.41

References

Albano, T., French, B. F., & Vo, T. T. (2024). Traditional vs intersectional DIF analysis: Considerations and a comparison using state testing data. Applied Measurement in Education, 37(1), 57-70.

Austin, B. W., & French, B. F. (2020). Adjusting group intercept and slope bias in predictive equations. Methodology, 16(3), 241–257.

Finch, W. H. (2016). Detection of differential item functioning for more than two groups: A Monte Carlo comparison of methods. Applied measurement in Education29(1), 30-45.

Magis, D., Raîche, G., Béland, S., & Gérard, P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing11(4), 365-386.

Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14(3), 235-259.

When to Use Cronbach’s Coefficient Alpha? An Overview and Visualization with R Code

This post follows up on a previous one where I gave a brief overview of so-called coefficient alpha and recommended against its overuse and traditional attribution to Cronbach. Here, I’m going to cover when to use alpha, also known as tau-equivalent reliability $\rho_T$, and when not to use it, with some demonstrations and plotting in R.

We’re referring to alpha now as tau-equivalent reliability because it’s a more descriptive label that conveys the assumptions supporting its use, again following conventions from Cho (2016).

As I said last time, these concepts aren’t new. They’ve been debated in the literature since the 1940s, with the following conclusions.

  1. $\rho_T$ underestimates the actual reliability when the assumptions of tau-equivalence aren’t met, which is likely often the case.
  2. $\rho_T$ is not an index of unidimensionality, where multidimensional tests can still produce strong reliability estimates.
  3. $\rho_T$ is sensitive to test length, where long tests can produce strong reliability estimates even when items are weakly related to one another.

For each of these points I’ll give a summary and demonstration in R.

Assuming tau equivalence

The main assumption in tau-equivalence is that, in the population, all the items in our test have the same relationship with the underlying construct, which we label tau or $\tau$. This assumption can be expressed in terms of factor loadings or inter-item covariances, where factor loadings are equal or covariances are the same across all pairs of items.

The difference between the tau-equivalent model and the more stringent parallel model is that the latter additionally constrains item variances to be equal whereas these are free to vary with tau-equivalence. The congeneric model is the least restrictive in that it allows both factor loadings (or inter-item covariances) and uniquenesses (item variances) to vary across items.

Tau-equivalence is a strong assumption, one that isn’t typically evaluated in practice. Here’s what can happen when it is violated. I’m simulating a test with 20 items that correlate with a single underlying construct to different degrees. At one extreme, the true loadings range from 0.05 to 0.95. At the other extreme, loadings are all 0.50. The mean of the loadings is always 0.50.

This scatterplot shows the loadings per condition as they increase from varying at the bottom, as permitted with the congeneric model, to similar at the top, as required by the tau-equivalent model. Tau-equivalent or coefficient alpha reliability should be most accurate in the top condition, and least accurate in the bottom one.

# Load tidyverse package
# Note the epmr and psych packages are also required
# psych in on CRAN, epmr is on GitHub at talbano/epmr
library("tidyverse")

# Build list of factor loadings for 20 item test
ni <- 20
lm <- lapply(1:10, function(x)
  seq(0 + x * .05, 1 - x * .05, length = ni))

# Visualize the levels of factor loadings
tibble(condition = factor(rep(1:length(lm), each = ni)),
  loading = unlist(lm)) %>%
  ggplot(aes(loading, condition)) + geom_point()
Factor loadings across ten range conditions

For each of the ten loading conditions, the simulation involved generating 1,000 data sets, each with 200 test takers, and estimating congeneric and tau-equivalent reliability for each. The table below shows the means of the reliability estimates, labeled $\rho_T$ for tau-equivalent and $\rho_C$ for congeneric, per condition, labeled lm.

# Set seed, reps, and output container
set.seed(201210)
reps <- 1000
sim_out <- tibble(lm = numeric(), rep = numeric(),
  omega = numeric(), alpha = numeric())

# Simulate via two loops, j through levels of
# factor loadings, i through reps
for (j in seq_along(lm)) {
  for (i in 1:reps) {
  # Congeneric data are simulated using the psych package
  temp <- psych::sim.congeneric(loads = lm[[j]],
    N = 200, short = F)
  # Alpha and omega are estimated using the epmr package
  sim_out <- bind_rows(sim_out, tibble(lm = j, rep = i,
    omega = epmr::coef_omega(temp$r, sigma = T),
    alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}
lm $\rho_T$ $\rho_C$ diff
1 0.8662 0.8807 -0.0145
2 0.8663 0.8784 -0.0121
3 0.8665 0.8757 -0.0093
4 0.8668 0.8735 -0.0067
5 0.8673 0.8720 -0.0047
6 0.8673 0.8706 -0.0032
7 0.8680 0.8701 -0.0020
8 0.8688 0.8699 -0.0011
9 0.8686 0.8692 -0.0006
10 0.8681 0.8685 -0.0004
Mean reliabilities by condition

The last column in this table shows the difference between $\rho_T$ and $\rho_C$. Alpha or $\rho_T$ always underestimates omega or $\rho_C$, and the discrepancy is largest in condition lm 1, where the tau-equivalent assumption of equal loadings is most clearly violated. Here, $\rho_T$ underestimates reliability on average by -0.0145. As we progress toward equal factor loadings in lm 10, $\rho_T$ approximates $\rho_C$.

Dimensionality

Tau-equivalent reliability is often misinterpreted as an index of unidimensionality. But $\rho_T$ doesn’t tell us directly how unidimensional our test is. Instead, like parallel and congeneric reliabilities, $\rho_T$ assumes our test measures a single construct or factor. If our items load on multiple distinct dimensions, $\rho_T$ will probably decrease but may still be strong.

Here’s a simple demonstration where I’ll estimate $\rho_T$ for tests simulated to have different amounts of multidimensionality, from completely unidimensional (correlation matrix is all 1s) to completely multidimensional across three factors (correlation matrix with three clusters of 1s). There are nine items.

The next table shows the generating correlation matrix for one of the 11 conditions examined. The three clusters of items (1 through 3, 4 through 6, and 7 through 9) always had perfect correlations, regardless of condition. The remaining off-cluster correlations were fixed within a condition to be 0.1, 0.2, … 1.0. Here, they’re fixed to 0.2. This condition shows strong multidimensionality, within the three factors, and a mild effect from a general factor, with the 0.2.

i1 i2 i3 i4 i5 i6 i7 i8 i9
i1 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i2 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i3 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i4 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i5 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i6 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i7 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i8 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i9 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
Correlation matrix showing some multidimensionality

The simulation again involved generating 1,000 tests, each with 200 test takers, for each condition.

# This will print out the correlation matrix for the
# condition shown in the table above
psych::sim.general(nvar = 9, nfact = 3, g = .2, r = .8)

# Set seed, reps, and output container
set.seed(201211)
reps <- 1000
dim_out <- tibble(dm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, j through levels of
# dimensionality, i through reps
for (j in seq(0, 1, .1)) {
  for (i in 1:reps) {
    # Data are simulated using the psych package
    temp <- psych::sim.general(nvar = 9, nfact = 3,
      g = 1 - j, r = j, n = 200)
    # Estimate alpha with the epmr package
    dim_out <- bind_rows(dim_out, tibble(dm = j, rep = i,
      alpha = epmr::coef_alpha(temp)$alpha))
  }
}

Results below show that mean $\rho_T$ starts out at 1.00 in the unidimensional condition dm1, and decreases to 0.75 in the most multidimensional condition dm11, where the off-cluster correlations were all 0.

The example correlation matrix above corresponds to dm9, showing that a relatively weak general dimension, with prominent group dimensions, still produces mean $\rho_T$ of 0.86.

dm1 dm2 dm3 dm4 dm5 dm6 dm7 dm8 dm9 dm10 dm11
1.000.99 0.98 0.97 0.96 0.94 0.92 0.89 0.86 0.81 0.75
Mean alphas for 11 conditions of multidimensionality

Test Length

The last demonstration shows how $\rho_T$ gets stronger despite weak factor loadings or weak relationships among items, as test length increases. I’m simulating tests containing 10 to 200 items. For each test length condition, I generate 1,000 tests using a congeneric model with all loadings fixed to 0.20.

# Set seed, reps, and output container
set.seed(201212)
reps <- 100
tim_out <- tibble(tm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, i through levels of
# test length, j through reps
for (j in 10:200) {
  for (i in 1:reps) {
    # Congeneric data are simulated using the psych package
    temp <- psych::sim.congeneric(loads = rep(.2, j),
      N = 200, short = F)
    tim_out <- bind_rows(tim_out, tibble(tm = j, rep = i,
      alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}

The plot below shows $\rho_T$ on the y-axis for each test length condition on x. The black line captures mean alpha and the ribbon captures the standard deviation over replications for a given condition.

# Summarize with mean and sd of alpha
tim_out %>% group_by(tm) %>%
  summarize(m = mean(alpha), se = sd(alpha)) %>%
  ggplot(aes(tm, m)) + geom_ribbon(aes(ymin = m - se, 
    ymax = m + se), fill = "lightblue") +
  geom_line() + xlab("test length") + ylab("alpha")
Alpha as a function of test length when factor loadings are fixed at 0.20

Mean $\rho_T$ starts out low at 0.30 for test length 10 items, but surpasses the 0.70 threshold once we hit 56 items. With test length 100 items, we have $\rho_T$ above 0.80, despite having the same weak factor loadings.

When to use tau-equivalent reliability?

These simple demonstrations highlight some of the main limitations of tau-equivalent or alpha reliability. To recap:

  1. As the assumption of tau-equivalence will rarely be met in practice, $\rho_T$ will tend to underestimate the actual reliability for our test, though the discrepancy may be small as shown in the first simulation.
  2. $\rho_T$ decreases somewhat with departures from unidimensionality, but stays relatively strong even with clear multidimensionality.
  3. Test length compensates surprisingly well for low factor loadings and inter-item relationships, producing respectable $\rho_T$ after 50 or so items.

The main benefit of $\rho_T$ is that it’s simpler to calculate than $\rho_C$. Tau-equivalence is thus recommended when circumstances like small sample size make it difficult to fit a congeneric model. We just have to interpret tau-equivalent results with caution, and then plan ahead for a more comprehensive evaluation of reliability.

References

Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19, 651-682. https://doi.org/10.1177/1094428116656239

Visualizing Conditional Standard Error in the GRE

Below is some R code for visualizing measurement error across the GRE score scale, plotted against percentiles. Data come from an ETS report at https://www.ets.org/s/gre/pdf/gre_guide.pdf.

The plot shows conditional standard error of measurement (SEM) for GRE verbal scores. SEM is the expected average variability in scores attributable to random error in the measurement process. For details, see my last post.

Here, the SEM is conditional on GRE score, with more error evident at lower verbal scores, and less at higher scores where measurement is more precise. As with other forms of standard error, the SEM can be used to build confidence intervals around an estimate. The plot has ribbons for 68% and 95% confidence intervals, based on +/- 1 and 2 SEM.

# Load ggplot2 package
library("ggplot2")

# Put percentiles into data frame, pasting from ETS
# report Table 1B
pct <- data.frame(gre = 170:130,
matrix(c(99, 96, 99, 95, 98, 93, 98, 90, 97, 89,
  96, 86, 94, 84, 93, 82, 90, 79, 88, 76, 86, 73,
  83, 70, 80, 67, 76, 64, 73, 60, 68, 56, 64, 53,
  60, 49, 54, 45, 51, 41, 46, 37, 41, 34, 37, 30,
  33, 26, 29, 23, 26, 19, 22, 16, 19, 13, 16, 11,
  14, 9, 11, 7, 9, 6, 8, 4, 6, 3, 4, 2, 3, 2, 2,
  1, 2, 1, 1, 1, 1, 1, 1, 1),
  nrow = 41, byrow = T))

# Add variable names
colnames(pct)[2:3] <- c("pct_verbal", "pct_quant")

# Subset and add conditional SEM from Table 5E
sem <- data.frame(pct[c(41, seq(36, 1, by = -5)), ],
  sem_verbal = c(3.9, 3.5, 2.9, 2.5, 2.3, 2.1, 2.1,
    2.0, 1.4),
  sem_quant = c(3.5, 2.9, 2.4, 2.2, 2.1, 2.0, 2.1,
    2.1, 1.0),
  row.names = NULL)

# Plot percentiles on x and GRE on y with
# error ribbons
ggplot(sem, aes(pct_verbal, gre)) +
  geom_ribbon(aes(ymin = gre - sem_verbal * 2,
    ymax = gre + sem_verbal * 2),
    fill = "blue", alpha = .2) +
  geom_ribbon(aes(ymin = gre - sem_verbal,
    ymax = gre + sem_verbal),
    fill = "red", alpha = .2) +
  geom_line()

Demo Code from Recent Paper in APM

A colleague and I recently published a paper in Applied Psychological Methods titled Linking With External Covariates: Examining Accuracy by Anchor Type, Test Length, Ability Difference, and Sample Size. A pre-print copy is available here.

As the title suggests, we looked at some psychometric situations wherein the process of linking measurement scales could benefit from external information. Here’s the abstract.

Research has recently demonstrated the use of multiple anchor tests and external covariates to supplement or substitute for common anchor items when linking and equating with nonequivalent groups. This study examines the conditions under which external covariates improve linking and equating accuracy, with internal and external anchor tests of varying lengths and groups of differing abilities. Pseudo forms of a state science test were equated within a resampling study where sample size ranged from 1,000 to 10,000 examinees and anchor tests ranged in length from eight to 20 items, with reading and math scores included as covariates. Frequency estimation linking with an anchor test and external covariate was found to produce the most accurate results under the majority of conditions studied. Practical applications of linking with anchor tests and covariates are discussed.

The study is somewhat novel in its use of resampling at both the person and item levels. The result is a different sample of test takers taking a different sample of items at each study replication. I created an Rmarkdown file (saved as txt) that demonstrates the process for a reduced set of conditions.

multi-anchor-demo.txt
multi-anchor-demo.html