When to Use Cronbach’s Coefficient Alpha? An Overview and Visualization with R Code

This post follows up on a previous one where I gave a brief overview of so-called coefficient alpha and recommended against its overuse and traditional attribution to Cronbach. Here, I’m going to cover when to use alpha, also known as tau-equivalent reliability $\rho_T$, and when not to use it, with some demonstrations and plotting in R.

We’re referring to alpha now as tau-equivalent reliability because it’s a more descriptive label that conveys the assumptions supporting its use, again following conventions from Cho (2016).

As I said last time, these concepts aren’t new. They’ve been debated in the literature since the 1940s, with the following conclusions.

  1. $\rho_T$ underestimates the actual reliability when the assumptions of tau-equivalence aren’t met, which is likely often the case.
  2. $\rho_T$ is not an index of unidimensionality, where multidimensional tests can still produce strong reliability estimates.
  3. $\rho_T$ is sensitive to test length, where long tests can produce strong reliability estimates even when items are weakly related to one another.

For each of these points I’ll give a summary and demonstration in R.

Assuming tau equivalence

The main assumption in tau-equivalence is that, in the population, all the items in our test have the same relationship with the underlying construct, which we label tau or $\tau$. This assumption can be expressed in terms of factor loadings or inter-item covariances, where factor loadings are equal or covariances are the same across all pairs of items.

The difference between the tau-equivalent model and the more stringent parallel model is that the latter additionally constrains item variances to be equal whereas these are free to vary with tau-equivalence. The congeneric model is the least restrictive in that it allows both factor loadings (or inter-item covariances) and uniquenesses (item variances) to vary across items.

Tau-equivalence is a strong assumption, one that isn’t typically evaluated in practice. Here’s what can happen when it is violated. I’m simulating a test with 20 items that correlate with a single underlying construct to different degrees. At one extreme, the true loadings range from 0.05 to 0.95. At the other extreme, loadings are all 0.50. The mean of the loadings is always 0.50.

This scatterplot shows the loadings per condition as they increase from varying at the bottom, as permitted with the congeneric model, to similar at the top, as required by the tau-equivalent model. Tau-equivalent or coefficient alpha reliability should be most accurate in the top condition, and least accurate in the bottom one.

# Load tidyverse package
# Note the epmr and psych packages are also required
# psych in on CRAN, epmr is on GitHub at talbano/epmr
library("tidyverse")

# Build list of factor loadings for 20 item test
ni <- 20
lm <- lapply(1:10, function(x)
  seq(0 + x * .05, 1 - x * .05, length = ni))

# Visualize the levels of factor loadings
tibble(condition = factor(rep(1:length(lm), each = ni)),
  loading = unlist(lm)) %>%
  ggplot(aes(loading, condition)) + geom_point()
Factor loadings across ten range conditions

For each of the ten loading conditions, the simulation involved generating 1,000 data sets, each with 200 test takers, and estimating congeneric and tau-equivalent reliability for each. The table below shows the means of the reliability estimates, labeled $\rho_T$ for tau-equivalent and $\rho_C$ for congeneric, per condition, labeled lm.

# Set seed, reps, and output container
set.seed(201210)
reps <- 1000
sim_out <- tibble(lm = numeric(), rep = numeric(),
  omega = numeric(), alpha = numeric())

# Simulate via two loops, j through levels of
# factor loadings, i through reps
for (j in seq_along(lm)) {
  for (i in 1:reps) {
  # Congeneric data are simulated using the psych package
  temp <- psych::sim.congeneric(loads = lm[[j]],
    N = 200, short = F)
  # Alpha and omega are estimated using the epmr package
  sim_out <- bind_rows(sim_out, tibble(lm = j, rep = i,
    omega = epmr::coef_omega(temp$r, sigma = T),
    alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}
lm $\rho_T$ $\rho_C$ diff
1 0.8662 0.8807 -0.0145
2 0.8663 0.8784 -0.0121
3 0.8665 0.8757 -0.0093
4 0.8668 0.8735 -0.0067
5 0.8673 0.8720 -0.0047
6 0.8673 0.8706 -0.0032
7 0.8680 0.8701 -0.0020
8 0.8688 0.8699 -0.0011
9 0.8686 0.8692 -0.0006
10 0.8681 0.8685 -0.0004
Mean reliabilities by condition

The last column in this table shows the difference between $\rho_T$ and $\rho_C$. Alpha or $\rho_T$ always underestimates omega or $\rho_C$, and the discrepancy is largest in condition lm 1, where the tau-equivalent assumption of equal loadings is most clearly violated. Here, $\rho_T$ underestimates reliability on average by -0.0145. As we progress toward equal factor loadings in lm 10, $\rho_T$ approximates $\rho_C$.

Dimensionality

Tau-equivalent reliability is often misinterpreted as an index of unidimensionality. But $\rho_T$ doesn’t tell us directly how unidimensional our test is. Instead, like parallel and congeneric reliabilities, $\rho_T$ assumes our test measures a single construct or factor. If our items load on multiple distinct dimensions, $\rho_T$ will probably decrease but may still be strong.

Here’s a simple demonstration where I’ll estimate $\rho_T$ for tests simulated to have different amounts of multidimensionality, from completely unidimensional (correlation matrix is all 1s) to completely multidimensional across three factors (correlation matrix with three clusters of 1s). There are nine items.

The next table shows the generating correlation matrix for one of the 11 conditions examined. The three clusters of items (1 through 3, 4 through 6, and 7 through 9) always had perfect correlations, regardless of condition. The remaining off-cluster correlations were fixed within a condition to be 0.1, 0.2, … 1.0. Here, they’re fixed to 0.2. This condition shows strong multidimensionality, within the three factors, and a mild effect from a general factor, with the 0.2.

i1 i2 i3 i4 i5 i6 i7 i8 i9
i1 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i2 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i3 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i4 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i5 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i6 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i7 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i8 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i9 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
Correlation matrix showing some multidimensionality

The simulation again involved generating 1,000 tests, each with 200 test takers, for each condition.

# This will print out the correlation matrix for the
# condition shown in the table above
psych::sim.general(nvar = 9, nfact = 3, g = .2, r = .8)

# Set seed, reps, and output container
set.seed(201211)
reps <- 1000
dim_out <- tibble(dm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, j through levels of
# dimensionality, i through reps
for (j in seq(0, 1, .1)) {
  for (i in 1:reps) {
    # Data are simulated using the psych package
    temp <- psych::sim.general(nvar = 9, nfact = 3,
      g = 1 - j, r = j, n = 200)
    # Estimate alpha with the epmr package
    dim_out <- bind_rows(dim_out, tibble(dm = j, rep = i,
      alpha = epmr::coef_alpha(temp)$alpha))
  }
}

Results below show that mean $\rho_T$ starts out at 1.00 in the unidimensional condition dm1, and decreases to 0.75 in the most multidimensional condition dm11, where the off-cluster correlations were all 0.

The example correlation matrix above corresponds to dm9, showing that a relatively weak general dimension, with prominent group dimensions, still produces mean $\rho_T$ of 0.86.

dm1 dm2 dm3 dm4 dm5 dm6 dm7 dm8 dm9 dm10 dm11
1.000.99 0.98 0.97 0.96 0.94 0.92 0.89 0.86 0.81 0.75
Mean alphas for 11 conditions of multidimensionality

Test Length

The last demonstration shows how $\rho_T$ gets stronger despite weak factor loadings or weak relationships among items, as test length increases. I’m simulating tests containing 10 to 200 items. For each test length condition, I generate 1,000 tests using a congeneric model with all loadings fixed to 0.20.

# Set seed, reps, and output container
set.seed(201212)
reps <- 100
tim_out <- tibble(tm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, i through levels of
# test length, j through reps
for (j in 10:200) {
  for (i in 1:reps) {
    # Congeneric data are simulated using the psych package
    temp <- psych::sim.congeneric(loads = rep(.2, j),
      N = 200, short = F)
    tim_out <- bind_rows(tim_out, tibble(tm = j, rep = i,
      alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}

The plot below shows $\rho_T$ on the y-axis for each test length condition on x. The black line captures mean alpha and the ribbon captures the standard deviation over replications for a given condition.

# Summarize with mean and sd of alpha
tim_out %>% group_by(tm) %>%
  summarize(m = mean(alpha), se = sd(alpha)) %>%
  ggplot(aes(tm, m)) + geom_ribbon(aes(ymin = m - se, 
    ymax = m + se), fill = "lightblue") +
  geom_line() + xlab("test length") + ylab("alpha")
Alpha as a function of test length when factor loadings are fixed at 0.20

Mean $\rho_T$ starts out low at 0.30 for test length 10 items, but surpasses the 0.70 threshold once we hit 56 items. With test length 100 items, we have $\rho_T$ above 0.80, despite having the same weak factor loadings.

When to use tau-equivalent reliability?

These simple demonstrations highlight some of the main limitations of tau-equivalent or alpha reliability. To recap:

  1. As the assumption of tau-equivalence will rarely be met in practice, $\rho_T$ will tend to underestimate the actual reliability for our test, though the discrepancy may be small as shown in the first simulation.
  2. $\rho_T$ decreases somewhat with departures from unidimensionality, but stays relatively strong even with clear multidimensionality.
  3. Test length compensates surprisingly well for low factor loadings and inter-item relationships, producing respectable $\rho_T$ after 50 or so items.

The main benefit of $\rho_T$ is that it’s simpler to calculate than $\rho_C$. Tau-equivalence is thus recommended when circumstances like small sample size make it difficult to fit a congeneric model. We just have to interpret tau-equivalent results with caution, and then plan ahead for a more comprehensive evaluation of reliability.

References

Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19, 651-682. https://doi.org/10.1177/1094428116656239

Visualizing Conditional Standard Error in the GRE

Below is some R code for visualizing measurement error across the GRE score scale, plotted against percentiles. Data come from an ETS report at https://www.ets.org/s/gre/pdf/gre_guide.pdf.

The plot shows conditional standard error of measurement (SEM) for GRE verbal scores. SEM is the expected average variability in scores attributable to random error in the measurement process. For details, seeĀ my last post.

Here, the SEM is conditional on GRE score, with more error evident at lower verbal scores, and less at higher scores where measurement is more precise. As with other forms of standard error, the SEM can be used to build confidence intervals around an estimate. The plot has ribbons for 68% and 95% confidence intervals, based on +/- 1 and 2 SEM.

# Load ggplot2 package
library("ggplot2")

# Put percentiles into data frame, pasting from ETS
# report Table 1B
pct <- data.frame(gre = 170:130,
matrix(c(99, 96, 99, 95, 98, 93, 98, 90, 97, 89,
  96, 86, 94, 84, 93, 82, 90, 79, 88, 76, 86, 73,
  83, 70, 80, 67, 76, 64, 73, 60, 68, 56, 64, 53,
  60, 49, 54, 45, 51, 41, 46, 37, 41, 34, 37, 30,
  33, 26, 29, 23, 26, 19, 22, 16, 19, 13, 16, 11,
  14, 9, 11, 7, 9, 6, 8, 4, 6, 3, 4, 2, 3, 2, 2,
  1, 2, 1, 1, 1, 1, 1, 1, 1),
  nrow = 41, byrow = T))

# Add variable names
colnames(pct)[2:3] <- c("pct_verbal", "pct_quant")

# Subset and add conditional SEM from Table 5E
sem <- data.frame(pct[c(41, seq(36, 1, by = -5)), ],
  sem_verbal = c(3.9, 3.5, 2.9, 2.5, 2.3, 2.1, 2.1,
    2.0, 1.4),
  sem_quant = c(3.5, 2.9, 2.4, 2.2, 2.1, 2.0, 2.1,
    2.1, 1.0),
  row.names = NULL)

# Plot percentiles on x and GRE on y with
# error ribbons
ggplot(sem, aes(pct_verbal, gre)) +
  geom_ribbon(aes(ymin = gre - sem_verbal * 2,
    ymax = gre + sem_verbal * 2),
    fill = "blue", alpha = .2) +
  geom_ribbon(aes(ymin = gre - sem_verbal,
    ymax = gre + sem_verbal),
    fill = "red", alpha = .2) +
  geom_line()

Demo Code from Recent Paper in APM

A colleague and I recently published a paper in Applied Psychological Methods titled Linking With External Covariates: Examining Accuracy by Anchor Type, Test Length, Ability Difference, and Sample Size. A pre-print copy is available here.

As the title suggests, we looked at some psychometric situations wherein the process of linking measurement scales could benefit from external information. Here’s the abstract.

Research has recently demonstrated the use of multiple anchor tests and external covariates to supplement or substitute for common anchor items when linking and equating with nonequivalent groups. This study examines the conditions under which external covariates improve linking and equating accuracy, with internal and external anchor tests of varying lengths and groups of differing abilities. Pseudo forms of a state science test were equated within a resampling study where sample size ranged from 1,000 to 10,000 examinees and anchor tests ranged in length from eight to 20 items, with reading and math scores included as covariates. Frequency estimation linking with an anchor test and external covariate was found to produce the most accurate results under the majority of conditions studied. Practical applications of linking with anchor tests and covariates are discussed.

The study is somewhat novel in its use of resampling at both the person and item levels. The result is a different sample of test takers taking a different sample of items at each study replication. I created an Rmarkdown file (saved as txt) that demonstrates the process for a reduced set of conditions.

multi-anchor-demo.txt
multi-anchor-demo.html