Limitations of Implicit Association Testing for Racial Bias

Apparently, implicit association testing (IAT) has been overhyped. Much like grit and power posing, two higher profile letdowns in pop psychology, implicit bias seems to have attracted more attention than is justified by research. Twitter pointed me to a couple articles from 2017 that clarify the limitations of IAT for racial bias.

https://www.vox.com/identities/2017/3/7/14637626/implicit-association-test-racism

https://www.thecut.com/2017/01/psychologys-racism-measuring-tool-isnt-up-to-the-job.html

The Vox article covers these main points.

  • The IAT might work to assess bias in the aggregate, for a group of people or across repeated testing for the same person.
  • It can’t actually predict individual racial bias.
  • The limitations of the IAT don’t mean that racism isn’t real, just that implicit forms of it are hard to measure.
  • As a result, focusing on implicit bias may not help in fighting racism.

The second article from New York Magazine, The Cut, gives some helpful references and outlines a few measurement concepts.

There’s an entire field of psychology, psychometrics, dedicated to the creation and validation of psychological instruments, and instruments are judged based on whether they exceed certain broadly agreed-upon statistical benchmarks. The most important benchmarks pertain to a test’s reliability — that is, the extent to which the test has a reasonably low amount of measurement error (every test has some) — and to its validity, or the extent to which it is measuring what it claims to be measuring. A good psychological instrument needs both.

Reliability for the IAT appears to land below 0.50, based on test-retest correlations. Interpretations of reliability depend on context, there aren’t clear standards, but in my experience 0.60 is usually considered too low to be useful. Here, 0.50 would indicate that 50% of the observed variance in scores can be attributed to consistent and meaningful measurement, whereas the other 50% is unpredictable.

I haven’t seen reporting on the actual scores that determine whether someone has or does not have implicit bias. Psychometrically, there should be a scale, and it should incorporate decision points or cutoffs beyond which a person is reported to have a strong, weak, or negligible bias.

Until I find some info on scaling, let’s assume that the final IAT result is a z-score centered at 0 (no bias) with standard deviation of 1 (capturing the average variability). Reliability of 0.50, best case scenario, gives us a standard error of measurement (SEM) of 0.71. This tells us scores are expected to differ on average due to random noise by 0.71 points.

[Confidence Intervals in Measurement vs Political Polls]

Without knowing the score scale and how it’s implemented, we don’t know the ultimate impact of an SEM of 0.71, but we can say that score changes across much of the scale are uninterpretable. A score of +1, or one standard deviation above the mean, still contains 0 within its 95% confidence interval. A 95% confidence interval for a score of 0, in this case, no bias, ranges from -1.41 to +1.41.

The authors of the test acknowledge that results can’t be interpreted reliably at the individual level, but their use in practice suggests otherwise. I took the online test a few times (at https://implicit.harvard.edu/) and the score report at the end includes phrasing like, “your responses suggest a strong automatic preference…” This is followed by a disclaimer.

These IAT results are provided for educational purposes only. The results may fluctuate and should not be used to make important decisions. The results are influenced by variables related to the test (e.g., the words or images used to represent categories) and the person (e.g., being tired, what you were thinking about before the IAT).

The disclaimer is on track, but a more honest and transparent message would include a simple index of unreliability, like we see in reports for state achievement test scores.

Really though, if score interpretation at the individual level isn’t recommended, why are individuals provided with a score report?

Correlations between implicit bias scores and other variables, like explicit bias or discriminatory behavior, are also weaker than I’d expect given the amount of publicity the test has received. The original authors of the test reported an average validity coefficient (from meta analysis) of 0.236 (Greenwald, Poehlman, Uhlmann, & Banaji, 2009; Greenwald, Banaji & Nosek, 2015), whereas critics of the test reported a more conservative 0.148 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock, 2013). At best, the IAT predicts 6% of the variability in measures of explicit racial bias, at worst, 2%.

The implication here is that implicit bias gets more coverage than it currently deserves. We don’t actually have a reliable way of measuring it, and even in aggregate form scores are only weakly correlated, if at all, with more overt measures of bias, discrimination, and stereotyping. Validity evidence is lacking.

This isn’t to say we shouldn’t investigate or talk about implicit racial bias. Instead, we should recognize that IAT may not produce the clean, actionable results that we’re expecting, and our time and resources may be better spent elsewhere if we want our trainings and education to have an impact.

References

Greenwald, A. G., Banaji, M. R., & Nosek, B. A. (2015). Statistically small effects of the Implicit Association Test can have societally large effects. Journal of Personality and Social Psychology108(4), 553–561.

Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97, 17– 41.

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105, 171–192.

When to Use Cronbach’s Coefficient Alpha? An Overview and Visualization with R Code

This post follows up on a previous one where I gave a brief overview of so-called coefficient alpha and recommended against its overuse and traditional attribution to Cronbach. Here, I’m going to cover when to use alpha, also known as tau-equivalent reliability $\rho_T$, and when not to use it, with some demonstrations and plotting in R.

We’re referring to alpha now as tau-equivalent reliability because it’s a more descriptive label that conveys the assumptions supporting its use, again following conventions from Cho (2016).

As I said last time, these concepts aren’t new. They’ve been debated in the literature since the 1940s, with the following conclusions.

  1. $\rho_T$ underestimates the actual reliability when the assumptions of tau-equivalence aren’t met, which is likely often the case.
  2. $\rho_T$ is not an index of unidimensionality, where multidimensional tests can still produce strong reliability estimates.
  3. $\rho_T$ is sensitive to test length, where long tests can produce strong reliability estimates even when items are weakly related to one another.

For each of these points I’ll give a summary and demonstration in R.

Assuming tau equivalence

The main assumption in tau-equivalence is that, in the population, all the items in our test have the same relationship with the underlying construct, which we label tau or $\tau$. This assumption can be expressed in terms of factor loadings or inter-item covariances, where factor loadings are equal or covariances are the same across all pairs of items.

The difference between the tau-equivalent model and the more stringent parallel model is that the latter additionally constrains item variances to be equal whereas these are free to vary with tau-equivalence. The congeneric model is the least restrictive in that it allows both factor loadings (or inter-item covariances) and uniquenesses (item variances) to vary across items.

Tau-equivalence is a strong assumption, one that isn’t typically evaluated in practice. Here’s what can happen when it is violated. I’m simulating a test with 20 items that correlate with a single underlying construct to different degrees. At one extreme, the true loadings range from 0.05 to 0.95. At the other extreme, loadings are all 0.50. The mean of the loadings is always 0.50.

This scatterplot shows the loadings per condition as they increase from varying at the bottom, as permitted with the congeneric model, to similar at the top, as required by the tau-equivalent model. Tau-equivalent or coefficient alpha reliability should be most accurate in the top condition, and least accurate in the bottom one.

# Load tidyverse package
# Note the epmr and psych packages are also required
# psych in on CRAN, epmr is on GitHub at talbano/epmr
library("tidyverse")

# Build list of factor loadings for 20 item test
ni <- 20
lm <- lapply(1:10, function(x)
  seq(0 + x * .05, 1 - x * .05, length = ni))

# Visualize the levels of factor loadings
tibble(condition = factor(rep(1:length(lm), each = ni)),
  loading = unlist(lm)) %>%
  ggplot(aes(loading, condition)) + geom_point()
Factor loadings across ten range conditions

For each of the ten loading conditions, the simulation involved generating 1,000 data sets, each with 200 test takers, and estimating congeneric and tau-equivalent reliability for each. The table below shows the means of the reliability estimates, labeled $\rho_T$ for tau-equivalent and $\rho_C$ for congeneric, per condition, labeled lm.

# Set seed, reps, and output container
set.seed(201210)
reps <- 1000
sim_out <- tibble(lm = numeric(), rep = numeric(),
  omega = numeric(), alpha = numeric())

# Simulate via two loops, j through levels of
# factor loadings, i through reps
for (j in seq_along(lm)) {
  for (i in 1:reps) {
  # Congeneric data are simulated using the psych package
  temp <- psych::sim.congeneric(loads = lm[[j]],
    N = 200, short = F)
  # Alpha and omega are estimated using the epmr package
  sim_out <- bind_rows(sim_out, tibble(lm = j, rep = i,
    omega = epmr::coef_omega(temp$r, sigma = T),
    alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}
lm $\rho_T$ $\rho_C$ diff
1 0.8662 0.8807 -0.0145
2 0.8663 0.8784 -0.0121
3 0.8665 0.8757 -0.0093
4 0.8668 0.8735 -0.0067
5 0.8673 0.8720 -0.0047
6 0.8673 0.8706 -0.0032
7 0.8680 0.8701 -0.0020
8 0.8688 0.8699 -0.0011
9 0.8686 0.8692 -0.0006
10 0.8681 0.8685 -0.0004
Mean reliabilities by condition

The last column in this table shows the difference between $\rho_T$ and $\rho_C$. Alpha or $\rho_T$ always underestimates omega or $\rho_C$, and the discrepancy is largest in condition lm 1, where the tau-equivalent assumption of equal loadings is most clearly violated. Here, $\rho_T$ underestimates reliability on average by -0.0145. As we progress toward equal factor loadings in lm 10, $\rho_T$ approximates $\rho_C$.

Dimensionality

Tau-equivalent reliability is often misinterpreted as an index of unidimensionality. But $\rho_T$ doesn’t tell us directly how unidimensional our test is. Instead, like parallel and congeneric reliabilities, $\rho_T$ assumes our test measures a single construct or factor. If our items load on multiple distinct dimensions, $\rho_T$ will probably decrease but may still be strong.

Here’s a simple demonstration where I’ll estimate $\rho_T$ for tests simulated to have different amounts of multidimensionality, from completely unidimensional (correlation matrix is all 1s) to completely multidimensional across three factors (correlation matrix with three clusters of 1s). There are nine items.

The next table shows the generating correlation matrix for one of the 11 conditions examined. The three clusters of items (1 through 3, 4 through 6, and 7 through 9) always had perfect correlations, regardless of condition. The remaining off-cluster correlations were fixed within a condition to be 0.1, 0.2, … 1.0. Here, they’re fixed to 0.2. This condition shows strong multidimensionality, within the three factors, and a mild effect from a general factor, with the 0.2.

i1 i2 i3 i4 i5 i6 i7 i8 i9
i1 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i2 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i3 1.0 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2
i4 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i5 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i6 0.2 0.2 0.2 1.0 1.0 1.0 0.2 0.2 0.2
i7 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i8 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
i9 0.2 0.2 0.2 0.2 0.2 0.2 1.0 1.0 1.0
Correlation matrix showing some multidimensionality

The simulation again involved generating 1,000 tests, each with 200 test takers, for each condition.

# This will print out the correlation matrix for the
# condition shown in the table above
psych::sim.general(nvar = 9, nfact = 3, g = .2, r = .8)

# Set seed, reps, and output container
set.seed(201211)
reps <- 1000
dim_out <- tibble(dm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, j through levels of
# dimensionality, i through reps
for (j in seq(0, 1, .1)) {
  for (i in 1:reps) {
    # Data are simulated using the psych package
    temp <- psych::sim.general(nvar = 9, nfact = 3,
      g = 1 - j, r = j, n = 200)
    # Estimate alpha with the epmr package
    dim_out <- bind_rows(dim_out, tibble(dm = j, rep = i,
      alpha = epmr::coef_alpha(temp)$alpha))
  }
}

Results below show that mean $\rho_T$ starts out at 1.00 in the unidimensional condition dm1, and decreases to 0.75 in the most multidimensional condition dm11, where the off-cluster correlations were all 0.

The example correlation matrix above corresponds to dm9, showing that a relatively weak general dimension, with prominent group dimensions, still produces mean $\rho_T$ of 0.86.

dm1 dm2 dm3 dm4 dm5 dm6 dm7 dm8 dm9 dm10 dm11
1.000.99 0.98 0.97 0.96 0.94 0.92 0.89 0.86 0.81 0.75
Mean alphas for 11 conditions of multidimensionality

Test Length

The last demonstration shows how $\rho_T$ gets stronger despite weak factor loadings or weak relationships among items, as test length increases. I’m simulating tests containing 10 to 200 items. For each test length condition, I generate 1,000 tests using a congeneric model with all loadings fixed to 0.20.

# Set seed, reps, and output container
set.seed(201212)
reps <- 100
tim_out <- tibble(tm = numeric(), rep = numeric(),
  alpha = numeric())

# Simulate via two loops, i through levels of
# test length, j through reps
for (j in 10:200) {
  for (i in 1:reps) {
    # Congeneric data are simulated using the psych package
    temp <- psych::sim.congeneric(loads = rep(.2, j),
      N = 200, short = F)
    tim_out <- bind_rows(tim_out, tibble(tm = j, rep = i,
      alpha = epmr::coef_alpha(temp$observed)$alpha))
  }
}

The plot below shows $\rho_T$ on the y-axis for each test length condition on x. The black line captures mean alpha and the ribbon captures the standard deviation over replications for a given condition.

# Summarize with mean and sd of alpha
tim_out %>% group_by(tm) %>%
  summarize(m = mean(alpha), se = sd(alpha)) %>%
  ggplot(aes(tm, m)) + geom_ribbon(aes(ymin = m - se, 
    ymax = m + se), fill = "lightblue") +
  geom_line() + xlab("test length") + ylab("alpha")
Alpha as a function of test length when factor loadings are fixed at 0.20

Mean $\rho_T$ starts out low at 0.30 for test length 10 items, but surpasses the 0.70 threshold once we hit 56 items. With test length 100 items, we have $\rho_T$ above 0.80, despite having the same weak factor loadings.

When to use tau-equivalent reliability?

These simple demonstrations highlight some of the main limitations of tau-equivalent or alpha reliability. To recap:

  1. As the assumption of tau-equivalence will rarely be met in practice, $\rho_T$ will tend to underestimate the actual reliability for our test, though the discrepancy may be small as shown in the first simulation.
  2. $\rho_T$ decreases somewhat with departures from unidimensionality, but stays relatively strong even with clear multidimensionality.
  3. Test length compensates surprisingly well for low factor loadings and inter-item relationships, producing respectable $\rho_T$ after 50 or so items.

The main benefit of $\rho_T$ is that it’s simpler to calculate than $\rho_C$. Tau-equivalence is thus recommended when circumstances like small sample size make it difficult to fit a congeneric model. We just have to interpret tau-equivalent results with caution, and then plan ahead for a more comprehensive evaluation of reliability.

References

Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19, 651-682. https://doi.org/10.1177/1094428116656239

Thoughts on Cronbach’s Coefficient Alpha

I have a few thoughts to share on coefficient alpha, the ubiquitous and frequently misused psychometric index of internal consistency reliability. These thoughts aren’t new, people have thought and written about them before (references below), but they’re worth repeating, as the majority of those who cite Cronbach (1951) seem to be unaware that:

  1. alpha is not the only or best measure of internal consistency reliability,
  2. strong alpha does not indicate unidimensionality or a single underlying construct, and
  3. Cronbach ultimately regretted that his alpha became the preferred index.

What is alpha?

Coefficient alpha indexes the extent to which the components of a scale function together in a consistent way. Higher alpha (closer to 1) vs lower alpha (closer to 0) means higher vs lower consistency.

The most common use of alpha is with items or questions within an educational or psychological test, where the composite is a total summed score. If we can determine that a set of test items is internally consistent, with a strong alpha, we can be more confident that a total on our test will provide a cohesive summary of performance across items. Low alpha suggests we shouldn’t combine our items by summing. In this case, the total is expected to have less consistent meaning.

Alpha estimates reliability using the average of the relationships among scored items. This is contrasted with the overall variability for the composite, based on the variance $\sigma^2_X$ of the total score $X$. If we find the covariance for each distinct item pair $X_j$ and $X_{j’}$ and then get the mean as $\bar{\sigma}_{X_jX_{j’}}$, we have

$$\rho_T = J^2\frac{\bar{\sigma}_{X_jX_{j’}}}{\sigma^2_X}$$

where $J$ is the number of items in the test. I’m using the label $\rho_T$ instead of alpha, where the $T$ denotes tau-equivalent reliability, following conventions from Cho (2016).

Alpha isn’t necessarily best

There are lots of papers outlining alpha as one among a variety of options for estimating reliability with scores from a single administration of a test. See the Wikipedia entries on tau-equivalent reliability, which encompasses alpha, and congeneric reliability for accessible summaries.

Most often, alpha is contrasted with what are called congeneric reliability estimates. A simple example is the ratio of the squared sum of standardized factor loadings $(\sum\lambda)^2$ from a unidimensional model, to total variance, or

$$\rho_C = \frac{(\sum\lambda)^2}{\sigma^2_X}.$$

Congeneric reliability indices are often recommended because they have less strict assumptions than tau-equivalent ones like alpha.

  • Tau-equivalent reliability, including alpha, allows individual item variances to differ, but assumes unidimensionality as well as equal inter-item covariances in the population.
  • Congeneric reliability allows individual item variances and inter-item covariances to differ, and only assumes unidimensionality in the population.

When the stricter assumptions of alpha aren’t met, which is typically the case in practice, alpha will underestimate and/or misrepresent reliability.

Cronbach and Schavelson (2004) recommended the more comprehensive generalizability theory in place of a narrow focus on alpha. More direct critiques of alpha include Sijtsma (2009), with a response from Revelle and Zinbarg (2009), and McNeish (2017), with a response from Raykov and Marcoulides (2019). Cho (2016) proposes a new perspective on the relationships among alpha and other reliability coefficients, as well as a new naming convention.

Alpha is not a direct measure of unidimensionality

A common misconception is that strong alpha is evidence of unidimensionality, that is, a single construct or factor underlying a set of items. The literature has thoroughly addressed this point, so I’ll just summarize by saying that

  • alpha assumes undimensionality, and works best when it’s present, but
  • strong alpha does not confirm that a scale is unidimensional, instead, alpha can be strong with a multidimensional scale.

These and related points have led some (e.g., Sijtsma, 2009) to recommend against the term internal consistency reliability because it suggests that alpha reflects the internal structure of the test, which it does not do, at least not consistently (Cortina, 1993).

Cronbach’s comments on alpha

Cronbach (1951) didn’t invent tau-equivalent reliability or the foundations for what would become coefficient alpha. Instead, he gave an existing coefficient an accessible derivation, as well as a catchy, seemingly preeminent greek label. The same or similar formulations were available in publications predating Cronbach’s article (for a summary, see the tau-equivalent reliability Wikipedia entry). This isn’t something Cronbach tried to hide, and it’s not necessarily a criticism of his work, but most people are unaware of these details and we’ve gotten carried away with the attribution, a fact that Cronbach himself lamented (2004, p 397):

To make so much use of an easily calculated translation of a well-established formula scarcely justifies the fame it has brought me. It is an embarrassment to me that the formula became conventionally known as Cronbach’s alpha.

I suggest we refer to alpha simply as coefficient alpha, or use a more specific term like tau-equivalent reliability. If we need a reference, we should use something more recent, comprehensive, and accessible, like one of the papers mentioned above or a measurement textbook (e.g., Albano, 2020; Bandalos, 2018). I also recommend considering alternative indices, and being more thoughtful about the choice. This may go against the grain, but it makes sense given the history and research.

If abandoning the Cronbach moniker isn’t rebellious enough for you, I also recommend against the omnipresent Likert scale for similar reasons which I’ll get into later.

[Update May 26, 2020: revised the formulas and added references.]

References

Albano, A. D. (2020). Introduction to Educational and Psychological Measurement Using R. https://thetaminusb.com/intro-measurement-r/

Bandalos, D. L. (2018). Measurement Theory and Applications for the Social Sciences. The Guilford Press.

Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19, 651-682. https://doi.org/10.1177/1094428116656239

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104.

Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. https://doi.org/10.1007/BF02310555

Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. https://doi.org/10.1177/0013164404266386

McNeish, D. (2017). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412–433. https://doi.org/10.1037/met0000144

Raykov, T., & Marcoulides, G. A. (2017). Thanks coefficient alpha, we still need you! Educational and Psychological Measurement, 79, 200–210. https://doi.org/10.1177/0013164417725127

Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, 145–154. https://doi.org/10.1007/s11336-008-9102-z

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. https://doi.org/10.1007/s11336-008-9101-0