Overview

This document provides code demonstrating the resampling process used in the article by Tony Albano and Marie Wiberg titled Linking With External Covariates: Examining Accuracy by Anchor Type, Test Length, Ability Difference, and Sample Size, currently in press in Applied Psychological Measurement, DOI: 10.1177/0146621618824855.

For a summary of the article, here’s its abstract:

Research has recently demonstrated the use of multiple anchor tests and external covariates to supplement or substitute for common anchor items when linking and equating with nonequivalent groups. This study examines the conditions under which external covariates improve linking and equating accuracy, with internal and external anchor tests of varying lengths and groups of differing abilities. Pseudo forms of a state science test were equated within a resampling study where sample size ranged from 1,000 to 10,000 examinees and anchor tests ranged in length from eight to 20 items, with reading and math scores included as covariates. Frequency estimation linking with an anchor test and external covariate was found to produce the most accurate results under the majority of conditions studied. Practical applications of linking with anchor tests and covariates are discussed.

The code provided below aims to demonstrate the procedures used in study 2 in Albano and Wiberg. The general structure of the resampling process is the same, with data from a population split into two sub-populations that constitute non-equivalent groups, and items sampled and distributed into pseudo test forms with different criterion equating functions at each replication.

Note that the code only demonstrates procedures for a small part of the full study, and only at a high level. It does not consider conditions such as anchor length, sample size, internal vs external anchor, or equating method. Instead, it focuses on a single sample size and test length, with an internal anchor test, and equipercentile equating. It also relies on different data.

The code is for demonstration purposes only, and is not optimized for speed or accuracy. R warnings will be raised during the loglinear presmoothing process. Otherwise, the resampling loop should run without errors given the data and setup below.

This document was compiled in RStudio using R markdown and the knitr package (Xie 2014). The code itself was developed in R (R Core Team 2019) version 3.4.4, with procedures from the equate package (Albano 2016), version 2.0.7. The document is distributed under a CC-BY-SA license.

# equate is available on CRAN
install.packages("equate")

# Load the package
library("equate")

Simulating data

This demonstration requires that we first simulate some testing data. Albano and Wiberg examined real data, based on results from a state testing program. However, these data are not publicly available. The simulated data are intended to resemble the data from study 2 in the article, which assumed a non-equivalent groups design.

Prior to simulating the data, we create two functions that are used to generate scored item responses based on person and item parameters presumed to come from an item response theory model.

# Item response function, taking a matrix of item parameters and vector
# of person ability
irt_irf <- function(ip, theta) {
  ni <- nrow(ip)
  out <- rbind(sapply(1:ni, function(i) ip[i, 3] + (1 - ip[i, 3]) /
      (1 + exp(ip[i, 1] * (-theta + ip[i, 2])))))
  return(out)
}

irt_sim <- function(ip, theta) {
  p <- irt_irf(ip, theta)
  nr <- prod(dim(p))
  out <- ifelse(p > runif(nr), 1, 0)
  return(out)
}

Data simulation involves sampling two groups of test takers from normal distributions, one with mean 1.5 and the other with mean 1, both having standard deviation 1. Difficulty parameters are generated for 60 items, and these are used to find probability of correct response, and simulated scored response, under a Rasch model for each item and each test taker. Finally, a total score is obtained for each test taker and another distribution of scores is created so as to correlate near 0.80 with the total score.

The first test is referred to as a science test, as in Albano and Wiberg, and the second as a reading test. The goal of equating in this scenario is to establish a statistical relationship between scores on two subsets of science items, referred to below as pseudo test forms, using information from both the science test and the reading test as an external covariate.

# Set random number generation seed
set.seed(190514)

# Population size
np <- 10000

# Simulate true science ability for sub-populations p and q, where p is
# slightly more able
theta_p <- rnorm(np / 2, 1.5, 1)
theta_q <- rnorm(np / 2, 1, 1)
theta <- c(theta_p, theta_q)

# A Rasch model was fit to the 60 science items, with item difficulties
# found to be distributed normally with mean near -1 and standard deviation
# 0.80
# Simulate item parameters and then use to generate item responses
ip <- cbind(1, rnorm(60, -1, .8), 0)
responses <- irt_sim(ip, theta)
colnames(responses) <- paste0("sq", 1:60)
st <- rowSums(responses)

# Simulate reading, which correlated with science around 0.80
rz <- ((st - mean(st)) * .8 + rnorm(np, 0, sd(st)) * sqrt(1 - .8^2)) / sd(st)
rt <- round((rz * 2) + 10)
rt[rt < 0] <- 0
rt[rt > 15] <- 0

Setup

The following code sets up the resampling loop by defining some preliminary values used in sampling test takers and items. In some cases, the level of detail is unnecessary for a simple demonstration like this. For example, a list of science item names is created to support the sampling of items by science domain. These data do not actually capture the relationships that would exist among items by domain. However, the code is provided here to show how sampling by domain was carried out in the article.

Equating will be conducted three times per replication. Once with the internal science anchor as well as the external reading test as a supplemental anchoring variable, again with only the internal science anchor, and a third time with a single group design and no anchor where the full population completes both x and y. In the R code, the numbers 1 and 2 are appended to some code objects to denote equating without reading scores as 1 and equating with reading scores as 2.

# Person indices, defining populations p and q
# These assume data in responses are stacked p then q
pindex <- 1:(np / 2)
qindex <- (np / 2 + 1):np

# Item names per four content domains, assuming items are ordered by domain
sdom <- list(d1 = paste0("sq", 1:13), d2 = paste0("sq", 14:29),
  d3 = paste0("sq", 30:46), d4 = paste0("sq", 47:60))

# Sample size per replication
nj <- 1000

# Numbers of anchor items per content domain, for 20 item anchor condition
nvi <- c(3, 6, 7, 4)

# Score scales for x, y, and anchor v, assuming 20 items in internal anchor
xscale <- yscale <- 0:40
vscale <- 0:20

# Score functions for loglinear presmoothing without reading
# Created beforehand to save on processing time
# Use empty frequency tables as templates
xtab1 <- as.data.frame(freqtab(cbind(0, 0), scales = list(xscale, vscale)))
degrees1 <- list(c(6, 6), c(2, 2))
sfun1 <- equate:::sf(xtab1, degrees1, stepup = T)

# Score functions for loglinear presmoothing with reading as second anchor
xtab2 <- as.data.frame(freqtab(cbind(0, 0, 0),
  scales = list(xscale, vscale, 0:15)))
degrees2 <- list(c(6, 6, 6), c(2, 2, 2), c(2, 2, 2))
sfun2 <- equate:::sf(xtab2, degrees2, stepup = T)

# Models to be compared when choosing best fit in presmoothing
# Each value maps onto a column in the score function, denoting that
# column should be included with others having the same value when
# running a given model
models1 <- c(1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 1, 6, 6, 6)
models2 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5,
  1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 1, 7, 7, 7, 7, 7, 7, 7)

# Score function for criterion equating, with a single group design
# A models object is not needed, as the same presmoothing model will
# always be used 
xtabc <- as.data.frame(freqtab(cbind(0, 0), scales = list(xscale, yscale)))
sfunc <- equate:::sf(xtabc, degrees = list(c(6, 6), c(2, 2)))

# Replications
reps <- 100

# Output containers for estimated and criterion equating functions
# and the smoothed criterion distribution over x for later finding
# weighted error
out1 <- out2 <- outc <- outf <- matrix(nrow = length(xscale), ncol = reps)

Resampling study

The resampling studies in Albano and Wiberg involved replications of equating across multiple combinations of conditions. Here, we focus on a single loop that iterates through replications for a single condition. At each replication, we sample item names from the domain list, and then arrange into forms for x, y, and the anchor test v. x and y represent science items uniquely administered to groups p and q, respectively, and v represents the common items. Items are sampled with the constraint that the mean difficulty difference between x and y, based on total score over unique items, is less than or equal to 2 points.

After building forms, we sample test takers from each group, without replacement. Total scores are then found and combined into trivariate frequency distributions, where the total includes unique and common anchor science items for a given group, the first internal anchor test includes the common anchor science items, and reading scores constitute a second anchor test with scale ranging from 0 to 15. The trivariate distribution is reduced to a bivariate distribution by removing reading scores.

Loglinear models are used next to presmooth each score distribution. Equipercentile equating functions are then estimated, for the sample and for the population, given the pseudo forms for the particular replication.

Note that the demo results are saved to an R file so this document can be compiled more efficiently. To compile this R markdown to HTML locally, this code for the replication loop would have to be run manually.

for(r in 1:reps) {
  
  # Sample items per form by domain, repeating if there's a
  # difficulty difference greater than 2 points
  vcheck <- TRUE
  while(vcheck) {
    si <- lapply(sdom, sample)
    # First five by domain go to x
    xi <- unlist(lapply(si, "[", 1:5))
    # Next five by domain go to y
    yi <- unlist(lapply(si, "[", 6:10))
    vcheck <- abs(mean(rowSums(responses[, xi])) -
      mean(rowSums(responses[, yi]))) > 2
  }
  # Next nvi per domain go to v
  vi <- c(unlist(sapply(1:4, function(k)
    si[[k]][11:(10 + nvi[k])])))

  # Sample people as indices to later be used in subsetting
  xj <- sample(pindex, nj, replace = F)
  yj <- sample(qindex, nj, replace = F)

  # Total and anchor scores per group
  xt <- rowSums(responses[xj, xi])
  yt <- rowSums(responses[yj, yi])
  xv <- rowSums(responses[xj, vi])
  yv <- rowSums(responses[yj, vi])
  
  # Create frequency tables, assuming internal anchor, with reading
  xf2 <- freqtab(cbind(xt + xv, xv, rt[xj]),
    scales = list(xscale, vscale, 0:15))
  yf2 <- freqtab(cbind(yt + yv, yv, rt[yj]),
    scales = list(yscale, vscale, 0:15))

  # Reduce frequency tables for equating without reading
  xf1 <- margin(xf2, 1:2)
  yf1 <- margin(yf2, 1:2)
  
  # Smooth x and y, with AIC determining best fitting model
  xf2s <- loglinear(xf2, scorefun = sfun2, models = models2, choose = T,
    rmimpossible = 1:2, choosemethod = "aic")
  yf2s <- loglinear(yf2, scorefun = sfun2, models = models2, choose = T,
    rmimpossible = 1:2, choosemethod = "aic")
  
  # Smooth without reading
  xf1s <- loglinear(xf1, scorefun = sfun1, models = models1, choose = T,
    rmimpossible = 1:2, choosemethod = "aic")
  yf1s <- loglinear(yf1, scorefun = sfun1, models = models1, choose = T,
    rmimpossible = 1:2, choosemethod = "aic")
  
  # Equipercentile equating with reading
  out2[, r] <- equate(xf2s, yf2s, type = "equip", method = "freq",
    verbose = FALSE)
  
  # Equipercentile equating without reading
  out1[, r] <- equate(xf1s, yf1s, type = "equip", method = "freq",
    verbose = FALSE)
  
  # Frequency table for population based on single group design
  xyf <- freqtab(cbind(rowSums(responses[, c(xi, vi)]),
    rowSums(responses[, c(yi, vi)])), scales = list(xscale, yscale))
  
  # Smooth population and find criterion equating function
  xyfs <- loglinear(xyf, scorefun = sfunc)
  outc[, r] <- equate(xyfs, type = "equip", verbose = FALSE)
  
  # Save smoothed x
  outf[, r] <- margin(xyfs, 1)
}

# Save output
save(out1, out2, outc, outf, file = "multi-anchor-demo.rda")

Results

We can summarize results over rows in the output object, comparing estimated equating functions at each score point with the varying criterion at each score point. Error is calculated as the difference between the two. Error can be summarized over score points for a given replication by taking the root mean squared error. Error for a condition is then summarized as the mean over replications.

Prior to averaging, errors can also be weighted by the smoothed population frequency distribution for x, which differs by replication. The weighted errors highlight accuracy where the majority of score information is provided, while de-emphasizing score points obtained by fewer test takers.

# Load output
load("multi-anchor-demo.rda")

# Check one of the smoothed population distributions
# Distribution is sparse below x of 20
plot(as.freqtab(cbind(xscale, outf[, 1])))


# Unweighted error by replication
rmse1 <- sqrt(colMeans((out1 - outc)^2))
rmse2 <- sqrt(colMeans((out2 - outc)^2))

# Overall unweighted
mean(rmse1)
## [1] 1.809826
mean(rmse2)
## [1] 1.748622

# Weighted error by condition
pw <- outf / 10000
wrmse1 <- sqrt(colMeans(((out1 - outc)^2) * pw))
wrmse2 <- sqrt(colMeans(((out2 - outc)^2) * pw))

# Overall weighted
mean(wrmse1)
## [1] 0.07039967
mean(wrmse2)
## [1] 0.05753582

References

Albano, Anthony D. 2016. “equate: An R Package for Observed-Score Linking and Equating.” Journal of Statistical Software 74 (8): 1–36. https://doi.org/10.18637/jss.v074.i08.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.