Tony – Page 3 – theta minus b

May 20, 2020September 20, 2021

Thoughts on Cronbach’s Coefficient Alpha

I have a few thoughts to share on coefficient alpha, the ubiquitous and frequently misused psychometric index of internal consistency reliability. These thoughts aren’t new, people have thought and written about them before (references below), but they’re worth repeating, as the majority of those who cite Cronbach (1951) seem to be unaware that:

alpha is not the only or best measure of internal consistency reliability,
strong alpha does not indicate unidimensionality or a single underlying construct, and
Cronbach ultimately regretted that his alpha became the preferred index.

What is alpha?

Coefficient alpha indexes the extent to which the components of a scale function together in a consistent way. Higher alpha (closer to 1) vs lower alpha (closer to 0) means higher vs lower consistency.

The most common use of alpha is with items or questions within an educational or psychological test, where the composite is a total summed score. If we can determine that a set of test items is internally consistent, with a strong alpha, we can be more confident that a total on our test will provide a cohesive summary of performance across items. Low alpha suggests we shouldn’t combine our items by summing. In this case, the total is expected to have less consistent meaning.

Alpha estimates reliability using the average of the relationships among scored items. This is contrasted with the overall variability for the composite, based on the variance $\sigma^2_X$ of the total score $X$. If we find the covariance for each distinct item pair $X_j$ and $X_{j’}$ and then get the mean as $\bar{\sigma}_{X_jX_{j’}}$, we have

$$\rho_T = J^2\frac{\bar{\sigma}_{X_jX_{j’}}}{\sigma^2_X}$$

where $J$ is the number of items in the test. I’m using the label $\rho_T$ instead of alpha, where the $T$ denotes tau-equivalent reliability, following conventions from Cho (2016).

Alpha isn’t necessarily best

There are lots of papers outlining alpha as one among a variety of options for estimating reliability with scores from a single administration of a test. See the Wikipedia entries on tau-equivalent reliability, which encompasses alpha, and congeneric reliability for accessible summaries.

Most often, alpha is contrasted with what are called congeneric reliability estimates. A simple example is the ratio of the squared sum of standardized factor loadings $(\sum\lambda)^2$ from a unidimensional model, to total variance, or

$$\rho_C = \frac{(\sum\lambda)^2}{\sigma^2_X}.$$

Congeneric reliability indices are often recommended because they have less strict assumptions than tau-equivalent ones like alpha.

Tau-equivalent reliability, including alpha, allows individual item variances to differ, but assumes unidimensionality as well as equal inter-item covariances in the population.
Congeneric reliability allows individual item variances and inter-item covariances to differ, and only assumes unidimensionality in the population.

When the stricter assumptions of alpha aren’t met, which is typically the case in practice, alpha will underestimate and/or misrepresent reliability.

Cronbach and Schavelson (2004) recommended the more comprehensive generalizability theory in place of a narrow focus on alpha. More direct critiques of alpha include Sijtsma (2009), with a response from Revelle and Zinbarg (2009), and McNeish (2017), with a response from Raykov and Marcoulides (2019). Cho (2016) proposes a new perspective on the relationships among alpha and other reliability coefficients, as well as a new naming convention.

Alpha is not a direct measure of unidimensionality

A common misconception is that strong alpha is evidence of unidimensionality, that is, a single construct or factor underlying a set of items. The literature has thoroughly addressed this point, so I’ll just summarize by saying that

alpha assumes undimensionality, and works best when it’s present, but
strong alpha does not confirm that a scale is unidimensional, instead, alpha can be strong with a multidimensional scale.

These and related points have led some (e.g., Sijtsma, 2009) to recommend against the term internal consistency reliability because it suggests that alpha reflects the internal structure of the test, which it does not do, at least not consistently (Cortina, 1993).

Cronbach’s comments on alpha

Cronbach (1951) didn’t invent tau-equivalent reliability or the foundations for what would become coefficient alpha. Instead, he gave an existing coefficient an accessible derivation, as well as a catchy, seemingly preeminent greek label. The same or similar formulations were available in publications predating Cronbach’s article (for a summary, see the tau-equivalent reliability Wikipedia entry). This isn’t something Cronbach tried to hide, and it’s not necessarily a criticism of his work, but most people are unaware of these details and we’ve gotten carried away with the attribution, a fact that Cronbach himself lamented (2004, p 397):

To make so much use of an easily calculated translation of a well-established formula scarcely justifies the fame it has brought me. It is an embarrassment to me that the formula became conventionally known as Cronbach’s alpha.

I suggest we refer to alpha simply as coefficient alpha, or use a more specific term like tau-equivalent reliability. If we need a reference, we should use something more recent, comprehensive, and accessible, like one of the papers mentioned above or a measurement textbook (e.g., Albano, 2020; Bandalos, 2018). I also recommend considering alternative indices, and being more thoughtful about the choice. This may go against the grain, but it makes sense given the history and research.

If abandoning the Cronbach moniker isn’t rebellious enough for you, I also recommend against the omnipresent Likert scale for similar reasons which I’ll get into later.

[Update May 26, 2020: revised the formulas and added references.]

References

Albano, A. D. (2020). Introduction to Educational and Psychological Measurement Using R. https://thetaminusb.com/intro-measurement-r/

Bandalos, D. L. (2018). Measurement Theory and Applications for the Social Sciences. The Guilford Press.

Cho, E. (2016). Making reliability reliable: A systematic approach to reliability coefficients. Organizational Research Methods, 19, 651-682. https://doi.org/10.1177/1094428116656239

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104.

Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. https://doi.org/10.1007/BF02310555

Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. https://doi.org/10.1177/0013164404266386

McNeish, D. (2017). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412–433. https://doi.org/10.1037/met0000144

Raykov, T., & Marcoulides, G. A. (2017). Thanks coefficient alpha, we still need you! Educational and Psychological Measurement, 79, 200–210. https://doi.org/10.1177/0013164417725127

Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, 145–154. https://doi.org/10.1007/s11336-008-9102-z

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. https://doi.org/10.1007/s11336-008-9101-0

April 25, 2020January 23, 2021

Article in Frontiers in Computer Science

A colleague and I recently published an open-access article in Frontiers, titled Development and Evaluation of the Nebraska Assessment of Computing Knowledge. Abstract and link to full text are below.

One way to increase the quality of computing education research is to increase the quality of the measurement tools that are available to researchers, especially measures of students’ knowledge and skills. This paper represents a step toward increasing the number of available thoroughly-evaluated tests that can be used in computing education research by evaluating the psychometric properties of a multiple-choice test designed to differentiate undergraduate students in terms of their mastery of foundational computing concepts. Classical test theory and item response theory analyses are reported and indicate that the test is a reliable, psychometrically-sound instrument suitable for research with undergraduate students. Limitations and the importance of using standardized measures of learning in education research are discussed.

https://www.frontiersin.org/articles/10.3389/fcomp.2020.00011/full

March 25, 2020August 30, 2021

Teaching and Learning Online During the Lockdown

Here are some pointers on transitioning college coursework to online delivery. I’m not an expert on the topic, and have never done it under threat of a pandemic, but I did figure out the basics through trial and error while teaching at Nebraska. For a few years I offered my intro measurement course via traditional in-person instruction in the spring semester and then online in the summer. Here’s what I learned.

Use technology to strengthen the online experience, not mimic the physical one

There’s no way to replicate the in-person experience from a distance, and that shouldn’t be the goal. Instead, we should become familiar with the available technology and consider how it can best be used to support the course objectives. When meeting in the same physical space, we’re hearing the same sounds and breathing the same air. We’re often seeing detailed facial expressions and picking up on subtle cues. None of this can be captured through a pixelated video call or static discussion post.

The learning environment is different online, and we should chose our technology based on its strengths.

Video or conference calls are good for presentations and lecture, and for efficiently communicating general information to a large audience.
Recorded presentations are good for presenting material in depth, since students can review as many times as needed. In this way, recordings can sometimes be more effective than live lecture, as exemplified in the flipped classroom movement.
Discussion forums can give everyone a voice, and are especially useful for encouraging thoughtful comments and questions that may be difficult for students to generate impromptu in class.

Prioritize accessibility

Providing all students with effective access to course materials is paramount across delivery modes, but we may take it for granted when switching to online that a given technology works equally well for all students. Some questions to consider.

Do all students have regular high-speed internet access as well as uninterrupted access to the required computing technology at home?
Does an increased digital reading load differentially impact multilingual students or students with visual impairment?
Do online formats enable less formal communication and the use of jargon that may be unfamiliar to international students?
Is getting to a testing center feasible for all students?

Facilitate independent study

My online courses involve much more independent work, as online allows students to proceed at their own pace. I expect this will be especially helpful when we’re on lockdown with extra responsibilities and different schedules at home. The trade-off with increased independence is decreased collaboration and less structure in pacing. It’s difficult to work together on an assignment or share the scoring key if some students haven’t completed it.

Here’s how my courses tend to work.

I try to post all of the course materials, slides, readings, assignments, rubrics, due dates, within the first week of class.
Group work is challenging from a distance, especially when students have never met in person and when they have very different schedules. I try to simplify it or avoid it online.
If I do have group assignments, they’re either brief or pushed to the end of the course. Students know about them early on, so they can plan accordingly. And students must commit to being caught up by the time a group assignment is given.
I still have a schedule for readings and assignments, but some of the due dates are flexible. I’ve found that the majority of students follow the suggested pacing, but some take advantage of the flexibility, especially in my summer courses. It might make sense to have some hard deadlines, with softer ones in between.

Lockdown considerations

UC Davis has provided lots of resources for teaching and learning during the lockdown, which I expect will extend into summer and may impact fall instruction as well. Many of these generalize to instruction in any college course. This link organizes most of what Davis has provided.

https://keepteaching.ucdavis.edu

March 5, 2020October 13, 2021

Visualizing Conditional Standard Error in the GRE

Below is some R code for visualizing measurement error across the GRE score scale, plotted against percentiles. Data come from an ETS report at https://www.ets.org/s/gre/pdf/gre_guide.pdf.

The plot shows conditional standard error of measurement (SEM) for GRE verbal scores. SEM is the expected average variability in scores attributable to random error in the measurement process. For details, see my last post.

Here, the SEM is conditional on GRE score, with more error evident at lower verbal scores, and less at higher scores where measurement is more precise. As with other forms of standard error, the SEM can be used to build confidence intervals around an estimate. The plot has ribbons for 68% and 95% confidence intervals, based on +/- 1 and 2 SEM.

# Load ggplot2 package
library("ggplot2")

# Put percentiles into data frame, pasting from ETS
# report Table 1B
pct <- data.frame(gre = 170:130,
matrix(c(99, 96, 99, 95, 98, 93, 98, 90, 97, 89,
  96, 86, 94, 84, 93, 82, 90, 79, 88, 76, 86, 73,
  83, 70, 80, 67, 76, 64, 73, 60, 68, 56, 64, 53,
  60, 49, 54, 45, 51, 41, 46, 37, 41, 34, 37, 30,
  33, 26, 29, 23, 26, 19, 22, 16, 19, 13, 16, 11,
  14, 9, 11, 7, 9, 6, 8, 4, 6, 3, 4, 2, 3, 2, 2,
  1, 2, 1, 1, 1, 1, 1, 1, 1),
  nrow = 41, byrow = T))

# Add variable names
colnames(pct)[2:3] <- c("pct_verbal", "pct_quant")

# Subset and add conditional SEM from Table 5E
sem <- data.frame(pct[c(41, seq(36, 1, by = -5)), ],
  sem_verbal = c(3.9, 3.5, 2.9, 2.5, 2.3, 2.1, 2.1,
    2.0, 1.4),
  sem_quant = c(3.5, 2.9, 2.4, 2.2, 2.1, 2.0, 2.1,
    2.1, 1.0),
  row.names = NULL)

# Plot percentiles on x and GRE on y with
# error ribbons
ggplot(sem, aes(pct_verbal, gre)) +
  geom_ribbon(aes(ymin = gre - sem_verbal * 2,
    ymax = gre + sem_verbal * 2),
    fill = "blue", alpha = .2) +
  geom_ribbon(aes(ymin = gre - sem_verbal,
    ymax = gre + sem_verbal),
    fill = "red", alpha = .2) +
  geom_line()

February 7, 2020March 11, 2021

Confidence Intervals in Measurement vs Political Polls

In class this week we covered reliability and went through some examples of how measurement error, the opposite of reliability, can be converted into a standard error for building confidence intervals (CI) around test scores. Students are often surprised to learn that, despite a moderate to strong reliability coefficient, a test can still introduce an unsettling amount of error into results.

Measurement

Here’s an example from testing before I get to error in political polling. The GRE verbal reasoning test has an internal consistency reliability of 0.92, with associated standard error of measurement (SEM) of 2.4 (see Table 5A in this ETS report).

Let’s say you get a score of $X = 154$ on the verbal reasoning test. This puts you in the 64th percentile among the norming sample (Table 1B). We can build a CI around your score as

$$CI = X \pm SEM \times z$$

$$CI = 154 \pm 2.4 \times 1.96$$

where the z of 1.96 comes from the unit normal curve.

After rounding, we have a range of about 10 points within which we’re 95% confident your true score should fall. That’s 154 – 4.3 = 149.7 at the bottom (41st percentile after rounding) and 154 + 4.3 = 158.3 at the top (83rd percentile after rounding).

I’ll leave it as an exercise for you to run the same calculations on the analytical writing component of the GRE, which has a reliability of 0.86 and standard error of 0.32. In either case, the CI will capture a significant chunk of scores, which calls into question the utility of tests like the GRE for comparisons among individuals.

I should mention that the GRE is based on item response theory, which presents error as a function of the construct being measured, where the SEM and CI would vary over the score scale. The example above is simplified to a single overall reliability and SEM.

Polling

Moving on to political polls, Monmouth University is reporting the following results for democratic candidate preference from a phone poll conducted this week with 503 prospective voters in New Hampshire (full report here).

Sanders with 24%
Buttigieg with 20%
Biden with 17%
Warren with 13%

This is the ranking for the top four candidates. Percentages decrease for the remaining choices.

Toward the end of the article, the margin of error is reported as 4.4 percentage points. This was probably found based on a generic standard error (SE), calculated as

$$SE = \frac{\sqrt{p \times q}}{\sqrt{n}}$$

$$\frac{\sqrt{.5 \times .5}}{\sqrt{503}}$$

where p is the proportion (percentage rating / 100) that produces the largest possible variability and SE, and q = 1 – p. This gives us SE = 0.022 or 2.2%.

The 4.4, found with $SE \times 1.96$, is only half of the confidence interval. So, we’re 95% confident that the actual results for Sanders fall between 24 – 4.4 = 19.6% and 24 + 4.4 = 28.4%, a range which captures the result for Buttigieg.

All of the point differences for adjacent candidates in the rankings, which are currently being showcased by major news outlets, are within this margin error.

Note that we could calculate SE and confidence intervals that are specific to the percentages for each candidate. For Sanders we get an SE of 1.9%, for Buttigieg we get 1.8%. We could also use statistical tests to compare points more formally. Whatever the approach, we need to be more clear about the impacts of sampling error and discuss results like these in context.

December 10, 2019January 5, 2021

Should We Drop the SAT/ACT as Requirements for Admissions?

California is reconsidering the role of tests like the SAT and ACT in its college admissions. Around 1,000 other colleges have already gone test-optional according to fairtest.org, but a shift for California would be big news, considering the size of the state university systems, which combined enrolled over 700,000 students for fall 2018.

I’m trying to get up to speed on this somewhat controversial issue. My research in testing focuses mainly on development and validation at the item level, and I’m less familiar with validity research on admissions policies and the broader consequences of test use in this area.

This week, I’ve gone through the following documents, all available online.

A recent LA Times report, Drop the SAT and ACT as a Requirement for Admission, Top UC Officials Say
A 2017 article by Saul Geiser summarizing the issue, Norm-referenced tests and race-blind admissions
A 2019 analysis of UC and CSU data by Michal Kurlaender and Kramer Cohen, Predicting College Success: How Do Different High School Assessments Measure Up?
A statement on Misconceptions about Group Differences in Average Test Scores from the National Council on Measurement in Education in response to the UC news
A summary of Validity Studies by the College Board, who owns the SAT

These documents seem to capture the gist of the debate, which centers on a few key issues. I’ll summarize here and then dig deeper in future posts.

Those in favor of norm-referenced admissions tests argue that the tests contribute to predicting undergraduate performance above and beyond other admissions variables like high school GPA and criterion-referenced tests, and they do so in a standardized way, with proctored administration, and using metrics that are independent of program or state.

Those in favor of dropping admissions tests, or making them optional, argue that the tests are more reflective of group differences than are other admissions variables. The costs, in terms of potential for bias, outweigh the benefits, in terms of incremental increases in predictive power.

In the end, the main question is, do we need a standardized measure of general content in the admissions process?

If so, what other options meet this need, and are available on an international scale, but don’t suffer from the same limitations as the SAT and ACT? Alternatively, is there room for improvement in current norm-referenced tests?

If not, how do we address limitations in the remaining admissions metrics, some of which may also be susceptible to misuse?

December 5, 2019

Demo Code from Recent Paper in APM

A colleague and I recently published a paper in Applied Psychological Methods titled Linking With External Covariates: Examining Accuracy by Anchor Type, Test Length, Ability Difference, and Sample Size. A pre-print copy is available here.

As the title suggests, we looked at some psychometric situations wherein the process of linking measurement scales could benefit from external information. Here’s the abstract.

Research has recently demonstrated the use of multiple anchor tests and external covariates to supplement or substitute for common anchor items when linking and equating with nonequivalent groups. This study examines the conditions under which external covariates improve linking and equating accuracy, with internal and external anchor tests of varying lengths and groups of differing abilities. Pseudo forms of a state science test were equated within a resampling study where sample size ranged from 1,000 to 10,000 examinees and anchor tests ranged in length from eight to 20 items, with reading and math scores included as covariates. Frequency estimation linking with an anchor test and external covariate was found to produce the most accurate results under the majority of conditions studied. Practical applications of linking with anchor tests and covariates are discussed.

The study is somewhat novel in its use of resampling at both the person and item levels. The result is a different sample of test takers taking a different sample of items at each study replication. I created an Rmarkdown file (saved as txt) that demonstrates the process for a reduced set of conditions.

multi-anchor-demo.txt
multi-anchor-demo.html

November 5, 2018

Getting Things Started

This is the first blog post on my new academic site. The main purpose of the site is to share educational and psychological measurement info and resources developed through my teaching and research.

My intro measurement textbook is available in HTML and PDF formats at https://www.thetaminusb.com/intro-measurement-r/. The book is designed for advanced undergraduate or beginner graduate courses in the theory and applications of measurement in education and psychology. Instructions and examples are given throughout on conducting psychometric analyses in R. If you’d like to contribute, email me or see the github repository at https://github.com/talbano/intro-measurement.

Note that the book is going to be updated in December, 2018 with revisions to the chapters on factor analysis, validity, and test evaluation. A Spanish translation is also underway and should be ready for 2019 at https://www.thetaminusb.com/intro-measurement-r-sp/.

I’m also working on forums for questions and conversations around measurement topics, deriving from the book, and equating topics, deriving from my R package and documentation. Stay tuned for links.