Guidelines for Developing Selected Response Assessment Items

These guidelines for developing selected-response (SR) items are organized into four groups addressing issues in overall item content, formatting and style, creating the item stem, and creating response options. The guidelines are best understood within the context of an actual educational assessment. So, before using them to develop your own items, you should first establish

The purpose for the assessment you are developing,
An outline specifying the structure and format of your assessment, and
Learning objectives that meet your assessment purposes and fit within your outline.

The overarching goal of these guidelines is to develop items that measure the target construct for a test or assessment while minimizing the impact of potentially irrelevant constructs.

The guidelines are based on

Rodriguez, M. C., & Albano, A. D. (2017). The college instructor’s guide to writing test items: Measuring student learning. New York, NY: Routledge.

Content

An item should be carefully constructed to measure only important and relevant content, as specified within the corresponding learning objective. Any content unrelated to the target learning objective, including opinions, misinformation, trivial information, or content or tasks pertaining to different learning objectives, should be avoided. Item content refers generally to the information, materials, and cognitive task addressed within an item. Different items assessing different content may require different levels of cognitive demand, from simple recall of knowledge to understanding and using complex skills. Cognitive tasks are sometimes referred to as knowledge, skills, and abilities (KSA).

1. Focus on one type of content and cognitive demand.

Effective SR items typically focus on a single task requiring a specific cognitive demand. An item assessing multiple types of content or multiple levels of cognitive demand becomes more complex and thus more difficult for the student. This added complexity may be unrelated to what we’re trying to measure. As a result, students may respond incorrectly not because they lack the targeted KSA, but because they’re unable to disentangle and interpret the item content.

For example, a question on photosynthesis should only assess the intended task, whether it involves the definition of a term like chlorophyll or the interpretation of results from an experiment. Photosynthesis, chlorophyll, and the experiment details constitute item content. Defining and interpreting constitute cognitive tasks. The question then should not require understanding of other contexts, such as solar power, or other terms, such as mitosis, unless they are relevant and specified within the learning objective.

Note that a good learning objective lends itself to specific content and level of cognitive demand.

2. Keep content unique and independent across items.

In addition to targeting specific content from the learning objective, this content should be independent of other items. When content for one item depends on another, students can use this dependence to deduce the correct response without actually having the targeted KSA.

3. Assess important content, avoiding content that is trivial, overly specific, or overly general.

Trivial, specific, or general content often appears when an item does not align well with its learning objective, or when the learning objective itself is flawed. A good learning objective focuses on specific information or generalities only when they are important to the course or curriculum.

4. Use novel material and applications to engage higher-level thinking and depth of knowledge.

Items tend to assess low levels of cognitive demand when they include basic content that is more likely to be recalled or recognized by students. Comprehension and understanding can be assessed by incorporating novel contexts, examples, and demonstrations, where students have to apply KSA in a situation they may not have encountered.

5. Avoid referencing unqualified opinions.

Opinions that are not qualified can be difficult or impossible to answer objectively. Consider a question about the best book of all time. Without reference to a clear metric for what constitutes best, such as, best according to a particular reviewer, or best based on total copies sold, no response choice can be identified as most correct.

Items referencing opinions, whether qualified or not, may also violate guideline three by assessing trivial, overly specific, or overly general content.

6. Avoid trick items that intentionally mislead students.

Tricky items don’t belong in assessments that students are expected to take seriously. Items should be difficult only because they involve challenging content and tasks, not because they introduce misleading or deceptive information.

Format and style

Guidelines on formatting and style cover best practices for the wording and presentation of item content.

7. Format the item vertically instead of horizontally, with options listed vertically.

Presenting an item and response options horizontally may save space, but it can lead to difficulties aligning response options with their labels. This tends to be problematic for younger test takers. To avoid confusion, SR items are typically presented vertically.

8. Edit and proof all items, including for correct grammar, punctuation, capitalization, and spelling.

Items should be proof read and edited for correctness and clarity of wording. Misspellings, grammatical errors, and poor word choice can be distracting at the least. If possible, items should be reviewed by peers with expertise in the subject matter.

9. Minimize the amount of reading required by each item.

Reading load should be minimized for items that are not intended to measure reading ability or related constructs. By removing unnecessary text, we can focus more on important content while reducing testing time. Items that involve scenarios or examples often include superfluous details and descriptions that can be eliminated without changing the nature of the item. Any information that is not essential to identifying the correct response should be removed.

10. Keep vocabulary and linguistic complexity appropriate for the target construct and target population.

Our target construct and population should lead us to an appropriate level of linguistic complexity when writing an item. The appropriate level of complexity will depend on the age and reading ability of students, and the role of reading ability within the learning objectives. Reading ability should not interfere with item content. Knowledge of difficult vocabulary should only be required when it is specified within the learning objectives.

The stem

The stem of an SR item includes the initial description, context, and question statement that a student refers to when choosing an option. Whereas guidelines on item content, formatting, and style pertain to the stem and response options, these two guidelines pertain specifically to writing the item stem.

11. Include the main idea for the item within the stem instead of the options.

The stem should usually include the majority of the information in an item. Long and wordy response options are more difficult to compare and contrast, especially when it’s unclear what question they are addressing. An effective SR item stem can typically stand alone as a constructed-response item. So, if you remove the response options from an SR item, with minimal rewording of the stem, a student should still be able to respond.

12. Word the stem positively, avoiding negative phrasing such as NOT or EXCEPT.

Negatives in an item stem can be easily overlooked, resulting in an incorrect response even for students with the required KSA. Negatives can also increase cognitive load. It is usually best to phrase the stem positively rather than negatively. When their use cannot be avoided, they should appear capitalized and boldface.

The response options

Writing response options can be the most challenging part of constructing an SR item. Incorrect options, also known as distractors, can be especially difficult to write. When an item is found to be too easy, often it is either because the correct option stands out from the rest, or because the incorrect options can be easily discounted.

13. Use only options that are plausible and discriminating. Three options are usually sufficient.

Good response options follow directly from the question or segue provided in the stem. If the stem asks for the most appropriate interpretation of a statistical result, each option should be worded as a plausible interpretation. If the stem describes a classroom assessment procedure, and then asks for the term that correctly defines it, each response should be a term related to assessment.

Discriminating items are ones that distinguish well between students of high and low ability. Students with the required KSA should select the correct option, and other students should select the incorrect options. When this trend is reversed or not apparent, the item provides less useful information about the construct.

Although increasing the number of options will tend to make an item more difficult, you shouldn’t feel bound to writing four or more options per item. The number of options can differ by item, and three options may suffice.

14. Ensure that only one option per item is the correct or keyed response.

An SR item that asks for a single response may end up having more than one correct answer. This can happen when the stem does not provide sufficient information for disqualifying distractors as incorrect. The problem is best avoided through proof reading and peer review.

15. Vary the location of the correct response according to the number of options.

Patterns in the placement of correct responses should be avoided. Instead, the placement of the correct choice should be varied across items, ideally with an even distribution.

16. Put options in logical or numerical order when possible.

When responses are all numeric, they should be formatted similarly, for example, with the same number of decimal places. They should also be presented in ascending or descending numerical order. Text responses that capture labels or terms should be ordered alphabetically when possible.

17. Keep options independent; options should not overlap in content.

Overlapping or dependent response options typically result in more than one option being correct. For example, consider a question asking for the correct scientific classification of a domestic cat. The options Felis and Mammalia would both be correct, as one is a subset of the other. The question would be improved by writing mutually exclusive options. The stem itself should also be clarified to ask for a specific level of classification.

18. Avoid using the options all of the above, none of the above, and I don’t know.

Research shows that these response options are difficult to implement, and tend to lower the quality of SR items. Alternatives to all of the above and none of the above are multiple true/false items and items where students select all that apply. These SR item types are more complicated to score, but can provide more reliable results.

19. Phrase options positively, avoiding negatives such as NOT.

Response options, like the SR stem, should be phrased positively whenever possible. The rationale is the same as with the guideline for positive phrasing in the stem. Negatives can be easily overlooked and can make it more difficult to compare and contrast response options.

20. Make options as similar as possible in length, grammatical structure, and content.

Teachers often provide more justification for the correct response than for incorrect ones, whether to ensure that the correct response is completely correct, or because the incorrect options are more difficult to write. The longer response option stands out from the rest, and tends to be right. Students can pick up on this cue and respond correctly without the requisite KSA. In addition to being similar in length, options should also be similar in grammatical structure and content.

21. Avoid specific determiners, including always, never, completely, and absolutely.

Extreme determiners like these more often signal incorrect responses than correct ones. Students can recognize that less extreme options are more likely correct.

22. Avoid repetition and association, where options are identical to or resemble parts of the stem.

Response options are more likely correct when they contain the same or similar wording as in the stem. For example, an item could ask for the type of reliability that is measured within a single test form. Internal consistency may attract more attention than Stability simply because of the association between the terms within and internal.

23. Avoid groups of options that are similar in structure or terminology.

Subsets of item responses that share the same wording can be distracting. In some cases, they can also appear to include or exclude the correct option, depending on what features make them similar. Groupings among options should be avoided in favor of consistency in length, structure, and content across all options.

24. Avoid blatantly absurd, ridiculous options.

Options should not be used if they could easily be discounted by students. Absurd options are sometimes used for humor or to round out the number of options so it matches other items. Neither of these reasons warrants their use.

25. Make all incorrect options plausible. Base incorrect options on typical errors of test takers.

This extends guideline thirteen by focusing specifically on incorrect options, which should be designed to capture common misconceptions and errors among students. When distractors are written to target common student errors, rather than unrealistic responses, they can provide more diagnostic information to inform future teaching and learning. Items with distractors based on common student errors can also be more discriminating.

26. Avoid the use of humor.

In smaller scale testing, such as low-stakes classroom assessment where teachers are familiar with their students, humor may not be a problem. In these cases, humor may reduce test anxiety for some students. However, it may end up being a distraction for others, which could have implications for fairness. Given the trend toward higher stakes in testing, even with teacher-made assessments, it is recommended that humor be avoided.