Multiple choice results are inherently probabilistic outcomes, as correct responses reflect a combination of knowledge and guessing, while incorrect responses additionally reflect blunder, a confidently committed mistake. To objectively resolve knowledge from responses in an MC test structure, we evaluated probabilistic models that explicitly account for guessing, knowledge and blunder using eight assessments (>9,000 responses) from an undergraduate biotechnology curriculum. A Bayesian implementation of the models, aimed at assessing their robustness to prior beliefs in examinee knowledge, showed that explicit estimators of knowledge are markedly sensitive to prior beliefs with scores as sole input. To overcome this limitation, we examined self-ranked confidence as a proxy knowledge indicator. For our test set, three levels of confidence resolved test performance. Responses rated as least confident were correct more frequently than expected from random selection, reflecting partial knowledge, but were balanced by blunder among the most confident responses. By translating evidence-based guessing and blunder rates to pass marks that statistically qualify a desired level of examinee knowledge, our approach finds practical utility in test analysis and design.
ABSTRACT
Testing programs are confronted with the decision of whether to report individual scores for examinees that have engaged in rapid guessing (RG). As noted by the Standards for Educational and Psychological Testing, this decision should be based on a documented criterion that determines score exclusion. To this end, a number of heuristic criteria (e.g., exclude all examinees with RG rates of 10%) have been adopted in the literature. Given that these criteria lack strong methodological support, the objective of this simulation study was to evaluate their appropriateness in terms of individual ability estimate and classification accuracy when manipulating both assessment and RG characteristics. The findings provide evidence that employing a common criterion for all examinees may be an ineffective strategy because a given RG percentage may have differing degrees of biasing effects based on test difficulty, examinee ability, and RG pattern. These results suggest that practitioners may benefit from establishing context-specific exclusion criteria that consider test purpose, score use, and targeted examinee trait levels.