This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7–14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7–10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.
In a 1993 EM:IP article, I made six predictions related to measurement policy issues for the approaching millenium. In this article, I evaluate the accuracy of those predictions (Spoiler: I was only modestly accurate) and I proffer a mix of seven contemporary predictions, recommendations, and aspirations regarding assessment generally, NCME as an association, and specific psychometric practices.
The goal in personalized assessment is to best fit the needs of each individual test taker, given the assessment purposes. Design-In-Real-Time (DIRTy) assessment reflects the progressive evolution in testing from a single test, to an adaptive test, to an adaptive assessment system. In this article, we lay the foundation for DIRTy assessment and illustrate how it meets the complex needs of each individual learner. The assessment framework incorporates culturally responsive assessment principles, thus making it innovative with respect to both technology and equity. Key aspects are (a) assessment building blocks called “assessment task modules” (ATMs) linked to multiple content standards and skill domains, (b) gathering information on test takers’ characteristics and preferences and using this information to improve their testing experience, and (c) selecting, modifying, and compiling ATMs to create a personalized test that best meets the needs of the testing purpose and individual test taker.
This paper explores the psychometric properties of scores derived from autogenerated test forms by introducing three conceptual frameworks: Alternate Test Forms, Randomly Parallel Forms, and Approximately Parallel Forms. Each framework provides a distinct perspective on score comparability, definitions of true score and standard error of measurement (SEM), and the necessity of equating. Through a simulation study, we illustrate how these frameworks compare in terms of true scores and SEMs, while also assessing the impact of equating on score comparability across varying levels of form variability. Ultimately, this study seeks to lay the groundwork for implementing scoring practices in large-scale standardized assessments that use autogenerated forms.
Random Equating (RE) and Heuristic Approach (HA) are two linking procedures that may be used to compare the scores of individuals in two tests that measure the same latent trait, in conditions where there are no common items or individuals. In this study, RE—that may only be used when the individuals taking the two tests come from the same population—was used as a benchmark for evaluating HA, which, in contrast, does not require any distributional assumptions. The comparison was based on both simulated and empirical data. Simulations showed that HA was good at reproducing the link shift connecting the difficulty parameters of the two sets of items, performing similarly to RE under the condition of slight violation of the distributional assumption. Empirical results showed satisfactory correspondence between the estimates of item and person parameters obtained via the two procedures.