In computerized adaptive testing, overexposure of items in the bank is a serious problem and might result in item compromise. We develop an item selection algorithm that utilizes the entire bank well and reduces the overexposure of items. The algorithm is based on collaborative filtering and selects an item in two stages. In the first stage, a set of candidate items whose expected performance matches the examinee's current performance is selected. In the second stage, an item that is approximately matched to the examinee's observed performance is selected from the candidate set. The expected performance of an examinee on an item is predicted by autoencoders. Experiment results show that the proposed algorithm outperforms existing item selection algorithms in terms of item exposure while incurring only a small loss in measurement precision.
The increasing use of computerization in the testing industry and the need for items potentially measuring higher-order skills have led educational measurement communities to develop technology-enhanced (TE) items and conduct validity studies on the use of TE items. Parallel to this goal, the purpose of this study was to collect validity evidence comparing item information functions, expected information values, and measurement efficiencies (item information per time unit) between multiple-choice (MC) and technology-enhanced (TE) items. The data came from K–12 mathematics large-scale accountability assessments. The study results were mainly interpreted descriptively, and the presence of specific patterns between MC and TE items was examined across grades and depth of knowledge levels. Although many earlier researchers pointed out that TE items were not as efficient as MC items, the results from the study point to ways that TE items might provide more information and were more than or equally efficient as MC items overall.
Sophisticated analytic strategies have been proposed as viable methods to improve the quantification of student improvement and to assist educators in making treatment decisions. The performance of three categories of latent growth modeling techniques (linear, quadratic, and dual change) to capture growth in oral reading fluency in response to a 12-week structured supplemental reading intervention among 280 grade three students at-risk for learning disabilities were compared. Although the most complex approach (dual-change) yielded the best model fit indices, there were few practical differences between predicted values from simpler linear models. A discussion to carefully consider the relative benefits and appropriateness of increasingly complex growth modeling strategies to evaluate individual student responses to intervention is offered.
Research shows that the intensity of high school course-taking is related to postsecondary outcomes. However, there are various approaches to measuring the intensity of students’ course-taking. This study presents new measures of coursework intensity that rely on differing levels of quantity and quality of coursework. We used these new indices to provide a current description of variations in high school course-taking across grades and student subgroups using a nationally representative dataset, the High School Longitudinal Study of 2009. Results showed that for measures emphasizing the quality of coursework the gaps in coursework among underserved students were larger and there was less upward movement in rigor across grades.
In computer-based tests allowing revision and reviews, examinees' sequence of visits and answer changes to questions can be recorded. The variable-length revision log data introduce new complexities to the collected data but, at the same time, provide additional information on examinees' test-taking behavior, which can inform test development and instructions. In the current study, we used recently proposed statistical learning methods for sequence data to provide an exploratory analysis of item-level revision and review log data. Based on the revision log data collected from computer-based classroom assessments, common prototypes of revisit and review behavior were identified. The relationship between revision behavior and various item, test, and individual covariates was further explored under a Bayesian multivariate generalized linear mixed model.
The subjective aspect of standard-setting is often criticized, yet data-driven standard-setting methods are rarely applied. Therefore, we applied a mixture Rasch model approach to setting performance standards across several testing programs of various sizes and compared the results to existing passing standards derived from traditional standard-setting methods. We found that heterogeneity of the sample is clearly necessary for the mixture Rasch model approach to standard setting to be useful. While possibly not sufficient to determine passing standards on their own, there may be value in these data-driven models for providing additional validity evidence to support decision-making bodies entrusted with establishing cut scores. They may also provide a useful tool for evaluating existing cut scores and determining if they continue to be supported or if a new study is warranted.
To assemble a high-quality test, psychometricians rely on subject matter experts (SMEs) to write high-quality items. However, SMEs are not typically given the opportunity to provide input on which content standards are most suitable for multiple-choice questions (MCQs). In the present study, we explored the relationship between perceived MCQ suitability for a given content standard and the associated item characteristics. Prior to item writing, we surveyed SMEs on MCQ suitability for each content standard. Following field testing, we then used SMEs’ average ratings for each content standard to predict item characteristics for the tests. We analyzed multilevel models predicting item difficulty (p value), discrimination, and nonfunctioning distractor presence. Items were nested within courses and content standards. There was a curvilinear relationship between SMEs’ ratings and item difficulty such that very low MCQ suitability ratings were predictive of easier items. After controlling for item difficulty, items with higher MCQ suitability ratings had higher discrimination and were less likely to have one or more nonfunctioning distractors. This research has practical implications for optimizing test blueprints. Additionally, psychometricians may use these ratings to better prepare for coaching SMEs during item writing.