Sophisticated analytic strategies have been proposed as viable methods to improve the quantification of student improvement and to assist educators in making treatment decisions. The performance of three categories of latent growth modeling techniques (linear, quadratic, and dual change) to capture growth in oral reading fluency in response to a 12-week structured supplemental reading intervention among 280 grade three students at-risk for learning disabilities were compared. Although the most complex approach (dual-change) yielded the best model fit indices, there were few practical differences between predicted values from simpler linear models. A discussion to carefully consider the relative benefits and appropriateness of increasingly complex growth modeling strategies to evaluate individual student responses to intervention is offered.
{"title":"Weighing the Value of Complex Growth Estimation Methods to Evaluate Individual Student Response to Instruction","authors":"Ethan R. Van Norman","doi":"10.1111/emip.12579","DOIUrl":"10.1111/emip.12579","url":null,"abstract":"<p>Sophisticated analytic strategies have been proposed as viable methods to improve the quantification of student improvement and to assist educators in making treatment decisions. The performance of three categories of latent growth modeling techniques (linear, quadratic, and dual change) to capture growth in oral reading fluency in response to a 12-week structured supplemental reading intervention among 280 grade three students at-risk for learning disabilities were compared. Although the most complex approach (dual-change) yielded the best model fit indices, there were few practical differences between predicted values from simpler linear models. A discussion to carefully consider the relative benefits and appropriateness of increasingly complex growth modeling strategies to evaluate individual student responses to intervention is offered.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 4","pages":"33-41"},"PeriodicalIF":2.0,"publicationDate":"2023-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12579","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47957423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Research shows that the intensity of high school course-taking is related to postsecondary outcomes. However, there are various approaches to measuring the intensity of students’ course-taking. This study presents new measures of coursework intensity that rely on differing levels of quantity and quality of coursework. We used these new indices to provide a current description of variations in high school course-taking across grades and student subgroups using a nationally representative dataset, the High School Longitudinal Study of 2009. Results showed that for measures emphasizing the quality of coursework the gaps in coursework among underserved students were larger and there was less upward movement in rigor across grades.
{"title":"Does It Matter How the Rigor of High School Coursework Is Measured? Gaps in Coursework Among Students and Across Grades","authors":"Burhan Ogut, Darrick Yee, Ruhan Circi, Nevin Dizdari","doi":"10.1111/emip.12577","DOIUrl":"10.1111/emip.12577","url":null,"abstract":"<p>Research\u0000shows that the intensity of high school course-taking is related to postsecondary outcomes. However, there are various approaches to measuring the intensity of students’ course-taking. This study presents new measures of coursework intensity that rely on differing levels of quantity and quality of coursework. We used these new indices to provide a current description of variations in high school course-taking across grades and student subgroups using a nationally representative dataset, the High School Longitudinal Study of 2009. Results showed that for measures emphasizing the quality of coursework the gaps in coursework among underserved students were larger and there was less upward movement in rigor across grades.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 4","pages":"42-52"},"PeriodicalIF":2.0,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43299184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In computer-based tests allowing revision and reviews, examinees' sequence of visits and answer changes to questions can be recorded. The variable-length revision log data introduce new complexities to the collected data but, at the same time, provide additional information on examinees' test-taking behavior, which can inform test development and instructions. In the current study, we used recently proposed statistical learning methods for sequence data to provide an exploratory analysis of item-level revision and review log data. Based on the revision log data collected from computer-based classroom assessments, common prototypes of revisit and review behavior were identified. The relationship between revision behavior and various item, test, and individual covariates was further explored under a Bayesian multivariate generalized linear mixed model.
{"title":"Exploration of Latent Structure in Test Revision and Review Log Data","authors":"Susu Zhang, Anqi Li, Shiyu Wang","doi":"10.1111/emip.12576","DOIUrl":"10.1111/emip.12576","url":null,"abstract":"<p>In computer-based tests allowing revision and reviews, examinees' sequence of visits and answer changes to questions can be recorded. The variable-length revision log data introduce new complexities to the collected data but, at the same time, provide additional information on examinees' test-taking behavior, which can inform test development and instructions. In the current study, we used recently proposed statistical learning methods for sequence data to provide an exploratory analysis of item-level revision and review log data. Based on the revision log data collected from computer-based classroom assessments, common prototypes of revisit and review behavior were identified. The relationship between revision behavior and various item, test, and individual covariates was further explored under a Bayesian multivariate generalized linear mixed model.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 4","pages":"53-65"},"PeriodicalIF":2.0,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12576","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45210991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The subjective aspect of standard-setting is often criticized, yet data-driven standard-setting methods are rarely applied. Therefore, we applied a mixture Rasch model approach to setting performance standards across several testing programs of various sizes and compared the results to existing passing standards derived from traditional standard-setting methods. We found that heterogeneity of the sample is clearly necessary for the mixture Rasch model approach to standard setting to be useful. While possibly not sufficient to determine passing standards on their own, there may be value in these data-driven models for providing additional validity evidence to support decision-making bodies entrusted with establishing cut scores. They may also provide a useful tool for evaluating existing cut scores and determining if they continue to be supported or if a new study is warranted.
{"title":"Applying a Mixture Rasch Model-Based Approach to Standard Setting","authors":"Michael R. Peabody, Timothy J. Muckle, Yu Meng","doi":"10.1111/emip.12571","DOIUrl":"10.1111/emip.12571","url":null,"abstract":"<p>The subjective aspect of standard-setting is often criticized, yet data-driven standard-setting methods are rarely applied. Therefore, we applied a mixture Rasch model approach to setting performance standards across several testing programs of various sizes and compared the results to existing passing standards derived from traditional standard-setting methods. We found that heterogeneity of the sample is clearly necessary for the mixture Rasch model approach to standard setting to be useful. While possibly not sufficient to determine passing standards on their own, there may be value in these data-driven models for providing additional validity evidence to support decision-making bodies entrusted with establishing cut scores. They may also provide a useful tool for evaluating existing cut scores and determining if they continue to be supported or if a new study is warranted.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"5-12"},"PeriodicalIF":2.0,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43146823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To assemble a high-quality test, psychometricians rely on subject matter experts (SMEs) to write high-quality items. However, SMEs are not typically given the opportunity to provide input on which content standards are most suitable for multiple-choice questions (MCQs). In the present study, we explored the relationship between perceived MCQ suitability for a given content standard and the associated item characteristics. Prior to item writing, we surveyed SMEs on MCQ suitability for each content standard. Following field testing, we then used SMEs’ average ratings for each content standard to predict item characteristics for the tests. We analyzed multilevel models predicting item difficulty (p value), discrimination, and nonfunctioning distractor presence. Items were nested within courses and content standards. There was a curvilinear relationship between SMEs’ ratings and item difficulty such that very low MCQ suitability ratings were predictive of easier items. After controlling for item difficulty, items with higher MCQ suitability ratings had higher discrimination and were less likely to have one or more nonfunctioning distractors. This research has practical implications for optimizing test blueprints. Additionally, psychometricians may use these ratings to better prepare for coaching SMEs during item writing.
{"title":"Do Subject Matter Experts’ Judgments of Multiple-Choice Format Suitability Predict Item Quality?","authors":"Rebecca F. Berenbon, Bridget C. McHugh","doi":"10.1111/emip.12570","DOIUrl":"10.1111/emip.12570","url":null,"abstract":"<p>To assemble a high-quality test, psychometricians rely on subject matter experts (SMEs) to write high-quality items. However, SMEs are not typically given the opportunity to provide input on which content standards are most suitable for multiple-choice questions (MCQs). In the present study, we explored the relationship between perceived MCQ suitability for a given content standard and the associated item characteristics. Prior to item writing, we surveyed SMEs on MCQ suitability for each content standard. Following field testing, we then used SMEs’ average ratings for each content standard to predict item characteristics for the tests. We analyzed multilevel models predicting item difficulty (<i>p</i> value), discrimination, and nonfunctioning distractor presence. Items were nested within courses and content standards. There was a curvilinear relationship between SMEs’ ratings and item difficulty such that very low MCQ suitability ratings were predictive of easier items. After controlling for item difficulty, items with higher MCQ suitability ratings had higher discrimination and were less likely to have one or more nonfunctioning distractors. This research has practical implications for optimizing test blueprints. Additionally, psychometricians may use these ratings to better prepare for coaching SMEs during item writing.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"13-21"},"PeriodicalIF":2.0,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12570","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46840085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Validity is a fundamental consideration of test development and test evaluation. The purpose of this study is to define and reify three key aspects of validity and validation, namely test-score interpretation, test-score use, and the claims supporting interpretation and use. This study employed a Delphi methodology to explore how experts in validity and validation conceptualize test-score interpretation, use, and claims. Definitions were developed through multiple iterations of data collection and analysis. By clarifying the language used when conducting validation, validation may be more accessible to a broader audience, including but not limited to test developers, test users, and test consumers.
{"title":"Defining Test-Score Interpretation, Use, and Claims: Delphi Study for the Validity Argument","authors":"Timothy D. Folger, Jonathan Bostic, Erin E. Krupa","doi":"10.1111/emip.12569","DOIUrl":"10.1111/emip.12569","url":null,"abstract":"<p>Validity is a fundamental consideration of test development and test evaluation. The purpose of this study is to define and reify three key aspects of validity and validation, namely test-score interpretation, test-score use, and the claims supporting interpretation and use. This study employed a Delphi methodology to explore how experts in validity and validation conceptualize test-score interpretation, use, and claims. Definitions were developed through multiple iterations of data collection and analysis. By clarifying the language used when conducting validation, validation may be more accessible to a broader audience, including but not limited to test developers, test users, and test consumers.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"22-38"},"PeriodicalIF":2.0,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12569","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44066378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There has been a growing interest in approaches based on machine learning (ML) for detecting test collusion as an alternative to the traditional methods. Clustering analysis under an unsupervised learning technique appears especially promising to detect group collusion. In this study, the effectiveness of hierarchical agglomerative clustering (HAC) for detecting aberrant test takers on Computer-Based Testing (CBT) is explored. Random forest ensembles are used to evaluate the accuracy of the clustering and find the important features to classify the aberrant test takers. Testing data from a certification exam is used. The level of overlap between the exact response matches on incorrectly keyed items in the exam preparation material and HAC are compared. Integrating HAC as an investigation mean is promising in this field to improve the accuracy of classification of aberrant test takers.
{"title":"Hierarchical Agglomerative Clustering to Detect Test Collusion on Computer-Based Tests","authors":"Soo Jeong Ingrisone, James N. Ingrisone","doi":"10.1111/emip.12568","DOIUrl":"10.1111/emip.12568","url":null,"abstract":"<p>There has been a growing interest in approaches based on machine learning (ML) for detecting test collusion as an alternative to the traditional methods. Clustering analysis under an unsupervised learning technique appears especially promising to detect group collusion. In this study, the effectiveness of hierarchical agglomerative clustering (HAC) for detecting aberrant test takers on Computer-Based Testing (CBT) is explored. Random forest ensembles are used to evaluate the accuracy of the clustering and find the important features to classify the aberrant test takers. Testing data from a certification exam is used. The level of overlap between the exact response matches on incorrectly keyed items in the exam preparation material and HAC are compared. Integrating HAC as an investigation mean is promising in this field to improve the accuracy of classification of aberrant test takers.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"39-49"},"PeriodicalIF":2.0,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49103459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esther Ulitzsch, Benjamin W. Domingue, Radhika Kapoor, Klint Kanopka, Joseph A. Rios
Common response-time-based approaches for non-effortful response behavior (NRB) in educational achievement tests filter responses that are associated with response times below some threshold. These approaches are, however, limited in that they require a binary decision on whether a response is classified as stemming from NRB; thus ignoring potential classification uncertainty in resulting parameter estimates. We developed a response-time-based probabilistic filtering procedure that overcomes this limitation. The procedure is rooted in the principles of multiple imputation. Instead of creating multiple plausible replacements of missing data, however, multiple data sets are created that represent plausible filtered response data. We propose two different approaches to filtering models, originating in different research traditions and conceptualizations of response-time-based identification of NRB. The first approach uses Gaussian mixture modeling to identify a response time subcomponent stemming from NRB. Plausible filtered data sets are created based on examinees' posterior probabilities of belonging to the NRB subcomponent. The second approach defines a plausible range of response time thresholds and creates plausible filtered data sets by drawing multiple response time thresholds from the defined range. We illustrate the workings of the proposed procedure as well as differences between the proposed filtering models based on both simulated data and empirical data from PISA 2018.
{"title":"A Probabilistic Filtering Approach to Non-Effortful Responding","authors":"Esther Ulitzsch, Benjamin W. Domingue, Radhika Kapoor, Klint Kanopka, Joseph A. Rios","doi":"10.1111/emip.12567","DOIUrl":"10.1111/emip.12567","url":null,"abstract":"<p>Common response-time-based approaches for non-effortful response behavior (NRB) in educational achievement tests filter responses that are associated with response times below some threshold. These approaches are, however, limited in that they require a binary decision on whether a response is classified as stemming from NRB; thus ignoring potential classification uncertainty in resulting parameter estimates. We developed a response-time-based probabilistic filtering procedure that overcomes this limitation. The procedure is rooted in the principles of multiple imputation. Instead of creating multiple plausible replacements of missing data, however, multiple data sets are created that represent plausible filtered response data. We propose two different approaches to filtering models, originating in different research traditions and conceptualizations of response-time-based identification of NRB. The first approach uses Gaussian mixture modeling to identify a response time subcomponent stemming from NRB. Plausible filtered data sets are created based on examinees' posterior probabilities of belonging to the NRB subcomponent. The second approach defines a plausible range of response time thresholds and creates plausible filtered data sets by drawing multiple response time thresholds from the defined range. We illustrate the workings of the proposed procedure as well as differences between the proposed filtering models based on both simulated data and empirical data from PISA 2018.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"50-64"},"PeriodicalIF":2.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12567","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46209020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most individuals who take, interpret, design, or score tests are aware that examinees do not always provide full effort when responding to items. However, many such individuals are not aware of how pervasive the issue is, what its consequences are, and how to address it. In this digital ITEMS module, Dr. James Soland will help fill these gaps in the knowledge base. Specifically, the module enumerates how frequently behaviors associated with low effort occur, and some of the ways they can distort inferences based on test scores. Then, the module explains some of the most common approaches for identifying low effort, and correcting for it when examining test scores. Brief discussion is also given to how these methods align with, and diverge from, those used to deal with low respondent effort in self-report contexts. Data and code are also provided such that readers can better implement some of the desired methods in their own work.
{"title":"Digital Module 32: Understanding and Mitigating the Impact of Low Effort on Common Uses of Test and Survey Scores","authors":"James Soland","doi":"10.1111/emip.12555","DOIUrl":"10.1111/emip.12555","url":null,"abstract":"<p>Most individuals who take, interpret, design, or score tests are aware that examinees do not always provide full effort when responding to items. However, many such individuals are not aware of how pervasive the issue is, what its consequences are, and how to address it. In this digital ITEMS module, Dr. James Soland will help fill these gaps in the knowledge base. Specifically, the module enumerates how frequently behaviors associated with low effort occur, and some of the ways they can distort inferences based on test scores. Then, the module explains some of the most common approaches for identifying low effort, and correcting for it when examining test scores. Brief discussion is also given to how these methods align with, and diverge from, those used to deal with low respondent effort in self-report contexts. Data and code are also provided such that readers can better implement some of the desired methods in their own work.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"75-76"},"PeriodicalIF":2.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45513786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}