Validity is a fundamental consideration of test development and test evaluation. The purpose of this study is to define and reify three key aspects of validity and validation, namely test-score interpretation, test-score use, and the claims supporting interpretation and use. This study employed a Delphi methodology to explore how experts in validity and validation conceptualize test-score interpretation, use, and claims. Definitions were developed through multiple iterations of data collection and analysis. By clarifying the language used when conducting validation, validation may be more accessible to a broader audience, including but not limited to test developers, test users, and test consumers.
{"title":"Defining Test-Score Interpretation, Use, and Claims: Delphi Study for the Validity Argument","authors":"Timothy D. Folger, Jonathan Bostic, Erin E. Krupa","doi":"10.1111/emip.12569","DOIUrl":"10.1111/emip.12569","url":null,"abstract":"<p>Validity is a fundamental consideration of test development and test evaluation. The purpose of this study is to define and reify three key aspects of validity and validation, namely test-score interpretation, test-score use, and the claims supporting interpretation and use. This study employed a Delphi methodology to explore how experts in validity and validation conceptualize test-score interpretation, use, and claims. Definitions were developed through multiple iterations of data collection and analysis. By clarifying the language used when conducting validation, validation may be more accessible to a broader audience, including but not limited to test developers, test users, and test consumers.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"22-38"},"PeriodicalIF":2.0,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12569","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44066378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There has been a growing interest in approaches based on machine learning (ML) for detecting test collusion as an alternative to the traditional methods. Clustering analysis under an unsupervised learning technique appears especially promising to detect group collusion. In this study, the effectiveness of hierarchical agglomerative clustering (HAC) for detecting aberrant test takers on Computer-Based Testing (CBT) is explored. Random forest ensembles are used to evaluate the accuracy of the clustering and find the important features to classify the aberrant test takers. Testing data from a certification exam is used. The level of overlap between the exact response matches on incorrectly keyed items in the exam preparation material and HAC are compared. Integrating HAC as an investigation mean is promising in this field to improve the accuracy of classification of aberrant test takers.
{"title":"Hierarchical Agglomerative Clustering to Detect Test Collusion on Computer-Based Tests","authors":"Soo Jeong Ingrisone, James N. Ingrisone","doi":"10.1111/emip.12568","DOIUrl":"10.1111/emip.12568","url":null,"abstract":"<p>There has been a growing interest in approaches based on machine learning (ML) for detecting test collusion as an alternative to the traditional methods. Clustering analysis under an unsupervised learning technique appears especially promising to detect group collusion. In this study, the effectiveness of hierarchical agglomerative clustering (HAC) for detecting aberrant test takers on Computer-Based Testing (CBT) is explored. Random forest ensembles are used to evaluate the accuracy of the clustering and find the important features to classify the aberrant test takers. Testing data from a certification exam is used. The level of overlap between the exact response matches on incorrectly keyed items in the exam preparation material and HAC are compared. Integrating HAC as an investigation mean is promising in this field to improve the accuracy of classification of aberrant test takers.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"39-49"},"PeriodicalIF":2.0,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49103459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esther Ulitzsch, Benjamin W. Domingue, Radhika Kapoor, Klint Kanopka, Joseph A. Rios
Common response-time-based approaches for non-effortful response behavior (NRB) in educational achievement tests filter responses that are associated with response times below some threshold. These approaches are, however, limited in that they require a binary decision on whether a response is classified as stemming from NRB; thus ignoring potential classification uncertainty in resulting parameter estimates. We developed a response-time-based probabilistic filtering procedure that overcomes this limitation. The procedure is rooted in the principles of multiple imputation. Instead of creating multiple plausible replacements of missing data, however, multiple data sets are created that represent plausible filtered response data. We propose two different approaches to filtering models, originating in different research traditions and conceptualizations of response-time-based identification of NRB. The first approach uses Gaussian mixture modeling to identify a response time subcomponent stemming from NRB. Plausible filtered data sets are created based on examinees' posterior probabilities of belonging to the NRB subcomponent. The second approach defines a plausible range of response time thresholds and creates plausible filtered data sets by drawing multiple response time thresholds from the defined range. We illustrate the workings of the proposed procedure as well as differences between the proposed filtering models based on both simulated data and empirical data from PISA 2018.
{"title":"A Probabilistic Filtering Approach to Non-Effortful Responding","authors":"Esther Ulitzsch, Benjamin W. Domingue, Radhika Kapoor, Klint Kanopka, Joseph A. Rios","doi":"10.1111/emip.12567","DOIUrl":"10.1111/emip.12567","url":null,"abstract":"<p>Common response-time-based approaches for non-effortful response behavior (NRB) in educational achievement tests filter responses that are associated with response times below some threshold. These approaches are, however, limited in that they require a binary decision on whether a response is classified as stemming from NRB; thus ignoring potential classification uncertainty in resulting parameter estimates. We developed a response-time-based probabilistic filtering procedure that overcomes this limitation. The procedure is rooted in the principles of multiple imputation. Instead of creating multiple plausible replacements of missing data, however, multiple data sets are created that represent plausible filtered response data. We propose two different approaches to filtering models, originating in different research traditions and conceptualizations of response-time-based identification of NRB. The first approach uses Gaussian mixture modeling to identify a response time subcomponent stemming from NRB. Plausible filtered data sets are created based on examinees' posterior probabilities of belonging to the NRB subcomponent. The second approach defines a plausible range of response time thresholds and creates plausible filtered data sets by drawing multiple response time thresholds from the defined range. We illustrate the workings of the proposed procedure as well as differences between the proposed filtering models based on both simulated data and empirical data from PISA 2018.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"50-64"},"PeriodicalIF":2.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12567","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46209020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most individuals who take, interpret, design, or score tests are aware that examinees do not always provide full effort when responding to items. However, many such individuals are not aware of how pervasive the issue is, what its consequences are, and how to address it. In this digital ITEMS module, Dr. James Soland will help fill these gaps in the knowledge base. Specifically, the module enumerates how frequently behaviors associated with low effort occur, and some of the ways they can distort inferences based on test scores. Then, the module explains some of the most common approaches for identifying low effort, and correcting for it when examining test scores. Brief discussion is also given to how these methods align with, and diverge from, those used to deal with low respondent effort in self-report contexts. Data and code are also provided such that readers can better implement some of the desired methods in their own work.
{"title":"Digital Module 32: Understanding and Mitigating the Impact of Low Effort on Common Uses of Test and Survey Scores","authors":"James Soland","doi":"10.1111/emip.12555","DOIUrl":"10.1111/emip.12555","url":null,"abstract":"<p>Most individuals who take, interpret, design, or score tests are aware that examinees do not always provide full effort when responding to items. However, many such individuals are not aware of how pervasive the issue is, what its consequences are, and how to address it. In this digital ITEMS module, Dr. James Soland will help fill these gaps in the knowledge base. Specifically, the module enumerates how frequently behaviors associated with low effort occur, and some of the ways they can distort inferences based on test scores. Then, the module explains some of the most common approaches for identifying low effort, and correcting for it when examining test scores. Brief discussion is also given to how these methods align with, and diverge from, those used to deal with low respondent effort in self-report contexts. Data and code are also provided such that readers can better implement some of the desired methods in their own work.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"75-76"},"PeriodicalIF":2.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45513786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visualizing Distributions Across Grades","authors":"Yuan-Ling Liaw","doi":"10.1111/emip.12558","DOIUrl":"10.1111/emip.12558","url":null,"abstract":"","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"4"},"PeriodicalIF":2.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42774382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p>In the previous issue of <i>Educational Measurement: Issues and Practice</i> (<i>EM:IP</i>) I outlined the ten steps to authoring and producing a digital module for the <i>Instructional Topics in Educational Measurement Series</i> (<i>ITEMS</i>). In the current piece, I detail the first three steps: Step 1—Content Outline; Step 2—Content Development; and Step 3—Draft Review. After in-depth discussion of these three steps, I introduce the newest ITEMS module.</p><p>Prior to beginning the ten-step process, ITEMS module development starts with an initial meeting between myself (as editor) and the lead author(s). During this meeting, I discuss the development process in detail, showcasing what a final product looks like from the learners’ perspective in addition to a sneak-peek behind-the-scenes at what the final product looks like from the editorial perspective. After discussing the end product, the remaining conversation focuses on the 10-step process and the user-friendly templates to guide authors. The conversation concludes after coming to an agreement of the topic and general scope for the module.</p><p>Authors then independently work through a module outline template to refine the scope and sequencing of the module (Step 1). During this step, authors are encouraged to first specify their audience before setting the learning objectives of the module. Once learning objectives are set, authors are then tasked with determining the prerequisite knowledge for learners. In the next section of the template, authors outline the content and sequencing of the 4–6 sections of the module. Each section has its own learning objectives that map to the objectives of the module. One of the sections is a learner-focused interactive activity, whether it be a demonstration of software or a case study that is relevant to the content of the other sections. Once the outline is completed, the authors receive feedback to ensure adequate sequencing, feasibility of module development (e.g., covering a reasonable amount of content), and appropriateness for the audience. This is an example of the unique nature of <i>ITEMS</i> module development. Unlike most other publications, <i>ITEMS</i> module development consists of regular communication and feedback from the editor. Once the scope and outline of content have been agreed to, the authors move on to Step 2: Content Development.</p><p>For Step 2, authors are provided a slide deck template to assist in developing content consistent with the <i>ITEMS</i> format and brand. Using this slide deck, authors maintain creative flexibility by choosing among many slide layouts, each preprogrammed with consistent font, sizing, and color. Authors create individual slide decks for each section of the module, embedding media (e.g., pictures/figures) wherever necessary to assist learner understanding. At this stage, authors are not expected to record audio nor are they expected to put in animations. The primary focus for the authors i
{"title":"ITEMS Corner Update: The Initial Steps in the ITEMS Development Process","authors":"Brian C. Leventhal","doi":"10.1111/emip.12556","DOIUrl":"10.1111/emip.12556","url":null,"abstract":"<p>In the previous issue of <i>Educational Measurement: Issues and Practice</i> (<i>EM:IP</i>) I outlined the ten steps to authoring and producing a digital module for the <i>Instructional Topics in Educational Measurement Series</i> (<i>ITEMS</i>). In the current piece, I detail the first three steps: Step 1—Content Outline; Step 2—Content Development; and Step 3—Draft Review. After in-depth discussion of these three steps, I introduce the newest ITEMS module.</p><p>Prior to beginning the ten-step process, ITEMS module development starts with an initial meeting between myself (as editor) and the lead author(s). During this meeting, I discuss the development process in detail, showcasing what a final product looks like from the learners’ perspective in addition to a sneak-peek behind-the-scenes at what the final product looks like from the editorial perspective. After discussing the end product, the remaining conversation focuses on the 10-step process and the user-friendly templates to guide authors. The conversation concludes after coming to an agreement of the topic and general scope for the module.</p><p>Authors then independently work through a module outline template to refine the scope and sequencing of the module (Step 1). During this step, authors are encouraged to first specify their audience before setting the learning objectives of the module. Once learning objectives are set, authors are then tasked with determining the prerequisite knowledge for learners. In the next section of the template, authors outline the content and sequencing of the 4–6 sections of the module. Each section has its own learning objectives that map to the objectives of the module. One of the sections is a learner-focused interactive activity, whether it be a demonstration of software or a case study that is relevant to the content of the other sections. Once the outline is completed, the authors receive feedback to ensure adequate sequencing, feasibility of module development (e.g., covering a reasonable amount of content), and appropriateness for the audience. This is an example of the unique nature of <i>ITEMS</i> module development. Unlike most other publications, <i>ITEMS</i> module development consists of regular communication and feedback from the editor. Once the scope and outline of content have been agreed to, the authors move on to Step 2: Content Development.</p><p>For Step 2, authors are provided a slide deck template to assist in developing content consistent with the <i>ITEMS</i> format and brand. Using this slide deck, authors maintain creative flexibility by choosing among many slide layouts, each preprogrammed with consistent font, sizing, and color. Authors create individual slide decks for each section of the module, embedding media (e.g., pictures/figures) wherever necessary to assist learner understanding. At this stage, authors are not expected to record audio nor are they expected to put in animations. The primary focus for the authors i","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"74"},"PeriodicalIF":2.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12556","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42654384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esther Ulitzsch, Oliver Lüdtke, Alexander Robitzsch
Country differences in response styles (RS) may jeopardize cross-country comparability of Likert-type scales. When adjusting for rather than investigating RS is the primary goal, it seems advantageous to impose minimal assumptions on RS structures and leverage information from multiple scales for RS measurement. Using PISA 2015 background questionnaire data, we investigate such an adjustment procedure and explore its impact on cross-country comparisons in contrast to customary analyses and RS adjustments that (a) leave RS unconsidered, (b) incorporate stronger assumptions on RS structure, and/or (c) only use some selected scales for RS measurement. Our findings suggest that not only the decision as to whether to adjust for RS but also how to adjust may heavily impact cross-country comparisons. This concerns both the assumptions on RS structures and the scales employed for RS measurement. Implications for RS adjustments in cross-country comparisons are derived, strongly advocating for taking model uncertainty into account.
{"title":"The Role of Response Style Adjustments in Cross-Country Comparisons—A Case Study Using Data from the PISA 2015 Questionnaire","authors":"Esther Ulitzsch, Oliver Lüdtke, Alexander Robitzsch","doi":"10.1111/emip.12552","DOIUrl":"10.1111/emip.12552","url":null,"abstract":"<p>Country differences in response styles (RS) may jeopardize cross-country comparability of Likert-type scales. When adjusting for rather than investigating RS is the primary goal, it seems advantageous to impose minimal assumptions on RS structures and leverage information from multiple scales for RS measurement. Using PISA 2015 background questionnaire data, we investigate such an adjustment procedure and explore its impact on cross-country comparisons in contrast to customary analyses and RS adjustments that (a) leave RS unconsidered, (b) incorporate stronger assumptions on RS structure, and/or (c) only use some selected scales for RS measurement. Our findings suggest that not only the decision as to whether to adjust for RS but also how to adjust may heavily impact cross-country comparisons. This concerns both the assumptions on RS structures and the scales employed for RS measurement. Implications for RS adjustments in cross-country comparisons are derived, strongly advocating for taking model uncertainty into account.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 3","pages":"65-79"},"PeriodicalIF":2.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49547140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of this study was to explore high school course-taking sequences and their relationship to college enrollment. Specifically, we implemented sequence analysis to discover common course-taking trajectories in math, science, and English language arts using high school transcript data from a recent nationally representative survey. Through sequence clustering, we reduced the complexity of the sequences and examined representative course-taking sequences. Classification tree, random forests, and multinomial logistic regression analyses were used to explore the relationship between the course sequences students complete and their postsecondary outcomes. Results showed that distinct representative course-taking sequences can be identified for all students as well as student subgroups. More advanced and complex course-taking sequences were associated with postsecondary enrollment.
{"title":"Diving Into Students’ Transcripts: High School Course-Taking Sequences and Postsecondary Enrollment","authors":"Burhan Ogut, Ruhan Circi","doi":"10.1111/emip.12554","DOIUrl":"10.1111/emip.12554","url":null,"abstract":"<p>The purpose of this study was to explore high school course-taking sequences and their relationship to college enrollment. Specifically, we implemented sequence analysis to discover common course-taking trajectories in math, science, and English language arts using high school transcript data from a recent nationally representative survey. Through sequence clustering, we reduced the complexity of the sequences and examined representative course-taking sequences. Classification tree, random forests, and multinomial logistic regression analyses were used to explore the relationship between the course sequences students complete and their postsecondary outcomes. Results showed that distinct representative course-taking sequences can be identified for all students as well as student subgroups. More advanced and complex course-taking sequences were associated with postsecondary enrollment.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"21-31"},"PeriodicalIF":2.0,"publicationDate":"2023-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42997861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Cross-Classified Mixed Effects Model (CCMEM) has been demonstrated to be a flexible framework for evaluating reliability by measurement specialists. Reliability can be estimated based on the variance components of the test scores. Built upon their accomplishment, this study extends the CCMEM to be used for evaluating validity evidence. Validity is viewed as the coherence among the elements of a measurement system. As such, validity can be evaluated by the user-reasoned desired or undesired fixed and random effects. Based on the data of ePIRLS 2016 Reading Assessment, we demonstrate how to obtain evidence for reliability and validity by CCMEM. We conclude with a discussion on the practicality and benefits of this validation method.
{"title":"Validation as Evaluating Desired and Undesired Effects: Insights From Cross-Classified Mixed Effects Model","authors":"Xuejun Ryan Ji, Amery D. Wu","doi":"10.1111/emip.12553","DOIUrl":"10.1111/emip.12553","url":null,"abstract":"<p>The Cross-Classified Mixed Effects Model (CCMEM) has been demonstrated to be a flexible framework for evaluating reliability by measurement specialists. Reliability can be estimated based on the variance components of the test scores. Built upon their accomplishment, this study extends the CCMEM to be used for evaluating validity evidence. Validity is viewed as the coherence among the elements of a measurement system. As such, validity can be evaluated by the user-reasoned desired or undesired fixed and random effects. Based on the data of ePIRLS 2016 Reading Assessment, we demonstrate how to obtain evidence for reliability and validity by CCMEM. We conclude with a discussion on the practicality and benefits of this validation method.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 2","pages":"12-20"},"PeriodicalIF":2.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44738214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}