Artificial-intelligence-based automated scoring is often an afterthought and is considered after assessments have been developed, resulting in nonoptimal possibility of implementing automated scoring solutions. In this article, we provide a review of Artificial intelligence (AI)-based methodologies for scoring in educational assessments. We then propose an evidence-centered design framework for developing assessments to align conceptualization, scoring, and ultimate assessment interpretation and use with the advantages and limitations of AI-based scoring in mind. We provide recommendations for defining construct, task, and evidence models to guide task and assessment design that optimize the development and implementation of AI-based automated scoring of constructed response items and support the validity of inferences from and uses of scores.
{"title":"Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring","authors":"Kadriye Ercikan, Daniel F. McCaffrey","doi":"10.1111/jedm.12332","DOIUrl":"10.1111/jedm.12332","url":null,"abstract":"<p>Artificial-intelligence-based automated scoring is often an afterthought and is considered after assessments have been developed, resulting in nonoptimal possibility of implementing automated scoring solutions. In this article, we provide a review of Artificial intelligence (AI)-based methodologies for scoring in educational assessments. We then propose an evidence-centered design framework for developing assessments to align conceptualization, scoring, and ultimate assessment interpretation and use with the advantages and limitations of AI-based scoring in mind. We provide recommendations for defining construct, task, and evidence models to guide task and assessment design that optimize the development and implementation of AI-based automated scoring of constructed response items and support the validity of inferences from and uses of scores.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"272-287"},"PeriodicalIF":1.3,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43168099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this concluding article of the special issue, we provide an overall discussion and point to future emerging trends in AI that might shape our approach to validity and building validity arguments.
{"title":"Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment: A Discussion and Look Forward","authors":"David W. Dorsey, Hillary R. Michaels","doi":"10.1111/jedm.12330","DOIUrl":"10.1111/jedm.12330","url":null,"abstract":"<p>In this concluding article of the special issue, we provide an overall discussion and point to future emerging trends in AI that might shape our approach to validity and building validity arguments.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"389-394"},"PeriodicalIF":1.3,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44571648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address the current limitations on evidence and validity arguments for scores from automated scoring engines from the points of view of the Standards for Educational and Psychological Testing (i.e., construct relevance, construct representation, and fairness) and emerging principles in Artificial Intelligence (e.g., explainable AI, an examinee's right to explanations, and principled AI). We illustrate these concepts and arguments for automated essay scores.
{"title":"Validity Arguments for AI-Based Automated Scores: Essay Scoring as an Illustration","authors":"Steve Ferrara, Saed Qunbar","doi":"10.1111/jedm.12333","DOIUrl":"10.1111/jedm.12333","url":null,"abstract":"<p>In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address the current limitations on evidence and validity arguments for scores from automated scoring engines from the points of view of the Standards for Educational and Psychological Testing (i.e., construct relevance, construct representation, and fairness) and emerging principles in Artificial Intelligence (e.g., explainable AI, an examinee's right to explanations, and principled AI). We illustrate these concepts and arguments for automated essay scores.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"288-313"},"PeriodicalIF":1.3,"publicationDate":"2022-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48147561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew S. Johnson, Xiang Liu, Daniel F. McCaffrey
With the increasing use of automated scores in operational testing settings comes the need to understand the ways in which they can yield biased and unfair results. In this paper, we provide a brief survey of some of the ways in which the predictive methods used in automated scoring can lead to biased, and thus unfair automated scores. After providing definitions of fairness from machine learning and a psychometric framework to study them, we demonstrate how modeling decisions, like omitting variables, using proxy measures or confounded variables, and even the optimization criterion in estimation can lead to biased and unfair automated scores. We then introduce two simple methods for evaluating bias, evaluate their statistical properties through simulation, and apply to an item from a large-scale reading assessment.
{"title":"Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring","authors":"Matthew S. Johnson, Xiang Liu, Daniel F. McCaffrey","doi":"10.1111/jedm.12335","DOIUrl":"10.1111/jedm.12335","url":null,"abstract":"<p>With the increasing use of automated scores in operational testing settings comes the need to understand the ways in which they can yield biased and unfair results. In this paper, we provide a brief survey of some of the ways in which the predictive methods used in automated scoring can lead to biased, and thus unfair automated scores. After providing definitions of fairness from machine learning and a psychometric framework to study them, we demonstrate how modeling decisions, like omitting variables, using proxy measures or confounded variables, and even the optimization criterion in estimation can lead to biased and unfair automated scores. We then introduce two simple methods for evaluating bias, evaluate their statistical properties through simulation, and apply to an item from a large-scale reading assessment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"338-361"},"PeriodicalIF":1.3,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49138205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Corinne Huggins-Manley, Brandon M. Booth, Sidney K. D'Mello
The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument-based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approach to fairness arguments that occurs outside of and complementary to validity arguments is required to address many of the views on fairness that a set of assessment stakeholders may hold. Accordingly, we focus this manuscript on two contributions: (a) introducing the argument-based fairness approach to complement argument-based validity for both traditional and artificial intelligence (AI)-enhanced assessments and (b) applying it in an illustrative AI assessment of perceived hireability in automated video interviews used to prescreen job candidates. We conclude with recommendations for further advancing argument-based fairness approaches.
{"title":"Toward Argument-Based Fairness with an Application to AI-Enhanced Educational Assessments","authors":"A. Corinne Huggins-Manley, Brandon M. Booth, Sidney K. D'Mello","doi":"10.1111/jedm.12334","DOIUrl":"10.1111/jedm.12334","url":null,"abstract":"<p>The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument-based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approach to fairness arguments that occurs outside of and complementary to validity arguments is required to address many of the views on fairness that a set of assessment stakeholders may hold. Accordingly, we focus this manuscript on two contributions: (a) introducing the argument-based fairness approach to complement argument-based validity for both traditional and artificial intelligence (AI)-enhanced assessments and (b) applying it in an illustrative AI assessment of perceived hireability in automated video interviews used to prescreen job candidates. We conclude with recommendations for further advancing argument-based fairness approaches.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"362-388"},"PeriodicalIF":1.3,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45199321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One result of recent changes in testing is that previously established linking frameworks may not adequately address challenges in current linking situations. Test linking through equating, concordance, vertical scaling or battery scaling may not represent linkings for the scores of tests developed to measure constructs differently for different examinees, or tests that are administered in different modes and data collection designs. This article considers how previously proposed linking frameworks might be updated to address more recent testing situations. The first section summarizes the definitions and frameworks described in previous test linking discussions. Additional sections consider some sources of more disparate approaches to test development and administrations, as well as the implications of these for test linking. Possibilities for reflecting these features in an expanded test linking framework are proposed that encourage limited comparability, such as comparability that is restricted to subgroups or to the conditions of a linking study when a linking is produced, or within, but not across tests or test forms when an empirical linking based on examinee data is not produced. The implications of an updated framework of previously established linking approaches are further described in a final discussion.
{"title":"Linking and Comparability across Conditions of Measurement: Established Frameworks and Proposed Updates","authors":"Tim Moses","doi":"10.1111/jedm.12322","DOIUrl":"10.1111/jedm.12322","url":null,"abstract":"<p>One result of recent changes in testing is that previously established linking frameworks may not adequately address challenges in current linking situations. Test linking through equating, concordance, vertical scaling or battery scaling may not represent linkings for the scores of tests developed to measure constructs differently for different examinees, or tests that are administered in different modes and data collection designs. This article considers how previously proposed linking frameworks might be updated to address more recent testing situations. The first section summarizes the definitions and frameworks described in previous test linking discussions. Additional sections consider some sources of more disparate approaches to test development and administrations, as well as the implications of these for test linking. Possibilities for reflecting these features in an expanded test linking framework are proposed that encourage limited comparability, such as comparability that is restricted to subgroups or to the conditions of a linking study when a linking is produced, or within, but not across tests or test forms when an empirical linking based on examinee data is not produced. The implications of an updated framework of previously established linking approaches are further described in a final discussion.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 2","pages":"231-250"},"PeriodicalIF":1.3,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46845951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue Maintaining Score Comparability: Recent Challenges and Some Possible Solutions","authors":"Tim Moses, Gautam Puhan","doi":"10.1111/jedm.12323","DOIUrl":"10.1111/jedm.12323","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 2","pages":"137-139"},"PeriodicalIF":1.3,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49471640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.
{"title":"Anchoring Validity Evidence for Automated Essay Scoring","authors":"Mark D. Shermis","doi":"10.1111/jedm.12336","DOIUrl":"10.1111/jedm.12336","url":null,"abstract":"<p>One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 3","pages":"314-337"},"PeriodicalIF":1.3,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46704579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by something and this article focuses on these form-invariant connective somethings. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.
{"title":"Historical Perspectives on Score Comparability Issues Raised by Innovations in Testing","authors":"Peter Baldwin, Brian E. Clauser","doi":"10.1111/jedm.12318","DOIUrl":"10.1111/jedm.12318","url":null,"abstract":"<p>While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by <i>something</i> and this article focuses on these form-invariant connective <i>somethings</i>. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 2","pages":"140-160"},"PeriodicalIF":1.3,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44486843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent Challenges to Maintaining Score Comparability: A Commentary","authors":"Neil J. Dorans, Shelby J. Haberman","doi":"10.1111/jedm.12319","DOIUrl":"10.1111/jedm.12319","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 2","pages":"251-264"},"PeriodicalIF":1.3,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45896082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}