Journal of Educational Measurement最新文献

英文中文

Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring 基于人工智能的自动评分的优化实现:基于人工智能评分的评估设计的循证设计方法

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-06-12 DOI: 10.1111/jedm.12332

Kadriye Ercikan, Daniel F. McCaffrey

Artificial-intelligence-based automated scoring is often an afterthought and is considered after assessments have been developed, resulting in nonoptimal possibility of implementing automated scoring solutions. In this article, we provide a review of Artificial intelligence (AI)-based methodologies for scoring in educational assessments. We then propose an evidence-centered design framework for developing assessments to align conceptualization, scoring, and ultimate assessment interpretation and use with the advantages and limitations of AI-based scoring in mind. We provide recommendations for defining construct, task, and evidence models to guide task and assessment design that optimize the development and implementation of AI-based automated scoring of constructed response items and support the validity of inferences from and uses of scores.

基于人工智能的自动评分通常是事后才考虑的，并且是在评估开发之后才考虑的，这导致了实施自动评分解决方案的非最佳可能性。在本文中，我们对基于人工智能(AI)的教育评估评分方法进行了综述。然后，我们提出了一个以证据为中心的设计框架，用于开发评估，以使概念化，评分，最终评估解释和使用与基于人工智能的评分的优点和局限性保持一致。我们提供了定义构建、任务和证据模型的建议，以指导任务和评估设计，优化基于人工智能的构建响应项目自动评分的开发和实现，并支持从得分中推断和使用的有效性。

引用次数: 2

Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment: A Discussion and Look Forward 创新教育评估中的有效性论证与人工智能:探讨与展望

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-06-09 DOI: 10.1111/jedm.12330

David W. Dorsey, Hillary R. Michaels

In this concluding article of the special issue, we provide an overall discussion and point to future emerging trends in AI that might shape our approach to validity and building validity arguments.

在本期特刊的最后一篇文章中，我们进行了全面的讨论，并指出了人工智能未来的新兴趋势，这些趋势可能会影响我们对有效性和建立有效性论证的方法。

引用次数: 1

Validity Arguments for AI-Based Automated Scores: Essay Scoring as an Illustration 基于人工智能的自动评分的有效性论证:以论文评分为例

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-06-08 DOI: 10.1111/jedm.12333

Steve Ferrara, Saed Qunbar

In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address the current limitations on evidence and validity arguments for scores from automated scoring engines from the points of view of the Standards for Educational and Psychological Testing (i.e., construct relevance, construct representation, and fairness) and emerging principles in Artificial Intelligence (e.g., explainable AI, an examinee's right to explanations, and principled AI). We illustrate these concepts and arguments for automated essay scores.

在本文中，我们认为自动评分引擎应该是透明的和结构相关的——也就是说，尽可能多的是当前可行的。如果不考虑一些可能不容易解释和理解的特征，并且可能与目标评估结构不明显和直接相关，那么许多当前的自动评分引擎无法实现高度的评分准确性。我们从教育和心理测试标准(即构建相关性，构建代表性和公平性)和人工智能新兴原则(例如，可解释的人工智能，考生的解释权和原则性人工智能)的角度解决了自动评分引擎得分的证据和有效性论点的当前限制。我们举例说明这些概念和论点自动作文分数。

引用次数: 4

Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring 评估自动评分测量和算法偏差的心理测量学方法

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-06-01 DOI: 10.1111/jedm.12335

Matthew S. Johnson, Xiang Liu, Daniel F. McCaffrey

With the increasing use of automated scores in operational testing settings comes the need to understand the ways in which they can yield biased and unfair results. In this paper, we provide a brief survey of some of the ways in which the predictive methods used in automated scoring can lead to biased, and thus unfair automated scores. After providing definitions of fairness from machine learning and a psychometric framework to study them, we demonstrate how modeling decisions, like omitting variables, using proxy measures or confounded variables, and even the optimization criterion in estimation can lead to biased and unfair automated scores. We then introduce two simple methods for evaluating bias, evaluate their statistical properties through simulation, and apply to an item from a large-scale reading assessment.

随着在操作测试设置中越来越多地使用自动化分数，需要了解它们可能产生偏见和不公平结果的方式。在本文中，我们简要介绍了自动评分中使用的预测方法可能导致有偏见，从而导致不公平的自动评分的一些方式。在提供了机器学习公平性的定义和研究它们的心理测量框架之后，我们展示了建模决策，如省略变量，使用代理度量或混淆变量，甚至是估计中的优化标准，如何导致有偏见和不公平的自动分数。然后，我们介绍了两种评估偏差的简单方法，通过模拟评估它们的统计特性，并将其应用于大规模阅读评估中的一个项目。

引用次数: 4

Toward Argument-Based Fairness with an Application to AI-Enhanced Educational Assessments 基于论证的公平性及其在人工智能增强教育评估中的应用

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-06-01 DOI: 10.1111/jedm.12334

A. Corinne Huggins-Manley, Brandon M. Booth, Sidney K. D'Mello

The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument-based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approach to fairness arguments that occurs outside of and complementary to validity arguments is required to address many of the views on fairness that a set of assessment stakeholders may hold. Accordingly, we focus this manuscript on two contributions: (a) introducing the argument-based fairness approach to complement argument-based validity for both traditional and artificial intelligence (AI)-enhanced assessments and (b) applying it in an illustrative AI assessment of perceived hireability in automated video interviews used to prescreen job candidates. We conclude with recommendations for further advancing argument-based fairness approaches.

教育测量领域将有效性和公平性作为评估质量的核心概念。先前的研究已经提出在基于论证的有效性过程中嵌入公平论证，特别是当公平被认为是跨群体评估属性的可比性时。然而，我们认为，为了解决一组评估利益相关者可能持有的许多关于公平性的观点，需要一种更灵活的方法来解决有效性论点之外的公平性论点并与之互补。因此，我们将本文的重点放在两个方面:(a)引入基于论证的公平性方法，以补充传统评估和人工智能(AI)增强评估的基于论证的有效性;(b)将其应用于用于预筛选求职者的自动视频面试中感知可雇佣性的说明性AI评估。最后，我们提出了进一步推进基于论证的公平方法的建议。

引用次数: 4

Linking and Comparability across Conditions of Measurement: Established Frameworks and Proposed Updates 测量条件之间的联系和可比性:已建立的框架和建议的更新

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-05-30 DOI: 10.1111/jedm.12322

Tim Moses

One result of recent changes in testing is that previously established linking frameworks may not adequately address challenges in current linking situations. Test linking through equating, concordance, vertical scaling or battery scaling may not represent linkings for the scores of tests developed to measure constructs differently for different examinees, or tests that are administered in different modes and data collection designs. This article considers how previously proposed linking frameworks might be updated to address more recent testing situations. The first section summarizes the definitions and frameworks described in previous test linking discussions. Additional sections consider some sources of more disparate approaches to test development and administrations, as well as the implications of these for test linking. Possibilities for reflecting these features in an expanded test linking framework are proposed that encourage limited comparability, such as comparability that is restricted to subgroups or to the conditions of a linking study when a linking is produced, or within, but not across tests or test forms when an empirical linking based on examinee data is not produced. The implications of an updated framework of previously established linking approaches are further described in a final discussion.

最近测试变化的一个结果是，以前建立的链接框架可能无法充分应对当前链接情况下的挑战。通过等号法、一致性法、垂直缩放法或单元缩放法进行的考试联系，可能不代表为不同考生开发的不同构式的考试成绩之间的联系，也不代表以不同模式和数据收集设计进行的考试成绩之间的联系。本文考虑如何更新先前提出的链接框架，以解决最近的测试情况。第一部分总结了前面测试链接讨论中描述的定义和框架。另外的部分将考虑更多不同的测试开发和管理方法的一些来源，以及这些方法对测试链接的影响。建议在扩展的测试链接框架中反映这些特征的可能性，以鼓励有限的可比性，例如，当产生链接时，仅限于子组或链接研究条件的可比性，或者当没有产生基于考生数据的经验链接时，在测试或测试形式内而不是跨测试或测试形式的可比性。在最后的讨论中进一步描述了先前建立的联系方法的更新框架的含义。

{"title":"Linking and Comparability across Conditions of Measurement: Established Frameworks and Proposed Updates","authors":"Tim Moses","doi":"10.1111/jedm.12322","DOIUrl":"10.1111/jedm.12322","url":null,"abstract":"<p>One result of recent changes in testing is that previously established linking frameworks may not adequately address challenges in current linking situations. Test linking through equating, concordance, vertical scaling or battery scaling may not represent linkings for the scores of tests developed to measure constructs differently for different examinees, or tests that are administered in different modes and data collection designs. This article considers how previously proposed linking frameworks might be updated to address more recent testing situations. The first section summarizes the definitions and frameworks described in previous test linking discussions. Additional sections consider some sources of more disparate approaches to test development and administrations, as well as the implications of these for test linking. Possibilities for reflecting these features in an expanded test linking framework are proposed that encourage limited comparability, such as comparability that is restricted to subgroups or to the conditions of a linking study when a linking is produced, or within, but not across tests or test forms when an empirical linking based on examinee data is not produced. The implications of an updated framework of previously established linking approaches are further described in a final discussion.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 2","pages":"231-250"},"PeriodicalIF":1.3,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46845951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to the Special Issue Maintaining Score Comparability: Recent Challenges and Some Possible Solutions 保持分数可比性特刊导论:最近的挑战和一些可能的解决方案

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-05-26 DOI: 10.1111/jedm.12323

Tim Moses, Gautam Puhan

引用次数: 1

Anchoring Validity Evidence for Automated Essay Scoring 锚定有效性证据的自动作文评分

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-05-15 DOI: 10.1111/jedm.12336

Mark D. Shermis

One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.

讨论论文机器评分的有效性论点的挑战之一集中在缺乏一个普遍持有的定义和理论的好写作。在最好的情况下，算法试图衡量写作的选择属性，并将它们与人类评分进行校准，目标是准确预测新文章的分数。有时，这些属性是基于写作的基础(例如，流畅性)，但更多时候，它们是基于当地开发的标准，可能会与具体的内容覆盖预期相混淆。由于缺乏透明度，很难提供系统的证据来证明机器评分是在评估写作，而是在评估写作表现的片段或相关性。

引用次数: 2

Historical Perspectives on Score Comparability Issues Raised by Innovations in Testing 从历史的角度看考试创新引起的分数可比性问题

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-05-11 DOI: 10.1111/jedm.12318

Peter Baldwin, Brian E. Clauser

While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by something and this article focuses on these form-invariant connective somethings. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.

虽然考试形式之间的分数可比性通常依赖于共同的(或随机相等的)考生或项目，但在以这种方式使用考生或项目之前，项目格式的创新、考试交付和扩大分数解释范围的努力可能需要特殊的数据收集，或者可能与共同的考生或项目设计完全不兼容。当在这些非常规条件下需要比较时，形式仍然必须通过某些东西连接起来，本文主要讨论这些形式不变的连接物。以这种方式给出了思考分数可比性问题的概念框架，然后描述了三类连接词。每个班级都给出了测试创新历史上的例子。

引用次数: 2

Recent Challenges to Maintaining Score Comparability: A Commentary 保持分数可比性的最新挑战:评论

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2022-05-10 DOI: 10.1111/jedm.12319

Neil J. Dorans, Shelby J. Haberman

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Educational Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀