Educational Measurement-Issues and Practice最新文献

英文中文

In the beginning, there was an item…

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-14 DOI: 10.1111/emip.12647

Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar

As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.The Standards for Educational and Psychological Testing (Standards; American Educational Research Association [AERA], the

{"title":"In the beginning, there was an item…","authors":"Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar","doi":"10.1111/emip.12647","DOIUrl":"https://doi.org/10.1111/emip.12647","url":null,"abstract":"As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.The Standards for Educational and Psychological Testing (Standards; American Educational Research Association [AERA], the ","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"40-45"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12647","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12640

Jung Yeon Park, Sean Joo, Zikun Li, Hyejin Yoon

This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7–14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7–10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.

本研究探讨了 2018 年国际学生评估项目（PISA）中基于学生主要语言状况的潜在评估偏差。具体而言，研究比较了美国多语种（MLs）和非多语种（non-MLs）学生在三个认知领域（阅读、数学和科学）的回答时间和得分情况。差异项目功能（DIF）分析表明，7%-14% 的项目在两组学生的计分回答中表现出与 DIF 相关的问题，这与国际学生评估项目（PISA）技术报告的结果一致。在不同认知水平的学生中，多语种学生通常比非多语种学生花费更多的时间在测试上，但差异反应时间（DRT）功能发现，在认知水平相似的学生中，有 7-10%的项目存在显著的时间差异。值得注意的是，存在 DIF 和 DRT 问题的题目显示出有限的重叠，这表明学生在测评中遇到困难的原因多种多样。建议测试开发人员和教师对项目特征进行更深入的研究，以便更好地了解这些细微差别。

{"title":"Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018","authors":"Jung Yeon Park, Sean Joo, Zikun Li, Hyejin Yoon","doi":"10.1111/emip.12640","DOIUrl":"https://doi.org/10.1111/emip.12640","url":null,"abstract":"This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7–14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7–10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 1","pages":"55-65"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12640","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

You Win Some, You Lose Some

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12643

Gregory J. Cizek

In a 1993 EM:IP article, I made six predictions related to measurement policy issues for the approaching millenium. In this article, I evaluate the accuracy of those predictions (Spoiler: I was only modestly accurate) and I proffer a mix of seven contemporary predictions, recommendations, and aspirations regarding assessment generally, NCME as an association, and specific psychometric practices.

引用次数: 0

The 2025 EM:IP Cover Graphic/Data Visualization Competition

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12658

Yuan-Ling Liaw

引用次数: 0

Introduction to the Special Section on the Past, Present, and Future of Educational Measurement

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12660

Zhongmin Cui

引用次数: 0

Evolving Educational Testing to Meet Students’ Needs: Design-in-Real-Time Assessment

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12653

Stephen G. Sireci, Javier Suárez-Álvarez, April L. Zenisky, Maria Elena Oliveri

The goal in personalized assessment is to best fit the needs of each individual test taker, given the assessment purposes. Design-In-Real-Time (DIRTy) assessment reflects the progressive evolution in testing from a single test, to an adaptive test, to an adaptive assessment system. In this article, we lay the foundation for DIRTy assessment and illustrate how it meets the complex needs of each individual learner. The assessment framework incorporates culturally responsive assessment principles, thus making it innovative with respect to both technology and equity. Key aspects are (a) assessment building blocks called “assessment task modules” (ATMs) linked to multiple content standards and skill domains, (b) gathering information on test takers’ characteristics and preferences and using this information to improve their testing experience, and (c) selecting, modifying, and compiling ATMs to create a personalized test that best meets the needs of the testing purpose and individual test taker.

引用次数: 0

AI: Can You Help Address This Issue?

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12655

Deborah J. Harris

Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are

{"title":"AI: Can You Help Address This Issue?","authors":"Deborah J. Harris","doi":"10.1111/emip.12655","DOIUrl":"https://doi.org/10.1111/emip.12655","url":null,"abstract":"Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"9-12"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12655","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparative Analysis of Psychometric Frameworks and Properties of Scores from Autogenerated Test Forms

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12648

Won-Chan Lee, Stella Y. Kim

This paper explores the psychometric properties of scores derived from autogenerated test forms by introducing three conceptual frameworks: Alternate Test Forms, Randomly Parallel Forms, and Approximately Parallel Forms. Each framework provides a distinct perspective on score comparability, definitions of true score and standard error of measurement (SEM), and the necessity of equating. Through a simulation study, we illustrate how these frameworks compare in terms of true scores and SEMs, while also assessing the impact of equating on score comparability across varying levels of form variability. Ultimately, this study seeks to lay the groundwork for implementing scoring practices in large-scale standardized assessments that use autogenerated forms.

引用次数: 0

Linking Unlinkable Tests: A Step Forward

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12638

Silvia Testa, Renato Miceli, Renato Miceli

Random Equating (RE) and Heuristic Approach (HA) are two linking procedures that may be used to compare the scores of individuals in two tests that measure the same latent trait, in conditions where there are no common items or individuals. In this study, RE—that may only be used when the individuals taking the two tests come from the same population—was used as a benchmark for evaluating HA, which, in contrast, does not require any distributional assumptions. The comparison was based on both simulated and empirical data. Simulations showed that HA was good at reproducing the link shift connecting the difficulty parameters of the two sets of items, performing similarly to RE under the condition of slight violation of the distributional assumption. Empirical results showed satisfactory correspondence between the estimates of item and person parameters obtained via the two procedures.

引用次数: 0

From Mandated to Test-Optional College Admissions Testing: Where Do We Go from Here?

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice

Pub Date : 2024-11-10 DOI: 10.1111/emip.12649

Kyndra V. Middleton, Comfort H. Omonkhodion, Ernest Y. Amoateng, Lucy O. Okam, Daniela Cardoza, Alexis Oakley

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Educational Measurement-Issues and Practice

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀