Bharati B. Belwalkar, Matthew Schultz, Christina Curnow, J. Carl Setzer
There is a growing integration of technology in the workplace (World Economic Forum), and with it, organizations are increasingly relying on advanced technological approaches for improving their human capital processes to stay relevant and competitive in complex environments. All professions must keep up with this transition and begin integrating technology into their tools and processes. This paper centers on how advanced technological approaches (such as natural language processing (NLP) and data mining) have complemented a traditional practice analysis of the accounting profession. We also discuss strategic selection and use of subject-matter experts (SMEs) for more efficient practice analysis. The authors have adopted a triangulation process—gathering information from traditional practice analysis, using selected SMEs, and confirming findings with a novel NLP-based approach. These methods collectively contributed to the revision of the Uniform CPA Exam blueprint and in understanding accounting trends.
{"title":"Blending Strategic Expertise and Technology: A Case Study for Practice Analysis","authors":"Bharati B. Belwalkar, Matthew Schultz, Christina Curnow, J. Carl Setzer","doi":"10.1111/emip.12607","DOIUrl":"10.1111/emip.12607","url":null,"abstract":"<p>There is a growing integration of technology in the workplace (World Economic Forum), and with it, organizations are increasingly relying on advanced technological approaches for improving their human capital processes to stay relevant and competitive in complex environments. All professions must keep up with this transition and begin integrating technology into their tools and processes. This paper centers on how advanced technological approaches (such as natural language processing (NLP) and data mining) have complemented a traditional practice analysis of the accounting profession. We also discuss strategic selection and use of subject-matter experts (SMEs) for more efficient practice analysis. The authors have adopted a triangulation process—gathering information from traditional practice analysis, using selected SMEs, and confirming findings with a novel NLP-based approach. These methods collectively contributed to the revision of the Uniform CPA Exam blueprint and in understanding accounting trends.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 3","pages":"85-94"},"PeriodicalIF":2.7,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141122285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article is based on my 2023 NCME Presidential Address, where I talked a bit about my journey into the profession, and more substantively about comparable scores. Specifically, I discussed some of the different ways ‘comparable scores’ are defined, highlighted some areas I think we as a profession need to pay more attention to when considering score comparability, and emphasized that comparability in this context is a matter of degree which varies according to the decisions we plan to make on particular scores.
{"title":"2023 NCME Presidential Address: Some Musings on Comparable Scores","authors":"Deborah J. Harris","doi":"10.1111/emip.12609","DOIUrl":"10.1111/emip.12609","url":null,"abstract":"<p>This article is based on my 2023 NCME Presidential Address, where I talked a bit about my journey into the profession, and more substantively about comparable scores. Specifically, I discussed some of the different ways ‘comparable scores’ are defined, highlighted some areas I think we as a profession need to pay more attention to when considering score comparability, and emphasized that comparability in this context is a matter of degree which varies according to the decisions we plan to make on particular scores.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 2","pages":"6-15"},"PeriodicalIF":2.0,"publicationDate":"2024-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12609","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140929271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study capitalizes on response and process data from the computer-based TIMSS 2019 Problem Solving and Inquiry tasks to investigate gender differences in test-taking behaviors and their association with mathematics achievement at the eighth grade. Specifically, a recently proposed hierarchical speed-accuracy-revisits (SAR) model was adapted to multiple country-by-gender groups to examine the extent to which mathematics ability, response speed, revisit propensity, and the relationship among them differ between boys and girls. Results across 10 countries showed that boys responded to items faster on average than girls, and there was greater variation in boys’ response speed across students. A mixture distribution of revisit propensity was found for all country-by-gender groups. Both genders had moderate to strong negative correlations between mathematics ability and response speed, supporting the speed-accuracy tradeoff pattern reported in the literature. Results are discussed in the context of low-stakes assessments and in relation to the utility of the multiple-group SAR model.
本研究利用基于计算机的 TIMSS 2019 年 "问题解决与探究 "任务中的反应和过程数据,研究八年级学生在考试行为方面的性别差异及其与数学成绩之间的关联。具体来说,我们将最近提出的分层速度-测准-重访(SAR)模型应用于多个国家的不同性别群体,以研究男生和女生在数学能力、反应速度、重访倾向以及它们之间的关系方面的差异程度。10 个国家的研究结果表明,男生对题目的平均反应速度比女生快,而且男生的反应速度在不同学生之间的差异更大。在所有国家和性别组中,重访倾向呈混合分布。男女生的数学能力与反应速度之间都存在中度到高度的负相关,这支持了文献中报道的速度-准确性权衡模式。本研究结合低分值评估以及多组 SAR 模型的实用性对结果进行了讨论。
{"title":"Examining Gender Differences in TIMSS 2019 Using a Multiple-Group Hierarchical Speed-Accuracy-Revisits Model","authors":"Dihao Leng, Ummugul Bezirhan, Lale Khorramdel, Bethany Fishbein, Matthias von Davier","doi":"10.1111/emip.12606","DOIUrl":"10.1111/emip.12606","url":null,"abstract":"<p>This study capitalizes on response and process data from the computer-based TIMSS 2019 Problem Solving and Inquiry tasks to investigate gender differences in test-taking behaviors and their association with mathematics achievement at the eighth grade. Specifically, a recently proposed hierarchical speed-accuracy-revisits (SAR) model was adapted to multiple country-by-gender groups to examine the extent to which mathematics ability, response speed, revisit propensity, and the relationship among them differ between boys and girls. Results across 10 countries showed that boys responded to items faster on average than girls, and there was greater variation in boys’ response speed across students. A mixture distribution of revisit propensity was found for all country-by-gender groups. Both genders had moderate to strong negative correlations between mathematics ability and response speed, supporting the speed-accuracy tradeoff pattern reported in the literature. Results are discussed in the context of low-stakes assessments and in relation to the utility of the multiple-group SAR model.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 3","pages":"64-75"},"PeriodicalIF":2.7,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12606","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140663098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional approaches to the modeling of multiple-choice item response data (e.g., 3PL, 4PL models) emphasize slips and guesses as random events. In this paper, an item response model is presented that characterizes both disjunctively interacting guessing and conjunctively interacting slipping processes as proficiency-related phenomena. We show how evidence for this perspective is seen in the systematic form of invariance violations for item slip and guess parameters under four-parameter IRT models when compared across populations of different mean proficiency levels. Specifically, higher proficiency populations tend to show higher guess and lower slip probabilities than lower proficiency populations. The results undermine the use of traditional models for IRT applications that require invariance and would suggest greater attention to alternatives.
{"title":"Guesses and Slips as Proficiency-Related Phenomena and Impacts on Parameter Invariance","authors":"Xiangyi Liao, Daniel M Bolt","doi":"10.1111/emip.12605","DOIUrl":"10.1111/emip.12605","url":null,"abstract":"<p>Traditional approaches to the modeling of multiple-choice item response data (e.g., 3PL, 4PL models) emphasize slips and guesses as random events. In this paper, an item response model is presented that characterizes both disjunctively interacting guessing and conjunctively interacting slipping processes as proficiency-related phenomena. We show how evidence for this perspective is seen in the systematic form of invariance violations for item slip and guess parameters under four-parameter IRT models when compared across populations of different mean proficiency levels. Specifically, higher proficiency populations tend to show higher guess and lower slip probabilities than lower proficiency populations. The results undermine the use of traditional models for IRT applications that require invariance and would suggest greater attention to alternatives.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 3","pages":"76-84"},"PeriodicalIF":2.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12605","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangang Hao, Alina A. von Davier, Victoria Yaneva, Susan Lottridge, Matthias von Davier, Deborah J. Harris
The remarkable strides in artificial intelligence (AI), exemplified by ChatGPT, have unveiled a wealth of opportunities and challenges in assessment. Applying cutting-edge large language models (LLMs) and generative AI to assessment holds great promise in boosting efficiency, mitigating bias, and facilitating customized evaluations. Conversely, these innovations raise significant concerns regarding validity, reliability, transparency, fairness, equity, and test security, necessitating careful thinking when applying them in assessments. In this article, we discuss the impacts and implications of LLMs and generative AI on critical dimensions of assessment with example use cases and call for a community effort to equip assessment professionals with the needed AI literacy to harness the potential effectively.
{"title":"Transforming Assessment: The Impacts and Implications of Large Language Models and Generative AI","authors":"Jiangang Hao, Alina A. von Davier, Victoria Yaneva, Susan Lottridge, Matthias von Davier, Deborah J. Harris","doi":"10.1111/emip.12602","DOIUrl":"10.1111/emip.12602","url":null,"abstract":"<p>The remarkable strides in artificial intelligence (AI), exemplified by ChatGPT, have unveiled a wealth of opportunities and challenges in assessment. Applying cutting-edge large language models (LLMs) and generative AI to assessment holds great promise in boosting efficiency, mitigating bias, and facilitating customized evaluations. Conversely, these innovations raise significant concerns regarding validity, reliability, transparency, fairness, equity, and test security, necessitating careful thinking when applying them in assessments. In this article, we discuss the impacts and implications of LLMs and generative AI on critical dimensions of assessment with example use cases and call for a community effort to equip assessment professionals with the needed AI literacy to harness the potential effectively.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 2","pages":"16-29"},"PeriodicalIF":2.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Short scales are time-efficient for participants and cost-effective in research. However, researchers often mistakenly expect short scales to have the same reliability as long ones without considering the effect of scale length. We argue that applying a universal benchmark for alpha is problematic as the impact of low-quality items is greater on shorter scales. In this study, we proposed simple guidelines for item reduction using the “alpha-if-item-deleted” procedure in scale construction. An item can be removed if alpha increases or decreases by less than .02, especially for short scales. Conversely, an item should be retained if alpha decreases by more than .04 upon its removal. For reliability benchmarks, .80 is relatively safe in most conditions, but higher benchmarks are recommended for longer scales and smaller sample sizes. Supplementary analyses, including item content, face validity, and content coverage, are critical to ensure scale quality.
{"title":"Revisiting the Usage of Alpha in Scale Evaluation: Effects of Scale Length and Sample Size","authors":"Leifeng Xiao, Kit-Tai Hau, Melissa Dan Wang","doi":"10.1111/emip.12604","DOIUrl":"10.1111/emip.12604","url":null,"abstract":"<p>Short scales are time-efficient for participants and cost-effective in research. However, researchers often mistakenly expect short scales to have the same reliability as long ones without considering the effect of scale length. We argue that applying a universal benchmark for alpha is problematic as the impact of low-quality items is greater on shorter scales. In this study, we proposed simple guidelines for item reduction using the “alpha-if-item-deleted” procedure in scale construction. An item can be removed if alpha increases or decreases by less than .02, especially for short scales. Conversely, an item should be retained if alpha decreases by more than .04 upon its removal. For reliability benchmarks, .80 is relatively safe in most conditions, but higher benchmarks are recommended for longer scales and smaller sample sizes. Supplementary analyses, including item content, face validity, and content coverage, are critical to ensure scale quality.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 2","pages":"74-81"},"PeriodicalIF":2.0,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12604","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140173185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Measuring opportunities to learn (OTL) is crucial for evaluating education quality and equity, but obtaining accurate and comprehensive OTL data at a large scale remains challenging. We attempt to address this issue by investigating measurement concerns in data collection and sampling. With the primary goal of estimating group-level OTLs for large populations of classrooms and the secondary goal of estimating classroom-level OTLs, we propose forming a teacher panel and using an online log-type survey to collect content and time data on sampled days throughout the school year. We compared various sampling schemes in a simulation study with real daily log data from 66 fourth-grade math teachers. The findings from this study indicate that sampling 1 day per week or 1 day every other week provided accurate group-level estimates, while sampling 1 day per week yielded satisfactory classroom-level estimates. The proposed approach aids in effectively monitoring large-scale classroom OTL.
{"title":"What Mathematics Content Do Teachers Teach? Optimizing Measurement of Opportunities to Learn in the Classroom","authors":"Jiahui Zhang, William H. Schmidt","doi":"10.1111/emip.12603","DOIUrl":"10.1111/emip.12603","url":null,"abstract":"<p>Measuring opportunities to learn (OTL) is crucial for evaluating education quality and equity, but obtaining accurate and comprehensive OTL data at a large scale remains challenging. We attempt to address this issue by investigating measurement concerns in data collection and sampling. With the primary goal of estimating group-level OTLs for large populations of classrooms and the secondary goal of estimating classroom-level OTLs, we propose forming a teacher panel and using an online log-type survey to collect content and time data on sampled days throughout the school year. We compared various sampling schemes in a simulation study with real daily log data from 66 fourth-grade math teachers. The findings from this study indicate that sampling 1 day per week or 1 day every other week provided accurate group-level estimates, while sampling 1 day per week yielded satisfactory classroom-level estimates. The proposed approach aids in effectively monitoring large-scale classroom OTL.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 2","pages":"40-54"},"PeriodicalIF":2.0,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12603","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140097883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angela Johnson, Elizabeth Barker, Marcos Viveros Cespedes
Educators and researchers strive to build policies and practices on data and evidence, especially on academic achievement scores. When assessment scores are inaccurate for specific student populations or when scores are inappropriately used, even data-driven decisions will be misinformed. To maximize the impact of the research-practice-policy collaborative, every stage of the assessment and research process needs to be critically interrogated. In this paper, we highlight the need to reframe assessment and research for multilingual learners, students with disabilities, and multilingual students with disabilities. We outline a framework that integrates three critical perspectives (QuantCrit, DisCrit, and critical multiculturalism) and discuss how this framework can be applied to assessment creation and research.
{"title":"Reframing Research and Assessment Practices: Advancing an Antiracist and Anti-Ableist Research Agenda","authors":"Angela Johnson, Elizabeth Barker, Marcos Viveros Cespedes","doi":"10.1111/emip.12601","DOIUrl":"10.1111/emip.12601","url":null,"abstract":"<p>Educators and researchers strive to build policies and practices on data and evidence, especially on academic achievement scores. When assessment scores are inaccurate for specific student populations or when scores are inappropriately used, even data-driven decisions will be misinformed. To maximize the impact of the research-practice-policy collaborative, every stage of the assessment and research process needs to be critically interrogated. In this paper, we highlight the need to reframe assessment and research for multilingual learners, students with disabilities, and multilingual students with disabilities. We outline a framework that integrates three critical perspectives (QuantCrit, DisCrit, and critical multiculturalism) and discuss how this framework can be applied to assessment creation and research.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 3","pages":"95-105"},"PeriodicalIF":2.7,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}