The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.
{"title":"Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches","authors":"Güler Yavuz Temel","doi":"10.1111/jedm.12377","DOIUrl":"10.1111/jedm.12377","url":null,"abstract":"<p>The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"69-98"},"PeriodicalIF":1.3,"publicationDate":"2023-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136185515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi
In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization (SAEM) algorithm for multidimensional data, but it also alleviates the potential instability caused by label-switching, and then improved the estimation accuracy. Simulation studies are conducted to illustrate the good performance of the proposed MSAEM method, where MSAEM consistently performs better than SAEM and some other existing methods in multidimensional item response theory. Moreover, the proposed method is applied to a real data set from the 2018 Programme for International Student Assessment (PISA) to demonstrate the usefulness of the 4PNO model as well as MSAEM in practice.
{"title":"MSAEM Estimation for Confirmatory Multidimensional Four-Parameter Normal Ogive Models","authors":"Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi","doi":"10.1111/jedm.12378","DOIUrl":"10.1111/jedm.12378","url":null,"abstract":"<p>In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization (SAEM) algorithm for multidimensional data, but it also alleviates the potential instability caused by label-switching, and then improved the estimation accuracy. Simulation studies are conducted to illustrate the good performance of the proposed MSAEM method, where MSAEM consistently performs better than SAEM and some other existing methods in multidimensional item response theory. Moreover, the proposed method is applied to a real data set from the 2018 Programme for International Student Assessment (PISA) to demonstrate the usefulness of the 4PNO model as well as MSAEM in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"99-124"},"PeriodicalIF":1.3,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135146227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The usual interpretation of the person and task variables in between-persons measurement models such as item response theory (IRT) is as attributes of persons and tasks, respectively. They can be viewed instead as ensemble descriptors of patterns of interactions among persons and situations that arise from sociocognitive complex adaptive system (CASs). This view offers insights for interpreting and using between-persons measurement models and connecting with sociocognitive research. In this article, we use data generated from an agent-based model to illustrate relations between “social” and “cognitive” features of a simple underlying CAS and the variables of an IRT model fit to resulting data. We note how the ideas connect to explanatory item response modeling and briefly comment on implications for score interpretations and uses in practice.
在人与人之间的测量模型(如项目反应理论(IRT))中,通常将人和任务变量分别解释为人和任务的属性。相反,它们可以被看作是社会认知复杂适应系统(CAS)中产生的人与情境之间互动模式的集合描述符。这种观点为解释和使用人与人之间的测量模型以及与社会认知研究的联系提供了启示。在本文中,我们利用一个基于代理的模型所产生的数据,来说明一个简单的基本 CAS 的 "社会 "和 "认知 "特征之间的关系,以及与所产生的数据相适应的 IRT 模型的变量之间的关系。我们指出了这些观点与解释性项目反应模型之间的联系,并简要评述了对分数解释和实际应用的影响。
{"title":"Sociocognitive Processes and Item Response Models: A Didactic Example","authors":"Tao Gong, Lan Shuai, Robert J. Mislevy","doi":"10.1111/jedm.12376","DOIUrl":"10.1111/jedm.12376","url":null,"abstract":"<p>The usual interpretation of the person and task variables in between-persons measurement models such as item response theory (IRT) is as attributes of persons and tasks, respectively. They can be viewed instead as ensemble descriptors of patterns of interactions among persons and situations that arise from sociocognitive complex adaptive system (CASs). This view offers insights for interpreting and using between-persons measurement models and connecting with sociocognitive research. In this article, we use data generated from an agent-based model to illustrate relations between “social” and “cognitive” features of a simple underlying CAS and the variables of an IRT model fit to resulting data. We note how the ideas connect to explanatory item response modeling and briefly comment on implications for score interpretations and uses in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"150-173"},"PeriodicalIF":1.3,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135397635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many language proficiency tests include group oral assessments involving peer interaction. In such an assessment, examinees discuss a common topic with others. Human raters score each examinee's spoken performance on specially designed criteria. However, measurement models for analyzing group assessment data usually assume local person independence and thus fail to consider the impact of peer interaction on the assessment outcomes. This research advances an extended many-facet Rasch model for group assessments (MFRM-GA), accounting for local person dependence. In a series of simulations, we examined the MFRM-GA's parameter recovery and the consequences of ignoring peer interactions under the traditional modeling approach. We also used a real dataset from the English-speaking test of the Language Proficiency Assessment for Teachers (LPAT) routinely administered in Hong Kong to illustrate the efficiency of the new model. The discussion focuses on the model's usefulness for measuring oral language proficiency, practical implications, and future research perspectives.
{"title":"Measuring the Impact of Peer Interaction in Group Oral Assessments with an Extended Many-Facet Rasch Model","authors":"Kuan-Yu Jin, Thomas Eckes","doi":"10.1111/jedm.12375","DOIUrl":"10.1111/jedm.12375","url":null,"abstract":"<p>Many language proficiency tests include group oral assessments involving peer interaction. In such an assessment, examinees discuss a common topic with others. Human raters score each examinee's spoken performance on specially designed criteria. However, measurement models for analyzing group assessment data usually assume local person independence and thus fail to consider the impact of peer interaction on the assessment outcomes. This research advances an extended many-facet Rasch model for group assessments (MFRM-GA), accounting for local person dependence. In a series of simulations, we examined the MFRM-GA's parameter recovery and the consequences of ignoring peer interactions under the traditional modeling approach. We also used a real dataset from the English-speaking test of the Language Proficiency Assessment for Teachers (LPAT) routinely administered in Hong Kong to illustrate the efficiency of the new model. The discussion focuses on the model's usefulness for measuring oral language proficiency, practical implications, and future research perspectives.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"47-68"},"PeriodicalIF":1.3,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135352749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Derek C. Briggs Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies","authors":"David Torres Irribarra","doi":"10.1111/jedm.12374","DOIUrl":"10.1111/jedm.12374","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"739-746"},"PeriodicalIF":1.3,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multidimensional computerized adaptive testing (MCAT), item selection strategies are generally constructed based on responses, and they do not consider the response times required by items. This study constructed two new criteria (referred to as DT-inc and DT) for MCAT item selection by utilizing information from response times. The new designs maximize the amount of information per unit time. Furthermore, these two new designs were extended to the DTS-inc and DTS designs to efficiently estimate intentional abilities. Moreover, the EAP method for ability estimation was also equipped with response time. The performances of the response-time-based EAP (RT-based EAP) and the new designs were evaluated in simulation and empirical studies. The results showed that the RT-based EAP significantly improved the ability estimation precision compared with the EAP without using response time, and the new designs dramatically saved testing times for examinees with a small sacrifice of ability estimation precision and item pool usage.
{"title":"Using Response Time in Multidimensional Computerized Adaptive Testing","authors":"Yinhong He, Yuanyuan Qi","doi":"10.1111/jedm.12373","DOIUrl":"10.1111/jedm.12373","url":null,"abstract":"<p>In multidimensional computerized adaptive testing (MCAT), item selection strategies are generally constructed based on responses, and they do not consider the response times required by items. This study constructed two new criteria (referred to as DT-inc and DT) for MCAT item selection by utilizing information from response times. The new designs maximize the amount of information per unit time. Furthermore, these two new designs were extended to the DT<sub>S</sub>-inc and DT<sub>S</sub> designs to efficiently estimate intentional abilities. Moreover, the EAP method for ability estimation was also equipped with response time. The performances of the response-time-based EAP (RT-based EAP) and the new designs were evaluated in simulation and empirical studies. The results showed that the RT-based EAP significantly improved the ability estimation precision compared with the EAP without using response time, and the new designs dramatically saved testing times for examinees with a small sacrifice of ability estimation precision and item pool usage.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"697-738"},"PeriodicalIF":1.3,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48931962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the COVID-19 pandemic lockdowns forced populations across the world to become completely dependent on digital devices for working, studying, and socializing, there has been no shortage of published studies about the possible negative effects of the increased use of digital devices during this exceptional period. In seeking to empirically address how the concern with digital dependency has been experienced during the pandemic, we present findings from a study of daily self-reported logbooks by 59 university students in Copenhagen, Denmark, over 4 weeks in April and May 2020, investigating their everyday use of digital devices. We highlight two main findings. First, students report high levels of online fatigue, expressed as frustration with their constant reliance on digital devices. On the other hand, students found creative ways of using digital devices for maintaining social relations, helping them to cope with isolation. Such online interactions were nevertheless seen as a poor substitute for physical interactions in the long run. Our findings show how the dependence on digital devices was marked by ambivalence, where digital communication was seen as both the cure against, and cause of, feeling isolated and estranged from a sense of normality.
{"title":"Digital dependence: Online fatigue and coping strategies during the COVID-19 lockdown.","authors":"Emilie Munch Gregersen, Sofie Læbo Astrupgaard, Malene Hornstrup Jespersen, Tobias Priesholm Gårdhus, Kristoffer Albris","doi":"10.1177/01634437231154781","DOIUrl":"10.1177/01634437231154781","url":null,"abstract":"<p><p>As the COVID-19 pandemic lockdowns forced populations across the world to become completely dependent on digital devices for working, studying, and socializing, there has been no shortage of published studies about the possible negative effects of the increased use of digital devices during this exceptional period. In seeking to empirically address how the concern with digital dependency has been experienced during the pandemic, we present findings from a study of daily self-reported logbooks by 59 university students in Copenhagen, Denmark, over 4 weeks in April and May 2020, investigating their everyday use of digital devices. We highlight two main findings. First, students report high levels of online fatigue, expressed as frustration with their constant reliance on digital devices. On the other hand, students found creative ways of using digital devices for maintaining social relations, helping them to cope with isolation. Such online interactions were nevertheless seen as a poor substitute for physical interactions in the long run. Our findings show how the dependence on digital devices was marked by ambivalence, where digital communication was seen as both the cure against, and cause of, feeling isolated and estranged from a sense of normality.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"33 1","pages":"967-984"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9922647/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85419232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.
{"title":"Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests","authors":"Benjamin R. Shear","doi":"10.1111/jedm.12372","DOIUrl":"10.1111/jedm.12372","url":null,"abstract":"<p>Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"676-696"},"PeriodicalIF":1.3,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42035945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIFR statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform DIF. Extensive CAT simulations revealed RDIFR to have well-controlled Type I error and slightly higher power to detect uniform DIF compared with CATSIB, especially when pretest items were calibrated using fixed-item parameter calibration. Moreover, RDIFR accurately estimated the amount of uniform DIF irrespective of the presence of impact. Therefore, RDIFR demonstrates its potential as a useful tool for evaluating both the statistical and practical significance of uniform DIF in CAT.
{"title":"Detecting Differential Item Functioning in CAT Using IRT Residual DIF Approach","authors":"Hwanggyu Lim, Edison M. Choe","doi":"10.1111/jedm.12366","DOIUrl":"10.1111/jedm.12366","url":null,"abstract":"<p>The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIF<sub>R</sub> statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform DIF. Extensive CAT simulations revealed RDIF<sub>R</sub> to have well-controlled Type I error and slightly higher power to detect uniform DIF compared with CATSIB, especially when pretest items were calibrated using fixed-item parameter calibration. Moreover, RDIF<sub>R</sub> accurately estimated the amount of uniform DIF irrespective of the presence of impact. Therefore, RDIF<sub>R</sub> demonstrates its potential as a useful tool for evaluating both the statistical and practical significance of uniform DIF in CAT.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"626-650"},"PeriodicalIF":1.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45693936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer
When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model lacks a slope parameter. This means that the model assumes that all items are equally speed sensitive. From a conceptual perspective, this assumption seems very restrictive. Furthermore, various other empirical studies and new data analyses performed by us show that this assumption almost never holds in practice. To overcome this shortcoming, we bring together the already frequently used three-parameter lognormal model for response times, which contains a slope parameter, and the ATA approach for controlling speededness by van der Linden. Using multiple empirically based illustrations, the proposed extension is illustrated, including complete and documented R code. Both the original van der Linden approach and our newly proposed approach are available to practitioners in the freely available R package eatATA.
当设计或修改测试时,一个重要的挑战是控制测试的速度。为了实现这一点,van der Linden (2011a, 2011b)提出使用对数正态响应时间模型,更具体地说是双参数对数正态模型,并通过混合整数线性规划实现自动化测试装配(ATA)。然而,这种方法有一个严重的局限性,即双参数对数正态模型缺乏斜率参数。这意味着该模型假定所有项目对速度都同样敏感。从概念的角度来看,这个假设似乎非常有限。此外,我们进行的各种其他实证研究和新数据分析表明,这一假设几乎从未在实践中成立。为了克服这一缺点,我们将已经经常使用的响应时间的三参数对数正态模型(包含一个斜率参数)和由范德林登控制速度的ATA方法结合在一起。使用多个基于经验的插图,说明了建议的扩展,包括完整的和文档化的R代码。原始的van der Linden方法和我们新提出的方法都可以在免费的R包eatATA中获得。
{"title":"Controlling the Speededness of Assembled Test Forms: A Generalization to the Three-Parameter Lognormal Response Time Model","authors":"Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer","doi":"10.1111/jedm.12364","DOIUrl":"10.1111/jedm.12364","url":null,"abstract":"<p>When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model lacks a slope parameter. This means that the model assumes that all items are equally speed sensitive. From a conceptual perspective, this assumption seems very restrictive. Furthermore, various other empirical studies and new data analyses performed by us show that this assumption almost never holds in practice. To overcome this shortcoming, we bring together the already frequently used three-parameter lognormal model for response times, which contains a slope parameter, and the ATA approach for controlling speededness by van der Linden. Using multiple empirically based illustrations, the proposed extension is illustrated, including complete and documented R code. Both the original van der Linden approach and our newly proposed approach are available to practitioners in the freely available R package <span>eatATA</span>.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"551-574"},"PeriodicalIF":1.3,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12364","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49199830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}