Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.
{"title":"Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments","authors":"Sandip Sinharay, Matthew S. Johnson","doi":"10.1111/jedm.12381","DOIUrl":"10.1111/jedm.12381","url":null,"abstract":"<p>Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how data on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.
{"title":"Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments","authors":"Sandip Sinharay, Matthew S. Johnson","doi":"10.1111/jedm.12381","DOIUrl":"https://doi.org/10.1111/jedm.12381","url":null,"abstract":"Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how data on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract The use of multistage adaptive testing (MST) has gradually increased in large‐scale testing programs as MST achieves a balanced compromise between linear test design and item‐level adaptive testing. MST works on the premise that each examinee gives their best effort when attempting the items, and their responses truly reflect what they know or can do. However, research shows that large‐scale assessments may suffer from a lack of test‐taking engagement, especially if they are low stakes. Examinees with low test‐taking engagement are likely to show noneffortful responding (e.g., answering the items very rapidly without reading the item stem or response options). To alleviate the impact of noneffortful responses on the measurement accuracy of MST, test‐taking engagement can be operationalized as a latent trait based on response times and incorporated into the on‐the‐fly module assembly procedure. To demonstrate the proposed approach, a Monte‐Carlo simulation study was conducted based on item parameters from an international large‐scale assessment. The results indicated that the on‐the‐fly module assembly considering both ability and test‐taking engagement could minimize the impact of noneffortful responses, yielding more accurate ability estimates and classifications. Implications for practice and directions for future research were discussed.
{"title":"Incorporating Test‐Taking Engagement into Multistage Adaptive Testing Design for Large‐Scale Assessments","authors":"Okan Bulut, Guher Gorgun, Hacer Karamese","doi":"10.1111/jedm.12380","DOIUrl":"https://doi.org/10.1111/jedm.12380","url":null,"abstract":"Abstract The use of multistage adaptive testing (MST) has gradually increased in large‐scale testing programs as MST achieves a balanced compromise between linear test design and item‐level adaptive testing. MST works on the premise that each examinee gives their best effort when attempting the items, and their responses truly reflect what they know or can do. However, research shows that large‐scale assessments may suffer from a lack of test‐taking engagement, especially if they are low stakes. Examinees with low test‐taking engagement are likely to show noneffortful responding (e.g., answering the items very rapidly without reading the item stem or response options). To alleviate the impact of noneffortful responses on the measurement accuracy of MST, test‐taking engagement can be operationalized as a latent trait based on response times and incorporated into the on‐the‐fly module assembly procedure. To demonstrate the proposed approach, a Monte‐Carlo simulation study was conducted based on item parameters from an international large‐scale assessment. The results indicated that the on‐the‐fly module assembly considering both ability and test‐taking engagement could minimize the impact of noneffortful responses, yielding more accurate ability estimates and classifications. Implications for practice and directions for future research were discussed.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135137584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.
本文介绍了强迫选择问卷中包含两个(成对)和三个(三连音)陈述的项目的 Rank 双参数逻辑模型(Rank-2PLM)的项目信息函数和测试信息函数。针对成对陈述的 Rank-2PLM 模型为 MUPP-2PLM(多维成对偏好),针对三重陈述的 Rank-2PLM 模型为 Triplet-2PLM。描述了费雪信息和方向信息,并区分了最大似然(ML)、最大 A 后验(MAP)和期望 A 后验(EAP)性状分数估计的测试信息。提出并绘制了不同水平的预期项目/测验信息指数,以提供项目和测验的诊断信息。由于典型测验的项目反应模式数量庞大,EAP 分数的预期测验信息指数可能难以计算。本文讨论了项目/测验信息与语句辨别参数、标准误差和特质分值估计的可靠性估计之间的关系,并使用真实数据进行了演示。此外,还提供了检查各种预期项目/测验信息指数和绘图的实用建议。
{"title":"Information Functions of Rank-2PL Models for Forced-Choice Questionnaires","authors":"Jianbin Fu, Xuan Tan, Patrick C. Kyllonen","doi":"10.1111/jedm.12379","DOIUrl":"10.1111/jedm.12379","url":null,"abstract":"<p>This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136134855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.
{"title":"Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches","authors":"Güler Yavuz Temel","doi":"10.1111/jedm.12377","DOIUrl":"10.1111/jedm.12377","url":null,"abstract":"<p>The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136185515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi
In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization (SAEM) algorithm for multidimensional data, but it also alleviates the potential instability caused by label-switching, and then improved the estimation accuracy. Simulation studies are conducted to illustrate the good performance of the proposed MSAEM method, where MSAEM consistently performs better than SAEM and some other existing methods in multidimensional item response theory. Moreover, the proposed method is applied to a real data set from the 2018 Programme for International Student Assessment (PISA) to demonstrate the usefulness of the 4PNO model as well as MSAEM in practice.
{"title":"MSAEM Estimation for Confirmatory Multidimensional Four-Parameter Normal Ogive Models","authors":"Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi","doi":"10.1111/jedm.12378","DOIUrl":"10.1111/jedm.12378","url":null,"abstract":"<p>In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization (SAEM) algorithm for multidimensional data, but it also alleviates the potential instability caused by label-switching, and then improved the estimation accuracy. Simulation studies are conducted to illustrate the good performance of the proposed MSAEM method, where MSAEM consistently performs better than SAEM and some other existing methods in multidimensional item response theory. Moreover, the proposed method is applied to a real data set from the 2018 Programme for International Student Assessment (PISA) to demonstrate the usefulness of the 4PNO model as well as MSAEM in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135146227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The usual interpretation of the person and task variables in between-persons measurement models such as item response theory (IRT) is as attributes of persons and tasks, respectively. They can be viewed instead as ensemble descriptors of patterns of interactions among persons and situations that arise from sociocognitive complex adaptive system (CASs). This view offers insights for interpreting and using between-persons measurement models and connecting with sociocognitive research. In this article, we use data generated from an agent-based model to illustrate relations between “social” and “cognitive” features of a simple underlying CAS and the variables of an IRT model fit to resulting data. We note how the ideas connect to explanatory item response modeling and briefly comment on implications for score interpretations and uses in practice.
在人与人之间的测量模型(如项目反应理论(IRT))中,通常将人和任务变量分别解释为人和任务的属性。相反,它们可以被看作是社会认知复杂适应系统(CAS)中产生的人与情境之间互动模式的集合描述符。这种观点为解释和使用人与人之间的测量模型以及与社会认知研究的联系提供了启示。在本文中,我们利用一个基于代理的模型所产生的数据,来说明一个简单的基本 CAS 的 "社会 "和 "认知 "特征之间的关系,以及与所产生的数据相适应的 IRT 模型的变量之间的关系。我们指出了这些观点与解释性项目反应模型之间的联系,并简要评述了对分数解释和实际应用的影响。
{"title":"Sociocognitive Processes and Item Response Models: A Didactic Example","authors":"Tao Gong, Lan Shuai, Robert J. Mislevy","doi":"10.1111/jedm.12376","DOIUrl":"10.1111/jedm.12376","url":null,"abstract":"<p>The usual interpretation of the person and task variables in between-persons measurement models such as item response theory (IRT) is as attributes of persons and tasks, respectively. They can be viewed instead as ensemble descriptors of patterns of interactions among persons and situations that arise from sociocognitive complex adaptive system (CASs). This view offers insights for interpreting and using between-persons measurement models and connecting with sociocognitive research. In this article, we use data generated from an agent-based model to illustrate relations between “social” and “cognitive” features of a simple underlying CAS and the variables of an IRT model fit to resulting data. We note how the ideas connect to explanatory item response modeling and briefly comment on implications for score interpretations and uses in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135397635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many language proficiency tests include group oral assessments involving peer interaction. In such an assessment, examinees discuss a common topic with others. Human raters score each examinee's spoken performance on specially designed criteria. However, measurement models for analyzing group assessment data usually assume local person independence and thus fail to consider the impact of peer interaction on the assessment outcomes. This research advances an extended many-facet Rasch model for group assessments (MFRM-GA), accounting for local person dependence. In a series of simulations, we examined the MFRM-GA's parameter recovery and the consequences of ignoring peer interactions under the traditional modeling approach. We also used a real dataset from the English-speaking test of the Language Proficiency Assessment for Teachers (LPAT) routinely administered in Hong Kong to illustrate the efficiency of the new model. The discussion focuses on the model's usefulness for measuring oral language proficiency, practical implications, and future research perspectives.
{"title":"Measuring the Impact of Peer Interaction in Group Oral Assessments with an Extended Many-Facet Rasch Model","authors":"Kuan-Yu Jin, Thomas Eckes","doi":"10.1111/jedm.12375","DOIUrl":"10.1111/jedm.12375","url":null,"abstract":"<p>Many language proficiency tests include group oral assessments involving peer interaction. In such an assessment, examinees discuss a common topic with others. Human raters score each examinee's spoken performance on specially designed criteria. However, measurement models for analyzing group assessment data usually assume local person independence and thus fail to consider the impact of peer interaction on the assessment outcomes. This research advances an extended many-facet Rasch model for group assessments (MFRM-GA), accounting for local person dependence. In a series of simulations, we examined the MFRM-GA's parameter recovery and the consequences of ignoring peer interactions under the traditional modeling approach. We also used a real dataset from the English-speaking test of the Language Proficiency Assessment for Teachers (LPAT) routinely administered in Hong Kong to illustrate the efficiency of the new model. The discussion focuses on the model's usefulness for measuring oral language proficiency, practical implications, and future research perspectives.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135352749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Derek C. Briggs Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies","authors":"David Torres Irribarra","doi":"10.1111/jedm.12374","DOIUrl":"10.1111/jedm.12374","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multidimensional computerized adaptive testing (MCAT), item selection strategies are generally constructed based on responses, and they do not consider the response times required by items. This study constructed two new criteria (referred to as DT-inc and DT) for MCAT item selection by utilizing information from response times. The new designs maximize the amount of information per unit time. Furthermore, these two new designs were extended to the DTS-inc and DTS designs to efficiently estimate intentional abilities. Moreover, the EAP method for ability estimation was also equipped with response time. The performances of the response-time-based EAP (RT-based EAP) and the new designs were evaluated in simulation and empirical studies. The results showed that the RT-based EAP significantly improved the ability estimation precision compared with the EAP without using response time, and the new designs dramatically saved testing times for examinees with a small sacrifice of ability estimation precision and item pool usage.
{"title":"Using Response Time in Multidimensional Computerized Adaptive Testing","authors":"Yinhong He, Yuanyuan Qi","doi":"10.1111/jedm.12373","DOIUrl":"10.1111/jedm.12373","url":null,"abstract":"<p>In multidimensional computerized adaptive testing (MCAT), item selection strategies are generally constructed based on responses, and they do not consider the response times required by items. This study constructed two new criteria (referred to as DT-inc and DT) for MCAT item selection by utilizing information from response times. The new designs maximize the amount of information per unit time. Furthermore, these two new designs were extended to the DT<sub>S</sub>-inc and DT<sub>S</sub> designs to efficiently estimate intentional abilities. Moreover, the EAP method for ability estimation was also equipped with response time. The performances of the response-time-based EAP (RT-based EAP) and the new designs were evaluated in simulation and empirical studies. The results showed that the RT-based EAP significantly improved the ability estimation precision compared with the EAP without using response time, and the new designs dramatically saved testing times for examinees with a small sacrifice of ability estimation precision and item pool usage.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48931962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}