Pub Date : 2020-04-02DOI: 10.1080/15305058.2019.1692212
Yanyan Fu, Tyler Strachan, E. Ip, John T. Willse, Shyh-Huei Chen, Terry A. Ackerman
This research examined correlation estimates between latent abilities when using the two-dimensional and three-dimensional compensatory and noncompensatory item response theory models. Simulation study results showed that the recovery of the latent correlation was best when the test contained 100% of simple structure items for all models and conditions. When a test measured weakly discriminated dimensions, it became harder to recover the latent correlation. Results also showed that increasing the sample size, test length, or using simpler models (i.e., two-parameter logistic rather than three-parameter logistic, compensatory rather than noncompensatory) could improve the recovery of latent correlation.
{"title":"The Recovery of Correlation Between Latent Abilities Using Compensatory and Noncompensatory Multidimensional IRT Models","authors":"Yanyan Fu, Tyler Strachan, E. Ip, John T. Willse, Shyh-Huei Chen, Terry A. Ackerman","doi":"10.1080/15305058.2019.1692212","DOIUrl":"https://doi.org/10.1080/15305058.2019.1692212","url":null,"abstract":"This research examined correlation estimates between latent abilities when using the two-dimensional and three-dimensional compensatory and noncompensatory item response theory models. Simulation study results showed that the recovery of the latent correlation was best when the test contained 100% of simple structure items for all models and conditions. When a test measured weakly discriminated dimensions, it became harder to recover the latent correlation. Results also showed that increasing the sample size, test length, or using simpler models (i.e., two-parameter logistic rather than three-parameter logistic, compensatory rather than noncompensatory) could improve the recovery of latent correlation.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"169 - 186"},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1692212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44045191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-02-12DOI: 10.1080/15305058.2020.1720216
Xiuyan Guo, Pui‐wa Lei
Little research has been done on the effects of peer raters’ quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters’ qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment where training and motivation interventions were manipulated, 24 classes with 838 high school students were randomly assigned to study conditions. Inter-rater error, intra-rater error and criterion error indices for peer ratings on four selected essays were analyzed using hierarchical linear models. Results indicated that peer raters’ content knowledge, previous rating experience, and rating motivation were associated with rating errors. This study also found some significant interactions between peer raters’ quality characteristics. Implications for in-person and online peer assessments as well as future directions are discussed.
{"title":"Effect of Quality Characteristics of Peer Raters on Rating Errors in Peer Assessment","authors":"Xiuyan Guo, Pui‐wa Lei","doi":"10.1080/15305058.2020.1720216","DOIUrl":"https://doi.org/10.1080/15305058.2020.1720216","url":null,"abstract":"Little research has been done on the effects of peer raters’ quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters’ qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment where training and motivation interventions were manipulated, 24 classes with 838 high school students were randomly assigned to study conditions. Inter-rater error, intra-rater error and criterion error indices for peer ratings on four selected essays were analyzed using hierarchical linear models. Results indicated that peer raters’ content knowledge, previous rating experience, and rating motivation were associated with rating errors. This study also found some significant interactions between peer raters’ quality characteristics. Implications for in-person and online peer assessments as well as future directions are discussed.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"206 - 230"},"PeriodicalIF":1.7,"publicationDate":"2020-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2020.1720216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43660947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-10DOI: 10.1080/15305058.2019.1706529
M. Michaelides, M. Ivanova, C. Nicolaou
The study examined the relationship between examinees’ test-taking effort and their accuracy rate on items from the PISA 2015 assessment. The 10% normative threshold method was applied on Science multiple-choice items in the Cyprus sample to detect rapid guessing behavior. Results showed that the extent of rapid guessing across simple and complex multiple-choice items was on average less than 6% per item. Rapid guessers were identified, and for most items their accuracy was lower than the accuracy for students engaging in solution-based behavior. Examinees with higher overall performance on the test items tended to engage in less rapid guessing than their lower performing peers. Overall, this empirical investigation presents original evidence on test-taking effort as measured by response time in PISA items and tests propositions of Wise’s (2017) Test-Taking Theory.
{"title":"The Relationship between Response-Time Effort and Accuracy in PISA Science Multiple Choice Items","authors":"M. Michaelides, M. Ivanova, C. Nicolaou","doi":"10.1080/15305058.2019.1706529","DOIUrl":"https://doi.org/10.1080/15305058.2019.1706529","url":null,"abstract":"The study examined the relationship between examinees’ test-taking effort and their accuracy rate on items from the PISA 2015 assessment. The 10% normative threshold method was applied on Science multiple-choice items in the Cyprus sample to detect rapid guessing behavior. Results showed that the extent of rapid guessing across simple and complex multiple-choice items was on average less than 6% per item. Rapid guessers were identified, and for most items their accuracy was lower than the accuracy for students engaging in solution-based behavior. Examinees with higher overall performance on the test items tended to engage in less rapid guessing than their lower performing peers. Overall, this empirical investigation presents original evidence on test-taking effort as measured by response time in PISA items and tests propositions of Wise’s (2017) Test-Taking Theory.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"187 - 205"},"PeriodicalIF":1.7,"publicationDate":"2020-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1706529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43585415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-02DOI: 10.1080/15305058.2018.1551225
Ying Cui, Qi Guo, Jacqueline P. Leighton, Man-Wai Chu
This study explores the use of the Adaptive Neuro-Fuzzy Inference System (ANFIS), a neuro-fuzzy approach, to analyze the log data of technology-based assessments to extract relevant features of student problem-solving processes, and develop and refine a set of fuzzy logic rules that could be used to interpret student performance. The log data that record student response processes while solving a science simulation task were analyzed with ANFIS. Results indicate the ANFIS analysis could generate and refine a set of fuzzy rules that shed lights on the process of how students solve the simulation task. We conclude the article by discussing the advantages of combining human judgments with the learning capacity of ANFIS for log data analysis and outlining the limitations of the current study and areas of future research.
{"title":"Log Data Analysis with ANFIS: A Fuzzy Neural Network Approach","authors":"Ying Cui, Qi Guo, Jacqueline P. Leighton, Man-Wai Chu","doi":"10.1080/15305058.2018.1551225","DOIUrl":"https://doi.org/10.1080/15305058.2018.1551225","url":null,"abstract":"This study explores the use of the Adaptive Neuro-Fuzzy Inference System (ANFIS), a neuro-fuzzy approach, to analyze the log data of technology-based assessments to extract relevant features of student problem-solving processes, and develop and refine a set of fuzzy logic rules that could be used to interpret student performance. The log data that record student response processes while solving a science simulation task were analyzed with ANFIS. Results indicate the ANFIS analysis could generate and refine a set of fuzzy rules that shed lights on the process of how students solve the simulation task. We conclude the article by discussing the advantages of combining human judgments with the learning capacity of ANFIS for log data analysis and outlining the limitations of the current study and areas of future research.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"78 - 96"},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1551225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48938428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-02DOI: 10.1080/15305058.2018.1551224
J. Sabatini, T. O’Reilly, Jonathan P. Weeks, Zuowei Wang
The construct of reading comprehension has changed significantly in the twenty-first century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.
{"title":"Engineering a Twenty-First Century Reading Comprehension Assessment System Utilizing Scenario-Based Assessment Techniques","authors":"J. Sabatini, T. O’Reilly, Jonathan P. Weeks, Zuowei Wang","doi":"10.1080/15305058.2018.1551224","DOIUrl":"https://doi.org/10.1080/15305058.2018.1551224","url":null,"abstract":"The construct of reading comprehension has changed significantly in the twenty-first century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"1 - 23"},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1551224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47386975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-02DOI: 10.1080/15305058.2019.1605999
S. Wise, J. Soland, Y. Bo
Disengaged test taking tends to be most prevalent with low-stakes tests. This has led to questions about the validity of aggregated scores from large-scale international assessments such as PISA and TIMSS, as previous research has found a meaningful correlation between the mean engagement and mean performance of countries. The current study, using data from the computer-based version of the PISA-Based Test for Schools, examined the distortive effects of differential engagement on aggregated school-level scores. The results showed that, although there was considerable differential engagement among schools, the school means were highly stable due to two factors. First, any distortive effects of disengagement in a school were diluted by a high proportion of the students exhibiting no non-effortful behavior. Second, and most interestingly, disengagement produced both positive and negative distortion of individual student scores, which tended to cancel out much of the net distortive effect on the school’s mean.
在低风险的考试中,心不在焉的考试往往最为普遍。这导致了对大规模国际评估(如PISA和TIMSS)汇总分数有效性的质疑,因为之前的研究发现,国家的平均参与度和平均表现之间存在有意义的相关性。目前的研究使用了基于pisa的学校测试(Test for Schools)的计算机版数据,研究了不同参与程度对学校总体成绩的扭曲效应。结果表明,虽然学校之间的参与程度存在较大差异,但由于两个因素,学校的投入程度高度稳定。首先,在学校里,没有表现出不努力行为的学生所占比例很高,这就稀释了任何不投入的扭曲效应。其次,也是最有趣的一点是,不投入对个别学生的成绩产生了积极和消极的扭曲,这往往会抵消对学校平均分的大部分净扭曲效应。
{"title":"The (Non)Impact of Differential Test Taker Engagement on Aggregated Scores","authors":"S. Wise, J. Soland, Y. Bo","doi":"10.1080/15305058.2019.1605999","DOIUrl":"https://doi.org/10.1080/15305058.2019.1605999","url":null,"abstract":"Disengaged test taking tends to be most prevalent with low-stakes tests. This has led to questions about the validity of aggregated scores from large-scale international assessments such as PISA and TIMSS, as previous research has found a meaningful correlation between the mean engagement and mean performance of countries. The current study, using data from the computer-based version of the PISA-Based Test for Schools, examined the distortive effects of differential engagement on aggregated school-level scores. The results showed that, although there was considerable differential engagement among schools, the school means were highly stable due to two factors. First, any distortive effects of disengagement in a school were diluted by a high proportion of the students exhibiting no non-effortful behavior. Second, and most interestingly, disengagement produced both positive and negative distortion of individual student scores, which tended to cancel out much of the net distortive effect on the school’s mean.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"57 - 77"},"PeriodicalIF":1.7,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1605999","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47045097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott B Morris, Michael Bass, Elizabeth Howard, Richard E Neapolitan
The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached, and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the PROMIS Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall and particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.
{"title":"Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information.","authors":"Scott B Morris, Michael Bass, Elizabeth Howard, Richard E Neapolitan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The <i>standard error</i> (<i>SE</i>) stopping rule, which terminates a <i>computer adaptive test</i> (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the <i>predicted standard error reduction</i> (PSER) stopping rule will stop the CAT even if the <i>SE</i> threshold has not been reached, and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency<i>.</i> Using simulated data for the PROMIS <i>Anxiety</i> and <i>Physical Function</i> banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the <i>SE</i> stopping rule overall and particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.</p>","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 2","pages":"146-168"},"PeriodicalIF":1.7,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7518406/pdf/nihms-1534260.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38521672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-02DOI: 10.1080/15305058.2019.1631024
M. Oliveri
These guidelines describe considerations relevant to the assessment of test takers in or across countries or regions that are linguistically or culturally diverse. The guidelines were developed by a committee of experts to help inform test developers, psychometricians, test users, and test administrators about fairness issues in support of the fair and valid assessment of linguistically or culturally diverse populations. They are meant to apply to most, if not all, aspects of the development, administration, scoring, and use of assessments; and are intended to supplement other existing professional standards or guidelines for testing and assessment. That is, these guidelines focus on the types of adaptations and considerations to use when developing, reviewing, and interpreting items and test scores from tests administered to culturally and linguistically or culturally diverse populations. Other guidelines such as the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) or Guidelines for Best Practice in Cross-Cultural Surveys (Survey Research Center, 2016) may also be relevant to testing linguistically and culturally diverse populations.
{"title":"ITC Guidelines for the Large-Scale Assessment of Linguistically and Culturally Diverse Populations","authors":"M. Oliveri","doi":"10.1080/15305058.2019.1631024","DOIUrl":"https://doi.org/10.1080/15305058.2019.1631024","url":null,"abstract":"These guidelines describe considerations relevant to the assessment of test takers in or across countries or regions that are linguistically or culturally diverse. The guidelines were developed by a committee of experts to help inform test developers, psychometricians, test users, and test administrators about fairness issues in support of the fair and valid assessment of linguistically or culturally diverse populations. They are meant to apply to most, if not all, aspects of the development, administration, scoring, and use of assessments; and are intended to supplement other existing professional standards or guidelines for testing and assessment. That is, these guidelines focus on the types of adaptations and considerations to use when developing, reviewing, and interpreting items and test scores from tests administered to culturally and linguistically or culturally diverse populations. Other guidelines such as the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) or Guidelines for Best Practice in Cross-Cultural Surveys (Survey Research Center, 2016) may also be relevant to testing linguistically and culturally diverse populations.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"301 - 336"},"PeriodicalIF":1.7,"publicationDate":"2019-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1631024","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49265430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-16DOI: 10.1080/15305058.2019.1632316
Nathan D. Roberson, B. Zumbo
This paper investigates measurement invariance as it relates to migration background using the Program for International Student Assessment measure of social belonging. We explore how the use of two measurement invariance techniques provide insights into differential item functioning using the alignment method in conjunction with logistic regression in the case of multiple group comparisons. Social belonging is a central human need, and we argue that immigration background is important factor when considering how an individual interacts with a survey/items about belonging. Overall results from both the alignment method and ordinal logistic regression, interpreted through a diffractive lens, suggest that it is inappropriate to treat peoples of four different immigration backgrounds within the countries analyzed as exchangeable groups.
{"title":"Migration Background in PISA’s Measure of Social Belonging: Using a Diffractive Lens to Interpret Multi-Method DIF Studies","authors":"Nathan D. Roberson, B. Zumbo","doi":"10.1080/15305058.2019.1632316","DOIUrl":"https://doi.org/10.1080/15305058.2019.1632316","url":null,"abstract":"This paper investigates measurement invariance as it relates to migration background using the Program for International Student Assessment measure of social belonging. We explore how the use of two measurement invariance techniques provide insights into differential item functioning using the alignment method in conjunction with logistic regression in the case of multiple group comparisons. Social belonging is a central human need, and we argue that immigration background is important factor when considering how an individual interacts with a survey/items about belonging. Overall results from both the alignment method and ordinal logistic regression, interpreted through a diffractive lens, suggest that it is inappropriate to treat peoples of four different immigration backgrounds within the countries analyzed as exchangeable groups.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"363 - 389"},"PeriodicalIF":1.7,"publicationDate":"2019-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1632316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44180342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-03DOI: 10.1080/15305058.2019.1621871
Xiao Luo, Xinrui Wang
This study introduced dynamic multistage testing (dy-MST) as an improvement to existing adaptive testing methods. dy-MST combines the advantages of computerized adaptive testing (CAT) and computerized adaptive multistage testing (ca-MST) to create a highly efficient and regulated adaptive testing method. In the test construction phase, multistage panels are assembled using similar design principles and assembly techniques with ca-MST. In the administration phase, items are adaptively administered from a dynamic interim pool. A large-scale simulation study was conducted to evaluate the merits of dy-MST, and it found that dy-MST significantly reduced test length while maintaining the identical classification accuracy with the full-length tests and meeting all content requirements effectively. Psychometrically, the testing efficiency in dy-MST was comparable to CAT. Operationally, dy-MST allows for holistic pre-administration management of test content directly at the test level. Thus, dy-MST is deemed appropriate for delivering adaptive tests with high efficiency and well-controlled content.
本研究引入动态多级测试(dynamic multi - stage testing, dy-MST)作为现有自适应测试方法的改进。dy-MST结合了计算机化自适应测试(CAT)和计算机化自适应多阶段测试(ca-MST)的优点,创造了一种高效、规范的自适应测试方法。在测试施工阶段,多级面板使用与ca-MST相似的设计原则和组装技术进行组装。在管理阶段,从动态临时池自适应地管理项目。通过大规模的仿真研究,对dy-MST的优点进行了评价,发现dy-MST在保持与全长测试相同的分类精度的同时,显著缩短了测试长度,有效地满足了所有内容要求。在心理测量学上,dy-MST的测试效率与CAT相当。从操作上讲,dy-MST允许在考试阶段直接对考试内容进行全面的预管理。因此,dy-MST被认为适合于提供具有高效率和良好控制内容的自适应测试。
{"title":"Dynamic Multistage Testing: A Highly Efficient and Regulated Adaptive Testing Method","authors":"Xiao Luo, Xinrui Wang","doi":"10.1080/15305058.2019.1621871","DOIUrl":"https://doi.org/10.1080/15305058.2019.1621871","url":null,"abstract":"This study introduced dynamic multistage testing (dy-MST) as an improvement to existing adaptive testing methods. dy-MST combines the advantages of computerized adaptive testing (CAT) and computerized adaptive multistage testing (ca-MST) to create a highly efficient and regulated adaptive testing method. In the test construction phase, multistage panels are assembled using similar design principles and assembly techniques with ca-MST. In the administration phase, items are adaptively administered from a dynamic interim pool. A large-scale simulation study was conducted to evaluate the merits of dy-MST, and it found that dy-MST significantly reduced test length while maintaining the identical classification accuracy with the full-length tests and meeting all content requirements effectively. Psychometrically, the testing efficiency in dy-MST was comparable to CAT. Operationally, dy-MST allows for holistic pre-administration management of test content directly at the test level. Thus, dy-MST is deemed appropriate for delivering adaptive tests with high efficiency and well-controlled content.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"227 - 247"},"PeriodicalIF":1.7,"publicationDate":"2019-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1621871","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48949313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}