Pub Date : 2022-01-02DOI: 10.1080/15366367.2021.1915094
Jordan M. Wheeler, G. Engelhard, Jue Wang
ABSTRACT Objectively scoring constructed-response items on educational assessments has long been a challenge due to the use of human raters. Even well-trained raters using a rubric can inaccurately assess essays. Unfolding models measure rater’s scoring accuracy by capturing the discrepancy between criterion and operational ratings by placing essays on an unfolding continuum with an ideal-point location. Essay unfolding locations indicate how difficult it is for raters to score an essay accurately. This study aims to explore a substantive interpretation of the unfolding scale based on a supervised Latent Dirichlet Allocation (sLDA) model. We investigate the relationship between latent topics extracted using sLDA and unfolding locations with a sample of essays (n = 100) obtained from an integrated writing assessment. Results show that (a) three latent topics moderately explain (r 2 = 0.561) essay locations defined by the unfolding scale and (b) failing to use and/or cite the source articles led to essays that are difficult-to-score accurately.
{"title":"Exploring Rater Accuracy Using Unfolding Models Combined with Topic Models: Incorporating Supervised Latent Dirichlet Allocation","authors":"Jordan M. Wheeler, G. Engelhard, Jue Wang","doi":"10.1080/15366367.2021.1915094","DOIUrl":"https://doi.org/10.1080/15366367.2021.1915094","url":null,"abstract":"ABSTRACT Objectively scoring constructed-response items on educational assessments has long been a challenge due to the use of human raters. Even well-trained raters using a rubric can inaccurately assess essays. Unfolding models measure rater’s scoring accuracy by capturing the discrepancy between criterion and operational ratings by placing essays on an unfolding continuum with an ideal-point location. Essay unfolding locations indicate how difficult it is for raters to score an essay accurately. This study aims to explore a substantive interpretation of the unfolding scale based on a supervised Latent Dirichlet Allocation (sLDA) model. We investigate the relationship between latent topics extracted using sLDA and unfolding locations with a sample of essays (n = 100) obtained from an integrated writing assessment. Results show that (a) three latent topics moderately explain (r 2 = 0.561) essay locations defined by the unfolding scale and (b) failing to use and/or cite the source articles led to essays that are difficult-to-score accurately.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87560836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1080/15366367.2022.2014446
{"title":"Now in JMP® Pro: Structual Equation Modeling","authors":"","doi":"10.1080/15366367.2022.2014446","DOIUrl":"https://doi.org/10.1080/15366367.2022.2014446","url":null,"abstract":"","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84870253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1080/15366367.2021.1976090
Ki Cole, Insu Paek
ABSTRACT Statistical Analysis Software (SAS) is a widely used tool for data management analysis across a variety of fields. The procedure for item response theory (PROC IRT) is one to perform unidimensional and multidimensional item response theory (IRT) analysis for dichotomous and polytomous data. This review provides a summary of the features of PROC IRT specifically for multidimensional data with examples provided for simple structure data, complex structure data, and bifactor data. Instructive examples for dichotomous data (using the Rasch and 2-parameter logistic models) and polytomous data (using the graded response model) are given. Explanations of the syntax are also presented.
{"title":"Using SAS PROC IRT for Multidimensional Item Response Theory Analysis","authors":"Ki Cole, Insu Paek","doi":"10.1080/15366367.2021.1976090","DOIUrl":"https://doi.org/10.1080/15366367.2021.1976090","url":null,"abstract":"ABSTRACT Statistical Analysis Software (SAS) is a widely used tool for data management analysis across a variety of fields. The procedure for item response theory (PROC IRT) is one to perform unidimensional and multidimensional item response theory (IRT) analysis for dichotomous and polytomous data. This review provides a summary of the features of PROC IRT specifically for multidimensional data with examples provided for simple structure data, complex structure data, and bifactor data. Instructive examples for dichotomous data (using the Rasch and 2-parameter logistic models) and polytomous data (using the graded response model) are given. Explanations of the syntax are also presented.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85032110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2021.1878779
Yong Luo
ABSTRACT To date, only frequentist model-selection methods have been studied with mixed-format data in the context of IRT model-selection, and it is unknown how popular Bayesian model-selection methods such as DIC, WAIC, and LOO perform. In this study, we present the results of a comprehensive simulation study that compared the performances of eight model-selection methods with mixed-format data to select the correct combination of IRT models. Findings of the simulation study indicate that DIC, WAIC, and LOO had excellent statistical power to choose the correct IRT model combination. They performed comparably with LRT and slightly preferably than AIC, and considerably better than BIC, AICc, and SABIC. In addition, the performances of the three Bayesian methods were more stable than those of AIC and LRT regardless of the sample size and ability distribution. The eight model-selection methods were applied to a real dataset for demonstration purpose.
{"title":"A Comparison of Common IRT Model-selection Methods with Mixed-Format Tests","authors":"Yong Luo","doi":"10.1080/15366367.2021.1878779","DOIUrl":"https://doi.org/10.1080/15366367.2021.1878779","url":null,"abstract":"ABSTRACT To date, only frequentist model-selection methods have been studied with mixed-format data in the context of IRT model-selection, and it is unknown how popular Bayesian model-selection methods such as DIC, WAIC, and LOO perform. In this study, we present the results of a comprehensive simulation study that compared the performances of eight model-selection methods with mixed-format data to select the correct combination of IRT models. Findings of the simulation study indicate that DIC, WAIC, and LOO had excellent statistical power to choose the correct IRT model combination. They performed comparably with LRT and slightly preferably than AIC, and considerably better than BIC, AICc, and SABIC. In addition, the performances of the three Bayesian methods were more stable than those of AIC and LRT regardless of the sample size and ability distribution. The eight model-selection methods were applied to a real dataset for demonstration purpose.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84754702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2021.1982169
{"title":"Now in JMP® Pro: Structual Equation Modeling","authors":"","doi":"10.1080/15366367.2021.1982169","DOIUrl":"https://doi.org/10.1080/15366367.2021.1982169","url":null,"abstract":"","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87777285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2021.1940667
David Torres Irribarra
{"title":"Applying the Rasch Model in Social Sciences Using R and BlueSky Statistics","authors":"David Torres Irribarra","doi":"10.1080/15366367.2021.1940667","DOIUrl":"https://doi.org/10.1080/15366367.2021.1940667","url":null,"abstract":"","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89774696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2021.1950486
R. Schumacker, Stefanie A. Wind, Lauren F. Holmes
ABSTRACT A variety of resources are available from which researchers can identify measurement instruments, including peer-reviewed journal articles, collections of technical information about published instruments, and electronic databases that are sponsored by universities, testing organizations, and other groups. Although these resources are widespread, many researchers are not aware of them. We provide a brief overview of several selected resources that researchers can use to identify measurement instruments for social science research.
{"title":"Resources for Identifying Measurement Instruments for Social Science Research","authors":"R. Schumacker, Stefanie A. Wind, Lauren F. Holmes","doi":"10.1080/15366367.2021.1950486","DOIUrl":"https://doi.org/10.1080/15366367.2021.1950486","url":null,"abstract":"ABSTRACT A variety of resources are available from which researchers can identify measurement instruments, including peer-reviewed journal articles, collections of technical information about published instruments, and electronic databases that are sponsored by universities, testing organizations, and other groups. Although these resources are widespread, many researchers are not aware of them. We provide a brief overview of several selected resources that researchers can use to identify measurement instruments for social science research.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77255688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2020.1855034
Ting Sun, S. Y. Kim
ABSTRACT In many large testing programs, equipercentile equating has been widely used under a random groups design to adjust test difficulty between forms. However, one thorny issue occurs with equipercentile equating when a particular score has no observed frequency. The purpose of this study is to suggest and evaluate six potential methods in equipercentile equating when an observed-score distribution involves zero-frequency scores. A simulation study involving two levels of test lengths (30 and 50 items), five levels of sample sizes (100, 500, 1000, 3000, and 5000), and two levels of similarity in score distributions between two forms, was conducted to assess these methods in terms of equating accuracy. Results revealed that presmoothing was the most accurate method in estimating the equipercentile equating relationship when the population distributions for two forms differ with respect to the form of score distributions. When the populations have a similar score distribution, the presmoothing method was also found to be the most accurate method with longer tests (50 items). Furthermore, the performance of these methods does not vary as a function of the number of zero-frequency scores. This study informs practitioners of approaches to handling a zero-frequency issue with equipercentile equating that leads to more accurate equating results.
{"title":"Evaluating Six Approaches to Handling Zero-Frequency Scores under Equipercentile Equating","authors":"Ting Sun, S. Y. Kim","doi":"10.1080/15366367.2020.1855034","DOIUrl":"https://doi.org/10.1080/15366367.2020.1855034","url":null,"abstract":"ABSTRACT In many large testing programs, equipercentile equating has been widely used under a random groups design to adjust test difficulty between forms. However, one thorny issue occurs with equipercentile equating when a particular score has no observed frequency. The purpose of this study is to suggest and evaluate six potential methods in equipercentile equating when an observed-score distribution involves zero-frequency scores. A simulation study involving two levels of test lengths (30 and 50 items), five levels of sample sizes (100, 500, 1000, 3000, and 5000), and two levels of similarity in score distributions between two forms, was conducted to assess these methods in terms of equating accuracy. Results revealed that presmoothing was the most accurate method in estimating the equipercentile equating relationship when the population distributions for two forms differ with respect to the form of score distributions. When the populations have a similar score distribution, the presmoothing method was also found to be the most accurate method with longer tests (50 items). Furthermore, the performance of these methods does not vary as a function of the number of zero-frequency scores. This study informs practitioners of approaches to handling a zero-frequency issue with equipercentile equating that leads to more accurate equating results.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82193167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-02DOI: 10.1080/15366367.2021.1873062
Gregory Camilli
ABSTRACT After 25 years with small to moderate gains in performance in mathematics, scores on the National Assessment of Educational Progress (NAEP) main assessment declined between 2013 and 2015 in Grades 4 and 8. Previous research has suggested the decline may be linked to the implementation of the Common Core state standards. In this article, the decline in the NAEP composite score is shown to be driven primarily by losses in the content strands of Geometry and of Data Analysis, Statistics, and Probability. A gain in fractions achievement is also evident in an item-level examination of the NAEP results, but not in reported NAEP scores. These effects are discussed with respect to the CCSS, the rationale for evaluating national progress, and a potential redesign of the NAEP assessment.
{"title":"The 2013-15 Decline in NAEP Mathematics: What it Teaches Us about NAEP and the Common Core","authors":"Gregory Camilli","doi":"10.1080/15366367.2021.1873062","DOIUrl":"https://doi.org/10.1080/15366367.2021.1873062","url":null,"abstract":"ABSTRACT After 25 years with small to moderate gains in performance in mathematics, scores on the National Assessment of Educational Progress (NAEP) main assessment declined between 2013 and 2015 in Grades 4 and 8. Previous research has suggested the decline may be linked to the implementation of the Common Core state standards. In this article, the decline in the NAEP composite score is shown to be driven primarily by losses in the content strands of Geometry and of Data Analysis, Statistics, and Probability. A gain in fractions achievement is also evident in an item-level examination of the NAEP results, but not in reported NAEP scores. These effects are discussed with respect to the CCSS, the rationale for evaluating national progress, and a potential redesign of the NAEP assessment.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87697889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-03DOI: 10.1080/15366367.2021.1878778
L. Cai, Anthony D. Albano, L. Roussos
ABSTRACT Multistage testing (MST), an adaptive test delivery mode that involves algorithmic selection of predefined item modules rather than individual items, offers a practical alternative to linear and fully computerized adaptive testing. However, interactions across stages between item modules and examinee groups can lead to challenges in item calibration with MST. This study used simulated data based on an operational program to investigate the performance of four item calibration methods under a 1–3 MST design. Conditions included routing module length, routing rule, and sample size. Calibration methods were evaluated based on item and person parameter recovery and classification accuracy. Results indicated that calibration with fixed common item parameters and concurrent calibration assuming a single ability distribution similarly outperformed both separate calibration with linking and concurrent calibration with the multiple-group procedure.
{"title":"An Investigation of Item Calibration Methods in Multistage Testing","authors":"L. Cai, Anthony D. Albano, L. Roussos","doi":"10.1080/15366367.2021.1878778","DOIUrl":"https://doi.org/10.1080/15366367.2021.1878778","url":null,"abstract":"ABSTRACT Multistage testing (MST), an adaptive test delivery mode that involves algorithmic selection of predefined item modules rather than individual items, offers a practical alternative to linear and fully computerized adaptive testing. However, interactions across stages between item modules and examinee groups can lead to challenges in item calibration with MST. This study used simulated data based on an operational program to investigate the performance of four item calibration methods under a 1–3 MST design. Conditions included routing module length, routing rule, and sample size. Calibration methods were evaluated based on item and person parameter recovery and classification accuracy. Results indicated that calibration with fixed common item parameters and concurrent calibration assuming a single ability distribution similarly outperformed both separate calibration with linking and concurrent calibration with the multiple-group procedure.","PeriodicalId":46596,"journal":{"name":"Measurement-Interdisciplinary Research and Perspectives","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73776718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}