Pub Date : 2021-02-25DOI: 10.1080/08957347.2021.1890742
T. Hogan, Marissa DeStefano, Caitlin Gilby, Dana C. Kosman, Joshua Peri
ABSTRACT Buros’ Mental Measurements Yearbook (MMY) has provided professional reviews of commercially published psychological and educational tests for over 80 years. It serves as a kind of conscience for the testing industry. For a random sample of 50 entries in the 19th MMY (a total of 100 separate reviews) this study determined the level of qualitative judgment rendered by reviewers and the consistency of those independent reviewers in rendering judgments. Judgments of quality distributed themselves almost uniformly from very good to very bad across the 100 reviews. Agreement among reviewers for a given test was positive but relatively weak. We explore implications of the results and suggest follow-up investigations.
{"title":"Reviewing the Test Reviews: Quality Judgments and Reviewer Agreements in the Mental Measurements Yearbook","authors":"T. Hogan, Marissa DeStefano, Caitlin Gilby, Dana C. Kosman, Joshua Peri","doi":"10.1080/08957347.2021.1890742","DOIUrl":"https://doi.org/10.1080/08957347.2021.1890742","url":null,"abstract":"ABSTRACT Buros’ Mental Measurements Yearbook (MMY) has provided professional reviews of commercially published psychological and educational tests for over 80 years. It serves as a kind of conscience for the testing industry. For a random sample of 50 entries in the 19th MMY (a total of 100 separate reviews) this study determined the level of qualitative judgment rendered by reviewers and the consistency of those independent reviewers in rendering judgments. Judgments of quality distributed themselves almost uniformly from very good to very bad across the 100 reviews. Agreement among reviewers for a given test was positive but relatively weak. We explore implications of the results and suggest follow-up investigations.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"34 1","pages":"75 - 84"},"PeriodicalIF":1.5,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2021.1890742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47432659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-02DOI: 10.1080/08957347.2020.1835914
J. Bostic
ABSTRACT Think alouds are valuable tools for academicians, test developers, and practitioners as they provide a unique window into a respondent’s thinking during an assessment. The purpose of this special issue is to highlight novel ways to use think alouds as a means to gather evidence about respondents’ thinking. An intended outcome from this special issue is that readers may better understand think alouds and feel better equipped to use them in practical and research settings.
{"title":"Think Alouds: Informing Scholarship and Broadening Partnerships through Assessment","authors":"J. Bostic","doi":"10.1080/08957347.2020.1835914","DOIUrl":"https://doi.org/10.1080/08957347.2020.1835914","url":null,"abstract":"ABSTRACT Think alouds are valuable tools for academicians, test developers, and practitioners as they provide a unique window into a respondent’s thinking during an assessment. The purpose of this special issue is to highlight novel ways to use think alouds as a means to gather evidence about respondents’ thinking. An intended outcome from this special issue is that readers may better understand think alouds and feel better equipped to use them in practical and research settings.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"13 1","pages":"1 - 9"},"PeriodicalIF":1.5,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1835914","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41258007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-02DOI: 10.1080/08957347.2020.1835912
Sarah M. Bonner, Peggy P. Chen, Kristi Jones, Brandon Milonovich
ABSTRACT We describe the use of think alouds to examine substantive processes involved in performance on a formative assessment of computational thinking (CT) designed to support self-regulated learning (SRL). Our task design model included three phases of work on a computational thinking problem: forethought, performance, and reflection. The cognitive processes of seven students who reported their thinking during all three phases were analyzed. Ratings of artifacts of code indicated the computational thinking problem was moderately difficult to solve (M = 15, SD = 5) on a scale of 0 to 21 points. Profiles were created to illustrate length and sequence of different types of cognitive processes during the think-aloud. Results provide construct validity evidence for the tasks as formative assessments of CT, elucidate the way learners at different levels of skill use SRL, shed light on the nature of computational thinking, and point out areas for improvement in assessment design.
{"title":"Formative Assessment of Computational Thinking: Cognitive and Metacognitive Processes","authors":"Sarah M. Bonner, Peggy P. Chen, Kristi Jones, Brandon Milonovich","doi":"10.1080/08957347.2020.1835912","DOIUrl":"https://doi.org/10.1080/08957347.2020.1835912","url":null,"abstract":"ABSTRACT We describe the use of think alouds to examine substantive processes involved in performance on a formative assessment of computational thinking (CT) designed to support self-regulated learning (SRL). Our task design model included three phases of work on a computational thinking problem: forethought, performance, and reflection. The cognitive processes of seven students who reported their thinking during all three phases were analyzed. Ratings of artifacts of code indicated the computational thinking problem was moderately difficult to solve (M = 15, SD = 5) on a scale of 0 to 21 points. Profiles were created to illustrate length and sequence of different types of cognitive processes during the think-aloud. Results provide construct validity evidence for the tasks as formative assessments of CT, elucidate the way learners at different levels of skill use SRL, shed light on the nature of computational thinking, and point out areas for improvement in assessment design.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"18 2","pages":"27 - 45"},"PeriodicalIF":1.5,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1835912","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41299261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-02DOI: 10.1080/08957347.2020.1835910
Ya Mo, Michele B. Carney, L. Cavey, Tatia Totorica
ABSTRACT There is a need for assessment items that assess complex constructs but can also be efficiently scored for evaluation of teacher education programs. In an effort to measure the construct of teacher attentiveness in an efficient and scalable manner, we are using exemplar responses elicited by constructed-response item prompts to develop selected-response assessment items. Through analyses of think-aloud interview data, this study examines the alignment between participant responses to, and scores arising from, the two item types. The interview protocol was administered to 12 mathematics teachers and teacher candidates who were first presented a constructed-response version of an item followed by the selected-response version of the same item stem. Our analyses focus on the alignment between responses and scores for eight item stems across the two item types and the identification of items in need of modification. The results have the potential to influence the way test developers generate and use response process evidence to support or refute the assumptions inherent in a particular score interpretation and use.
{"title":"Using Think-Alouds for Response Process Evidence of Teacher Attentiveness","authors":"Ya Mo, Michele B. Carney, L. Cavey, Tatia Totorica","doi":"10.1080/08957347.2020.1835910","DOIUrl":"https://doi.org/10.1080/08957347.2020.1835910","url":null,"abstract":"ABSTRACT There is a need for assessment items that assess complex constructs but can also be efficiently scored for evaluation of teacher education programs. In an effort to measure the construct of teacher attentiveness in an efficient and scalable manner, we are using exemplar responses elicited by constructed-response item prompts to develop selected-response assessment items. Through analyses of think-aloud interview data, this study examines the alignment between participant responses to, and scores arising from, the two item types. The interview protocol was administered to 12 mathematics teachers and teacher candidates who were first presented a constructed-response version of an item followed by the selected-response version of the same item stem. Our analyses focus on the alignment between responses and scores for eight item stems across the two item types and the identification of items in need of modification. The results have the potential to influence the way test developers generate and use response process evidence to support or refute the assumptions inherent in a particular score interpretation and use.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"34 1","pages":"10 - 26"},"PeriodicalIF":1.5,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1835910","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42248907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-02DOI: 10.1080/08957347.2020.1835913
J. Bostic, T. Sondergeld, G. Matney, G. Stone, Tiara Hicks
ABSTRACT Response process validity evidence provides a window into a respondent’s cognitive processing. The purpose of this study is to describe a new data collection tool called a whole-class think aloud (WCTA). This work is performed as part of test development for a series of problem-solving measures to be used in elementary and middle grades. Data from third-grade students were collected in a 1–1 think-aloud setting and compared to data from similar students as part of WCTAs. Findings indicated that students performed similarly on the items when the two think-aloud settings were compared. Respondents also needed less encouragement to share ideas aloud during the WCTA compared to the 1–1 think aloud. They also communicated feeling more comfortable in the WCTA setting compared to the 1–1 think aloud. Drawing the findings together, WCTAs functioned as well if not better, than 1–1 think alouds for the purpose of contextualizing third-grade students’ cognitive processes. Future studies using WCTAs are recommended to explore their limitations and other factors that might impact their success as data gathering tools.
{"title":"Gathering Response Process Data for a Problem-Solving Measure through Whole-Class Think Alouds","authors":"J. Bostic, T. Sondergeld, G. Matney, G. Stone, Tiara Hicks","doi":"10.1080/08957347.2020.1835913","DOIUrl":"https://doi.org/10.1080/08957347.2020.1835913","url":null,"abstract":"ABSTRACT Response process validity evidence provides a window into a respondent’s cognitive processing. The purpose of this study is to describe a new data collection tool called a whole-class think aloud (WCTA). This work is performed as part of test development for a series of problem-solving measures to be used in elementary and middle grades. Data from third-grade students were collected in a 1–1 think-aloud setting and compared to data from similar students as part of WCTAs. Findings indicated that students performed similarly on the items when the two think-aloud settings were compared. Respondents also needed less encouragement to share ideas aloud during the WCTA compared to the 1–1 think aloud. They also communicated feeling more comfortable in the WCTA setting compared to the 1–1 think aloud. Drawing the findings together, WCTAs functioned as well if not better, than 1–1 think alouds for the purpose of contextualizing third-grade students’ cognitive processes. Future studies using WCTAs are recommended to explore their limitations and other factors that might impact their success as data gathering tools.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"34 1","pages":"46 - 60"},"PeriodicalIF":1.5,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1835913","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47052381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-02DOI: 10.1080/08957347.2020.1835911
Jacqueline P. Leighton
ABSTRACT The objective of this paper is to comment on the think-aloud methods presented in the three papers included in this special issue. The commentary offered stems from the author’s own psychological investigations of unobservable information processes and the conditions under which the most defensible claims can be advanced. The structure of this commentary is as follows: First, the objective of think-alouds in light of test development and validation goals are considered for each of the three papers in the volume. Second, the response processes (psychological constructs) described in the three studies are assessed vis à vis think-aloud methods. Third, the methodological details that are essential to properly evaluate response processing data for educational assessment goals are elaborated. Fourth, the possible impasse of using a psychological technique to collect psychological data about non-psychological content forms the basis of the commentary’s conclusion.
{"title":"Rethinking Think-Alouds: The Often-Problematic Collection of Response Process Data","authors":"Jacqueline P. Leighton","doi":"10.1080/08957347.2020.1835911","DOIUrl":"https://doi.org/10.1080/08957347.2020.1835911","url":null,"abstract":"ABSTRACT The objective of this paper is to comment on the think-aloud methods presented in the three papers included in this special issue. The commentary offered stems from the author’s own psychological investigations of unobservable information processes and the conditions under which the most defensible claims can be advanced. The structure of this commentary is as follows: First, the objective of think-alouds in light of test development and validation goals are considered for each of the three papers in the volume. Second, the response processes (psychological constructs) described in the three studies are assessed vis à vis think-aloud methods. Third, the methodological details that are essential to properly evaluate response processing data for educational assessment goals are elaborated. Fourth, the possible impasse of using a psychological technique to collect psychological data about non-psychological content forms the basis of the commentary’s conclusion.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"34 1","pages":"61 - 74"},"PeriodicalIF":1.5,"publicationDate":"2021-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1835911","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42774068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-25DOI: 10.1080/08957347.2020.1789142
Zhonghua Zhang
ABSTRACT The characteristic curve methods have been applied to estimate the equating coefficients in test equating under the graded response model (GRM). However, the approaches for obtaining the standard errors for the estimates of these coefficients have not been developed and examined. In this study, the delta method was applied to derive the mathematical formulas for computing the asymptotic standard errors for the parameter scale transformation coefficients and the true score equating coefficients that are estimated using the characteristic curve methods in test equating under the GRM in the context of the common-item nonequivalent groups equating design. Simulated and real data were further used to examine the accuracy of the derivations and compare the performance of the newly developed delta method with that of the multiple imputation method. The results indicated that the standard errors produced by the delta method were extremely close to the criterion empirical standard errors as well as those yielded by the multiple imputation method. The development of the standard error expressions by the delta method in the study has important practical implications.
{"title":"Asymptotic Standard Errors of Equating Coefficients Using the Characteristic Curve Methods for the Graded Response Model","authors":"Zhonghua Zhang","doi":"10.1080/08957347.2020.1789142","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789142","url":null,"abstract":"ABSTRACT The characteristic curve methods have been applied to estimate the equating coefficients in test equating under the graded response model (GRM). However, the approaches for obtaining the standard errors for the estimates of these coefficients have not been developed and examined. In this study, the delta method was applied to derive the mathematical formulas for computing the asymptotic standard errors for the parameter scale transformation coefficients and the true score equating coefficients that are estimated using the characteristic curve methods in test equating under the GRM in the context of the common-item nonequivalent groups equating design. Simulated and real data were further used to examine the accuracy of the derivations and compare the performance of the newly developed delta method with that of the multiple imputation method. The results indicated that the standard errors produced by the delta method were extremely close to the criterion empirical standard errors as well as those yielded by the multiple imputation method. The development of the standard error expressions by the delta method in the study has important practical implications.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"309 - 330"},"PeriodicalIF":1.5,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789142","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49604565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-29DOI: 10.1080/08957347.2020.1789141
Joseph A. Rios, Hongwen Guo
ABSTRACT The objective of this study was to evaluate whether differential noneffortful responding (identified via response latencies) was present in four countries administered a low-stakes college-level critical thinking assessment. Results indicated significant differences (as large as .90 SD) between nearly all country pairings in the average number of noneffortful responses per test taker. Furthermore, noneffortful responding was found to be associated with a number of individual-level predictors, such as demographics (both gender and academic year), prior ability, and perceived difficulty of the test, though, these predictors were found to differ across countries. Ignoring the presence of noneffortful responses was associated with: (a) model fit deterioration as well as inflation of reliability, and (b) the inclusion of non-invariant items in the score linking anchor set. However, no meaningful differences in relative performance were noted once accounting for noneffortful responses. Implications for test development and improving the validity of score-based inferences from international assessments are discussed.
{"title":"Can Culture Be a Salient Predictor of Test-Taking Engagement? An Analysis of Differential Noneffortful Responding on an International College-Level Assessment of Critical Thinking","authors":"Joseph A. Rios, Hongwen Guo","doi":"10.1080/08957347.2020.1789141","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789141","url":null,"abstract":"ABSTRACT The objective of this study was to evaluate whether differential noneffortful responding (identified via response latencies) was present in four countries administered a low-stakes college-level critical thinking assessment. Results indicated significant differences (as large as .90 SD) between nearly all country pairings in the average number of noneffortful responses per test taker. Furthermore, noneffortful responding was found to be associated with a number of individual-level predictors, such as demographics (both gender and academic year), prior ability, and perceived difficulty of the test, though, these predictors were found to differ across countries. Ignoring the presence of noneffortful responses was associated with: (a) model fit deterioration as well as inflation of reliability, and (b) the inclusion of non-invariant items in the score linking anchor set. However, no meaningful differences in relative performance were noted once accounting for noneffortful responses. Implications for test development and improving the validity of score-based inferences from international assessments are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"263 - 279"},"PeriodicalIF":1.5,"publicationDate":"2020-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789141","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59806029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-27DOI: 10.1080/08957347.2020.1789137
D. Cohen, Alesha D. Ballman, F. Rijmen, Jon Cohen
ABSTRACT Computer-based, pop-up glossaries are perhaps the most promising accommodation aimed at mitigating the influence of linguistic structure and cultural bias on the performance of English Learner (EL) students on statewide assessments. To date, there is no established procedure for identifying the words that require a glossary for EL students that is sufficiently reliable. In the coding procedure, we developed a method to reliably identify words and phrases that require a glossary. The method developed in the coding procedure was used to provide glossaries for the field-test items of statewide English language arts (ELA) and mathematics assessments across grades 3–11 (Current Study). In the Current Study, we assess the effectiveness and influence on construct validity of a pop-up glossary of the words identified in the coding procedure in a large scale, randomized controlled trial. The results demonstrated that generally the pop-up glossary accommodation was effective for both the ELA and mathematics assessments and did not harm the construct being measured.
{"title":"On the Reliable Identification and Effectiveness of Computer-Based, Pop-Up Glossaries in Large-Scale Assessments","authors":"D. Cohen, Alesha D. Ballman, F. Rijmen, Jon Cohen","doi":"10.1080/08957347.2020.1789137","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789137","url":null,"abstract":"ABSTRACT Computer-based, pop-up glossaries are perhaps the most promising accommodation aimed at mitigating the influence of linguistic structure and cultural bias on the performance of English Learner (EL) students on statewide assessments. To date, there is no established procedure for identifying the words that require a glossary for EL students that is sufficiently reliable. In the coding procedure, we developed a method to reliably identify words and phrases that require a glossary. The method developed in the coding procedure was used to provide glossaries for the field-test items of statewide English language arts (ELA) and mathematics assessments across grades 3–11 (Current Study). In the Current Study, we assess the effectiveness and influence on construct validity of a pop-up glossary of the words identified in the coding procedure in a large scale, randomized controlled trial. The results demonstrated that generally the pop-up glossary accommodation was effective for both the ELA and mathematics assessments and did not harm the construct being measured.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"378 - 389"},"PeriodicalIF":1.5,"publicationDate":"2020-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59806226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-23DOI: 10.1080/08957347.2020.1789138
D. Sauder, Christine E. DeMars
ABSTRACT We used simulation techniques to assess the item-level and familywise Type I error control and power of an IRT item-fit statistic, the S-X2 . Previous research indicated that the S-X2 has good Type I error control and decent power, but no previous research examined familywise Type I error control. We varied percentage of misfitting items, sample size, and test length, and computed familywise Type I error with no correction, a Bonferroni correction, and a Benjamini-Hochberg correction. The S-X2 controlled item-level and familywise Type I errors when corrections were applied to conditions with no misfitting items. In the presence of misfitting items, the S-X2 exhibited inflated item-level and familywise false hit rates in many conditions, even with familywise Type I error corrections. Lastly, power was low and negatively impacted when either of the familywise Type I error corrections was applied. We suggest using the S-X2 with no familywise Type I error control in conjunction with other methods of assessing item fit (e.g., visual analysis).
{"title":"Applying a Multiple Comparison Control to IRT Item-fit Testing","authors":"D. Sauder, Christine E. DeMars","doi":"10.1080/08957347.2020.1789138","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789138","url":null,"abstract":"ABSTRACT We used simulation techniques to assess the item-level and familywise Type I error control and power of an IRT item-fit statistic, the S-X2 . Previous research indicated that the S-X2 has good Type I error control and decent power, but no previous research examined familywise Type I error control. We varied percentage of misfitting items, sample size, and test length, and computed familywise Type I error with no correction, a Bonferroni correction, and a Benjamini-Hochberg correction. The S-X2 controlled item-level and familywise Type I errors when corrections were applied to conditions with no misfitting items. In the presence of misfitting items, the S-X2 exhibited inflated item-level and familywise false hit rates in many conditions, even with familywise Type I error corrections. Lastly, power was low and negatively impacted when either of the familywise Type I error corrections was applied. We suggest using the S-X2 with no familywise Type I error control in conjunction with other methods of assessing item fit (e.g., visual analysis).","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"362 - 377"},"PeriodicalIF":1.5,"publicationDate":"2020-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789138","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47875393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}