Pub Date : 2024-09-01Epub Date: 2024-06-17DOI: 10.1177/01466216241261709
Brooke E Magnus
Clinical instruments that use a filter/follow-up response format often produce data with excess zeros, especially when administered to nonclinical samples. When the unidimensional graded response model (GRM) is then fit to these data, parameter estimates and scale scores tend to suggest that the instrument measures individual differences only among individuals with severe levels of the psychopathology. In such scenarios, alternative item response models that explicitly account for excess zeros may be more appropriate. The multivariate hurdle graded response model (MH-GRM), which has been previously proposed for handling zero-inflated questionnaire data, includes two latent variables: susceptibility, which underlies responses to the filter question, and severity, which underlies responses to the follow-up question. Using both simulated and empirical data, the current research shows that compared to unidimensional GRMs, the MH-GRM is better able to capture individual differences across a wider range of psychopathology, and that when unidimensional GRMs are fit to data from questionnaires that include filter questions, individual differences at the lower end of the severity continuum largely go unmeasured. Practical implications are discussed.
{"title":"Item Response Modeling of Clinical Instruments With Filter Questions: Disentangling Symptom Presence and Severity.","authors":"Brooke E Magnus","doi":"10.1177/01466216241261709","DOIUrl":"10.1177/01466216241261709","url":null,"abstract":"<p><p>Clinical instruments that use a filter/follow-up response format often produce data with excess zeros, especially when administered to nonclinical samples. When the unidimensional graded response model (GRM) is then fit to these data, parameter estimates and scale scores tend to suggest that the instrument measures individual differences only among individuals with severe levels of the psychopathology. In such scenarios, alternative item response models that explicitly account for excess zeros may be more appropriate. The multivariate hurdle graded response model (MH-GRM), which has been previously proposed for handling zero-inflated questionnaire data, includes two latent variables: susceptibility, which underlies responses to the filter question, and severity, which underlies responses to the follow-up question. Using both simulated and empirical data, the current research shows that compared to unidimensional GRMs, the MH-GRM is better able to capture individual differences across a wider range of psychopathology, and that when unidimensional GRMs are fit to data from questionnaires that include filter questions, individual differences at the lower end of the severity continuum largely go unmeasured. Practical implications are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11331747/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142009739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1177/01466216241265757
Jiaying Xiao, Chun Wang, Gongjun Xu
Accurate item parameters and standard errors (SEs) are crucial for many multidimensional item response theory (MIRT) applications. A recent study proposed the Gaussian Variational Expectation Maximization (GVEM) algorithm to improve computational efficiency and estimation accuracy ( Cho et al., 2021 ). However, the SE estimation procedure has yet to be fully addressed. To tackle this issue, the present study proposed an updated supplemented expectation maximization (USEM) method and a bootstrap method for SE estimation. These two methods were compared in terms of SE recovery accuracy. The simulation results demonstrated that the GVEM algorithm with bootstrap and item priors (GVEM-BSP) outperformed the other methods, exhibiting less bias and relative bias for SE estimates under most conditions. Although the GVEM with USEM (GVEM-USEM) was the most computationally efficient method, it yielded an upward bias for SE estimates.
准确的项目参数和标准误差(SE)对许多多维项目反应理论(MIRT)的应用至关重要。最近的一项研究提出了高斯变分期望最大化(GVEM)算法,以提高计算效率和估计精度(Cho 等人,2021 年)。然而,SE 估算程序尚未得到充分解决。为解决这一问题,本研究提出了一种用于 SE 估计的更新补充期望最大化(USEM)方法和一种自举法。这两种方法在 SE 恢复精度方面进行了比较。模拟结果表明,带有自举和项目先验的 GVEM 算法(GVEM-BSP)优于其他方法,在大多数条件下,SE 估计的偏差和相对偏差都较小。虽然带有 USEM 的 GVEM 算法(GVEM-USEM)是计算效率最高的方法,但它产生了 SE 估计值的向上偏差。
{"title":"A Note on Standard Errors for Multidimensional Two-Parameter Logistic Models Using Gaussian Variational Estimation","authors":"Jiaying Xiao, Chun Wang, Gongjun Xu","doi":"10.1177/01466216241265757","DOIUrl":"https://doi.org/10.1177/01466216241265757","url":null,"abstract":"Accurate item parameters and standard errors (SEs) are crucial for many multidimensional item response theory (MIRT) applications. A recent study proposed the Gaussian Variational Expectation Maximization (GVEM) algorithm to improve computational efficiency and estimation accuracy ( Cho et al., 2021 ). However, the SE estimation procedure has yet to be fully addressed. To tackle this issue, the present study proposed an updated supplemented expectation maximization (USEM) method and a bootstrap method for SE estimation. These two methods were compared in terms of SE recovery accuracy. The simulation results demonstrated that the GVEM algorithm with bootstrap and item priors (GVEM-BSP) outperformed the other methods, exhibiting less bias and relative bias for SE estimates under most conditions. Although the GVEM with USEM (GVEM-USEM) was the most computationally efficient method, it yielded an upward bias for SE estimates.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141809630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1177/01466216241261708
J. Lasker
Psychometricians have argued that measurement invariance (MI) testing is needed to know if the same psychological constructs are measured in different groups. Data from five experiments allowed that position to be tested. In the first, participants answered questionnaires on belief in free will and either the meaning of life or the meaning of a nonsense concept called “gavagai.” Since the meaning of life and the meaning of gavagai conceptually differ, MI should have been violated when groups were treated like their measurements were identical. MI was severely violated, indicating the questionnaires were interpreted differently. In the second and third experiments, participants were randomized to watch treatment videos explaining figural matrices rules or task-irrelevant control videos. Participants then took intelligence and figural matrices tests. The intervention worked and the experimental group had an additional influence on figural matrix performance in the form of knowing matrix rules, so their performance on the matrices tests violated MI and was anomalously high for their intelligence levels. In both experiments, MI was severely violated. In the fourth and fifth experiments, individuals were exposed to growth mindset interventions that a twin study revealed changed the amount of genetic variance in the target mindset measure without affecting other variables. When comparing treatment and control groups, MI was attainable before but not after treatment. Moreover, the control group showed longitudinal invariance, but the same was untrue for the treatment group. MI testing is likely able to show if the same things are measured in different groups.
{"title":"Measurement Invariance Testing Works","authors":"J. Lasker","doi":"10.1177/01466216241261708","DOIUrl":"https://doi.org/10.1177/01466216241261708","url":null,"abstract":"Psychometricians have argued that measurement invariance (MI) testing is needed to know if the same psychological constructs are measured in different groups. Data from five experiments allowed that position to be tested. In the first, participants answered questionnaires on belief in free will and either the meaning of life or the meaning of a nonsense concept called “gavagai.” Since the meaning of life and the meaning of gavagai conceptually differ, MI should have been violated when groups were treated like their measurements were identical. MI was severely violated, indicating the questionnaires were interpreted differently. In the second and third experiments, participants were randomized to watch treatment videos explaining figural matrices rules or task-irrelevant control videos. Participants then took intelligence and figural matrices tests. The intervention worked and the experimental group had an additional influence on figural matrix performance in the form of knowing matrix rules, so their performance on the matrices tests violated MI and was anomalously high for their intelligence levels. In both experiments, MI was severely violated. In the fourth and fifth experiments, individuals were exposed to growth mindset interventions that a twin study revealed changed the amount of genetic variance in the target mindset measure without affecting other variables. When comparing treatment and control groups, MI was attainable before but not after treatment. Moreover, the control group showed longitudinal invariance, but the same was untrue for the treatment group. MI testing is likely able to show if the same things are measured in different groups.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141343348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-12DOI: 10.1177/01466216241261704
Yifan Zhang, Jinsong Chen
Special measurement effects including the method and testlet effects are common issues in educational and psychological measurement. They are typically covered by various bifactor models or models for the multiple traits multiple methods (MTMM) structure for continuous data and by various testlet effect models for categorical data. However, existing models have some limitations in accommodating different type of effects. With slight modification, the generalized partially confirmatory factor analysis (GPCFA) framework can flexibly accommodate special effects for continuous and categorical cases with added benefits. Various bifactor, MTMM and testlet effect models can be linked to different variants of the revised GPCFA model. Compared to existing approaches, GPCFA offers multidimensionality for both the general and effect factors (or traits) and can address local dependence, mixed-type formats, and missingness jointly. Moreover, the partially confirmatory approach allows for regularization of the loading patterns, resulting in a simpler structure in both the general and special parts. We also provide a subroutine to compute the equivalent effect size. Simulation studies and real-data examples are used to demonstrate the performance and usefulness of the proposed approach under different situations.
{"title":"Accommodating and Extending Various Models for Special Effects Within the Generalized Partially Confirmatory Factor Analysis Framework","authors":"Yifan Zhang, Jinsong Chen","doi":"10.1177/01466216241261704","DOIUrl":"https://doi.org/10.1177/01466216241261704","url":null,"abstract":"Special measurement effects including the method and testlet effects are common issues in educational and psychological measurement. They are typically covered by various bifactor models or models for the multiple traits multiple methods (MTMM) structure for continuous data and by various testlet effect models for categorical data. However, existing models have some limitations in accommodating different type of effects. With slight modification, the generalized partially confirmatory factor analysis (GPCFA) framework can flexibly accommodate special effects for continuous and categorical cases with added benefits. Various bifactor, MTMM and testlet effect models can be linked to different variants of the revised GPCFA model. Compared to existing approaches, GPCFA offers multidimensionality for both the general and effect factors (or traits) and can address local dependence, mixed-type formats, and missingness jointly. Moreover, the partially confirmatory approach allows for regularization of the loading patterns, resulting in a simpler structure in both the general and special parts. We also provide a subroutine to compute the equivalent effect size. Simulation studies and real-data examples are used to demonstrate the performance and usefulness of the proposed approach under different situations.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141353380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-11DOI: 10.1177/01466216241261705
Siqi He, Justin L. Kern
Item response tree (IRTree) approaches have received increasing attention in the response style literature due to their capability to partial out response style latent traits from content-related latent traits by considering separate decisions for agreement and level of agreement. Additionally, it has shown that the functioning of the intensity of agreement decision may depend upon the agreement decision with an item, so that the item parameters and person parameters may differ by direction of agreement; when the parameters across direction are the same, this is called directional invariance. Furthermore, for non-cognitive psychological constructs, it has been argued that the response process may be best described as following an unfolding process. In this study, a family of IRTree models to handle unfolding responses with the agreement decision following the hyperbolic cosine model and the intensity of agreement decision following a graded response model is investigated. This model family also allows for investigation of item- and person-level directional invariance. A simulation study is conducted to evaluate parameter recovery; model parameters are estimated with a fully Bayesian approach using JAGS (Just Another Gibbs Sampler). The proposed modeling scheme is demonstrated with two data examples with multiple model comparisons allowing for varying levels of directional invariance and unfolding versus dominance processes. An approach to visualizing the final model item response functioning is also developed. The article closes with a short discussion about the results.
项目反应树(IRTree)方法通过考虑不同的同意决定和同意程度决定,能够从内容相关的潜在特质中分离出反应风格潜在特质,因此在反应风格文献中受到越来越多的关注。此外,研究表明,同意强度决定的功能可能取决于对项目的同意决定,因此项目参数和人的参数可能因同意方向的不同而不同;当不同方向的参数相同时,这被称为方向不变性。此外,对于非认知性心理建构而言,有人认为最好将反应过程描述为一个展开过程。在本研究中,我们研究了一个 IRTree 模型系列来处理展开式反应,其中同意决定采用双曲余弦模型,同意强度决定采用分级反应模型。该模型系列还可用于研究项目和个人层面的方向不变性。为评估参数恢复情况,进行了一项模拟研究;使用 JAGS(Just Another Gibbs Sampler,另一种吉布斯采样器)以完全贝叶斯方法估算模型参数。建议的建模方案通过两个数据示例进行了演示,并对多个模型进行了比较,以考虑不同程度的方向不变性和展开过程与优势过程。文章还提出了一种可视化最终模型项目反应功能的方法。文章最后对结果进行了简短讨论。
{"title":"Investigating Directional Invariance in an Item Response Tree Model for Extreme Response Style and Trait-Based Unfolding Responses","authors":"Siqi He, Justin L. Kern","doi":"10.1177/01466216241261705","DOIUrl":"https://doi.org/10.1177/01466216241261705","url":null,"abstract":"Item response tree (IRTree) approaches have received increasing attention in the response style literature due to their capability to partial out response style latent traits from content-related latent traits by considering separate decisions for agreement and level of agreement. Additionally, it has shown that the functioning of the intensity of agreement decision may depend upon the agreement decision with an item, so that the item parameters and person parameters may differ by direction of agreement; when the parameters across direction are the same, this is called directional invariance. Furthermore, for non-cognitive psychological constructs, it has been argued that the response process may be best described as following an unfolding process. In this study, a family of IRTree models to handle unfolding responses with the agreement decision following the hyperbolic cosine model and the intensity of agreement decision following a graded response model is investigated. This model family also allows for investigation of item- and person-level directional invariance. A simulation study is conducted to evaluate parameter recovery; model parameters are estimated with a fully Bayesian approach using JAGS (Just Another Gibbs Sampler). The proposed modeling scheme is demonstrated with two data examples with multiple model comparisons allowing for varying levels of directional invariance and unfolding versus dominance processes. An approach to visualizing the final model item response functioning is also developed. The article closes with a short discussion about the results.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141356467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1177/01466216241261707
Kylie Gorney, Jiayi Deng
{"title":"aberrance: An R Package for Detecting Aberrant Behavior in Test Data","authors":"Kylie Gorney, Jiayi Deng","doi":"10.1177/01466216241261707","DOIUrl":"https://doi.org/10.1177/01466216241261707","url":null,"abstract":"","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141385802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-11DOI: 10.1177/01466216241253795
Katherine E. Castellano, Matthew S. Johnson, Rene Lawless
The COVID-19 pandemic led to a proliferation of remote-proctored (or “at-home”) assessments. The lack of standardized setting, device, or in-person proctor during at-home testing makes it markedly distinct from testing at a test center. Comparability studies of at-home and test center scores are important in understanding whether these distinctions impact test scores. This study found no significant differences in at-home versus test center test scores on a large-scale admissions test using either a randomized controlled trial or an observational study after adjusting for differences in sample composition along baseline characteristics.
{"title":"Are Large-Scale Test Scores Comparable for At-Home Versus Test Center Testing?","authors":"Katherine E. Castellano, Matthew S. Johnson, Rene Lawless","doi":"10.1177/01466216241253795","DOIUrl":"https://doi.org/10.1177/01466216241253795","url":null,"abstract":"The COVID-19 pandemic led to a proliferation of remote-proctored (or “at-home”) assessments. The lack of standardized setting, device, or in-person proctor during at-home testing makes it markedly distinct from testing at a test center. Comparability studies of at-home and test center scores are important in understanding whether these distinctions impact test scores. This study found no significant differences in at-home versus test center test scores on a large-scale admissions test using either a randomized controlled trial or an observational study after adjusting for differences in sample composition along baseline characteristics.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140989974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-23DOI: 10.1177/01466216241248826
Kirk A. Becker, Jinghua Liu, Paul E. Jones
Published information is limited regarding the security of testing programs, and even less on the relative security of different testing modalities: in-person at test centers (TC) versus remote online proctored (OP) testing. This article begins by examining indicators of test security violations across a wide range of programs in professional, admissions, and IT fields. We look at high levels of response overlap as a potential indicator of collusion to cheat on the exam and compare rates by modality and between test center types. Next, we scrutinize indicators of potential test security violations for a single large testing program over the course of 14 months, during which the program went from exclusively in-person TC testing to a mix of OP and TC testing. Test security indicators include high response overlap, large numbers of fast correct responses, large numbers of slow correct responses, large test-retest score gains, unusually fast response times for passing candidates, and measures of differential person functioning. These indicators are examined and compared prior to and after the introduction of OP testing. In addition, test-retest modality is examined for candidates who fail and retest subsequent to the introduction of OP testing, with special attention paid to test takers who change modality between the initial attempt and the retest. These data allow us to understand whether indications of content exposure increase with the introduction of OP testing, and whether testing modalities affect potential score increase in a similar way.
有关考试项目安全性的公开信息十分有限,而有关不同考试模式的相对安全性的信息则更少:在考试中心(TC)进行的现场考试与远程在线监考(OP)考试。本文首先研究了专业、招生和 IT 领域中各种测试项目的测试安全违规指标。我们将高水平的答题重叠作为串通作弊的潜在指标,并比较了不同模式和不同类型考试中心的作弊率。接下来,我们仔细研究了一个大型考试项目在 14 个月内的潜在考试安全违规指标,在此期间,该项目从完全的面对面 TC 考试转变为 OP 和 TC 混合考试。测试安全指标包括:高应答重叠率、大量快速正确应答、大量慢速正确应答、测试后得分大幅提高、及格考生异常快速的应答时间,以及差异人功能的测量。在引入 OP 测试之前和之后,对这些指标进行了研究和比较。此外,我们还对 OP 测试引入后未通过测试和重测的考生的重测方式进行了研究,特别关注了在初次测试和重测之间改变测试方式的考生。通过这些数据,我们可以了解内容暴露的迹象是否会随着 OP 测试的引入而增加,以及测试模式是否会以类似的方式影响潜在分数的增加。
{"title":"Test Security and the Pandemic: Comparison of Test Center and Online Proctor Delivery Modalities","authors":"Kirk A. Becker, Jinghua Liu, Paul E. Jones","doi":"10.1177/01466216241248826","DOIUrl":"https://doi.org/10.1177/01466216241248826","url":null,"abstract":"Published information is limited regarding the security of testing programs, and even less on the relative security of different testing modalities: in-person at test centers (TC) versus remote online proctored (OP) testing. This article begins by examining indicators of test security violations across a wide range of programs in professional, admissions, and IT fields. We look at high levels of response overlap as a potential indicator of collusion to cheat on the exam and compare rates by modality and between test center types. Next, we scrutinize indicators of potential test security violations for a single large testing program over the course of 14 months, during which the program went from exclusively in-person TC testing to a mix of OP and TC testing. Test security indicators include high response overlap, large numbers of fast correct responses, large numbers of slow correct responses, large test-retest score gains, unusually fast response times for passing candidates, and measures of differential person functioning. These indicators are examined and compared prior to and after the introduction of OP testing. In addition, test-retest modality is examined for candidates who fail and retest subsequent to the introduction of OP testing, with special attention paid to test takers who change modality between the initial attempt and the retest. These data allow us to understand whether indications of content exposure increase with the introduction of OP testing, and whether testing modalities affect potential score increase in a similar way.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140667252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1177/01466216241238749
Kelly D. Edwards, J. Soland
Survey scores are often the basis for understanding how individuals grow psychologically and socio-emotionally. A known problem with many surveys is that the items are all “easy”—that is, individuals tend to use only the top one or two response categories on the Likert scale. Such an issue could be especially problematic, and lead to ceiling effects, when the same survey is administered repeatedly over time. In this study, we conduct simulation and empirical studies to (a) quantify the impact of these ceiling effects on growth estimates when using typical scoring approaches like sum scores and unidimensional item response theory (IRT) models and (b) examine whether approaches to survey design and scoring, including employing various longitudinal multidimensional IRT (MIRT) models, can mitigate any bias in growth estimates. We show that bias is substantial when using typical scoring approaches and that, while lengthening the survey helps somewhat, using a longitudinal MIRT model with plausible values scoring all but alleviates the issue. Results have implications for scoring surveys in growth studies going forward, as well as understanding how Likert item ceiling effects may be contributing to replication failures.
{"title":"How Scoring Approaches Impact Estimates of Growth in the Presence of Survey Item Ceiling Effects","authors":"Kelly D. Edwards, J. Soland","doi":"10.1177/01466216241238749","DOIUrl":"https://doi.org/10.1177/01466216241238749","url":null,"abstract":"Survey scores are often the basis for understanding how individuals grow psychologically and socio-emotionally. A known problem with many surveys is that the items are all “easy”—that is, individuals tend to use only the top one or two response categories on the Likert scale. Such an issue could be especially problematic, and lead to ceiling effects, when the same survey is administered repeatedly over time. In this study, we conduct simulation and empirical studies to (a) quantify the impact of these ceiling effects on growth estimates when using typical scoring approaches like sum scores and unidimensional item response theory (IRT) models and (b) examine whether approaches to survey design and scoring, including employing various longitudinal multidimensional IRT (MIRT) models, can mitigate any bias in growth estimates. We show that bias is substantial when using typical scoring approaches and that, while lengthening the survey helps somewhat, using a longitudinal MIRT model with plausible values scoring all but alleviates the issue. Results have implications for scoring surveys in growth studies going forward, as well as understanding how Likert item ceiling effects may be contributing to replication failures.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140236784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14DOI: 10.1177/01466216241238740
John R. Donoghue, Adrienne N. Sgammato
Methods to detect item response theory (IRT) item-level misfit are typically derived assuming fixed test forms. However, IRT is also employed with more complicated test designs, such as the balanced incomplete block (BIB) design used in large-scale educational assessments. This study investigates two modifications of Douglas and Cohen’s 2001 nonparametric method of assessing item misfit, based on A) using block total score and B) pooling booklet level scores for analyzing BIB data. Block-level scores showed extreme inflation of Type I error for short blocks containing 5 or 10 items. The pooled booklet method yielded Type I error rates close to nominal [Formula: see text] in most conditions and had power to detect misfitting items. The study also found that the Douglas and Cohen procedure is only slightly affected by the presence of other misfitting items in the block. The pooled booklet method is recommended for practical applications of Douglas and Cohen’s method with BIB data.
检测项目反应理论(IRT)项目级误差的方法通常是在假定测试形式固定的情况下得出的。然而,IRT 也适用于更复杂的测验设计,如大规模教育评估中使用的平衡不完全区组(BIB)设计。本研究调查了 Douglas 和 Cohen 2001 年评估项目不匹配度的非参数方法的两种修改方案,分别基于 A) 使用组块总分和 B) 汇总册级分数来分析 BIB 数据。对于包含 5 或 10 个项目的短块,块级得分显示出 I 类误差的极度膨胀。在大多数情况下,汇总的小册子方法产生的 I 类误差率接近名义误差率[公式:见正文],并且有能力检测出不匹配的项目。研究还发现,Douglas 和 Cohen 程序只会受到区块中存在其他不匹配项目的轻微影响。建议在实际应用道格拉斯和科恩的方法处理 BIB 数据时,采用集合小册子法。
{"title":"Evaluating the Douglas-Cohen IRT Goodness of Fit Measure With BIB Sampling of Items","authors":"John R. Donoghue, Adrienne N. Sgammato","doi":"10.1177/01466216241238740","DOIUrl":"https://doi.org/10.1177/01466216241238740","url":null,"abstract":"Methods to detect item response theory (IRT) item-level misfit are typically derived assuming fixed test forms. However, IRT is also employed with more complicated test designs, such as the balanced incomplete block (BIB) design used in large-scale educational assessments. This study investigates two modifications of Douglas and Cohen’s 2001 nonparametric method of assessing item misfit, based on A) using block total score and B) pooling booklet level scores for analyzing BIB data. Block-level scores showed extreme inflation of Type I error for short blocks containing 5 or 10 items. The pooled booklet method yielded Type I error rates close to nominal [Formula: see text] in most conditions and had power to detect misfitting items. The study also found that the Douglas and Cohen procedure is only slightly affected by the presence of other misfitting items in the block. The pooled booklet method is recommended for practical applications of Douglas and Cohen’s method with BIB data.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140243145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}