Pub Date : 2025-01-03DOI: 10.1177/00131644241302721
Duygu Koçak
This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.
{"title":"Examination of ChatGPT's Performance as a Data Analysis Tool.","authors":"Duygu Koçak","doi":"10.1177/00131644241302721","DOIUrl":"https://doi.org/10.1177/00131644241302721","url":null,"abstract":"<p><p>This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241302721"},"PeriodicalIF":2.1,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1177/00131644241306024
Yeşim Beril Soğuksu, Ergül Demir
This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.
{"title":"The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions.","authors":"Yeşim Beril Soğuksu, Ergül Demir","doi":"10.1177/00131644241306024","DOIUrl":"10.1177/00131644241306024","url":null,"abstract":"<p><p>This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241306024"},"PeriodicalIF":2.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29DOI: 10.1177/00131644241297925
Christine E DeMars
This study investigates the treatment of rapid-guess (RG) responses as missing data within the context of the effort-moderated model. Through a series of illustrations, this study demonstrates that the effort-moderated model assumes missing at random (MAR) rather than missing completely at random (MCAR), explaining the conditions necessary for MAR. These examples show that RG responses, when treated as missing under the effort-moderated model, do not introduce bias into ability estimates if the missingness mechanism is properly accounted for. Conversely, using a standard item response theory (IRT) model (scoring RG responses as if they were valid) instead of the effort-moderated model leads to considerable biases, underestimating group means and overestimating standard deviations when the item parameters are known, or overestimating item difficulty if the item parameters are estimated.
{"title":"Treating Noneffortful Responses as Missing.","authors":"Christine E DeMars","doi":"10.1177/00131644241297925","DOIUrl":"https://doi.org/10.1177/00131644241297925","url":null,"abstract":"<p><p>This study investigates the treatment of rapid-guess (RG) responses as missing data within the context of the effort-moderated model. Through a series of illustrations, this study demonstrates that the effort-moderated model assumes missing at random (MAR) rather than missing completely at random (MCAR), explaining the conditions necessary for MAR. These examples show that RG responses, when treated as missing under the effort-moderated model, do not introduce bias into ability estimates if the missingness mechanism is properly accounted for. Conversely, using a standard item response theory (IRT) model (scoring RG responses as if they were valid) instead of the effort-moderated model leads to considerable biases, underestimating group means and overestimating standard deviations when the item parameters are known, or overestimating item difficulty if the item parameters are estimated.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241297925"},"PeriodicalIF":2.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607706/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29DOI: 10.1177/00131644241298975
Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley
Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.
{"title":"Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data.","authors":"Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley","doi":"10.1177/00131644241298975","DOIUrl":"https://doi.org/10.1177/00131644241298975","url":null,"abstract":"<p><p>Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241298975"},"PeriodicalIF":2.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25DOI: 10.1177/00131644241283400
Matthias Kloft, Daniel W Heck
In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.
{"title":"Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths.","authors":"Matthias Kloft, Daniel W Heck","doi":"10.1177/00131644241283400","DOIUrl":"10.1177/00131644241283400","url":null,"abstract":"<p><p>In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241283400"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25DOI: 10.1177/00131644241296139
Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht
The assessment of individual students is not only crucial in the school setting but also at the core of educational research. Although classical test theory focuses on maximizing insights from student responses, the Bayesian perspective incorporates the assessor's prior belief, thereby enriching assessment with knowledge gained from previous interactions with the student or with similar students. We propose and illustrate a formal Bayesian approach that not only allows to form a stronger belief about a student's competency but also offers a more accurate assessment than classical test theory. In addition, we propose a straightforward method for gauging prior beliefs using two specific items and point to the possibility to integrate additional information.
{"title":"Novick Meets Bayes: Improving the Assessment of Individual Students in Educational Practice and Research by Capitalizing on Assessors' Prior Beliefs.","authors":"Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht","doi":"10.1177/00131644241296139","DOIUrl":"10.1177/00131644241296139","url":null,"abstract":"<p><p>The assessment of individual students is not only crucial in the school setting but also at the core of educational research. Although classical test theory focuses on maximizing insights from student responses, the Bayesian perspective incorporates the assessor's prior belief, thereby enriching assessment with knowledge gained from previous interactions with the student or with similar students. We propose and illustrate a formal Bayesian approach that not only allows to form a stronger belief about a student's competency but also offers a more accurate assessment than classical test theory. In addition, we propose a straightforward method for gauging prior beliefs using two specific items and point to the possibility to integrate additional information.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241296139"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22DOI: 10.1177/00131644241293694
W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch
There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln ) and the variance of the Mantel-Haenszel log odds ratio ( ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.
{"title":"Differential Item Functioning Effect Size Use for Validity Information.","authors":"W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch","doi":"10.1177/00131644241293694","DOIUrl":"10.1177/00131644241293694","url":null,"abstract":"<p><p>There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln <math> <mrow> <msub> <mrow> <mover><mrow><mi>OR</mi></mrow> <mo>¯</mo></mover> </mrow> <mrow><mi>FE</mi></mrow> </msub> </mrow> </math> ) and the variance of the Mantel-Haenszel log odds ratio ( <math> <mrow> <msup> <mrow> <mover><mrow><mi>τ</mi></mrow> <mo>^</mo></mover> </mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241293694"},"PeriodicalIF":2.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1177/00131644241290172
Xinran Liu, Daniel McNeish
Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.
{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"<p><p>Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241290172"},"PeriodicalIF":2.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1177/00131644241282982
John Protzko
Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (N = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.
{"title":"Invariance: What Does Measurement Invariance Allow Us to Claim?","authors":"John Protzko","doi":"10.1177/00131644241282982","DOIUrl":"10.1177/00131644241282982","url":null,"abstract":"<p><p>Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (<i>N</i> = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241282982"},"PeriodicalIF":2.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1177/00131644241280400
Qizhou Duan, Ying Cheng
This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR2 with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.
{"title":"Detecting Differential Item Functioning Using Response Time.","authors":"Qizhou Duan, Ying Cheng","doi":"10.1177/00131644241280400","DOIUrl":"10.1177/00131644241280400","url":null,"abstract":"<p><p>This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using Δ<i>R</i> <sup>2</sup> with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280400"},"PeriodicalIF":2.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}