Educational and Psychological Measurement最新文献_第2页

Examination of ChatGPT's Performance as a Data Analysis Tool. ChatGPT作为数据分析工具的性能检验。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-03 DOI: 10.1177/00131644241302721

Duygu Koçak

This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.

本研究通过探索性因素分析（EFA）对OpenAI开发的ChatGPT作为数据分析工具的性能进行了研究，ChatGPT是一种广泛使用的基于人工智能的会话工具。为此，在各种数据条件下生成模拟数据，包括正态分布、响应类别、样本量、试验长度、因子载荷和测量模型。在相同的提示下，使用chatgpt - 40对生成的数据进行两次分析，每隔一周进行一次，并与使用R代码获得的结果进行比较。在数据分析中，计算了Kaiser- meyer - olkin （KMO）值、解释的总方差、使用经验Kaiser准则、Hull方法和Kaiser- guttman准则估计的因子数量以及因子负荷。在两个不同的时间从ChatGPT获得的结果被发现与使用r获得的结果一致。总的来说，ChatGPT在只需要计算决策而不涉及研究人员判断或理论评估（如KMO，总方差解释和因子负载）的步骤中表现出良好的性能。然而，对于多维结构，尽管在分析中估计的因素数量是一致的，但仍观察到偏差，这表明研究人员在做出此类决定时应谨慎行事。

{"title":"Examination of ChatGPT's Performance as a Data Analysis Tool.","authors":"Duygu Koçak","doi":"10.1177/00131644241302721","DOIUrl":"https://doi.org/10.1177/00131644241302721","url":null,"abstract":"This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241302721"},"PeriodicalIF":2.1,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions. 用IRTree方法建模缺失数据对不同仿真条件下参数估计的影响。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-12-23 DOI: 10.1177/00131644241306024

Yeşim Beril Soğuksu, Ergül Demir

This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.

本研究探讨了项目响应树（IRTree）方法在缺失数据建模中的性能，并将其性能与期望最大化（EM）算法和多重imputation （MI）方法进行了比较。模拟和经验数据用于评估这些方法在不同的缺失数据机制，测试长度，样本量和缺失数据比例。期望后验法用于能力估计，并计算偏差和均方根误差（RMSE）。研究结果表明，与EM和MI方法相比，IRTree方法提供了更准确的能力估计，RMSE更低。在完全随机缺失和非随机缺失两种情况下，特别是在测试时间较长和缺失数据比例较低的情况下，其总体性能特别强。然而，IRTree对中等水平的遗漏答案和中等能力的考生最有效，尽管在极端遗漏和能力的情况下，其准确性会下降。该研究强调，IRTree特别适合于低风险测试，并且在提供对数据集中潜在缺失数据机制的更深入了解方面具有强大的潜力。

{"title":"The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions.","authors":"Yeşim Beril Soğuksu, Ergül Demir","doi":"10.1177/00131644241306024","DOIUrl":"10.1177/00131644241306024","url":null,"abstract":"This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241306024"},"PeriodicalIF":2.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Treating Noneffortful Responses as Missing. 将不费力的回应视为缺失。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-29 DOI: 10.1177/00131644241297925

Christine E DeMars

This study investigates the treatment of rapid-guess (RG) responses as missing data within the context of the effort-moderated model. Through a series of illustrations, this study demonstrates that the effort-moderated model assumes missing at random (MAR) rather than missing completely at random (MCAR), explaining the conditions necessary for MAR. These examples show that RG responses, when treated as missing under the effort-moderated model, do not introduce bias into ability estimates if the missingness mechanism is properly accounted for. Conversely, using a standard item response theory (IRT) model (scoring RG responses as if they were valid) instead of the effort-moderated model leads to considerable biases, underestimating group means and overestimating standard deviations when the item parameters are known, or overestimating item difficulty if the item parameters are estimated.

本研究探讨了在努力调节模型的背景下，快速猜测（RG）反应作为缺失数据的处理。通过一系列的例子，本研究表明，努力调节模型假设随机缺失（MAR）而不是完全随机缺失（MCAR），解释了随机缺失的必要条件。这些例子表明，如果缺失机制得到适当考虑，RG反应在努力调节模型下被视为缺失时，不会在能力估计中引入偏差。相反，使用标准项目反应理论（IRT）模型（将RG反应视为有效）而不是努力调节模型会导致相当大的偏差，当项目参数已知时低估了群体均值，高估了标准偏差，或者如果项目参数是估计的，则高估了项目难度。

引用次数: 0

Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data. 通过反应过程数据探索解释差异项目功能的证据。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-29 DOI: 10.1177/00131644241298975

Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley

Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.

在评估中评估差异项目功能（DIF）在实现跨不同亚组（如性别和母语）的测量公平方面发挥着重要作用。然而，在传统的DIF技术中，仅仅依靠项目反应分数对DIF的解释给研究者和实践者带来了挑战。近年来，反应过程数据提供了有关考生反应行为的宝贵信息，通过检查反应过程的差异，为进一步解释DIF项目提供了机会。本研究旨在探讨反应过程数据特征在提高DIF项目可解释性方面的潜力，重点关注性别DIF，使用的数据来自2012年国际成人能力评估项目（PIAAC）基于计算机的计算能力评估。我们应用随机森林和逻辑回归与岭正则化来研究过程数据特征与DIF项目之间的关联，评估解释DIF的重要特征。此外，我们评估了不同百分比的DIF项目的模型性能，以反映具有不同百分比的DIF项目的实际场景。结果表明，时序特征和动作序列特征的结合能够有效地揭示群体间的反应过程差异，从而增强了DIF项目的可解释性。总体而言，本研究引入了一种可行的程序来利用反应过程数据来理解和解释DIF项目，揭示了DIF统计数据与专家评论之间一致性低的潜在原因，并揭示了潜在的不相关因素，以增强测量公平性。

{"title":"Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data.","authors":"Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley","doi":"10.1177/00131644241298975","DOIUrl":"https://doi.org/10.1177/00131644241298975","url":null,"abstract":"Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241298975"},"PeriodicalIF":2.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths. 区间反应格式的区分效力：调查区间宽度的维度结构。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-25 DOI: 10.1177/00131644241283400

Matthias Kloft, Daniel W Heck

In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.

在心理研究中，受访者通常会被要求用单一的回答值来回答问题。双区间滑动条（DRS）等区间回答格式是一种有用的替代方法，在这种方法中，受访者为每个项目提供一个具有下限和上限的区间。区间回答可用于测量心理结构，如人格领域中的变异性（如自我评价）、估计任务中的不确定性（如预测）以及判断中的模糊性（如有关言语量词的实际使用）。然而，目前还不清楚被调查者是否对特定任务的要求敏感，也不清楚区间宽度是否真正测量了所关注的结构。为了检验区间宽度的判别效度，我们进行了一项研究，受访者回答了属于人格、估计和判断等领域的七项不同任务的 92 个项目。我们通过拟合探索性和确认性因子模型来研究区间宽度的维度结构，同时使用适当的多元对数函数来转换有界区间的回答。估计的因子结构与理论上假设的任务结构密切相关，而任务的相似程度各不相同。我们并没有发现一个强有力的总体因素，这说明在所有任务和领域中，影响区间宽度的反应风格并不存在。总体而言，这表明受访者在使用间隔回答格式时对不同任务和领域的要求非常敏感。

{"title":"Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths.","authors":"Matthias Kloft, Daniel W Heck","doi":"10.1177/00131644241283400","DOIUrl":"10.1177/00131644241283400","url":null,"abstract":"In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241283400"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Novick Meets Bayes: Improving the Assessment of Individual Students in Educational Practice and Research by Capitalizing on Assessors' Prior Beliefs. 诺维克与贝叶斯：利用评估者的先验信念，改进教育实践和研究中对学生个体的评估。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-25 DOI: 10.1177/00131644241296139

Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht

The assessment of individual students is not only crucial in the school setting but also at the core of educational research. Although classical test theory focuses on maximizing insights from student responses, the Bayesian perspective incorporates the assessor's prior belief, thereby enriching assessment with knowledge gained from previous interactions with the student or with similar students. We propose and illustrate a formal Bayesian approach that not only allows to form a stronger belief about a student's competency but also offers a more accurate assessment than classical test theory. In addition, we propose a straightforward method for gauging prior beliefs using two specific items and point to the possibility to integrate additional information.

对学生个体的评估不仅在学校环境中至关重要，而且也是教育研究的核心。尽管经典测试理论侧重于最大限度地从学生的回答中获得启示，但贝叶斯视角将评估者的先验信念纳入其中，从而利用以前与学生或类似学生的互动中获得的知识丰富评估内容。我们提出并举例说明了一种正式的贝叶斯方法，这种方法不仅能让评估者对学生的能力形成更强的信念，还能提供比经典测试理论更准确的评估。此外，我们还提出了一种利用两个特定项目衡量先验信念的直接方法，并指出了整合其他信息的可能性。

引用次数: 0

Differential Item Functioning Effect Size Use for Validity Information. 差异项目功能效应大小用于有效性信息。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-22 DOI: 10.1177/00131644241293694

W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch

There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln ${\bar{OR}}_{FE}$ ) and the variance of the Mantel-Haenszel log odds ratio ( ${\hat{τ}}^{2}$ ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.

人们一直在强调差异项目功能（DIF）的效应大小，目的是了解通过统计显著性检验发现的差异的程度。根据分析方法的不同，提出了几种不同的效应大小，以及不同的解释准则。本模拟研究的目的是比较用于量化和比较两个评估中 DIF 量的 DIF 效果大小测量的性能。对一些被认为会影响效应大小或已知会影响 DIF 检测的因素进行了操作。本研究提出了以下两个问题。首先，效应大小是否准确地反映了各项目之间的总体 DIF？其次，效应大小是否能准确确定哪项评估的 DIF 量最少？我们强调了在几种模拟条件下表现良好的效应大小。我们还将这些效应量应用于一个真实数据集，以提供一个示例。研究结果表明，固定效应的对数几率比（Ln OR ¯ FE）和曼特尔-海恩泽尔对数几率比的方差（τ ^ 2）对于识别哪种测试包含更多的 DIF 最为准确。我们指出了这项工作的未来方向，有助于继续关注效应大小以了解 DIF 的程度。

{"title":"Differential Item Functioning Effect Size Use for Validity Information.","authors":"W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch","doi":"10.1177/00131644241293694","DOIUrl":"10.1177/00131644241293694","url":null,"abstract":"There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln <math> <mrow> <msub> <mrow> <mover><mrow><mi>OR</mi></mrow> <mo>¯</mo></mover> </mrow> <mrow><mi>FE</mi></mrow> </msub> </mrow> </math> ) and the variance of the Mantel-Haenszel log odds ratio ( <math> <mrow> <msup> <mrow> <mover><mrow><mi>τ</mi></mrow> <mo>^</mo></mover> </mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241293694"},"PeriodicalIF":2.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs. 获得稳定动态拟合指数临界值的最佳重复次数

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-11-08 DOI: 10.1177/00131644241290172

Xinran Liu, Daniel McNeish

Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.

因子分析常用于行为科学中潜在结构的测量，研究人员通常会考虑近似的拟合指数，以确保模型充分拟合并提供重要的有效性证据。由于缺乏通用的拟合指数临界值，方法论专家建议采用基于模拟的方法来创建自定义临界值，以便研究人员更准确地评估模型拟合度。然而，基于模拟的方法需要大量计算。一个悬而未决的问题是：需要多少次模拟重复才能使这些自定义截断值趋于稳定？这项蒙特卡洛模拟研究主要针对这样一种基于模拟的方法--动态拟合指数（DFI）截断值，以确定获得稳定截断值的最佳重复次数。研究结果表明，DFI 方法可以通过 500 次重复（目前推荐的次数）生成稳定的临界值，但如果重复次数更少，这一过程的效率会更高，尤其是在使用分类数据进行模拟时。使用更少的重复次数可以大大减少确定临界值的计算时间，而对结果的影响却很小。对于单因素或三因素模型，结果表明，在大多数情况下，200 次 DFI 重复是兼顾拟合指数临界值稳定性和计算效率的最佳选择。

{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241290172"},"PeriodicalIF":2.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Invariance: What Does Measurement Invariance Allow Us to Claim? 不变性：测量不变性能让我们宣称什么？

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-28 DOI: 10.1177/00131644241282982

John Protzko

Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (N = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.

测量涉及许多理论和经验步骤--确保我们的测量在不同群体中的操作相同就是其中一步。当一个量表的因子载荷和项目截距或阈值在不同群体中处于同一潜变量水平的人身上运行相似时，就会出现测量不变性。这通常被假定为量表在这些群体中测量的是相同的东西。在这里，我们通过随机分配美国成年人（N = 1500）填写量表，评估一个连贯因子（寻找人生意义）或一个什么都不测量的无意义因子，来检验将测量不变性扩展到共同测量的假设。我们发现，在无意义量表中，什么都不测量的项目显示出与原始量表很强的测量不变性、可靠性以及与其他结构的协变性。我们表明，测量不变性可以在没有测量的情况下发生。因此，我们不能推断测量不变性意味着测量的是同一事物，它可能是一个必要条件，但不是充分条件。

引用次数: 0

Detecting Differential Item Functioning Using Response Time. 利用响应时间检测项目功能差异。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-26 DOI: 10.1177/00131644241280400

Qizhou Duan, Ying Cheng

This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ $R^{2}$ and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ $R^{2}$ , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR ² with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.

本研究调查了反应时间中的统一差异项目功能（DIF）检测。我们提出了一种回归分析方法，将工作速度和组员身份作为自变量，将对数转换后的反应时间作为因变量。我们使用Δ R 2 和回归系数变化百分比等效应大小指标，结合统计显著性检验来标记 DIF 项目。我们进行了一项模拟研究，以评估三种 DIF 检测标准的性能：(a) 显著性检验；(b) Δ R 2 的显著性检验；(c) 回归系数百分比变化的显著性检验。模拟研究考虑的因素包括样本量、焦点组在总样本量中所占比例、DIF 项目数和 DIF 量。结果表明，仅使用显著性检验过于严格；使用回归系数的百分比变化作为效应大小衡量标准，在样本量较大时可降低标记率，但在不同条件下效果不一致；使用ΔR 2 并进行显著性检验可降低标记率，且效果相当一致。我们使用 PISA 2018 数据来说明所提方法在真实数据集中的表现。此外，我们还提供了利用响应时间进行 DIF 研究的指南。

{"title":"Detecting Differential Item Functioning Using Response Time.","authors":"Qizhou Duan, Ying Cheng","doi":"10.1177/00131644241280400","DOIUrl":"10.1177/00131644241280400","url":null,"abstract":"This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR 2 with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280400"},"PeriodicalIF":2.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0