Educational and Psychological Measurement最新文献

英文中文

Fixed Effects or Mixed Effects Classifiers? Evidence From Simulated and Archival Data. 固定效应还是混合效应分类器？来自模拟数据和档案数据的证据

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-06-30 DOI: 10.1177/00131644221108180

Anthony A Mangino, Jocelyn H Bolin, W Holmes Finch

This study seeks to compare fixed and mixed effects models for the purposes of predictive classification in the presence of multilevel data. The first part of the study utilizes a Monte Carlo simulation to compare fixed and mixed effects logistic regression and random forests. An applied examination of the prediction of student retention in the public-use U.S. PISA data set was considered to verify the simulation findings. Results of this study indicate fixed effects models performed comparably with mixed effects models across both the simulation and PISA examinations. Results broadly suggest that researchers should be cognizant of the type of predictors and data structure being used, as these factors carried more weight than did the model type.

本研究旨在比较固定效应模型和混合效应模型，以便在多层次数据情况下进行预测分类。研究的第一部分利用蒙特卡罗模拟来比较固定效应和混合效应逻辑回归与随机森林。为了验证模拟结果，我们考虑了对美国国际学生评估项目（PISA）公共使用数据集中的学生保留率预测进行应用检查。研究结果表明，固定效应模型与混合效应模型在模拟和 PISA 考试中的表现相当。研究结果广泛表明，研究人员应认识到所使用的预测因子类型和数据结构，因为这些因素比模型类型更重要。

引用次数: 0

Exploration of the Stacking Ensemble Machine Learning Algorithm for Cheating Detection in Large-Scale Assessment. 探索用于大规模评估作弊检测的堆叠集合机器学习算法。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-08-13 DOI: 10.1177/00131644221117193

Todd Zhou, Hong Jiao

Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.

大规模评估中的作弊检测在现有文献中受到了广泛关注。然而，在这一研究方向上，之前的研究都没有调查过用于作弊检测的堆叠集合机器学习算法。此外，也没有研究使用重采样来解决类不平衡的问题。本研究探索了堆叠集合机器学习算法在分析考生的项目响应、响应时间和增强数据以检测作弊行为中的应用。研究将堆叠方法的性能与其他两种集合方法（bagging 和 boosting）以及六种基本的非集合机器学习算法进行了比较。研究还探讨了与类不平衡和输入特征相关的问题。研究结果表明，在作弊检测方面，堆叠、重采样和包含增强摘要数据的特征集的性能普遍优于同类算法。与本研究中调查的其他竞争性机器学习算法相比，在所有研究条件中，当使用项目回答和增强汇总统计数据作为输入特征时，使用基于前两个基本模型--梯度提升和随机森林--的判别分析的堆叠元模型的表现一般最佳，而采样不足比率为 10:1。

{"title":"Exploration of the Stacking Ensemble Machine Learning Algorithm for Cheating Detection in Large-Scale Assessment.","authors":"Todd Zhou, Hong Jiao","doi":"10.1177/00131644221117193","DOIUrl":"10.1177/00131644221117193","url":null,"abstract":"Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"831-854"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing the Psychometric Properties of a Scale Across Three Likert and Three Alternative Formats: An Application to the Rosenberg Self-Esteem Scale. 比较三种李克特和三种替代格式量表的心理测量特性:在罗森博格自尊量表中的应用。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 DOI: 10.1177/00131644221111402

Xijuan Zhang, Linnan Zhou, Victoria Savalei

Zhang and Savalei proposed an alternative scale format to the Likert format, called the Expanded format. In this format, response options are presented in complete sentences, which can reduce acquiescence bias and method effects. The goal of the current study was to compare the psychometric properties of the Rosenberg Self-Esteem Scale (RSES) in the Expanded format and in two other alternative formats, relative to several versions of the traditional Likert format. We conducted two studies to compare the psychometric properties of the RSES across the different formats. We found that compared with the Likert format, the alternative formats tend to have a unidimensional factor structure, less response inconsistency, and comparable validity. In addition, we found that the Expanded format resulted in the best factor structure among the three alternative formats. Researchers should consider the Expanded format, especially when creating short psychological scales such as the RSES.

Zhang和Savalei提出了一种替代李克特量表格式的量表格式，称为扩展量表格式。在这种格式中，回答选项以完整的句子形式呈现，可以减少默认偏见和方法效应。本研究的目的是比较罗森博格自尊量表(RSES)的扩展格式和其他两种可选格式的心理测量特性，相对于传统李克特格式的几个版本。我们进行了两项研究来比较不同格式的RSES的心理测量特性。研究发现，与李克特问卷相比，备选问卷具有单向度的因素结构、较少的反应不一致性和可比较效度。此外，我们发现在三种备选格式中，扩展格式的因子结构最好。研究人员应该考虑扩展格式，特别是在创建像RSES这样的短心理量表时。

{"title":"Comparing the Psychometric Properties of a Scale Across Three Likert and Three Alternative Formats: An Application to the Rosenberg Self-Esteem Scale.","authors":"Xijuan Zhang, Linnan Zhou, Victoria Savalei","doi":"10.1177/00131644221111402","DOIUrl":"https://doi.org/10.1177/00131644221111402","url":null,"abstract":"Zhang and Savalei proposed an alternative scale format to the Likert format, called the Expanded format. In this format, response options are presented in complete sentences, which can reduce acquiescence bias and method effects. The goal of the current study was to compare the psychometric properties of the Rosenberg Self-Esteem Scale (RSES) in the Expanded format and in two other alternative formats, relative to several versions of the traditional Likert format. We conducted two studies to compare the psychometric properties of the RSES across the different formats. We found that compared with the Likert format, the alternative formats tend to have a unidimensional factor structure, less response inconsistency, and comparable validity. In addition, we found that the Expanded format resulted in the best factor structure among the three alternative formats. Researchers should consider the Expanded format, especially when creating short psychological scales such as the RSES.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"649-683"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0c/99/10.1177_00131644221111402.PMC10311935.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9802113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills. CDMs 和 (M)IRT 在衡量潜在技能增长方面的相对稳健性。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-08-18 DOI: 10.1177/00131644221117194

Qi Helen Huang, Daniel M Bolt

Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.

以往的研究已经证明，即使是有意为测量二元技能而设计的测验，也存在潜在技能连续性的证据。此外，在连续性存在的情况下，二元技能的假设已被证明可能会造成项目和潜在能力参数缺乏不变性，从而影响应用。在本文中，我们将成长测量作为其中一种应用进行研究，并考虑将多维项目反应理论（MIRT）作为一种可供选择的替代方法。受先前关于技能连续性影响的研究结果的启发，我们研究了认知诊断模型（CDMs）和（M）IRT 模型在二元和连续潜在技能分布条件下测量成长的相对稳健性。我们发现，认知诊断模型是一种在误设情况下量化成长的不太稳健的方法，并随后提供了一个真实数据示例，表明低估成长很可能是一种后果。我们建议，研究人员应定期关注与使用二元潜技能相关的假设，如果不确定其离散性，可考虑将（M）IRT 作为一种潜在的更稳健的替代方法。

{"title":"Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills.","authors":"Qi Helen Huang, Daniel M Bolt","doi":"10.1177/00131644221117194","DOIUrl":"10.1177/00131644221117194","url":null,"abstract":"Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"808-830"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Are Speeded Tests Unfair? Modeling the Impact of Time Limits on the Gender Gap in Mathematics. 快速测试不公平吗?时间限制对数学性别差距的影响建模。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 DOI: 10.1177/00131644221111076

Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe

When cognitive and educational tests are administered under time limits, tests may become speeded and this may affect the reliability and validity of the resulting test scores. Prior research has shown that time limits may create or enlarge gender gaps in cognitive and academic testing. On average, women complete fewer items than men when a test is administered with a strict time limit, whereas gender gaps are frequently reduced when time limits are relaxed. In this study, we propose that gender differences in test strategy might inflate gender gaps favoring men, and relate test strategy to stereotype threat effects under which women underperform due to the pressure of negative stereotypes about their performance. First, we applied a Bayesian two-dimensional item response theory (IRT) model to data obtained from two registered reports that investigated stereotype threat in mathematics, and estimated the latent correlation between underlying test strategy (here, completion factor, a proxy for working speed) and mathematics ability. Second, we tested the gender gap and assessed potential effects of stereotype threat on female test performance. We found a positive correlation between the completion factor and mathematics ability, such that more able participants dropped out later in the test. We did not observe a stereotype threat effect but found larger gender differences on the latent completion factor than on latent mathematical ability, suggesting that test strategies affect the gender gap in timed mathematics performance. We argue that if the effect of time limits on tests is not taken into account, this may lead to test unfairness and biased group comparisons, and urge researchers to consider these effects in either their analyses or study planning.

在有时间限制的情况下进行认知和教育测试时，测试可能会变快，这可能会影响测试结果的可靠性和有效性。先前的研究表明，时间限制可能会造成或扩大认知和学术测试中的性别差距。平均而言，在严格的时间限制下进行测试时，女性完成的项目比男性少，而在时间限制放松时，性别差距往往会缩小。在本研究中，我们提出考试策略的性别差异可能会扩大有利于男性的性别差距，并将考试策略与刻板印象威胁效应联系起来，在刻板印象威胁效应下，女性由于对其表现的负面刻板印象压力而表现不佳。首先，我们将贝叶斯二维项目反应理论(IRT)模型应用于两份调查数学刻板印象威胁的注册报告的数据，并估计了潜在的测试策略(这里是完成因子，工作速度的代理)与数学能力之间的潜在相关性。其次，我们测试了性别差异，并评估了刻板印象威胁对女性考试成绩的潜在影响。我们发现完成因子与数学能力之间存在正相关关系，因此较有能力的参与者在测试后期退出了测试。我们没有观察到刻板印象威胁效应，但发现潜在完成因子的性别差异大于潜在数学能力的性别差异，这表明测试策略影响了时间数学成绩的性别差异。我们认为，如果不考虑时间限制对测试的影响，这可能导致测试不公平和有偏见的组比较，并敦促研究人员在他们的分析或研究计划中考虑这些影响。

{"title":"Are Speeded Tests Unfair? Modeling the Impact of Time Limits on the Gender Gap in Mathematics.","authors":"Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe","doi":"10.1177/00131644221111076","DOIUrl":"https://doi.org/10.1177/00131644221111076","url":null,"abstract":"When cognitive and educational tests are administered under time limits, tests may become speeded and this may affect the reliability and validity of the resulting test scores. Prior research has shown that time limits may create or enlarge gender gaps in cognitive and academic testing. On average, women complete fewer items than men when a test is administered with a strict time limit, whereas gender gaps are frequently reduced when time limits are relaxed. In this study, we propose that gender differences in test strategy might inflate gender gaps favoring men, and relate test strategy to stereotype threat effects under which women underperform due to the pressure of negative stereotypes about their performance. First, we applied a Bayesian two-dimensional item response theory (IRT) model to data obtained from two registered reports that investigated stereotype threat in mathematics, and estimated the latent correlation between underlying test strategy (here, completion factor, a proxy for working speed) and mathematics ability. Second, we tested the gender gap and assessed potential effects of stereotype threat on female test performance. We found a positive correlation between the completion factor and mathematics ability, such that more able participants dropped out later in the test. We did not observe a stereotype threat effect but found larger gender differences on the latent completion factor than on latent mathematical ability, suggesting that test strategies affect the gender gap in timed mathematics performance. We argue that if the effect of time limits on tests is not taken into account, this may lead to test unfairness and biased group comparisons, and urge researchers to consider these effects in either their analyses or study planning.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"684-709"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10299044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Robust Method for Detecting Item Misfit in Large-Scale Assessments. 在大规模评估中检测项目错位的稳健方法。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-07-02 DOI: 10.1177/00131644221105819

Matthias von Davier, Ummugul Bezirhan

Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical DIF assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides a robust approach for DIF detection that does not assume perfect model data fit, but rather uses Tukey's concept of contaminated distributions. The approach uses robust outlier detection to flag items for which adequate model data fit cannot be established.

识别项目不拟合或差异项目功能（DIF）的可行方法是量表构建和合理测量的核心。许多方法都依赖于在某个模型完全拟合数据的假设下推导出一个极限分布。典型的 DIF 假设，如项目函数的单调性和群体独立性，甚至在经典测验理论中都存在，但在使用项目反应理论或其他潜变量模型评估项目拟合度时，这些假设会得到更明确的阐述。本文介绍的工作提供了一种稳健的 DIF 检测方法，它不假定模型数据完全拟合，而是使用 Tukey 的污染分布概念。该方法使用稳健离群点检测来标记无法建立充分模型数据拟合的项目。

引用次数: 0

A Bayesian General Model to Account for Individual Differences in Operation-Specific Learning Within a Test. 贝叶斯一般模型可解释测试中特定操作学习的个体差异。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-09-19 DOI: 10.1177/00131644221109796

José H Lozano, Javier Revuelta

The present paper introduces a general multidimensional model to measure individual differences in learning within a single administration of a test. Learning is assumed to result from practicing the operations involved in solving the items. The model accounts for the possibility that the ability to learn may manifest differently for correct and incorrect responses, which allows for distinguishing different types of learning effects in the data. Model estimation and evaluation is based on a Bayesian framework. A simulation study is presented that examines the performance of the estimation and evaluation methods. The results show accuracy in parameter recovery as well as good performance in model evaluation and selection. An empirical study illustrates the applicability of the model to data from a logical ability test.

本文介绍了一种通用的多维模型，用于测量单次施测中学习的个体差异。假定学习是通过练习解题过程中所涉及的操作而产生的。该模型考虑到了学习能力可能在正确和错误的回答中表现出不同，从而可以区分数据中不同类型的学习效果。模型的估计和评估基于贝叶斯框架。本文介绍了一项模拟研究，以检验估计和评估方法的性能。结果表明，参数恢复准确，模型评估和选择性能良好。一项实证研究说明了该模型在逻辑能力测试数据中的适用性。

引用次数: 0

On the Importance of Coefficient Alpha for Measurement Research: Loading Equality Is Not Necessary for Alpha's Utility as a Scale Reliability Index. 论系数 Alpha 在测量研究中的重要性：载荷相等并非阿尔法作为量表可靠性指标的必要条件

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-08-01 Epub Date: 2022-07-20 DOI: 10.1177/00131644221104972

Tenko Raykov, James C Anthony, Natalja Menold

The population relationship between coefficient alpha and scale reliability is studied in the widely used setting of unidimensional multicomponent measuring instruments. It is demonstrated that for any set of component loadings on the common factor, regardless of the extent of their inequality, the discrepancy between alpha and reliability can be arbitrarily small in any considered population and hence practically ignorable. In addition, the set of parameter values where this discrepancy is negligible is shown to possess the same dimensionality as that of the underlying model parameter space. The article contributes to the measurement and related literature by pointing out that (a) approximate or strict loading identity is not a necessary condition for the utility of alpha as a trustworthy index of scale reliability, and (b) coefficient alpha can be a dependable reliability measure with any extent of inequality in the component loadings.

在广泛使用的单维多成分测量工具中，研究了α系数与量表信度之间的群体关系。研究结果表明，对于共同因素上的任何一组成分负荷，无论其不平等程度如何，在任何考虑的群体中，α 与信度之间的差异都可能非常小，因此实际上是可以忽略不计的。此外，该差异可忽略不计的参数值集与基础模型参数空间具有相同的维度。文章指出：(a) 近似或严格的载荷同一性并不是α系数作为量表可靠性可靠指标的必要条件；(b) α系数可以作为可靠的可靠性指标，其成分载荷在任何程度上都是不平等的。

引用次数: 0

Awareness Is Bliss: How Acquiescence Affects Exploratory Factor Analysis. 意识是福:默许如何影响探索性因素分析。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-06-01 DOI: 10.1177/00131644221089857

E Damiano D'Urso, Jesper Tijmstra, Jeroen K Vermunt, Kim De Roover

Assessing the measurement model (MM) of self-report scales is crucial to obtain valid measurements of individuals' latent psychological constructs. This entails evaluating the number of measured constructs and determining which construct is measured by which item. Exploratory factor analysis (EFA) is the most-used method to evaluate these psychometric properties, where the number of measured constructs (i.e., factors) is assessed, and, afterward, rotational freedom is resolved to interpret these factors. This study assessed the effects of an acquiescence response style (ARS) on EFA for unidimensional and multidimensional (un)balanced scales. Specifically, we evaluated (a) whether ARS is captured as an additional factor, (b) the effect of different rotation approaches on the content and ARS factors recovery, and (c) the effect of extracting the additional ARS factor on the recovery of factor loadings. ARS was often captured as an additional factor in balanced scales when it was strong. For these scales, ignoring extracting this additional ARS factor, or rotating to simple structure when extracting it, harmed the recovery of the original MM by introducing bias in loadings and cross-loadings. These issues were avoided by using informed rotation approaches (i.e., target rotation), where (part of) the rotation target is specified according to a priori expectations on the MM. Not extracting the additional ARS factor did not affect the loading recovery in unbalanced scales. Researchers should consider the potential presence of ARS when assessing the psychometric properties of balanced scales and use informed rotation approaches when suspecting that an additional factor is an ARS factor.

评估自我报告量表的测量模型(MM)是获得有效测量个体潜在心理构念的关键。这需要评估被测量的构念的数量，并确定哪个构念是由哪个项目测量的。探索性因素分析(EFA)是评估这些心理测量特性最常用的方法，其中评估测量的构念(即因素)的数量，然后解决旋转自由度来解释这些因素。本研究评估了默许反应方式(ARS)在一维和多维(非)平衡量表上对EFA的影响。具体而言，我们评估了(a) ARS是否作为附加因子被捕获，(b)不同旋转方式对ARS含量和ARS因子恢复的影响，以及(c)提取附加ARS因子对因子加载恢复的影响。当ARS较强时，它通常被视为平衡尺度中的一个附加因素。对于这些尺度，忽略提取这个额外的ARS因子，或者在提取时旋转到简单结构，通过在加载和交叉加载中引入偏差，损害原始MM的恢复。通过使用知情旋转方法(即目标旋转)避免了这些问题，其中(部分)旋转目标是根据MM的先验期望指定的。不提取额外的ARS因素并不影响不平衡尺度下的加载恢复。研究人员在评估平衡量表的心理测量特性时应考虑ARS的潜在存在，并在怀疑附加因素是ARS因素时使用知情旋转方法。

{"title":"Awareness Is Bliss: How Acquiescence Affects Exploratory Factor Analysis.","authors":"E Damiano D'Urso, Jesper Tijmstra, Jeroen K Vermunt, Kim De Roover","doi":"10.1177/00131644221089857","DOIUrl":"https://doi.org/10.1177/00131644221089857","url":null,"abstract":"Assessing the measurement model (MM) of self-report scales is crucial to obtain valid measurements of individuals' latent psychological constructs. This entails evaluating the number of measured constructs and determining which construct is measured by which item. Exploratory factor analysis (EFA) is the most-used method to evaluate these psychometric properties, where the number of measured constructs (i.e., factors) is assessed, and, afterward, rotational freedom is resolved to interpret these factors. This study assessed the effects of an acquiescence response style (ARS) on EFA for unidimensional and multidimensional (un)balanced scales. Specifically, we evaluated (a) whether ARS is captured as an additional factor, (b) the effect of different rotation approaches on the content and ARS factors recovery, and (c) the effect of extracting the additional ARS factor on the recovery of factor loadings. ARS was often captured as an additional factor in balanced scales when it was strong. For these scales, ignoring extracting this additional ARS factor, or rotating to simple structure when extracting it, harmed the recovery of the original MM by introducing bias in loadings and cross-loadings. These issues were avoided by using informed rotation approaches (i.e., target rotation), where (part of) the rotation target is specified according to a priori expectations on the MM. Not extracting the additional ARS factor did not affect the loading recovery in unbalanced scales. Researchers should consider the potential presence of ARS when assessing the psychometric properties of balanced scales and use informed rotation approaches when suspecting that an additional factor is an ARS factor.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 3","pages":"433-472"},"PeriodicalIF":2.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9846850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks. 使用人工神经网络对 2019 年 TIMSS 中的图形反应进行评分。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2023-06-01 Epub Date: 2022-05-23 DOI: 10.1177/00131644221098021

Matthias von Davier, Lillian Tyack, Lale Khorramdel

Automated scoring of free drawings or images as responses has yet to be used in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a TIMSS 2019 item. We are comparing classification accuracy of convolutional and feed-forward approaches. Our results show that convolutional neural networks (CNNs) outperform feed-forward neural networks in both loss and accuracy. The CNN models classified up to 97.53% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human-rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for international large-scale assessments (ILSAs), while improving the validity and comparability of scoring complex constructed-response items.

在大规模的学生成绩评估中，尚未使用自由绘画或图像作为回答的自动评分。在本研究中，我们建议使用人工神经网络对 TIMSS 2019 项目中的这类图形回答进行分类。我们比较了卷积方法和前馈方法的分类准确性。我们的结果表明，卷积神经网络（CNN）在损失和准确性方面都优于前馈神经网络。卷积神经网络模型可将高达 97.53% 的图像响应分类到相应的评分类别中，其准确性甚至可媲美典型的人类评分员。通过观察发现，最准确的 CNN 模型能够正确地将一些被人类评分员错误评分的图像响应进行分类，从而进一步证实了这些发现。作为一项额外的创新，我们概述了一种方法，该方法基于从项目反应理论中得出的预期反应函数的应用，为训练样本选择人类评分的反应。本文认为，基于 CNN 的图像回答自动评分是一种高度精确的程序，有可能取代国际大规模测评（ILSA）中第二名人工评分员的工作量和成本，同时提高复杂构建回答项目评分的有效性和可比性。

{"title":"Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks.","authors":"Matthias von Davier, Lillian Tyack, Lale Khorramdel","doi":"10.1177/00131644221098021","DOIUrl":"10.1177/00131644221098021","url":null,"abstract":"Automated scoring of free drawings or images as responses has yet to be used in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a TIMSS 2019 item. We are comparing classification accuracy of convolutional and feed-forward approaches. Our results show that convolutional neural networks (CNNs) outperform feed-forward neural networks in both loss and accuracy. The CNN models classified up to 97.53% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human-rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for international large-scale assessments (ILSAs), while improving the validity and comparability of scoring complex constructed-response items.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 3","pages":"556-585"},"PeriodicalIF":2.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177318/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9475856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Educational and Psychological Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀