Pub Date : 2023-08-01Epub Date: 2022-06-30DOI: 10.1177/00131644221108180
Anthony A Mangino, Jocelyn H Bolin, W Holmes Finch
This study seeks to compare fixed and mixed effects models for the purposes of predictive classification in the presence of multilevel data. The first part of the study utilizes a Monte Carlo simulation to compare fixed and mixed effects logistic regression and random forests. An applied examination of the prediction of student retention in the public-use U.S. PISA data set was considered to verify the simulation findings. Results of this study indicate fixed effects models performed comparably with mixed effects models across both the simulation and PISA examinations. Results broadly suggest that researchers should be cognizant of the type of predictors and data structure being used, as these factors carried more weight than did the model type.
{"title":"Fixed Effects or Mixed Effects Classifiers? Evidence From Simulated and Archival Data.","authors":"Anthony A Mangino, Jocelyn H Bolin, W Holmes Finch","doi":"10.1177/00131644221108180","DOIUrl":"10.1177/00131644221108180","url":null,"abstract":"<p><p>This study seeks to compare fixed and mixed effects models for the purposes of predictive classification in the presence of multilevel data. The first part of the study utilizes a Monte Carlo simulation to compare fixed and mixed effects logistic regression and random forests. An applied examination of the prediction of student retention in the public-use U.S. PISA data set was considered to verify the simulation findings. Results of this study indicate fixed effects models performed comparably with mixed effects models across both the simulation and PISA examinations. Results broadly suggest that researchers should be cognizant of the type of predictors and data structure being used, as these factors carried more weight than did the model type.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"710-739"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01Epub Date: 2022-08-13DOI: 10.1177/00131644221117193
Todd Zhou, Hong Jiao
Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.
{"title":"Exploration of the Stacking Ensemble Machine Learning Algorithm for Cheating Detection in Large-Scale Assessment.","authors":"Todd Zhou, Hong Jiao","doi":"10.1177/00131644221117193","DOIUrl":"10.1177/00131644221117193","url":null,"abstract":"<p><p>Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"831-854"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1177/00131644221111402
Xijuan Zhang, Linnan Zhou, Victoria Savalei
Zhang and Savalei proposed an alternative scale format to the Likert format, called the Expanded format. In this format, response options are presented in complete sentences, which can reduce acquiescence bias and method effects. The goal of the current study was to compare the psychometric properties of the Rosenberg Self-Esteem Scale (RSES) in the Expanded format and in two other alternative formats, relative to several versions of the traditional Likert format. We conducted two studies to compare the psychometric properties of the RSES across the different formats. We found that compared with the Likert format, the alternative formats tend to have a unidimensional factor structure, less response inconsistency, and comparable validity. In addition, we found that the Expanded format resulted in the best factor structure among the three alternative formats. Researchers should consider the Expanded format, especially when creating short psychological scales such as the RSES.
{"title":"Comparing the Psychometric Properties of a Scale Across Three Likert and Three Alternative Formats: An Application to the Rosenberg Self-Esteem Scale.","authors":"Xijuan Zhang, Linnan Zhou, Victoria Savalei","doi":"10.1177/00131644221111402","DOIUrl":"https://doi.org/10.1177/00131644221111402","url":null,"abstract":"<p><p>Zhang and Savalei proposed an alternative scale format to the Likert format, called the Expanded format. In this format, response options are presented in complete sentences, which can reduce acquiescence bias and method effects. The goal of the current study was to compare the psychometric properties of the Rosenberg Self-Esteem Scale (RSES) in the Expanded format and in two other alternative formats, relative to several versions of the traditional Likert format. We conducted two studies to compare the psychometric properties of the RSES across the different formats. We found that compared with the Likert format, the alternative formats tend to have a unidimensional factor structure, less response inconsistency, and comparable validity. In addition, we found that the Expanded format resulted in the best factor structure among the three alternative formats. Researchers should consider the Expanded format, especially when creating short psychological scales such as the RSES.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"649-683"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0c/99/10.1177_00131644221111402.PMC10311935.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9802113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01Epub Date: 2022-08-18DOI: 10.1177/00131644221117194
Qi Helen Huang, Daniel M Bolt
Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.
{"title":"Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills.","authors":"Qi Helen Huang, Daniel M Bolt","doi":"10.1177/00131644221117194","DOIUrl":"10.1177/00131644221117194","url":null,"abstract":"<p><p>Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"808-830"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1177/00131644221111076
Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe
When cognitive and educational tests are administered under time limits, tests may become speeded and this may affect the reliability and validity of the resulting test scores. Prior research has shown that time limits may create or enlarge gender gaps in cognitive and academic testing. On average, women complete fewer items than men when a test is administered with a strict time limit, whereas gender gaps are frequently reduced when time limits are relaxed. In this study, we propose that gender differences in test strategy might inflate gender gaps favoring men, and relate test strategy to stereotype threat effects under which women underperform due to the pressure of negative stereotypes about their performance. First, we applied a Bayesian two-dimensional item response theory (IRT) model to data obtained from two registered reports that investigated stereotype threat in mathematics, and estimated the latent correlation between underlying test strategy (here, completion factor, a proxy for working speed) and mathematics ability. Second, we tested the gender gap and assessed potential effects of stereotype threat on female test performance. We found a positive correlation between the completion factor and mathematics ability, such that more able participants dropped out later in the test. We did not observe a stereotype threat effect but found larger gender differences on the latent completion factor than on latent mathematical ability, suggesting that test strategies affect the gender gap in timed mathematics performance. We argue that if the effect of time limits on tests is not taken into account, this may lead to test unfairness and biased group comparisons, and urge researchers to consider these effects in either their analyses or study planning.
{"title":"Are Speeded Tests Unfair? Modeling the Impact of Time Limits on the Gender Gap in Mathematics.","authors":"Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe","doi":"10.1177/00131644221111076","DOIUrl":"https://doi.org/10.1177/00131644221111076","url":null,"abstract":"<p><p>When cognitive and educational tests are administered under time limits, tests may become speeded and this may affect the reliability and validity of the resulting test scores. Prior research has shown that time limits may create or enlarge gender gaps in cognitive and academic testing. On average, women complete fewer items than men when a test is administered with a strict time limit, whereas gender gaps are frequently reduced when time limits are relaxed. In this study, we propose that gender differences in test strategy might inflate gender gaps favoring men, and relate test strategy to stereotype threat effects under which women underperform due to the pressure of negative stereotypes about their performance. First, we applied a Bayesian two-dimensional item response theory (IRT) model to data obtained from two registered reports that investigated stereotype threat in mathematics, and estimated the latent correlation between underlying test strategy (here, completion factor, a proxy for working speed) and mathematics ability. Second, we tested the gender gap and assessed potential effects of stereotype threat on female test performance. We found a positive correlation between the completion factor and mathematics ability, such that more able participants dropped out later in the test. We did not observe a stereotype threat effect but found larger gender differences on the latent completion factor than on latent mathematical ability, suggesting that test strategies affect the gender gap in timed mathematics performance. We argue that if the effect of time limits on tests is not taken into account, this may lead to test unfairness and biased group comparisons, and urge researchers to consider these effects in either their analyses or study planning.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"684-709"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10299044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01Epub Date: 2022-07-02DOI: 10.1177/00131644221105819
Matthias von Davier, Ummugul Bezirhan
Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical DIF assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides a robust approach for DIF detection that does not assume perfect model data fit, but rather uses Tukey's concept of contaminated distributions. The approach uses robust outlier detection to flag items for which adequate model data fit cannot be established.
{"title":"A Robust Method for Detecting Item Misfit in Large-Scale Assessments.","authors":"Matthias von Davier, Ummugul Bezirhan","doi":"10.1177/00131644221105819","DOIUrl":"10.1177/00131644221105819","url":null,"abstract":"<p><p>Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical DIF assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides a robust approach for DIF detection that does not assume perfect model data fit, but rather uses Tukey's concept of contaminated distributions. The approach uses robust outlier detection to flag items for which adequate model data fit cannot be established.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"740-765"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01Epub Date: 2022-09-19DOI: 10.1177/00131644221109796
José H Lozano, Javier Revuelta
The present paper introduces a general multidimensional model to measure individual differences in learning within a single administration of a test. Learning is assumed to result from practicing the operations involved in solving the items. The model accounts for the possibility that the ability to learn may manifest differently for correct and incorrect responses, which allows for distinguishing different types of learning effects in the data. Model estimation and evaluation is based on a Bayesian framework. A simulation study is presented that examines the performance of the estimation and evaluation methods. The results show accuracy in parameter recovery as well as good performance in model evaluation and selection. An empirical study illustrates the applicability of the model to data from a logical ability test.
{"title":"A Bayesian General Model to Account for Individual Differences in Operation-Specific Learning Within a Test.","authors":"José H Lozano, Javier Revuelta","doi":"10.1177/00131644221109796","DOIUrl":"10.1177/00131644221109796","url":null,"abstract":"<p><p>The present paper introduces a general multidimensional model to measure individual differences in learning within a single administration of a test. Learning is assumed to result from practicing the operations involved in solving the items. The model accounts for the possibility that the ability to learn may manifest differently for correct and incorrect responses, which allows for distinguishing different types of learning effects in the data. Model estimation and evaluation is based on a Bayesian framework. A simulation study is presented that examines the performance of the estimation and evaluation methods. The results show accuracy in parameter recovery as well as good performance in model evaluation and selection. An empirical study illustrates the applicability of the model to data from a logical ability test.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"782-807"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311956/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10300370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01Epub Date: 2022-07-20DOI: 10.1177/00131644221104972
Tenko Raykov, James C Anthony, Natalja Menold
The population relationship between coefficient alpha and scale reliability is studied in the widely used setting of unidimensional multicomponent measuring instruments. It is demonstrated that for any set of component loadings on the common factor, regardless of the extent of their inequality, the discrepancy between alpha and reliability can be arbitrarily small in any considered population and hence practically ignorable. In addition, the set of parameter values where this discrepancy is negligible is shown to possess the same dimensionality as that of the underlying model parameter space. The article contributes to the measurement and related literature by pointing out that (a) approximate or strict loading identity is not a necessary condition for the utility of alpha as a trustworthy index of scale reliability, and (b) coefficient alpha can be a dependable reliability measure with any extent of inequality in the component loadings.
{"title":"On the Importance of Coefficient Alpha for Measurement Research: Loading Equality Is Not Necessary for Alpha's Utility as a Scale Reliability Index.","authors":"Tenko Raykov, James C Anthony, Natalja Menold","doi":"10.1177/00131644221104972","DOIUrl":"10.1177/00131644221104972","url":null,"abstract":"<p><p>The population relationship between coefficient alpha and scale reliability is studied in the widely used setting of unidimensional multicomponent measuring instruments. It is demonstrated that for any set of component loadings on the common factor, regardless of the extent of their inequality, the discrepancy between alpha and reliability can be arbitrarily small in any considered population and hence practically ignorable. In addition, the set of parameter values where this discrepancy is negligible is shown to possess the same dimensionality as that of the underlying model parameter space. The article contributes to the measurement and related literature by pointing out that (a) approximate or strict loading identity is not a necessary condition for the utility of alpha as a trustworthy index of scale reliability, and (b) coefficient alpha can be a dependable reliability measure with any extent of inequality in the component loadings.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"766-781"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-01DOI: 10.1177/00131644221089857
E Damiano D'Urso, Jesper Tijmstra, Jeroen K Vermunt, Kim De Roover
Assessing the measurement model (MM) of self-report scales is crucial to obtain valid measurements of individuals' latent psychological constructs. This entails evaluating the number of measured constructs and determining which construct is measured by which item. Exploratory factor analysis (EFA) is the most-used method to evaluate these psychometric properties, where the number of measured constructs (i.e., factors) is assessed, and, afterward, rotational freedom is resolved to interpret these factors. This study assessed the effects of an acquiescence response style (ARS) on EFA for unidimensional and multidimensional (un)balanced scales. Specifically, we evaluated (a) whether ARS is captured as an additional factor, (b) the effect of different rotation approaches on the content and ARS factors recovery, and (c) the effect of extracting the additional ARS factor on the recovery of factor loadings. ARS was often captured as an additional factor in balanced scales when it was strong. For these scales, ignoring extracting this additional ARS factor, or rotating to simple structure when extracting it, harmed the recovery of the original MM by introducing bias in loadings and cross-loadings. These issues were avoided by using informed rotation approaches (i.e., target rotation), where (part of) the rotation target is specified according to a priori expectations on the MM. Not extracting the additional ARS factor did not affect the loading recovery in unbalanced scales. Researchers should consider the potential presence of ARS when assessing the psychometric properties of balanced scales and use informed rotation approaches when suspecting that an additional factor is an ARS factor.
{"title":"Awareness Is Bliss: How Acquiescence Affects Exploratory Factor Analysis.","authors":"E Damiano D'Urso, Jesper Tijmstra, Jeroen K Vermunt, Kim De Roover","doi":"10.1177/00131644221089857","DOIUrl":"https://doi.org/10.1177/00131644221089857","url":null,"abstract":"<p><p>Assessing the measurement model (MM) of self-report scales is crucial to obtain valid measurements of individuals' latent psychological constructs. This entails evaluating the number of measured constructs and determining which construct is measured by which item. Exploratory factor analysis (EFA) is the most-used method to evaluate these psychometric properties, where the number of measured constructs (i.e., factors) is assessed, and, afterward, rotational freedom is resolved to interpret these factors. This study assessed the effects of an acquiescence response style (ARS) on EFA for unidimensional and multidimensional (un)balanced scales. Specifically, we evaluated (a) whether ARS is captured as an additional factor, (b) the effect of different rotation approaches on the content and ARS factors recovery, and (c) the effect of extracting the additional ARS factor on the recovery of factor loadings. ARS was often captured as an additional factor in balanced scales when it was strong. For these scales, ignoring extracting this additional ARS factor, or rotating to simple structure when extracting it, harmed the recovery of the original MM by introducing bias in loadings and cross-loadings. These issues were avoided by using informed rotation approaches (i.e., target rotation), where (part of) the rotation target is specified according to a priori expectations on the MM. Not extracting the additional ARS factor did not affect the loading recovery in unbalanced scales. Researchers should consider the potential presence of ARS when assessing the psychometric properties of balanced scales and use informed rotation approaches when suspecting that an additional factor is an ARS factor.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 3","pages":"433-472"},"PeriodicalIF":2.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9846850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-01Epub Date: 2022-05-23DOI: 10.1177/00131644221098021
Matthias von Davier, Lillian Tyack, Lale Khorramdel
Automated scoring of free drawings or images as responses has yet to be used in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a TIMSS 2019 item. We are comparing classification accuracy of convolutional and feed-forward approaches. Our results show that convolutional neural networks (CNNs) outperform feed-forward neural networks in both loss and accuracy. The CNN models classified up to 97.53% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human-rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for international large-scale assessments (ILSAs), while improving the validity and comparability of scoring complex constructed-response items.
{"title":"Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks.","authors":"Matthias von Davier, Lillian Tyack, Lale Khorramdel","doi":"10.1177/00131644221098021","DOIUrl":"10.1177/00131644221098021","url":null,"abstract":"<p><p>Automated scoring of free drawings or images as responses has yet to be used in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a TIMSS 2019 item. We are comparing classification accuracy of convolutional and feed-forward approaches. Our results show that convolutional neural networks (CNNs) outperform feed-forward neural networks in both loss and accuracy. The CNN models classified up to 97.53% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human-rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for international large-scale assessments (ILSAs), while improving the validity and comparability of scoring complex constructed-response items.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 3","pages":"556-585"},"PeriodicalIF":2.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177318/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9475856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}