Pub Date : 2025-01-04eCollection Date: 2025-08-01DOI: 10.1177/00131644241306680
Changsheng Chen, Robbe D'hondt, Celine Vens, Wim Van den Noortgate
Multidimensional Item Response Theory (MIRT) is applied routinely in developing educational and psychological assessment tools, for instance, for exploring multidimensional structures of items using exploratory MIRT. A critical decision in exploratory MIRT analyses is the number of factors to retain. Unfortunately, the comparative properties of statistical methods and innovative Machine Learning (ML) methods for factor retention in exploratory MIRT analyses are still not clear. This study aims to fill this gap by comparing a selection of statistical and ML methods, including Kaiser Criterion (KC), Empirical Kaiser Criterion (EKC), Parallel Analysis (PA), scree plot (OC and AF), Very Simple Structure (VSS; C1 and C2), Minimum Average Partial (MAP), Exploratory Graph Analysis (EGA), Random Forest (RF), Histogram-based Gradient Boosted Decision Trees (HistGBDT), eXtreme Gradient Boosting (XGBoost), and Artificial Neural Network (ANN). The comparison was performed using 720,000 dichotomous response data sets simulated by the MIRT, for various between-item and within-item structures and considering characteristics of large-scale assessments. The results show that MAP, RF, HistGBDT, XGBoost, and ANN tremendously outperform other methods. Among them, HistGBDT generally performs better than other methods. Furthermore, including statistical methods' results as training features improves ML methods' performance. The methods' correct-factoring proportions decrease with an increase in missingness or a decrease in sample size. KC, PA, EKC, and scree plot (OC) are over-factoring, while EGA, scree plot (AF), and VSS (C1) are under-factoring. We recommend that practitioners use both MAP and HistGBDT to determine the number of factors when applying exploratory MIRT.
{"title":"Factor Retention in Exploratory Multidimensional Item Response Theory.","authors":"Changsheng Chen, Robbe D'hondt, Celine Vens, Wim Van den Noortgate","doi":"10.1177/00131644241306680","DOIUrl":"10.1177/00131644241306680","url":null,"abstract":"<p><p>Multidimensional Item Response Theory (MIRT) is applied routinely in developing educational and psychological assessment tools, for instance, for exploring multidimensional structures of items using exploratory MIRT. A critical decision in exploratory MIRT analyses is the number of factors to retain. Unfortunately, the comparative properties of statistical methods and innovative Machine Learning (ML) methods for factor retention in exploratory MIRT analyses are still not clear. This study aims to fill this gap by comparing a selection of statistical and ML methods, including Kaiser Criterion (KC), Empirical Kaiser Criterion (EKC), Parallel Analysis (PA), scree plot (OC and AF), Very Simple Structure (VSS; C1 and C2), Minimum Average Partial (MAP), Exploratory Graph Analysis (EGA), Random Forest (RF), Histogram-based Gradient Boosted Decision Trees (HistGBDT), eXtreme Gradient Boosting (XGBoost), and Artificial Neural Network (ANN). The comparison was performed using 720,000 dichotomous response data sets simulated by the MIRT, for various between-item and within-item structures and considering characteristics of large-scale assessments. The results show that MAP, RF, HistGBDT, XGBoost, and ANN tremendously outperform other methods. Among them, HistGBDT generally performs better than other methods. Furthermore, including statistical methods' results as training features improves ML methods' performance. The methods' correct-factoring proportions decrease with an increase in missingness or a decrease in sample size. KC, PA, EKC, and scree plot (OC) are over-factoring, while EGA, scree plot (AF), and VSS (C1) are under-factoring. We recommend that practitioners use both MAP and HistGBDT to determine the number of factors when applying exploratory MIRT.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"672-695"},"PeriodicalIF":2.3,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699551/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03eCollection Date: 2025-08-01DOI: 10.1177/00131644241302721
Duygu Koçak
This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.
{"title":"Examination of ChatGPT's Performance as a Data Analysis Tool.","authors":"Duygu Koçak","doi":"10.1177/00131644241302721","DOIUrl":"10.1177/00131644241302721","url":null,"abstract":"<p><p>This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"641-671"},"PeriodicalIF":2.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29eCollection Date: 2025-08-01DOI: 10.1177/00131644241298975
Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley
Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.
{"title":"Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data.","authors":"Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley","doi":"10.1177/00131644241298975","DOIUrl":"10.1177/00131644241298975","url":null,"abstract":"<p><p>Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"783-813"},"PeriodicalIF":2.3,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-07eCollection Date: 2025-10-01DOI: 10.1177/00131644241281049
Tenko Raykov, Khaled Alkherainej
A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.
{"title":"On Latent Structure Examination of Behavioral Measuring Instruments in Complex Empirical Settings.","authors":"Tenko Raykov, Khaled Alkherainej","doi":"10.1177/00131644241281049","DOIUrl":"10.1177/00131644241281049","url":null,"abstract":"<p><p>A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"983-999"},"PeriodicalIF":2.3,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562891/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01Epub Date: 2023-12-27DOI: 10.1177/00131644231215771
Tenko Raykov
This note is concerned with the benefits that can result from the use of the maximal reliability and optimal linear combination concepts in educational and psychological research. Within the widely used framework of unidimensional multi-component measuring instruments, it is demonstrated that the linear combination of their components that possesses the highest possible reliability can exhibit a level of consistency considerably exceeding that of their overall sum score that is nearly routinely employed in contemporary empirical research. This optimal linear combination can be particularly useful in circumstances where one or more scale components are associated with relatively large error variances, but their removal from the instrument can lead to a notable loss in validity due to construct underrepresentation. The discussion is illustrated with a numerical example.
{"title":"On the Benefits of Using Maximal Reliability in Educational and Behavioral Research.","authors":"Tenko Raykov","doi":"10.1177/00131644231215771","DOIUrl":"10.1177/00131644231215771","url":null,"abstract":"<p><p>This note is concerned with the benefits that can result from the use of the maximal reliability and optimal linear combination concepts in educational and psychological research. Within the widely used framework of unidimensional multi-component measuring instruments, it is demonstrated that the linear combination of their components that possesses the highest possible reliability can exhibit a level of consistency considerably exceeding that of their overall sum score that is nearly routinely employed in contemporary empirical research. This optimal linear combination can be particularly useful in circumstances where one or more scale components are associated with relatively large error variances, but their removal from the instrument can lead to a notable loss in validity due to construct underrepresentation. The discussion is illustrated with a numerical example.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"84 5","pages":"994-1011"},"PeriodicalIF":2.3,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11418609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142336340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1177/00131644241274122
Letty Koopman, Johan Braeken
Educational and psychological tests with an ordered item structure enable efficient test administration procedures and allow for intuitive score interpretation and monitoring. The effectiveness of the measurement instrument relies to a large extent on the validated strength of its ordering structure. We define three increasingly strict types of ordering for the ordering structure of a measurement instrument with clustered items: a weak and a strong invariant cluster ordering and a clustered invariant item ordering. Following a nonparametric item response theory (IRT) approach, we proposed a procedure to evaluate the ordering structure of a clustered item set along this three-fold continuum of order invariance. The basis of the procedure is (a) the local assessment of pairwise conditional expectations at both cluster and item level and (b) the global assessment of the number of Guttman errors through new generalizations of the H-coefficient for this item-cluster context. The procedure, readily implemented in R, is illustrated and applied to an empirical example. Suggestions for test practice, further methodological developments, and future research are discussed.
采用有序项目结构的教育和心理测验可以提高测验实施程序的效率,并能对分数进行直观的解释和监控。测量工具的有效性在很大程度上取决于其排序结构的有效强度。我们为具有聚类项目的测量工具的排序结构定义了三种越来越严格的排序类型:弱不变聚类排序和强不变聚类排序,以及聚类不变项目排序。按照非参数项目反应理论(IRT)的方法,我们提出了一种程序,用于根据顺序不变性的三重连续统一体评估聚类项目集的排序结构。该程序的基础是:(a) 在聚类和项目水平上对成对条件期望进行局部评估;(b) 通过对 H 系数进行新的概括,在此项目-聚类背景下对 Guttman 误差的数量进行全局评估。该程序可在 R 中轻松实现,并在一个实证例子中加以说明和应用。此外,还讨论了对测试实践、进一步的方法论发展和未来研究的建议。
{"title":"Investigating the Ordering Structure of Clustered Items Using Nonparametric Item Response Theory","authors":"Letty Koopman, Johan Braeken","doi":"10.1177/00131644241274122","DOIUrl":"https://doi.org/10.1177/00131644241274122","url":null,"abstract":"Educational and psychological tests with an ordered item structure enable efficient test administration procedures and allow for intuitive score interpretation and monitoring. The effectiveness of the measurement instrument relies to a large extent on the validated strength of its ordering structure. We define three increasingly strict types of ordering for the ordering structure of a measurement instrument with clustered items: a weak and a strong invariant cluster ordering and a clustered invariant item ordering. Following a nonparametric item response theory (IRT) approach, we proposed a procedure to evaluate the ordering structure of a clustered item set along this three-fold continuum of order invariance. The basis of the procedure is (a) the local assessment of pairwise conditional expectations at both cluster and item level and (b) the global assessment of the number of Guttman errors through new generalizations of the H-coefficient for this item-cluster context. The procedure, readily implemented in R, is illustrated and applied to an empirical example. Suggestions for test practice, further methodological developments, and future research are discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"108 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1177/00131644241268128
Kylie Gorney, Sandip Sinharay
Test-takers, policymakers, teachers, and institutions are increasingly demanding that testing programs provide more detailed feedback regarding test performance. As a result, there has been a growing interest in the reporting of subscores that potentially provide such detailed feedback. Haberman developed a method based on classical test theory for determining whether a subscore has added value over the total score. Sinharay conducted a detailed study using both real and simulated data and concluded that it is not common for subscores to have added value according to Haberman’s criterion. However, Sinharay almost exclusively dealt with data from tests with only dichotomous items. In this article, we show that it is more common for subscores to have added value in tests with polytomous items.
{"title":"Added Value of Subscores for Tests With Polytomous Items","authors":"Kylie Gorney, Sandip Sinharay","doi":"10.1177/00131644241268128","DOIUrl":"https://doi.org/10.1177/00131644241268128","url":null,"abstract":"Test-takers, policymakers, teachers, and institutions are increasingly demanding that testing programs provide more detailed feedback regarding test performance. As a result, there has been a growing interest in the reporting of subscores that potentially provide such detailed feedback. Haberman developed a method based on classical test theory for determining whether a subscore has added value over the total score. Sinharay conducted a detailed study using both real and simulated data and concluded that it is not common for subscores to have added value according to Haberman’s criterion. However, Sinharay almost exclusively dealt with data from tests with only dichotomous items. In this article, we show that it is more common for subscores to have added value in tests with polytomous items.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"3 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1177/00131644241262964
Yongtian Cheng, K. V. Petrides
Psychologists are emphasizing the importance of predictive conclusions. Machine learning methods, such as supervised neural networks, have been used in psychological studies as they naturally fit prediction tasks. However, we are concerned about whether neural networks fitted with random datasets (i.e., datasets where there is no relationship between ordinal independent variables and continuous or binary-dependent variables) can provide an acceptable level of predictive performance from a psychologist’s perspective. Through a Monte Carlo simulation study, we found that this kind of erroneous conclusion is not likely to be drawn as long as the sample size is larger than 50 with continuous-dependent variables. However, when the dependent variable is binary, the minimum sample size is 500 when the criteria are balanced accuracy ≥ .6 or balanced accuracy ≥ .65, and the minimum sample size is 200 when the criterion is balanced accuracy ≥ .7 for a decision error less than .05. In the case where area under the curve (AUC) is used as a metric, a sample size of 100, 200, and 500 is necessary when the minimum acceptable performance level is set at AUC ≥ .7, AUC ≥ .65, and AUC ≥ .6, respectively. The results found by this study can be used for sample size planning for psychologists who wish to apply neural networks for a qualitatively reliable conclusion. Further directions and limitations of the study are also discussed.
{"title":"Evaluating The Predictive Reliability of Neural Networks in Psychological Research With Random Datasets","authors":"Yongtian Cheng, K. V. Petrides","doi":"10.1177/00131644241262964","DOIUrl":"https://doi.org/10.1177/00131644241262964","url":null,"abstract":"Psychologists are emphasizing the importance of predictive conclusions. Machine learning methods, such as supervised neural networks, have been used in psychological studies as they naturally fit prediction tasks. However, we are concerned about whether neural networks fitted with random datasets (i.e., datasets where there is no relationship between ordinal independent variables and continuous or binary-dependent variables) can provide an acceptable level of predictive performance from a psychologist’s perspective. Through a Monte Carlo simulation study, we found that this kind of erroneous conclusion is not likely to be drawn as long as the sample size is larger than 50 with continuous-dependent variables. However, when the dependent variable is binary, the minimum sample size is 500 when the criteria are balanced accuracy ≥ .6 or balanced accuracy ≥ .65, and the minimum sample size is 200 when the criterion is balanced accuracy ≥ .7 for a decision error less than .05. In the case where area under the curve (AUC) is used as a metric, a sample size of 100, 200, and 500 is necessary when the minimum acceptable performance level is set at AUC ≥ .7, AUC ≥ .65, and AUC ≥ .6, respectively. The results found by this study can be used for sample size planning for psychologists who wish to apply neural networks for a qualitatively reliable conclusion. Further directions and limitations of the study are also discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-24DOI: 10.1177/00131644241256626
Tenko Raykov
A latent variable modeling procedure for studying factorial invariance and differential item functioning for multi-component measuring instruments with nominal items is discussed. The method is based on a multiple testing approach utilizing the false discovery rate concept and likelihood ratio tests. The procedure complements the Revuelta, Franco-Martinez, and Ximenez approach to factorial invariance examination, and permits localization of individual invariance violations. The outlined method does not require the selection of a reference observed variable and is illustrated with empirical data.
{"title":"Studying Factorial Invariance With Nominal Items: A Note on a Latent Variable Modeling Procedure","authors":"Tenko Raykov","doi":"10.1177/00131644241256626","DOIUrl":"https://doi.org/10.1177/00131644241256626","url":null,"abstract":"A latent variable modeling procedure for studying factorial invariance and differential item functioning for multi-component measuring instruments with nominal items is discussed. The method is based on a multiple testing approach utilizing the false discovery rate concept and likelihood ratio tests. The procedure complements the Revuelta, Franco-Martinez, and Ximenez approach to factorial invariance examination, and permits localization of individual invariance violations. The outlined method does not require the selection of a reference observed variable and is illustrated with empirical data.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"33 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141501613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-24DOI: 10.1177/00131644241259026
Tenko Raykov, Martin Pusic
A procedure is outlined for point and interval estimation of location parameters associated with polytomous items, or raters assessing studied subjects or cases, which follow the rating scale model. The method is developed within the framework of latent variable modeling, and is readily applied in empirical research using popular software. The approach permits testing the goodness of fit of this widely used model, which represents a rather parsimonious item response theory model as a means of description and explanation of an analyzed data set. The procedure allows examination of important aspects of the functioning of measuring instruments with polytomous ordinal items, which may also constitute person assessments furnished by teachers, counselors, judges, raters, or clinicians. The described method is illustrated using an empirical example.
{"title":"A Note on Evaluation of Polytomous Item Locations With the Rating Scale Model and Testing Its Fit","authors":"Tenko Raykov, Martin Pusic","doi":"10.1177/00131644241259026","DOIUrl":"https://doi.org/10.1177/00131644241259026","url":null,"abstract":"A procedure is outlined for point and interval estimation of location parameters associated with polytomous items, or raters assessing studied subjects or cases, which follow the rating scale model. The method is developed within the framework of latent variable modeling, and is readily applied in empirical research using popular software. The approach permits testing the goodness of fit of this widely used model, which represents a rather parsimonious item response theory model as a means of description and explanation of an analyzed data set. The procedure allows examination of important aspects of the functioning of measuring instruments with polytomous ordinal items, which may also constitute person assessments furnished by teachers, counselors, judges, raters, or clinicians. The described method is illustrated using an empirical example.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"18 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141501614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}