Educational and Psychological Measurement最新文献

英文中文

Shortening Psychological Scales: Semantic Similarity Matters.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-02-24 DOI: 10.1177/00131644251319047

Sevilay Kilmen, Okan Bulut

In this study, we proposed a novel scale abbreviation method based on sentence embeddings and compared it to two established automatic scale abbreviation techniques. Scale abbreviation methods typically rely on administering the full scale to a large representative sample, which is often impractical in certain settings. Our approach leverages the semantic similarity among the items to select abbreviated versions of scales without requiring response data, offering a practical alternative for scale development. We found that the sentence embedding method performs comparably to the data-driven scale abbreviation approaches in terms of model fit, measurement accuracy, and ability estimates. In addition, our results reveal a moderate negative correlation between item discrimination parameters and semantic similarity indices, suggesting that semantically unique items may result in a higher discrimination power. This supports the notion that semantic features can be predictive of psychometric properties. However, this relationship was not observed for reverse-scored items, which may require further investigation. Overall, our findings suggest that the sentence embedding approach offers a promising solution for scale abbreviation, particularly in situations where large sample sizes are unavailable, and may eventually serve as an alternative to traditional data-driven methods.

{"title":"Shortening Psychological Scales: Semantic Similarity Matters.","authors":"Sevilay Kilmen, Okan Bulut","doi":"10.1177/00131644251319047","DOIUrl":"10.1177/00131644251319047","url":null,"abstract":"In this study, we proposed a novel scale abbreviation method based on sentence embeddings and compared it to two established automatic scale abbreviation techniques. Scale abbreviation methods typically rely on administering the full scale to a large representative sample, which is often impractical in certain settings. Our approach leverages the semantic similarity among the items to select abbreviated versions of scales without requiring response data, offering a practical alternative for scale development. We found that the sentence embedding method performs comparably to the data-driven scale abbreviation approaches in terms of model fit, measurement accuracy, and ability estimates. In addition, our results reveal a moderate negative correlation between item discrimination parameters and semantic similarity indices, suggesting that semantically unique items may result in a higher discrimination power. This supports the notion that semantic features can be predictive of psychometric properties. However, this relationship was not observed for reverse-scored items, which may require further investigation. Overall, our findings suggest that the sentence embedding approach offers a promising solution for scale abbreviation, particularly in situations where large sample sizes are unavailable, and may eventually serve as an alternative to traditional data-driven methods.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251319047"},"PeriodicalIF":2.1,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11851598/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143515073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overestimation of Internal Consistency by Coefficient Omega in Data Giving Rise to a Centroid-Like Factor Solution.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-02-13 DOI: 10.1177/00131644241313447

Karl Schweizer, Tengfei Wang, Xuezhu Ren

Coefficient Omega measuring internal consistency is investigated for its deviations from expected outcomes when applied to correlational patterns that produce variable-focused factor solutions in confirmatory factor analysis. In these solutions, the factor loadings on the factor of the one-factor measurement model closely correspond to the correlations of one manifest variable with the other manifest variables, as is in centroid solutions. It is demonstrated that in such a situation, a heterogeneous correlational pattern leads to an Omega estimate larger than those for similarly heterogeneous and uniform patterns. A simulation study reveals that these deviations are restricted to datasets including small numbers of manifest variables and that the degree of heterogeneity determines the degree of deviation. We propose a method for identifying variable-focused factor solutions and how to deal with deviations.

引用次数: 0

Obtaining a Bayesian Estimate of Coefficient Alpha Using a Posterior Normal Distribution.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-31 DOI: 10.1177/00131644241311877

John Mart V DelosReyes, Miguel A Padilla

A new alternative to obtain a Bayesian estimate of coefficient alpha through a posterior normal distribution is proposed and assessed through percentile, normal-theory-based, and highest probability density credible intervals in a simulation study. The results indicate that the proposed Bayesian method to estimate coefficient alpha has acceptable coverage probability performance across the majority of investigated simulation conditions.

引用次数: 0

Examining the Instructional Sensitivity of Constructed-Response Achievement Test Item Scores.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-30 DOI: 10.1177/00131644241313212

Anne Traynor, Cheng-Hsien Li, Shuqi Zhou

Inferences about student learning from large-scale achievement test scores are fundamental in education. For achievement test scores to provide useful information about student learning progress, differences in the content of instruction (i.e., the implemented curriculum) should affect test-takers' item responses. Existing research has begun to identify patterns in the content of instructionally sensitive multiple-choice achievement test items. To inform future test design decisions, this study identified instructionally (in)sensitive constructed-response achievement items, then characterized features of those items and their corresponding scoring rubrics. First, we used simulation to evaluate an item step difficulty difference index for constructed-response test items, derived from the generalized partial credit model. The statistical performance of the index was adequate, so we then applied it to data from 32 constructed-response eighth-grade science test items. We found that the instructional sensitivity (IS) index values varied appreciably across the category boundaries within an item as well as across items. Content analysis by master science teachers allowed us to identify general features of item score categories that show high, or negligible, IS.

{"title":"Examining the Instructional Sensitivity of Constructed-Response Achievement Test Item Scores.","authors":"Anne Traynor, Cheng-Hsien Li, Shuqi Zhou","doi":"10.1177/00131644241313212","DOIUrl":"10.1177/00131644241313212","url":null,"abstract":"Inferences about student learning from large-scale achievement test scores are fundamental in education. For achievement test scores to provide useful information about student learning progress, differences in the content of instruction (i.e., the implemented curriculum) should affect test-takers' item responses. Existing research has begun to identify patterns in the content of instructionally sensitive multiple-choice achievement test items. To inform future test design decisions, this study identified instructionally (in)sensitive constructed-response achievement items, then characterized features of those items and their corresponding scoring rubrics. First, we used simulation to evaluate an item step difficulty difference index for constructed-response test items, derived from the generalized partial credit model. The statistical performance of the index was adequate, so we then applied it to data from 32 constructed-response eighth-grade science test items. We found that the instructional sensitivity (IS) index values varied appreciably across the category boundaries within an item as well as across items. Content analysis by master science teachers allowed us to identify general features of item score categories that show high, or negligible, IS.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241313212"},"PeriodicalIF":2.1,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143079163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Impact of Attentiveness Interventions on Survey Data.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-29 DOI: 10.1177/00131644241311851

Christie M Fuller, Marcia J Simmering, Brian Waterwall, Elizabeth Ragland, Douglas P Twitchell, Alison Wall

Social and behavioral science researchers who use survey data are vigilant about data quality, with an increasing emphasis on avoiding common method variance (CMV) and insufficient effort responding (IER). Each of these errors can inflate and deflate substantive relationships, and there are both a priori and post hoc means to address them. Yet, little research has investigated how both IER and CMV are affected with the use of these different procedural or statistical techniques used to address them. More specifically, if interventions to reduce IER are used, does this affect CMV in data? In an experiment conducted both in and out of the laboratory, we investigate the impact of attentiveness interventions, such as a Factual Manipulation Check (FMC) on both IER and CMV in same-source survey data. In addition to typical IER measures, we also track whether respondents play the instructional video and their mouse movement. The results show that while interventions have some impact on the level of participant attentiveness, these interventions do not appear to lead to differing levels of CMV.

{"title":"The Impact of Attentiveness Interventions on Survey Data.","authors":"Christie M Fuller, Marcia J Simmering, Brian Waterwall, Elizabeth Ragland, Douglas P Twitchell, Alison Wall","doi":"10.1177/00131644241311851","DOIUrl":"10.1177/00131644241311851","url":null,"abstract":"Social and behavioral science researchers who use survey data are vigilant about data quality, with an increasing emphasis on avoiding common method variance (CMV) and insufficient effort responding (IER). Each of these errors can inflate and deflate substantive relationships, and there are both a priori and post hoc means to address them. Yet, little research has investigated how both IER and CMV are affected with the use of these different procedural or statistical techniques used to address them. More specifically, if interventions to reduce IER are used, does this affect CMV in data? In an experiment conducted both in and out of the laboratory, we investigate the impact of attentiveness interventions, such as a Factual Manipulation Check (FMC) on both IER and CMV in same-source survey data. In addition to typical IER measures, we also track whether respondents play the instructional video and their mouse movement. The results show that while interventions have some impact on the level of participant attentiveness, these interventions do not appear to lead to differing levels of CMV.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241311851"},"PeriodicalIF":2.1,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11775934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143064490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

"What If Applicants Fake Their Responses?": Modeling Faking and Response Styles in High-Stakes Assessments Using the Multidimensional Nominal Response Model.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-23 DOI: 10.1177/00131644241307560

Timo Seitz, Maik Spengler, Thorsten Meiser

Self-report personality tests used in high-stakes assessments hold the risk that test-takers engage in faking. In this article, we demonstrate an extension of the multidimensional nominal response model (MNRM) to account for the response bias of faking. The MNRM is a flexible item response theory (IRT) model that allows modeling response biases whose effect patterns vary between items. In a simulation, we found good parameter recovery of the model accounting for faking under different conditions as well as good performance of model selection criteria. Also, we modeled responses from N = 3,046 job applicants taking a personality test under real high-stakes conditions. We thereby specified item-specific effect patterns of faking by setting scoring weights to appropriate values that we collected in a pilot study. Results indicated that modeling faking significantly increased model fit over and above response styles and improved divergent validity, while the faking dimension exhibited relations to several covariates. Additionally, applying the model to a sample of job incumbents taking the test under low-stakes conditions, we found evidence that the model can effectively capture faking and adjust estimates of substantive trait scores for the assumed influence of faking. We end the article with a discussion of implications for psychological measurement in high-stakes assessment contexts.

{"title":"\"What If Applicants Fake Their Responses?\": Modeling Faking and Response Styles in High-Stakes Assessments Using the Multidimensional Nominal Response Model.","authors":"Timo Seitz, Maik Spengler, Thorsten Meiser","doi":"10.1177/00131644241307560","DOIUrl":"10.1177/00131644241307560","url":null,"abstract":"Self-report personality tests used in high-stakes assessments hold the risk that test-takers engage in faking. In this article, we demonstrate an extension of the multidimensional nominal response model (MNRM) to account for the response bias of faking. The MNRM is a flexible item response theory (IRT) model that allows modeling response biases whose effect patterns vary between items. In a simulation, we found good parameter recovery of the model accounting for faking under different conditions as well as good performance of model selection criteria. Also, we modeled responses from N = 3,046 job applicants taking a personality test under real high-stakes conditions. We thereby specified item-specific effect patterns of faking by setting scoring weights to appropriate values that we collected in a pilot study. Results indicated that modeling faking significantly increased model fit over and above response styles and improved divergent validity, while the faking dimension exhibited relations to several covariates. Additionally, applying the model to a sample of job incumbents taking the test under low-stakes conditions, we found evidence that the model can effectively capture faking and adjust estimates of substantive trait scores for the assumed influence of faking. We end the article with a discussion of implications for psychological measurement in high-stakes assessment contexts.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241307560"},"PeriodicalIF":2.1,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11755426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143045425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comparison of the Next Eigenvalue Sufficiency Test to Other Stopping Rules for the Number of Factors in Factor Analysis.

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-22 DOI: 10.1177/00131644241308528

Pier-Olivier Caron

A plethora of techniques exist to determine the number of factors to retain in exploratory factor analysis. A recent and promising technique is the Next Eigenvalue Sufficiency Test (NEST), but has not been systematically compared with well-established stopping rules. The present study proposes a simulation with synthetic factor structures to compare NEST, parallel analysis, sequential $χ^{2}$ test, Hull method, and the empirical Kaiser criterion. The structures were based on 24 variables containing one to eight factors, loadings ranged from .40 to .80, inter-factor correlations ranged from .00 to .30, and three sample sizes were used. In total, 360 scenarios were replicated 1,000 times. Performance was evaluated in terms of accuracy (correct identification of dimensionality) and bias (tendency to over- or underestimate dimensionality). Overall, NEST showed the best overall performances, especially in hard conditions where it had to detect small but meaningful factors. It had a tendency to underextract, but to a lesser extent than other methods. The second best method was parallel analysis by being more liberal in harder cases. The three other stopping rules had pitfalls: sequential $χ^{2}$ test and Hull method even in some easy conditions; the empirical Kaiser criterion in hard conditions.

{"title":"A Comparison of the Next Eigenvalue Sufficiency Test to Other Stopping Rules for the Number of Factors in Factor Analysis.","authors":"Pier-Olivier Caron","doi":"10.1177/00131644241308528","DOIUrl":"10.1177/00131644241308528","url":null,"abstract":"A plethora of techniques exist to determine the number of factors to retain in exploratory factor analysis. A recent and promising technique is the Next Eigenvalue Sufficiency Test (NEST), but has not been systematically compared with well-established stopping rules. The present study proposes a simulation with synthetic factor structures to compare NEST, parallel analysis, sequential <math> <mrow> <msup><mrow><mi>χ</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> test, Hull method, and the empirical Kaiser criterion. The structures were based on 24 variables containing one to eight factors, loadings ranged from .40 to .80, inter-factor correlations ranged from .00 to .30, and three sample sizes were used. In total, 360 scenarios were replicated 1,000 times. Performance was evaluated in terms of accuracy (correct identification of dimensionality) and bias (tendency to over- or underestimate dimensionality). Overall, NEST showed the best overall performances, especially in hard conditions where it had to detect small but meaningful factors. It had a tendency to underextract, but to a lesser extent than other methods. The second best method was parallel analysis by being more liberal in harder cases. The three other stopping rules had pitfalls: sequential <math> <mrow> <msup><mrow><mi>χ</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> test and Hull method even in some easy conditions; the empirical Kaiser criterion in hard conditions.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241308528"},"PeriodicalIF":2.1,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11755425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143045428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Omega-Hierarchical Extension Index for Second-Order Constructs With Hierarchical Measuring Instruments. 具有层次测量仪器的二阶结构的ω -层次可拓指标。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-14 DOI: 10.1177/00131644241302284

Tenko Raykov, Christine DiStefano, Yusuf Ransome

An index extending the widely used omega-hierarchical coefficient is discussed, which can be used for evaluating the influence of a second-order factor on the interrelationships among the components of a hierarchical measuring instrument. The index represents a useful and informative complement to the traditional omega-hierarchical measure of explained overall scale score variance by that underlying construct. A point and interval estimation procedure is outlined for the described index, which is based on model reparameterization and is developed within the latent variable modeling framework. The method is readily applicable with popular software and is illustrated with examples.

讨论了一种扩展广泛使用的- - -等级系数的指标，该指标可用于评价二阶因子对等级测量仪器各组成部分之间相互关系的影响。该指数代表了一个有用的和翔实的补充，传统的omega-分层测量解释的总体规模得分方差的基础结构。在潜在变量建模框架内，提出了基于模型重参数化的点和区间估计方法。该方法易于应用于流行的软件，并通过实例加以说明。

引用次数: 0

The Impact of Missing Data on Parameter Estimation: Three Examples in Computerized Adaptive Testing. 缺失数据对参数估计的影响：计算机自适应测试中的三个例子。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-07 DOI: 10.1177/00131644241306990

Xiaowen Liu, Eric Loken

In computerized adaptive testing (CAT), examinees see items targeted to their ability level. Postoperational data have a high degree of missing information relative to designs where everyone answers all questions. Item responses are observed over a restricted range of abilities, reducing item-total score correlations. However, if the adaptive item selection depends only on observed responses, the data are missing at random (MAR). We simulated data from three different testing designs (common items, randomly selected items, and CAT) and found that it was possible to re-estimate both person and item parameters from postoperational CAT data. In a multidimensional CAT, we show that it is necessary to include all responses from the testing phase to avoid violating missing data assumptions. We also observed that some CAT designs produced "reversals" where item discriminations became negative causing dramatic under and over-estimation of abilities. Our results apply to situations where researchers work with data drawn from adaptive testing or from instructional tools with adaptive delivery. To avoid bias, researchers must make sure they use all the data necessary to meet the MAR assumptions.

在计算机自适应测试（CAT）中，考生看到的是针对他们能力水平的项目。相对于每个人回答所有问题的设计，操作后数据有高度的信息缺失。道具反应是在有限的能力范围内观察到的，这降低了道具与总分的相关性。然而，如果自适应项目选择仅依赖于观察到的反应，则数据随机缺失（MAR）。我们模拟了三种不同测试设计（常见项目、随机选择项目和CAT）的数据，发现可以从术后CAT数据中重新估计人和项目参数。在多维CAT中，我们表明有必要包括来自测试阶段的所有响应，以避免违反缺失的数据假设。我们还观察到一些CAT设计产生了“逆转”，其中项目歧视变得消极，导致对能力的严重低估和高估。我们的研究结果适用于研究人员使用自适应测试或自适应交付的教学工具得出的数据的情况。为了避免偏见，研究人员必须确保他们使用所有必要的数据来满足MAR假设。

{"title":"The Impact of Missing Data on Parameter Estimation: Three Examples in Computerized Adaptive Testing.","authors":"Xiaowen Liu, Eric Loken","doi":"10.1177/00131644241306990","DOIUrl":"https://doi.org/10.1177/00131644241306990","url":null,"abstract":"In computerized adaptive testing (CAT), examinees see items targeted to their ability level. Postoperational data have a high degree of missing information relative to designs where everyone answers all questions. Item responses are observed over a restricted range of abilities, reducing item-total score correlations. However, if the adaptive item selection depends only on observed responses, the data are missing at random (MAR). We simulated data from three different testing designs (common items, randomly selected items, and CAT) and found that it was possible to re-estimate both person and item parameters from postoperational CAT data. In a multidimensional CAT, we show that it is necessary to include all responses from the testing phase to avoid violating missing data assumptions. We also observed that some CAT designs produced \"reversals\" where item discriminations became negative causing dramatic under and over-estimation of abilities. Our results apply to situations where researchers work with data drawn from adaptive testing or from instructional tools with adaptive delivery. To avoid bias, researchers must make sure they use all the data necessary to meet the MAR assumptions.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241306990"},"PeriodicalIF":2.1,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142946372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks. 基于功能主成分聚类和神经网络的项目难度分类。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2025-01-04 DOI: 10.1177/00131644241299834

James Zoucha, Igor Himelfarb, Nai-En Tang

Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved very easy to easy items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.

要准确、公平地将考生划分为及格或不及格类别，在各种测试表格中保持项目难度的一致性至关重要。本文介绍了一种利用功能数据分析（FDA）根据难度水平对项目进行分类的实用程序。在方法上，我们通过分析项目特征曲线（ICC）的功能主成分（FPC），将其聚类为难度组，然后采用神经网络预测 ICC 的难度。鉴于许多 ICC 之间的相似程度，按难度对项目进行分类可能具有挑战性。这种方法的优势在于它能够为项目分类提供一个经验性和一致性的过程，而不是仅仅依靠目测。研究结果表明，目测分类与 FDA 结果之间的大多数差异仅相差一个难度等级。在这些差异中，约有 67% 的差异涉及中等至较难的项目被 FDA 归类为较高难度级别，而其余三分之一的差异则涉及非常简单至简单的项目被归类为较低难度级别。在这些数据上训练的神经网络的准确率达到了 79.6%，与 FDA 聚类相比，误分类的难度等级也只相差一个。该方法展示了一种高效实用的测试项目分类程序，尤其适用于在全年不同时间对较少数量的考生进行测试的测试项目。

{"title":"Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks.","authors":"James Zoucha, Igor Himelfarb, Nai-En Tang","doi":"10.1177/00131644241299834","DOIUrl":"https://doi.org/10.1177/00131644241299834","url":null,"abstract":"Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved very easy to easy items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241299834"},"PeriodicalIF":2.1,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142930042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Educational and Psychological Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀