Pub Date : 2025-02-26DOI: 10.1177/01466216251322646
Maryam Pezeshki, Susan Embretson
To maintain test quality, a large supply of items is typically desired. Automatic item generation can result in a reduction in cost and labor, especially if the generated items have predictable item parameters and thus possibly reducing or eliminating the need for empirical tryout. However, the effect of different levels of item parameter predictability on the accuracy of trait estimation using item response theory models is unclear. If predictability is lower, adding response time as a collateral source of information may mitigate the effect on trait estimation accuracy. The present study investigates the impact of varying item parameter predictability on trait estimation accuracy, along with the impact of adding response time as a collateral source of information. Results indicated that trait estimation accuracy using item family model-based item parameters differed only slightly from using known item parameters. Somewhat larger trait estimation errors resulted from using cognitive complexity features to predict item parameters. Further, adding response times to the model resulted in more accurate trait estimation for tests with lower item difficulty levels (e.g., achievement tests). Implications for item generation and response processes aspect of validity are discussed.
{"title":"Impact of Parameter Predictability and Joint Modeling of Response Accuracy and Response Time on Ability Estimates.","authors":"Maryam Pezeshki, Susan Embretson","doi":"10.1177/01466216251322646","DOIUrl":"https://doi.org/10.1177/01466216251322646","url":null,"abstract":"<p><p>To maintain test quality, a large supply of items is typically desired. Automatic item generation can result in a reduction in cost and labor, especially if the generated items have predictable item parameters and thus possibly reducing or eliminating the need for empirical tryout. However, the effect of different levels of item parameter predictability on the accuracy of trait estimation using item response theory models is unclear. If predictability is lower, adding response time as a collateral source of information may mitigate the effect on trait estimation accuracy. The present study investigates the impact of varying item parameter predictability on trait estimation accuracy, along with the impact of adding response time as a collateral source of information. Results indicated that trait estimation accuracy using item family model-based item parameters differed only slightly from using known item parameters. Somewhat larger trait estimation errors resulted from using cognitive complexity features to predict item parameters. Further, adding response times to the model resulted in more accurate trait estimation for tests with lower item difficulty levels (e.g., achievement tests). Implications for item generation and response processes aspect of validity are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251322646"},"PeriodicalIF":1.0,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143543104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-20DOI: 10.1177/01466216251320403
Nate R Smith, Lisa A Keller, Richard A Feinberg, Chunyan Liu
Item preknowledge refers to the case where examinees have advanced knowledge of test material prior to taking the examination. When examinees have item preknowledge, the scores that result from those item responses are not true reflections of the examinee's proficiency. Further, this contamination in the data also has an impact on the item parameter estimates and therefore has an impact on scores for all examinees, regardless of whether they had prior knowledge. To ensure the validity of test scores, it is essential to identify both issues: compromised items (CIs) and examinees with preknowledge (EWPs). In some cases, the CIs are known, and the task is reduced to determining the EWPs. However, due to the potential threat to validity, it is critical for high-stakes testing programs to have a process for routinely monitoring for evidence of EWPs, often when CIs are unknown. Further, even knowing that specific items may have been compromised does not guarantee that any examinees had prior access to those items, or that those examinees that did have prior access know how to effectively use the preknowledge. Therefore, this paper attempts to use response behavior to identify item preknowledge without knowledge of which items may or may not have been compromised. While most research in this area has relied on traditional psychometric models, we investigate the utility of an unsupervised machine learning algorithm, extended isolation forest (EIF), to detect EWPs. Similar to previous research, the response behavior being analyzed is response time (RT) and response accuracy (RA).
{"title":"Few and Different: Detecting Examinees With Preknowledge Using Extended Isolation Forests.","authors":"Nate R Smith, Lisa A Keller, Richard A Feinberg, Chunyan Liu","doi":"10.1177/01466216251320403","DOIUrl":"10.1177/01466216251320403","url":null,"abstract":"<p><p>Item preknowledge refers to the case where examinees have advanced knowledge of test material prior to taking the examination. When examinees have item preknowledge, the scores that result from those item responses are not true reflections of the examinee's proficiency. Further, this contamination in the data also has an impact on the item parameter estimates and therefore has an impact on scores for all examinees, regardless of whether they had prior knowledge. To ensure the validity of test scores, it is essential to identify both issues: compromised items (CIs) and examinees with preknowledge (EWPs). In some cases, the CIs are known, and the task is reduced to determining the EWPs. However, due to the potential threat to validity, it is critical for high-stakes testing programs to have a process for routinely monitoring for evidence of EWPs, often when CIs are unknown. Further, even knowing that specific items may have been compromised does not guarantee that any examinees had prior access to those items, or that those examinees that did have prior access know how to effectively use the preknowledge. Therefore, this paper attempts to use response behavior to identify item preknowledge without knowledge of which items may or may not have been compromised. While most research in this area has relied on traditional psychometric models, we investigate the utility of an unsupervised machine learning algorithm, extended isolation forest (EIF), to detect EWPs. Similar to previous research, the response behavior being analyzed is response time (RT) and response accuracy (RA).</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251320403"},"PeriodicalIF":1.0,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11843570/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-27DOI: 10.1177/01466216251316559
Sandip Sinharay, Matthew S Johnson
This article suggests a new approach based on Bayesian decision theory (e.g., Cronbach & Gleser, 1965; Ferguson, 1967) for detection of test fraud. The approach leads to a simple decision rule that involves the computation of the posterior probability that an examinee committed test fraud given the data. The suggested approach was applied to a real data set that involved actual test fraud.
{"title":"Application of Bayesian Decision Theory in Detecting Test Fraud.","authors":"Sandip Sinharay, Matthew S Johnson","doi":"10.1177/01466216251316559","DOIUrl":"https://doi.org/10.1177/01466216251316559","url":null,"abstract":"<p><p>This article suggests a new approach based on Bayesian decision theory (e.g., Cronbach & Gleser, 1965; Ferguson, 1967) for detection of test fraud. The approach leads to a simple decision rule that involves the computation of the posterior probability that an examinee committed test fraud given the data. The suggested approach was applied to a real data set that involved actual test fraud.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251316559"},"PeriodicalIF":1.0,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11773507/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01Epub Date: 2024-09-20DOI: 10.1177/01466216241284293
Mark C Ramsey, Nathan A Bowling, Preston S Menke
Careless responding measures are important for several purposes, whether it's screening for careless responding or for research centered on careless responding as a substantive variable. One such approach for assessing carelessness in surveys is the use of an instructional manipulation check. Despite its apparent popularity, little is known about the construct validity of instructional manipulation checks as measures of careless responding. Initial results are inconclusive, and no study has thoroughly evaluated the validity of the instructional manipulation check as a measure of careless responding. Across 2 samples (N = 762), we evaluated the construct validity of the instructional manipulation check under a nomological network. We found that the instructional manipulation check converged poorly with other measures of careless responding, weakly predicted participant inability to recognize study content, and did not display incremental validity over existing measures of careless responding. Additional analyses revealed that instructional manipulation checks performed poorly compared to single scores of other alternative careless responding measures and that screening data with alternative measures of careless responding produced greater or similar gains in data quality to instructional manipulation checks. Based on the results of our studies, we do not recommend using instructional manipulation checks to assess or screen for careless responding to surveys.
{"title":"Evaluating the Construct Validity of Instructional Manipulation Checks as Measures of Careless Responding to Surveys.","authors":"Mark C Ramsey, Nathan A Bowling, Preston S Menke","doi":"10.1177/01466216241284293","DOIUrl":"10.1177/01466216241284293","url":null,"abstract":"<p><p>Careless responding measures are important for several purposes, whether it's screening for careless responding or for research centered on careless responding as a substantive variable. One such approach for assessing carelessness in surveys is the use of an instructional manipulation check. Despite its apparent popularity, little is known about the construct validity of instructional manipulation checks as measures of careless responding. Initial results are inconclusive, and no study has thoroughly evaluated the validity of the instructional manipulation check as a measure of careless responding. Across 2 samples (<i>N</i> = 762), we evaluated the construct validity of the instructional manipulation check under a nomological network. We found that the instructional manipulation check converged poorly with other measures of careless responding, weakly predicted participant inability to recognize study content, and did not display incremental validity over existing measures of careless responding. Additional analyses revealed that instructional manipulation checks performed poorly compared to single scores of other alternative careless responding measures and that screening data with alternative measures of careless responding produced greater or similar gains in data quality to instructional manipulation checks. Based on the results of our studies, we do not recommend using instructional manipulation checks to assess or screen for careless responding to surveys.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"48 7-8","pages":"341-356"},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11501094/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142510499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01Epub Date: 2024-09-17DOI: 10.1177/01466216241284585
William C M Belzak, J R Lockwood
Test-retest reliability is often estimated using naturally occurring data from test repeaters. In settings such as admissions testing, test takers choose if and when to retake an assessment. This self-selection can bias estimates of test-retest reliability because individuals who choose to retest are typically unrepresentative of the broader testing population and because differences among test takers in learning or practice effects may increase with time between test administrations. We develop a set of methods for estimating test-retest reliability from observational data that can mitigate these sources of bias, which include sample weighting, polynomial regression, and Bayesian model averaging. We demonstrate the value of using these methods for reducing bias and improving precision of estimated reliability using empirical and simulated data, both of which are based on more than 40,000 repeaters of a high-stakes English language proficiency test. Finally, these methods generalize to settings in which only a single, error-prone measurement is taken repeatedly over time and where self-selection and/or changes to the underlying construct may be at play.
{"title":"Estimating Test-Retest Reliability in the Presence of Self-Selection Bias and Learning/Practice Effects.","authors":"William C M Belzak, J R Lockwood","doi":"10.1177/01466216241284585","DOIUrl":"10.1177/01466216241284585","url":null,"abstract":"<p><p>Test-retest reliability is often estimated using naturally occurring data from test repeaters. In settings such as admissions testing, test takers choose if and when to retake an assessment. This self-selection can bias estimates of test-retest reliability because individuals who choose to retest are typically unrepresentative of the broader testing population and because differences among test takers in learning or practice effects may increase with time between test administrations. We develop a set of methods for estimating test-retest reliability from observational data that can mitigate these sources of bias, which include sample weighting, polynomial regression, and Bayesian model averaging. We demonstrate the value of using these methods for reducing bias and improving precision of estimated reliability using empirical and simulated data, both of which are based on more than 40,000 repeaters of a high-stakes English language proficiency test. Finally, these methods generalize to settings in which only a single, error-prone measurement is taken repeatedly over time and where self-selection and/or changes to the underlying construct may be at play.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"48 7-8","pages":"323-340"},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528726/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142569674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01Epub Date: 2024-09-13DOI: 10.1177/01466216241284410
Richard A Feinberg
Testing organizations routinely investigate if secure exam material has been compromised and is consequently invalid for scoring and inclusion on future assessments. Beyond identifying individual compromised items, knowing the degree to which a form is compromised can inform decisions on whether the form can no longer be administered or when an item pool is compromised to such an extent that serious action on a broad scale must be taken to ensure the validity of score interpretations. Previous research on estimating the population of item compromise is sparse; however, this is a more generally long-studied problem in ecological research. In this note, we exemplify the utility of the mark-recapture technique to estimate the population of compromised items, first through a brief demonstration to introduce the fundamental concepts and then a more realistic scenario to illustrate applicability to large-scale testing programs. An effective use of this technique would be to longitudinally track changes in the estimated population to inform operational test security strategies. Many variations on mark-recapture exist and interpretation of the estimated population depends on several factors. Thus, this note is only meant to introduce the concept of mark-recapture as a useful application to evaluate a testing organization's compromise mitigation procedures.
{"title":"A Mark-Recapture Approach to Estimating Item Pool Compromise.","authors":"Richard A Feinberg","doi":"10.1177/01466216241284410","DOIUrl":"10.1177/01466216241284410","url":null,"abstract":"<p><p>Testing organizations routinely investigate if secure exam material has been compromised and is consequently invalid for scoring and inclusion on future assessments. Beyond identifying individual compromised items, knowing the degree to which a form is compromised can inform decisions on whether the form can no longer be administered or when an item pool is compromised to such an extent that serious action on a broad scale must be taken to ensure the validity of score interpretations. Previous research on estimating the population of item compromise is sparse; however, this is a more generally long-studied problem in ecological research. In this note, we exemplify the utility of the mark-recapture technique to estimate the population of compromised items, first through a brief demonstration to introduce the fundamental concepts and then a more realistic scenario to illustrate applicability to large-scale testing programs. An effective use of this technique would be to longitudinally track changes in the estimated population to inform operational test security strategies. Many variations on mark-recapture exist and interpretation of the estimated population depends on several factors. Thus, this note is only meant to introduce the concept of mark-recapture as a useful application to evaluate a testing organization's compromise mitigation procedures.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"48 7-8","pages":"357-363"},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142569673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01Epub Date: 2024-09-17DOI: 10.1177/01466216241284295
Merve Sahin Kursad, Seher Yalcin
This study provides an overview of the effect of differential item functioning (DIF) on measurement precision, test information function (TIF), and test effectiveness in computer adaptive tests (CATs). Simulated data for the study was produced and analyzed with the Rstudio. During the data generation process, item pool size, DIF type, DIF percentage, item selection method for CAT, and the test termination rules were considered changed conditions. Sample size and ability parameter distribution, Item Response Theory (IRT) model, DIF size, ability estimation method, test starting rule, and item usage frequency method regarding CAT conditions were considered fixed conditions. To examine the effect of DIF, measurement precision, TIF and test effectiveness were calculated. Results show DIF has negative effects on measurement precision, TIF, and test effectiveness. In particular, statistically significant effects of the percentage DIF items and DIF type are observed on measurement precision.
{"title":"Effect of Differential Item Functioning on Computer Adaptive Testing Under Different Conditions.","authors":"Merve Sahin Kursad, Seher Yalcin","doi":"10.1177/01466216241284295","DOIUrl":"10.1177/01466216241284295","url":null,"abstract":"<p><p>This study provides an overview of the effect of differential item functioning (DIF) on measurement precision, test information function (TIF), and test effectiveness in computer adaptive tests (CATs). Simulated data for the study was produced and analyzed with the Rstudio. During the data generation process, item pool size, DIF type, DIF percentage, item selection method for CAT, and the test termination rules were considered changed conditions. Sample size and ability parameter distribution, Item Response Theory (IRT) model, DIF size, ability estimation method, test starting rule, and item usage frequency method regarding CAT conditions were considered fixed conditions. To examine the effect of DIF, measurement precision, TIF and test effectiveness were calculated. Results show DIF has negative effects on measurement precision, TIF, and test effectiveness. In particular, statistically significant effects of the percentage DIF items and DIF type are observed on measurement precision.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"48 7-8","pages":"303-322"},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11501093/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142510498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-06-17DOI: 10.1177/01466216241261709
Brooke E Magnus
Clinical instruments that use a filter/follow-up response format often produce data with excess zeros, especially when administered to nonclinical samples. When the unidimensional graded response model (GRM) is then fit to these data, parameter estimates and scale scores tend to suggest that the instrument measures individual differences only among individuals with severe levels of the psychopathology. In such scenarios, alternative item response models that explicitly account for excess zeros may be more appropriate. The multivariate hurdle graded response model (MH-GRM), which has been previously proposed for handling zero-inflated questionnaire data, includes two latent variables: susceptibility, which underlies responses to the filter question, and severity, which underlies responses to the follow-up question. Using both simulated and empirical data, the current research shows that compared to unidimensional GRMs, the MH-GRM is better able to capture individual differences across a wider range of psychopathology, and that when unidimensional GRMs are fit to data from questionnaires that include filter questions, individual differences at the lower end of the severity continuum largely go unmeasured. Practical implications are discussed.
{"title":"Item Response Modeling of Clinical Instruments With Filter Questions: Disentangling Symptom Presence and Severity.","authors":"Brooke E Magnus","doi":"10.1177/01466216241261709","DOIUrl":"10.1177/01466216241261709","url":null,"abstract":"<p><p>Clinical instruments that use a filter/follow-up response format often produce data with excess zeros, especially when administered to nonclinical samples. When the unidimensional graded response model (GRM) is then fit to these data, parameter estimates and scale scores tend to suggest that the instrument measures individual differences only among individuals with severe levels of the psychopathology. In such scenarios, alternative item response models that explicitly account for excess zeros may be more appropriate. The multivariate hurdle graded response model (MH-GRM), which has been previously proposed for handling zero-inflated questionnaire data, includes two latent variables: susceptibility, which underlies responses to the filter question, and severity, which underlies responses to the follow-up question. Using both simulated and empirical data, the current research shows that compared to unidimensional GRMs, the MH-GRM is better able to capture individual differences across a wider range of psychopathology, and that when unidimensional GRMs are fit to data from questionnaires that include filter questions, individual differences at the lower end of the severity continuum largely go unmeasured. Practical implications are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"48 6","pages":"235-256"},"PeriodicalIF":1.2,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11331747/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142009739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-03DOI: 10.1177/01466216231209758
Matthew Naveiras, Sun-Joo Cho
Marginal maximum likelihood estimation (MMLE) is commonly used for item response theory item parameter estimation. However, sufficiently large sample sizes are not always possible when studying rare populations. In this paper, empirical Bayes and hierarchical Bayes are presented as alternatives to MMLE in small sample sizes, using auxiliary item information to estimate the item parameters of a graded response model with higher accuracy. Empirical Bayes and hierarchical Bayes methods are compared with MMLE to determine under what conditions these Bayes methods can outperform MMLE, and to determine if hierarchical Bayes can act as an acceptable alternative to MMLE in conditions where MMLE is unable to converge. In addition, empirical Bayes and hierarchical Bayes methods are compared to show how hierarchical Bayes can result in estimates of posterior variance with greater accuracy than empirical Bayes by acknowledging the uncertainty of item parameter estimates. The proposed methods were evaluated via a simulation study. Simulation results showed that hierarchical Bayes methods can be acceptable alternatives to MMLE under various testing conditions, and we provide a guideline to indicate which methods would be recommended in different research situations. R functions are provided to implement these proposed methods.
{"title":"Using Auxiliary Item Information in the Item Parameter Estimation of a Graded Response Model for a Small to Medium Sample Size: Empirical Versus Hierarchical Bayes Estimation","authors":"Matthew Naveiras, Sun-Joo Cho","doi":"10.1177/01466216231209758","DOIUrl":"https://doi.org/10.1177/01466216231209758","url":null,"abstract":"Marginal maximum likelihood estimation (MMLE) is commonly used for item response theory item parameter estimation. However, sufficiently large sample sizes are not always possible when studying rare populations. In this paper, empirical Bayes and hierarchical Bayes are presented as alternatives to MMLE in small sample sizes, using auxiliary item information to estimate the item parameters of a graded response model with higher accuracy. Empirical Bayes and hierarchical Bayes methods are compared with MMLE to determine under what conditions these Bayes methods can outperform MMLE, and to determine if hierarchical Bayes can act as an acceptable alternative to MMLE in conditions where MMLE is unable to converge. In addition, empirical Bayes and hierarchical Bayes methods are compared to show how hierarchical Bayes can result in estimates of posterior variance with greater accuracy than empirical Bayes by acknowledging the uncertainty of item parameter estimates. The proposed methods were evaluated via a simulation study. Simulation results showed that hierarchical Bayes methods can be acceptable alternatives to MMLE under various testing conditions, and we provide a guideline to indicate which methods would be recommended in different research situations. R functions are provided to implement these proposed methods.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"21 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1177/01466216231209752
José H. Lozano, Javier Revuelta
The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a single administration of a test. The proposed model is an extension of the linear logistic test model of learning developed by Spada (1977) in which the practice effects are considered random effects varying across examinees. A Bayesian framework was used for model estimation and evaluation. A simulation study was conducted to examine the behavior of the model in combination with the Bayesian procedures. The results demonstrated the good performance of the estimation and evaluation methods. Additionally, an empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a sample of responses from a logical ability test providing evidence of individual differences in operation-specific practice effects.
{"title":"A Bayesian Random Weights Linear Logistic Test Model for Within-Test Practice Effects","authors":"José H. Lozano, Javier Revuelta","doi":"10.1177/01466216231209752","DOIUrl":"https://doi.org/10.1177/01466216231209752","url":null,"abstract":"The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a single administration of a test. The proposed model is an extension of the linear logistic test model of learning developed by Spada (1977) in which the practice effects are considered random effects varying across examinees. A Bayesian framework was used for model estimation and evaluation. A simulation study was conducted to examine the behavior of the model in combination with the Bayesian procedures. The results demonstrated the good performance of the estimation and evaluation methods. Additionally, an empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a sample of responses from a logical ability test providing evidence of individual differences in operation-specific practice effects.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"193 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135371926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}