Pub Date : 2023-10-01Epub Date: 2022-10-18DOI: 10.1177/00131644221129577
Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley
Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the performance of 14 person-fit indices to identify SDB in simulated responses. The area under the curve (AUC) of receiver operating characteristic (ROC) curve analysis was computed to evaluate the predictive power of these statistics. The findings showed that the agreement statistic outperformed all other person-fit indices, while the disagreement statistic , dependability statistic , and the number of Guttman errors also demonstrated high AUCs to detect SDB. Recommendations for practitioners to use these fit indices are provided.
{"title":"A Comparison of Person-Fit Indices to Detect Social Desirability Bias.","authors":"Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley","doi":"10.1177/00131644221129577","DOIUrl":"10.1177/00131644221129577","url":null,"abstract":"<p><p>Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the performance of 14 person-fit indices to identify SDB in simulated responses. The area under the curve (AUC) of receiver operating characteristic (ROC) curve analysis was computed to evaluate the predictive power of these statistics. The findings showed that the agreement statistic <math><mrow><mo>(</mo><mi>A</mi><mo>)</mo></mrow></math> outperformed all other person-fit indices, while the disagreement statistic <math><mrow><mo>(</mo><mi>D</mi><mo>)</mo></mrow></math>, dependability statistic <math><mrow><mo>(</mo><mi>E</mi><mo>)</mo></mrow></math>, and the number of Guttman errors <math><mrow><mo>(</mo><mi>G</mi><mo>)</mo></mrow></math> also demonstrated high AUCs to detect SDB. Recommendations for practitioners to use these fit indices are provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"907-928"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10208755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-08-12DOI: 10.1177/00131644221116292
Stefanie A Wind
Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean "less" of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., "rating scale malfunctioning"). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.
{"title":"Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model.","authors":"Stefanie A Wind","doi":"10.1177/00131644221116292","DOIUrl":"10.1177/00131644221116292","url":null,"abstract":"<p><p>Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean \"less\" of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., \"rating scale malfunctioning\"). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"953-983"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470161/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-10-27DOI: 10.1177/00131644221130482
Georgios Sideridis, Ioannis Tsaousis, Hanan Ghamdi
The purpose of the present study was to provide the means to evaluate the "interval-scaling" assumption that governs the use of parametric statistics and continuous data estimators in self-report instruments that utilize Likert-type scaling. Using simulated and real data, the methodology to test for this important assumption is evaluated using the popular software Mplus 8.8. Evidence on meeting the assumption is provided using the Wald test and the equidistant index. It is suggested that routine evaluations of self-report instruments engage the present methodology so that the most appropriate estimator will be implemented when testing the construct validity of self-report instruments.
{"title":"Equidistant Response Options on Likert-Type Instruments: Testing the Interval Scaling Assumption Using Mplus.","authors":"Georgios Sideridis, Ioannis Tsaousis, Hanan Ghamdi","doi":"10.1177/00131644221130482","DOIUrl":"10.1177/00131644221130482","url":null,"abstract":"<p><p>The purpose of the present study was to provide the means to evaluate the \"interval-scaling\" assumption that governs the use of parametric statistics and continuous data estimators in self-report instruments that utilize Likert-type scaling. Using simulated and real data, the methodology to test for this important assumption is evaluated using the popular software Mplus 8.8. Evidence on meeting the assumption is provided using the Wald test and the equidistant index. It is suggested that routine evaluations of self-report instruments engage the present methodology so that the most appropriate estimator will be implemented when testing the construct validity of self-report instruments.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"885-906"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470166/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10357822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-11-12DOI: 10.1177/00131644221132335
Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié
Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests' outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees' responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options' processing are discussed, and implications for test development are considered.
{"title":"Position of Correct Option and Distractors Impacts Responses to Multiple-Choice Items: Evidence From a National Test.","authors":"Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié","doi":"10.1177/00131644221132335","DOIUrl":"10.1177/00131644221132335","url":null,"abstract":"<p><p>Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests' outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees' responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options' processing are discussed, and implications for test development are considered.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"861-884"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10306861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-07-21DOI: 10.1177/00131644221111993
W Holmes Finch
Psychometricians have devoted much research and attention to categorical item responses, leading to the development and widespread use of item response theory for the estimation of model parameters and identification of items that do not perform in the same way for examinees from different population subgroups (e.g., differential item functioning [DIF]). With the increasing use of computer-based measurement, use of items with a continuous response modality is becoming more common. Models for use with these items have been developed and refined in recent years, but less attention has been devoted to investigating DIF for these continuous response models (CRMs). Therefore, the purpose of this simulation study was to compare the performance of three potential methods for assessing DIF for CRMs, including regression, the MIMIC model, and factor invariance testing. Study results revealed that the MIMIC model provided a combination of Type I error control and relatively high power for detecting DIF. Implications of these findings are discussed.
{"title":"The Impact and Detection of Uniform Differential Item Functioning for Continuous Item Response Models.","authors":"W Holmes Finch","doi":"10.1177/00131644221111993","DOIUrl":"10.1177/00131644221111993","url":null,"abstract":"<p><p>Psychometricians have devoted much research and attention to categorical item responses, leading to the development and widespread use of item response theory for the estimation of model parameters and identification of items that do not perform in the same way for examinees from different population subgroups (e.g., differential item functioning [DIF]). With the increasing use of computer-based measurement, use of items with a continuous response modality is becoming more common. Models for use with these items have been developed and refined in recent years, but less attention has been devoted to investigating DIF for these continuous response models (CRMs). Therefore, the purpose of this simulation study was to compare the performance of three potential methods for assessing DIF for CRMs, including regression, the MIMIC model, and factor invariance testing. Study results revealed that the MIMIC model provided a combination of Type I error control and relatively high power for detecting DIF. Implications of these findings are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"929-952"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-11-16DOI: 10.1177/00131644221136142
Kaiwen Man, Jeffrey R Harring
Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.
{"title":"Detecting Preknowledge Cheating via Innovative Measures: A Mixture Hierarchical Model for Jointly Modeling Item Responses, Response Times, and Visual Fixation Counts.","authors":"Kaiwen Man, Jeffrey R Harring","doi":"10.1177/00131644221136142","DOIUrl":"10.1177/00131644221136142","url":null,"abstract":"<p><p>Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1059-1080"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470163/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the "missingness" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.
{"title":"The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method.","authors":"Zhehan Jiang, Yuting Han, Lingling Xu, Dexin Shi, Ren Liu, Jinying Ouyang, Fen Cai","doi":"10.1177/00131644221120899","DOIUrl":"10.1177/00131644221120899","url":null,"abstract":"<p><p>The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the \"missingness\" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"984-1006"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10357823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-10-15DOI: 10.1177/00131644221128341
Ivy Liu, Thomas Suesse, Samuel Harvey, Peter Yongqi Gu, Daniel Fernández, John Randal
The Mantel-Haenszel estimator is one of the most popular techniques for measuring differential item functioning (DIF). A generalization of this estimator is applied to the context of DIF to compare items by taking the covariance of odds ratio estimators between dependent items into account. Unlike the Item Response Theory, the method does not rely on the local item independence assumption which is likely to be violated when one item provides clues about the answer of another item. Furthermore, we use these (co)variance estimators to construct a hypothesis test to assess DIF for multiple items simultaneously. A simulation study is presented to assess the performance of several tests. Finally, the use of these DIF tests is illustrated via application to two real data sets.
{"title":"Generalized Mantel-Haenszel Estimators for Simultaneous Differential Item Functioning Tests.","authors":"Ivy Liu, Thomas Suesse, Samuel Harvey, Peter Yongqi Gu, Daniel Fernández, John Randal","doi":"10.1177/00131644221128341","DOIUrl":"10.1177/00131644221128341","url":null,"abstract":"<p><p>The Mantel-Haenszel estimator is one of the most popular techniques for measuring differential item functioning (DIF). A generalization of this estimator is applied to the context of DIF to compare items by taking the covariance of odds ratio estimators between dependent items into account. Unlike the Item Response Theory, the method does not rely on the local item independence assumption which is likely to be violated when one item provides clues about the answer of another item. Furthermore, we use these (co)variance estimators to construct a hypothesis test to assess DIF for multiple items simultaneously. A simulation study is presented to assess the performance of several tests. Finally, the use of these DIF tests is illustrated via application to two real data sets.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1007-1032"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470165/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-01Epub Date: 2022-11-04DOI: 10.1177/00131644221132723
Jochen Ranger, Nico Schmidt, Anett Wolgast
Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.
{"title":"Detecting Cheating in Large-Scale Assessment: The Transfer of Detectors to New Tests.","authors":"Jochen Ranger, Nico Schmidt, Anett Wolgast","doi":"10.1177/00131644221132723","DOIUrl":"10.1177/00131644221132723","url":null,"abstract":"<p><p>Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1033-1058"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470164/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-19DOI: 10.1177/00131644231193625
Kaiwen Man
In various fields, including college admission, medical board certifications, and military recruitment, high-stakes decisions are frequently made based on scores obtained from large-scale assessments. These decisions necessitate precise and reliable scores that enable valid inferences to be drawn about test-takers. However, the ability of such tests to provide reliable, accurate inference on a test-taker’s performance could be jeopardized by aberrant test-taking practices, for instance, practicing real items prior to the test. As a result, it is crucial for administrators of such assessments to develop strategies that detect potential aberrant test-takers after data collection. The aim of this study is to explore the implementation of machine learning methods in combination with multimodal data fusion strategies that integrate bio-information technology, such as eye-tracking, and psychometric measures, including response times and item responses, to detect aberrant test-taking behaviors in technology-assisted remote testing settings.
{"title":"Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning","authors":"Kaiwen Man","doi":"10.1177/00131644231193625","DOIUrl":"https://doi.org/10.1177/00131644231193625","url":null,"abstract":"In various fields, including college admission, medical board certifications, and military recruitment, high-stakes decisions are frequently made based on scores obtained from large-scale assessments. These decisions necessitate precise and reliable scores that enable valid inferences to be drawn about test-takers. However, the ability of such tests to provide reliable, accurate inference on a test-taker’s performance could be jeopardized by aberrant test-taking practices, for instance, practicing real items prior to the test. As a result, it is crucial for administrators of such assessments to develop strategies that detect potential aberrant test-takers after data collection. The aim of this study is to explore the implementation of machine learning methods in combination with multimodal data fusion strategies that integrate bio-information technology, such as eye-tracking, and psychometric measures, including response times and item responses, to detect aberrant test-taking behaviors in technology-assisted remote testing settings.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135014578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}