Pub Date : 2021-02-06DOI: 10.1080/15305058.2021.1884872
S. Roschmann, S. Witmer, Martin A. Volker
Abstract Accommodations are commonly provided to address language-related barriers students may experience during testing. Research on the validity of scores from accommodated test administrations remains somewhat inconclusive. The current study investigated item response patterns to understand whether accommodations, as used in practice among English learners (ELs) in the United States, allow for comparable measurement between ELs and non-ELs. Results indicated that although significant differences are evident in overall test scores for ELs and non-ELs, only minimal measurement concerns were evident. Very few items displayed moderate or large differential item functioning (DIF); no tests showed small, medium, or large differential test functioning. The current study adds to existing literature on measurement comparability and accommodation research on ELs; implications for practice are provided.
{"title":"Examining provision and sufficiency of testing accommodations for English learners","authors":"S. Roschmann, S. Witmer, Martin A. Volker","doi":"10.1080/15305058.2021.1884872","DOIUrl":"https://doi.org/10.1080/15305058.2021.1884872","url":null,"abstract":"Abstract Accommodations are commonly provided to address language-related barriers students may experience during testing. Research on the validity of scores from accommodated test administrations remains somewhat inconclusive. The current study investigated item response patterns to understand whether accommodations, as used in practice among English learners (ELs) in the United States, allow for comparable measurement between ELs and non-ELs. Results indicated that although significant differences are evident in overall test scores for ELs and non-ELs, only minimal measurement concerns were evident. Very few items displayed moderate or large differential item functioning (DIF); no tests showed small, medium, or large differential test functioning. The current study adds to existing literature on measurement comparability and accommodation research on ELs; implications for practice are provided.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"21 1","pages":"32 - 55"},"PeriodicalIF":1.7,"publicationDate":"2021-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2021.1884872","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46681048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-03DOI: 10.1080/15305058.2021.1884871
Emmanuel Affum-Osei, H. Mensah, S. K. Forkuoh, Eric Adom Asante
Abstract The purpose of this study was to examine the psychometric properties of the goal orientation (GO) scale across job search contexts to facilitate its use in large and varied search settings. A sample of 720 job losers and new entrants’ job seekers in Ghana completed the survey. Confirmatory factor analysis supported the three-factor theoretical structure (Learning goal, Performance-prove goal, and Performance-avoid goal orientations) for both new entrants’ and job losers’ samples. Results of the invariance test reached measurement equivalence across job search contexts and genders. Furthermore, GO dimensions correlated differently with some cognitive self-regulation criterion variables (employment commitment, self-control, learning from failure, and strategy awareness) thus, providing evidence of convergent and discriminant validity. Overall, the study provides additional support for the job search GO measure for use across different job search contexts.
{"title":"Goal orientation in job search: Psychometric characteristics and construct validation across job search contexts","authors":"Emmanuel Affum-Osei, H. Mensah, S. K. Forkuoh, Eric Adom Asante","doi":"10.1080/15305058.2021.1884871","DOIUrl":"https://doi.org/10.1080/15305058.2021.1884871","url":null,"abstract":"Abstract The purpose of this study was to examine the psychometric properties of the goal orientation (GO) scale across job search contexts to facilitate its use in large and varied search settings. A sample of 720 job losers and new entrants’ job seekers in Ghana completed the survey. Confirmatory factor analysis supported the three-factor theoretical structure (Learning goal, Performance-prove goal, and Performance-avoid goal orientations) for both new entrants’ and job losers’ samples. Results of the invariance test reached measurement equivalence across job search contexts and genders. Furthermore, GO dimensions correlated differently with some cognitive self-regulation criterion variables (employment commitment, self-control, learning from failure, and strategy awareness) thus, providing evidence of convergent and discriminant validity. Overall, the study provides additional support for the job search GO measure for use across different job search contexts.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"21 1","pages":"1 - 31"},"PeriodicalIF":1.7,"publicationDate":"2021-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2021.1884871","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43471412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1080/15305058.2021.2019747
Zoe Magraw‐Mickelson, Harry Wang, M. Gollwitzer
Abstract Much psychological research depends on participants’ diligence in filling out materials such as surveys. However, not all participants are motivated to respond attentively, which leads to unintended issues with data quality, known as careless responding. Our question is: how do different modes of data collection—paper/pencil, computer/web-based, and smartphone—affect participants’ diligence vs. “careless responding” tendencies and, thus, data quality? Results from prior studies suggest that different data collection modes produce a comparable prevalence of careless responding tendencies. However, as technology develops and data are collected with increasingly diversified populations, this question needs to be readdressed and taken further. The present research examined the effect of survey mode on careless responding in a repeated-measures design with data from three different samples. First, in a sample of working adults from China, we found that participants were slightly more careless when completing computer/web-based survey materials than in paper/pencil mode. Next, in a German student sample, participants were slightly more careless when completing the paper/pencil mode compared to the smartphone mode. Finally, in a sample of Chinese-speaking students, we found no difference between modes. Overall, in a meta-analysis of the findings, we found minimal difference between modes across cultures. Theoretical and practical implications are discussed.
{"title":"Survey mode and data quality: Careless responding across three modes in cross-cultural contexts","authors":"Zoe Magraw‐Mickelson, Harry Wang, M. Gollwitzer","doi":"10.1080/15305058.2021.2019747","DOIUrl":"https://doi.org/10.1080/15305058.2021.2019747","url":null,"abstract":"Abstract Much psychological research depends on participants’ diligence in filling out materials such as surveys. However, not all participants are motivated to respond attentively, which leads to unintended issues with data quality, known as careless responding. Our question is: how do different modes of data collection—paper/pencil, computer/web-based, and smartphone—affect participants’ diligence vs. “careless responding” tendencies and, thus, data quality? Results from prior studies suggest that different data collection modes produce a comparable prevalence of careless responding tendencies. However, as technology develops and data are collected with increasingly diversified populations, this question needs to be readdressed and taken further. The present research examined the effect of survey mode on careless responding in a repeated-measures design with data from three different samples. First, in a sample of working adults from China, we found that participants were slightly more careless when completing computer/web-based survey materials than in paper/pencil mode. Next, in a German student sample, participants were slightly more careless when completing the paper/pencil mode compared to the smartphone mode. Finally, in a sample of Chinese-speaking students, we found no difference between modes. Overall, in a meta-analysis of the findings, we found minimal difference between modes across cultures. Theoretical and practical implications are discussed.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"22 1","pages":"121 - 153"},"PeriodicalIF":1.7,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45224199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-23DOI: 10.1080/15305058.2020.1828427
M. Finkelman, J. de la Torre, Jeremy Karp
Abstract Cognitive diagnosis models (CDMs) have been studied as a means of providing detailed diagnostic information about the skills that have been mastered, and the skills that have not, by examinees. Prior research has examined the use of automated test assembly (ATA) alongside CDMs; however, no previous study has investigated how to perform ATA when a CDM is employed and the total amount of time taken by the test must be controlled. The purpose of the current research was to develop an ATA procedure to select tests that are highly informative while simultaneously satisfying constraints on key parameters related to the total-time distribution. In a simulation study, the procedure successfully selected tests that met these dual goals.
{"title":"Cognitive diagnosis models and automated test assembly: an approach incorporating response times","authors":"M. Finkelman, J. de la Torre, Jeremy Karp","doi":"10.1080/15305058.2020.1828427","DOIUrl":"https://doi.org/10.1080/15305058.2020.1828427","url":null,"abstract":"Abstract Cognitive diagnosis models (CDMs) have been studied as a means of providing detailed diagnostic information about the skills that have been mastered, and the skills that have not, by examinees. Prior research has examined the use of automated test assembly (ATA) alongside CDMs; however, no previous study has investigated how to perform ATA when a CDM is employed and the total amount of time taken by the test must be controlled. The purpose of the current research was to develop an ATA procedure to select tests that are highly informative while simultaneously satisfying constraints on key parameters related to the total-time distribution. In a simulation study, the procedure successfully selected tests that met these dual goals.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"299 - 320"},"PeriodicalIF":1.7,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2020.1828427","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44001960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-31DOI: 10.1080/15305058.2020.1786833
Anely Ramírez, Mladen Koljatic, Mónica Silva
Abstract The study addresses the association between coaching practices and university admission test performance in Chile. Estimates of coaching effects are reported for test-takers from the private and public school systems. Our results indicate that coaching is associated with variations in test scores. The estimated magnitude of coaching appears to vary by subject area, type of coaching strategy and type of high school attended.
{"title":"Coaching β in admission test performance: a study of group differences","authors":"Anely Ramírez, Mladen Koljatic, Mónica Silva","doi":"10.1080/15305058.2020.1786833","DOIUrl":"https://doi.org/10.1080/15305058.2020.1786833","url":null,"abstract":"Abstract The study addresses the association between coaching practices and university admission test performance in Chile. Estimates of coaching effects are reported for test-takers from the private and public school systems. Our results indicate that coaching is associated with variations in test scores. The estimated magnitude of coaching appears to vary by subject area, type of coaching strategy and type of high school attended.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"253 - 273"},"PeriodicalIF":1.7,"publicationDate":"2020-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2020.1786833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48759901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-24DOI: 10.1080/15305058.2020.1786834
S. Finney, B. Perkins, Paulius Satkus
Abstract Using a sample of 497 college students, we measured test-taking emotions (anger, worry, pride, enjoyment) after the first third, second third, and last third of a low-stakes cognitive test of sociocultural knowledge. We examined the simultaneous change in emotions and whether change in emotions predicted subsequent test-taking effort and test performance. Latent growth models indicated that, on average, enjoyment and anger increased, whereas pride and worry decreased during the test. There was significant variability in individual change about these averages. Positive correlations were observed between change in worry and anger and change in pride and enjoyment. Structural equation models indicated that all initial emotions and gains in pride during the test influenced subsequent effort, whereas initial worry, anger and enjoyment, change in pride and enjoyment, and effort influenced test scores. The findings emphasize the importance of assessing change in emotions and the mediation mechanism of effort when modeling test performance.
{"title":"Examining the simultaneous change in emotions during a test: relations with expended effort and test performance","authors":"S. Finney, B. Perkins, Paulius Satkus","doi":"10.1080/15305058.2020.1786834","DOIUrl":"https://doi.org/10.1080/15305058.2020.1786834","url":null,"abstract":"Abstract Using a sample of 497 college students, we measured test-taking emotions (anger, worry, pride, enjoyment) after the first third, second third, and last third of a low-stakes cognitive test of sociocultural knowledge. We examined the simultaneous change in emotions and whether change in emotions predicted subsequent test-taking effort and test performance. Latent growth models indicated that, on average, enjoyment and anger increased, whereas pride and worry decreased during the test. There was significant variability in individual change about these averages. Positive correlations were observed between change in worry and anger and change in pride and enjoyment. Structural equation models indicated that all initial emotions and gains in pride during the test influenced subsequent effort, whereas initial worry, anger and enjoyment, change in pride and enjoyment, and effort influenced test scores. The findings emphasize the importance of assessing change in emotions and the mediation mechanism of effort when modeling test performance.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"274 - 298"},"PeriodicalIF":1.7,"publicationDate":"2020-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2020.1786834","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42495913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-02DOI: 10.1080/15305058.2019.1673758
A. Walker, Stefanie A. Wind
Researchers apply individual person fit analyses as a procedure for checking model-data fit for individual test-takers. When a test-taker misfits, it means that the inferences from their test score regarding what they know and can do may not be accurate. One problem in applying individual person fit procedures in practice is the question of how much misfit it takes to make the test score an untrustworthy estimate of achievement. In this paper, we argue that if a person’s responses generally follow a monotonic pattern, the resulting test score is “good enough” to be interpreted and used. We present an approach that applies statistical procedures from the Rasch and Mokken measurement perspectives to examine individual person fit based on this good enough criterion in real data from a performance assessment. We discuss how these perspectives may facilitate thinking about applying individual person fit procedures in practice.
{"title":"Identifying Misfitting Achievement Estimates in Performance Assessments: An Illustration Using Rasch and Mokken Scale Analyses","authors":"A. Walker, Stefanie A. Wind","doi":"10.1080/15305058.2019.1673758","DOIUrl":"https://doi.org/10.1080/15305058.2019.1673758","url":null,"abstract":"Researchers apply individual person fit analyses as a procedure for checking model-data fit for individual test-takers. When a test-taker misfits, it means that the inferences from their test score regarding what they know and can do may not be accurate. One problem in applying individual person fit procedures in practice is the question of how much misfit it takes to make the test score an untrustworthy estimate of achievement. In this paper, we argue that if a person’s responses generally follow a monotonic pattern, the resulting test score is “good enough” to be interpreted and used. We present an approach that applies statistical procedures from the Rasch and Mokken measurement perspectives to examine individual person fit based on this good enough criterion in real data from a performance assessment. We discuss how these perspectives may facilitate thinking about applying individual person fit procedures in practice.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"231 - 251"},"PeriodicalIF":1.7,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1673758","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49272081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-02DOI: 10.1080/15305058.2019.1648270
J. Moon, S. Sinharay, M. Keehner, Irvin R. Katz
The current study examined the relationship between test-taker cognition and psychometric item properties in multiple-selection multiple-choice and grid items. In a study with content-equivalent mathematics items in alternative item formats, adult participants’ tendency to respond to an item was affected by the presence of a grid and variations of answer options. The results of an item response theory analysis were consistent with the hypothesized cognitive processes in alternative item formats. The findings suggest that seemingly subtle variations of item design could substantially affect test-taker cognition and psychometric outcomes, emphasizing the need for investigating item format effects at a fine-grained level.
{"title":"Investigating Technology-Enhanced Item Formats Using Cognitive and Item Response Theory Approaches","authors":"J. Moon, S. Sinharay, M. Keehner, Irvin R. Katz","doi":"10.1080/15305058.2019.1648270","DOIUrl":"https://doi.org/10.1080/15305058.2019.1648270","url":null,"abstract":"The current study examined the relationship between test-taker cognition and psychometric item properties in multiple-selection multiple-choice and grid items. In a study with content-equivalent mathematics items in alternative item formats, adult participants’ tendency to respond to an item was affected by the presence of a grid and variations of answer options. The results of an item response theory analysis were consistent with the hypothesized cognitive processes in alternative item formats. The findings suggest that seemingly subtle variations of item design could substantially affect test-taker cognition and psychometric outcomes, emphasizing the need for investigating item format effects at a fine-grained level.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"122 - 145"},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1648270","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46485223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-02DOI: 10.1080/15305058.2019.1635604
S. Morris, Mike Bass, Elizabeth Howard, R. Neapolitan
The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient-reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the Patient-Reported Outcomes Measurement Information System Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall, particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.
{"title":"Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information","authors":"S. Morris, Mike Bass, Elizabeth Howard, R. Neapolitan","doi":"10.1080/15305058.2019.1635604","DOIUrl":"https://doi.org/10.1080/15305058.2019.1635604","url":null,"abstract":"The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient-reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the Patient-Reported Outcomes Measurement Information System Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall, particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"146 - 168"},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1635604","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43767801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-02DOI: 10.1080/15305058.2019.1673398
Ricardo Primi, Filip De Fruyt, Daniel Santos, Stephen Antonoplis, O. John
What type of items, keyed positively or negatively, makes social-emotional skill or personality scales more valid? The present study examines the different criterion validities of true- and false-keyed items, before and after correction for acquiescence. The sample included 12,987 children and adolescents from 425 schools of the State of São Paulo Brazil (ages 11–18 attending grades 6–12). They answered a computerized 162-item questionnaire measuring 18 facets grouped into five broad domains of social-emotional skills, i.e.: Open-mindedness (O), Conscientious Self-Management (C), Engaging with others (E), Amity (A), and Negative-Emotion Regulation (N). All facet scales were fully balanced (3 true-keyed and 3 false-keyed items per facet). Criterion validity coefficients of scales composed of only true-keyed items versus only false-keyed items were compared. The criterion measure was a standardized achievement test of language and math ability. We found that coefficients were almost as twice as big for false-keyed items’ scales than for true-keyed items’ scales. After correcting for acquiescence coefficients became more similar. Acquiescence suppresses the criterion validity of unbalanced scales composed of true-keyed items. We conclude that balanced scales with pairs of true and false keyed items make a better scale in terms of internal structural and predictive validity.
{"title":"True or False? Keying Direction and Acquiescence Influence the Validity of Socio-Emotional Skills Items in Predicting High School Achievement","authors":"Ricardo Primi, Filip De Fruyt, Daniel Santos, Stephen Antonoplis, O. John","doi":"10.1080/15305058.2019.1673398","DOIUrl":"https://doi.org/10.1080/15305058.2019.1673398","url":null,"abstract":"What type of items, keyed positively or negatively, makes social-emotional skill or personality scales more valid? The present study examines the different criterion validities of true- and false-keyed items, before and after correction for acquiescence. The sample included 12,987 children and adolescents from 425 schools of the State of São Paulo Brazil (ages 11–18 attending grades 6–12). They answered a computerized 162-item questionnaire measuring 18 facets grouped into five broad domains of social-emotional skills, i.e.: Open-mindedness (O), Conscientious Self-Management (C), Engaging with others (E), Amity (A), and Negative-Emotion Regulation (N). All facet scales were fully balanced (3 true-keyed and 3 false-keyed items per facet). Criterion validity coefficients of scales composed of only true-keyed items versus only false-keyed items were compared. The criterion measure was a standardized achievement test of language and math ability. We found that coefficients were almost as twice as big for false-keyed items’ scales than for true-keyed items’ scales. After correcting for acquiescence coefficients became more similar. Acquiescence suppresses the criterion validity of unbalanced scales composed of true-keyed items. We conclude that balanced scales with pairs of true and false keyed items make a better scale in terms of internal structural and predictive validity.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"20 1","pages":"121 - 97"},"PeriodicalIF":1.7,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2019.1673398","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49361168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}