The 10-item Emotion Regulation Questionnaire (ERQ) was developed to measure individual differences in the tendency to use two common emotion regulation strategies: cognitive reappraisal and suppression. The current study examined the psychometric properties of the ERQ in a heterogeneous mixed sample of 713 (64.9% female) community residents using the polytomous Rasch model. The results showed that the 10-item ERQ was multidimensional and supported the two distinct factors. The reappraisal and suppression subscales were both found to be unidimensional and fit the Rasch model. No evidence of local dependence was observed. The five response categories also functioned as intended. Differential item functioning (DIF) was assessed across sub-samples defined by gender, self-report experiencing symptoms of mental illness, regular meditation practice, and age groupings. No evidence emerged of items functioning differently across any of these groups. Using Rasch measure scores, a number of meaningful group differences in person location emerged. Less use of reappraisal was reported by younger adults, non-meditators, and those reporting experiencing symptoms of mental illness. Non-meditators also reported greater use of suppression compared with regular meditators; no other age group, gender, or symptomatic group differences emerged on suppression.
{"title":"A Rasch Model Analysis of the Emotion Regulation Questionnaire.","authors":"Michael J Ireland, Hong Eng Goh, Ida Marais","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The 10-item Emotion Regulation Questionnaire (ERQ) was developed to measure individual differences in the tendency to use two common emotion regulation strategies: cognitive reappraisal and suppression. The current study examined the psychometric properties of the ERQ in a heterogeneous mixed sample of 713 (64.9% female) community residents using the polytomous Rasch model. The results showed that the 10-item ERQ was multidimensional and supported the two distinct factors. The reappraisal and suppression subscales were both found to be unidimensional and fit the Rasch model. No evidence of local dependence was observed. The five response categories also functioned as intended. Differential item functioning (DIF) was assessed across sub-samples defined by gender, self-report experiencing symptoms of mental illness, regular meditation practice, and age groupings. No evidence emerged of items functioning differently across any of these groups. Using Rasch measure scores, a number of meaningful group differences in person location emerged. Less use of reappraisal was reported by younger adults, non-meditators, and those reporting experiencing symptoms of mental illness. Non-meditators also reported greater use of suppression compared with regular meditators; no other age group, gender, or symptomatic group differences emerged on suppression.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 3","pages":"258-270"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36451691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In linked-chain equating, equating errors may accumulate and cause scale drift. This simulation study extends the investigation on scale drift in linked-chain equating to mixed-format test. Specifically, the impact of equating method and the characteristics of anchor test and equating chain on equating errors and scale drift in IRT true score equating is examined. To evaluate equating results, a new method is used to derive true linking coefficients. The results indicate that the characteristic curve methods produce more accurate and reliable equating results than the moment methods. Although using more anchor items or an anchor test configuration with more IRT parameters can lower the variability of equating results, neither of them help control equating bias. Additionally, scale drift increases when an equating chain runs longer or poorly calibrated test forms are added to the chain. The role of calibration precision in evaluating equating results is highlighted.
{"title":"Equating Errors and Scale Drift in Linked-Chain IRT Equating with Mixed-Format Tests.","authors":"Bo Hu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In linked-chain equating, equating errors may accumulate and cause scale drift. This simulation study extends the investigation on scale drift in linked-chain equating to mixed-format test. Specifically, the impact of equating method and the characteristics of anchor test and equating chain on equating errors and scale drift in IRT true score equating is examined. To evaluate equating results, a new method is used to derive true linking coefficients. The results indicate that the characteristic curve methods produce more accurate and reliable equating results than the moment methods. Although using more anchor items or an anchor test configuration with more IRT parameters can lower the variability of equating results, neither of them help control equating bias. Additionally, scale drift increases when an equating chain runs longer or poorly calibrated test forms are added to the chain. The role of calibration precision in evaluating equating results is highlighted.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 1","pages":"41-58"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35932759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of the present study was to evaluate various analytical means to detect academic cheating in an experimental setting. The omega index was compared and contrasted given a gold criterion of academic cheating which entailed a discrepant score between two administrations using an experimental study with real test takers. Participants were 164 elementary school students who were administered a mathematics exam followed by an equivalent mock exam under conditions of strict and relaxed, invigilation, respectively. Discrepant scores were defined as exceeding 7 responses in any direction (correct or incorrect), based on what was expected due to chance. Results indicated that the omega index was successful in capturing more than 39% of the cases who exceeded the conventional plus or minus 7 discrepancy criteria. It is concluded that the response similarity analysis may be an important tool in detecting academic cheating.
{"title":"Validation of Response Similarity Analysis for the Detection of Academic Cheating: An Experimental Study.","authors":"Georgios D Sideridis, Cengiz Zopluoglu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The purpose of the present study was to evaluate various analytical means to detect academic cheating in an experimental setting. The omega index was compared and contrasted given a gold criterion of academic cheating which entailed a discrepant score between two administrations using an experimental study with real test takers. Participants were 164 elementary school students who were administered a mathematics exam followed by an equivalent mock exam under conditions of strict and relaxed, invigilation, respectively. Discrepant scores were defined as exceeding 7 responses in any direction (correct or incorrect), based on what was expected due to chance. Results indicated that the omega index was successful in capturing more than 39% of the cases who exceeded the conventional plus or minus 7 discrepancy criteria. It is concluded that the response similarity analysis may be an important tool in detecting academic cheating.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 1","pages":"59-75"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35932760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fluency may be considered as a conjoint measure of work product quality and speed. It is especially useful in educational and medical settings to evaluate expertise and/or competence. In this paper, didactic exams were used to model fluency. Binned propensity matching with question difficulty and time intensity was used to define a 'load' variable and construct fluency (sum correct/ elapsed response time). Response surfaces as speed-accuracy tradeoffs resulted from the analysis. Person by load fluency matrices behaved well in Rasch analysis and warranted the definition of a person fluency variable ('skill'). A path model with skill and load as mediators substantially described the fluency data. The indirect paths through skill and load dominated direct variable effects. This is supportive evidence that skill and load have stand-alone merit. Therefore, it appears that the constructs of skill, load, and fluency could provide psychometrically defensible descriptors when utilized in appropriate contexts.
{"title":"Person-Level Analysis of the Effect of Cognitive Loading by Question Difficulty and Question Time Intensity on Didactic Examination Fluency (Speed-Accuracy Tradeoff).","authors":"James J Thompson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Fluency may be considered as a conjoint measure of work product quality and speed. It is especially useful in educational and medical settings to evaluate expertise and/or competence. In this paper, didactic exams were used to model fluency. Binned propensity matching with question difficulty and time intensity was used to define a 'load' variable and construct fluency (sum correct/ elapsed response time). Response surfaces as speed-accuracy tradeoffs resulted from the analysis. Person by load fluency matrices behaved well in Rasch analysis and warranted the definition of a person fluency variable ('skill'). A path model with skill and load as mediators substantially described the fluency data. The indirect paths through skill and load dominated direct variable effects. This is supportive evidence that skill and load have stand-alone merit. Therefore, it appears that the constructs of skill, load, and fluency could provide psychometrically defensible descriptors when utilized in appropriate contexts.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 3","pages":"229-242"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36451136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study aimed to advance the Scientific Multi-Text Reading Comprehension Assessment (SMTRCA) by developing a rubric which consisted of 4 subscales: information retrieval, information generalization, information interpretation, and information integration. The assessment tool included 11 close-ended and 8 open-ended items and its rubric. Two texts describing opposing views of the dispute of whether to continue the Fourth Nuclear Power Plant construction in Taiwan were developed and 1535 grade 5-9 students read these two texts in a counterbalanced order and answered the test items. First, the results showed that the Cronbach's values were more than .9, indicating very good intra-rater consistency. The Kendall coefficient of concordance of the inter-rater reliability was larger than .8, denoting a consistent scoring pattern between raters. Second, the analysis of many-facet Rasch measurement showed that there were significant difference in rater severity, and both severe and lenient raters could distinguish high versus low-ability students effectively. The comparison of the rating scale model and the partial credit model indicated that each rater had a unique rating scale structure, meaning that the rating procedures involve human interpretation and evaluation during the scoring processes so that it is difficult to reach a machine-like consistency level. However, this is in line with expectations of typical human judgment processes. Third, the Cronbach's coefficient of the full assessment were above .85, denoting that the SMTRCA has high internal-consistency. Finally, confirmatory factory analysis showed that there was an acceptable goodness-of-fit among the SMTRCA. These results suggest that the SMTRCA was a useful tool for measuring multi-text reading comprehension abilities.
{"title":"Developing and Validating a Scientific Multi-Text Reading Comprehension Assessment: In the Text Case of the Dispute of whether to Continue the Fourth Nuclear Power Plant Construction in Taiwan.","authors":"Lin Hsiao-Hui, Yuh-Tsuen Tzeng","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This study aimed to advance the Scientific Multi-Text Reading Comprehension Assessment (SMTRCA) by developing a rubric which consisted of 4 subscales: information retrieval, information generalization, information interpretation, and information integration. The assessment tool included 11 close-ended and 8 open-ended items and its rubric. Two texts describing opposing views of the dispute of whether to continue the Fourth Nuclear Power Plant construction in Taiwan were developed and 1535 grade 5-9 students read these two texts in a counterbalanced order and answered the test items. First, the results showed that the Cronbach's values were more than .9, indicating very good intra-rater consistency. The Kendall coefficient of concordance of the inter-rater reliability was larger than .8, denoting a consistent scoring pattern between raters. Second, the analysis of many-facet Rasch measurement showed that there were significant difference in rater severity, and both severe and lenient raters could distinguish high versus low-ability students effectively. The comparison of the rating scale model and the partial credit model indicated that each rater had a unique rating scale structure, meaning that the rating procedures involve human interpretation and evaluation during the scoring processes so that it is difficult to reach a machine-like consistency level. However, this is in line with expectations of typical human judgment processes. Third, the Cronbach's coefficient of the full assessment were above .85, denoting that the SMTRCA has high internal-consistency. Finally, confirmatory factory analysis showed that there was an acceptable goodness-of-fit among the SMTRCA. These results suggest that the SMTRCA was a useful tool for measuring multi-text reading comprehension abilities.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 3","pages":"320-337"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36451142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social perspective-taking (SPT), which involves the ability infer others' intentions, is a consequential social cognitive process. The purpose of this study is to evaluate the psychometric properties of a web-based social perspective-taking (SELweb SPT) assessment designed for children in kindergarten through third grade. Data were collected from two separate samples of children. The first sample included 3224 children and the second sample included 4419 children. Data were calibrated using Rasch dichotomous model (Rasch, 1960). Differential item and test functioning were also evaluated across gender and ethnicity groups. Across both samples, we found: evidence of consistent item fit; unidimensional item structure; and adequate item targeting. Poor item targeting at high and low ability levels suggests that more items are needed to distinguish low and high ability respondents. Analyses of DIF found some significant item-level DIF across gender, but no DIF across ethnicity. The analyses of person measure calibrations with and without DIF items evidenced negligible differential test functioning (DTF) across gender and ethnicity groups in both samples.
{"title":"Psychometric Properties and Differential Item Functioning of a Web-Based Assessment of Children's Social Perspective-Taking.","authors":"Beyza Aksu Dunya, Clark McKown, Everett V Smith","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Social perspective-taking (SPT), which involves the ability infer others' intentions, is a consequential social cognitive process. The purpose of this study is to evaluate the psychometric properties of a web-based social perspective-taking (SELweb SPT) assessment designed for children in kindergarten through third grade. Data were collected from two separate samples of children. The first sample included 3224 children and the second sample included 4419 children. Data were calibrated using Rasch dichotomous model (Rasch, 1960). Differential item and test functioning were also evaluated across gender and ethnicity groups. Across both samples, we found: evidence of consistent item fit; unidimensional item structure; and adequate item targeting. Poor item targeting at high and low ability levels suggests that more items are needed to distinguish low and high ability respondents. Analyses of DIF found some significant item-level DIF across gender, but no DIF across ethnicity. The analyses of person measure calibrations with and without DIF items evidenced negligible differential test functioning (DTF) across gender and ethnicity groups in both samples.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 1","pages":"93-105"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35932762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article describes the development and calibration of items from the 1997 to 2006 Tertiary Entrance Exams (TEE) in Chemistry conducted by the Curriculum Council of Western Australia for the purposes of establishing a Chemistry item bank. Only items that met the strict Rasch measurement criterion of ordered thresholds were included. Item Residuals and Chi-square conformity of the items were likewise scrutinized. Further, specialist experts in chemistry were employed to ascertain the qualitative properties of the items, particularly the item wording, so as to provide accurate item descriptors. An item bank of 174 items was created. This item bank may now be accurately used by teachers in their classrooms for the purposes of developing class assessments in Chemistry and/or for classroom diagnostic purposes.
{"title":"Development and Calibration of Chemistry Items to Create an Item Bank, using the Rasch Measurement Model.","authors":"Joseph N Njiru, Joseph T Romanoski","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This article describes the development and calibration of items from the 1997 to 2006 Tertiary Entrance Exams (TEE) in Chemistry conducted by the Curriculum Council of Western Australia for the purposes of establishing a Chemistry item bank. Only items that met the strict Rasch measurement criterion of ordered thresholds were included. Item Residuals and Chi-square conformity of the items were likewise scrutinized. Further, specialist experts in chemistry were employed to ascertain the qualitative properties of the items, particularly the item wording, so as to provide accurate item descriptors. An item bank of 174 items was created. This item bank may now be accurately used by teachers in their classrooms for the purposes of developing class assessments in Chemistry and/or for classroom diagnostic purposes.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 2","pages":"192-200"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36215372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carolina Saskia Fellinghauer, Birgit Prodinger, Alan Tennant
Imputation becomes common practice through availability of easy-to-use algorithms and software. This study aims to determine if different imputation strategies are robust to the extent and type of missingness, local item dependencies (LID), differential item functioning (DIF), and misfit when doing a Rasch analysis. Four samples were simulated and represented a sample with good metric properties, a sample with LID, a sample with DIF, and a sample with LID and DIF. Missing values were generated with increasing proportion and were either missing at random or completely at random. Four imputation techniques were applied before Rasch analysis and deviation of the results and the quality of fit compared. Imputation strategies showed good performance with less than 15% of missingness. The analysis with missing values performed best in recovering statistical estimates. The best strategy, when doing a Rasch analysis, is the analysis with missing values. If for some reason imputation is necessary, we recommend using the expectation-maximization algorithm.
{"title":"The Impact of Missing Values and Single Imputation upon Rasch Analysis Outcomes: A Simulation Study.","authors":"Carolina Saskia Fellinghauer, Birgit Prodinger, Alan Tennant","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Imputation becomes common practice through availability of easy-to-use algorithms and software. This study aims to determine if different imputation strategies are robust to the extent and type of missingness, local item dependencies (LID), differential item functioning (DIF), and misfit when doing a Rasch analysis. Four samples were simulated and represented a sample with good metric properties, a sample with LID, a sample with DIF, and a sample with LID and DIF. Missing values were generated with increasing proportion and were either missing at random or completely at random. Four imputation techniques were applied before Rasch analysis and deviation of the results and the quality of fit compared. Imputation strategies showed good performance with less than 15% of missingness. The analysis with missing values performed best in recovering statistical estimates. The best strategy, when doing a Rasch analysis, is the analysis with missing values. If for some reason imputation is necessary, we recommend using the expectation-maximization algorithm.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 1","pages":"1-25"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35932758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aligning scales in vertical equating carries a number of challenges for practitioners in contexts such as large-scale testing. This paper examines the impact of high and low discrimination on the results of vertical equating when the Rasch model is applied. A simulation study is used to show that different levels of discrimination introduce systematic error into estimates. A second simulation study shows that for the purpose of vertical equating, items with high or low discrimination contain information about translation constants that contains systematic error. The impact of differential item discrimination on vertical equating is examined and subsequently illustrated in terms of a real data set from a large-scale testing program, with vertical links between grade 3 and 5 numeracy tests. Implications of the results for practitioners conducting vertical equating with the Rasch model are identified, including monitoring progress over time. Implications for other item response models are also discussed.
{"title":"The Impact of Levels of Discrimination on Vertical Equating in the Rasch Model.","authors":"Stephen N Humphrey","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Aligning scales in vertical equating carries a number of challenges for practitioners in contexts such as large-scale testing. This paper examines the impact of high and low discrimination on the results of vertical equating when the Rasch model is applied. A simulation study is used to show that different levels of discrimination introduce systematic error into estimates. A second simulation study shows that for the purpose of vertical equating, items with high or low discrimination contain information about translation constants that contains systematic error. The impact of differential item discrimination on vertical equating is examined and subsequently illustrated in terms of a real data set from a large-scale testing program, with vertical links between grade 3 and 5 numeracy tests. Implications of the results for practitioners conducting vertical equating with the Rasch model are identified, including monitoring progress over time. Implications for other item response models are also discussed.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 3","pages":"216-228"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36451686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Establishing the internal validity of psychometric instruments is an important research priority, and is especially vital for instruments that are used to collect data to guide public policy decisions. The Warwick-Edinburgh Mental Well-Being Scale (WEMWBS) is a well-established and widely-used instrument for assessing individual differences in well-being. The current analyses were motivated by concerns that metal wellbeing items that refer to interpersonal relationships (Items 9 and 12) may operate differently for those in a relationship compared to those not in a relationship. To assess this, the present study used item characteristic curves (ICC) and ANOVA of residuals to scrutinize the differential item functioning (DIF) of the 14 WEMWBS items for participant relationship status (n with partner = 261, n without partner = 210). Items 5, 9, and 12 showed evidence of DIF which impacted group mean differences. Item 5 ("energy to spare") was unexpected, however plausible explanation is discussed. For participants at the same level of mental wellbeing, those in a relationship scored higher on items 9 and 12 than those not in a relationship. This suggests these items are sensitive to non-wellbeing related variance associated with relationship status. Implications and future research directions are discussed.
{"title":"The Impact of Differential Item Functioning on the Warwick-Edinburgh Mental Well-Being Scale.","authors":"Hong Eng Goh, Ida Marais, Michael Ireland","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Establishing the internal validity of psychometric instruments is an important research priority, and is especially vital for instruments that are used to collect data to guide public policy decisions. The Warwick-Edinburgh Mental Well-Being Scale (WEMWBS) is a well-established and widely-used instrument for assessing individual differences in well-being. The current analyses were motivated by concerns that metal wellbeing items that refer to interpersonal relationships (Items 9 and 12) may operate differently for those in a relationship compared to those not in a relationship. To assess this, the present study used item characteristic curves (ICC) and ANOVA of residuals to scrutinize the differential item functioning (DIF) of the 14 WEMWBS items for participant relationship status (n with partner = 261, n without partner = 210). Items 5, 9, and 12 showed evidence of DIF which impacted group mean differences. Item 5 (\"energy to spare\") was unexpected, however plausible explanation is discussed. For participants at the same level of mental wellbeing, those in a relationship scored higher on items 9 and 12 than those not in a relationship. This suggests these items are sensitive to non-wellbeing related variance associated with relationship status. Implications and future research directions are discussed.</p>","PeriodicalId":73608,"journal":{"name":"Journal of applied measurement","volume":"19 2","pages":"162-172"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36216477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}