Pub Date : 2019-01-02DOI: 10.1080/15305058.2018.1464010
A. Mousavi, Ying Cui, Todd Rogers
This simulation study evaluates four different methods of setting cutoff values for person fit assessment, including (a) using fixed cutoff values either from theoretical distributions of person fit statistics, or arbitrarily chosen by the researchers in the literature; (b) using the specific percentile rank of empirical sampling distribution of person fit statistics from simulated fitting responses; (c) using bootstrap method to estimate cutoff values of empirical sampling distribution of person fit statistics from simulated fitting responses; and (d) using the p-value methods to identify misfitting responses conditional on ability levels. The Snijders' (2001), as an index with known theoretical distribution, van der Flier's U3 (1982) and Sijtsma's HT coefficient (1986), as indices with unknown theoretical distribution, were chosen. According to the simulation results, different methods of setting cutoff values tend to produce different levels of Type I error and detection rates, indicating it is critical to select an appropriate method for setting cutoff values in person fit research.
该模拟研究评估了四种不同的设定截断值的方法,包括(a)使用固定的截断值,或者从理论分布的人适合统计,或者由研究人员在文献中任意选择;(b)利用模拟拟合反应的人拟合统计量的经验抽样分布的特定百分位数秩;(c)利用自举法从模拟拟合响应中估计人拟合统计量经验抽样分布的截止值;(d)使用p值方法识别以能力水平为条件的错拟合反应。选用理论分布已知的指标Snijders’s(2001),理论分布未知的指标van der Flier’s U3(1982)和Sijtsma’s HT系数(1986)。仿真结果表明,不同的截止值设置方法往往会产生不同程度的I型误差和检出率,这表明在人身拟合研究中选择合适的截止值设置方法至关重要。
{"title":"An Examination of Different Methods of Setting Cutoff Values in Person Fit Research","authors":"A. Mousavi, Ying Cui, Todd Rogers","doi":"10.1080/15305058.2018.1464010","DOIUrl":"https://doi.org/10.1080/15305058.2018.1464010","url":null,"abstract":"This simulation study evaluates four different methods of setting cutoff values for person fit assessment, including (a) using fixed cutoff values either from theoretical distributions of person fit statistics, or arbitrarily chosen by the researchers in the literature; (b) using the specific percentile rank of empirical sampling distribution of person fit statistics from simulated fitting responses; (c) using bootstrap method to estimate cutoff values of empirical sampling distribution of person fit statistics from simulated fitting responses; and (d) using the p-value methods to identify misfitting responses conditional on ability levels. The Snijders' (2001), as an index with known theoretical distribution, van der Flier's U3 (1982) and Sijtsma's HT coefficient (1986), as indices with unknown theoretical distribution, were chosen. According to the simulation results, different methods of setting cutoff values tend to produce different levels of Type I error and detection rates, indicating it is critical to select an appropriate method for setting cutoff values in person fit research.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"1 - 22"},"PeriodicalIF":1.7,"publicationDate":"2019-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1464010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48532510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-13DOI: 10.1080/15305058.2018.1530239
Kyung Yong Kim, Euijin Lim, Won‐Chan Lee
For passage-based tests, items that belong to a common passage often violate the local independence assumption of unidimensional item response theory (UIRT). In this case, ignoring local item dependence (LID) and estimating item parameters using a UIRT model could be problematic because doing so might result in inaccurate parameter estimates, which, in turn, could impact the results of equating. Under the random groups design, the main purpose of this article was to compare the relative performance of the three-parameter logistic (3PL), graded response (GR), bifactor, and testlet models on equating passage-based tests when various degrees of LID were present due to passage. Simulation results showed that the testlet model produced the most accurate equating results, followed by the bifactor model. The 3PL model worked as well as the bifactor and testlet models when the degree of LID was low but returned less accurate equating results than the two multidimensional models as the degree of LID increased. Among the four models, the polytomous GR model provided the least accurate equating results.
{"title":"A Comparison of the Relative Performance of Four IRT Models on Equating Passage-Based Tests","authors":"Kyung Yong Kim, Euijin Lim, Won‐Chan Lee","doi":"10.1080/15305058.2018.1530239","DOIUrl":"https://doi.org/10.1080/15305058.2018.1530239","url":null,"abstract":"For passage-based tests, items that belong to a common passage often violate the local independence assumption of unidimensional item response theory (UIRT). In this case, ignoring local item dependence (LID) and estimating item parameters using a UIRT model could be problematic because doing so might result in inaccurate parameter estimates, which, in turn, could impact the results of equating. Under the random groups design, the main purpose of this article was to compare the relative performance of the three-parameter logistic (3PL), graded response (GR), bifactor, and testlet models on equating passage-based tests when various degrees of LID were present due to passage. Simulation results showed that the testlet model produced the most accurate equating results, followed by the bifactor model. The 3PL model worked as well as the bifactor and testlet models when the degree of LID was low but returned less accurate equating results than the two multidimensional models as the degree of LID increased. Among the four models, the polytomous GR model provided the least accurate equating results.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"248 - 269"},"PeriodicalIF":1.7,"publicationDate":"2018-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1530239","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46453114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-02DOI: 10.1080/15305058.2017.1396466
S. Finney, Aaron J. Myers, C. Mathers
Assessment specialists expend a great deal of energy to promote valid inferences from test scores gathered in low-stakes testing contexts. Given the indirect effect of perceived test importance on test performance via examinee effort, assessment practitioners have manipulated test instructions with the goal of increasing perceived test importance. Importantly, no studies have investigated the impact of test instructions on this indirect effect. In the current study, students were randomly assigned to one of three test instruction conditions intended to increase test relevance while keeping the test low-stakes to examinees. Test instructions did not impact average perceived test importance, examinee effort, or test performance. Furthermore, the indirect relationship between importance and performance via effort was not moderated by instructions. Thus, the effect of perceived test importance on test scores via expended effort appears consistent across different messages regarding the personal relevance of the test to examinees. The main implication for testing practice is that the effect of instructions may be negligible when reflective of authentic low-stakes test score use. Future studies should focus on uncovering instructions that increase the value of performance to the examinee yet remain truthful regarding score use.
{"title":"Test Instructions Do Not Moderate the Indirect Effect of Perceived Test Importance on Test Performance in Low-Stakes Testing Contexts","authors":"S. Finney, Aaron J. Myers, C. Mathers","doi":"10.1080/15305058.2017.1396466","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396466","url":null,"abstract":"Assessment specialists expend a great deal of energy to promote valid inferences from test scores gathered in low-stakes testing contexts. Given the indirect effect of perceived test importance on test performance via examinee effort, assessment practitioners have manipulated test instructions with the goal of increasing perceived test importance. Importantly, no studies have investigated the impact of test instructions on this indirect effect. In the current study, students were randomly assigned to one of three test instruction conditions intended to increase test relevance while keeping the test low-stakes to examinees. Test instructions did not impact average perceived test importance, examinee effort, or test performance. Furthermore, the indirect relationship between importance and performance via effort was not moderated by instructions. Thus, the effect of perceived test importance on test scores via expended effort appears consistent across different messages regarding the personal relevance of the test to examinees. The main implication for testing practice is that the effect of instructions may be negligible when reflective of authentic low-stakes test score use. Future studies should focus on uncovering instructions that increase the value of performance to the examinee yet remain truthful regarding score use.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"297 - 322"},"PeriodicalIF":1.7,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396466","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49293024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-20DOI: 10.1080/15305058.2018.1497636
Amanda M Marcotte, Francis Rick, C. Wells
Reading comprehension plays an important role in achievement for all academic domains. The purpose of this study is to describe the sentence verification technique (SVT) (Royer, Hastings, & Hook, 1979) as an alternative method of assessing reading comprehension, which can be used with a variety of texts and across diverse populations and educational contexts. Additionally, this study adds a unique contribution to the extant literature on the SVT through an investigation of the precision of the instrument across proficiency levels. Data were gathered from a sample of 464 fourth-grade students from the Northeast region of the United States. Reliability was estimated using one, two, three, and four passage test forms. Two or three passages provided sufficient reliability. The conditional reliability analyses revealed that the SVT test scores were reliable for readers with average to below average proficiency, but did not provide reliable information for students who were very poor or strong readers.
{"title":"Investigating the Reliability of the Sentence Verification Technique","authors":"Amanda M Marcotte, Francis Rick, C. Wells","doi":"10.1080/15305058.2018.1497636","DOIUrl":"https://doi.org/10.1080/15305058.2018.1497636","url":null,"abstract":"Reading comprehension plays an important role in achievement for all academic domains. The purpose of this study is to describe the sentence verification technique (SVT) (Royer, Hastings, & Hook, 1979) as an alternative method of assessing reading comprehension, which can be used with a variety of texts and across diverse populations and educational contexts. Additionally, this study adds a unique contribution to the extant literature on the SVT through an investigation of the precision of the instrument across proficiency levels. Data were gathered from a sample of 464 fourth-grade students from the Northeast region of the United States. Reliability was estimated using one, two, three, and four passage test forms. Two or three passages provided sufficient reliability. The conditional reliability analyses revealed that the SVT test scores were reliable for readers with average to below average proficiency, but did not provide reliable information for students who were very poor or strong readers.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"74 - 95"},"PeriodicalIF":1.7,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1497636","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45868181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-14DOI: 10.1080/15305058.2018.1481852
HyeSun Lee, K. Geisinger
The purpose of the current study was to examine the impact of item parameter drift (IPD) occurring in context questionnaires from an international large-scale assessment and determine the most appropriate way to address IPD. Focusing on the context of psychometric and educational research where scores from context questionnaires composed of polytomous items were employed for the classification of examinees, the current research investigated the impacts of IPD on the estimation of questionnaire scores and classification accuracy with five manipulated factors: the length of a questionnaire, the proportion of items exhibiting IPD, the direction and magnitude of IPD, and three decisions about IPD. The results indicated that the impact of IPD occurring in a short context questionnaire on the accuracy of score estimation and classification of examinees was substantial. The accuracy in classification considerably decreased especially at the lowest and highest categories of a trait. Unlike the recommendation from literature in educational testing, the current study demonstrated that keeping items exhibiting IPD and removing them only for transformation were appropriate when IPD occurred in relatively short context questionnaires. Using 2011 TIMSS data from Iran, an applied example demonstrated the application of provided guidance in making appropriate decisions about IPD.
{"title":"Item Parameter Drift in Context Questionnaires from International Large-Scale Assessments","authors":"HyeSun Lee, K. Geisinger","doi":"10.1080/15305058.2018.1481852","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481852","url":null,"abstract":"The purpose of the current study was to examine the impact of item parameter drift (IPD) occurring in context questionnaires from an international large-scale assessment and determine the most appropriate way to address IPD. Focusing on the context of psychometric and educational research where scores from context questionnaires composed of polytomous items were employed for the classification of examinees, the current research investigated the impacts of IPD on the estimation of questionnaire scores and classification accuracy with five manipulated factors: the length of a questionnaire, the proportion of items exhibiting IPD, the direction and magnitude of IPD, and three decisions about IPD. The results indicated that the impact of IPD occurring in a short context questionnaire on the accuracy of score estimation and classification of examinees was substantial. The accuracy in classification considerably decreased especially at the lowest and highest categories of a trait. Unlike the recommendation from literature in educational testing, the current study demonstrated that keeping items exhibiting IPD and removing them only for transformation were appropriate when IPD occurred in relatively short context questionnaires. Using 2011 TIMSS data from Iran, an applied example demonstrated the application of provided guidance in making appropriate decisions about IPD.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"23 - 51"},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481852","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42801965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-14DOI: 10.1080/15305058.2018.1486316
Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He
The relationship of expected and actual difficulty of items on six mathematics question papers designed for 16-year olds in England was investigated through paired comparison using experts and testing with students. A variant of the Rasch model was applied to the comparison data to establish a scale of expected difficulty. In testing, the papers were taken by 2933 students using an equivalent-groups design, allowing the actual difficulty of the items to be placed on the same measurement scale. It was found that the expected difficulty derived using the comparative judgement approach and the actual difficulty derived from the test data was reasonably strongly correlated. This suggests that comparative judgement may be an effective way to investigate the comparability of difficulty of examinations. The approach could potentially be used as a proxy for pretesting high-stakes tests in situations where pretesting is not feasible due to reasons of security or other risks.
{"title":"Investigating the Comparability of Examination Difficulty Using Comparative Judgement and Rasch Modelling","authors":"Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He","doi":"10.1080/15305058.2018.1486316","DOIUrl":"https://doi.org/10.1080/15305058.2018.1486316","url":null,"abstract":"The relationship of expected and actual difficulty of items on six mathematics question papers designed for 16-year olds in England was investigated through paired comparison using experts and testing with students. A variant of the Rasch model was applied to the comparison data to establish a scale of expected difficulty. In testing, the papers were taken by 2933 students using an equivalent-groups design, allowing the actual difficulty of the items to be placed on the same measurement scale. It was found that the expected difficulty derived using the comparative judgement approach and the actual difficulty derived from the test data was reasonably strongly correlated. This suggests that comparative judgement may be an effective way to investigate the comparability of difficulty of examinations. The approach could potentially be used as a proxy for pretesting high-stakes tests in situations where pretesting is not feasible due to reasons of security or other risks.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"366 - 391"},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1486316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45405533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-14DOI: 10.1080/15305058.2018.1481853
Adam E. Wyse
An important piece of validity evidence to support the use of credentialing exams comes from performing a job analysis of the profession. One common job analysis method is the task inventory method, where people working in the field are surveyed using rating scales about the tasks thought necessary to safely and competently perform the job. This article describes how mixture Rasch models can be used to analyze these data, and how results from these analyses can help to identify whether different groups of people may be responding to job tasks differently. Three examples from different credentialing programs illustrate scenarios that can be found when applying mixture Rasch models to job analysis data. Discussion of what these results may imply for the development of credentialing exams and other analyses of job analysis data is provided.
{"title":"Analyzing Job Analysis Data Using Mixture Rasch Models","authors":"Adam E. Wyse","doi":"10.1080/15305058.2018.1481853","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481853","url":null,"abstract":"An important piece of validity evidence to support the use of credentialing exams comes from performing a job analysis of the profession. One common job analysis method is the task inventory method, where people working in the field are surveyed using rating scales about the tasks thought necessary to safely and competently perform the job. This article describes how mixture Rasch models can be used to analyze these data, and how results from these analyses can help to identify whether different groups of people may be responding to job tasks differently. Three examples from different credentialing programs illustrate scenarios that can be found when applying mixture Rasch models to job analysis data. Discussion of what these results may imply for the development of credentialing exams and other analyses of job analysis data is provided.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"19 1","pages":"52 - 73"},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481853","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47874147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-03DOI: 10.1080/15305058.2017.1396465
Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang
Pursuing the line of the difference models in IRT (Thissen & Steinberg, 1986), this article proposed a new cognitive diagnostic model for graded/polytomous data based on the deterministic input, noisy, and gate (Haertel, 1989; Junker & Sijtsma, 2001), which is named the DINA model for graded data (DINA-GD). We investigated the performance of a full Bayesian estimation of the proposed model. In the simulation, the classification accuracy and item recovery for the DINA-GD model were investigated. The results indicated that the proposed model had acceptable examinees' correct attribute classification rate and item parameter recovery. In addition, a real-data example was used to illustrate the application of this new model with the graded data or polytomously scored items.
{"title":"A Polytomous Model of Cognitive Diagnostic Assessment for Graded Data","authors":"Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang","doi":"10.1080/15305058.2017.1396465","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396465","url":null,"abstract":"Pursuing the line of the difference models in IRT (Thissen & Steinberg, 1986), this article proposed a new cognitive diagnostic model for graded/polytomous data based on the deterministic input, noisy, and gate (Haertel, 1989; Junker & Sijtsma, 2001), which is named the DINA model for graded data (DINA-GD). We investigated the performance of a full Bayesian estimation of the proposed model. In the simulation, the classification accuracy and item recovery for the DINA-GD model were investigated. The results indicated that the proposed model had acceptable examinees' correct attribute classification rate and item parameter recovery. In addition, a real-data example was used to illustrate the application of this new model with the graded data or polytomously scored items.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"231 - 252"},"PeriodicalIF":1.7,"publicationDate":"2018-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396465","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49274990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-03DOI: 10.1080/15305058.2017.1396464
Amery Wu, Michelle Y. Chen, J. Stone
This article investigates how test-takers change their strategies to handle increased test difficulty. An adult sample reported their test-taking strategies immediately after completing the tasks in a reading test. Data were analyzed using structural equation modeling specifying a measurement-invariant, ability-moderated, latent transition analysis in Mplus (Muthén & Asparouhov, 2011). It was found that almost half of the test-takers (47%) changed their strategies when encountering increased task-difficulty. The changes were characterized by augmenting comprehending-meaning strategies with score-maximizing and test-wiseness strategies. Moreover, test-takers' ability was the driving influence that facilitated and/or buffered the changes. The test outcomes, when reviewed in light of adjusted test-taking strategies, demonstrated a form of process-based validity evidence.
{"title":"Investigating How Test-Takers Change Their Strategies to Handle Difficulty in Taking a Reading Comprehension Test: Implications for Score Validation","authors":"Amery Wu, Michelle Y. Chen, J. Stone","doi":"10.1080/15305058.2017.1396464","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396464","url":null,"abstract":"This article investigates how test-takers change their strategies to handle increased test difficulty. An adult sample reported their test-taking strategies immediately after completing the tasks in a reading test. Data were analyzed using structural equation modeling specifying a measurement-invariant, ability-moderated, latent transition analysis in Mplus (Muthén & Asparouhov, 2011). It was found that almost half of the test-takers (47%) changed their strategies when encountering increased task-difficulty. The changes were characterized by augmenting comprehending-meaning strategies with score-maximizing and test-wiseness strategies. Moreover, test-takers' ability was the driving influence that facilitated and/or buffered the changes. The test outcomes, when reviewed in light of adjusted test-taking strategies, demonstrated a form of process-based validity evidence.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"253 - 275"},"PeriodicalIF":1.7,"publicationDate":"2018-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396464","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45131489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-03DOI: 10.1080/15305058.2017.1407767
Patriann Smith, P. Frazier, Jaehoon Lee, R. Chang
Previous research has primarily addressed the effects of language on the Program for International Student Assessment (PISA) mathematics and science assessments. More recent research has focused on the effects of language on PISA reading comprehension and literacy assessments on student populations in specific Organization for Economic Cooperation and Development (OECD) and non-OECD countries. Recognizing calls to highlight the impact of language on student PISA reading performance across countries, the purpose of this study was to examine the effect of home languages versus test languages on PISA reading literacy across OECD and non-OECD economies, while considering other factors. The results of Ordinary Least Squares regression showed that about half of the economies demonstrated a positive and significant effect of students' language status on their reading performance. This finding is consistent with observations in the parallel analysis of PISA 2009 data, suggesting that students' performance on reading literacy assessment was higher when they were tested in their home language. Our findings highlight the importance of the role of context, the need for new approaches to test translation, and the potential similarities in language status for youth from OECD and non-OECD countries that have implications for interpreting their PISA reading literacy assessments.
{"title":"Incongruence Between Native and Test Administration Languages: Towards Equal Opportunity in International Literacy Assessment","authors":"Patriann Smith, P. Frazier, Jaehoon Lee, R. Chang","doi":"10.1080/15305058.2017.1407767","DOIUrl":"https://doi.org/10.1080/15305058.2017.1407767","url":null,"abstract":"Previous research has primarily addressed the effects of language on the Program for International Student Assessment (PISA) mathematics and science assessments. More recent research has focused on the effects of language on PISA reading comprehension and literacy assessments on student populations in specific Organization for Economic Cooperation and Development (OECD) and non-OECD countries. Recognizing calls to highlight the impact of language on student PISA reading performance across countries, the purpose of this study was to examine the effect of home languages versus test languages on PISA reading literacy across OECD and non-OECD economies, while considering other factors. The results of Ordinary Least Squares regression showed that about half of the economies demonstrated a positive and significant effect of students' language status on their reading performance. This finding is consistent with observations in the parallel analysis of PISA 2009 data, suggesting that students' performance on reading literacy assessment was higher when they were tested in their home language. Our findings highlight the importance of the role of context, the need for new approaches to test translation, and the potential similarities in language status for youth from OECD and non-OECD countries that have implications for interpreting their PISA reading literacy assessments.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":"18 1","pages":"276 - 296"},"PeriodicalIF":1.7,"publicationDate":"2018-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1407767","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41346188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}