Groups are used widely in the language classroom and, particularly in classes where there is a wide range of English proficiency among students, teachers may want to construct balanced groups based on the language proficiency of individual students. In order to construct such groups, teachers need a reliable measure that effectively differentiates between different levels of proficiency, and yet there are contexts where information regarding student proficiency may not be available. This paper reports on the use of an in-house dictation test to measure the English proficiency of students in a Japanese university. Rasch analysis was used to determine the degree to which the dictation differentiated between the range of proficiencies in the classes, and to assess the reliability of the test. Correlation with scores from TOEIC and SLEP tests was used to confirm that the dictation tests English proficiency. Results show that dictation is a simple, cheap, and effective means of assessing the relative proficiency of students in this context, and can be used for constructing balanced groups.
{"title":"Using dictation to measure language proficiency: A Rasch analysis","authors":"Paul Leeming, Aeric Wong","doi":"10.58379/mbsw8958","DOIUrl":"https://doi.org/10.58379/mbsw8958","url":null,"abstract":"Groups are used widely in the language classroom and, particularly in classes where there is a wide range of English proficiency among students, teachers may want to construct balanced groups based on the language proficiency of individual students. In order to construct such groups, teachers need a reliable measure that effectively differentiates between different levels of proficiency, and yet there are contexts where information regarding student proficiency may not be available. This paper reports on the use of an in-house dictation test to measure the English proficiency of students in a Japanese university. Rasch analysis was used to determine the degree to which the dictation differentiated between the range of proficiencies in the classes, and to assess the reliability of the test. Correlation with scores from TOEIC and SLEP tests was used to confirm that the dictation tests English proficiency. Results show that dictation is a simple, cheap, and effective means of assessing the relative proficiency of students in this context, and can be used for constructing balanced groups.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"212 5-6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72537289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many large scale proficiency assessments that use human raters as part of their scoring procedures struggle with the realities of being able to offer regular face-to-face rater training workshops for new raters in different locations in the world. A number of these testing agencies have therefore introduced online rater training systems in order to access raters in a larger number of locations as well as from different contexts. Potential raters have more flexibility to complete the training in their own time and at their own pace. This paper describes the collaborative evaluation of a new online rater training module developed for a large scale international language assessment. The longitudinal evaluation focussed on two key points in the development process of the new program. The first, involving scrutiny of the online program, took place when the site was close to completion and the second, an empirical evaluation, followed the training of the first trial cohort of raters. The main purpose of this paper is to detail some of the complexities of completing such an evaluation within the operational demands of rolling out a new system and to comment on the advantages of the collaborative nature of such a project.
{"title":"An evaluation of an online rater training program for the speaking and writing sub-tests of the Aptis test","authors":"U. Knoch, J. Fairbairn, A. Huisman","doi":"10.58379/xdyp1068","DOIUrl":"https://doi.org/10.58379/xdyp1068","url":null,"abstract":"Many large scale proficiency assessments that use human raters as part of their scoring procedures struggle with the realities of being able to offer regular face-to-face rater training workshops for new raters in different locations in the world. A number of these testing agencies have therefore introduced online rater training systems in order to access raters in a larger number of locations as well as from different contexts. Potential raters have more flexibility to complete the training in their own time and at their own pace. This paper describes the collaborative evaluation of a new online rater training module developed for a large scale international language assessment. The longitudinal evaluation focussed on two key points in the development process of the new program. The first, involving scrutiny of the online program, took place when the site was close to completion and the second, an empirical evaluation, followed the training of the first trial cohort of raters. The main purpose of this paper is to detail some of the complexities of completing such an evaluation within the operational demands of rolling out a new system and to comment on the advantages of the collaborative nature of such a project.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85706267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper reflects on a review of an existing English language examination for admission to an English-medium university in a non-English-dominant context. Studying how well an established test sits in its present context may highlight environmental changes causing gaps and points of friction. Such an evaluation therefore provides a baseline understanding from which to move forward. From the 1960s to 1980s, experts developed an examination for applicants to the American University of Beirut that was similar to the Test of English as a Foreign Language (TOEFL) of that time. The AUB English Entrance Examination has remained relatively unchanged since then. Concern about its effectiveness prompted a recent review, providing an opportunity to study consequences of employing a test not fully adapted to its current use. The review found differences in what is/was viewed as appropriate test format and content, and in definitions of language proficiency. It also noted unwarranted assumptions made about comparability of results from different tests. Current language practices at the university, in the region and in the globalized workplace where graduates subsequently seek employment are different from those assumed when the test was first developed. This indicates the need for test revision and, for example, the potential benefit of developing an institutional language policy.
{"title":"Maintaining the connection between test and context: A language test for university admission","authors":"J. Pill","doi":"10.58379/upin8160","DOIUrl":"https://doi.org/10.58379/upin8160","url":null,"abstract":"This paper reflects on a review of an existing English language examination for admission to an English-medium university in a non-English-dominant context. Studying how well an established test sits in its present context may highlight environmental changes causing gaps and points of friction. Such an evaluation therefore provides a baseline understanding from which to move forward. From the 1960s to 1980s, experts developed an examination for applicants to the American University of Beirut that was similar to the Test of English as a Foreign Language (TOEFL) of that time. The AUB English Entrance Examination has remained relatively unchanged since then. Concern about its effectiveness prompted a recent review, providing an opportunity to study consequences of employing a test not fully adapted to its current use. The review found differences in what is/was viewed as appropriate test format and content, and in definitions of language proficiency. It also noted unwarranted assumptions made about comparability of results from different tests. Current language practices at the university, in the region and in the globalized workplace where graduates subsequently seek employment are different from those assumed when the test was first developed. This indicates the need for test revision and, for example, the potential benefit of developing an institutional language policy.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89095816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent literature in teacher education has argued for a shift away from the development of teacher cognitions as a goal of teacher education, to the development of core practices which would make a difference to students’ lives in the classroom (Ball & Forzani, 2009; Kubanyiova & Feryok, 2015; Zeichner, 2012). Hiebert and Morris (2012) propose that these key practices would be embedded into instructional contexts and preserved as lesson plans and as common assessments. This paper focuses on the evaluation tools developed for an in-service professional development programme for language teachers (the Teacher Professional Development Languages (TPDL) programme: http://www.tpdl.ac.nz/). TPDL is a year-long programme for teachers of foreign languages in NZ schools. Programme participants are visited by TPDL In-School support facilitators four times during the course of the year. The facilitators observe their teaching practice and then use two key documents, the ‘Evidence of Principles and Strategies (EPS) portfolio’ and the ‘Progress Standards’ to assist teachers to evaluate their practice against key criteria. As the year progresses the teachers are increasingly encouraged to take ownership and control of the use of these tools, so that by Visit 4, the evaluation is conducted as a self-assessment. This paper evaluates these tools and considers evidence for their validity. Data is presented from the case study of one teacher, to further demonstrate how the tools are used and to document evidence for any change in teaching practice.
{"title":"Using evaluation to promote change in language teacher practice","authors":"R. Erlam","doi":"10.58379/wxpg2438","DOIUrl":"https://doi.org/10.58379/wxpg2438","url":null,"abstract":"Recent literature in teacher education has argued for a shift away from the development of teacher cognitions as a goal of teacher education, to the development of core practices which would make a difference to students’ lives in the classroom (Ball & Forzani, 2009; Kubanyiova & Feryok, 2015; Zeichner, 2012). Hiebert and Morris (2012) propose that these key practices would be embedded into instructional contexts and preserved as lesson plans and as common assessments. This paper focuses on the evaluation tools developed for an in-service professional development programme for language teachers (the Teacher Professional Development Languages (TPDL) programme: http://www.tpdl.ac.nz/). TPDL is a year-long programme for teachers of foreign languages in NZ schools. Programme participants are visited by TPDL In-School support facilitators four times during the course of the year. The facilitators observe their teaching practice and then use two key documents, the ‘Evidence of Principles and Strategies (EPS) portfolio’ and the ‘Progress Standards’ to assist teachers to evaluate their practice against key criteria. As the year progresses the teachers are increasingly encouraged to take ownership and control of the use of these tools, so that by Visit 4, the evaluation is conducted as a self-assessment. This paper evaluates these tools and considers evidence for their validity. Data is presented from the case study of one teacher, to further demonstrate how the tools are used and to document evidence for any change in teaching practice.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82595080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paired oral assessments have gained increasing popularity as a method of assessing speaking skills (East, 2015; Galaczi, 2014). Several advantages have been associated with this method, including practicality and authenticity (Taylor, 2003). Nevertheless, concerns have also been raised in terms of the interlocutor effect in paired speaking tests, particularly in regard to the interlocutor’s oral proficiency (e.g., Norton, 2005). The present study reports on an approximate replication of Davis (2009), who looked at the effect of interlocutor proficiency on paired speaking assessments. The current study compared the oral performance of 24 university students in two different pairing conditions: once paired with a partner of the same proficiency level and once with a partner of a different proficiency level. Results of this replication study partially confirmed Davis’s (2009) results. There were only minimal differences in test-takers’ scores between both conditions. A multi-facet Rasch analysis confirmed these results indicating that the pairing conditions were equivalent in difficulty. There were, however, observable differences in the quantity of talk depending on the interlocutor’s proficiency. Unlike Davis (2009), this study found that low-proficiency test-takers produced fewer words when paired with high-proficiency partners. Even though the number of words produced by test takers was affected by their partner’s proficiency, their performance scores remained constant.
{"title":"Interaction in a paired oral assessment: Revisiting the effect of proficiency","authors":"Young-A Son","doi":"10.58379/lzzz5040","DOIUrl":"https://doi.org/10.58379/lzzz5040","url":null,"abstract":"Paired oral assessments have gained increasing popularity as a method of assessing speaking skills (East, 2015; Galaczi, 2014). Several advantages have been associated with this method, including practicality and authenticity (Taylor, 2003). Nevertheless, concerns have also been raised in terms of the interlocutor effect in paired speaking tests, particularly in regard to the interlocutor’s oral proficiency (e.g., Norton, 2005). The present study reports on an approximate replication of Davis (2009), who looked at the effect of interlocutor proficiency on paired speaking assessments. The current study compared the oral performance of 24 university students in two different pairing conditions: once paired with a partner of the same proficiency level and once with a partner of a different proficiency level. Results of this replication study partially confirmed Davis’s (2009) results. There were only minimal differences in test-takers’ scores between both conditions. A multi-facet Rasch analysis confirmed these results indicating that the pairing conditions were equivalent in difficulty. There were, however, observable differences in the quantity of talk depending on the interlocutor’s proficiency. Unlike Davis (2009), this study found that low-proficiency test-takers produced fewer words when paired with high-proficiency partners. Even though the number of words produced by test takers was affected by their partner’s proficiency, their performance scores remained constant.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87704852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"S. Gollin-Kies, D. R. Hall and S. H. Moore. Language for Specific Purposes","authors":"A. Koschade","doi":"10.58379/erbg3448","DOIUrl":"https://doi.org/10.58379/erbg3448","url":null,"abstract":"<jats:p>n/a</jats:p>","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82764723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speakers naturally adjust their speech in interactions with others and will use accommodative strategies if their co-speaker is having difficulty understanding. These same adjustments have also been found in examiner accommodation in second language speaking tests (Cafarella, 1997; Ross, 1992). In the training of examiners in the IELTS speaking test, there is an attempt to control the degree of examiner accommodation in the interests of consistency. Examiners are explicitly instructed to avoid the use of response tokens or to repeat a question only once without rephrasing it in the face of repair (Seedhouse & Egbert, 2006). This specific attempt to remove aspects of what is deemed to be authentic1 interactional behaviour runs counter to what speakers do ‘in the wild’ as the growing body of research in conversation analysis shows (see for example Hutchby & Wooffit, 2008). We believe that it is timely to discuss the issue of examiner accommodation within a language-testing context against a backdrop of what is now known about naturally occurring interaction. We initiate such a discussion by reviewing the scholarly literature on interaction, and on the IELTS speaking test and examiner accommodation.
{"title":"Authentic interaction and examiner accommodation in the IELTS speaking test: A discussion","authors":"A. Filipi","doi":"10.58379/kdbk9824","DOIUrl":"https://doi.org/10.58379/kdbk9824","url":null,"abstract":"Speakers naturally adjust their speech in interactions with others and will use accommodative strategies if their co-speaker is having difficulty understanding. These same adjustments have also been found in examiner accommodation in second language speaking tests (Cafarella, 1997; Ross, 1992). In the training of examiners in the IELTS speaking test, there is an attempt to control the degree of examiner accommodation in the interests of consistency. Examiners are explicitly instructed to avoid the use of response tokens or to repeat a question only once without rephrasing it in the face of repair (Seedhouse & Egbert, 2006). This specific attempt to remove aspects of what is deemed to be authentic1 interactional behaviour runs counter to what speakers do ‘in the wild’ as the growing body of research in conversation analysis shows (see for example Hutchby & Wooffit, 2008). We believe that it is timely to discuss the issue of examiner accommodation within a language-testing context against a backdrop of what is now known about naturally occurring interaction. We initiate such a discussion by reviewing the scholarly literature on interaction, and on the IELTS speaking test and examiner accommodation.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76697061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-stakes pre-entry language testing is the predominate tool used to measure test takers’ proficiency for admission purposes in higher education in China. Given the important role of these tests, there are heated discussions about how to ensure test fairness for different groups of test takers. This study examined the fairness of the Graduate School Entrance English Examination (GSEEE) that is used to decide whether over one million test takers can enter master’s programs in China. Using SIBTEST and content analysis, the study investigated differential item functioning (DIF) and the presence of potential bias on the GSEEE with aspects to groups of gender and academic background. Results found that a large percentage of the GSEEE items did not provide reliable results to distinguish good and poor performers. A number of DIF and DBF functioned differentially and three test reviewers identified a myriad of factors such as motivation and learning styles that potentially contributed to group performance differences. However, consistent evidence was not found to suggest these flagged items/texts exhibited bias. While systematic bias may not have been detected, the results revealed poor test reliability and the study highlighted an urgent need to improve test quality and clarify the purpose of the test. DIF issues may be revisited once test quality has been improved.
{"title":"DIF investigations across groups of gender and academic background in a large-scale high-stakes language test ","authors":"Xia-li Song, Liying Cheng, D. Klinger","doi":"10.58379/rshg8366","DOIUrl":"https://doi.org/10.58379/rshg8366","url":null,"abstract":"High-stakes pre-entry language testing is the predominate tool used to measure test takers’ proficiency for admission purposes in higher education in China. Given the important role of these tests, there are heated discussions about how to ensure test fairness for different groups of test takers. This study examined the fairness of the Graduate School Entrance English Examination (GSEEE) that is used to decide whether over one million test takers can enter master’s programs in China. Using SIBTEST and content analysis, the study investigated differential item functioning (DIF) and the presence of potential bias on the GSEEE with aspects to groups of gender and academic background. Results found that a large percentage of the GSEEE items did not provide reliable results to distinguish good and poor performers. A number of DIF and DBF functioned differentially and three test reviewers identified a myriad of factors such as motivation and learning styles that potentially contributed to group performance differences. However, consistent evidence was not found to suggest these flagged items/texts exhibited bias. While systematic bias may not have been detected, the results revealed poor test reliability and the study highlighted an urgent need to improve test quality and clarify the purpose of the test. DIF issues may be revisited once test quality has been improved.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"90 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83924605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ducasse, A. M. Interaction in paired oral proficiency assessment in Spanish. Language Testing and Evaluation Series","authors":"C. Inoue","doi":"10.58379/edcu5184","DOIUrl":"https://doi.org/10.58379/edcu5184","url":null,"abstract":"<jats:p>n/a</jats:p>","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82659753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Use of a single, standardised instrument to make high-stakes decisions about test-takers is pervasive in higher education around the world, including English as a foreign language (EFL) contexts. Contrary to longstanding best practices, however, few test users endeavour to meaningfully validate the instrument(s) they use for their specific context and purposes. This study reports efforts to validate a standardised placement test, used in a US-accredited, higher education institution in the Pacific, to exempt, exclude, or place students within its Developmental English Program. A hybrid of two validation structures – Kane’s (1992, 1994) interpretive model and Bachman’s (2005) and Bachman and Palmer’s (2010) assessment use argument – and a broad range of types and sources of evidence were used to ensure a balanced focus on both test score interpretation and test utilisation. Outcomes establish serious doubt as to the validity of the instrument for the local context. Moreover, results provide valuable insights regarding the dangers of not evaluating the validity of an assessment for the local context, the relative strengths and weaknesses of standardised tests used for placement, and the value of argument-based validation.
{"title":"Accuplacer Companion in a foreign language context: An argument-based validation of both test score meaning and impact","authors":"Robert C. Johnson, A. Riazi","doi":"10.58379/vavb1448","DOIUrl":"https://doi.org/10.58379/vavb1448","url":null,"abstract":"Use of a single, standardised instrument to make high-stakes decisions about test-takers is pervasive in higher education around the world, including English as a foreign language (EFL) contexts. Contrary to longstanding best practices, however, few test users endeavour to meaningfully validate the instrument(s) they use for their specific context and purposes. This study reports efforts to validate a standardised placement test, used in a US-accredited, higher education institution in the Pacific, to exempt, exclude, or place students within its Developmental English Program. A hybrid of two validation structures – Kane’s (1992, 1994) interpretive model and Bachman’s (2005) and Bachman and Palmer’s (2010) assessment use argument – and a broad range of types and sources of evidence were used to ensure a balanced focus on both test score interpretation and test utilisation. Outcomes establish serious doubt as to the validity of the instrument for the local context. Moreover, results provide valuable insights regarding the dangers of not evaluating the validity of an assessment for the local context, the relative strengths and weaknesses of standardised tests used for placement, and the value of argument-based validation.","PeriodicalId":29650,"journal":{"name":"Studies in Language Assessment","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81984060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}