Pub Date : 2025-10-14DOI: 10.1016/j.rmal.2025.100272
Shawn Hemelstrand , Tomohiro Inoue
It is common in the language sciences to dichotomize continuous data in order to fit models to data. However, several statisticians and methodologists have warned against this practice for years. Many in the language sciences seem unaware of this problem. Because of the lack of modern, robust, and open data simulations related to this issue in the language science literature, this article provides an empirical investigation of this practice. Across three different simulations, our analysis shows that dichotomization almost universally increases the standard errors, and consequently leads to inaccuracy of tests of statistical significance. Furthermore, effect sizes like are often diminished by the reduction of available information in the data. We conclude by providing suggestions and considerations for future empirical studies.
{"title":"Stop splitting hairs: The problems with dichotomizing continuous data in language research","authors":"Shawn Hemelstrand , Tomohiro Inoue","doi":"10.1016/j.rmal.2025.100272","DOIUrl":"10.1016/j.rmal.2025.100272","url":null,"abstract":"<div><div>It is common in the language sciences to dichotomize continuous data in order to fit models to data. However, several statisticians and methodologists have warned against this practice for years. Many in the language sciences seem unaware of this problem. Because of the lack of modern, robust, and open data simulations related to this issue in the language science literature, this article provides an empirical investigation of this practice. Across three different simulations, our analysis shows that dichotomization almost universally increases the standard errors, and consequently leads to inaccuracy of tests of statistical significance. Furthermore, effect sizes like <span><math><msup><mi>R</mi><mn>2</mn></msup></math></span> are often diminished by the reduction of available information in the data. We conclude by providing suggestions and considerations for future empirical studies.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100272"},"PeriodicalIF":0.0,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145319782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1016/j.rmal.2025.100265
Xian Zhang
This study demonstrates how to use confirmatory bifactor analysis (CbFA) and omega-family indices to evaluate the dimensionality of the Foreign Language Classroom Anxiety Scale (FLCAS). The FLCAS is generally regarded to have high reliability, which is supported by high Cronbach's alpha values often observed in the literature. However, as the FLCAS has often been shown to have multidimensional constructs, it remains unclear if the scale can measure a general construct in the presence of multidimensionality. Confirmatory bifactor modeling can be used to assess whether an instrument can measure a single general psychological construct while accounting for multidimensionality. The model posits a general factor that explains the shared variance across all items, along with specific factors that capture the unique variance within subsets of items. With CbFA, the dimensionality of a factor structure can then be closely examined with statistics such as construct replicability, explained common variance, and omega family indices (e.g., Reise, 2012). In this demonstration, I will show that a smaller subset of FLCAS items effectively measures the general FLA construct, with the general factor explaining the largest portion of the model-wise variance. I will then present recommendations for determining when aggregated scores from a reduced item set can reliably represent the general construct while preserving essential psychometric properties. Finally, I will discuss key considerations for applying and interpreting CbFA in foreign language research.
{"title":"Applying the confirmatory bifactor modeling and omega-family indices: The case of the foreign language classroom anxiety scale","authors":"Xian Zhang","doi":"10.1016/j.rmal.2025.100265","DOIUrl":"10.1016/j.rmal.2025.100265","url":null,"abstract":"<div><div>This study demonstrates how to use confirmatory bifactor analysis (CbFA) and omega-family indices to evaluate the dimensionality of the Foreign Language Classroom Anxiety Scale (FLCAS). The FLCAS is generally regarded to have high reliability, which is supported by high Cronbach's alpha values often observed in the literature. However, as the FLCAS has often been shown to have multidimensional constructs, it remains unclear if the scale can measure a general construct in the presence of multidimensionality. Confirmatory bifactor modeling can be used to assess whether an instrument can measure a single general psychological construct while accounting for multidimensionality. The model posits a general factor that explains the shared variance across all items, along with specific factors that capture the unique variance within subsets of items. With CbFA, the dimensionality of a factor structure can then be closely examined with statistics such as construct replicability, explained common variance, and omega family indices (e.g., Reise, 2012). In this demonstration, I will show that a smaller subset of FLCAS items effectively measures the general FLA construct, with the general factor explaining the largest portion of the model-wise variance. I will then present recommendations for determining when aggregated scores from a reduced item set can reliably represent the general construct while preserving essential psychometric properties. Finally, I will discuss key considerations for applying and interpreting CbFA in foreign language research.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100265"},"PeriodicalIF":0.0,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145264664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1016/j.rmal.2025.100269
Wilson Cheong Hin Hong
Despite five decades of research into error gravity (EG) in writing, the field remains characterized by contradictory findings and limited practical applications. This methodology review critically examines the methods and their impact on findings across this fragmented research landscape. Through a two-phase approach—a selective review of studies from 1970–2015 (n = 21) and a PRISMA review of 2016–2025 research (n = 16)—four key constructs that have shaped the field are identified: reader perceptions, comprehension, awareness/sensitivity, and processing effort. Analyses reveal significant methodological limitations that have hindered progress, including over-reliance on subjective assessments, inconsistent error categorization, limited participant representativeness, and lack of theoretical grounding. The research trajectory shows a shifting focus from comprehension in early studies to reader preferences in the 2000s, followed by renewed interest in communication effectiveness and the emergence of processing effort as a novel yet critical construct, despite the persistent dominance of subjective ratings. Recommendations for future research are proposed, including the potential theories to frame studies, adoption of direct and validated methods and shifting focus to non-teacher participant populations. Only with genuine methodological advancements can this strand of study meaningfully inform L2/FL pedagogy and curriculum, providing evidence-based guidance for prioritizing certain L2 issues in instructions.
{"title":"Revisiting the impact of errors in L2/FL writing: A methodological review of five decades of research","authors":"Wilson Cheong Hin Hong","doi":"10.1016/j.rmal.2025.100269","DOIUrl":"10.1016/j.rmal.2025.100269","url":null,"abstract":"<div><div>Despite five decades of research into error gravity (EG) in writing, the field remains characterized by contradictory findings and limited practical applications. This methodology review critically examines the methods and their impact on findings across this fragmented research landscape. Through a two-phase approach—a selective review of studies from 1970–2015 (<em>n</em> = 21) and a PRISMA review of 2016–2025 research (<em>n</em> = 16)—four key constructs that have shaped the field are identified: reader perceptions, comprehension, awareness/sensitivity, and processing effort. Analyses reveal significant methodological limitations that have hindered progress, including over-reliance on subjective assessments, inconsistent error categorization, limited participant representativeness, and lack of theoretical grounding. The research trajectory shows a shifting focus from comprehension in early studies to reader preferences in the 2000s, followed by renewed interest in communication effectiveness and the emergence of processing effort as a novel yet critical construct, despite the persistent dominance of subjective ratings. Recommendations for future research are proposed, including the potential theories to frame studies, adoption of direct and validated methods and shifting focus to non-teacher participant populations. Only with genuine methodological advancements can this strand of study meaningfully inform L2/FL pedagogy and curriculum, providing evidence-based guidance for prioritizing certain L2 issues in instructions.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100269"},"PeriodicalIF":0.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145264724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1016/j.rmal.2025.100274
Yanlu Zhong, Simon Todd, Nicole Xu, Laurel Brehm
Psycholinguistic research has traditionally relied on human ratings for stimulus norming, but whether large language models (LLMs) can reliably replace human ratings remains uncertain. This study compares human participants and three LLMs—one proprietary model (ChatGPT-4o) and two open-source models (LLaMA-3.3–70B and DeepSeek-V3)—with respect to their statistical knowledge of English binomials. For each binomial, we obtained ratings of frequency, dispersion, forward association strength, and backward association strength from 34 human participants and from 30 output samples per LLM. We examined rating-to-corpus consistency (consistencyrating2corpus), the sensitivity of statistical ratings to corpus data, and the influence of other psycholinguistic factors on ratings. All LLMs’ statistical knowledge broadly mirrored that of humans. Ratings from both groups were sensitive to corpus data but not fully consistent with it. Frequency showed the highest consistencyrating2corpus, whereas dispersion showed the lowest consistencyrating2corpus and the weakest sensitivity. LLM ratings were also influenced by word-level cues. Nonetheless, LLM ratings showed greater consistencyrating2corpus, heightened sensitivity, and stronger reliance on other psycholinguistic cues than human ratings. Overall, while LLMs’ performance generally aligned with that of humans, their internal statistical representations differed significantly from human cognition. The three LLMs also showed variation in their rating behavior. Thus, although multi-LLM ratings can aid pilot studies in psycholinguistics, they should not replace human ratings in formal experiments.
{"title":"Evaluating LLMs as proxies for humans in psycholinguistic ratings: A comparison of statistical knowledge","authors":"Yanlu Zhong, Simon Todd, Nicole Xu, Laurel Brehm","doi":"10.1016/j.rmal.2025.100274","DOIUrl":"10.1016/j.rmal.2025.100274","url":null,"abstract":"<div><div>Psycholinguistic research has traditionally relied on human ratings for stimulus norming, but whether large language models (LLMs) can reliably replace human ratings remains uncertain. This study compares human participants and three LLMs—one proprietary model (ChatGPT-4o) and two open-source models (LLaMA-3.3–70B and DeepSeek-V3)—with respect to their statistical knowledge of English binomials. For each binomial, we obtained ratings of frequency, dispersion, forward association strength, and backward association strength from 34 human participants and from 30 output samples per LLM. We examined rating-to-corpus consistency (consistency<sub>rating2corpus</sub>), the sensitivity of statistical ratings to corpus data, and the influence of other psycholinguistic factors on ratings. All LLMs’ statistical knowledge broadly mirrored that of humans. Ratings from both groups were sensitive to corpus data but not fully consistent with it. Frequency showed the highest consistency<sub>rating2corpus</sub>, whereas dispersion showed the lowest consistency<sub>rating2corpus</sub> and the weakest sensitivity. LLM ratings were also influenced by word-level cues. Nonetheless, LLM ratings showed greater consistency<sub>rating2corpus</sub>, heightened sensitivity, and stronger reliance on other psycholinguistic cues than human ratings. Overall, while LLMs’ performance generally aligned with that of humans, their internal statistical representations differed significantly from human cognition. The three LLMs also showed variation in their rating behavior. Thus, although multi-LLM ratings can aid pilot studies in psycholinguistics, they should not replace human ratings in formal experiments.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100274"},"PeriodicalIF":0.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145264725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1016/j.rmal.2025.100273
Akbar A. Jahanbakhsh , Zahra Banitalebi , Jenifer Larson-Hall , Aya Shiiba
An increasing bulk of research highlights the importance of robust methodologies in ensuring the validity and reliability of research findings; yet, concerns remain regarding the quality of quantitative studies within second language (L2) research. To address the gap, this study aimed to systematically analyze a corpus of quantitative articles to assess the extent to which they adhere to established methodological best practices. The corpus comprises 791 interventionist quantitative articles published over 12 years in 8 journals selected based on their high impact factor and relevance to key areas within applied linguistics. A detailed coding scheme was developed to evaluate the articles across several crucial methodological dimensions, including sampling and design issues, types of statistical analyses, the necessary statistical assumptions to be checked, reporting practices, and visual presentation of data. The findings revealed that while improvements were evident in some areas, such as design-related issues and reporting practices, there is still a lack/shortage of attention to sampling issues like power analysis, practicing data sharing, and using data-rich/accountable visuals, with no significant improvement over time. These results highlighted areas where improvements in methodological rigor are needed to enhance the credibility and generalizability of quantitative research in applied linguistics. The study promotes best practices in research design and reporting, informs the development of guidelines for future research, and fosters a more critical and reflective approach to interpreting quantitative findings in the field.
{"title":"Methodological rigor in quantitative L2 research: A focus on interventionist experimental studies","authors":"Akbar A. Jahanbakhsh , Zahra Banitalebi , Jenifer Larson-Hall , Aya Shiiba","doi":"10.1016/j.rmal.2025.100273","DOIUrl":"10.1016/j.rmal.2025.100273","url":null,"abstract":"<div><div>An increasing bulk of research highlights the importance of robust methodologies in ensuring the validity and reliability of research findings; yet, concerns remain regarding the quality of quantitative studies within second language (L2) research. To address the gap, this study aimed to systematically analyze a corpus of quantitative articles to assess the extent to which they adhere to established methodological best practices. The corpus comprises 791 interventionist quantitative articles published over 12 years in 8 journals selected based on their high impact factor and relevance to key areas within applied linguistics. A detailed coding scheme was developed to evaluate the articles across several crucial methodological dimensions, including sampling and design issues, types of statistical analyses, the necessary statistical assumptions to be checked, reporting practices, and visual presentation of data. The findings revealed that while improvements were evident in some areas, such as design-related issues and reporting practices, there is still a lack/shortage of attention to sampling issues like power analysis, practicing data sharing, and using data-rich/accountable visuals, with no significant improvement over time. These results highlighted areas where improvements in methodological rigor are needed to enhance the credibility and generalizability of quantitative research in applied linguistics. The study promotes best practices in research design and reporting, informs the development of guidelines for future research, and fosters a more critical and reflective approach to interpreting quantitative findings in the field.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100273"},"PeriodicalIF":0.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145264686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30DOI: 10.1016/j.rmal.2025.100271
Xiuming Wang , Shan Chen , Yuanzhao Ding
Pitch doubling is a pitch detection phenomenon in which an algorithm incorrectly identifies the frequency of a note as either double or half of its actual value, representing one of the major pitfalls for pitch detection accuracy. To review the literature on pitch doubling, this study searched the Web of Science Core Collection and systematically filtered relevant studies. Using VOSviewer bibliometric visualization, the research examined trends based on keywords, institutions, and countries or regions in the pitch doubling research field. Drawing on seminal contributions, the paper describes the underlying causes of pitch doubling (e.g., harmonic interference) and reviews existing mitigation methods (e.g., improved pitch detection algorithms). The weaknesses of current approaches are identified, and conclusions are provided to inform the development of more effective solutions to pitch doubling.
音调加倍是一种音调检测现象,其中算法错误地将音符的频率识别为其实际值的两倍或一半,这是音调检测精度的主要缺陷之一。为了回顾音高加倍的相关文献,本研究检索了Web of Science Core Collection,并对相关研究进行了系统筛选。利用VOSviewer文献计量可视化技术,该研究基于关键词、机构、国家或地区对音高加倍研究领域的趋势进行了调查。根据开创性的贡献,本文描述了基音加倍的潜在原因(例如,谐波干扰),并回顾了现有的缓解方法(例如,改进的基音检测算法)。确定了当前方法的弱点,并提供了结论,以便为制定更有效的解决方案提供信息。
{"title":"Bibliographic analysis in solving pitch doubling issues","authors":"Xiuming Wang , Shan Chen , Yuanzhao Ding","doi":"10.1016/j.rmal.2025.100271","DOIUrl":"10.1016/j.rmal.2025.100271","url":null,"abstract":"<div><div>Pitch doubling is a pitch detection phenomenon in which an algorithm incorrectly identifies the frequency of a note as either double or half of its actual value, representing one of the major pitfalls for pitch detection accuracy. To review the literature on pitch doubling, this study searched the Web of Science Core Collection and systematically filtered relevant studies. Using VOSviewer bibliometric visualization, the research examined trends based on keywords, institutions, and countries or regions in the pitch doubling research field. Drawing on seminal contributions, the paper describes the underlying causes of pitch doubling (e.g., harmonic interference) and reviews existing mitigation methods (e.g., improved pitch detection algorithms). The weaknesses of current approaches are identified, and conclusions are provided to inform the development of more effective solutions to pitch doubling.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100271"},"PeriodicalIF":0.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30DOI: 10.1016/j.rmal.2025.100270
Larry Xethakis , Michael Rupp , Oliver Edwards , Mark Howarth , Toshikazu Kawagoe
The study of positive emotions and their influence on language learning has gained considerable attention recently, with foreign language enjoyment being one of the most-studied emotions. The Short-form Foreign Language Enjoyment Scale (S-FLES) is a popular measure of enjoyment, however, this measure has yet to be validated for use in the Japanese context. This study aimed to address this gap by comparing hierarchical and bifactor confirmatory factor analysis (CFA) models, with analogous models employing the innovative technique of exploratory structural equation modeling (ESEM). Responses from 536 undergraduate EFL learners were used in the analysis of the models, with results indicating that the fit of the ESEM models were superior to that of the CFA models. The bifactor ESEM was chosen as the most suitable model of the S-FLES on the basis of its better convergent validity, divergent validity, and reliability, as well as its measurement quality. Invariance testing supported the bifactor model’s configural invariance, as well as its partial metric and scalar invariance across gender. The relationship between the bifactor model and social-behavioral engagement was evaluated as a measure of the S-FLES’s concurrent validity. The model exhibited a very strong degree of predictive power, with the general factor accounting for the greatest degree of variance in social-behavioral engagement. The bifactor model of the S-FLES was shown to be a valid and reliable measure of FLE among Japanese undergraduate EFL learners, providing further support to the use of ESEM in evaluating positive psychological instruments.
{"title":"Validating the Japanese version of the short-form foreign language enjoyment scale","authors":"Larry Xethakis , Michael Rupp , Oliver Edwards , Mark Howarth , Toshikazu Kawagoe","doi":"10.1016/j.rmal.2025.100270","DOIUrl":"10.1016/j.rmal.2025.100270","url":null,"abstract":"<div><div>The study of positive emotions and their influence on language learning has gained considerable attention recently, with foreign language enjoyment being one of the most-studied emotions. The Short-form Foreign Language Enjoyment Scale (S-FLES) is a popular measure of enjoyment, however, this measure has yet to be validated for use in the Japanese context. This study aimed to address this gap by comparing hierarchical and bifactor confirmatory factor analysis (CFA) models, with analogous models employing the innovative technique of exploratory structural equation modeling (ESEM). Responses from 536 undergraduate EFL learners were used in the analysis of the models, with results indicating that the fit of the ESEM models were superior to that of the CFA models. The bifactor ESEM was chosen as the most suitable model of the S-FLES on the basis of its better convergent validity, divergent validity, and reliability, as well as its measurement quality. Invariance testing supported the bifactor model’s configural invariance, as well as its partial metric and scalar invariance across gender. The relationship between the bifactor model and social-behavioral engagement was evaluated as a measure of the S-FLES’s concurrent validity. The model exhibited a very strong degree of predictive power, with the general factor accounting for the greatest degree of variance in social-behavioral engagement. The bifactor model of the S-FLES was shown to be a valid and reliable measure of FLE among Japanese undergraduate EFL learners, providing further support to the use of ESEM in evaluating positive psychological instruments.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100270"},"PeriodicalIF":0.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30DOI: 10.1016/j.rmal.2025.100266
Wan Yee Winsy Lai , Paul Kim , Ju Seong Lee
As Generative AI (GenAI) technologies advance rapidly, educational settings face an urgent need for targeted interventions to cultivate learners’ critical, higher-order inquiry skills, so they can effectively navigate, assess, and apply AI-generated content. The urgency of this imperative is magnified for EFL learners in test-driven educational contexts that foster passive learning behaviors, discourage questioning, and inhibit critical thinking. To address these issues, we developed an AI-powered tool designed to evaluate questions based on Bloom’s Taxonomy, a six-level framework of cognitive processes, ranging from basic recall questions (Level 1) to advanced questions that trigger creative and evaluative thinking (Level 5). In study 1, the reliability of the tool was confirmed through multiple inter-rater tests with strong agreement. In study 2, we implemented an intervention program that integrated Bloom’s Taxonomy, targeted readings, group discussions, and sharing to enhance inquiry skills among EFL undergraduate students. Four statistical analyses in SPSS 29.0—including ICC for inter-rater reliability, Pearson correlation, and regression—were conducted to validate the AI-powered inquiry evaluation tool. Across 174 questions, students’ average inquiry level improved from 3.3 to 4.1 (on a five-level scale), showing a significant 0.8-level increase and meaningful enhancement in question quality. The study provides solid evidence of the reliability and validity of the AI-powered inquiry evaluation tool as an objective, real-time method that enhances the efficiency, consistency, and scalability of assessments, offering valuable guidance for EFL practitioners, curriculum designers, researchers, educators, and institutions in integrating evidence-based, inquiry-driven tools into EFL programs.
{"title":"Designing and validating an AI-supported tool for enhancing critical inquiry in EFL education","authors":"Wan Yee Winsy Lai , Paul Kim , Ju Seong Lee","doi":"10.1016/j.rmal.2025.100266","DOIUrl":"10.1016/j.rmal.2025.100266","url":null,"abstract":"<div><div>As Generative AI (GenAI) technologies advance rapidly, educational settings face an urgent need for targeted interventions to cultivate learners’ critical, higher-order inquiry skills, so they can effectively navigate, assess, and apply AI-generated content. The urgency of this imperative is magnified for EFL learners in test-driven educational contexts that foster passive learning behaviors, discourage questioning, and inhibit critical thinking. To address these issues, we developed an AI-powered tool designed to evaluate questions based on Bloom’s Taxonomy, a six-level framework of cognitive processes, ranging from basic recall questions (Level 1) to advanced questions that trigger creative and evaluative thinking (Level 5). In study 1, the reliability of the tool was confirmed through multiple inter-rater tests with strong agreement. In study 2, we implemented an intervention program that integrated Bloom’s Taxonomy, targeted readings, group discussions, and sharing to enhance inquiry skills among EFL undergraduate students. Four statistical analyses in SPSS 29.0—including ICC for inter-rater reliability, Pearson correlation, and regression—were conducted to validate the AI-powered inquiry evaluation tool. Across 174 questions, students’ average inquiry level improved from 3.3 to 4.1 (on a five-level scale), showing a significant 0.8-level increase and meaningful enhancement in question quality. The study provides solid evidence of the reliability and validity of the AI-powered inquiry evaluation tool as an objective, real-time method that enhances the efficiency, consistency, and scalability of assessments, offering valuable guidance for EFL practitioners, curriculum designers, researchers, educators, and institutions in integrating evidence-based, inquiry-driven tools into EFL programs.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100266"},"PeriodicalIF":0.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1016/j.rmal.2025.100264
Mathias Russnes
This article investigates the inter-rater reliability of established methods of categorising semantic prosody. Semantic prosody is a concept associated with corpus linguistics, which describes the tendency of seemingly neutral items to occur in particular evaluative contexts. In previous research on semantic prosody, there has been a heavy reliance on manual analysis of smaller samples, and because of this, questions have been raised about the stability of the established methods for categorisation. Furthermore, there is also a lack of consensus regarding how such categorisations should be operationalised. Traditionally, it has often been viewed in binary terms, distinguishing between positive and negative prosodies. However, this restricted system has also received criticism, and certain researchers have adopted a more comprehensive (or fine-grained) categorisation, more connected to a unit’s semantic preference. This paper aims to evaluate the inter-analyst consistency of these systems through two experimental studies, in which four researchers independently analyse the same set of random concordance lines of the items habit and views from BNC2014, applying both methods of categorisation. The results indicate that a binary distinction between positive and negative offers a higher inter-analyst consistency than a more fine-grained categorisation. Additionally, this more comprehensive system was also found to obscure the borders between semantic preference and semantic prosody. However, because neither system achieved satisfactory inter-rater agreement, both studies highlight the need for more objective methods of analysing and categorising semantic prosody.
{"title":"Semantic prosody, categorisation and inter-rater reliability","authors":"Mathias Russnes","doi":"10.1016/j.rmal.2025.100264","DOIUrl":"10.1016/j.rmal.2025.100264","url":null,"abstract":"<div><div>This article investigates the inter-rater reliability of established methods of categorising semantic prosody. Semantic prosody is a concept associated with corpus linguistics, which describes the tendency of seemingly neutral items to occur in particular evaluative contexts. In previous research on semantic prosody, there has been a heavy reliance on manual analysis of smaller samples, and because of this, questions have been raised about the stability of the established methods for categorisation. Furthermore, there is also a lack of consensus regarding how such categorisations should be operationalised. Traditionally, it has often been viewed in binary terms, distinguishing between <em>positive</em> and <em>negative</em> prosodies. However, this restricted system has also received criticism, and certain researchers have adopted a more comprehensive (or fine-grained) categorisation, more connected to a unit’s semantic preference. This paper aims to evaluate the inter-analyst consistency of these systems through two experimental studies, in which four researchers independently analyse the same set of random concordance lines of the items <em>habit</em> and <em>views</em> from BNC2014, applying both methods of categorisation. The results indicate that a binary distinction between <em>positive</em> and <em>negative</em> offers a higher inter-analyst consistency than a more fine-grained categorisation. Additionally, this more comprehensive system was also found to obscure the borders between semantic preference and semantic prosody. However, because neither system achieved satisfactory inter-rater agreement, both studies highlight the need for more objective methods of analysing and categorising semantic prosody.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100264"},"PeriodicalIF":0.0,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26DOI: 10.1016/j.rmal.2025.100268
Wen Xin, Lei Jiang
Metadiscourse has predominantly been studied in monologic written academic discourse, especially in research articles and student writing, largely due to the influence of two widely adopted models of metadiscourse developed by Hyland (2005) and Ädel (2006). In this article, we illustrate several conceptual and methodological challenges involved in implementing the two influential models into the genre of written feedback, a non-traditional, dialogic academic genre that both depends on and responds to another genre (student writing). We conclude by proposing potential pathways for addressing these conceptual and methodological challenges. The pathways may also be applicable to other written dialogic genres.
{"title":"From monologic to dialogic: Conceptual and methodological issues in metadiscourse studies","authors":"Wen Xin, Lei Jiang","doi":"10.1016/j.rmal.2025.100268","DOIUrl":"10.1016/j.rmal.2025.100268","url":null,"abstract":"<div><div>Metadiscourse has predominantly been studied in monologic written academic discourse, especially in research articles and student writing, largely due to the influence of two widely adopted models of metadiscourse developed by Hyland (2005) and Ädel (2006). In this article, we illustrate several conceptual and methodological challenges involved in implementing the two influential models into the genre of written feedback, a non-traditional, dialogic academic genre that both depends on and responds to another genre (student writing). We conclude by proposing potential pathways for addressing these conceptual and methodological challenges. The pathways may also be applicable to other written dialogic genres.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100268"},"PeriodicalIF":0.0,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}