Pub Date : 2025-12-01DOI: 10.1016/j.acorp.2025.100167
Xurui Ling , Xingbing Liu
Adolescent romance in China navigates a complex sociocultural landscape, marked by a discursive tension between evolving developmental understanding and deeply entrenched anxieties. This study employs a corpus-assisted critical discourse analysis of 1,740 WeChat articles and Sina Weibo posts (2020–2024) to investigate contemporary representations of adolescent romance. Findings reveal a ‘negotiated space’ where the increasing normalization of adolescent romance directly contends with historical apprehensions. A notable shift is observed from 20th-century criticism-dominated narratives towards a contemporary discourse featuring emergent supportive voices that reframe romance as integral to development. While critical views persist, their focus has transformed from broad moral condemnations to specific concerns about academic impact and adolescents’ psycho-emotional maturity. These insights are valuable for educators, policymakers, and researchers aiming to understand contemporary China’s social dynamics and youth development, offering a perspective on how digital platforms shape and reflect societal attitudes toward adolescent romance.
{"title":"“Too young to love”: A corpus-assisted critical discourse analysis of adolescent romance on Chinese social media","authors":"Xurui Ling , Xingbing Liu","doi":"10.1016/j.acorp.2025.100167","DOIUrl":"10.1016/j.acorp.2025.100167","url":null,"abstract":"<div><div>Adolescent romance in China navigates a complex sociocultural landscape, marked by a discursive tension between evolving developmental understanding and deeply entrenched anxieties. This study employs a corpus-assisted critical discourse analysis of 1,740 WeChat articles and Sina Weibo posts (2020–2024) to investigate contemporary representations of adolescent romance. Findings reveal a ‘negotiated space’ where the increasing normalization of adolescent romance directly contends with historical apprehensions. A notable shift is observed from 20th-century criticism-dominated narratives towards a contemporary discourse featuring emergent supportive voices that reframe romance as integral to development. While critical views persist, their focus has transformed from broad moral condemnations to specific concerns about academic impact and adolescents’ psycho-emotional maturity. These insights are valuable for educators, policymakers, and researchers aiming to understand contemporary China’s social dynamics and youth development, offering a perspective on how digital platforms shape and reflect societal attitudes toward adolescent romance.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100167"},"PeriodicalIF":2.1,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.acorp.2025.100165
Eniko Csomay , Reka R. Jablonkai , Hui Sun
In this ACORP special issue, contributors explore the rapidly evolving intersection of corpus linguistics and generative artificial intelligence (GenAI) in language learning, teaching, and research. As GenAI tools gain prominence alongside established corpus-based approaches, new pedagogical opportunities and challenges emerge for learners, teachers, and researchers alike. The articles in this issue collectively examine how corpora and GenAI can be integrated to enhance language analysis, genre awareness, writing development, and instructional design. Together, they offer critical insights into how these complementary technologies can inform DDL, promote critical digital literacies, and reshape the future of language education and applied linguistics research.
{"title":"Corpora and AI for inductive learning: Theory and practice","authors":"Eniko Csomay , Reka R. Jablonkai , Hui Sun","doi":"10.1016/j.acorp.2025.100165","DOIUrl":"10.1016/j.acorp.2025.100165","url":null,"abstract":"<div><div>In this ACORP special issue, contributors explore the rapidly evolving intersection of corpus linguistics and generative artificial intelligence (GenAI) in language learning, teaching, and research. As GenAI tools gain prominence alongside established corpus-based approaches, new pedagogical opportunities and challenges emerge for learners, teachers, and researchers alike. The articles in this issue collectively examine how corpora and GenAI can be integrated to enhance language analysis, genre awareness, writing development, and instructional design. Together, they offer critical insights into how these complementary technologies can inform DDL, promote critical digital literacies, and reshape the future of language education and applied linguistics research.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100165"},"PeriodicalIF":2.1,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.acorp.2025.100172
Meng Huat Chau
This article revisits key insights from corpus linguistics, such as units of meaning and pattern grammar, in dialogue with cognitive linguistic understandings of form–meaning pairings, showing how meaning arises from patterned, contextualized, and emergent use rather than from isolated words. Foregrounding the language learner as a fully legitimate meaning maker alongside expert and other language users, it advances communicative meaningfulness as an ecological model grounded in relational resonance rather than formal accuracy or communicative effectiveness. Drawing on longitudinal corpus evidence from school students, the article demonstrates how learners rework patterned resources to express stance, negotiate values, and enact situated identities, revealing their languaging as meaning-in-motion. It further articulates a transversal–pluriversal turn in applied linguistics: transversal in its crossings of disciplinary, cultural, and linguistic boundaries; pluriversal in its affirmation of diverse epistemologies and ways of knowing. The article concludes that learners’ meaning making contributes to a living relational ecology of communication, positioning the study of corpora, learner languaging, and language as a whole as co-created, evolving, and interrelated resources. Such an orientation not only guides more inclusive, humane, and epistemically diverse practices in corpus linguistics and applied linguistics; it also, importantly, deepens and expands our shared human capacity for understanding and connection.
{"title":"On corpus linguistics, the search for meaning, and a transversal–pluriversal turn in celebrating learner languaging","authors":"Meng Huat Chau","doi":"10.1016/j.acorp.2025.100172","DOIUrl":"10.1016/j.acorp.2025.100172","url":null,"abstract":"<div><div>This article revisits key insights from corpus linguistics, such as units of meaning and pattern grammar, in dialogue with cognitive linguistic understandings of form–meaning pairings, showing how meaning arises from patterned, contextualized, and emergent use rather than from isolated words. Foregrounding the language learner as a fully legitimate meaning maker alongside expert and other language users, it advances <em>communicative meaningfulness</em> as an ecological model grounded in relational resonance rather than formal accuracy or communicative effectiveness. Drawing on longitudinal corpus evidence from school students, the article demonstrates how learners rework patterned resources to express stance, negotiate values, and enact situated identities, revealing their languaging as meaning-in-motion. It further articulates a transversal–pluriversal turn in applied linguistics: <em>transversal</em> in its crossings of disciplinary, cultural, and linguistic boundaries; <em>pluriversal</em> in its affirmation of diverse epistemologies and ways of knowing. The article concludes that learners’ meaning making contributes to a living relational ecology of communication, positioning the study of corpora, learner languaging, and language as a whole as co-created, evolving, and interrelated resources. Such an orientation not only guides more inclusive, humane, and epistemically diverse practices in corpus linguistics and applied linguistics; it also, importantly, deepens and expands our shared human capacity for understanding and connection.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"6 1","pages":"Article 100172"},"PeriodicalIF":2.1,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.acorp.2025.100170
Tiffany Tsz-Yin Pang
Corpus tools have proven effective for supporting inductive language learning by enabling learners to observe multiple examples, form hypotheses, and verify the hypotheses based on additional examples. However, when applied to Chinese as a Second Language (CSL), these tools encounter limitations that disrupt the observe-hypothesize-verify process. Sketch Engine, for example, misanalyzes Chinese word boundaries, topicalized objects, and ba-constructions, and provides inaccurate observational data that undermines the effectiveness of inductive learning. This paper proposes integrating Large Language Models (LLMs) with corpus tools to address the limitations. Using Sketch Engine and Claude Opus 4 as exemplars, I demonstrate how LLMs serve three pedagogical functions: (1) error detection to identify misanalyzed features in corpus outputs, (2) guided pattern discovery to help learners recognize linguistic regularities across examples, and (3) hypothesis verification to confirm/refine learners’ observations. Through analysis of specific Chinese features, I show how LLM integration maintains the discovery processes while ensuring accurate linguistic input for the learners. The proposed corpus-LLM integration represents an advancement in leveraging AI for language pedagogy. The paper concludes with future research directions for optimizing this integration in CSL acquisition, and emphasizes the need to balance technological innovation with pedagogical principles.
语料库工具已被证明对支持归纳语言学习是有效的,它使学习者能够观察多个例子,形成假设,并基于其他例子验证假设。然而,当应用于汉语作为第二语言(CSL)时,这些工具遇到了限制,破坏了观察-假设-验证的过程。例如,Sketch Engine错误地分析了中文词边界、主题化对象和ba结构,并提供了不准确的观察数据,从而破坏了归纳学习的有效性。本文提出将大型语言模型(llm)与语料库工具集成来解决这一问题。以Sketch Engine和Claude Opus 4为例,我展示了llm如何服务于三个教学功能:(1)错误检测,以识别语料库输出中的错误分析特征;(2)引导模式发现,以帮助学习者识别示例中的语言规律;(3)假设验证,以确认/完善学习者的观察结果。通过分析具体的中国特色,我展示了LLM整合如何在保持发现过程的同时确保学习者准确的语言输入。拟议的语料库-法学硕士集成代表了利用人工智能进行语言教学的进步。最后,本文提出了在对外汉语习得中优化整合的未来研究方向,并强调需要在技术创新与教学原则之间取得平衡。
{"title":"Leveraging large language models to supplement corpus-based inductive learning of Chinese as a second language","authors":"Tiffany Tsz-Yin Pang","doi":"10.1016/j.acorp.2025.100170","DOIUrl":"10.1016/j.acorp.2025.100170","url":null,"abstract":"<div><div>Corpus tools have proven effective for supporting inductive language learning by enabling learners to observe multiple examples, form hypotheses, and verify the hypotheses based on additional examples. However, when applied to Chinese as a Second Language (CSL), these tools encounter limitations that disrupt the observe-hypothesize-verify process. Sketch Engine, for example, misanalyzes Chinese word boundaries, topicalized objects, and <em>ba</em>-constructions, and provides inaccurate observational data that undermines the effectiveness of inductive learning. This paper proposes integrating Large Language Models (LLMs) with corpus tools to address the limitations. Using Sketch Engine and Claude Opus 4 as exemplars, I demonstrate how LLMs serve three pedagogical functions: (1) error detection to identify misanalyzed features in corpus outputs, (2) guided pattern discovery to help learners recognize linguistic regularities across examples, and (3) hypothesis verification to confirm/refine learners’ observations. Through analysis of specific Chinese features, I show how LLM integration maintains the discovery processes while ensuring accurate linguistic input for the learners. The proposed corpus-LLM integration represents an advancement in leveraging AI for language pedagogy. The paper concludes with future research directions for optimizing this integration in CSL acquisition, and emphasizes the need to balance technological innovation with pedagogical principles.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"6 1","pages":"Article 100170"},"PeriodicalIF":2.1,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.acorp.2025.100168
Agnieszka Leńko-Szymańska
The integration of corpora and generative artificial intelligence (GenAI) in language teacher education presents both opportunities and challenges. While corpus-based approaches have long been promoted for data-driven learning (DDL), their adoption remains limited due to complexity issues and time-demands. In contrast, GenAI tools offer immediate, user-friendly access to linguistic data, yet raise concerns about authenticity and reliability. This study compares pre-service teachers’ use of corpora and GenAI in pedagogically oriented language analysis, lesson planning, and materials development. Conducted within a graduate-level course, the study examines student teachers’ approaches to corpus-based and AI-based lesson design, focusing on their ability to retrieve and analyse linguistic data, plan lessons, create learning materials, and reflect on the effectiveness of these tools. Findings indicate the considerable potential of both corpora and GenAI for supporting data-informed, inductive approaches to language learning and teaching. Yet, the results also reveal that while pre-service teachers demonstrated operational proficiency in using both tools, they struggled to extract meaningful linguistic insights and integrate their findings into cohesive pedagogical frameworks. The study highlights the need for targeted training to develop teachers’ analytical and pedagogical skills in working with both types of resources. Ultimately, it argues that rather than replacing corpora, GenAI should complement data-driven learning, reinforcing the importance of linguistic accuracy and pedagogical soundness in technology-enhanced language teaching.
{"title":"Not so fast? A comparative study of pre-service teachers’ lesson design using corpora and generative artificial intelligence","authors":"Agnieszka Leńko-Szymańska","doi":"10.1016/j.acorp.2025.100168","DOIUrl":"10.1016/j.acorp.2025.100168","url":null,"abstract":"<div><div>The integration of corpora and generative artificial intelligence (GenAI) in language teacher education presents both opportunities and challenges. While corpus-based approaches have long been promoted for data-driven learning (DDL), their adoption remains limited due to complexity issues and time-demands. In contrast, GenAI tools offer immediate, user-friendly access to linguistic data, yet raise concerns about authenticity and reliability. This study compares pre-service teachers’ use of corpora and GenAI in pedagogically oriented language analysis, lesson planning, and materials development. Conducted within a graduate-level course, the study examines student teachers’ approaches to corpus-based and AI-based lesson design, focusing on their ability to retrieve and analyse linguistic data, plan lessons, create learning materials, and reflect on the effectiveness of these tools. Findings indicate the considerable potential of both corpora and GenAI for supporting data-informed, inductive approaches to language learning and teaching. Yet, the results also reveal that while pre-service teachers demonstrated operational proficiency in using both tools, they struggled to extract meaningful linguistic insights and integrate their findings into cohesive pedagogical frameworks. The study highlights the need for targeted training to develop teachers’ analytical and pedagogical skills in working with both types of resources. Ultimately, it argues that rather than replacing corpora, GenAI should complement data-driven learning, reinforcing the importance of linguistic accuracy and pedagogical soundness in technology-enhanced language teaching.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"6 1","pages":"Article 100168"},"PeriodicalIF":2.1,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.acorp.2025.100171
Yejin Jung , Kathy MinHye Kim
Technological innovations can greatly enhance second language (L2) pragmatics instruction by providing learners with more natural and authentic communication opportunities. As Generative Artificial Intelligence (GenAI) tools become increasingly integrated into L2 teaching, questions arise as to whether they provide pedagogically appropriate input and how they can be used for inductive instruction (e.g., Data-driven Learning). To advance meaningful instructional approaches to Korean honorifics, understanding the nature of input is key; particularly, what exemplars of honorifics are available through GenAI and spoken corpora and how L2 learners perceive and evaluate different honorific forms. In response to these inquiries, we analyzed patterns of subject-verb honorific agreement in outputs from ChatGPT 4.0 and the NIKL Korean Dialogue Summarization Corpus (Study 1), and conducted an acceptability judgment test of four subject-verb honorific (mis)match forms (Study 2). We found that ChatGPT predominantly favored a subject-verb matched form, whereas corpus data reflected the highly complex, context-dependent use and variations of honorifics. L1 judgments aligned more closely with the corpus results, reflecting sensitivity to nuanced (mis)match forms, whereas L2 judgments closely mirrored ChatGPT’s patterns, lacking sensitivity beyond the matched forms. These results underscore the challenges associated with Korean honorification for both learners and educators, highlighting the need for more refined inductive teaching.
{"title":"A comparative analysis of AI-generated texts, corpus data, and speaker judgments: Subject honorification patterns in Korean","authors":"Yejin Jung , Kathy MinHye Kim","doi":"10.1016/j.acorp.2025.100171","DOIUrl":"10.1016/j.acorp.2025.100171","url":null,"abstract":"<div><div>Technological innovations can greatly enhance second language (L2) pragmatics instruction by providing learners with more natural and authentic communication opportunities. As Generative Artificial Intelligence (GenAI) tools become increasingly integrated into L2 teaching, questions arise as to whether they provide pedagogically appropriate input and how they can be used for inductive instruction (e.g., Data-driven Learning). To advance meaningful instructional approaches to Korean honorifics, understanding the nature of input is key; particularly, what exemplars of honorifics are available through GenAI and spoken corpora and how L2 learners perceive and evaluate different honorific forms. In response to these inquiries, we analyzed patterns of subject-verb honorific agreement in outputs from <em>ChatGPT 4.0</em> and the NIKL Korean Dialogue Summarization Corpus (Study 1), and conducted an acceptability judgment test of four subject-verb honorific (mis)match forms (Study 2). We found that ChatGPT predominantly favored a subject-verb matched form, whereas corpus data reflected the highly complex, context-dependent use and variations of honorifics. L1 judgments aligned more closely with the corpus results, reflecting sensitivity to nuanced (mis)match forms, whereas L2 judgments closely mirrored ChatGPT’s patterns, lacking sensitivity beyond the matched forms. These results underscore the challenges associated with Korean honorification for both learners and educators, highlighting the need for more refined inductive teaching.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"6 1","pages":"Article 100171"},"PeriodicalIF":2.1,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-16DOI: 10.1016/j.acorp.2025.100166
Satoshi Yamagata , Gareth Carrol , Crayton Walker
Collocational knowledge is a critical component of second language (L2) learning. However, L2 learners often rely on first language (L1) translations, leading to the production of deviant collocations. To address this issue, this study investigates the pedagogical potential of teaching collocations through multiple CORE meanings (capitalised), in contrast to approaches that rely on a single core meaning of verbal nodes. Multiple CORE meanings are characterised not only by their typical nominal collocates, but also by other aspects of how they typically pattern. While previous accounts have tended to treat high-frequency verbal nodes as polysemous, we argue that many verbal nodes are better understood as examples of homonymy, which carries several semantically distinct CORE meanings (i.e., ‘draw’ meaning ‘to pull or move something’, ‘to divide something into two’, or ‘to make a picture’), and that this might offer a more logical way for learners to discover and learn collocational patterns. We first identified CORE meanings for six high-frequency verbal nodes through corpus-based analysis, and then tested their pedagogical potential with 240 EFL high school learners. Learners were taught verb-noun collocations using either a CORE meaning-based discovery approach or conventional L1 translations, and they completed a pre-test and two post-tests assessing productive recall and collocability judgement. Results showed that CORE meaning-based instruction enhanced productive recall, though the advantage did not extend to collocability judgement. These findings suggest that presenting learners with multiple CORE meanings can be a promising way to strengthen L2 collocational competence, although further refinement in instructional design is warranted.
{"title":"Exploring the potential of multiple CORE meanings in learning L2 verb-noun collocations: A corpus-based discovery learning approach","authors":"Satoshi Yamagata , Gareth Carrol , Crayton Walker","doi":"10.1016/j.acorp.2025.100166","DOIUrl":"10.1016/j.acorp.2025.100166","url":null,"abstract":"<div><div>Collocational knowledge is a critical component of second language (L2) learning. However, L2 learners often rely on first language (L1) translations, leading to the production of deviant collocations. To address this issue, this study investigates the pedagogical potential of teaching collocations through multiple CORE meanings (capitalised), in contrast to approaches that rely on a single core meaning of verbal nodes. Multiple CORE meanings are characterised not only by their typical nominal collocates, but also by other aspects of how they typically pattern. While previous accounts have tended to treat high-frequency verbal nodes as polysemous, we argue that many verbal nodes are better understood as examples of homonymy, which carries several semantically distinct CORE meanings (i.e., ‘draw’ meaning ‘to pull or move something’, ‘to divide something into two’, or ‘to make a picture’), and that this might offer a more logical way for learners to discover and learn collocational patterns. We first identified CORE meanings for six high-frequency verbal nodes through corpus-based analysis, and then tested their pedagogical potential with 240 EFL high school learners. Learners were taught verb-noun collocations using either a CORE meaning-based discovery approach or conventional L1 translations, and they completed a pre-test and two post-tests assessing productive recall and collocability judgement. Results showed that CORE meaning-based instruction enhanced productive recall, though the advantage did not extend to collocability judgement. These findings suggest that presenting learners with multiple CORE meanings can be a promising way to strengthen L2 collocational competence, although further refinement in instructional design is warranted.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"6 1","pages":"Article 100166"},"PeriodicalIF":2.1,"publicationDate":"2025-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145839973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.acorp.2025.100164
Laurence Anthony
Concordancing is a central method in corpus research. It also plays an important role in the data-driven learning (DDL) classroom, where learners use Key-Word-In-Context (KWIC) analysis to discover and implicitly learn lexical and grammatical patterns in a specific language domain. Effective concordancing requires users to craft precise single- or multi-word queries that capture the language features of interest, but these queries can quickly become complex and error-prone. The method also relies on users selecting, ordering, and grouping results in order to interpret them in a meaningful way, which can also be a significant challenge.
This paper proposes using word and sentence embedding models from the field of AI to facilitate both the querying of corpora and the interpretation of concordance results. First, the paper explains that embeddings are numerical representations of language that capture semantic and contextual information in a high-dimensional vector space. Next, the paper reports on three experiments using pre-trained embedding models (BERT, word2vec) looking at synonym searches, concordance grouping and ordering, and language variation analysis across two general English corpora. The results show that embeddings allow for ‘fuzzy’, nuanced, context aware searches of corpus data without the need for meticulous query crafting, and enable the grouping and ordering of results in novel, interesting, and useful ways.
{"title":"Concordancing with AI: Applications of word and sentence embeddings","authors":"Laurence Anthony","doi":"10.1016/j.acorp.2025.100164","DOIUrl":"10.1016/j.acorp.2025.100164","url":null,"abstract":"<div><div>Concordancing is a central method in corpus research. It also plays an important role in the data-driven learning (DDL) classroom, where learners use Key-Word-In-Context (KWIC) analysis to discover and implicitly learn lexical and grammatical patterns in a specific language domain. Effective concordancing requires users to craft precise single- or multi-word queries that capture the language features of interest, but these queries can quickly become complex and error-prone. The method also relies on users selecting, ordering, and grouping results in order to interpret them in a meaningful way, which can also be a significant challenge.</div><div>This paper proposes using word and sentence embedding models from the field of AI to facilitate both the querying of corpora and the interpretation of concordance results. First, the paper explains that embeddings are numerical representations of language that capture semantic and contextual information in a high-dimensional vector space. Next, the paper reports on three experiments using pre-trained embedding models (BERT, word2vec) looking at synonym searches, concordance grouping and ordering, and language variation analysis across two general English corpora. The results show that embeddings allow for ‘fuzzy’, nuanced, context aware searches of corpus data without the need for meticulous query crafting, and enable the grouping and ordering of results in novel, interesting, and useful ways.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100164"},"PeriodicalIF":2.1,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145578582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.acorp.2025.100163
Jiantao Zou, Xuri Tang
The emerging triangulation approach in corpus-based critical discourse analysis—supra-lexical discursive component extraction in particular—faces the challenge of bridging macro-level analytical constructs (such as topics) with micro-level discursive realizations. This paper addresses this macro-micro divide in discerning sinsign topic shifts by proposing a framework that introduces unsupervised keyword extraction and word-embedding-based keyword clustering for topic shift identification and synthesizes collocation networks, sentiment analysis, and concordance reading to triangulate statistical topic shift patterns with fine-grained discursive realizations. The case study of UK HIV news discourse with the proposed framework identifies three diachronic shifts: the change from protection to prevention in HIV policy, destigmatization, and increasing focus on life quality of people living with HIV, all validated through macro-micro triangulation.
{"title":"Discerning diachronic sinsign topic shifts: A case study of UK HIV news","authors":"Jiantao Zou, Xuri Tang","doi":"10.1016/j.acorp.2025.100163","DOIUrl":"10.1016/j.acorp.2025.100163","url":null,"abstract":"<div><div>The emerging triangulation approach in corpus-based critical discourse analysis—supra-lexical discursive component extraction in particular—faces the challenge of bridging macro-level analytical constructs (such as topics) with micro-level discursive realizations. This paper addresses this macro-micro divide in discerning sinsign topic shifts by proposing a framework that introduces unsupervised keyword extraction and word-embedding-based keyword clustering for topic shift identification and synthesizes collocation networks, sentiment analysis, and concordance reading to triangulate statistical topic shift patterns with fine-grained discursive realizations. The case study of UK HIV news discourse with the proposed framework identifies three diachronic shifts: the change from protection to prevention in HIV policy, destigmatization, and increasing focus on life quality of people living with HIV, all validated through macro-micro triangulation.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100163"},"PeriodicalIF":2.1,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145578580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1016/j.acorp.2025.100162
John Blake, Maxim Mozgovoy
This paper introduces the Feature Visualizer, an open-access AI-powered tool designed to raise genre awareness among novice academic writers through inductive learning, a process that includes approaches such as discovery learning. The tool houses an annotated corpus of scientific research articles written by computer science majors and allows learners to explore authentic texts using on-demand visualizations and multimodal explanations. By engaging with the corpus, learners identify recurring language patterns and rhetorical structures at macro, meso, and micro levels, facilitating the bottom-up discovery of genre conventions. A longitudinal study with Japanese undergraduate computer science majors showed that the tool enhanced learners’ awareness of academic writing conventions and genre features. Focus group interviews further confirmed the usability and pedagogical value of the Feature Visualizer. We conclude by discussing practical applications for genre-based writing instruction informed by inductive learning principles.
{"title":"Raising genre awareness through visualizing language features","authors":"John Blake, Maxim Mozgovoy","doi":"10.1016/j.acorp.2025.100162","DOIUrl":"10.1016/j.acorp.2025.100162","url":null,"abstract":"<div><div>This paper introduces the Feature Visualizer, an open-access AI-powered tool designed to raise genre awareness among novice academic writers through inductive learning, a process that includes approaches such as discovery learning. The tool houses an annotated corpus of scientific research articles written by computer science majors and allows learners to explore authentic texts using on-demand visualizations and multimodal explanations. By engaging with the corpus, learners identify recurring language patterns and rhetorical structures at macro, meso, and micro levels, facilitating the bottom-up discovery of genre conventions. A longitudinal study with Japanese undergraduate computer science majors showed that the tool enhanced learners’ awareness of academic writing conventions and genre features. Focus group interviews further confirmed the usability and pedagogical value of the Feature Visualizer. We conclude by discussing practical applications for genre-based writing instruction informed by inductive learning principles.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100162"},"PeriodicalIF":2.1,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145578581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}