This paper presents a corpus-driven Sinclairian analysis of five high-frequency Slovene verbs covering the lexical paradigm ‘to express orally’ in combination with their premodifying adverbs of manner. One of the main goals of the paper is to establish how frequent the phenomenon of semantic prosody actually is among high-frequency lexical items (here, adv-v pairs). A methodology aiming to provide an answer to this question has been proposed featuring the top-down approach (i.e., in order of decreasing frequency of occurrence). It involves setting up the widest possible parameters of searching for so-called ‘extended units of meaning’ and their semantic prosody amongst the most frequent lexical patterns in a language. A total of twenty-six adv-v pairs have been examined. Results indicate a strong correlation between the frequency of multi-word lexical items and their tendency to develop semantic prosodies: high-frequency collocations are thus more likely to have semantic prosodies compared to their lower-frequency counterparts. Overall, results also corroborate the trend of semantic prosody to be found with mainly negative meanings and to a lesser extent in neutral meanings, while no positive semantic prosody has been determined in this study.
{"title":"Semantic prosody of Slovene adverb–verb collocations: introducing the top-down approach","authors":"P. Jurko","doi":"10.3366/cor.2022.0234","DOIUrl":"https://doi.org/10.3366/cor.2022.0234","url":null,"abstract":"This paper presents a corpus-driven Sinclairian analysis of five high-frequency Slovene verbs covering the lexical paradigm ‘to express orally’ in combination with their premodifying adverbs of manner. One of the main goals of the paper is to establish how frequent the phenomenon of semantic prosody actually is among high-frequency lexical items (here, adv-v pairs). A methodology aiming to provide an answer to this question has been proposed featuring the top-down approach (i.e., in order of decreasing frequency of occurrence). It involves setting up the widest possible parameters of searching for so-called ‘extended units of meaning’ and their semantic prosody amongst the most frequent lexical patterns in a language. A total of twenty-six adv-v pairs have been examined. Results indicate a strong correlation between the frequency of multi-word lexical items and their tendency to develop semantic prosodies: high-frequency collocations are thus more likely to have semantic prosodies compared to their lower-frequency counterparts. Overall, results also corroborate the trend of semantic prosody to be found with mainly negative meanings and to a lesser extent in neutral meanings, while no positive semantic prosody has been determined in this study.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42993874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates ( i) whether there has been a shift towards increased statistical focus in corpus linguistic research articles, and, if so, ( ii) whether this has had any repercussions for the attention paid to linguistic description. We investigate this through an analysis of the relative focus on statistical reporting versus linguistic description in the way the results are reported and discussed in research articles published in four major corpus linguistics journals in 2009 and 2019. The results display a marked change: in 2009, a clear majority of the articles exhibit a preference for linguistic description over statistical reporting; in 2019, the exact opposite is true. The number of different statistical techniques employed has also gone up. Whilst the increased statistical focus may reflect increased methodological sophistication, our results show that it has come at a cost: a diminished focus on linguistic description, evident, for example, through fewer text excerpts and linguistic examples, which appears to be symptomatic of increasing distance from the language that is the object of study. We discuss these shifts and suggest some ways of employing sophisticated statistical techniques without sacrificing a focus on language.
{"title":"On the status of statistical reporting versus linguistic description in corpus linguistics: a ten-year perspective","authors":"Tove Larsson, Jesse Egbert, D. Biber","doi":"10.3366/cor.2022.0238","DOIUrl":"https://doi.org/10.3366/cor.2022.0238","url":null,"abstract":"This study investigates ( i) whether there has been a shift towards increased statistical focus in corpus linguistic research articles, and, if so, ( ii) whether this has had any repercussions for the attention paid to linguistic description. We investigate this through an analysis of the relative focus on statistical reporting versus linguistic description in the way the results are reported and discussed in research articles published in four major corpus linguistics journals in 2009 and 2019. The results display a marked change: in 2009, a clear majority of the articles exhibit a preference for linguistic description over statistical reporting; in 2019, the exact opposite is true. The number of different statistical techniques employed has also gone up. Whilst the increased statistical focus may reflect increased methodological sophistication, our results show that it has come at a cost: a diminished focus on linguistic description, evident, for example, through fewer text excerpts and linguistic examples, which appears to be symptomatic of increasing distance from the language that is the object of study. We discuss these shifts and suggest some ways of employing sophisticated statistical techniques without sacrificing a focus on language.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44253043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous research on methodological triangulation, like Baker and Egbert (2016), has mainly focussed on triangulation within corpus linguistics (CL). This timely volume presents triangulation between corpus linguistic methods and other linguistic methodologies through nine empirical studies in discourse analysis, applied linguistics and psycholinguistics. The volume consists of an introduction, nine chapters grouped into three sections, and a ‘Synthesis and Conclusion’. In the Introduction, the editors briefly introduce CL and methodological triangulation. A brief review of previous literature on triangulation between CL and other linguistic methods in the fields of discourse analysis, applied linguistics and psycholinguistics is then presented. It ends with a sequential introduction to the nine studies in the volume. Part I (Chapters 2 to 4) falls into the area of discourse analysis. To analyse text structure in a corpus of twenty-four academic lectures, in Chapter 2, Erin Schnur and Eniko Csomay employ manual/automatic segmentation and qualitative/quantitative analysis. The first approach involves manual segmentation using Mechanical Turk (MT) and qualitative coding of the 1,056 segments identified based on eight functions. The analysis here focusses on the distribution of segment functions in the texts. In the second approach, 769 Vocabulary-Based Discourse Units are automatically identified with TextTiler and then subjected to quantitative analysis, identifying four text-types of segments with similar linguistic features. Thus, the second case study focusses on the distribution of linguistic patterns in text structure to illustrate the association between language variation and pedagogical purpose. In Chapter 3, Tony McEnery, Helen Baker and Carmen Dayrell rely on an historical newspaper corpus to explore the reality of droughts in nineteenth-century Britain. To control the potential errors in the digitised
{"title":"Review: Egbert and Baker (eds). 2020. Using Corpus Methods to Triangulate Linguistic Analysis","authors":"Xiaoli Fu","doi":"10.3366/cor.2021.0230","DOIUrl":"https://doi.org/10.3366/cor.2021.0230","url":null,"abstract":"Previous research on methodological triangulation, like Baker and Egbert (2016), has mainly focussed on triangulation within corpus linguistics (CL). This timely volume presents triangulation between corpus linguistic methods and other linguistic methodologies through nine empirical studies in discourse analysis, applied linguistics and psycholinguistics. The volume consists of an introduction, nine chapters grouped into three sections, and a ‘Synthesis and Conclusion’. In the Introduction, the editors briefly introduce CL and methodological triangulation. A brief review of previous literature on triangulation between CL and other linguistic methods in the fields of discourse analysis, applied linguistics and psycholinguistics is then presented. It ends with a sequential introduction to the nine studies in the volume. Part I (Chapters 2 to 4) falls into the area of discourse analysis. To analyse text structure in a corpus of twenty-four academic lectures, in Chapter 2, Erin Schnur and Eniko Csomay employ manual/automatic segmentation and qualitative/quantitative analysis. The first approach involves manual segmentation using Mechanical Turk (MT) and qualitative coding of the 1,056 segments identified based on eight functions. The analysis here focusses on the distribution of segment functions in the texts. In the second approach, 769 Vocabulary-Based Discourse Units are automatically identified with TextTiler and then subjected to quantitative analysis, identifying four text-types of segments with similar linguistic features. Thus, the second case study focusses on the distribution of linguistic patterns in text structure to illustrate the association between language variation and pedagogical purpose. In Chapter 3, Tony McEnery, Helen Baker and Carmen Dayrell rely on an historical newspaper corpus to explore the reality of droughts in nineteenth-century Britain. To control the potential errors in the digitised","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47881288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sport is a powerful social institution where hegemonic masculinity is constantly constructed and naturalised through the positioning of physicality and athleticism alongside maleness. Female athletes continue to be sub-ordinated by means of under-representation and trivialising gender discourses. So far, the extensive discussion of gendered language in sports media has primarily focussed on identifying the manifestations of gender bias in traditional news media. There has been little endeavour to explore the language of online media and tournament organisers. This study addresses that gap by comparing online gender representations of tennis players during the Wimbledon Championships 2018 on five online news websites and the tournament website. It also contributes to existing literature by providing corpus evidence of gender bias in sports media. The corpus consists of 1,622 articles (1,076,475 tokens). Findings from frequency, collocation and concordance analysis indicate that despite some instances of gender-neutral representations, female players are prone to gender marking and gender-bland sexism on all websites. I argue that the challenges women face relate to the tension between femininity and athleticism, and the misguided belief that women need to but can never eliminate the muscle gap.
{"title":"Pinning down the gap: gender and the online representation of professional tennis players","authors":"A. Yip","doi":"10.3366/cor.2021.0227","DOIUrl":"https://doi.org/10.3366/cor.2021.0227","url":null,"abstract":"Sport is a powerful social institution where hegemonic masculinity is constantly constructed and naturalised through the positioning of physicality and athleticism alongside maleness. Female athletes continue to be sub-ordinated by means of under-representation and trivialising gender discourses. So far, the extensive discussion of gendered language in sports media has primarily focussed on identifying the manifestations of gender bias in traditional news media. There has been little endeavour to explore the language of online media and tournament organisers. This study addresses that gap by comparing online gender representations of tennis players during the Wimbledon Championships 2018 on five online news websites and the tournament website. It also contributes to existing literature by providing corpus evidence of gender bias in sports media. The corpus consists of 1,622 articles (1,076,475 tokens). Findings from frequency, collocation and concordance analysis indicate that despite some instances of gender-neutral representations, female players are prone to gender marking and gender-bland sexism on all websites. I argue that the challenges women face relate to the tension between femininity and athleticism, and the misguided belief that women need to but can never eliminate the muscle gap.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49030958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper considers the role of historical context in initiating shifts in word meaning. The study focusses on two words – the translation equivalents separatist and separatism – in the discourses of Russian and Ukrainian parliamentary debates before and during the Russian–Ukrainian conflict which emerged at the beginning of 2014. The paper employs a cross-linguistic corpus-assisted discourse analysis to investigate the way wider socio-political context affects word usage and meaning. To allow a comparison of discourses around separatism between two parliaments, four corpora were compiled covering the debates in both parliaments before and during the conflict. Keywords, collocations and n-grams were studied and compared, and this was followed by qualitative analysis of concordance lines, co-text and the larger context in which these words occurred. The results show how originally close meanings of translation equivalents began to diverge and manifest noticeable changes in their connotative, affective and, to an extent, denotative meanings at a time of conflict in line with the dominant ideologies of the parliaments as well as the political affiliations of individuals.
{"title":"Separatism: a cross-linguistic corpus-assisted study of word-meaning development in a time of conflict","authors":"Tatyana Karpenko-Seccombe","doi":"10.3366/cor.2021.0228","DOIUrl":"https://doi.org/10.3366/cor.2021.0228","url":null,"abstract":"This paper considers the role of historical context in initiating shifts in word meaning. The study focusses on two words – the translation equivalents separatist and separatism – in the discourses of Russian and Ukrainian parliamentary debates before and during the Russian–Ukrainian conflict which emerged at the beginning of 2014. The paper employs a cross-linguistic corpus-assisted discourse analysis to investigate the way wider socio-political context affects word usage and meaning. To allow a comparison of discourses around separatism between two parliaments, four corpora were compiled covering the debates in both parliaments before and during the conflict. Keywords, collocations and n-grams were studied and compared, and this was followed by qualitative analysis of concordance lines, co-text and the larger context in which these words occurred. The results show how originally close meanings of translation equivalents began to diverge and manifest noticeable changes in their connotative, affective and, to an extent, denotative meanings at a time of conflict in line with the dominant ideologies of the parliaments as well as the political affiliations of individuals.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48018322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.
{"title":"Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging","authors":"A. Hardie, Wesam M. A. Ibrahim","doi":"10.3366/cor.2021.0225","DOIUrl":"https://doi.org/10.3366/cor.2021.0225","url":null,"abstract":"Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47452203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learner corpus studies typically investigate the language of second-language learners with a different first language (L1) or with proficiency levels inferred from external criteria (e.g., the Louvain International Database of Spoken English Interlanguage, lindsei; Gilquin et al., 2010 ). This paper reports the process of expanding the original Czech ( Gráf, 2017 ) and Taiwanese ( Huang, 2014 ) sub-corpora (predominantly at B2 and C1; Huang et al., 2018 ) with samples from learners of other L1s across cefr levels. In addition to sixty interviews by the German, Finnish and Norwegian lindsei teams, another eighty-three interviews with university students in Taiwan and Finland were held. The data collection and transcription procedures were adapted from lindsei guidelines to ensure comparability. Each fourteen-minute interview was anonymised using Audacity, and orthographically transcribed and aligned by means of exmaralda. The levels of speaking proficiency in the supplemented data were assessed by two expert raters. The expanded learner corpus, containing 243 interviews, will be of considerable value for studying the development of learner English.
学习者语料库研究通常调查具有不同第一语言(L1)或根据外部标准推断出的熟练程度的第二语言学习者的语言(例如,Louvain国际英语口语中介语数据库,lindsei;Gilquin等人,2010年)。本文报道了扩展原始捷克语(Gráf,2017)和台语(Huang,2014)子语料库(主要在B2和C1;Huang et al.,2018)的过程,样本来自不同cefr水平的其他L1学习者。除了德国、芬兰和挪威林赛团队的60次采访外,还对台湾和芬兰的83名大学生进行了采访。数据收集和转录程序根据lindsei指南进行了调整,以确保可比性。每一次14分钟的采访都使用Audacity进行匿名处理,并通过exmaralda进行拼写转录和对齐。两名专家评估了补充数据中的口语水平。扩展后的学习者语料库包含243个访谈,对研究英语学习者的发展具有相当大的价值。
{"title":"Expanding lindsei to spoken learner English from several L1s across cefr levels","authors":"Lan-fen Huang, Tomáš Gráf","doi":"10.3366/cor.2021.0220","DOIUrl":"https://doi.org/10.3366/cor.2021.0220","url":null,"abstract":"Learner corpus studies typically investigate the language of second-language learners with a different first language (L1) or with proficiency levels inferred from external criteria (e.g., the Louvain International Database of Spoken English Interlanguage, lindsei; Gilquin et al., 2010 ). This paper reports the process of expanding the original Czech ( Gráf, 2017 ) and Taiwanese ( Huang, 2014 ) sub-corpora (predominantly at B2 and C1; Huang et al., 2018 ) with samples from learners of other L1s across cefr levels. In addition to sixty interviews by the German, Finnish and Norwegian lindsei teams, another eighty-three interviews with university students in Taiwan and Finland were held. The data collection and transcription procedures were adapted from lindsei guidelines to ensure comparability. Each fourteen-minute interview was anonymised using Audacity, and orthographically transcribed and aligned by means of exmaralda. The levels of speaking proficiency in the supplemented data were assessed by two expert raters. The expanded learner corpus, containing 243 interviews, will be of considerable value for studying the development of learner English.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49066077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When exploring diachronic corpora, it is often beneficial for linguists to pinpoint not only the first or the last attestation dates of certain linguistic items, but also the moments in which they become more strongly established in the corpus or, conversely, the moments in which they, despite still being part of the language, become obsolete. In this paper, we propose an algorithm to assist the identification of such periods based on the frequency of items in a corpus. Our simple and generalisable algorithm can be used for the investigation of any linguistic item in any corpus which is divided into time-frames. We also demonstrate the applicability of our method using lexical data from the Corpus of Historical American English (coha), providing case studies on the statistics and characteristics of words that appear in or disappear from this corpus in different periods.
{"title":"An algorithm to identify periods of establishment and obsolescence of linguistic items in a diachronic corpus","authors":"Evandro Cunha, S. Wichmann","doi":"10.3366/cor.2021.0218","DOIUrl":"https://doi.org/10.3366/cor.2021.0218","url":null,"abstract":"When exploring diachronic corpora, it is often beneficial for linguists to pinpoint not only the first or the last attestation dates of certain linguistic items, but also the moments in which they become more strongly established in the corpus or, conversely, the moments in which they, despite still being part of the language, become obsolete. In this paper, we propose an algorithm to assist the identification of such periods based on the frequency of items in a corpus. Our simple and generalisable algorithm can be used for the investigation of any linguistic item in any corpus which is divided into time-frames. We also demonstrate the applicability of our method using lexical data from the Corpus of Historical American English (coha), providing case studies on the statistics and characteristics of words that appear in or disappear from this corpus in different periods.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48245952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}