David Trye, Andreea S. Calude, T. Keegan, Julia R. Falconer
Networks are being used to model an increasingly diverse range of real-world phenomena. This paper introduces an exploratory approach to studying loanwords in relation to one another, using networks of co-occurrence. While traditional studies treat individual loanwords as discrete items, we show that insights can be gained by focusing on the various loanwords that co-occur within each text in a corpus, especially when leveraging the notion of a hypergraph. Our research involves a case-study of New Zealand English (NZE), which borrows Indigenous Māori words on a large scale. We use a topic-constrained corpus to show that: (i) Māori loanword types tend not to occur by themselves in a text; (ii) infrequent loanwords are nearly always accompanied by frequent loanwords; and (iii) it is not uncommon for texts to contain a mixture of listed and unlisted loanwords, suggesting that NZE is still riding a wave of borrowing importation from Māori.
{"title":"When loanwords are not lone words","authors":"David Trye, Andreea S. Calude, T. Keegan, Julia R. Falconer","doi":"10.1075/ijcl.21124.try","DOIUrl":"https://doi.org/10.1075/ijcl.21124.try","url":null,"abstract":"\u0000Networks are being used to model an increasingly diverse range of real-world phenomena. This paper introduces an exploratory approach to studying loanwords in relation to one another, using networks of co-occurrence. While traditional studies treat individual loanwords as discrete items, we show that insights can be gained by focusing on the various loanwords that co-occur within each text in a corpus, especially when leveraging the notion of a hypergraph. Our research involves a case-study of New Zealand English (NZE), which borrows Indigenous Māori words on a large scale. We use a topic-constrained corpus to show that: (i) Māori loanword types tend not to occur by themselves in a text; (ii) infrequent loanwords are nearly always accompanied by frequent loanwords; and (iii) it is not uncommon for texts to contain a mixture of listed and unlisted loanwords, suggesting that NZE is still riding a wave of borrowing importation from Māori.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47146935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2022-04-29DOI: 10.1075/ijcl.21019.col
Luke Collins, Vaclav Brezina, Zsófia Demjén, Elena Semino, Angela Woods
Triangulating corpus linguistic approaches with other (linguistic and non-linguistic) approaches enhances "both the rigour of corpus linguistics and its incorporation into all kinds of research" (McEnery & Hardie, 2012:227). Our study investigates an important area of mental health research: the experiences of those who hear voices that others cannot hear, and particularly the ways in which those voices are described as person-like. We apply corpus methods to augment the findings of a qualitative approach to 40 interviews with voice-hearers, whereby each interview was coded as involving 'minimal' or 'complex' personification of voices. Our analysis provides linguistic evidence in support of the qualitative coding of the interviews, but also goes beyond a binary approach by revealing different types and degrees of personification of voices, based on how they are referred to and described by voice-hearers. We relate these findings to concepts that inform therapeutic interventions in clinical psychology.
{"title":"Corpus linguistics and clinical psychology: Investigating personification in first-person accounts of voice-hearing.","authors":"Luke Collins, Vaclav Brezina, Zsófia Demjén, Elena Semino, Angela Woods","doi":"10.1075/ijcl.21019.col","DOIUrl":"10.1075/ijcl.21019.col","url":null,"abstract":"<p><p>Triangulating corpus linguistic approaches with other (linguistic and non-linguistic) approaches enhances \"both the rigour of corpus linguistics and its incorporation into all kinds of research\" (McEnery & Hardie, 2012:227). Our study investigates an important area of mental health research: the experiences of those who hear voices that others cannot hear, and particularly the ways in which those voices are described as person-like. We apply corpus methods to augment the findings of a qualitative approach to 40 interviews with voice-hearers, whereby each interview was coded as involving 'minimal' or 'complex' personification of voices. Our analysis provides linguistic evidence in support of the qualitative coding of the interviews, but also goes beyond a binary approach by revealing different <i>types</i> and <i>degrees</i> of personification of voices, based on how they are referred to and described by voice-hearers. We relate these findings to concepts that inform therapeutic interventions in clinical psychology.</p>","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9388413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Review of Le Bruyn & Paquot (2021): Learner Corpus Research Meets Second Language Acquisition","authors":"Li Nguyen","doi":"10.1075/ijcl.00051.ngu","DOIUrl":"https://doi.org/10.1075/ijcl.00051.ngu","url":null,"abstract":"","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42342205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article investigates the two main corpus indicators of word commonness, frequency and dispersion, through a cross-validation analysis of frequency and four dispersion measures (‘Range’, ‘Chi-squared’, ‘Deviation of Proportions’ and ‘Juilland’s D’). The approach provides an estimation of the capacity of the named measures to predict the distribution of corpus items in an extracted language sample. Based on a dataset of 273 Norwegian compounds, the results show that especially Deviation of Proportions is a robust measure of dispersion that can be used in conjunction with frequency to substantiate assertions of word commonness based on corpus data. In addition, dispersion measures do not only reflect what sort of distribution the frequency statistic is generated from, but also how reliable the frequency estimation in the corpus sample is in terms of giving an accurate representation of frequency in the language variety that the corpus is sampled from.
本文通过对词频和四个离散度度量(“范围”、“卡方”、“比例偏差”和“茱莉兰D”)的交叉验证分析,考察了词的共性、频率和离散度这两个主要语料库指标。该方法提供了命名度量在提取的语言样本中预测语料库项目分布的能力的估计。基于273个挪威语复合词的数据集,结果表明,比例偏差(Deviation of Proportions)是一种鲁棒的离散度量,可以与频率一起使用,以证实基于语料库数据的词共同性断言。此外,离散度量不仅反映了频率统计是从哪种分布生成的,而且还反映了语料库样本中频率估计的可靠性,因为语料库样本给出了语料库样本中语言种类频率的准确表示。
{"title":"Assessing word commonness","authors":"Mikkel Ekeland Paulsen","doi":"10.1075/ijcl.21037.eke","DOIUrl":"https://doi.org/10.1075/ijcl.21037.eke","url":null,"abstract":"\u0000 The article investigates the two main corpus indicators of word commonness, frequency and dispersion, through a\u0000 cross-validation analysis of frequency and four dispersion measures (‘Range’, ‘Chi-squared’, ‘Deviation of Proportions’ and\u0000 ‘Juilland’s D’). The approach provides an estimation of the capacity of the named measures to predict the distribution of corpus\u0000 items in an extracted language sample. Based on a dataset of 273 Norwegian compounds, the results show that especially Deviation\u0000 of Proportions is a robust measure of dispersion that can be used in conjunction with frequency to substantiate assertions of word\u0000 commonness based on corpus data. In addition, dispersion measures do not only reflect what sort of distribution the frequency\u0000 statistic is generated from, but also how reliable the frequency estimation in the corpus sample is in terms of giving an accurate\u0000 representation of frequency in the language variety that the corpus is sampled from.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45362810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The sense of smell has been relatively neglected in the Western research. It is not regarded as particularly useful compared to the perceived importance of senses like sight, sound, and touch. Correspondingly, English speakers are ill-equipped to describe qualities of smells, instead invoking entities that share similar olfactory qualities, e.g. like roses. This raises the question: which odours do English speakers frequently refer to, and which terms describe them? This corpus-driven study looks at nouns in olfactory contexts, and the conceptual domains they fall into. Results show that speakers invoke different smells according to context: when talking about a smell they perceive, when describing a smell, or in a description of another smell, which demonstrates the differential communicative functions of smells. Further analysis shows that smells that are described are more variable than those used as descriptors, and smells being used to describe are more emotional using psychometric norming data.
{"title":"Things we smell and things they smell like","authors":"Thomas Poulton","doi":"10.1075/ijcl.21028.pou","DOIUrl":"https://doi.org/10.1075/ijcl.21028.pou","url":null,"abstract":"\u0000The sense of smell has been relatively neglected in the Western research. It is not regarded as particularly useful compared to the perceived importance of senses like sight, sound, and touch. Correspondingly, English speakers are ill-equipped to describe qualities of smells, instead invoking entities that share similar olfactory qualities, e.g. like roses. This raises the question: which odours do English speakers frequently refer to, and which terms describe them? This corpus-driven study looks at nouns in olfactory contexts, and the conceptual domains they fall into. Results show that speakers invoke different smells according to context: when talking about a smell they perceive, when describing a smell, or in a description of another smell, which demonstrates the differential communicative functions of smells. Further analysis shows that smells that are described are more variable than those used as descriptors, and smells being used to describe are more emotional using psychometric norming data.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48239689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Crosthwaite, Sulistya Ningrum, M. Schweinberger
This paper uses a bibliometric analysis to map the field of Corpus Linguistics (CL) research in arts and humanities over the last 20 years, tracking changes in popular CL research topics, outlets, highly cited authors, and geographical origins based on the metadata of 5,829 CL-related articles from 429 Scopus-indexed journals. Results reveal an increase in corpus-assisted discourse studies, lexical bundles and academic writing, alongside newer topics including multilingualism and social media. CL studies span 193 languages/dialects with a significant rise in Chinese, Russian, Spanish, and Italian CL research over the past decade. Clusters of highly cited CL researchers are identified spanning (inter)disciplinary research areas. An increase of CL researchers in China, Poland, South Korea, Japan, and more is evidence of the now global reach of CL research. These findings mirror diachronic socio-cultural developments in applied linguistics and society more generally and provide insights into what CL research might come next.
{"title":"Research trends in corpus linguistics","authors":"P. Crosthwaite, Sulistya Ningrum, M. Schweinberger","doi":"10.1075/ijcl.21072.cro","DOIUrl":"https://doi.org/10.1075/ijcl.21072.cro","url":null,"abstract":"\u0000 This paper uses a bibliometric analysis to map the field of Corpus Linguistics (CL) research in arts and\u0000 humanities over the last 20 years, tracking changes in popular CL research topics, outlets, highly cited authors, and geographical\u0000 origins based on the metadata of 5,829 CL-related articles from 429 Scopus-indexed journals. Results reveal an increase in\u0000 corpus-assisted discourse studies, lexical bundles and academic writing, alongside newer topics including multilingualism and\u0000 social media. CL studies span 193 languages/dialects with a significant rise in Chinese, Russian, Spanish, and Italian CL research\u0000 over the past decade. Clusters of highly cited CL researchers are identified spanning (inter)disciplinary research areas. An\u0000 increase of CL researchers in China, Poland, South Korea, Japan, and more is evidence of the now global reach of CL research.\u0000 These findings mirror diachronic socio-cultural developments in applied linguistics and society more generally and provide\u0000 insights into what CL research might come next.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46860663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated tools for syntactic complexity measurement are increasingly used for analyzing various kinds of second language corpora, even though these tools were originally developed and tested for texts produced by advanced learners. This study investigates the reliability of automated complexity measurement for beginner and lower-intermediate L2 English data by comparing manual and automated analyses of a corpus of 80 texts written by Dutch-speaking learners. Our quantitative and qualitative analyses reveal that the reliability of automated complexity measurement is substantially affected by learner errors, parser errors, and Tregex pattern undergeneration. We also demonstrate the importance of aligning the definitions of analytical units between the computational tool and human annotators. In order to enhance the reliability of automated analyses, it is recommended that certain modifications are made to the system, and non-advanced L2 English data are preprocessed prior to automated analyses.
{"title":"A comparison of automated and manual analyses of syntactic complexity in L2 English writing","authors":"Quang Hồng Châu, Bram Bulté","doi":"10.1075/ijcl.20181.cha","DOIUrl":"https://doi.org/10.1075/ijcl.20181.cha","url":null,"abstract":"\u0000 Automated tools for syntactic complexity measurement are increasingly used for analyzing various kinds of second\u0000 language corpora, even though these tools were originally developed and tested for texts produced by advanced learners. This study\u0000 investigates the reliability of automated complexity measurement for beginner and lower-intermediate L2 English data by comparing\u0000 manual and automated analyses of a corpus of 80 texts written by Dutch-speaking learners. Our quantitative and qualitative\u0000 analyses reveal that the reliability of automated complexity measurement is substantially affected by learner errors, parser\u0000 errors, and Tregex pattern undergeneration. We also demonstrate the importance of aligning the definitions of\u0000 analytical units between the computational tool and human annotators. In order to enhance the reliability of automated analyses,\u0000 it is recommended that certain modifications are made to the system, and non-advanced L2 English data are preprocessed prior to\u0000 automated analyses.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46270658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The theories and methods in corpus linguistics (CL) have had an impact on numerous areas in applied linguistics. However, the interface between CL and multimodal speech-gesture studies remains underexplored. One fundamental question is whether it is possible, and even appropriate, to apply the theories and paradigms established based on textual data to multimodal data. To explore this, we examine how CL can assist investigating lexico-grammatical patterns of speech co-occurring with a recurrent gesture (i.e. the circular gesture). Sinclair’s (1996) unit of meaning model is used to describe the co-gestural speech patterns. The study draws on a subset of the Nottingham Multimodal Corpus, in which 570 instances of circular gestures and their co-occurring speech are identified and analysed. We argue that Sinclair’s unit of meaning model can be extended to include speech-gesture patterns, and that those descriptions enable a more nuanced understanding of meaning in context.
{"title":"Towards a corpus-based description of speech-gesture units of meaning","authors":"Yaoyao Chen, S. Adolphs","doi":"10.1075/ijcl.20174.che","DOIUrl":"https://doi.org/10.1075/ijcl.20174.che","url":null,"abstract":"\u0000 The theories and methods in corpus linguistics (CL) have had an impact on numerous areas in applied linguistics.\u0000 However, the interface between CL and multimodal speech-gesture studies remains underexplored. One fundamental question is whether\u0000 it is possible, and even appropriate, to apply the theories and paradigms established based on textual data to multimodal data. To\u0000 explore this, we examine how CL can assist investigating lexico-grammatical patterns of speech co-occurring with a recurrent\u0000 gesture (i.e. the circular gesture). Sinclair’s (1996) unit of meaning model is used to\u0000 describe the co-gestural speech patterns. The study draws on a subset of the Nottingham Multimodal Corpus, in which 570 instances\u0000 of circular gestures and their co-occurring speech are identified and analysed. We argue that Sinclair’s unit of meaning model can\u0000 be extended to include speech-gesture patterns, and that those descriptions enable a more nuanced understanding of meaning in\u0000 context.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42061385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Stifter, Fangzhe Qiu, M. Aquino-López, Bernhard Bauer, E. Lash, Nora White
This article introduces Corpus PalaeoHibernicum (CorPH), a corpus currently consisting of 78 texts in Early Irish (c. 7th–10th cent.) created by the ERC-funded Chronologicon Hibernicum (ChronHib) project by bringing together pre-existing lexical and syntactic databases and adding further crucial texts from the period. In addition to being annotated for POS, morphological and syntactic information, another layer of annotation has been developed for CorPH – ‘Variation Tagging’, i.e. a tagset that numerically encodes synchronic language variation during the Early Irish period, thus allowing for much improved research on the chronological variation among the material. Another new pillar of studying linguistic variation is Bayesian Language Variation Analysis (BLaVA), in order to address the challenge that “not-so-big data” poses to statistical corpus methods. Instead of reflecting feature frequencies, BLaVA models language variation as probabilities of variation.
{"title":"Strategies in tracing linguistic variation in a corpus of Old Irish texts (CorPH)","authors":"D. Stifter, Fangzhe Qiu, M. Aquino-López, Bernhard Bauer, E. Lash, Nora White","doi":"10.1075/ijcl.22018.sti","DOIUrl":"https://doi.org/10.1075/ijcl.22018.sti","url":null,"abstract":"\u0000This article introduces Corpus PalaeoHibernicum (CorPH), a corpus currently consisting of 78 texts in Early Irish (c. 7th–10th cent.) created by the ERC-funded Chronologicon Hibernicum (ChronHib) project by bringing together pre-existing lexical and syntactic databases and adding further crucial texts from the period. In addition to being annotated for POS, morphological and syntactic information, another layer of annotation has been developed for CorPH – ‘Variation Tagging’, i.e. a tagset that numerically encodes synchronic language variation during the Early Irish period, thus allowing for much improved research on the chronological variation among the material. Another new pillar of studying linguistic variation is Bayesian Language Variation Analysis (BLaVA), in order to address the challenge that “not-so-big data” poses to statistical corpus methods. Instead of reflecting feature frequencies, BLaVA models language variation as probabilities of variation.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45170903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ways in which politicians have discussed who, what, and where was considered “uncivilized’” across the past two centuries gives an insight into how speakers in a position of authority classified and constructed the world around them, and how those in power in Britain see the country and themselves. This article uses the Hansard Corpus 1803–2003 of speeches in the UK Parliament alongside data from the Historical Thesaurus of English to analyse diachronic variation in usage of words for persons, places and practices considered uncivil. It proposes new methods and offers quantitative data to describe the period’s shift in political attitudes towards not just the so-called “uncivil” but also the country as a whole.
{"title":"“In barbarous times and in uncivilized countries”","authors":"Marc Alexander, Andrew Struan","doi":"10.1075/ijcl.22016.ale","DOIUrl":"https://doi.org/10.1075/ijcl.22016.ale","url":null,"abstract":"\u0000The ways in which politicians have discussed who, what, and where was considered “uncivilized’” across the past two centuries gives an insight into how speakers in a position of authority classified and constructed the world around them, and how those in power in Britain see the country and themselves. This article uses the Hansard Corpus 1803–2003 of speeches in the UK Parliament alongside data from the Historical Thesaurus of English to analyse diachronic variation in usage of words for persons, places and practices considered uncivil. It proposes new methods and offers quantitative data to describe the period’s shift in political attitudes towards not just the so-called “uncivil” but also the country as a whole.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2022-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45305120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}