Abstract Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.
{"title":"Seeing the wood for the trees: predictive margins for random forests","authors":"Lukas Sönning, Jason Grafmiller","doi":"10.1515/cllt-2022-0083","DOIUrl":"https://doi.org/10.1515/cllt-2022-0083","url":null,"abstract":"Abstract Classification trees and random forests offer a number of attractive features to corpus data analysts. However, the way in which these models are typically reported – a decision tree and/or set of variable importance scores – offers insufficient information if interest centers on the (form of) relationship between (multiple) predictors and the outcome. This paper develops predictive margins as an interpretative approach to ensemble techniques such as random forests. These are model summaries in the form of adjusted predictions, which provide a clearer picture of patterns in the data and allow us to query a model on potential nonlinear associations and interactions among predictor variables. The present paper outlines the general strategy for forming predictive margins and addresses methodological issues from an explicitly (corpus) linguistic perspective. For illustration, we use data on the English genitive alternation and provide an R package and code for their implementation.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"0 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41334909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Nepali is typologically rare in terms of nominal classification systems, as it is one of the few languages of the world having simultaneously two gender systems (human/non-human, masculine/feminine) and one numeral classifier system (distinguishing features such as human, round-shaped objects, and long objects among others). Such a rare co-occurrence of different nominal classification systems is highly relevant for investigating linguistic complexity, as languages generally do not have several systems of the same type fulfilling the same functions. However, no corpus-based quantitative analyses have been conducted on the productive use of nominal classification systems in Nepali. The current paper aims at filling this gap by providing a token-based study from the Nepali National Corpus (∼20 million words). Our preliminary results show that there is in fact little formal overlap between the classifier and the gender systems.
{"title":"A corpus-based quantitative study of numeral classifiers in Nepali","authors":"Krishna Prasad Parajuli, Marc Allassonnière-Tang","doi":"10.1515/cllt-2022-0064","DOIUrl":"https://doi.org/10.1515/cllt-2022-0064","url":null,"abstract":"Abstract Nepali is typologically rare in terms of nominal classification systems, as it is one of the few languages of the world having simultaneously two gender systems (human/non-human, masculine/feminine) and one numeral classifier system (distinguishing features such as human, round-shaped objects, and long objects among others). Such a rare co-occurrence of different nominal classification systems is highly relevant for investigating linguistic complexity, as languages generally do not have several systems of the same type fulfilling the same functions. However, no corpus-based quantitative analyses have been conducted on the productive use of nominal classification systems in Nepali. The current paper aims at filling this gap by providing a token-based study from the Nepali National Corpus (∼20 million words). Our preliminary results show that there is in fact little formal overlap between the classifier and the gender systems.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43975397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract English verbs can combine with an object-like (or Objoid) element consisting of a possessive and a superlative. These Superlative Objoids do not add a participant to the event but function like manner adverbs (they work their hardest, i.e. they work extremely hard). This paper is the first to use diachronic evidence from a corpus of Late Modern American English to trace the recent history of Superlative Objoid Constructions (SOC). In particular, it aims to assess whether the construction has become entrenched to the extent that it can give rise to analogical extension. Secondly, the evidence is used to model, within the framework of Construction Grammar, the horizontal and vertical links between the SOC and its (potential) relatives in the constructional network of transitivity changing constructions.
{"title":"They worked their hardest on the construction’s history: Superlative Objoid Constructions in Late Modern American English","authors":"Tamara Bouso, M. Hundt","doi":"10.1515/cllt-2022-0088","DOIUrl":"https://doi.org/10.1515/cllt-2022-0088","url":null,"abstract":"Abstract English verbs can combine with an object-like (or Objoid) element consisting of a possessive and a superlative. These Superlative Objoids do not add a participant to the event but function like manner adverbs (they work their hardest, i.e. they work extremely hard). This paper is the first to use diachronic evidence from a corpus of Late Modern American English to trace the recent history of Superlative Objoid Constructions (SOC). In particular, it aims to assess whether the construction has become entrenched to the extent that it can give rise to analogical extension. Secondly, the evidence is used to model, within the framework of Construction Grammar, the horizontal and vertical links between the SOC and its (potential) relatives in the constructional network of transitivity changing constructions.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43926545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.1515/cllt-2023-frontmatter1
{"title":"Frontmatter","authors":"","doi":"10.1515/cllt-2023-frontmatter1","DOIUrl":"https://doi.org/10.1515/cllt-2023-frontmatter1","url":null,"abstract":"","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136178354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract For centuries, investigations of disputed authorship have shown that people have unique styles of writing. Given sufficient data, it is generally possible to distinguish between the writings of a small group of authors, for example, through the multivariate analysis of the relative frequencies of common function words. There is, however, no accepted explanation for why this type of stylometric analysis is successful. Authorship analysts often argue that authors write in subtly different dialects, but the analysis of individual words is not licensed by standard theories of sociolinguistic variation. Alternatively, stylometric analysis is consistent with standard theories of register variation. In this paper, I argue that stylometric methods work because authors write in subtly different registers. To support this claim, I present the results of parallel stylometric and multidimensional register analyses of a corpus of newspaper articles written by two columnists. I demonstrate that both analyses not only distinguish between these authors but identify the same underlying patterns of linguistic variation. I therefore propose that register variation, as opposed to dialect variation, provides a basis for explaining these differences and for explaining stylometric analyses of authorship more generally.
{"title":"Register variation explains stylometric authorship analysis","authors":"J. Grieve","doi":"10.1515/cllt-2022-0040","DOIUrl":"https://doi.org/10.1515/cllt-2022-0040","url":null,"abstract":"Abstract For centuries, investigations of disputed authorship have shown that people have unique styles of writing. Given sufficient data, it is generally possible to distinguish between the writings of a small group of authors, for example, through the multivariate analysis of the relative frequencies of common function words. There is, however, no accepted explanation for why this type of stylometric analysis is successful. Authorship analysts often argue that authors write in subtly different dialects, but the analysis of individual words is not licensed by standard theories of sociolinguistic variation. Alternatively, stylometric analysis is consistent with standard theories of register variation. In this paper, I argue that stylometric methods work because authors write in subtly different registers. To support this claim, I present the results of parallel stylometric and multidimensional register analyses of a corpus of newspaper articles written by two columnists. I demonstrate that both analyses not only distinguish between these authors but identify the same underlying patterns of linguistic variation. I therefore propose that register variation, as opposed to dialect variation, provides a basis for explaining these differences and for explaining stylometric analyses of authorship more generally.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"38 1","pages":"47 - 77"},"PeriodicalIF":1.6,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41269648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One way to resolve the actuation problem of metaphorical language change is to provide a statistical profile of metaphorical constructions and generative rules with antecedent conditions. Based on arguments from the view of language as complex systems and the dynamic view of metaphor, this paper argues that metaphorical language change qualifies as a Self-Organized Criticality state and the linguistic expressions of a metaphor can be profiled as a fractal with spatio-temporal correlations. Synchronously, these metaphorical expressions self-organize into a self-similar, scale-invariant fractal that follows a power-law distribution; temporally, long range interdependence constrains the self-organization process by the way of transformation rules that are intrinsic of a language system. This argument is verified in the paper with statistical analyses of twelve randomly selected Chinese verb metaphors in a large-scale diachronic corpus.
{"title":"Metaphorical language change is Self-Organized Criticality","authors":"Xuri Tang, Huifang Ye","doi":"10.1515/cllt-2022-0016","DOIUrl":"https://doi.org/10.1515/cllt-2022-0016","url":null,"abstract":"One way to resolve the actuation problem of metaphorical language change is to provide a statistical profile of metaphorical constructions and generative rules with antecedent conditions. Based on arguments from the view of language as complex systems and the dynamic view of metaphor, this paper argues that metaphorical language change qualifies as a Self-Organized Criticality state and the linguistic expressions of a metaphor can be profiled as a fractal with spatio-temporal correlations. Synchronously, these metaphorical expressions self-organize into a self-similar, scale-invariant fractal that follows a power-law distribution; temporally, long range interdependence constrains the self-organization process by the way of transformation rules that are intrinsic of a language system. This argument is verified in the paper with statistical analyses of twelve randomly selected Chinese verb metaphors in a large-scale diachronic corpus.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"10 4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138513488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Register variation and corpus linguistics: empirical findings and emerging theories. Special issue introduction of Corpus Linguistics and Linguistic Theory in honor of Douglas Biber","authors":"Jesse Egbert, Bethany Gray, Tove Larsson","doi":"10.1515/cllt-2022-0093","DOIUrl":"https://doi.org/10.1515/cllt-2022-0093","url":null,"abstract":"","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"19 1","pages":"1 - 5"},"PeriodicalIF":1.6,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42403128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Several studies have shown that there is considerable cross-genre variation as regards what linguistic units tend to be coordinated by and. While literate, expository writing favors coordination of phrasal units such as noun phrases, coordinated units are more often clausal (e.g., main or subordinate clauses) in speech-related texts. This difference has been attested in studies that focus exclusively on coordination as well as in macro-level studies of co-variation among a large number of linguistic features. However, this register differentiation has increased over time: studies of Early and Late Modern English point to less pronounced differences among registers than those attested in the present-day language. This study fills a gap in research by considering data on coordination by and from the middle of the 20th century, a period that does not belong fully to either Late Modern or Present-Day English, and the late 20th and early 21st century, and thus ties diachronic and synchronic research on register variation in coordination together. We also examine language from films and television in order to complement historical findings for speech-related language with data on registers that arose in the 20th century.
{"title":"Clausal and phrasal coordination in recent American English","authors":"Merja Kytö, Erik Smitterberg","doi":"10.1515/cllt-2022-0035","DOIUrl":"https://doi.org/10.1515/cllt-2022-0035","url":null,"abstract":"Abstract Several studies have shown that there is considerable cross-genre variation as regards what linguistic units tend to be coordinated by and. While literate, expository writing favors coordination of phrasal units such as noun phrases, coordinated units are more often clausal (e.g., main or subordinate clauses) in speech-related texts. This difference has been attested in studies that focus exclusively on coordination as well as in macro-level studies of co-variation among a large number of linguistic features. However, this register differentiation has increased over time: studies of Early and Late Modern English point to less pronounced differences among registers than those attested in the present-day language. This study fills a gap in research by considering data on coordination by and from the middle of the 20th century, a period that does not belong fully to either Late Modern or Present-Day English, and the late 20th and early 21st century, and thus ties diachronic and synchronic research on register variation in coordination together. We also examine language from films and television in order to complement historical findings for speech-related language with data on registers that arose in the 20th century.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"19 1","pages":"23 - 46"},"PeriodicalIF":1.6,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42014549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract This article provides an overview of Douglas Biber’s work on register and his central role in establishing register as both an empirical focus and a theoretical construct in corpus linguistics. I identity four general phases of his work. Each has a slightly different emphasis, but each also advances intertwined threads of research that lead to an increased understanding of register variation. Biber’s work has made major contributions to distinct areas within the study of registers, from cross-linguistic speech-writing differences to English grammar, but he has advanced the field especially by integrating the findings from different areas. He has offered conceptualizations of register that account for findings from multiple areas of study, and he continues to refine the conceptualization as he engages in new lines of inquiry today.
{"title":"Register in corpus linguistics: the role and legacy of Douglas Biber","authors":"Susan Conrad","doi":"10.1515/cllt-2022-0032","DOIUrl":"https://doi.org/10.1515/cllt-2022-0032","url":null,"abstract":"Abstract This article provides an overview of Douglas Biber’s work on register and his central role in establishing register as both an empirical focus and a theoretical construct in corpus linguistics. I identity four general phases of his work. Each has a slightly different emphasis, but each also advances intertwined threads of research that lead to an increased understanding of register variation. Biber’s work has made major contributions to distinct areas within the study of registers, from cross-linguistic speech-writing differences to English grammar, but he has advanced the field especially by integrating the findings from different areas. He has offered conceptualizations of register that account for findings from multiple areas of study, and he continues to refine the conceptualization as he engages in new lines of inquiry today.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":"19 1","pages":"7 - 21"},"PeriodicalIF":1.6,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48474675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-19DOI: 10.48550/arXiv.2211.10709
Xuri Tang, Huifang Ye
Abstract One way to resolve the actuation problem of metaphorical language change is to provide a statistical profile of metaphorical constructions and generative rules with antecedent conditions. Based on arguments from the view of language as complex systems and the dynamic view of metaphor, this paper argues that metaphorical language change qualifies as a Self-Organized Criticality state and the linguistic expressions of a metaphor can be profiled as a fractal with spatio-temporal correlations. Synchronously, these metaphorical expressions self-organize into a self-similar, scale-invariant fractal that follows a power-law distribution; temporally, long range interdependence constrains the self-organization process by the way of transformation rules that are intrinsic of a language system. This argument is verified in the paper with statistical analyses of twelve randomly selected Chinese verb metaphors in a large-scale diachronic corpus.
{"title":"Metaphorical language change is Self-Organized Criticality","authors":"Xuri Tang, Huifang Ye","doi":"10.48550/arXiv.2211.10709","DOIUrl":"https://doi.org/10.48550/arXiv.2211.10709","url":null,"abstract":"Abstract One way to resolve the actuation problem of metaphorical language change is to provide a statistical profile of metaphorical constructions and generative rules with antecedent conditions. Based on arguments from the view of language as complex systems and the dynamic view of metaphor, this paper argues that metaphorical language change qualifies as a Self-Organized Criticality state and the linguistic expressions of a metaphor can be profiled as a fractal with spatio-temporal correlations. Synchronously, these metaphorical expressions self-organize into a self-similar, scale-invariant fractal that follows a power-law distribution; temporally, long range interdependence constrains the self-organization process by the way of transformation rules that are intrinsic of a language system. This argument is verified in the paper with statistical analyses of twelve randomly selected Chinese verb metaphors in a large-scale diachronic corpus.","PeriodicalId":45605,"journal":{"name":"Corpus Linguistics and Linguistic Theory","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46217476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}