This paper introduces the Electronic Repository of Greater Poland Oaths, eROThA (1386–1446), a digitisation project of a diplomatic edition of mediaeval land court oaths recorded in Latin and Old Polish, resulting in a small, lightly tagged specialised bilingual corpus. We present the background, aims, design and methodology of the project. We also discuss the problems and limitations entrenched in turning a printed diplomatic edition into a machine-readable diplomatic edition equipped with a new interpretative layer that is sensitive to the switches between Latin and Old Polish. In addition to the automatic annotation of code-switched items on the basis of typographic characteristics of the printed edition, flexible coding of recurrent language and discourse boundary phenomena has been introduced manually to account for linguistically ambiguous or neutral forms. The project offers a fully multilingual corpus, as well as customised Polish-only and Latin-only datasets, and enables filtered metadata searches in the online front-end. Overall, the report presents a methodology for constructing multilingual corpora in the context of legal cultures in medieval Central Europe that may be extrapolated to datasets originating in other periods and regions.
{"title":"Multilingualism in Greater Poland court records (1386–1448): tagging discourse boundaries and code-switching","authors":"M. Włodarczyk, J. Kopaczyk, M. Kozák","doi":"10.3366/cor.2020.0200","DOIUrl":"https://doi.org/10.3366/cor.2020.0200","url":null,"abstract":"This paper introduces the Electronic Repository of Greater Poland Oaths, eROThA (1386–1446), a digitisation project of a diplomatic edition of mediaeval land court oaths recorded in Latin and Old Polish, resulting in a small, lightly tagged specialised bilingual corpus. We present the background, aims, design and methodology of the project. We also discuss the problems and limitations entrenched in turning a printed diplomatic edition into a machine-readable diplomatic edition equipped with a new interpretative layer that is sensitive to the switches between Latin and Old Polish. In addition to the automatic annotation of code-switched items on the basis of typographic characteristics of the printed edition, flexible coding of recurrent language and discourse boundary phenomena has been introduced manually to account for linguistically ambiguous or neutral forms. The project offers a fully multilingual corpus, as well as customised Polish-only and Latin-only datasets, and enables filtered metadata searches in the online front-end. Overall, the report presents a methodology for constructing multilingual corpora in the context of legal cultures in medieval Central Europe that may be extrapolated to datasets originating in other periods and regions.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44457108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of this study is to detect the historical distribution of representations of the United States and Brazil formed around the use of the nationality adjectives American and Brazilian. To ach...
{"title":"A historical characterisation of American and Brazilian cultures based on lexical representations","authors":"Tony Berber Sardinha","doi":"10.3366/cor.2020.0194","DOIUrl":"https://doi.org/10.3366/cor.2020.0194","url":null,"abstract":"The goal of this study is to detect the historical distribution of representations of the United States and Brazil formed around the use of the nationality adjectives American and Brazilian. To ach...","PeriodicalId":44933,"journal":{"name":"Corpora","volume":"15 1","pages":"183-212"},"PeriodicalIF":0.5,"publicationDate":"2020-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49089984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Corpora are usually not only made up of words, sentences and plain texts; they usually also have metadata, background information and structural features which can be used to filter searches or pro...
{"title":"Calculating and displaying key labels: the texts, sections, authors and neighbourhoods where words and collocations are likely to be prominent","authors":"Stephen Jeaco","doi":"10.3366/cor.2020.0193","DOIUrl":"https://doi.org/10.3366/cor.2020.0193","url":null,"abstract":"Corpora are usually not only made up of words, sentences and plain texts; they usually also have metadata, background information and structural features which can be used to filter searches or pro...","PeriodicalId":44933,"journal":{"name":"Corpora","volume":"15 1","pages":"169-182"},"PeriodicalIF":0.5,"publicationDate":"2020-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46657574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Review: McIntyre and Walker. 2019. Corpus Stylistics","authors":"Jordan Smith","doi":"10.3366/cor.2020.0196","DOIUrl":"https://doi.org/10.3366/cor.2020.0196","url":null,"abstract":"","PeriodicalId":44933,"journal":{"name":"Corpora","volume":"15 1","pages":"243-246"},"PeriodicalIF":0.5,"publicationDate":"2020-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46573938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yukiko Ohashi, N. Katagiri, K. Oka, Michiko Hanada
This paper reports on two research results: (1) designing an English for Specific Purposes (esp) corpus architecture complete with annotations structured by regular expressions; and (2) a case stud...
{"title":"ESP corpus design: compilation of the Veterinary Nursing Medical Chart Corpus and the Veterinary Nursing Wordlist","authors":"Yukiko Ohashi, N. Katagiri, K. Oka, Michiko Hanada","doi":"10.3366/cor.2020.0191","DOIUrl":"https://doi.org/10.3366/cor.2020.0191","url":null,"abstract":"This paper reports on two research results: (1) designing an English for Specific Purposes (esp) corpus architecture complete with annotations structured by regular expressions; and (2) a case stud...","PeriodicalId":44933,"journal":{"name":"Corpora","volume":"15 1","pages":"125-140"},"PeriodicalIF":0.5,"publicationDate":"2020-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47066325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Adolphs, Dawn Knight, Catherine Smith, Dominic T. Price
Spoken corpora have traditionally been assembled through careful recording and transcription of discourse events, a process which is both labour intensive and often restrictive in terms of breadth of recording contexts available. To overcome these potential challenges in spoken corpus compilation, we explore the use of crowdsourcing of language samples that are reported by participants. We investigate the level of precision and recall of the ‘crowd’ when it comes to reporting language they have heard in certain contexts, alongside the use of a crowdsourcing toolkit to facilitate this task. As a focussing device for the selection of reported language samples, we draw on the use of formulaic phrases as an area that has received considerable attention by corpus linguists and applied linguists over the years. We argue that while studying reported language usage instead of actual language-in-use is problematic for several reasons, many of which have been highlighted in the literature on Discourse Completion Tasks ( Schauer and Adolphs, 2006 ), our suggested approach presents several advantages and opportunities for spoken corpus linguistics.
{"title":"Crowdsourcing formulaic phrases: towards a new type of spoken corpus","authors":"S. Adolphs, Dawn Knight, Catherine Smith, Dominic T. Price","doi":"10.3366/COR.2020.0192","DOIUrl":"https://doi.org/10.3366/COR.2020.0192","url":null,"abstract":"Spoken corpora have traditionally been assembled through careful recording and transcription of discourse events, a process which is both labour intensive and often restrictive in terms of breadth of recording contexts available. To overcome these potential challenges in spoken corpus compilation, we explore the use of crowdsourcing of language samples that are reported by participants. We investigate the level of precision and recall of the ‘crowd’ when it comes to reporting language they have heard in certain contexts, alongside the use of a crowdsourcing toolkit to facilitate this task. As a focussing device for the selection of reported language samples, we draw on the use of formulaic phrases as an area that has received considerable attention by corpus linguists and applied linguists over the years. We argue that while studying reported language usage instead of actual language-in-use is problematic for several reasons, many of which have been highlighted in the literature on Discourse Completion Tasks ( Schauer and Adolphs, 2006 ), our suggested approach presents several advantages and opportunities for spoken corpus linguistics.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49051057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explores the alternation between the mandative subjunctive and its modal alternative with should across native and non-native Englishes. Methodologically, we try to improve on existing standards by investigating over 3,300 occurrences of the alternation from the Corpus of Web-based Global English and annotated for a range of linguistic factors analysed with a forest of conditional inference trees; also, we are exemplifying a new strategy for the use of random or conditional inference forests in corpus-based alternation studies. We obtain a forest with significant prediction accuracies and a good C-score and discuss the strongest predictors of the subjunctive versus should alternation across Englishes. Contrasting with existing research, our multi-factorial results: ( i) suggest that in British English the mandative subjunctive may not be dying out as much as we thought; and ( ii) individual suasive verbs influence speakers' use of the two variants more than their variety of English.
{"title":"Mandative subjunctive versus should in world Englishes: a new take on an old alternation","authors":"Sandra C. Deshors, S. Gries","doi":"10.3366/cor.2020.0195","DOIUrl":"https://doi.org/10.3366/cor.2020.0195","url":null,"abstract":"This study explores the alternation between the mandative subjunctive and its modal alternative with should across native and non-native Englishes. Methodologically, we try to improve on existing standards by investigating over 3,300 occurrences of the alternation from the Corpus of Web-based Global English and annotated for a range of linguistic factors analysed with a forest of conditional inference trees; also, we are exemplifying a new strategy for the use of random or conditional inference forests in corpus-based alternation studies. We obtain a forest with significant prediction accuracies and a good C-score and discuss the strongest predictors of the subjunctive versus should alternation across Englishes. Contrasting with existing research, our multi-factorial results: ( i) suggest that in British English the mandative subjunctive may not be dying out as much as we thought; and ( ii) individual suasive verbs influence speakers' use of the two variants more than their variety of English.","PeriodicalId":44933,"journal":{"name":"Corpora","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48311872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}