复制作为一种评估语料库代表性和专业词表可泛化性的方法

IF 2.1 Applied Corpus Linguistics Pub Date : 2022-12-01 Epub Date: 2022-08-10 DOI:10.1016/j.acorp.2022.100027

Don Miller

{"title":"复制作为一种评估语料库代表性和专业词表可泛化性的方法","authors":"Don Miller","doi":"10.1016/j.acorp.2022.100027","DOIUrl":null,"url":null,"abstract":"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"2 3","pages":"Article 100027"},"PeriodicalIF":2.1000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf","citationCount":"1","resultStr":"{\"title\":\"Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists\",\"authors\":\"Don Miller\",\"doi\":\"10.1016/j.acorp.2022.100027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>\",\"PeriodicalId\":72254,\"journal\":{\"name\":\"Applied Corpus Linguistics\",\"volume\":\"2 3\",\"pages\":\"Article 100027\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Corpus Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/8/10 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799122000120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/8/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

相当多的精力投入到设计在不同宽度的话语域中突出的单词列表上。在过去的二十年中，设计和验证基于语料库的频率列表的大部分工作集中在三个方面:语料库编译、项目选择标准和基于覆盖的列表鲁棒性演示。因此，现代语料库现在往往更大，更平衡;应用额外的分散统计数据可以更好地定位具有期望分布的项目;当代词汇频率表的效率越来越高，可以提供更高的目标文本覆盖范围，或者用更少的单词实现这样的覆盖范围。然而，尽管取得了这些重要的进展，人们对词表可靠性的关注相对较少，即词表在多大程度上可以被推广到更广泛的话语领域，即它们所基于的语料库所代表的话语领域。本研究开始解决这一差距，通过两个单词列表开发案例研究(一个用于环境科学，一个用于应用语言学)证明，通过增加语料库规模的方法学复制和结果列表上项目的比较，增加迭代可靠性分析可以用于:1)告知语料库设计超越Biber(1991)所说的“情境”参数，使我们能够看到语料库是否充分代表了目标话语域的词汇分布;2)对我们所开发的词表的泛化程度提供有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists

Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊