复制作为一种评估语料库代表性和专业词表可泛化性的方法

Don Miller
{"title":"复制作为一种评估语料库代表性和专业词表可泛化性的方法","authors":"Don Miller","doi":"10.1016/j.acorp.2022.100027","DOIUrl":null,"url":null,"abstract":"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf","citationCount":"1","resultStr":"{\"title\":\"Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists\",\"authors\":\"Don Miller\",\"doi\":\"10.1016/j.acorp.2022.100027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>\",\"PeriodicalId\":72254,\"journal\":{\"name\":\"Applied Corpus Linguistics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Corpus Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799122000120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

相当多的精力投入到设计在不同宽度的话语域中突出的单词列表上。在过去的二十年中,设计和验证基于语料库的频率列表的大部分工作集中在三个方面:语料库编译、项目选择标准和基于覆盖的列表鲁棒性演示。因此,现代语料库现在往往更大,更平衡;应用额外的分散统计数据可以更好地定位具有期望分布的项目;当代词汇频率表的效率越来越高,可以提供更高的目标文本覆盖范围,或者用更少的单词实现这样的覆盖范围。然而,尽管取得了这些重要的进展,人们对词表可靠性的关注相对较少,即词表在多大程度上可以被推广到更广泛的话语领域,即它们所基于的语料库所代表的话语领域。本研究开始解决这一差距,通过两个单词列表开发案例研究(一个用于环境科学,一个用于应用语言学)证明,通过增加语料库规模的方法学复制和结果列表上项目的比较,增加迭代可靠性分析可以用于:1)告知语料库设计超越Biber(1991)所说的“情境”参数,使我们能够看到语料库是否充分代表了目标话语域的词汇分布;2)对我们所开发的词表的泛化程度提供有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists

Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Applied Corpus Linguistics
Applied Corpus Linguistics Linguistics and Language
CiteScore
1.30
自引率
0.00%
发文量
0
审稿时长
70 days
期刊最新文献
Breach of pacta sunt servanda: A corpus-assisted analysis of newspaper discourse on the AUKUS agreement Identifying ChatGPT-generated texts in EFL students’ writing: Through comparative analysis of linguistic fingerprints English podcasts for schoolchildren and their vocabulary demands Capturing chronological variation in L2 speech through lexical measurements and regression analysis Investigating spoken classroom interactions in linguistically heterogeneous learning groups – An interdisciplinary approach to process video-based data in second language acquisition classrooms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1