Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Linguistics Pub Date : 2023-06-13 DOI:10.1162/coli_a_00487
Alex Rosenfeld, L. Hinrichs
{"title":"Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings","authors":"Alex Rosenfeld, L. Hinrichs","doi":"10.1162/coli_a_00487","DOIUrl":null,"url":null,"abstract":"\n Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00487","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过投票区嵌入捕捉语言使用的精细区域差异
可以通过将感兴趣区域划分为多个区域并使用社交媒体数据来训练表示这些区域中的语言使用的嵌入来捕捉感兴趣区域之间的语言变化。最近的工作集中在较大的地区,如城市或县,以确保每个地区都有足够的社交媒体数据,但较大的地区发现细粒度差异的能力有限,例如城市内部语言使用的差异。我们证明了嵌入较小的区域是可能的,这可以提供更高分辨率的语言变体分析。我们嵌入了投票区,这些投票区是用于管理选举的小而均匀的政治分区。在小范围内建模语言使用的问题是,由于许多地区的社交媒体数据不足,数据变得极其稀疏。我们提出了一种新的嵌入方法,该方法将训练与平滑交替进行,从而缓解了这些稀疏性问题。我们关注的是得克萨斯州的语言变异,因为它的研究相对不足。我们开发了两种新的定量评估,用于衡量嵌入在捕捉语言变化方面的效果。第一个评估衡量了一个模型在给定特定于方言的术语的情况下映射方言的能力。第二个评估衡量了一个模型在多大程度上能够映射词汇变体的偏好。这些评估显示了嵌入模型如何被社会语言学家直接使用,并衡量嵌入中包含了多少社会语言学信息。我们用一种将嵌入作为一种遗传密码的方法来补充第二种评估,在这种方法中,我们识别对应于社会学变量的“基因”,并将这些“基因”与语言现象联系起来,从而将社会学现象与语言现象连接起来。最后,我们探索了使用嵌入来推断等光泽的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
期刊最新文献
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies Languages through the Looking Glass of BPE Compression Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings Machine Learning for Ancient Languages: A Survey Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1