Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages

IF 0.3 0 ASIAN STUDIES Korean Studies Pub Date : 2023-01-01 DOI:10.1353/ks.2023.a908624
Benoit Berthelier
{"title":"Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages","authors":"Benoit Berthelier","doi":"10.1353/ks.2023.a908624","DOIUrl":null,"url":null,"abstract":"Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of \"unified\" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.","PeriodicalId":43382,"journal":{"name":"Korean Studies","volume":null,"pages":null},"PeriodicalIF":0.3000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Korean Studies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1353/ks.2023.a908624","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"ASIAN STUDIES","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of "unified" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分割与数位语言分割:南韩与北韩语言自然语言处理资源的批判观点
摘要:数字世界的特点是不同语言之间的内容量存在很大的不对称性。作为一个直接的推论,这种不平等也存在,并被放大,在资源的数量(标记和未标记的数据集,预训练模型,学术研究)可用于这些语言的计算分析或通常被称为自然语言处理(NLP)。NLP文献将语言分为高资源语言和低资源语言。由于民间和政府在这一领域的早期投资,韩国语被普遍认为是资源丰富的语言。本文表明,为韩国语开发的资源并不一定会转移到朝鲜语。然而,它也认为,这并不意味着朝鲜语是一种资源匮乏的语言。一方面,韩国的资源可以与朝鲜的数据相结合,以获得更好的表现。另一方面,朝鲜拥有的资源比人们通常认为的要多。回顾朝鲜NLP研究的悠久历史,本文表明,尽管不容易获得,但存在大量的朝鲜语言数据集和研究。本文最后探讨了“统一”语言模型的可能性,并强调了在朝鲜半岛开展积极的NLP研究合作的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Korean Studies
Korean Studies ASIAN STUDIES-
CiteScore
0.50
自引率
0.00%
发文量
16
期刊最新文献
Editor's Note "Wise Mothers," "Mom Bugs," and Pyŏngmat (Twisted Tastes): The Limits of Maternal Emotional Expression in South Korean Webtoons Ungrateful Refugees: North Korean Refugees in South Korea Views at Variance: Korean Women Disrupting and Subverting the Narrative of Protestant Missionary Women Through Moments of Difference, 1884–1910 The Discourse of Korean Han: Background and Historical Landscape
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1