Automatic construction of a dictionary of variant forms of Chinese characters

IF 0.3 Q4 LINGUISTICS Chinese Language and Discourse Pub Date : 2022-06-09 DOI:10.1075/cld.21037.shi
X. Shi
{"title":"Automatic construction of a dictionary of variant forms of Chinese characters","authors":"X. Shi","doi":"10.1075/cld.21037.shi","DOIUrl":null,"url":null,"abstract":"\n Many Chinese characters have more than one form of writing owing to complex nature of creation and long evolvement\n history of writing. Most existing Chinese dictionaries list these variant forms but do not explain in a systematic way why a\n specific character is a variant form of another, and only list a few older key bibliographies, many of which are themselves\n dictionaries of various forms. In this article we present a new theory and practice of how to determine whether a Chinese\n character is a variant of another, and show how we can deduce a dictionary of variant characters automatically from a corpus of\n ancient Chinese texts totaling 2.3 billion characters with artificial intelligence techniques. Results show that in over 74,000\n instances of identified variant character groups, more than 20,000 new instances are found by our algorithm. We have then compiled\n all the instances into a dictionary and call it Dictionary of Chinese Variant Words (異體字詞典, Yiti Zi Cidian). The key insight of our theory\n is to find synonymous words with variant characters. The dictionary has already been put online for several years and everyone can\n freely access and edit it like the way they do on Wikipedia.","PeriodicalId":42144,"journal":{"name":"Chinese Language and Discourse","volume":"93 1","pages":""},"PeriodicalIF":0.3000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Language and Discourse","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/cld.21037.shi","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Many Chinese characters have more than one form of writing owing to complex nature of creation and long evolvement history of writing. Most existing Chinese dictionaries list these variant forms but do not explain in a systematic way why a specific character is a variant form of another, and only list a few older key bibliographies, many of which are themselves dictionaries of various forms. In this article we present a new theory and practice of how to determine whether a Chinese character is a variant of another, and show how we can deduce a dictionary of variant characters automatically from a corpus of ancient Chinese texts totaling 2.3 billion characters with artificial intelligence techniques. Results show that in over 74,000 instances of identified variant character groups, more than 20,000 new instances are found by our algorithm. We have then compiled all the instances into a dictionary and call it Dictionary of Chinese Variant Words (異體字詞典, Yiti Zi Cidian). The key insight of our theory is to find synonymous words with variant characters. The dictionary has already been put online for several years and everyone can freely access and edit it like the way they do on Wikipedia.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
汉字变体字典的自动构建
由于汉字的创作性质复杂,书写历史悠久,许多汉字都有多种写法。大多数现有的汉语词典都列出了这些变体形式,但没有系统地解释为什么一个特定的字符是另一个字符的变体形式,而只列出了一些较老的关键书目,其中许多本身就是各种形式的词典。在本文中,我们提出了一种新的理论和实践来判断一个汉字是否为另一个汉字的变体,并展示了如何利用人工智能技术从总计23亿汉字的古代汉语文本语料库中自动推导出一个变体词典。结果表明,在已识别的74,000多个变体字符组实例中,我们的算法发现了20,000多个新实例。然后,我们将所有实例汇编成一本词典,并将其命名为《汉语变词词典》。我们的理论的关键观点是找到同义词的变体字符。这本词典已经上线好几年了,每个人都可以像在维基百科上一样自由地访问和编辑它。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.80
自引率
0.00%
发文量
27
期刊最新文献
The DIG Mandarin Conversations (DMC) Corpus Colloquialism and genre variation in Chinese Review of Xiang (2021): Language, Multimodal Interaction and Transaction Review of Shi (2021): Loanwords in the Chinese language 对既存并列项的觉察与选择
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1