Machine learning and data analysis for word segmentation of classical Chinese poems: illustrations with Tang and Song examples

IF 0.7 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Digital Scholarship in the Humanities Pub Date : 2023-10-20 DOI:10.1093/llc/fqad073
Chao-Lin Liu, Wei-Ting Chang, Chang-Ting Chu, Ti-Yong Zheng
{"title":"Machine learning and data analysis for word segmentation of classical Chinese poems: illustrations with Tang and Song examples","authors":"Chao-Lin Liu, Wei-Ting Chang, Chang-Ting Chu, Ti-Yong Zheng","doi":"10.1093/llc/fqad073","DOIUrl":null,"url":null,"abstract":"Abstract Words are essential parts for understanding classical Chinese poems. We report a collection of 32,399 classical Chinese poems that were annotated with word boundaries. Statistics about the annotated poems support a few heuristic experiences, including the patterns of lines and a practice for the parallel structures (對仗), that researchers of Chinese literature discuss in the literature. The annotators were affiliated with two universities, so they could annotate the poems as independently as possible. Results of an inter-rater agreement study indicate that the annotators have consensus over the identified words 93 per cent of the time and have perfect consensus for the segmentation of a poem 42 per cent of the time. We applied unsupervised classification methods to annotate the poems in several different settings, and evaluated the results with human annotations. Under favorable conditions, the classifier identified about 88 per cent of the words, and segmented poems perfectly 22 per cent of the time.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"26 3","pages":"0"},"PeriodicalIF":0.7000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/llc/fqad073","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Words are essential parts for understanding classical Chinese poems. We report a collection of 32,399 classical Chinese poems that were annotated with word boundaries. Statistics about the annotated poems support a few heuristic experiences, including the patterns of lines and a practice for the parallel structures (對仗), that researchers of Chinese literature discuss in the literature. The annotators were affiliated with two universities, so they could annotate the poems as independently as possible. Results of an inter-rater agreement study indicate that the annotators have consensus over the identified words 93 per cent of the time and have perfect consensus for the segmentation of a poem 42 per cent of the time. We applied unsupervised classification methods to annotate the poems in several different settings, and evaluated the results with human annotations. Under favorable conditions, the classifier identified about 88 per cent of the words, and segmented poems perfectly 22 per cent of the time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
中文古诗分词的机器学习与数据分析:唐宋例证
摘要词是理解古诗的重要组成部分。我们报告了32399首有词边界注释的中国古典诗歌。关于注释诗歌的统计数据支持一些启发式经验,包括中国文学研究者在文献中讨论的线条模式和平行结构的实践。注释者隶属于两所大学,因此他们可以尽可能独立地注释诗歌。一项校际协议研究的结果表明,注释者在93%的时间里对识别的单词达成了共识,在42%的时间里对一首诗的分段达成了完美的共识。我们应用无监督分类方法对不同环境下的诗歌进行了注释,并对人工注释的结果进行了评估。在良好的条件下,分类器识别了大约88%的单词,并在22%的时间里完美地分割了诗歌。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.80
自引率
25.00%
发文量
78
期刊介绍: DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.
期刊最新文献
Social network analysis of the Babylonian Talmud Ancient classical theatre from the digital humanities: a systematic review 2010–21 Language-based machine perception: linguistic perspectives on the compilation of captioning datasets Personality prediction via multi-task transformer architecture combined with image aesthetics Who wrote the first Constitutions of Freemasonry?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1