平行双语语料库与标点标准的统计对齐

Thomas C. Chuang, K. Yeh
{"title":"平行双语语料库与标点标准的统计对齐","authors":"Thomas C. Chuang, K. Yeh","doi":"10.30019/IJCLCLP.200503.0005","DOIUrl":null,"url":null,"abstract":"We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria\",\"authors\":\"Thomas C. Chuang, K. Yeh\",\"doi\":\"10.30019/IJCLCLP.200503.0005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.\",\"PeriodicalId\":436300,\"journal\":{\"name\":\"Int. J. Comput. Linguistics Chin. Lang. Process.\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Comput. Linguistics Chin. Lang. Process.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30019/IJCLCLP.200503.0005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.200503.0005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

摘要

本文提出了一种基于标点符号的双语平行语料库句子对齐方法。尽管基于长度的方法对于用两种西方语言(如法语-英语或德语-英语)编写的干净的平行语料库(如汉语-英语)产生了很高的句子对齐准确率,但对于用两种不同语言(如汉语-英语)编写的嘈杂平行语料库来说,它的效果并不好。可以在基于长度的方法之上使用同源词来提高对齐精度。然而,在两种完全不同的语言之间不存在同源词,这限制了基于同源词的方法的适用性。在本文中,我们研究了利用两种语言中标点符号的统计顺序匹配来实现高精度句子对齐的可行性。我们对平行语料库、汉英《中国文物学》杂志语料库和《科学美国人》杂志文章进行了实验,取得了满意的结果。实验结果表明,与基于长度的方法相比,该方法具有更高的精度。当在一个共同的统计框架内采用基于标点和基于长度的方法时,观察到非常有希望的改进。我们还演示了该方法可以应用于其他语言对,例如英语-日语,而只需要很少的额外工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria
We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Enriching Cold Start Personalized Language Model Using Social Network Information Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars TQDL: Integrated Models for Cross-Language Document Retrieval Evaluation of TTS Systems in Intelligibility and Comprehension Tasks: a Case Study of HTS-2008 and Multisyn Synthesizers Effects of Combining Bilingual and Collocational Information on Translation of English and Chinese Verb-Noun Pairs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1