双语语料库对齐:特别适用于来自不同家庭的语言对

Information Sciences - Applications Pub Date : 1995-09-01 Epub Date: 2010-11-17 DOI:10.1016/1069-0115(95)90013-6

Kuang-Hua Chen, Hsin-Hsi Chen

{"title":"双语语料库对齐:特别适用于来自不同家庭的语言对","authors":"Kuang-Hua Chen, Hsin-Hsi Chen","doi":"10.1016/1069-0115(95)90013-6","DOIUrl":null,"url":null,"abstract":"<div>Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as Sinorama Magazine and an IBM User Manual. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from Sinorama Magazine, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the IBM User Manual.</div>","PeriodicalId":100668,"journal":{"name":"Information Sciences - Applications","volume":"4 2","pages":"Pages 57-81"},"PeriodicalIF":0.0000,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/1069-0115(95)90013-6","citationCount":"9","resultStr":"{\"title\":\"Aligning bilingual corpus: Especially for language pairs from different families\",\"authors\":\"Kuang-Hua Chen, Hsin-Hsi Chen\",\"doi\":\"10.1016/1069-0115(95)90013-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as Sinorama Magazine and an IBM User Manual. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from Sinorama Magazine, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the IBM User Manual.</div>\",\"PeriodicalId\":100668,\"journal\":{\"name\":\"Information Sciences - Applications\",\"volume\":\"4 2\",\"pages\":\"Pages 57-81\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/1069-0115(95)90013-6\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences - Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/1069011595900136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2010/11/17 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences - Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/1069011595900136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2010/11/17 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

本文提出了基于词性的双语文本对齐标准，而不是使用基于长度或翻译的标准来对齐双语文本。假设是双语文本应该共享相同的概念、想法、实体和事件。此外，这些通常由一些关键的姿势表示。因此，在一个词的语言对中，关键姿势的数量是接近的。这个标准有两个优点:一是它在不同语系中的行为是一致的;二是相对于基于翻译的标准，其简单性。分而治之、动态规划和模拟退火技术被用于实现基于pos的对齐算法。在对齐顺序约束下，提出了一种计算准确率和召回率的性能评价方法。增量珠子的概念测量了真实珠子序列和计算珠子序列之间的匹配程度。实验中考虑了两个重要问题。一方面，测试文本是来自不同语系的语言，即汉语(东方语言)和英语(西方语言)。另一方面，它们是从不同的寄存器中选择的，例如Sinorama杂志和IBM用户手册。实验结果表明，模拟退火方法具有很好的性能。使用段落标记对《中国画报》文本进行对齐，查全率为94.4%，查准率为94.9%。在不使用段落标记的情况下，对齐IBM用户手册的召回率为96.7%，准确率为97.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Aligning bilingual corpus: Especially for language pairs from different families

Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as Sinorama Magazine and an IBM User Manual. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from Sinorama Magazine, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the IBM User Manual.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Sciences - Applications

自引率

0.00%

发文量

期刊最新文献

An application of fuzzy logic control to a gimballed payload on a space platform Logic programming and the execution model of Prolog Author index to volumes 3–4 Volume contents for 1995 Title index for volume 3–4