{"title":"双语语料库对齐:特别适用于来自不同家庭的语言对","authors":"Kuang-Hua Chen, Hsin-Hsi Chen","doi":"10.1016/1069-0115(95)90013-6","DOIUrl":null,"url":null,"abstract":"<div><p>Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as <em>Sinorama Magazine</em> and an <em>IBM User Manual</em>. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from <em>Sinorama Magazine</em>, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the <em>IBM User Manual</em>.</p></div>","PeriodicalId":100668,"journal":{"name":"Information Sciences - Applications","volume":"4 2","pages":"Pages 57-81"},"PeriodicalIF":0.0000,"publicationDate":"1995-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/1069-0115(95)90013-6","citationCount":"9","resultStr":"{\"title\":\"Aligning bilingual corpus: Especially for language pairs from different families\",\"authors\":\"Kuang-Hua Chen, Hsin-Hsi Chen\",\"doi\":\"10.1016/1069-0115(95)90013-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as <em>Sinorama Magazine</em> and an <em>IBM User Manual</em>. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from <em>Sinorama Magazine</em>, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the <em>IBM User Manual</em>.</p></div>\",\"PeriodicalId\":100668,\"journal\":{\"name\":\"Information Sciences - Applications\",\"volume\":\"4 2\",\"pages\":\"Pages 57-81\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/1069-0115(95)90013-6\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences - Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/1069011595900136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences - Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/1069011595900136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Aligning bilingual corpus: Especially for language pairs from different families
Rather than using a length-based or translation-based criterion to align bilingual texts, this paper proposes a part-of-speech-based (POS-based) criterion. The postulation is that bilingual texts should share the same concepts, ideas, entities, and events. In addition, these are usually represented by some critical POSes. Thus, the numbers of critical POSes in a language pair of a bead are close. This criterion has two advantages: one is its uniform behavior across the different language families; the other is its simplicity comparing to translation-based criterion. Divide-and-conquer, dynamic programming, and simulated annealing techniques are used to implement the POS-based alignment algorithm. Under the order constraint of alignment, this paper introduces a performance evaluation method to calculate precision and recall. A concept of incremental beads measures the degree of matching between real bead sequence and computed bead sequence. Two important issues are considered in the experiments. On the one hand, the test texts are in languages from different families, i.e., Chinese (an oriental language) and English (an occidental language). On the other hand, they are selected from diversified registers, such as Sinorama Magazine and an IBM User Manual. The experimental results show that the simulated annealing approach has very good performance. In aligning texts from Sinorama Magazine, the recall is 94.4% and the precision is 94.9% by using paragraph markers. Without paragraph markers, the recall is 96.7% and the precision is 97.2% for aligning the IBM User Manual.