统计机器翻译的双语分词器

2008 Second International Symposium on Universal Communication Pub Date : 2008-12-15 DOI:10.1109/ISUC.2008.10

Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang

{"title":"统计机器翻译的双语分词器","authors":"Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang","doi":"10.1109/ISUC.2008.10","DOIUrl":null,"url":null,"abstract":"We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.","PeriodicalId":339811,"journal":{"name":"2008 Second International Symposium on Universal Communication","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Bilingual Segmenter for Statistical Machine Translation\",\"authors\":\"Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang\",\"doi\":\"10.1109/ISUC.2008.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.\",\"PeriodicalId\":339811,\"journal\":{\"name\":\"2008 Second International Symposium on Universal Communication\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Second International Symposium on Universal Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISUC.2008.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second International Symposium on Universal Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISUC.2008.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们提出了一个双语驱动的汉语分词框架，该框架没有明确的词边界分隔符。它包括使用双语分词算法生成与基于单词的语言一致的中文标记，并提供bittext，并基于先前注释的中文句子导出概率标记模型。在双语分词算法中，我们首先将分词搜索转换为一个顺序标注问题，允许多项式时间动态规划解决方案，并在裁剪中文句子时加入一个控制来平衡单语和双语信息。实验表明，我们的框架作为预标记化组件，在翻译质量上明显优于现有的切分器，这表明我们的方法支持涉及孤立语言(如汉语)的双语NLP应用程序更好的切分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Bilingual Segmenter for Statistical Machine Translation

We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 Second International Symposium on Universal Communication

自引率

0.00%

发文量

期刊最新文献

AnHitz, Development and Integration of Language, Speech and Visual Technologies for Basque Chinese NP Chunking: A Semi-Supervised Approach The UCSD/Calit2 GreenLight Project (Invited Paper) Inferring User Interests from Relevance Feedback with High Similarity Sequence Data-Driven Clustering Computer Simulation of HRTFs for Personalization of 3D Audio