统计机器翻译的双语分词器

Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang
{"title":"统计机器翻译的双语分词器","authors":"Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang","doi":"10.1109/ISUC.2008.10","DOIUrl":null,"url":null,"abstract":"We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.","PeriodicalId":339811,"journal":{"name":"2008 Second International Symposium on Universal Communication","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Bilingual Segmenter for Statistical Machine Translation\",\"authors\":\"Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang\",\"doi\":\"10.1109/ISUC.2008.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.\",\"PeriodicalId\":339811,\"journal\":{\"name\":\"2008 Second International Symposium on Universal Communication\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Second International Symposium on Universal Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISUC.2008.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second International Symposium on Universal Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISUC.2008.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

我们提出了一个双语驱动的汉语分词框架,该框架没有明确的词边界分隔符。它包括使用双语分词算法生成与基于单词的语言一致的中文标记,并提供bittext,并基于先前注释的中文句子导出概率标记模型。在双语分词算法中,我们首先将分词搜索转换为一个顺序标注问题,允许多项式时间动态规划解决方案,并在裁剪中文句子时加入一个控制来平衡单语和双语信息。实验表明,我们的框架作为预标记化组件,在翻译质量上明显优于现有的切分器,这表明我们的方法支持涉及孤立语言(如汉语)的双语NLP应用程序更好的切分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Bilingual Segmenter for Statistical Machine Translation
We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
AnHitz, Development and Integration of Language, Speech and Visual Technologies for Basque Chinese NP Chunking: A Semi-Supervised Approach The UCSD/Calit2 GreenLight Project (Invited Paper) Inferring User Interests from Relevance Feedback with High Similarity Sequence Data-Driven Clustering Computer Simulation of HRTFs for Personalization of 3D Audio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1