{"title":"Bilingual Segmenter for Statistical Machine Translation","authors":"Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang","doi":"10.1109/ISUC.2008.10","DOIUrl":null,"url":null,"abstract":"We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.","PeriodicalId":339811,"journal":{"name":"2008 Second International Symposium on Universal Communication","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Second International Symposium on Universal Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISUC.2008.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.