Noise-aware Character Alignment for Extracting Transliteration Fragments

Q4 Computer Science Journal of Information Processing Pub Date : 2014-09-16 DOI:10.5715/JNLP.21.1107
{"title":"Noise-aware Character Alignment for Extracting Transliteration Fragments","authors":"Katsuhito Sudoh, Shinsuke Mori, M. Nagata","doi":"10.5715/JNLP.21.1107","DOIUrl":null,"url":null,"abstract":"This paper proposes a novel noise-aware character alignment method for automatically extracting transliteration fragments in phrase pairs that are extracted from parallel corpora. The proposed method extends a many-to-many Bayesian character alignment method by distinguishing transliteration (signal) parts from non-transliteration (noise) parts. The model can be trained efficiently by a state-based blocked Gibbs sampling algorithm with signal and noise states. The proposed method bootstraps statistical machine transliteration using the extracted transliteration fragments to train transliteration models. In experiments using Japanese-English patent data, the proposed method was able to extract transliteration fragments with much less noise than an IBM-model-based baseline, and achieved better transliteration performance than sample-wise extraction in transliteration bootstrapping.","PeriodicalId":16243,"journal":{"name":"Journal of Information Processing","volume":"21 1","pages":"1107-1131"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5715/JNLP.21.1107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1

Abstract

This paper proposes a novel noise-aware character alignment method for automatically extracting transliteration fragments in phrase pairs that are extracted from parallel corpora. The proposed method extends a many-to-many Bayesian character alignment method by distinguishing transliteration (signal) parts from non-transliteration (noise) parts. The model can be trained efficiently by a state-based blocked Gibbs sampling algorithm with signal and noise states. The proposed method bootstraps statistical machine transliteration using the extracted transliteration fragments to train transliteration models. In experiments using Japanese-English patent data, the proposed method was able to extract transliteration fragments with much less noise than an IBM-model-based baseline, and achieved better transliteration performance than sample-wise extraction in transliteration bootstrapping.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于噪声感知的字符对齐方法提取音译片段
本文提出了一种新的无噪声字符对齐方法,用于自动提取从平行语料库中提取的词组对中的音译片段。该方法通过区分音译(信号)部分和非音译(噪声)部分,扩展了多对多贝叶斯字符对齐方法。采用基于状态的闭塞Gibbs采样算法对模型进行了有效的训练。该方法利用提取的音译片段来训练统计机器的音译模型。在日语-英语专利数据的实验中,该方法能够以比基于ibm模型的基线低得多的噪声提取音译片段,并且在音译自举中获得比样本提取更好的音译性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
相关文献
Learning Multi Character Alignment Rules and Classification of Training Data for Transliteration
IF 0 NEWS@IJCNLPPub Date : 2009-08-07 DOI: 10.3115/1699705.1699721
Dipankar Bose, S. Sarkar
来源期刊
Journal of Information Processing
Journal of Information Processing Computer Science-Computer Science (all)
CiteScore
1.20
自引率
0.00%
发文量
0
期刊最新文献
Container-native Managed Data Sharing Editor's Message to Special Issue of Computer Security Technologies for Secure Cyberspace Understanding the Inconsistencies in the Permissions Mechanism of Web Browsers An Analysis of Susceptibility to Phishing via Business Chat through Online Survey Analysis and Consideration of Detection Methods to Prevent Fraudulent Access by Utilizing Attribute Information and the Access Log History
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1