This paper proposes a method for inducing translation boundaries as soft constraints for Bracketing Transduction Grammar based (BTG-based) decoding. Translation boundaries used in previous research are extracted from left-most synchronous trees generated by a deterministic algorithm. Translation boundaries in this research are extracted from induced synchronous trees, which are statistically optimal and more balanced than the left-most synchronous trees. Experiments show that induced translation boundaries are more consistent than those extracted from left-most synchronous trees, resulting in significantly better performances over the strong baseline.
{"title":"Optimal Translation Boundaries for BTG-Based Decoding","authors":"Xiangyu Duan, Min Zhang","doi":"10.1109/IALP.2011.73","DOIUrl":"https://doi.org/10.1109/IALP.2011.73","url":null,"abstract":"This paper proposes a method for inducing translation boundaries as soft constraints for Bracketing Transduction Grammar based (BTG-based) decoding. Translation boundaries used in previous research are extracted from left-most synchronous trees generated by a deterministic algorithm. Translation boundaries in this research are extracted from induced synchronous trees, which are statistically optimal and more balanced than the left-most synchronous trees. Experiments show that induced translation boundaries are more consistent than those extracted from left-most synchronous trees, resulting in significantly better performances over the strong baseline.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127657113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a character-level system combination strategy for English -- Chinese spoken language translation. For languages like Chinese that the word boundaries are not orthographically marked, word segmentation which segments a Chinese sentence into a sequence of words, is often required for many Natural Language Processing tasks. In this paper we evaluate the impact of segmentation (spoken data) on the performance of system combination, and show that using inappropriate segmentation in system combination can result in inferior performance compared to single systems. We further demonstrate that using characters as basic translation unit in system combination on IWSLT ASR translation task leads to significant gains in translation quality in terms of BLEU and NIST scores.
{"title":"Character-Level System Combination: An Empirical Study for English-to-Chinese Spoken Language Translation","authors":"Jinhua Du","doi":"10.1109/IALP.2011.47","DOIUrl":"https://doi.org/10.1109/IALP.2011.47","url":null,"abstract":"This paper proposes a character-level system combination strategy for English -- Chinese spoken language translation. For languages like Chinese that the word boundaries are not orthographically marked, word segmentation which segments a Chinese sentence into a sequence of words, is often required for many Natural Language Processing tasks. In this paper we evaluate the impact of segmentation (spoken data) on the performance of system combination, and show that using inappropriate segmentation in system combination can result in inferior performance compared to single systems. We further demonstrate that using characters as basic translation unit in system combination on IWSLT ASR translation task leads to significant gains in translation quality in terms of BLEU and NIST scores.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115303455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growth of exchange activities between four regions of cross strait, the problem to correctly convert between Traditional Chinese (TC) and Simplified Chinese (SC) become more and more important. Numerous one-to-many mappings and term usage differences make it more difficult to convert from SC to TC. This paper proposed a novel simplified-traditional Chinese character conversion model based on log-linear models, in which features such as language models and lexical semantic consistency weighs are integrated. When estimating lexical semantic consistency weighs, cross-language word-based semantic spaces were used. Experiments were conducted and the results show that the proposed model achieve better performance.
{"title":"A Simplified-Traditional Chinese Character Conversion Model Based on Log-Linear Models","authors":"Yidong Chen, X. Shi, Changle Zhou","doi":"10.1109/IALP.2011.15","DOIUrl":"https://doi.org/10.1109/IALP.2011.15","url":null,"abstract":"With the growth of exchange activities between four regions of cross strait, the problem to correctly convert between Traditional Chinese (TC) and Simplified Chinese (SC) become more and more important. Numerous one-to-many mappings and term usage differences make it more difficult to convert from SC to TC. This paper proposed a novel simplified-traditional Chinese character conversion model based on log-linear models, in which features such as language models and lexical semantic consistency weighs are integrated. When estimating lexical semantic consistency weighs, cross-language word-based semantic spaces were used. Experiments were conducted and the results show that the proposed model achieve better performance.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121699211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper aims to construct Chinese-Tibetan multi-word equivalent pair dictionary for Chinese-Tibetan computer-aided translation system. Since Tibetan is a morphologically rich language, we propose two-phase framework to automatically extract multi-word equivalent pairs. First, extract Chinese Multi-word Units (MWUs). In this phase, we propose CBEM model to partition a Chinese sentence into MWUs using two measures of collocation and binding degree. Second, get Tibetan translations of the extracted Chinese MWUs. In the second phase, we propose TSIM model to focus on extracting 1-to-n bilingual MWUs. Preliminary experimental results show that the mixed method combining CBEM model with TSIM model is effective.
{"title":"Automatic Acquisition of Chinese-Tibetan Multi-word Equivalent Pair from Bilingual Corpora","authors":"Minghua Nuo, Huidan Liu, Long-Long Ma, Jian Wu, Zhiming Ding","doi":"10.1109/IALP.2011.33","DOIUrl":"https://doi.org/10.1109/IALP.2011.33","url":null,"abstract":"This paper aims to construct Chinese-Tibetan multi-word equivalent pair dictionary for Chinese-Tibetan computer-aided translation system. Since Tibetan is a morphologically rich language, we propose two-phase framework to automatically extract multi-word equivalent pairs. First, extract Chinese Multi-word Units (MWUs). In this phase, we propose CBEM model to partition a Chinese sentence into MWUs using two measures of collocation and binding degree. Second, get Tibetan translations of the extracted Chinese MWUs. In the second phase, we propose TSIM model to focus on extracting 1-to-n bilingual MWUs. Preliminary experimental results show that the mixed method combining CBEM model with TSIM model is effective.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"51 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129312543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bilingual sentence pairs are key resource for statistical machine translation. Currently, most of the sentence alignment corpus is between English and French or English and German. And there is little specialized sentence alignment dataset between English and Chinese. So our aim is to create large-scale, high-precision English-Chinese aligned sentences. Length based method is used to align bilingual paragraphs which were extracted from CNKI (China National Knowledge Infrastructure). CNKI is one of largest academic website, and contains huge Chinese-English bilingual paragraph. Our method adapts and combines some approaches, which are based on words and based on hybrid. At last, we choose the best alignment by dynamic programming. The experiments on CNKI dataset showed that the presented method had satisfactory the recall ratio and the precision ratio.
{"title":"The Chinese-English Bilingual Sentence Alignment Based on Length","authors":"Huafu Ding, Li Quan, Haoliang Qi","doi":"10.1109/IALP.2011.70","DOIUrl":"https://doi.org/10.1109/IALP.2011.70","url":null,"abstract":"Bilingual sentence pairs are key resource for statistical machine translation. Currently, most of the sentence alignment corpus is between English and French or English and German. And there is little specialized sentence alignment dataset between English and Chinese. So our aim is to create large-scale, high-precision English-Chinese aligned sentences. Length based method is used to align bilingual paragraphs which were extracted from CNKI (China National Knowledge Infrastructure). CNKI is one of largest academic website, and contains huge Chinese-English bilingual paragraph. Our method adapts and combines some approaches, which are based on words and based on hybrid. At last, we choose the best alignment by dynamic programming. The experiments on CNKI dataset showed that the presented method had satisfactory the recall ratio and the precision ratio.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126657878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grammar Induction is a machine learning process for learning grammar from corpora. This paper will discuss the process of grammar induction for Indonesian language corpora using genetic algorithm. The Grammar production rules will be modeled in the form of chromosomes. The fitness function is used to count how many sentences can be parsed. The data used are Indonesian fairy tales stories such as "Bawang Merah Bawang Putih" and "Malin Kundang". This paper describes the detailed explanations about the steps of each process carried out for natural language grammar problems.
语法归纳是从语料库中学习语法的机器学习过程。本文将讨论用遗传算法对印尼语语料库进行语法归纳的过程。语法生成规则将以染色体的形式建模。适应度函数用于计算可以解析多少个句子。使用的数据是印度尼西亚的童话故事,如“Bawang Merah Bawang Putih”和“Malin Kundang”。本文对自然语言语法问题的每个处理步骤进行了详细的说明。
{"title":"Natural Language Grammar Induction of Indonesian Language Corpora Using Genetic Algorithm","authors":"Ary Hermawan, Gunawan, Joan Santoso","doi":"10.1109/IALP.2011.58","DOIUrl":"https://doi.org/10.1109/IALP.2011.58","url":null,"abstract":"Grammar Induction is a machine learning process for learning grammar from corpora. This paper will discuss the process of grammar induction for Indonesian language corpora using genetic algorithm. The Grammar production rules will be modeled in the form of chromosomes. The fitness function is used to count how many sentences can be parsed. The data used are Indonesian fairy tales stories such as \"Bawang Merah Bawang Putih\" and \"Malin Kundang\". This paper describes the detailed explanations about the steps of each process carried out for natural language grammar problems.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126516917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper aims at assessing the automatic labeling of an undocumented, unknown and underresourced unwritten language (Mo Piu) of the North Vietnam, by an expert phonetician. For this task, we chose 5 languages in different combinations in order to highlight the best set. Two assessments will be presented, first, that of the phonetic events, and secondly that of the languages sets. After the presentation of the methods used for the automatic labeling and recognition, the paper will focus on the assessment of the phonetic units and of the languages sets.
{"title":"Automatic Labeling and Phonetic Assessment for an Unknown Asian Language: The Case of the \"Mo Piu\" North Vietnamese Minority (early results)","authors":"G. Caelen-Haumont, Sam Sethserey, E. Castelli","doi":"10.1109/IALP.2011.81","DOIUrl":"https://doi.org/10.1109/IALP.2011.81","url":null,"abstract":"This paper aims at assessing the automatic labeling of an undocumented, unknown and underresourced unwritten language (Mo Piu) of the North Vietnam, by an expert phonetician. For this task, we chose 5 languages in different combinations in order to highlight the best set. Two assessments will be presented, first, that of the phonetic events, and secondly that of the languages sets. After the presentation of the methods used for the automatic labeling and recognition, the paper will focus on the assessment of the phonetic units and of the languages sets.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116801777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The establishment of Contemporary Mongolian word segmentation specification for information processing has a great significance in the standardization of information processing, the compatibleness of different systems, the sharing of corpus, grammatical analysis, and POS tagging. The present paper studies the framework of Mongolian word segmentation including guidelines, formulating principles, styles, scopes of segmentation units, establishment foundation, structure of the specification and so on, and lays the theoretical foundation for this specification.
{"title":"Theoretical Framework of Mongolian Word Segmentation Specification for Information Processing","authors":"T. Laga, Xiaobing Zhao","doi":"10.1109/IALP.2011.45","DOIUrl":"https://doi.org/10.1109/IALP.2011.45","url":null,"abstract":"The establishment of Contemporary Mongolian word segmentation specification for information processing has a great significance in the standardization of information processing, the compatibleness of different systems, the sharing of corpus, grammatical analysis, and POS tagging. The present paper studies the framework of Mongolian word segmentation including guidelines, formulating principles, styles, scopes of segmentation units, establishment foundation, structure of the specification and so on, and lays the theoretical foundation for this specification.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128331151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-speech (TTS) synthesis and phone recognition (PR) system. PR system and ASR system are quite similar in functionality. The difference between these two is that for PR system the speech signal is converted to phonefootnote{smallest discrete segment of sound in uttered speech} text whereas for ASR system the speech signal is converted to word text. Speech corpus for PR system usually consists of a text corpus, recording data corresponding to the text corpus, phonetic representation of the text corpus and a pronunciation dictionary. Selecting optimum text from available text with balanced phone distribution is an important task for developing high quality PR system. In this paper, we describe our text selection technique and discuss the performance of phone recognition system.
{"title":"Developing Bengali Speech Corpus for Phone Recognizer Using Optimum Text Selection Technique","authors":"S. Mandal, B. Das, Pabitra Mitra, A. Basu","doi":"10.1109/IALP.2011.16","DOIUrl":"https://doi.org/10.1109/IALP.2011.16","url":null,"abstract":"Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-speech (TTS) synthesis and phone recognition (PR) system. PR system and ASR system are quite similar in functionality. The difference between these two is that for PR system the speech signal is converted to phonefootnote{smallest discrete segment of sound in uttered speech} text whereas for ASR system the speech signal is converted to word text. Speech corpus for PR system usually consists of a text corpus, recording data corresponding to the text corpus, phonetic representation of the text corpus and a pronunciation dictionary. Selecting optimum text from available text with balanced phone distribution is an important task for developing high quality PR system. In this paper, we describe our text selection technique and discuss the performance of phone recognition system.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134491378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Currently context-based approach is a popular approach for constructing bilingual lexicons from comparable corpora. Following this line of research, this paper proposes a dependency relationship mapping model and investigates its effect on bilingual lexicon construction. The experiments show that, by mapping context words, dependency relationship types and directions simultaneously when calculating the similarity between two words in the source and target languages respectively, our approach significantly outperforms a state-of-the-art system in bilingual lexicon construction from either Chinese-English or English-Chinese. This justifies the effectiveness of our dependency relationship mapping model on bilingual lexicon construction.
{"title":"Improving Bilingual Lexicon Construction from Chinese-English Comparable Corpora via Dependency Relationship Mapping","authors":"Hua Xu, Dandan Liu, Longhua Qian, Guodong Zhou","doi":"10.1109/IALP.2011.22","DOIUrl":"https://doi.org/10.1109/IALP.2011.22","url":null,"abstract":"Currently context-based approach is a popular approach for constructing bilingual lexicons from comparable corpora. Following this line of research, this paper proposes a dependency relationship mapping model and investigates its effect on bilingual lexicon construction. The experiments show that, by mapping context words, dependency relationship types and directions simultaneously when calculating the similarity between two words in the source and target languages respectively, our approach significantly outperforms a state-of-the-art system in bilingual lexicon construction from either Chinese-English or English-Chinese. This justifies the effectiveness of our dependency relationship mapping model on bilingual lexicon construction.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127710907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}