An algorithm to detect Chinese repetitive stuttering by computer is studied. According to the features of repetitions in Chinese stuttered speech, improvement solutions are provided based on the previous research findings. First, a multi-span looping forced alignment decoding networks is designed to detect multi-syllable repetitions in Chinese stuttered speech. Second, branch penalty factor is added in the networks to adjust decoding trend using recursive search in order to reduce the error from the complexity of the decoding networks. Finally, we rejudge the detected stutters by calculating confidence to improve the reliability of the detection result. The experimental results show that compared to previous algorithm, the proposed algorithm can improve system performance significantly, about 18% average detection error rate relatively.
{"title":"A Computer-Assist Algorithm to Detect Repetitive Stuttering Automatically","authors":"Junbo Zhang, Bin Dong, Yonghong Yan","doi":"10.1109/IALP.2013.32","DOIUrl":"https://doi.org/10.1109/IALP.2013.32","url":null,"abstract":"An algorithm to detect Chinese repetitive stuttering by computer is studied. According to the features of repetitions in Chinese stuttered speech, improvement solutions are provided based on the previous research findings. First, a multi-span looping forced alignment decoding networks is designed to detect multi-syllable repetitions in Chinese stuttered speech. Second, branch penalty factor is added in the networks to adjust decoding trend using recursive search in order to reduce the error from the complexity of the decoding networks. Finally, we rejudge the detected stutters by calculating confidence to improve the reliability of the detection result. The experimental results show that compared to previous algorithm, the proposed algorithm can improve system performance significantly, about 18% average detection error rate relatively.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115748751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ralph Vincent J. Regalado, Jenina L. Chua, J. L. Co, Thomas James Z. Tiam-Lee
Subjectivity classification classifies a given document if it contains subjective information or not, or identifies which portions of the document are subjective. This research reports a machine learning approach on document-level and sentence-level subjectivity classification of Filipino texts using existing machine learning algorithms such as C4.5, Naïve Bayes, k-Nearest Neighbor, and Support Vector Machine. For the document-level classification, result shows that Support Vector Machines gave the best result with 95.06% accuracy. While for the sentence-level classification, Naïve Baves gave the best result with 58.75% accuracy.
{"title":"Subjectivity Classification of Filipino Text with Features Based on Term Frequency -- Inverse Document Frequency","authors":"Ralph Vincent J. Regalado, Jenina L. Chua, J. L. Co, Thomas James Z. Tiam-Lee","doi":"10.1109/IALP.2013.40","DOIUrl":"https://doi.org/10.1109/IALP.2013.40","url":null,"abstract":"Subjectivity classification classifies a given document if it contains subjective information or not, or identifies which portions of the document are subjective. This research reports a machine learning approach on document-level and sentence-level subjectivity classification of Filipino texts using existing machine learning algorithms such as C4.5, Naïve Bayes, k-Nearest Neighbor, and Support Vector Machine. For the document-level classification, result shows that Support Vector Machines gave the best result with 95.06% accuracy. While for the sentence-level classification, Naïve Baves gave the best result with 58.75% accuracy.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122157788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.
{"title":"Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion","authors":"Wei Li, Jin-Song Zhang, Yanlu Xie, Xiaoyun Wang, M. Nishida, Seiichi Yamamoto","doi":"10.1109/IALP.2013.37","DOIUrl":"https://doi.org/10.1109/IALP.2013.37","url":null,"abstract":"Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116479882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents a language model based solution to the test item of Multiple Choice of CET-4. Trained on the web scale English language data, different n-grams are examined under a dynamic programming searching for the best answers. Experimental results indicate that both 4-gram and 5-gram model could generate an average of 81% precision for 16 test items.
{"title":"A Tentative Study on Language Model Based Solution to Multiple Choice of CET-4","authors":"Zhihang Fan, Muyun Yang, T. Zhao, Sheng Li","doi":"10.1109/IALP.2013.35","DOIUrl":"https://doi.org/10.1109/IALP.2013.35","url":null,"abstract":"The paper presents a language model based solution to the test item of Multiple Choice of CET-4. Trained on the web scale English language data, different n-grams are examined under a dynamic programming searching for the best answers. Experimental results indicate that both 4-gram and 5-gram model could generate an average of 81% precision for 16 test items.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125844006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengyu Liu, Qingcai Chen, Shanshan Guan, Xiaolong Wang, Huimiao Shi
Microblog, as an online communication platform, is becoming more and more popular. Users generate volumes of data everyday and the user generated content contains a lot of useful knowledge such as practical skills and technical expertise. This paper proposes a cross-data method to mine recipes in Microblog. In the proposed method, snippets of text relevant to recipes are firstly extracted from Baidu Encyclopedia. Secondly, the extracted snippets of text are used to train a domain-specific unigram language model. Thirdly, candidate recipes in Microblog are mined based on the unigram language model. Finally, some heuristic rules are used to identify real recipes from the candidate recipes. Experimental results show the effectiveness of the proposed method.
{"title":"Mining Recipes in Microblog","authors":"Shengyu Liu, Qingcai Chen, Shanshan Guan, Xiaolong Wang, Huimiao Shi","doi":"10.1109/IALP.2013.13","DOIUrl":"https://doi.org/10.1109/IALP.2013.13","url":null,"abstract":"Microblog, as an online communication platform, is becoming more and more popular. Users generate volumes of data everyday and the user generated content contains a lot of useful knowledge such as practical skills and technical expertise. This paper proposes a cross-data method to mine recipes in Microblog. In the proposed method, snippets of text relevant to recipes are firstly extracted from Baidu Encyclopedia. Secondly, the extracted snippets of text are used to train a domain-specific unigram language model. Thirdly, candidate recipes in Microblog are mined based on the unigram language model. Finally, some heuristic rules are used to identify real recipes from the candidate recipes. Experimental results show the effectiveness of the proposed method.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134554602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper introduces a dependency annotation effort which aims to fully annotate an Uyghur corpus. It is the first attempt of its kind to develop a large scale tree-bank for Uyghur. In this paper, we provide the motivation for following the dependency theory as the annotation scheme and argue that the dependency grammar is better suited to model the various linguistic phenomena in Uyghur. In our solution, the syntactic relations are encoded as labeled dependency relations among segments of lexical items and sequence of inflectional groups separated by derivational boundaries. We present the basic annotation scheme including morphological and syntactically dependency relation. We also show how the scheme handles some phenomenon such as omissions in copula sentences, punctuations and coordinations, etc.
{"title":"The Annotation Scheme for Uyghur Dependency Treebank","authors":"Samat Mamitimin, Turgun Ibrahim, Marhaba Eli","doi":"10.1109/IALP.2013.56","DOIUrl":"https://doi.org/10.1109/IALP.2013.56","url":null,"abstract":"The paper introduces a dependency annotation effort which aims to fully annotate an Uyghur corpus. It is the first attempt of its kind to develop a large scale tree-bank for Uyghur. In this paper, we provide the motivation for following the dependency theory as the annotation scheme and argue that the dependency grammar is better suited to model the various linguistic phenomena in Uyghur. In our solution, the syntactic relations are encoded as labeled dependency relations among segments of lexical items and sequence of inflectional groups separated by derivational boundaries. We present the basic annotation scheme including morphological and syntactically dependency relation. We also show how the scheme handles some phenomenon such as omissions in copula sentences, punctuations and coordinations, etc.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116846973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keywords matching is a preliminary means in public opinion analysis. Uyghur language is an agglutinative language, which words can be attaching by suffixes to express different semantic or syntactic in the text. Therefore, traditional matching algorithm can not be applied directly to the Uyghur text due to the Uyghur words have different surface forms in the text. In this paper, we implement a multi-keywords matching algorithm based on automaton for Uyghur text. The algorithm handles the inflection suffixes and the weakening of vowel letter in the word by use of reseverse suffixes automata and weakening of vowel restoration automata. By classification the keywords automata on the first letter of each keyword, a general multi-thread keywords matching approach for Uyghur also be proposed.
{"title":"Multi-thread Multi-keywords Matching Approach for Uyghur Text","authors":"Xinyuan Zhao, Adili Abuliz","doi":"10.1109/IALP.2013.36","DOIUrl":"https://doi.org/10.1109/IALP.2013.36","url":null,"abstract":"Keywords matching is a preliminary means in public opinion analysis. Uyghur language is an agglutinative language, which words can be attaching by suffixes to express different semantic or syntactic in the text. Therefore, traditional matching algorithm can not be applied directly to the Uyghur text due to the Uyghur words have different surface forms in the text. In this paper, we implement a multi-keywords matching algorithm based on automaton for Uyghur text. The algorithm handles the inflection suffixes and the weakening of vowel letter in the word by use of reseverse suffixes automata and weakening of vowel restoration automata. By classification the keywords automata on the first letter of each keyword, a general multi-thread keywords matching approach for Uyghur also be proposed.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131287664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research focus on the problem of Uygur language speech recognition with the accent spoken language. The recognition rate is not high enough, when recognizing the spoken language with pronunciation variation based on the recognition system of standard spoken language. We propose a Speech Recognition framework based on Uighur Accent Spoken Language, analyze acoustic characteristics, describe the phenomenon of pronunciation variation of Uyghur and create the acoustic model and the multi-pronunciation dictionary. The preliminary experimental results showed the capability of the proposed method improved the performance of the Uyghur continuous speech recognition.
{"title":"Speech Recognition Research on Uyghur Accent Spoken Language","authors":"Yating Yang, Bo Ma, Xinyu Tang, Osman Turghun","doi":"10.1109/IALP.2013.52","DOIUrl":"https://doi.org/10.1109/IALP.2013.52","url":null,"abstract":"This research focus on the problem of Uygur language speech recognition with the accent spoken language. The recognition rate is not high enough, when recognizing the spoken language with pronunciation variation based on the recognition system of standard spoken language. We propose a Speech Recognition framework based on Uighur Accent Spoken Language, analyze acoustic characteristics, describe the phenomenon of pronunciation variation of Uyghur and create the acoustic model and the multi-pronunciation dictionary. The preliminary experimental results showed the capability of the proposed method improved the performance of the Uyghur continuous speech recognition.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"91 27","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131878028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linfeng Song, Jun Xie, Xing Wang, Yajuan Lü, Qun Liu
Spoken language translation usually suffers from the missing translation of content words, failing to generate the appropriate translation. In this paper we propose a novel Mutual Information based method to improve spoken language translation by retrieving the missing translation of content words. We exploit several features that indicate how well the inner content words are translated for each rule to let MT systems select better translation rules. Experimental results show that our method can improve translation performance significantly ranging from 1.95 to 4.47 BLEU points on different test sets.
{"title":"Rule Refinement for Spoken Language Translation by Retrieving the Missing Translation of Content Words","authors":"Linfeng Song, Jun Xie, Xing Wang, Yajuan Lü, Qun Liu","doi":"10.1109/IALP.2013.23","DOIUrl":"https://doi.org/10.1109/IALP.2013.23","url":null,"abstract":"Spoken language translation usually suffers from the missing translation of content words, failing to generate the appropriate translation. In this paper we propose a novel Mutual Information based method to improve spoken language translation by retrieving the missing translation of content words. We exploit several features that indicate how well the inner content words are translated for each rule to let MT systems select better translation rules. Experimental results show that our method can improve translation performance significantly ranging from 1.95 to 4.47 BLEU points on different test sets.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117149889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe our work on pronominal resolution in Tamil using Tree CRFs. Pronominal resolution is the task of identifying the referent of a pronominal. In this work we have studied third person pronouns in Tamil such as 'avan', 'aval', 'athu', 'avar', he, she, it and they respectively. Tamil is a Dravidian language and it is morphologically rich and highly agglutinative language. Tree CRFs is a machine learning method, in which the data is modeled as a graph with edge weights used for learning. The features for learning are developed by using the morphological features of the language. The work is carried out on tourism domain data from the Web. We have obtained 70.8% precision and 66.5% recall. The results are encouraging.
{"title":"Pronominal Resolution in Tamil Using Tree CRFs","authors":"R. Ram, S. L. Devi","doi":"10.1109/IALP.2013.59","DOIUrl":"https://doi.org/10.1109/IALP.2013.59","url":null,"abstract":"We describe our work on pronominal resolution in Tamil using Tree CRFs. Pronominal resolution is the task of identifying the referent of a pronominal. In this work we have studied third person pronouns in Tamil such as 'avan', 'aval', 'athu', 'avar', he, she, it and they respectively. Tamil is a Dravidian language and it is morphologically rich and highly agglutinative language. Tree CRFs is a machine learning method, in which the data is modeled as a graph with edge weights used for learning. The features for learning are developed by using the morphological features of the language. The work is carried out on tourism domain data from the Web. We have obtained 70.8% precision and 66.5% recall. The results are encouraging.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124846380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}