Information retrieval model is still can not achieve satisfactory performance after decades of development. One of the reasons is the queries can not express information need precisely. Researches have shown that query reformulation can improve the performance of retrieval model. In this paper, we propose a query reformulation model, which use Markov network to represent term relationship to obtain useful information from corpus to reformulate query. Experimental results show that our model can avoid topic drift and then improve the retrieval performance.
{"title":"A Query Reformulation Model Using Markov Graphic Method","authors":"Jiali Zuo, Mingwen Wang","doi":"10.1109/IALP.2011.62","DOIUrl":"https://doi.org/10.1109/IALP.2011.62","url":null,"abstract":"Information retrieval model is still can not achieve satisfactory performance after decades of development. One of the reasons is the queries can not express information need precisely. Researches have shown that query reformulation can improve the performance of retrieval model. In this paper, we propose a query reformulation model, which use Markov network to represent term relationship to obtain useful information from corpus to reformulate query. Experimental results show that our model can avoid topic drift and then improve the retrieval performance.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115014669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance of Chinese Pinyin-to-Character conversion is severely affected when the characteristics of the training and conversion data differ. As natural language is highly variable and uncertain, it is impossible to build a complete and general language model to suit all the tasks. The traditional adaptive MAP models mix the task independent data with task dependent data using a mixture coefficient but we never can predict what style of language users have and what new domain will appear. This paper presents a statistical error-driven adaptive language modeling approach to Chinese Pinyin input system. This model can be incrementally adapted when an error occurs during Pinyin-to-Character converting time. It significantly improves Pinyin-to-Character conversion rate.
{"title":"Error-Driven Adaptive Language Modeling for Chinese Pinyin-to-Character Conversion","authors":"J. Huang, D. Powers","doi":"10.1109/IALP.2011.46","DOIUrl":"https://doi.org/10.1109/IALP.2011.46","url":null,"abstract":"The performance of Chinese Pinyin-to-Character conversion is severely affected when the characteristics of the training and conversion data differ. As natural language is highly variable and uncertain, it is impossible to build a complete and general language model to suit all the tasks. The traditional adaptive MAP models mix the task independent data with task dependent data using a mixture coefficient but we never can predict what style of language users have and what new domain will appear. This paper presents a statistical error-driven adaptive language modeling approach to Chinese Pinyin input system. This model can be incrementally adapted when an error occurs during Pinyin-to-Character converting time. It significantly improves Pinyin-to-Character conversion rate.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech visualization can be extended to a task of pronunciation animation for language learners. In this paper, a three dimensional English articulation database is recorded using Carstens Electro-Magnetic Articulograph (EMA AG500). An HMM-based visual synthesis method for continuous speech is implemented to recover 3D articulatory information. The synthesized articulations are then compared to the EMA recordings for objective evaluation. Using a data-driven 3D talking head, the distinctions between the confusable phonemes can be depicted through both external and internal articulatory movements. The experiments have demonstrated that the HMM-based synthesis with limited training data can achieve the minimum RMS error of less than 2mm. The synthesized articulatory movements can be used for computer assisted pronunciation training.
{"title":"The Phoneme-Level Articulator Dynamics for Pronunciation Animation","authors":"Sheng Li, Lan Wang, En Qi","doi":"10.1109/IALP.2011.13","DOIUrl":"https://doi.org/10.1109/IALP.2011.13","url":null,"abstract":"Speech visualization can be extended to a task of pronunciation animation for language learners. In this paper, a three dimensional English articulation database is recorded using Carstens Electro-Magnetic Articulograph (EMA AG500). An HMM-based visual synthesis method for continuous speech is implemented to recover 3D articulatory information. The synthesized articulations are then compared to the EMA recordings for objective evaluation. Using a data-driven 3D talking head, the distinctions between the confusable phonemes can be depicted through both external and internal articulatory movements. The experiments have demonstrated that the HMM-based synthesis with limited training data can achieve the minimum RMS error of less than 2mm. The synthesized articulatory movements can be used for computer assisted pronunciation training.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129690948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Polarity shifting has been a challenge to automatic sentiment classification. In this paper, we create a corpus which consists of polarity-shifted sentences in various kinds of product reviews. In the corpus, both the sentimental words and shifting trigger words are annotated. Furthermore, we analyze all the polarity shifted sentences and categorize them into five categories: opinion-itself, holder, target, time and hypothesis. Experimental study shows the agreement of annotation and the distribution of the five categories of polarity shifting.
{"title":"Polarity Shifting: Corpus Construction and Analysis","authors":"Xiaoqian Zhang, Shoushan Li, Guodong Zhou, Hongxia Zhao","doi":"10.1109/IALP.2011.27","DOIUrl":"https://doi.org/10.1109/IALP.2011.27","url":null,"abstract":"Polarity shifting has been a challenge to automatic sentiment classification. In this paper, we create a corpus which consists of polarity-shifted sentences in various kinds of product reviews. In the corpus, both the sentimental words and shifting trigger words are annotated. Furthermore, we analyze all the polarity shifted sentences and categorize them into five categories: opinion-itself, holder, target, time and hypothesis. Experimental study shows the agreement of annotation and the distribution of the five categories of polarity shifting.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122669284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summarization is a process of generating condensed form of a given text document, which retains its information and overall meaning. Document summarization approaches are broadly classified into two i.e. extractive summarization approach and abstractive summarization approach. In this paper, we performed single document summarization to generate summary of Telugu text document by using extractive summarization approach. Though there are many document surface features exists, we consider those features which can extensively cover original document and generates summary with less redundancy. We considered the features such as sentence position, sentence similarity with the title, centrality of the sentence and word frequency. To increase the strength of the features, we used a corpus which contains 3000 documents and performed various preprocessing steps like stop word elimination and stemming to retain more meaningful words within the sentence. Sentences are ranked by calculating the scores for each individual sentence by considering all four features simultaneously with optimum weights. The optimum weights to the feature are learned with the help human constructed summaries. The machine generated summaries are evaluated using F1 measure followed by human judgements.
{"title":"Corpus Based Extractive Document Summarization for Indic Script","authors":"P. Reddy, B. V. Vardhan, A. Govardhan","doi":"10.1109/IALP.2011.66","DOIUrl":"https://doi.org/10.1109/IALP.2011.66","url":null,"abstract":"Summarization is a process of generating condensed form of a given text document, which retains its information and overall meaning. Document summarization approaches are broadly classified into two i.e. extractive summarization approach and abstractive summarization approach. In this paper, we performed single document summarization to generate summary of Telugu text document by using extractive summarization approach. Though there are many document surface features exists, we consider those features which can extensively cover original document and generates summary with less redundancy. We considered the features such as sentence position, sentence similarity with the title, centrality of the sentence and word frequency. To increase the strength of the features, we used a corpus which contains 3000 documents and performed various preprocessing steps like stop word elimination and stemming to retain more meaningful words within the sentence. Sentences are ranked by calculating the scores for each individual sentence by considering all four features simultaneously with optimum weights. The optimum weights to the feature are learned with the help human constructed summaries. The machine generated summaries are evaluated using F1 measure followed by human judgements.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125013016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the processing of modern uygur corpus, it is necessary to make a word character mark study of the word level within the modern uygur language data. Since the classification of morpheme is to serve the mark of word character, the article classifies Uygur morphemes from their functions and lists their all classifications and arrangement rules.
{"title":"A Study of the Classification and Arrangement Rule of Uygur Morphemes for Information Processing","authors":"Pu Li, Shuzhen Shi","doi":"10.1109/IALP.2011.50","DOIUrl":"https://doi.org/10.1109/IALP.2011.50","url":null,"abstract":"In the processing of modern uygur corpus, it is necessary to make a word character mark study of the word level within the modern uygur language data. Since the classification of morpheme is to serve the mark of word character, the article classifies Uygur morphemes from their functions and lists their all classifications and arrangement rules.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"1047 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123141081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Al-Subaihin, Hend Suliman Al-Khalifa, A. Al-Salman
Recently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is characterized to be ill-structured and lacks specific linguistic standardization. In this paper, we investigate a preliminary step to conduct successful NLP processing which is the problem of sentence boundary detection. As informal Arabic lacks basic linguistic rules, we establish a list of commonly used punctuation marks after extensively studying a large amount of informal Arabic text. Moreover, we evaluated the correct usage of these punctuation marks as sentence delimiters; the result yielded a preliminary accuracy of 70%.
{"title":"Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result","authors":"A. Al-Subaihin, Hend Suliman Al-Khalifa, A. Al-Salman","doi":"10.1109/IALP.2011.38","DOIUrl":"https://doi.org/10.1109/IALP.2011.38","url":null,"abstract":"Recently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is characterized to be ill-structured and lacks specific linguistic standardization. In this paper, we investigate a preliminary step to conduct successful NLP processing which is the problem of sentence boundary detection. As informal Arabic lacks basic linguistic rules, we establish a list of commonly used punctuation marks after extensively studying a large amount of informal Arabic text. Moreover, we evaluated the correct usage of these punctuation marks as sentence delimiters; the result yielded a preliminary accuracy of 70%.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121480948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we proposed an approach to model the pronunciation of non-native accented speech for automatic speech recognition system. The proposed method consists of two phases: phones adaptation and pronunciation generalization. In phones adaptation, we identify the phones used by non-native speakers compared to the standard phones, and then remove the mismatch, as a result of the influence from mother tongue. In pronunciation adaptation, we predict the pronunciations of words by non-native speakers. The results shown the proposed approach reduce the WER from 44.8% to 41.9%.
{"title":"Non-native Accent Pronunciation Modeling in Automatic Speech Recognition","authors":"Basem H. A. Ahmed, T. Tan","doi":"10.1109/IALP.2011.65","DOIUrl":"https://doi.org/10.1109/IALP.2011.65","url":null,"abstract":"In this paper, we proposed an approach to model the pronunciation of non-native accented speech for automatic speech recognition system. The proposed method consists of two phases: phones adaptation and pronunciation generalization. In phones adaptation, we identify the phones used by non-native speakers compared to the standard phones, and then remove the mismatch, as a result of the influence from mother tongue. In pronunciation adaptation, we predict the pronunciations of words by non-native speakers. The results shown the proposed approach reduce the WER from 44.8% to 41.9%.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"345 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a model to describe the dynamic evolution of network information, identifying and analyzing the document collection on the same topic in different stages. In order to characterize the dynamic relationship of evolutionary content differences, this paper presents a dynamic multi-document summarization model, which is called the Dynamic Manifold-Ranking Model (DMRM). Some experiments were conducted on the Update Task test data from TAC2008, and results of new model were compared with results from the TAC2008 evaluation. This comparison demonstrated the effectiveness of the model.
{"title":"Research on Multi-document Summarization Model Based on Dynamic Manifold-Ranking","authors":"Meiling Liu, Honge Ren, Dequan Zheng, T. Zhao","doi":"10.1109/IALP.2011.55","DOIUrl":"https://doi.org/10.1109/IALP.2011.55","url":null,"abstract":"This paper introduces a model to describe the dynamic evolution of network information, identifying and analyzing the document collection on the same topic in different stages. In order to characterize the dynamic relationship of evolutionary content differences, this paper presents a dynamic multi-document summarization model, which is called the Dynamic Manifold-Ranking Model (DMRM). Some experiments were conducted on the Update Task test data from TAC2008, and results of new model were compared with results from the TAC2008 evaluation. This comparison demonstrated the effectiveness of the model.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114745902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, N. Phan, Quang-Thuy Ha
Personal names are among one of the most frequently searched items in web search engines and a person entity is always associated with numerous properties. In this paper, we propose an integrated model to recognize person entity and extract relevant values of a pre-defined set of properties related to this person simultaneously for Vietnamese. We also design a rich feature set by using various kind of knowledge resources and a apply famous machine learning method CRFs to improve the results. The obtained results show that our method is suitable for Vietnamese with the average result is 84 % of precision, 82.56% of recall and 83.39 % of F-measure. Moreover, performance time is pretty good, and the results also show the effectiveness of our feature set.
{"title":"An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text","authors":"Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, N. Phan, Quang-Thuy Ha","doi":"10.1109/IALP.2011.37","DOIUrl":"https://doi.org/10.1109/IALP.2011.37","url":null,"abstract":"Personal names are among one of the most frequently searched items in web search engines and a person entity is always associated with numerous properties. In this paper, we propose an integrated model to recognize person entity and extract relevant values of a pre-defined set of properties related to this person simultaneously for Vietnamese. We also design a rich feature set by using various kind of knowledge resources and a apply famous machine learning method CRFs to improve the results. The obtained results show that our method is suitable for Vietnamese with the average result is 84 % of precision, 82.56% of recall and 83.39 % of F-measure. Moreover, performance time is pretty good, and the results also show the effectiveness of our feature set.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117083362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}