In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and word-formation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.
{"title":"A Two-stage Statistical Word Segmentation System for Chinese","authors":"G. Fu, K. Luke","doi":"10.3115/1119250.1119273","DOIUrl":"https://doi.org/10.3115/1119250.1119273","url":null,"abstract":"In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and word-formation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115470752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The verb-noun sequence in Chinese often creates ambiguities in parsing. These ambiguities can usually be resolved if we know in advance whether the verb and the noun tend to be in the verb-object relation or the modifier-head relation. In this paper, we describe a learning procedure whereby such knowledge can be automatically acquired. Using an existing (imperfect) parser with a chart filter and a tree filter, a large corpus, and the log-likelihood-ratio (LLR) algorithm, we were able to acquire verb-noun pairs which typically occur either in verb-object relations or modifier-head relations. The learned pairs are then used in the parsing process for disambiguation. Evaluation shows that the accuracy of the original parser improves significantly with the use of the automatically acquired knowledge.
{"title":"Learning Verb-Noun Relations to Improve Parsing","authors":"Andi Wu","doi":"10.3115/1119250.1119267","DOIUrl":"https://doi.org/10.3115/1119250.1119267","url":null,"abstract":"The verb-noun sequence in Chinese often creates ambiguities in parsing. These ambiguities can usually be resolved if we know in advance whether the verb and the noun tend to be in the verb-object relation or the modifier-head relation. In this paper, we describe a learning procedure whereby such knowledge can be automatically acquired. Using an existing (imperfect) parser with a chart filter and a tree filter, a large corpus, and the log-likelihood-ratio (LLR) algorithm, we were able to acquire verb-noun pairs which typically occur either in verb-object relations or modifier-head relations. The learned pairs are then used in the parsing process for disambiguation. Evaluation shows that the accuracy of the original parser improves significantly with the use of the automatically acquired knowledge.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121833363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.
{"title":"Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures","authors":"Shengfen Luo, Maosong Sun","doi":"10.3115/1119250.1119254","DOIUrl":"https://doi.org/10.3115/1119250.1119254","url":null,"abstract":"Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense ambiguity for words in a parallel corpus. That sense tagging procedure, in effect, produces a semantic bilingual concordance, which can be used to train WSD systems for the two languages involved. Experimental results show that CBSDM trained on Longman Dictionary of Contemporary English, English-Chinese Edition (LDOCE E-C) and Longman Lexicon of Contemporary English (LLOCE) is very effectively in turning a Chinese-English parallel corpus into sense tagged data for development of WSD systems.
{"title":"Class Based Sense Definition Model for Word Sense Tagging and Disambiguation","authors":"Tracy Lin, Jason J. S. Chang","doi":"10.3115/1119250.1119252","DOIUrl":"https://doi.org/10.3115/1119250.1119252","url":null,"abstract":"We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense ambiguity for words in a parallel corpus. That sense tagging procedure, in effect, produces a semantic bilingual concordance, which can be used to train WSD systems for the two languages involved. Experimental results show that CBSDM trained on Longman Dictionary of Contemporary English, English-Chinese Edition (LDOCE E-C) and Longman Lexicon of Contemporary English (LLOCE) is very effectively in turning a Chinese-English parallel corpus into sense tagged data for development of WSD systems.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124618811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.
{"title":"Modeling of Long Distance Context Dependency in Chinese","authors":"Guodong Zhou","doi":"10.3115/1119250.1119260","DOIUrl":"https://doi.org/10.3115/1119250.1119260","url":null,"abstract":"Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126285722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The length of a constituent (number of syllables in a word or number of words in a phrase), or rhythm, plays an important role in Chinese syntax. This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank. Based on our survey, we then used the rhythm feature in a practical shallow parsing task by using rhythm as a statistical feature to augment a PCFG model. Our results show that using the probabilistic rhythm feature significantly improves the performance of our shallow parser.
{"title":"The Effect of Rhythm on Structural Disambiguation in Chinese","authors":"H. Sun, Dan Jurafsky","doi":"10.3115/1119250.1119256","DOIUrl":"https://doi.org/10.3115/1119250.1119256","url":null,"abstract":"The length of a constituent (number of syllables in a word or number of words in a phrase), or rhythm, plays an important role in Chinese syntax. This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank. Based on our survey, we then used the rhythm feature in a practical shallow parsing task by using rhythm as a statistical feature to augment a PCFG model. Our results show that using the probabilistic rhythm feature significantly improves the performance of our shallow parser.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natural language parsing has to be accurate and quick. Explanation-based Learning (EBL) is a technique to speed-up parsing. The accuracy however often declines with EBL. The paper shows that this accuracy loss is not due to the EBL framework as such, but to deductive parsing. Abductive EBL allows extending the deductive closure of the parser. We present a Chinese parser based on abduction. Experiments show improvements in accuracy and efficiency.1
{"title":"Abductive Explanation-based Learning Improves Parsing Accuracy and Efficiency","authors":"O. Streiter","doi":"10.3115/1119250.1119265","DOIUrl":"https://doi.org/10.3115/1119250.1119265","url":null,"abstract":"Natural language parsing has to be accurate and quick. Explanation-based Learning (EBL) is a technique to speed-up parsing. The accuracy however often declines with EBL. The paper shows that this accuracy loss is not due to the EBL framework as such, but to deductive parsing. Abductive EBL allows extending the deductive closure of the parser. We present a Chinese parser based on abduction. Experiments show improvements in accuracy and efficiency.1","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129391018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the investigation for Chinese named entity (NE) recognition, we are confronted with two principal challenges. One is how to ensure the quality of word segmentation and Part-of-Speech (POS) tagging, because its consequence has an adverse impact on the performance of NE recognition. Another is how to flexibly, reliably and accurately recognize NEs. In order to cope with the challenges, we propose a system architecture which is divided into two phases. In the first phase, we should reduce word segmentation and POS tagging errors leading to the second phase as much as possible. For this purpose, we utilize machine learning techniques to repair such errors. In the second phase, we design Finite State Cascades (FSC) which can be automatically constructed depending on the recognition rule sets as a shallow parser for the recognition of NEs. The advantages of that are reliable, accurate and easy to do maintenance for FSC. Additionally, to recognize special NEs, we work out the corresponding strategies to enhance the correctness of the recognition. The experimental evaluation of the system has shown that the total average recall and precision for six types of NEs are 83% and 85% respectively. Therefore, the system architecture is reasonable and effective.
{"title":"CHINERS: A Chinese Named Entity Recognition System for the Sports Domain","authors":"Tianfang Yao, Wei Ding, G. Erbach","doi":"10.3115/1119250.1119258","DOIUrl":"https://doi.org/10.3115/1119250.1119258","url":null,"abstract":"In the investigation for Chinese named entity (NE) recognition, we are confronted with two principal challenges. One is how to ensure the quality of word segmentation and Part-of-Speech (POS) tagging, because its consequence has an adverse impact on the performance of NE recognition. Another is how to flexibly, reliably and accurately recognize NEs. In order to cope with the challenges, we propose a system architecture which is divided into two phases. In the first phase, we should reduce word segmentation and POS tagging errors leading to the second phase as much as possible. For this purpose, we utilize machine learning techniques to repair such errors. In the second phase, we design Finite State Cascades (FSC) which can be automatically constructed depending on the recognition rule sets as a shallow parser for the recognition of NEs. The advantages of that are reliable, accurate and easy to do maintenance for FSC. Additionally, to recognize special NEs, we work out the corresponding strategies to enhance the correctness of the recognition. The experimental evaluation of the system has shown that the total average recall and precision for six types of NEs are 83% and 85% respectively. Therefore, the system architecture is reasonable and effective.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"386 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130210268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.
{"title":"The First International Chinese Word Segmentation Bakeoff","authors":"R. Sproat, Thomas Emerson","doi":"10.3115/1119250.1119269","DOIUrl":"https://doi.org/10.3115/1119250.1119269","url":null,"abstract":"This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131343275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present Chinese word segmentation algorithms based on the so-called LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of 95.9% and 91.6% on the Academia Sinica corpus and the Hong Kong City University corpus respectively.
{"title":"Chinese Word Segmentation as LMR Tagging","authors":"Nianwen Xue, Libin Shen","doi":"10.3115/1119250.1119278","DOIUrl":"https://doi.org/10.3115/1119250.1119278","url":null,"abstract":"In this paper we present Chinese word segmentation algorithms based on the so-called LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of 95.9% and 91.6% on the Academia Sinica corpus and the Hong Kong City University corpus respectively.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121767396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}