首页 > 最新文献

Workshop on Chinese Language Processing最新文献

英文 中文
A Two-stage Statistical Word Segmentation System for Chinese 中文两阶段统计分词系统
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119273
G. Fu, K. Luke
In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and word-formation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.
本文提出了一种基于构词法和双词图模型的两阶段中文统计分词系统。该系统已在首届国际汉语分词大赛上的北京大学语料库上进行了测试。最后给出了评价结果并进行了讨论。
{"title":"A Two-stage Statistical Word Segmentation System for Chinese","authors":"G. Fu, K. Luke","doi":"10.3115/1119250.1119273","DOIUrl":"https://doi.org/10.3115/1119250.1119273","url":null,"abstract":"In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and word-formation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115470752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Learning Verb-Noun Relations to Improve Parsing 学习动名词关系提高句法分析能力
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119267
Andi Wu
The verb-noun sequence in Chinese often creates ambiguities in parsing. These ambiguities can usually be resolved if we know in advance whether the verb and the noun tend to be in the verb-object relation or the modifier-head relation. In this paper, we describe a learning procedure whereby such knowledge can be automatically acquired. Using an existing (imperfect) parser with a chart filter and a tree filter, a large corpus, and the log-likelihood-ratio (LLR) algorithm, we were able to acquire verb-noun pairs which typically occur either in verb-object relations or modifier-head relations. The learned pairs are then used in the parsing process for disambiguation. Evaluation shows that the accuracy of the original parser improves significantly with the use of the automatically acquired knowledge.
汉语的动名词序列在句法分析中经常产生歧义。如果我们事先知道动词和名词是动词-宾语关系还是修饰语-词头关系,这些歧义通常是可以解决的。在本文中,我们描述了一个学习过程,这样的知识可以自动获得。使用现有的(不完善的)解析器(带有图表过滤器和树过滤器)、大型语料库和对数似然比(LLR)算法,我们能够获得动词-名词对,这些动词-名词对通常出现在动词-宾语关系或修饰语-词头关系中。然后在解析过程中使用学习到的对来消除歧义。评估表明,使用自动获取的知识后,原解析器的准确性得到了显著提高。
{"title":"Learning Verb-Noun Relations to Improve Parsing","authors":"Andi Wu","doi":"10.3115/1119250.1119267","DOIUrl":"https://doi.org/10.3115/1119250.1119267","url":null,"abstract":"The verb-noun sequence in Chinese often creates ambiguities in parsing. These ambiguities can usually be resolved if we know in advance whether the verb and the noun tend to be in the verb-object relation or the modifier-head relation. In this paper, we describe a learning procedure whereby such knowledge can be automatically acquired. Using an existing (imperfect) parser with a chart filter and a tree filter, a large corpus, and the log-likelihood-ratio (LLR) algorithm, we were able to acquire verb-noun pairs which typically occur either in verb-object relations or modifier-head relations. The learned pairs are then used in the parsing process for disambiguation. Evaluation shows that the accuracy of the original parser improves significantly with the use of the automatically acquired knowledge.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121833363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures 基于内部度量和上下文度量混合的汉语两字词提取
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119254
Shengfen Luo, Maosong Sun
Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.
词提取是文本信息处理中的重要任务之一。基于统计的词提取方法主要有两种:内部方法和上下文方法。本文对这两种方法进行了探讨。首先,对九种被广泛采用的内部措施进行了个别测试和比较。然后尝试了将这些措施结合起来的各种方案,以提高性能。最后,将左/右熵进行整合,以查看上下文度量的效果。探讨了遗传算法自动调整组合权值和阈值的方法。以两字中文词提取为研究对象的实验结果表明:互信息的f值为57.82%,是最强大的内部测度,而内部测度的最佳组合方案的f值为59.87%。结合语境测度,最终提取出68.48%的f测度。
{"title":"Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures","authors":"Shengfen Luo, Maosong Sun","doi":"10.3115/1119250.1119254","DOIUrl":"https://doi.org/10.3115/1119250.1119254","url":null,"abstract":"Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Class Based Sense Definition Model for Word Sense Tagging and Disambiguation 基于类的词义定义模型用于词义标注和消歧
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119252
Tracy Lin, Jason J. S. Chang
We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense ambiguity for words in a parallel corpus. That sense tagging procedure, in effect, produces a semantic bilingual concordance, which can be used to train WSD systems for the two languages involved. Experimental results show that CBSDM trained on Longman Dictionary of Contemporary English, English-Chinese Edition (LDOCE E-C) and Longman Lexicon of Contemporary English (LLOCE) is very effectively in turning a Chinese-English parallel corpus into sense tagged data for development of WSD systems.
我们提出了一种用于词义消歧(WSD)的无监督学习策略,该策略利用多种语言资源,包括并行语料库、双语机读词典和同义词库。该方法基于基于类的词义定义模型(CBSDM),该模型为一类词义生成注释和翻译。该模型可用于解决平行语料库中词的语义歧义问题。这种意义标记过程实际上产生了语义双语一致性,可用于为所涉及的两种语言训练WSD系统。实验结果表明,在《朗文当代英语词典》(LDOCE E-C)和《朗文当代英语词典》(LLOCE)上训练的CBSDM可以有效地将汉英平行语料库转化为语义标记数据,为WSD系统的开发提供支持。
{"title":"Class Based Sense Definition Model for Word Sense Tagging and Disambiguation","authors":"Tracy Lin, Jason J. S. Chang","doi":"10.3115/1119250.1119252","DOIUrl":"https://doi.org/10.3115/1119250.1119252","url":null,"abstract":"We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense ambiguity for words in a parallel corpus. That sense tagging procedure, in effect, produces a semantic bilingual concordance, which can be used to train WSD systems for the two languages involved. Experimental results show that CBSDM trained on Longman Dictionary of Contemporary English, English-Chinese Edition (LDOCE E-C) and Longman Lexicon of Contemporary English (LLOCE) is very effectively in turning a Chinese-English parallel corpus into sense tagged data for development of WSD systems.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124618811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling of Long Distance Context Dependency in Chinese 汉语长距离语境依赖的建模
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119260
Guodong Zhou
Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.
Ngram建模是一种简单的语言建模方法,在许多应用中得到了广泛的应用。然而,它只能在N个单词的窗口内捕获短距离上下文依赖,其中自然语言的最大实际N是3。与此同时,自然语言中的许多上下文依赖关系发生在三个单词的窗口之外。为了整合这种长距离上下文依赖,本文提出了一种新的MI-Ngram建模方法。MI- ngram模型由两个部分组成:ngram模型和MI模型。ngram模型捕获n个单词窗口内的短距离上下文依赖关系,而MI模型通过使用互信息的概念捕获n个单词窗口外单词对之间的长距离上下文依赖关系。研究发现,MI-Ngram建模比ngram建模具有更好的性能。对新华社新语料库2900万词的评估表明,与三格表模型相比,包含最好的160万个词对的MI-Trigram模型的困惑度降低了20%。与此同时,对汉语分词的评价表明,使用MI-Trigram模型与使用trigram模型相比,可以纠正约35%的错误。
{"title":"Modeling of Long Distance Context Dependency in Chinese","authors":"Guodong Zhou","doi":"10.3115/1119250.1119260","DOIUrl":"https://doi.org/10.3115/1119250.1119260","url":null,"abstract":"Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126285722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effect of Rhythm on Structural Disambiguation in Chinese 节奏对汉语结构消歧的影响
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119256
H. Sun, Dan Jurafsky
The length of a constituent (number of syllables in a word or number of words in a phrase), or rhythm, plays an important role in Chinese syntax. This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank. Based on our survey, we then used the rhythm feature in a practical shallow parsing task by using rhythm as a statistical feature to augment a PCFG model. Our results show that using the probabilistic rhythm feature significantly improves the performance of our shallow parser.
组成部分的长度(一个词的音节数或一个短语的单词数)或节奏在汉语句法中起着重要作用。本文系统地考察了汉语构筑物的节奏分布,并利用浅层树库的统计数据进行了分析。在调查的基础上,我们将节奏特征应用于一个实际的浅解析任务中,将节奏作为一个统计特征来增强PCFG模型。我们的结果表明,使用概率节奏特征显著提高了浅解析器的性能。
{"title":"The Effect of Rhythm on Structural Disambiguation in Chinese","authors":"H. Sun, Dan Jurafsky","doi":"10.3115/1119250.1119256","DOIUrl":"https://doi.org/10.3115/1119250.1119256","url":null,"abstract":"The length of a constituent (number of syllables in a word or number of words in a phrase), or rhythm, plays an important role in Chinese syntax. This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank. Based on our survey, we then used the rhythm feature in a practical shallow parsing task by using rhythm as a statistical feature to augment a PCFG model. Our results show that using the probabilistic rhythm feature significantly improves the performance of our shallow parser.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Abductive Explanation-based Learning Improves Parsing Accuracy and Efficiency 基于溯因解释的学习提高了解析的准确性和效率
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119265
O. Streiter
Natural language parsing has to be accurate and quick. Explanation-based Learning (EBL) is a technique to speed-up parsing. The accuracy however often declines with EBL. The paper shows that this accuracy loss is not due to the EBL framework as such, but to deductive parsing. Abductive EBL allows extending the deductive closure of the parser. We present a Chinese parser based on abduction. Experiments show improvements in accuracy and efficiency.1
自然语言解析必须准确、快速。基于解释的学习(EBL)是一种加速解析的技术。然而,EBL的准确性经常下降。本文表明,这种准确性损失不是由于EBL框架本身,而是由于演绎解析。溯因EBL允许扩展解析器的演绎闭包。提出了一种基于溯因法的中文解析器。实验表明,该方法在精度和效率上均有提高
{"title":"Abductive Explanation-based Learning Improves Parsing Accuracy and Efficiency","authors":"O. Streiter","doi":"10.3115/1119250.1119265","DOIUrl":"https://doi.org/10.3115/1119250.1119265","url":null,"abstract":"Natural language parsing has to be accurate and quick. Explanation-based Learning (EBL) is a technique to speed-up parsing. The accuracy however often declines with EBL. The paper shows that this accuracy loss is not due to the EBL framework as such, but to deductive parsing. Abductive EBL allows extending the deductive closure of the parser. We present a Chinese parser based on abduction. Experiments show improvements in accuracy and efficiency.1","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129391018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CHINERS: A Chinese Named Entity Recognition System for the Sports Domain CHINERS:面向体育领域的中文命名实体识别系统
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119258
Tianfang Yao, Wei Ding, G. Erbach
In the investigation for Chinese named entity (NE) recognition, we are confronted with two principal challenges. One is how to ensure the quality of word segmentation and Part-of-Speech (POS) tagging, because its consequence has an adverse impact on the performance of NE recognition. Another is how to flexibly, reliably and accurately recognize NEs. In order to cope with the challenges, we propose a system architecture which is divided into two phases. In the first phase, we should reduce word segmentation and POS tagging errors leading to the second phase as much as possible. For this purpose, we utilize machine learning techniques to repair such errors. In the second phase, we design Finite State Cascades (FSC) which can be automatically constructed depending on the recognition rule sets as a shallow parser for the recognition of NEs. The advantages of that are reliable, accurate and easy to do maintenance for FSC. Additionally, to recognize special NEs, we work out the corresponding strategies to enhance the correctness of the recognition. The experimental evaluation of the system has shown that the total average recall and precision for six types of NEs are 83% and 85% respectively. Therefore, the system architecture is reasonable and effective.
在中文命名实体(NE)识别的研究中,我们面临着两个主要的挑战。一个是如何保证分词和词性标注的质量,因为其后果会对网元识别的性能产生不利影响。二是如何灵活、可靠、准确地识别网元。为了应对这些挑战,我们提出了一个分为两个阶段的系统架构。在第一阶段,我们应该尽可能减少导致第二阶段的分词和词性标注错误。为此,我们利用机器学习技术来修复这些错误。在第二阶段,我们设计了有限状态级联(FSC),它可以根据识别规则集自动构建,作为识别网元的浅解析器。其优点是可靠、准确、易于维护。此外,针对特殊网元的识别,我们制定了相应的识别策略,以提高识别的正确性。实验结果表明,该系统对6种网元的总平均查全率和查准率分别为83%和85%。因此,系统架构合理有效。
{"title":"CHINERS: A Chinese Named Entity Recognition System for the Sports Domain","authors":"Tianfang Yao, Wei Ding, G. Erbach","doi":"10.3115/1119250.1119258","DOIUrl":"https://doi.org/10.3115/1119250.1119258","url":null,"abstract":"In the investigation for Chinese named entity (NE) recognition, we are confronted with two principal challenges. One is how to ensure the quality of word segmentation and Part-of-Speech (POS) tagging, because its consequence has an adverse impact on the performance of NE recognition. Another is how to flexibly, reliably and accurately recognize NEs. In order to cope with the challenges, we propose a system architecture which is divided into two phases. In the first phase, we should reduce word segmentation and POS tagging errors leading to the second phase as much as possible. For this purpose, we utilize machine learning techniques to repair such errors. In the second phase, we design Finite State Cascades (FSC) which can be automatically constructed depending on the recognition rule sets as a shallow parser for the recognition of NEs. The advantages of that are reliable, accurate and easy to do maintenance for FSC. Additionally, to recognize special NEs, we work out the corresponding strategies to enhance the correctness of the recognition. The experimental evaluation of the system has shown that the total average recall and precision for six types of NEs are 83% and 85% respectively. Therefore, the system architecture is reasonable and effective.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"386 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130210268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
The First International Chinese Word Segmentation Bakeoff 首届国际汉语分词大赛
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119269
R. Sproat, Thomas Emerson
This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.
本文介绍了2003年在日本札幌举行的第二届国际汉语分词研讨会上,由acl -SIGHAN主办的第一届国际汉语分词比赛的结果。我们给出了举办国际细分竞赛的动机(考虑到迄今为止在中国已经举办了两场竞赛),我们报告了第一场国际竞赛的结果,分析了这些结果,并为未来提出了一些建议。
{"title":"The First International Chinese Word Segmentation Bakeoff","authors":"R. Sproat, Thomas Emerson","doi":"10.3115/1119250.1119269","DOIUrl":"https://doi.org/10.3115/1119250.1119269","url":null,"abstract":"This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the results of this first international contest, analyze these results, and make some recommendations for the future.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131343275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 236
Chinese Word Segmentation as LMR Tagging 基于LMR标注的汉语分词
Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119278
Nianwen Xue, Libin Shen
In this paper we present Chinese word segmentation algorithms based on the so-called LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of 95.9% and 91.6% on the Academia Sinica corpus and the Hong Kong City University corpus respectively.
本文提出了一种基于LMR标注的中文分词算法。我们的LMR标记器是用最大熵马尔可夫模型实现的,然后我们使用基于转换的学习来组合两个LMR标记器的结果,这两个LMR标记器从相反的方向扫描输入。我们的系统在中央研究院语料库和香港城市大学语料库上分别获得95.9%和91.6%的f分。
{"title":"Chinese Word Segmentation as LMR Tagging","authors":"Nianwen Xue, Libin Shen","doi":"10.3115/1119250.1119278","DOIUrl":"https://doi.org/10.3115/1119250.1119278","url":null,"abstract":"In this paper we present Chinese word segmentation algorithms based on the so-called LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of 95.9% and 91.6% on the Academia Sinica corpus and the Hong Kong City University corpus respectively.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121767396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 151
期刊
Workshop on Chinese Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1