Workshop on Chinese Language Processing最新文献

英文中文

Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff 第一届国际汉语分词大赛CKIP中文分词系统介绍

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119276

Wei-Yun Ma, Keh-Jiann Chen

In this paper, we roughly described the procedures of our segmentation system, including the methods for resolving segmentation ambiguities and identifying unknown words. The CKIP group of Academia Sinica participated in testing on open and closed tracks of Beijing University (PK) and Hong Kong Cityu (HK). The evaluation results show our system performs very well in either HK open track or HK closed track and just acceptable in PK tracks. Some explanations and analysis are presented in this paper.

在本文中，我们大致描述了我们的分词系统的过程，包括分词歧义的解决方法和未知词的识别方法。中央研究院CKIP小组参与了北京大学(PK)和香港城市大学(HK)的开放和封闭轨道测试。评估结果表明，我们的系统在HK开放赛道和HK封闭赛道上都表现良好，在PK赛道上也可以接受。本文对此作了一些解释和分析。

引用次数: 172

News-Oriented Automatic Chinese Keyword Indexing 面向新闻的中文关键词自动索引

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119263

Sujian Li, Houfeng Wang, Shiwen Yu, Chengsheng Xin

In our information era, keywords are very useful to information retrieval, text clustering and so on. News is always a domain attracting a large amount of attention. However, the majority of news articles come without keywords, and indexing them manually costs highly. Aiming at news articles' characteristics and the resources available, this paper introduces a simple procedure to index keywords based on the scoring system. In the process of indexing, we make use of some relatively mature linguistic techniques and tools to filter those meaningless candidate items. Furthermore, according to the hierarchical relations of content words, keywords are not restricted to extracting from text. These methods have improved our system a lot. At last experimental results are given and analyzed, showing that the quality of extracted keywords are satisfying.

在信息时代，关键词在信息检索、文本聚类等方面发挥着重要作用。新闻总是一个吸引大量注意力的领域。然而，大多数新闻文章都没有关键字，手动索引它们的成本很高。针对新闻文章的特点和可利用的资源，介绍了一种基于评分系统的关键词索引的简单程序。在标引过程中，我们利用一些相对成熟的语言技术和工具，过滤掉那些没有意义的候选项。此外，根据内容词的层次关系，关键词不局限于从文本中提取。这些方法大大改善了我们的系统。最后给出了实验结果并进行了分析，结果表明所提取的关键词质量令人满意。

引用次数: 9

Single Character Chinese Named Entity Recognition 单字中文命名实体识别

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119268

Xiao-Dan Zhu, Mu Li, Jianfeng Gao, C. Huang

Single character named entity (SCNE) is a name entity (NE) composed of one Chinese character, such as "[Abstract contained text which could not be captured.]" (zhong1, China) and "[Abstract contained text which could not be captured.]" (e2, Russia). SCNE is very common in written Chinese text. However, due to the lack of in-depth research, SCNE is a major source of errors in named entity recognition (NER). This paper formulates the SCNE recognition within the source-channel model framework. Our experiments show very encouraging results: an F-score of 81.01% for single character location name recognition, and an F-score of 68.02% for single character person name recognition. An alternative view of the SCNE recognition problem is to formulate it as a classification task. We construct two classifiers based on maximum entropy model (ME) and vector space model (VSM), respectively. We compare all proposed approaches, showing that the source-channel model performs the best in most cases.

单字命名实体(Single character named entity，简称SCNE)是由一个中文字符组成的名称实体(name entity，简称NE)，如“[摘要]”，其中包含无法被捕获的文本。(中国，zhong1)和“[摘要包含无法捕获的文本。(2，俄罗斯)。SCNE在书面语中很常见。然而，由于缺乏深入的研究，SCNE是命名实体识别(NER)的一个主要错误来源。本文在信源-信道模型框架下建立了声源识别模型。我们的实验显示了非常令人鼓舞的结果:单字符位置名称识别的f值为81.01%，单字符人名识别的f值为68.02%。SCNE识别问题的另一种观点是将其表述为分类任务。我们分别基于最大熵模型(ME)和向量空间模型(VSM)构建了两个分类器。我们比较了所有提出的方法，表明源信道模型在大多数情况下表现最好。

引用次数: 11

SYSTRAN's Chinese Word Segmentation systeman的中文分词

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119279

Jin Yang, Jean Senellart, R. Zajac

SYSTRAN's Chinese word segmentation is one important component of its Chinese-English machine translation system. The Chinese word segmentation module uses a rule-based approach, based on a large dictionary and fine-grained linguistic rules. It works on general-purpose texts from different Chinese-speaking regions, with comparable performance. SYSTRAN participated in the four open tracks in the First International Chinese Word Segmentation Bakeoff. This paper gives a general description of the segmentation module, as well as the results and analysis of its performance in the Bakeoff.

SYSTRAN的中文分词是其汉英机器翻译系统的重要组成部分。中文分词模块采用基于规则的方法，基于大型词典和细粒度的语言规则。它适用于来自不同汉语地区的通用文本，性能相当。SYSTRAN参加了首届国际汉语分词大赛的四场公开赛。本文给出了分割模块的总体描述，并对其在Bakeoff中的性能进行了结果和分析。

引用次数: 10

HHMM-based Chinese Lexical Analyzer ICTCLAS 基于hmm的汉语词法分析器ICTCLAS

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119280

Huaping Zhang, Hongkui Yu, Deyi Xiong, Qun Liu

This document presents the results from Inst. of Computing Tech., CAS in the ACL SIGHAN-sponsored First International Chinese Word Segmentation Bake-off. The authors introduce the unified HHMM-based frame of our Chinese lexical analyzer ICTCLAS and explain the operation of the six tracks. Then provide the evaluation results and give more analysis. Evaluation on ICTCLAS shows that its performance is competitive. Compared with other system, ICTCLAS has ranked top both in CTB and PK closed track. In PK open track, it ranks second position. ICTCLAS BIG5 version was transformed from GB version only in two days; however, it achieved well in two BIG5 closed tracks. Through the first bakeoff, we could learn more about the development in Chinese word segmentation and become more confident on our HHMM-based approach. At the same time, we really find our problems during the evaluation. The bakeoff is interesting and helpful.

本文介绍了中国科学院计算机技术研究所在ACL sighan主办的首届国际汉语分词大赛中的结果。作者介绍了我们的汉语词法分析器ICTCLAS基于hmm的统一框架，并说明了六个轨道的运行情况。然后给出评价结果并进一步分析。对ICTCLAS的评价表明其性能具有一定的竞争力。与其他系统相比，ICTCLAS在CTB和PK封闭式赛道上均名列前茅。在PK公开赛中，排名第二。ICTCLAS BIG5版本仅用2天时间从GB版本转换完成;然而，它在两个BIG5封闭轨道上取得了很好的成绩。通过第一次的测试，我们可以了解更多关于汉语分词的发展，并对我们基于hmm的分词方法更有信心。同时，我们也在评估中真正发现了自己的问题。烘焙比赛很有趣，也很有帮助。

引用次数: 493

Combining Segmenter and Chunker for Chinese Word Segmentation 分词器与分块器相结合的中文分词方法

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119270

Masayuki Asahara, Chooi-Ling Goh, Xiaojie Wang, Yuji Matsumoto

Our proposed method is to use a Hidden Markov Model-based word segmenter and a Support Vector Machine-based chunker for Chinese word segmentation. Firstly, input sentences are analyzed by the Hidden Markov Model-based word segmenter. The word segmenter produces n-best word candidates together with some class information and confidence measures. Secondly, the extracted words are broken into character units and each character is annotated with the possible word class and the position in the word, which are then used as the features for the chunker. Finally, the Support Vector Machine-based chunker brings character units together into words so as to determine the word boundaries.

我们提出的方法是使用基于隐马尔可夫模型的分词器和基于支持向量机的分词器进行中文分词。首先，使用基于隐马尔可夫模型的分词器对输入句子进行分析。分词器产生n个最佳候选词以及一些类信息和置信度度量。其次，将提取的单词分解为字符单元，并对每个字符进行注释，并标注可能的词类和在单词中的位置，然后将其用作分块器的特征。最后，基于支持向量机的分块器将字符单元组合成单词，从而确定单词边界。

引用次数: 24

Chinese Lexical Analysis Using Hierarchical Hidden Markov Model 基于层次隐马尔科夫模型的汉语词法分析

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119259

Huaping Zhang, Qun Liu, Xueqi Cheng, H. Zhang, Hongkui Yu

This paper presents a unified approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and unknown words recognition into a whole theoretical frame. A class-based HMM is applied in word segmentation, and in this level unknown words are treated in the same way as common words listed in the lexicon. Unknown words are recognized with reliability in role-based HMM. As for disambiguation, the authors bring forth an n-shortest-path strategy that, in the early stage, reserves top N segmentation results as candidates and covers more ambiguity. Various experiments show that each level in HHMM contributes to lexical analysis. An HHMM-based system ICTCLAS was accomplished. The recent official evaluation indicates that ICTCLAS is one of the best Chinese lexical analyzers. In a word, HHMM is effective to Chinese lexical analysis.

本文提出了一种基于层次隐马尔可夫模型(HHMM)的汉语词法分析方法，该方法将汉语分词、词性标注、消歧和未知词识别整合到一个完整的理论框架中。在分词中应用了基于类的HMM，在这一层中，未知词与词典中列出的常用词一样被处理。在基于角色的HMM中，未知词的识别具有可靠性。在消歧义方面，作者提出了一种N -最短路径策略，在早期阶段，保留前N个分割结果作为候选，并覆盖更多的歧义。各种实验表明，HHMM的每个层次都有助于词法分析。实现了基于hmm的ICTCLAS系统。最近的官方评价表明，ICTCLAS是最好的汉语词汇分析工具之一。总之，hmm对汉语词汇分析是有效的。

引用次数: 139

A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction 中文未知词提取的自底向上合并算法

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119255

Wei-Yun Ma, Keh-Jiann Chen

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.

用统计方法提取汉语未识别词时，往往会提取出具有强统计关联的多余字符串。为了解决这一问题，本文提出使用一套通用的形态学规则来扩大覆盖范围，另一方面，在这些规则中附加不同的语言和统计约束，以提高表征的精度。为了消除规则应用的歧义，降低规则匹配的复杂性，提出了一种自下而上的词素提取合并算法，该算法通过对一般规则的查询，递归地合并可能的词素，并根据规则的优先级动态决定优先应用哪条规则。在实验中比较了不同优先级策略的效果，实验结果表明，本文提出的方法具有良好的性能。

引用次数: 74

Chinese Word Segmentation in MSR-NLP MSR-NLP中的中文分词

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119277

Andi Wu

Word segmentation in MSR-NLP is an integral part of a sentence analyzer which includes basic segmentation, derivational morphology, named entity recognition, new word identification, word lattice pruning and parsing. The final segmentation is produced from the leaves of parse trees. The output can be customized to meet different segmentation standards through the value combinations of a set of parameters. The system participated in four tracks of the segmentation bakeoff -- PK-open, PK-close, CTB-open and CTB-closed - and ranked #1, #2, #2 and #3 respectively in those tracks. Analysis of the results shows that each component of the system contributed to the scores.

MSR-NLP中的分词是句子分析器的一个重要组成部分，包括基本分词、衍生词法、命名实体识别、新词识别、词格修剪和句法分析。最后的分割是由解析树的叶子产生的。通过一组参数的值组合，可以定制输出，以满足不同的分割标准。该系统参与了PK-open、PK-close、CTB-open和CTB-closed 4个赛道的分段比赛，在这些赛道上分别排名第一、第二、第二和第三。对结果的分析表明，系统的每个组成部分都对分数有贡献。

引用次数: 24

Building a Large Chinese Corpus Annotated with Semantic Dependency 基于语义依赖标注的大型汉语语料库的构建

Workshop on Chinese Language Processing

Pub Date : 2003-07-11 DOI: 10.3115/1119250.1119262

Mingqin Li, Juan-Zi Li, Zhendong Dong, Zuoying Wang, Dajin Lu

At present most of corpora are annotated mainly with syntactic knowledge. In this paper, we attempt to build a large corpus and annotate semantic knowledge with dependency grammar. We believe that words are the basic units of semantics, and the structure and meaning of a sentence consist mainly of a series of semantic dependencies between individual words. A 1,000,000-word-scale corpus annotated with semantic dependency has been built. Compared with syntactic knowledge, semantic knowledge is more difficult to annotate, for ambiguity problem is more serious. In the paper, the strategy to improve consistency is addressed, and congruence is defined to measure the consistency of tagged corpus.. Finally, we will compare our corpus with other well-known corpora.

目前大多数语料库的标注都是以句法知识为主。在本文中，我们尝试建立一个大型语料库，并用依存语法对语义知识进行标注。我们认为词是语义的基本单位，句子的结构和意义主要由单个词之间的一系列语义依赖组成。建立了一个带有语义依赖注释的100万字规模的语料库。与句法知识相比，语义知识的标注难度更大，歧义问题更严重。本文讨论了提高一致性的策略，并定义了一致性来衡量标记语料库的一致性。最后，我们将我们的语料库与其他知名语料库进行比较。

引用次数: 30

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Workshop on Chinese Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀