首页 > 最新文献

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)最新文献

英文 中文
Are we waves or are we particles? A new insight into deep semantics in natural language processing 我们是波还是粒子?自然语言处理中深层语义的新见解
Svetlana Machova, J. Klecková
This paper brings conceptually new, empirically based scientific approach to a deeper understanding of human mind cognition, language acquisition, modularity of language and language origin itself. The research presented provides an interactive multilingual associative experiment as an attempt to map the Cognitive Semantic Space: (CSSES) and its basic frames of the Essential Self in the Czech language, collects and compares it to the CSSES of conceptual language view in Czech, Russian, English and potentially in other languages. We attempt to merge cognitive metaphor theory with psycholinguistics and psychoanalysis applying associative experiment methodology on the Essential Self metaphors. The research has two main goals: the first is to build an Essential Self multilingual WordNet, which serves as the basic lexical resource for Artificial Intelligence describes the core of the human nature. The second is to create a multilingual 3D semantic network.
本文为深入理解人类心理认知、语言习得、语言模块化和语言起源本身带来了全新的、基于经验的科学方法。本研究提供了一个交互式多语言联想实验,试图绘制捷克语的认知语义空间(cses)及其本质自我的基本框架,并将其与捷克语、俄语、英语以及潜在的其他语言的概念语言观的cses进行比较。我们尝试将认知隐喻理论与心理语言学和精神分析学相结合,运用联想实验方法对本质自我隐喻进行研究。本研究主要有两个目标:一是构建一个Essential Self多语言WordNet,作为人工智能描述人性核心的基础词汇资源。二是创建多语言3D语义网络。
{"title":"Are we waves or are we particles? A new insight into deep semantics in natural language processing","authors":"Svetlana Machova, J. Klecková","doi":"10.1109/NLPKE.2010.5587805","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587805","url":null,"abstract":"This paper brings conceptually new, empirically based scientific approach to a deeper understanding of human mind cognition, language acquisition, modularity of language and language origin itself. The research presented provides an interactive multilingual associative experiment as an attempt to map the Cognitive Semantic Space: (CSSES) and its basic frames of the Essential Self in the Czech language, collects and compares it to the CSSES of conceptual language view in Czech, Russian, English and potentially in other languages. We attempt to merge cognitive metaphor theory with psycholinguistics and psychoanalysis applying associative experiment methodology on the Essential Self metaphors. The research has two main goals: the first is to build an Essential Self multilingual WordNet, which serves as the basic lexical resource for Artificial Intelligence describes the core of the human nature. The second is to create a multilingual 3D semantic network.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129526111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Shui nationality characters stroke shape input method 水族文字笔画形状输入法
Hanyue Yang, Xiaorong Chen
Shape of Shui nationality characters is similar to that of Oracle and Jinwen. In order to work out the problems of how to code hieroglyph, a coding method based on stroke shape for Shui Nationality characters is proposed. The shapes of 467 Shui Nationality characters in the Common Shui Script Dictionary are analyzed, and seven basic strokes are extracted to consist of main Shui characters. Through the statistical comparison, 21 kinds of stroke shape can be got by subdividing the seven basic strokes. A Shui Nationality character is coded by an ordered sequence composed by three strokes taken from the corner of the character according to the coding rules. Finally, the users who can not read the Shui character can input it easily and quickly.
水族文字的字形与甲骨文、金文相近。为了解决象形文字编码中存在的问题,提出了一种基于笔画形状的水族文字编码方法。对《通用水文字词典》中467个水族文字的字形进行了分析,提取出7个基本笔画构成了主要的水字。通过统计比较,将7种基本笔画进行细分,可得到21种笔画形状。水族文字是由汉字角上的三笔画按编码规则组成的有序序列进行编码的。最后,不会读水字的用户也可以方便快捷地输入水字。
{"title":"Shui nationality characters stroke shape input method","authors":"Hanyue Yang, Xiaorong Chen","doi":"10.1109/NLPKE.2010.5587840","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587840","url":null,"abstract":"Shape of Shui nationality characters is similar to that of Oracle and Jinwen. In order to work out the problems of how to code hieroglyph, a coding method based on stroke shape for Shui Nationality characters is proposed. The shapes of 467 Shui Nationality characters in the Common Shui Script Dictionary are analyzed, and seven basic strokes are extracted to consist of main Shui characters. Through the statistical comparison, 21 kinds of stroke shape can be got by subdividing the seven basic strokes. A Shui Nationality character is coded by an ordered sequence composed by three strokes taken from the corner of the character according to the coding rules. Finally, the users who can not read the Shui character can input it easily and quickly.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130918935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chinese patent retrieval based on the pragmatic information 基于语用信息的中文专利检索
Liping Wu, Song Liu, F. Ren
In this paper, we propose a novel information retrieval approach based on the pragmatic information for Chinese patents. At present, patent retrieval is becoming more and more important. Not only because patents are always can an important resource in all kinds of field, but patent retrieval save a great deal of time and funds for corporations and researchers. However, with available methods the precision of retrieval results for patents is not very high. What's more, through analyzed the patent documentations we found that except the literal meanings, there are deeper meanings which can be concluded from the patents. Here we call the deeper meanings as pragmatic information. Therefore we established a patent retrieval system to integrate the pragmatic information with classical information retrieval technique to improve the retrieval accuracy. Some experiments using the proposed method have carried out, and the results show that the precision of patent retrieval based on the pragmatic information is higher than the one without using it.
本文提出了一种基于中文专利语用信息的信息检索方法。目前,专利检索变得越来越重要。不仅因为专利在各个领域都是重要的资源,而且专利检索为企业和研究人员节省了大量的时间和资金。然而,在现有的方法下,专利检索结果的精度不是很高。此外,通过对专利文献的分析,我们发现除了字面意义之外,专利文献中还有更深层次的含义。在这里,我们把深层含义称为语用信息。为此,我们建立了一个将实用信息与经典信息检索技术相结合的专利检索系统,以提高检索精度。应用该方法进行的实验结果表明,基于语用信息的专利检索精度高于不使用语用信息的检索精度。
{"title":"Chinese patent retrieval based on the pragmatic information","authors":"Liping Wu, Song Liu, F. Ren","doi":"10.1109/NLPKE.2010.5587776","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587776","url":null,"abstract":"In this paper, we propose a novel information retrieval approach based on the pragmatic information for Chinese patents. At present, patent retrieval is becoming more and more important. Not only because patents are always can an important resource in all kinds of field, but patent retrieval save a great deal of time and funds for corporations and researchers. However, with available methods the precision of retrieval results for patents is not very high. What's more, through analyzed the patent documentations we found that except the literal meanings, there are deeper meanings which can be concluded from the patents. Here we call the deeper meanings as pragmatic information. Therefore we established a patent retrieval system to integrate the pragmatic information with classical information retrieval technique to improve the retrieval accuracy. Some experiments using the proposed method have carried out, and the results show that the precision of patent retrieval based on the pragmatic information is higher than the one without using it.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125546479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Part-of-speech tagging for Chinese unknown words in a domain-specific small corpus using morphological and contextual rules 基于形态和语境规则的小语料库中文未知词词性标注
Tao-Hsing Chang, Fu-Yuan Hsu, Chia-Hoang Lee, Hahn-Ming Lee
Many studies have tried to search useful information on the Internet by meaningful terms or words. The performance of these approaches is often affected by the accuracy of unknown word extraction and POS tagging, while the accuracy is affected by the size of training corpora and the characteristics of language. This work proposes and develops a method that concentrates on tagging the POS of Chinese unknown words for the domain of our interest, based on the integration of morphological, contextual rules and a statistics-based method. Experimental results indicate that the proposed method can overcome the difficulties resulting from small corpora in oriental languages, and can accurately tags unknown words with POS in domain-specific small corpora.
许多研究都试图通过有意义的术语或单词在互联网上搜索有用的信息。这些方法的性能往往受到未知词提取和词性标注准确性的影响,而准确性又受到训练语料库大小和语言特征的影响。本文提出并发展了一种基于形态学、上下文规则和基于统计的方法的方法,专注于为我们感兴趣的领域标注中文未知词的词性标注。实验结果表明,该方法克服了东方语言小语料库的困难,能够在特定领域的小语料库中准确标注未知词。
{"title":"Part-of-speech tagging for Chinese unknown words in a domain-specific small corpus using morphological and contextual rules","authors":"Tao-Hsing Chang, Fu-Yuan Hsu, Chia-Hoang Lee, Hahn-Ming Lee","doi":"10.1109/NLPKE.2010.5587771","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587771","url":null,"abstract":"Many studies have tried to search useful information on the Internet by meaningful terms or words. The performance of these approaches is often affected by the accuracy of unknown word extraction and POS tagging, while the accuracy is affected by the size of training corpora and the characteristics of language. This work proposes and develops a method that concentrates on tagging the POS of Chinese unknown words for the domain of our interest, based on the integration of morphological, contextual rules and a statistics-based method. Experimental results indicate that the proposed method can overcome the difficulties resulting from small corpora in oriental languages, and can accurately tags unknown words with POS in domain-specific small corpora.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125283724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Statistical parsing based on Maximal Noun Phrase pre-processing 基于最大名词短语预处理的统计分析
Qiaoli Zhou, Yue Gu, Xin Liu, Wenjing Lang, Dongfeng Cai
According to the characteristics of Chinese language, this paper proposes a statistical parsing method based on Maximal Noun Phrase(MNP) per-processing. MNP parsing is preferable to be separated from parsing of the full sentence. Firstly, MNP in a sentence are identified; next, MNP can be represented by the head of MNP, and then the sentence is parsed with the head of the MNP. Therefore, the original sentence is divided into two parts, which can be parsed separately. The first part is MNP parsing; the second part is parsing of the sentence in which the MNP are replaced by their head words. Finally, the paper takes Conditional Random Fields (CRFs) as the statistical recognition model of each level in syntactic parsing process.
根据汉语的特点,提出了一种基于最大名词短语预处理的统计句法分析方法。MNP解析最好与整个句子的解析分开。首先,识别句子中的MNP;然后,MNP可以用MNP的头部来表示,然后用MNP的头部来解析句子。因此,将原句分成两部分,可以分别解析。第一部分是MNP解析;第二部分是句子的解析,其中MNP被它们的头词所取代。最后,本文将条件随机场(Conditional Random Fields, CRFs)作为句法解析过程中各个层次的统计识别模型。
{"title":"Statistical parsing based on Maximal Noun Phrase pre-processing","authors":"Qiaoli Zhou, Yue Gu, Xin Liu, Wenjing Lang, Dongfeng Cai","doi":"10.1109/NLPKE.2010.5587850","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587850","url":null,"abstract":"According to the characteristics of Chinese language, this paper proposes a statistical parsing method based on Maximal Noun Phrase(MNP) per-processing. MNP parsing is preferable to be separated from parsing of the full sentence. Firstly, MNP in a sentence are identified; next, MNP can be represented by the head of MNP, and then the sentence is parsed with the head of the MNP. Therefore, the original sentence is divided into two parts, which can be parsed separately. The first part is MNP parsing; the second part is parsing of the sentence in which the MNP are replaced by their head words. Finally, the paper takes Conditional Random Fields (CRFs) as the statistical recognition model of each level in syntactic parsing process.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127018013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A reranking method for syntactic parsing with heterogeneous treebanks 异构树库句法分析的重排序方法
Haibo Ding, Muhua Zhu, Jingbo Zhu
In the field of natural language processing (NLP), there often exist multiple corpora with different annotation standards for the same task. In this paper, we take syntactic parsing as a case study and propose a reranking method which is able to make direct use of disparate treebanks simultaneously without using techniques such as treebank conversion. The method proceeds in three steps: 1) build parsers on individual treebanks; 2) use parsers independently to generate n-best lists for each sentence in test set; 3) rerank individual n-best lists which correspond to the same sentence by using consensus information exchanged among these n-best lists. Experimental results on two open Chinese treebanks show that our method significantly outperforms the baseline system by 0.84% and 0.53% respectively.
在自然语言处理(NLP)领域,同一任务往往存在多个标注标准不同的语料库。本文以句法分析为例,提出了一种能够同时直接使用不同树库而不使用树库转换等技术的重新排序方法。该方法分三步进行:1)在单个树库上构建解析器;2)独立使用解析器为测试集中的每个句子生成n个最优列表;3)利用n个最优列表之间交换的共识信息,对同一句子对应的单个n个最优列表进行重新排序。在两个开放的中国树库上的实验结果表明,我们的方法分别显著优于基线系统0.84%和0.53%。
{"title":"A reranking method for syntactic parsing with heterogeneous treebanks","authors":"Haibo Ding, Muhua Zhu, Jingbo Zhu","doi":"10.1109/NLPKE.2010.5587842","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587842","url":null,"abstract":"In the field of natural language processing (NLP), there often exist multiple corpora with different annotation standards for the same task. In this paper, we take syntactic parsing as a case study and propose a reranking method which is able to make direct use of disparate treebanks simultaneously without using techniques such as treebank conversion. The method proceeds in three steps: 1) build parsers on individual treebanks; 2) use parsers independently to generate n-best lists for each sentence in test set; 3) rerank individual n-best lists which correspond to the same sentence by using consensus information exchanged among these n-best lists. Experimental results on two open Chinese treebanks show that our method significantly outperforms the baseline system by 0.84% and 0.53% respectively.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible English writing support based on negative-positive conversion method 基于正负转换法的灵活英语写作支持
Yasushi Katsura, Kazuyuki Matsumoto, F. Ren
With development of the recent globalization, the chance to exchange in English increased in the business field. In particular, it's necessary to write a thesis and a charter handwriting in English. Because many Japanese are not used to making English sentence, it is a great burden to write appropriate sentence in English without any support for creating English sentence. In this study we have developed an English composition support system. By this system, it's to search for the interlinear translation example to refer to by database and generate a new sentence by replacing a noun in the example sentence. In this paper, based on the technique of Super-Function, we propose a method to convert an affirmative sentence into negative sentence and vice versa to realize more flexible and extensive text conversion.
随着全球化的发展,在商务领域用英语交流的机会越来越多。特别需要用英文写论文和特许状。由于很多日本人不习惯造英语句子,在没有任何造英语句子的支持下,用英语写出合适的句子是一个很大的负担。在这项研究中,我们开发了一个英语作文支持系统。本系统的目的是在数据库中搜索到行间翻译的例句,并通过替换例句中的名词来生成一个新的句子。本文基于Super-Function技术,提出了一种肯定句与否定句相互转换的方法,实现了更灵活、更广泛的文本转换。
{"title":"Flexible English writing support based on negative-positive conversion method","authors":"Yasushi Katsura, Kazuyuki Matsumoto, F. Ren","doi":"10.1109/NLPKE.2010.5587778","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587778","url":null,"abstract":"With development of the recent globalization, the chance to exchange in English increased in the business field. In particular, it's necessary to write a thesis and a charter handwriting in English. Because many Japanese are not used to making English sentence, it is a great burden to write appropriate sentence in English without any support for creating English sentence. In this study we have developed an English composition support system. By this system, it's to search for the interlinear translation example to refer to by database and generate a new sentence by replacing a noun in the example sentence. In this paper, based on the technique of Super-Function, we propose a method to convert an affirmative sentence into negative sentence and vice versa to realize more flexible and extensive text conversion.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121698126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Context-based term identification and extraction for ontology construction 面向本体构建的基于上下文的术语识别与提取
Hui-Ngo Goh, Ching Kiu
Ontology construction often requires a domain specific corpus in conceptualizing the domain knowledge; specifically, it is an association of terms, relation between terms and related instances. It is a vital task to identify a list of significant term for constructing a practical ontology. In this paper, we present the use of a context-based term identification and extraction methodology for ontology construction from text document. The methodology is using a taxonomy and Wikipedia to support automatic term identification and extraction from structured documents with an assumption of candidate terms for a topic are often associated with its topic-specific keywords. A hierarchical relationship of super-topics and sub-topics is defined by a taxonomy, meanwhile, Wikipedia is used to provide context and background knowledge for topics that defined in the taxonomy to guide the term identification and extraction. The experimental results have shown the context-based term identification and extraction methodology is viable in defining topic concepts and its sub-concepts for constructing ontology. The experimental results have also proven its viability to be applied in a small corpus / text size environment in supporting ontology construction.
本体的构建往往需要一个特定领域的语料库来概念化领域知识;具体来说,它是术语的关联,术语和相关实例之间的关系。为构建一个实用的本体论,确定一个有意义的术语列表是一项至关重要的任务。在本文中,我们提出了一种基于上下文的术语识别和提取方法,用于从文本文档中构建本体。该方法使用分类法和Wikipedia来支持自动术语识别和从结构化文档中提取,并假设主题的候选术语通常与主题特定的关键字相关联。分类法定义了超级主题和子主题的层次关系,同时使用Wikipedia为分类法中定义的主题提供上下文和背景知识,以指导术语的识别和提取。实验结果表明,基于上下文的术语识别和提取方法在定义主题概念及其子概念以构建本体方面是可行的。实验结果也证明了该方法在支持本体构建的小语料库/文本环境下的可行性。
{"title":"Context-based term identification and extraction for ontology construction","authors":"Hui-Ngo Goh, Ching Kiu","doi":"10.1109/NLPKE.2010.5587801","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587801","url":null,"abstract":"Ontology construction often requires a domain specific corpus in conceptualizing the domain knowledge; specifically, it is an association of terms, relation between terms and related instances. It is a vital task to identify a list of significant term for constructing a practical ontology. In this paper, we present the use of a context-based term identification and extraction methodology for ontology construction from text document. The methodology is using a taxonomy and Wikipedia to support automatic term identification and extraction from structured documents with an assumption of candidate terms for a topic are often associated with its topic-specific keywords. A hierarchical relationship of super-topics and sub-topics is defined by a taxonomy, meanwhile, Wikipedia is used to provide context and background knowledge for topics that defined in the taxonomy to guide the term identification and extraction. The experimental results have shown the context-based term identification and extraction methodology is viable in defining topic concepts and its sub-concepts for constructing ontology. The experimental results have also proven its viability to be applied in a small corpus / text size environment in supporting ontology construction.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126280720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A pragmatic model for new Chinese word extraction 汉语新词提取的语用模型
Haijun Zhang, Heyan Huang, Chao-Yong Zhu, Shumin Shi
This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.
提出了一种基于重复的汉语新词提取的语用模型。它包含两个创新。第一部分是对NWE过程的形式化描述,从理论上给出了特征选择的指导。在此基础上,选择条件随机场模型(Conditional Random Fields model, CRF)作为统计框架来解决形式化描述问题。二是改进了左(右)熵算法,提高了NWE的效率。通过与基线算法的比较,改进后的算法能显著提高熵的计算速度。总体而言,实验表明本文提出的模型是非常有效的,开放测试的F值为49.72%,词语提取的F值为69.83%,与以往的同类作品相比有了明显的提高。
{"title":"A pragmatic model for new Chinese word extraction","authors":"Haijun Zhang, Heyan Huang, Chao-Yong Zhu, Shumin Shi","doi":"10.1109/NLPKE.2010.5587846","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587846","url":null,"abstract":"This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122616272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Bagging to find better expansion words 寻找更好的扩展词
Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang
The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.
将监督学习应用到查询扩展技术中,训练一个模型来预测扩展词对检索系统的“良度”或“效用”。有许多特征可以用来衡量扩展词与查询之间的相关性,这些特征可以被纳入监督学习中来选择扩展词。训练数据集是通过一种复杂的方法自动生成的。然而,这种方法会受到许多方面的影响。一个严重的问题是特征的分布是查询相关的,这在以前的工作中没有讨论过。由于特征的分布不同,将这些训练实例合并在一起并使用整个数据集来训练单个模型是有问题的。在本文中,我们首先研究了自动生成的训练数据的统计分布,并指出了训练数据集中存在的问题。在分析的基础上,我们提出采用bagging方法对多个回归模型进行集成,以得到一个更好的监督模型来对扩展项进行预测。我们在TREC基准测试集合上进行了实验。我们对训练数据的分析揭示了一些关于查询扩展技术的有趣现象。实验结果还表明,套袋方法可以在标准TREC数据集上达到最先进的检索性能。
{"title":"Bagging to find better expansion words","authors":"Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang","doi":"10.1109/NLPKE.2010.5587826","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587826","url":null,"abstract":"The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125257444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1