首页 > 最新文献

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)最新文献

英文 中文
Event-event relation identification: A CRF based approach 事件-事件关系识别:基于CRF的方法
A. Kolya, Asif Ekbal, Sivaji Bandyopadhyay
Temporal information extraction is a popular and interesting research field in the area of Natural Language Processing (NLP). The main tasks involve the identification of event-time, event-document creation time and event-event relations in a text. In this paper, we take up Task C that involves identification of relations between the events in adjacent sentences under the TimeML framework. We use a supervised machine learning technique, namely Conditional Random Field (CRF). Initially, a baseline system is developed by considering the most frequent temporal relation in the task's training data. For CRF, we consider only those features that are already available in the TempEval-2007 training set. Evaluation results on the Task C test set yield precision, recall and F-score values of 55.1%, 55.1% and 55.1%, respectively under the strict evaluation scheme and 56.9%, 56.9 and 56.9%, respectively under the relaxed evaluation scheme. Results also show that the proposed system performs better than the baseline system.
时间信息提取是自然语言处理(NLP)领域中一个热门而有趣的研究领域。主要任务包括识别文本中的事件时间、事件文档创建时间和事件-事件关系。在本文中,我们采用任务C,它涉及在TimeML框架下识别相邻句子中事件之间的关系。我们使用有监督的机器学习技术,即条件随机场(CRF)。首先,通过考虑任务训练数据中最频繁的时间关系来开发基线系统。对于CRF,我们只考虑TempEval-2007训练集中已经存在的特征。Task C测试集的评价结果,严格评价方案下的准确率、召回率和F-score值分别为55.1%、55.1%和55.1%,宽松评价方案下的准确率、召回率和F-score值分别为56.9%、56.9%和56.9%。结果还表明,该系统的性能优于基准系统。
{"title":"Event-event relation identification: A CRF based approach","authors":"A. Kolya, Asif Ekbal, Sivaji Bandyopadhyay","doi":"10.1109/NLPKE.2010.5587774","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587774","url":null,"abstract":"Temporal information extraction is a popular and interesting research field in the area of Natural Language Processing (NLP). The main tasks involve the identification of event-time, event-document creation time and event-event relations in a text. In this paper, we take up Task C that involves identification of relations between the events in adjacent sentences under the TimeML framework. We use a supervised machine learning technique, namely Conditional Random Field (CRF). Initially, a baseline system is developed by considering the most frequent temporal relation in the task's training data. For CRF, we consider only those features that are already available in the TempEval-2007 training set. Evaluation results on the Task C test set yield precision, recall and F-score values of 55.1%, 55.1% and 55.1%, respectively under the strict evaluation scheme and 56.9%, 56.9 and 56.9%, respectively under the relaxed evaluation scheme. Results also show that the proposed system performs better than the baseline system.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127400745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
iSentenizer: An incremental sentence boundary classifier iSentenizer:一个增量式句子边界分类器
F. Wong, S. Chao
In this paper, we revisited the topic of sentence boundary detection, and proposed an incremental approach to tackle the problem. The boundary classifier is revised on the fly to adapt to the text of high variety of sources and genres. We applied i+Learning, an incremental algorithm, for constructing the sentence boundary detection model using different features based on local context. Although the model can be easily trained on any genre of text and on any alphabet language, we emphasize the ability that the classifier is adaptable to text with domain and topic shifts without retraining the whole model from scratch. Empirical results indicate that the performance of proposed system is comparable to that of similar systems.
在本文中,我们重新审视了句子边界检测的主题,并提出了一种增量方法来解决这个问题。边界分类器是动态修改的,以适应各种来源和体裁的文本。采用i+Learning增量算法,基于局部语境,利用不同特征构建句子边界检测模型。尽管该模型可以很容易地在任何类型的文本和任何字母语言上进行训练,但我们强调分类器适应具有领域和主题变化的文本的能力,而无需从头开始重新训练整个模型。实证结果表明,该系统的性能与同类系统相当。
{"title":"iSentenizer: An incremental sentence boundary classifier","authors":"F. Wong, S. Chao","doi":"10.1109/NLPKE.2010.5587856","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587856","url":null,"abstract":"In this paper, we revisited the topic of sentence boundary detection, and proposed an incremental approach to tackle the problem. The boundary classifier is revised on the fly to adapt to the text of high variety of sources and genres. We applied i+Learning, an incremental algorithm, for constructing the sentence boundary detection model using different features based on local context. Although the model can be easily trained on any genre of text and on any alphabet language, we emphasize the ability that the classifier is adaptable to text with domain and topic shifts without retraining the whole model from scratch. Empirical results indicate that the performance of proposed system is comparable to that of similar systems.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124881840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
The impact of parsing accuracy on syntax-based SMT 解析精度对基于语法的SMT的影响
Haotong Zhang, Huizhen Wang, Tong Xiao, Jingbo Zhu
In statistical machine translation (SMT), syntax-based models generally rely on the syntactic information provided by syntactic parsers in source language, target language or both of them. However, whether or how parsers impact the performance of syntax-based systems is still an open issue in the MT field. In this paper, we make an attempt to explore answers to this issue, and empirically investigate the impact of parsing accuracy on MT performance in a state-of-the-art syntax-based system. Our study shows that syntax-based system is not very sensitive to the parsing accuracy of parsers used in building MT systems.
在统计机器翻译(SMT)中,基于语法的模型通常依赖于语法解析器在源语言、目标语言或两者中提供的语法信息。然而,解析器是否或如何影响基于语法的系统的性能在机器翻译领域仍然是一个悬而未决的问题。在本文中,我们试图探索这个问题的答案,并在一个最先进的基于语法的系统中实证研究解析精度对机器翻译性能的影响。我们的研究表明,基于句法的系统对用于构建机器翻译系统的解析器的解析精度不太敏感。
{"title":"The impact of parsing accuracy on syntax-based SMT","authors":"Haotong Zhang, Huizhen Wang, Tong Xiao, Jingbo Zhu","doi":"10.1109/NLPKE.2010.5587845","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587845","url":null,"abstract":"In statistical machine translation (SMT), syntax-based models generally rely on the syntactic information provided by syntactic parsers in source language, target language or both of them. However, whether or how parsers impact the performance of syntax-based systems is still an open issue in the MT field. In this paper, we make an attempt to explore answers to this issue, and empirically investigate the impact of parsing accuracy on MT performance in a state-of-the-art syntax-based system. Our study shows that syntax-based system is not very sensitive to the parsing accuracy of parsers used in building MT systems.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123543037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Detecting duplicates with shallow and parser-based methods 使用浅方法和基于解析器的方法检测重复项
Sven Hartrumpf, Tim vor der Brück, Christian Eichhorn
Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized semantic network index. In order to detect many kinds of paraphrases the current base semantic network is varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Some important phenomena occurring in difficult-to-detect duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora like Wikipedia is explained briefly. This deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably, in comparison to traditional shallow methods. For the evaluation, a standard corpus of German plagiarisms was extended by four diverse components with an emphasis on duplicates (and not just plagiarisms), e.g., news feed articles from different web sources and two translations of the same short story.
识别重复文本在许多领域都很重要,比如抄袭检测、信息检索、文本摘要和问题回答。当前的方法大多是面向表面的(或只使用浅层语法表示),并且只将每个文本视为一个标记列表。然而,在这项工作中,我们描述了一种基于语义网络的深度,面向语义的方法,该方法由语法语义解析器派生。利用专门的语义网络索引,对给定基础文本中每个句子的语义相同或相似的语义网络进行高效检索。为了检测多种类型的释义,现有的基础语义网络通过运用推理:词汇语义关系、关系公理和意义公设来进行变化。讨论了在难以检测的重复中出现的一些重要现象。深层方法得益于背景知识,本文简要地解释了从维基百科等语料库获取背景知识的方法。该深度重复识别器与两个浅重复识别器相结合,以保证对无法完全解析的文本的高召回率。评估结果表明,与传统的浅层方法相比,组合方法在保留召回率的同时显著提高了准确率。为了进行评估,标准的德语剽窃语料库被扩展为四个不同的组成部分,重点是重复(而不仅仅是剽窃),例如,来自不同网络来源的新闻feed文章和同一短篇小说的两个翻译。
{"title":"Detecting duplicates with shallow and parser-based methods","authors":"Sven Hartrumpf, Tim vor der Brück, Christian Eichhorn","doi":"10.1109/NLPKE.2010.5587838","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587838","url":null,"abstract":"Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized semantic network index. In order to detect many kinds of paraphrases the current base semantic network is varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Some important phenomena occurring in difficult-to-detect duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora like Wikipedia is explained briefly. This deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably, in comparison to traditional shallow methods. For the evaluation, a standard corpus of German plagiarisms was extended by four diverse components with an emphasis on duplicates (and not just plagiarisms), e.g., news feed articles from different web sources and two translations of the same short story.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125553538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A method of mining bilingual resources from Web Based on Maximum Frequent Sequential Pattern 基于最大频繁序列模式的Web双语资源挖掘方法
Guiping Zhang, Yang Luo, D. Ji
The bilingual resources are indispensable and vital resources in the NPL fields, such as machine translation, etc. A large amount of electronic information is embedded in the Internet, which can be used as a potential information source of large-scale multi-language corpus, so it is a potential and feasible way to mine a great capacity of true bilingual resources from the Web. This paper proposes a method of mining bilingual resources from the Web based on Maximum Frequent Sequential Pattern. The method uses the heuristic approach to search and filter the candidate bilingual web pages, then mines patterns using maximum frequent sequential, and uses a machine learning method for extending the pattern base and verifying bilingual resources in accordance with the Japanese to Chinese word proportion. The experimental results indicate that the method could extract bilingual resources efficiently, with the precision rate over 90%.
双语资源是机器翻译等非物理物理领域不可缺少的重要资源。互联网中嵌入了大量的电子信息,这些信息可以作为大规模多语言语料库的潜在信息源,因此从网络中挖掘大容量的真正双语资源是一种潜在的、可行的方法。提出了一种基于最大频繁序列模式的Web双语资源挖掘方法。该方法采用启发式方法对候选双语网页进行搜索和过滤,然后利用最大频繁序列挖掘模式,并采用机器学习方法根据日文与中文字数比例扩展模式库并对双语资源进行验证。实验结果表明,该方法能够有效地提取双语资源,准确率达到90%以上。
{"title":"A method of mining bilingual resources from Web Based on Maximum Frequent Sequential Pattern","authors":"Guiping Zhang, Yang Luo, D. Ji","doi":"10.1109/NLPKE.2010.5587831","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587831","url":null,"abstract":"The bilingual resources are indispensable and vital resources in the NPL fields, such as machine translation, etc. A large amount of electronic information is embedded in the Internet, which can be used as a potential information source of large-scale multi-language corpus, so it is a potential and feasible way to mine a great capacity of true bilingual resources from the Web. This paper proposes a method of mining bilingual resources from the Web based on Maximum Frequent Sequential Pattern. The method uses the heuristic approach to search and filter the candidate bilingual web pages, then mines patterns using maximum frequent sequential, and uses a machine learning method for extending the pattern base and verifying bilingual resources in accordance with the Japanese to Chinese word proportion. The experimental results indicate that the method could extract bilingual resources efficiently, with the precision rate over 90%.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131784929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel Chinese-English on translation method using mix-language web pages 一种基于混合语言网页的汉英互译方法
Feiliang Ren, Jingbo Zhu, Huizhen Wang
In this paper, we propose a novel Chinese-English organization name translation method with the assistance of mix-language web resources. Firstly, all the implicit out-of-vocabulary terms in the input Chinese organization name are recognized by a CRFs model. Then the input Chinese organization name is translated without considering these recognized out-of-vocabulary terms. Secondly, we construct some efficient queries to find the mix-language web pages that contain both the original input organization name and its correct translation. At last, a similarity matching and limited expansion based translation identification approach is proposed to identify the correct translation from the returned web pages. Experimental results show that our method is effective for Chinese organization name translation and can improve performance of Chinese organization name translation significantly.
本文提出了一种基于混合语言网络资源的中英文组织名称翻译方法。首先,利用CRFs模型对输入的中文组织名称中的所有隐式词汇外词进行识别。然后将输入的中文组织名称进行翻译,而不考虑这些已识别的词汇外术语。其次,我们构建了一些高效的查询来查找包含原始输入组织名称及其正确翻译的混合语言网页。最后,提出了一种基于相似性匹配和有限扩展的翻译识别方法,从返回的网页中识别出正确的翻译。实验结果表明,该方法对中文组织名称的翻译是有效的,可以显著提高中文组织名称的翻译性能。
{"title":"A novel Chinese-English on translation method using mix-language web pages","authors":"Feiliang Ren, Jingbo Zhu, Huizhen Wang","doi":"10.1109/NLPKE.2010.5587832","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587832","url":null,"abstract":"In this paper, we propose a novel Chinese-English organization name translation method with the assistance of mix-language web resources. Firstly, all the implicit out-of-vocabulary terms in the input Chinese organization name are recognized by a CRFs model. Then the input Chinese organization name is translated without considering these recognized out-of-vocabulary terms. Secondly, we construct some efficient queries to find the mix-language web pages that contain both the original input organization name and its correct translation. At last, a similarity matching and limited expansion based translation identification approach is proposed to identify the correct translation from the returned web pages. Experimental results show that our method is effective for Chinese organization name translation and can improve performance of Chinese organization name translation significantly.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132852500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizations for item-based Collaborative Filtering algorithm 基于项的协同过滤算法的优化
Shuang Xia, Yang Zhao, Yong Zhang, Chunxiao Xing, Scott Roepnack, Shihong Huang
Collaborative Filtering (CF) is widely used in the Internet for recommender systems to find items that fit users' interest by exploring users' opinion expressed on other items. However there are two challenges for CF algorithm, which are recommendation accuracy and data sparsity. In this paper, we try to address the accuracy problem with an approach of deviation adjustment in item-based CF. Its main idea is to add a constant value to every prediction on each user or each item to modify the uniform error between prediction and actual rating of one user or one item. Our deviation adjustment approach can be also used in other kinds of CF algorithms. For data sparsity, we improve similarity computation by filling some blank rating with a user's average rating to help decrease the sparsity of data. We run experiments with our optimization of similarity computation and deviation adjustment by using MovieLens data set. The result shows these methods can generate better predication compared with the baseline CF algorithm.
协同过滤(CF)广泛应用于互联网上的推荐系统,通过探索用户对其他项目的意见来找到符合用户兴趣的项目。然而,CF算法存在推荐精度和数据稀疏性两个问题。在本文中,我们尝试用一种基于项目的CF中的偏差调整方法来解决准确率问题,其主要思想是在每个用户或每个项目的每个预测中添加一个常数值,以修正一个用户或一个项目的预测与实际评分之间的均匀误差。我们的偏差调整方法也可用于其他类型的CF算法。对于数据稀疏性,我们通过用用户的平均评级填充一些空白评级来改进相似性计算,以帮助降低数据的稀疏性。我们利用MovieLens数据集进行了相似度计算和偏差调整的优化实验。结果表明,与基线CF算法相比,这些方法可以产生更好的预测结果。
{"title":"Optimizations for item-based Collaborative Filtering algorithm","authors":"Shuang Xia, Yang Zhao, Yong Zhang, Chunxiao Xing, Scott Roepnack, Shihong Huang","doi":"10.1109/NLPKE.2010.5587833","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587833","url":null,"abstract":"Collaborative Filtering (CF) is widely used in the Internet for recommender systems to find items that fit users' interest by exploring users' opinion expressed on other items. However there are two challenges for CF algorithm, which are recommendation accuracy and data sparsity. In this paper, we try to address the accuracy problem with an approach of deviation adjustment in item-based CF. Its main idea is to add a constant value to every prediction on each user or each item to modify the uniform error between prediction and actual rating of one user or one item. Our deviation adjustment approach can be also used in other kinds of CF algorithms. For data sparsity, we improve similarity computation by filling some blank rating with a user's average rating to help decrease the sparsity of data. We run experiments with our optimization of similarity computation and deviation adjustment by using MovieLens data set. The result shows these methods can generate better predication compared with the baseline CF algorithm.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133656657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A morphology-based Chinese word segmentation method 一种基于形态学的汉语分词方法
Xiaojun Lin, Liang Zhao, Meng Zhang, Xihong Wu
This paper proposes a novel method of Chinese word segmentation utilizing morphology information. The method introduces morphology into statistical model to capture structural relationship within word. It improves the conventional Conditional Random Fields (CRFs) models on the ability of representing the structure information. Firstly, a word-segmented Chinese corpus is annotated with morphology tags by a semi-automatic method. The resulting structure-related tags are integrated into the CRFs model. Secondly, a joint CRFs model is trained, which generates both morphology tags and word boundaries. Experiments are carried out on several SIGHAN Bakeoff corpus and show that the morphology information can improve the performance of Chinese word segmentation significantly, especially for the segmentation of out-of-vocabulary words.
提出了一种利用形态学信息进行汉语分词的新方法。该方法将形态学引入统计模型,捕捉词内的结构关系。它在表示结构信息的能力上改进了传统条件随机场(CRFs)模型。首先,采用半自动方法对分词汉语语料库进行词法标注。生成的与结构相关的标记被集成到CRFs模型中。其次,训练联合CRFs模型,生成词法标签和词边界;在多个SIGHAN Bakeoff语料库上进行了实验,结果表明形态学信息可以显著提高汉语分词的性能,特别是对词汇外词的分词。
{"title":"A morphology-based Chinese word segmentation method","authors":"Xiaojun Lin, Liang Zhao, Meng Zhang, Xihong Wu","doi":"10.1109/NLPKE.2010.5587786","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587786","url":null,"abstract":"This paper proposes a novel method of Chinese word segmentation utilizing morphology information. The method introduces morphology into statistical model to capture structural relationship within word. It improves the conventional Conditional Random Fields (CRFs) models on the ability of representing the structure information. Firstly, a word-segmented Chinese corpus is annotated with morphology tags by a semi-automatic method. The resulting structure-related tags are integrated into the CRFs model. Secondly, a joint CRFs model is trained, which generates both morphology tags and word boundaries. Experiments are carried out on several SIGHAN Bakeoff corpus and show that the morphology information can improve the performance of Chinese word segmentation significantly, especially for the segmentation of out-of-vocabulary words.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132122277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Feature selection for Chinese Text Categorization based on improved particle swarm optimization 基于改进粒子群优化的中文文本分类特征选择
Yaohong Jin, Wen Xiong, Cong Wang
Feature selection is an important preprocessing step of Chinese Text Categorization, which reduces the high dimension and keeps the reduced results comprehensible compared to feature extraction. A novel criterion to filter features coarsely is proposed, which integrating the superiorities of term frequency-inverse document frequency as inner-class measure and CHI-square as inter-class, and a new feature selection method for Chinese text categorization based on swarm intelligence is presented, which using improved particle swarm optimization to select features fine on the results of coarse grain filtering, and utilizing support vector machine to evaluate feature subsets and taking the evaluations as the fitness of particles. The experiments on Fudan University Chinese Text Classification Corpus show a higher classification accuracy obtained by using the new criterion for features filtering and an effective feature reduction ratio attained by utilizing the novel FS method for Chinese text categorization.
特征选择是中文文本分类的重要预处理步骤,与特征提取相比,特征选择降低了文本分类的高维,降低了分类结果的可理解性。结合词频逆作为类内度量和卡方作为类间度量的优点,提出了一种新的特征粗过滤准则,并提出了一种基于群智能的中文文本分类特征选择新方法,该方法利用改进的粒子群算法对粗粒度过滤的结果进行精细特征选择。利用支持向量机对特征子集进行评价,并将评价结果作为粒子的适应度。在复旦大学中文文本分类语料库上的实验表明,采用新的特征过滤准则获得了较高的分类准确率,采用新的FS方法对中文文本进行分类获得了有效的特征约简比。
{"title":"Feature selection for Chinese Text Categorization based on improved particle swarm optimization","authors":"Yaohong Jin, Wen Xiong, Cong Wang","doi":"10.1109/NLPKE.2010.5587844","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587844","url":null,"abstract":"Feature selection is an important preprocessing step of Chinese Text Categorization, which reduces the high dimension and keeps the reduced results comprehensible compared to feature extraction. A novel criterion to filter features coarsely is proposed, which integrating the superiorities of term frequency-inverse document frequency as inner-class measure and CHI-square as inter-class, and a new feature selection method for Chinese text categorization based on swarm intelligence is presented, which using improved particle swarm optimization to select features fine on the results of coarse grain filtering, and utilizing support vector machine to evaluate feature subsets and taking the evaluations as the fitness of particles. The experiments on Fudan University Chinese Text Classification Corpus show a higher classification accuracy obtained by using the new criterion for features filtering and an effective feature reduction ratio attained by utilizing the novel FS method for Chinese text categorization.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128081712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Boosting performance of gene mention tagging system by classifiers ensemble 利用分类器集成提高基因提及标记系统的性能
Lishuang Li, Jing Sun, Degen Huang
To further improve the tagging performance of single classifiers, a classifiers ensemble experimental framework is presented for gene mention tagging. In the framework, six classifiers are constructed by four toolkits (CRF++, YamCha, Maximum Entropy (ME) and MALLET) with different training methods and feature sets and then combined with a two-layer stacking algorithm. The recognition results of different classifiers are regarded as input feature vectors to be incorporated, and then a high-powered model is obtained. Experiments carried out on the corpus of BioCreative II GM task show that the classifiers ensemble method is effective and our best combination method achieves an F-score of 88.09%, which outperforms most of the top-ranked Bio-NER systems in the BioCreAtIvE II GM challenge.
为了进一步提高单个分类器的标记性能,提出了一个用于基因提及标记的分类器集成实验框架。在该框架中,使用不同训练方法和特征集的四个工具箱(crf++、YamCha、Maximum Entropy (ME)和MALLET)构建6个分类器,并结合两层叠加算法。将不同分类器的识别结果作为输入特征向量进行合并,从而得到一个高性能的模型。在BioCreative II GM任务的语料库上进行的实验表明,分类器集成方法是有效的,我们的最佳组合方法达到了88.09%的f分,在BioCreative II GM挑战中优于大多数排名前几位的Bio-NER系统。
{"title":"Boosting performance of gene mention tagging system by classifiers ensemble","authors":"Lishuang Li, Jing Sun, Degen Huang","doi":"10.1109/NLPKE.2010.5587822","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587822","url":null,"abstract":"To further improve the tagging performance of single classifiers, a classifiers ensemble experimental framework is presented for gene mention tagging. In the framework, six classifiers are constructed by four toolkits (CRF++, YamCha, Maximum Entropy (ME) and MALLET) with different training methods and feature sets and then combined with a two-layer stacking algorithm. The recognition results of different classifiers are regarded as input feature vectors to be incorporated, and then a high-powered model is obtained. Experiments carried out on the corpus of BioCreative II GM task show that the classifiers ensemble method is effective and our best combination method achieves an F-score of 88.09%, which outperforms most of the top-ranked Bio-NER systems in the BioCreAtIvE II GM challenge.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121128249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1