首页 > 最新文献

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)最新文献

英文 中文
Improving phrase-based SMT model with Flattened Bilingual Parse Tree 基于扁平双语解析树的基于短语的SMT模型改进
Dakun Zhang, Le Sun, Wenbo Li
Phrase orders influence much on translation quality. However, general phrase based methods take only the source side information for phrase orderings. We instead propose a bilingual parse structure, Flattened Bilingual Parse Tree (FBPT), for better describing the inner structure of bilingual sentences and then for better translations. The main idea is to extract phrase pairs with orientation features under the help of FBPT structure. Such features can help maintain better sentence generations during translation. Furthermore, the FBPT structure can be learned automatically from parallel corpus with lower costs without the need of complex linguistic parsing. Evaluations on MT08 translation task indicate that 7% relative improvement on BLEU can be achieved compared to distortion based method (like Pharaoh).
短语顺序对翻译质量影响很大。然而,一般的基于短语的方法只采用源端信息进行短语排序。为了更好地描述双语句子的内部结构,从而实现更好的翻译,我们提出了一种双语解析结构,即扁平双语解析树(FBPT)。其主要思想是在FBPT结构的帮助下提取具有方向特征的短语对。这些特征有助于在翻译过程中保持更好的句子生成。此外,FBPT结构可以从并行语料库中自动学习,成本较低,无需复杂的语言解析。对MT08翻译任务的评估表明,与基于失真的方法(如法老)相比,BLEU可以实现7%的相对改进。
{"title":"Improving phrase-based SMT model with Flattened Bilingual Parse Tree","authors":"Dakun Zhang, Le Sun, Wenbo Li","doi":"10.1109/NLPKE.2010.5587836","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587836","url":null,"abstract":"Phrase orders influence much on translation quality. However, general phrase based methods take only the source side information for phrase orderings. We instead propose a bilingual parse structure, Flattened Bilingual Parse Tree (FBPT), for better describing the inner structure of bilingual sentences and then for better translations. The main idea is to extract phrase pairs with orientation features under the help of FBPT structure. Such features can help maintain better sentence generations during translation. Furthermore, the FBPT structure can be learned automatically from parallel corpus with lower costs without the need of complex linguistic parsing. Evaluations on MT08 translation task indicate that 7% relative improvement on BLEU can be achieved compared to distortion based method (like Pharaoh).","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Patterns of syntactic trees for parsing arabic texts 解析阿拉伯文本的语法树模式
Fériel Ben Fraj Trabelsi, C. Zribi, M. Ahmed
In order to parse Arabic texts, we have chosen to use a machine learning approach. It learns from an Arabic Treebank. The knowledge enclosed in this Treebank is structured as patterns of syntactic trees. These patterns are representative models of syntactic components of the Arabic language. They are not only layered but also both structurally and contextually rich. They serve as an informational source for guiding the parsing process. Our parser is progressive given that it proceeds by treating a sentence into a number of stages, equal to the number of its words. At each step, the parser affects the target word with the most likely patterns to represent it in the context where it is put. Then, it joins the selected patterns with those collected in the previous steps so as to construct the representative syntactic tree(s) of the whole sentence. Preliminary tests have yielded to obtain accuracy and f-score which are respectively equal to 84.78% and 77.52%.
为了解析阿拉伯文本,我们选择使用机器学习方法。它从阿拉伯树库学习。这个树库中包含的知识被结构化为语法树的模式。这些模式是阿拉伯语语法成分的代表性模型。它们不仅有层次,而且在结构和语境上都很丰富。它们作为指导解析过程的信息源。我们的解析器是渐进式的,因为它将一个句子分成若干个阶段,这些阶段等于它的单词数量。在每一步中,解析器都会使用最可能的模式来影响目标单词,以便在放置该单词的上下文中表示该单词。然后,将选择的模式与前面步骤中收集的模式连接起来,从而构建整个句子的代表性句法树。初步试验获得的准确率和f-score分别为84.78%和77.52%。
{"title":"Patterns of syntactic trees for parsing arabic texts","authors":"Fériel Ben Fraj Trabelsi, C. Zribi, M. Ahmed","doi":"10.1109/NLPKE.2010.5587791","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587791","url":null,"abstract":"In order to parse Arabic texts, we have chosen to use a machine learning approach. It learns from an Arabic Treebank. The knowledge enclosed in this Treebank is structured as patterns of syntactic trees. These patterns are representative models of syntactic components of the Arabic language. They are not only layered but also both structurally and contextually rich. They serve as an informational source for guiding the parsing process. Our parser is progressive given that it proceeds by treating a sentence into a number of stages, equal to the number of its words. At each step, the parser affects the target word with the most likely patterns to represent it in the context where it is put. Then, it joins the selected patterns with those collected in the previous steps so as to construct the representative syntactic tree(s) of the whole sentence. Preliminary tests have yielded to obtain accuracy and f-score which are respectively equal to 84.78% and 77.52%.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125334773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A novel dependency based word-level reordering model for phrased-based translation 一种新的基于依赖的词级重排模型用于短语翻译
Shui Liu, Sheng Li, T. Zhao, Shiqi Li
Phrase based statistic MT (SMT) is an important milestone in MT. However, the translation model in the phrase based SMT is structure free which limits its reordering capacity to some extent. In order to enhance the reordering capacity of phrase based SMT, in this paper we propose a head-modifier relation based reordering model, which exploits the way how to utilize the structured linguistic analysis information in source language. Within very small size of reordering model, we enhance the performance of the phrase based SMT significantly.
基于短语的统计机器翻译(SMT)是机器翻译领域的一个重要里程碑,然而,基于短语的统计机器翻译的翻译模型是无结构的,这在一定程度上限制了其重新排序的能力。为了提高基于短语的SMT的重排能力,本文提出了一种基于首修饰语关系的重排模型,该模型利用了源语言的结构化语言分析信息。在非常小的重排序模型中,我们显著提高了基于短语的SMT的性能。
{"title":"A novel dependency based word-level reordering model for phrased-based translation","authors":"Shui Liu, Sheng Li, T. Zhao, Shiqi Li","doi":"10.1109/NLPKE.2010.5587829","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587829","url":null,"abstract":"Phrase based statistic MT (SMT) is an important milestone in MT. However, the translation model in the phrase based SMT is structure free which limits its reordering capacity to some extent. In order to enhance the reordering capacity of phrase based SMT, in this paper we propose a head-modifier relation based reordering model, which exploits the way how to utilize the structured linguistic analysis information in source language. Within very small size of reordering model, we enhance the performance of the phrase based SMT significantly.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117142532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dashboard: An integration and testing platform based on backboard architecture for NLP applications Dashboard:一个基于背板架构的集成和测试平台,用于NLP应用程序
Pawan Kumar, Arun Kumar Rathaur, R. Ahmad, M. K. Sinha, R. Sangal
The paper presents a software integration, testing and visualization tool, called Dashboard, which is based on pipe-lined backboard architecture for family of natural language processing (NLP) application. The Dashboard helps in testing of a module in isolation, facilitating the training and tuning of a module, integration and testing of a set of heterogeneous modules, and building and testing of complete integrated system as well. It is also equipped with a user-friendly visualization tool to build, test, and integrate a system (or a subsystem) and view its component-wise performance, and step-wise processing as well. The Dashboard is being successfully used by a consortium of eleven academic institutions to develop a suite of bi-directional machine translation (MT) system for nine pairs of Indic languages, and six MT systems have already been deployed on web. The MT systems are being developed by reusing / re-engineering previously developed NLP modules, by different institutions, in different programming languages, using Dashboard as the testing and integration tool. The paper also discusses the experiences of developing MT products in consortium mode, using Dashboard as its integrating and testing platform, and its proposed enhancements.
本文提出了一种基于流水线背板架构的用于自然语言处理(NLP)应用的软件集成、测试和可视化工具Dashboard。Dashboard可以帮助单独测试一个模块,促进模块的培训和调优,集成和测试一组异构模块,以及构建和测试完整的集成系统。它还配备了一个用户友好的可视化工具,用于构建、测试和集成系统(或子系统),并查看其组件性能,以及分步处理。由11个学术机构组成的联盟成功地使用仪表板开发了一套针对9对印度语言的双向机器翻译(MT)系统,并且已经在网上部署了6个机器翻译系统。MT系统的开发是通过重用/重新设计以前开发的NLP模块,由不同的机构,使用不同的编程语言,使用Dashboard作为测试和集成工具。本文还讨论了以Dashboard为集成和测试平台,以联盟模式开发MT产品的经验,以及提出的改进建议。
{"title":"Dashboard: An integration and testing platform based on backboard architecture for NLP applications","authors":"Pawan Kumar, Arun Kumar Rathaur, R. Ahmad, M. K. Sinha, R. Sangal","doi":"10.1109/NLPKE.2010.5587779","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587779","url":null,"abstract":"The paper presents a software integration, testing and visualization tool, called Dashboard, which is based on pipe-lined backboard architecture for family of natural language processing (NLP) application. The Dashboard helps in testing of a module in isolation, facilitating the training and tuning of a module, integration and testing of a set of heterogeneous modules, and building and testing of complete integrated system as well. It is also equipped with a user-friendly visualization tool to build, test, and integrate a system (or a subsystem) and view its component-wise performance, and step-wise processing as well. The Dashboard is being successfully used by a consortium of eleven academic institutions to develop a suite of bi-directional machine translation (MT) system for nine pairs of Indic languages, and six MT systems have already been deployed on web. The MT systems are being developed by reusing / re-engineering previously developed NLP modules, by different institutions, in different programming languages, using Dashboard as the testing and integration tool. The paper also discusses the experiences of developing MT products in consortium mode, using Dashboard as its integrating and testing platform, and its proposed enhancements.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114139261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Transitivity in semantic relation learning 语义关系学习中的及物性
F. Fallucchi, Fabio Massimo Zanzotto
Text understanding models exploit semantic networks of words as basic components. Automatically enriching and expanding these resources is then an important challenge for NLP. Existing models for enriching semantic resources based on lexical-syntactic patterns make little use of structural properties of target semantic relations. In this paper, we propose a novel approach to include transitivity in probabilistic models for expanding semantic resources. We directly include transitivity in the formulation of probabilistic models. Experiments demonstrate that these models are an effective way for exploiting structural properties of relations in learning semantic networks.
文本理解模型利用词的语义网络作为基本组成部分。自动丰富和扩展这些资源是NLP面临的一个重要挑战。现有的基于词汇句法模式的语义资源丰富模型很少利用目标语义关系的结构特性。在本文中,我们提出了一种将及物性纳入概率模型的新方法来扩展语义资源。我们在概率模型的表述中直接包含传递性。实验表明,这些模型是挖掘语义网络中关系结构特性的有效方法。
{"title":"Transitivity in semantic relation learning","authors":"F. Fallucchi, Fabio Massimo Zanzotto","doi":"10.1109/NLPKE.2010.5587773","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587773","url":null,"abstract":"Text understanding models exploit semantic networks of words as basic components. Automatically enriching and expanding these resources is then an important challenge for NLP. Existing models for enriching semantic resources based on lexical-syntactic patterns make little use of structural properties of target semantic relations. In this paper, we propose a novel approach to include transitivity in probabilistic models for expanding semantic resources. We directly include transitivity in the formulation of probabilistic models. Experiments demonstrate that these models are an effective way for exploiting structural properties of relations in learning semantic networks.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114453951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An improved method of keywords extraction based on short technology text 一种改进的基于短技术文本的关键词提取方法
Jun Wang, Lei Li, F. Ren
Keywords are the critical resources of information management and retrieval, automatic text classification and clustering. The keywords extraction plays an important role in the process of constructing structured text. Current algorithms of keywords extraction have matured in some ways. However the errors of word segmentation which caused by unknown words have been affected the performance of Chinese keywords extraction, particularly in the field of technological text. In order to solve the problem, this paper proposes an improved method of keywords extraction based on the relationship among words. Experiments show that the proposed method can effectively correct the errors caused by segmentation and improve the performance of keywords extraction, and it can also extend to other areas.
关键词是信息管理和检索、文本自动分类和聚类的关键资源。关键词提取在结构化文本的构建过程中起着重要的作用。当前的关键词提取算法在某些方面已经成熟。然而,由于未知词导致的分词错误影响了中文关键词提取的性能,特别是在科技文本领域。为了解决这一问题,本文提出了一种改进的基于词间关系的关键词提取方法。实验结果表明,该方法可以有效地纠正分割过程中产生的错误,提高关键词提取的性能,并可扩展到其他领域。
{"title":"An improved method of keywords extraction based on short technology text","authors":"Jun Wang, Lei Li, F. Ren","doi":"10.1109/NLPKE.2010.5587797","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587797","url":null,"abstract":"Keywords are the critical resources of information management and retrieval, automatic text classification and clustering. The keywords extraction plays an important role in the process of constructing structured text. Current algorithms of keywords extraction have matured in some ways. However the errors of word segmentation which caused by unknown words have been affected the performance of Chinese keywords extraction, particularly in the field of technological text. In order to solve the problem, this paper proposes an improved method of keywords extraction based on the relationship among words. Experiments show that the proposed method can effectively correct the errors caused by segmentation and improve the performance of keywords extraction, and it can also extend to other areas.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"451 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124486207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Graph-based text representation model and its realization 基于图的文本表示模型及其实现
Faguo Zhou, Fan Zhang, Bingru Yang
In this paper, on the foundation of summarizing several common used text representation models, such as Boolean model, probability model, vector space model and so on, mainly according to the defects of the vector space model, the word semantic space is proposed. And in the word semantic space, a graph-based text representation model is designed. Some properties of this type of text representation model have been given, and this model can describe the words semantic constraints in the text. At the same time, this model can also solve the defects of vector space model, such as the order or words, the boundary between sentences and phrases, etc. And at last the method of computing the text similarity is put forward.
本文在总结常用的几种文本表示模型,如布尔模型、概率模型、向量空间模型等的基础上,主要针对向量空间模型的缺陷,提出了词语义空间。在词语义空间中,设计了基于图的文本表示模型。给出了这类文本表示模型的一些性质,该模型可以描述文本中单词的语义约束。同时,该模型还可以解决向量空间模型的缺陷,如词的顺序、句子和短语之间的边界等。最后提出了文本相似度的计算方法。
{"title":"Graph-based text representation model and its realization","authors":"Faguo Zhou, Fan Zhang, Bingru Yang","doi":"10.1109/NLPKE.2010.5587861","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587861","url":null,"abstract":"In this paper, on the foundation of summarizing several common used text representation models, such as Boolean model, probability model, vector space model and so on, mainly according to the defects of the vector space model, the word semantic space is proposed. And in the word semantic space, a graph-based text representation model is designed. Some properties of this type of text representation model have been given, and this model can describe the words semantic constraints in the text. At the same time, this model can also solve the defects of vector space model, such as the order or words, the boundary between sentences and phrases, etc. And at last the method of computing the text similarity is put forward.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
An ontological model for representing computational lexicons a componential based approach 用于表示计算词汇的本体论模型是一种基于组件的方法
M. Al-Yahya, Hend Suliman Al-Khalifa, Alia Bahanshal, Iman Alodah, Nawal Al-Helwah
In the last decades the computational linguistics community has developed important and widely used lexical resources. Although they are very popular among the Natural Language Processing (NLP) community, they do not address two important characteristics of language. The first is that the meaning of a word in a language is a collective effort defined by the people who use the language. The second is that language is a dynamic entity (some words change their meaning, others become obsolete, new words are born). A computational model which aims to represent this real world entity should be structured in a way that allows for expansion, facilitates collaboration, and provides transparent meaning representation. This paper addresses these two issues and provides a solution based on Semantic Web technologies. The solution is based on an ontological model for representing computational lexicons using the field theory of semantics and componential analysis. The model has been implemented on the “Time” semantic field vocabulary of the Arabic language and the results of a preliminary evaluation are presented.
在过去的几十年里,计算语言学社区开发了重要的和广泛使用的词汇资源。尽管它们在自然语言处理(NLP)社区中非常流行,但它们没有解决语言的两个重要特征。首先,语言中单词的含义是由使用该语言的人共同定义的。第二,语言是一个动态的实体(有些词会改变它们的意思,有些词会过时,新的词会诞生)。旨在表示这个现实世界实体的计算模型应该以允许扩展、促进协作和提供透明意义表示的方式构建。本文针对这两个问题,提出了一个基于语义Web技术的解决方案。该解决方案基于本体模型,使用语义和成分分析的领域理论来表示计算词汇。该模型在阿拉伯文“时间”语义场词汇表上进行了实现,并给出了初步评价结果。
{"title":"An ontological model for representing computational lexicons a componential based approach","authors":"M. Al-Yahya, Hend Suliman Al-Khalifa, Alia Bahanshal, Iman Alodah, Nawal Al-Helwah","doi":"10.1109/NLPKE.2010.5587768","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587768","url":null,"abstract":"In the last decades the computational linguistics community has developed important and widely used lexical resources. Although they are very popular among the Natural Language Processing (NLP) community, they do not address two important characteristics of language. The first is that the meaning of a word in a language is a collective effort defined by the people who use the language. The second is that language is a dynamic entity (some words change their meaning, others become obsolete, new words are born). A computational model which aims to represent this real world entity should be structured in a way that allows for expansion, facilitates collaboration, and provides transparent meaning representation. This paper addresses these two issues and provides a solution based on Semantic Web technologies. The solution is based on an ontological model for representing computational lexicons using the field theory of semantics and componential analysis. The model has been implemented on the “Time” semantic field vocabulary of the Arabic language and the results of a preliminary evaluation are presented.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131570277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Data selection for statistical machine translation 统计机器翻译的数据选择
Peng Liu, Yu Zhou, Chengqing Zong
The bilingual language corpus has a great effect on the performance of a statistical machine translation system. More data will lead to better performance. However, more data also increase the computational load. In this paper, we propose methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence weight. The translation system is built and tuned on the compact corpus. The experimental results show that we can obtain a competitive performance with much less data.
双语语料库对统计机器翻译系统的性能有很大的影响。数据越多,性能越好。然而,更多的数据也增加了计算负荷。在本文中,我们提出了基于句子权重估计句子权重的方法,并根据句子权重从训练语料库和开发语料库中选择信息更丰富的句子。翻译系统是在紧凑语料库的基础上构建和调整的。实验结果表明,我们可以在较少的数据量下获得具有竞争力的性能。
{"title":"Data selection for statistical machine translation","authors":"Peng Liu, Yu Zhou, Chengqing Zong","doi":"10.1109/NLPKE.2010.5587827","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587827","url":null,"abstract":"The bilingual language corpus has a great effect on the performance of a statistical machine translation system. More data will lead to better performance. However, more data also increase the computational load. In this paper, we propose methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence weight. The translation system is built and tuned on the compact corpus. The experimental results show that we can obtain a competitive performance with much less data.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"392 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115992325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Research on sentiment classification of Blog based on PMI-IR 基于PMI-IR的博客情感分类研究
Xiuting Duan, Tingting He, Le Song
Development of Blog texts information on the internet has brought new challenge to Chinese text classification. Aim to solving the semantics deficiency problem in traditional methods for Chinese text classification, this paper implements a text classification method on classifying a blog as joy, angry, sad or fear using a simple unsupervised learning algorithm. The classification of a blog text is predicted by the max semantic orientation (SO) of the phrases in the blog text that contains adjectives or adverbs. In this paper, the SO of a phrase is calculated as the mutual information between the given phrase and the polar words. Then the SO of the given blog text is determined by the max mutual information value. A blog text is classified as joy if the SO of its phrases is joy. Two different corpora are adopted to test our method, one is the Blog corpus collected by Monitor and Research Center for National Language Resource Network Multimedia Sub-branch Center, and the other is Chinese dataset provided by COAE2008 task. Based on the two datasets, the method respectively achieves a high improvement compared to the traditional methods.
网络博客文本信息的发展给中文文本分类带来了新的挑战。针对传统中文文本分类方法语义不足的问题,本文采用一种简单的无监督学习算法实现了一种将博客分为喜、怒、悲、恐四类的文本分类方法。通过博客文本中包含形容词或副词的短语的最大语义方向(SO)来预测博客文本的分类。本文将短语的SO计算为给定短语与极性词之间的互信息。然后,给定博客文本的SO由最大互信息值确定。如果一篇博客文章中短语的SO是joy,那么它就被归类为joy。采用两个不同的语料库来测试我们的方法,一个是国家语言资源网络多媒体分中心监测与研究中心收集的博客语料库,另一个是COAE2008任务提供的中文数据集。基于这两个数据集,该方法分别比传统方法实现了较高的改进。
{"title":"Research on sentiment classification of Blog based on PMI-IR","authors":"Xiuting Duan, Tingting He, Le Song","doi":"10.1109/NLPKE.2010.5587849","DOIUrl":"https://doi.org/10.1109/NLPKE.2010.5587849","url":null,"abstract":"Development of Blog texts information on the internet has brought new challenge to Chinese text classification. Aim to solving the semantics deficiency problem in traditional methods for Chinese text classification, this paper implements a text classification method on classifying a blog as joy, angry, sad or fear using a simple unsupervised learning algorithm. The classification of a blog text is predicted by the max semantic orientation (SO) of the phrases in the blog text that contains adjectives or adverbs. In this paper, the SO of a phrase is calculated as the mutual information between the given phrase and the polar words. Then the SO of the given blog text is determined by the max mutual information value. A blog text is classified as joy if the SO of its phrases is joy. Two different corpora are adopted to test our method, one is the Blog corpus collected by Monitor and Research Center for National Language Resource Network Multimedia Sub-branch Center, and the other is Chinese dataset provided by COAE2008 task. Based on the two datasets, the method respectively achieves a high improvement compared to the traditional methods.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116011509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1