首页 > 最新文献

VS@HLT-NAACL最新文献

英文 中文
Distributional Semantic Concept Models for Entity Relation Discovery 面向实体关系发现的分布式语义概念模型
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1507
J. Urbain, Glenn Bushee, George Kowalski
We present an ad hoc concept modeling approach using distributional semantic models to identify fine-grained entities and their relations in an online search setting. Concepts are generated from user-defined seed terms, distributional evidence, and a relational model over concept distributions. A dimensional indexing model is used for efficient aggregation of distributional, syntactic, and relational evidence. The proposed semi-supervised model allows concepts to be defined and related at varying levels of granularity and scope. Qualitative evaluations on medical records, intelligence documents, and open domain web data demonstrate the efficacy of our approach.
我们提出了一种特别的概念建模方法,使用分布式语义模型来识别在线搜索设置中的细粒度实体及其关系。概念由用户定义的种子项、分布证据和概念分布上的关系模型生成。维度索引模型用于有效地聚合分布、句法和关系证据。提出的半监督模型允许在不同的粒度和范围级别上定义和关联概念。对医疗记录、情报文件和开放域网络数据的定性评估证明了我们方法的有效性。
{"title":"Distributional Semantic Concept Models for Entity Relation Discovery","authors":"J. Urbain, Glenn Bushee, George Kowalski","doi":"10.3115/v1/W15-1507","DOIUrl":"https://doi.org/10.3115/v1/W15-1507","url":null,"abstract":"We present an ad hoc concept modeling approach using distributional semantic models to identify fine-grained entities and their relations in an online search setting. Concepts are generated from user-defined seed terms, distributional evidence, and a relational model over concept distributions. A dimensional indexing model is used for efficient aggregation of distributional, syntactic, and relational evidence. The proposed semi-supervised model allows concepts to be defined and related at varying levels of granularity and scope. Qualitative evaluations on medical records, intelligence documents, and open domain web data demonstrate the efficacy of our approach.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"74 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114252092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Distributed Vector Representations for Words 结合词的分布式向量表示
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1513
Justin Garten, Kenji Sagae, Volkan Ustun, Morteza Dehghani
Recent interest in distributed vector representations for words has resulted in an increased diversity of approaches, each with strengths and weaknesses. We demonstrate how diverse vector representations may be inexpensively composed into hybrid representations, effectively leveraging strengths of individual components, as evidenced by substantial improvements on a standard word analogy task. We further compare these results over different sizes of training sets and find these advantages are more pronounced when training data is limited. Finally, we explore the relative impacts of the differences in the learning methods themselves and the size of the contexts they access.
最近对单词的分布式向量表示的兴趣导致了各种方法的增加,每种方法都有优点和缺点。我们演示了不同的向量表示如何廉价地组成混合表示,有效地利用各个组件的优势,正如标准单词类比任务的实质性改进所证明的那样。我们进一步将这些结果与不同规模的训练集进行比较,发现当训练数据有限时,这些优势更加明显。最后,我们探讨了学习方法本身的差异和他们所接触的环境的大小的相对影响。
{"title":"Combining Distributed Vector Representations for Words","authors":"Justin Garten, Kenji Sagae, Volkan Ustun, Morteza Dehghani","doi":"10.3115/v1/W15-1513","DOIUrl":"https://doi.org/10.3115/v1/W15-1513","url":null,"abstract":"Recent interest in distributed vector representations for words has resulted in an increased diversity of approaches, each with strengths and weaknesses. We demonstrate how diverse vector representations may be inexpensively composed into hybrid representations, effectively leveraging strengths of individual components, as evidenced by substantial improvements on a standard word analogy task. We further compare these results over different sizes of training sets and find these advantages are more pronounced when training data is limited. Finally, we explore the relative impacts of the differences in the learning methods themselves and the size of the contexts they access.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129153305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Neural context embeddings for automatic discovery of word senses 自动发现词义的神经上下文嵌入
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1504
Mikael Kågebäck, Fredrik D. Johansson, Richard Johansson, Devdatt P. Dubhashi
Word sense induction (WSI) is the problem of automatically building an inventory of senses for a set of target words using only a text corpus. We introduce a new method for embedding word instances and their context, for use in WSI. The method, Instance-context embedding (ICE), leverages neural word embeddings, and the correlation statistics they capture, to compute high quality embeddings of word contexts. In WSI, these context embeddings are clustered to find the word senses present in the text. ICE is based on a novel method for combining word embeddings using continuous Skip-gram, based on both se- mantic and a temporal aspects of context words. ICE is evaluated both in a new system, and in an extension to a previous system for WSI. In both cases, we surpass previous state-of-the-art, on the WSI task of SemEval-2013, which highlights the generality of ICE. Our proposed system achieves a 33% relative improvement.
词义归纳(WSI)是仅使用文本语料库自动为一组目标词建立词义清单的问题。我们介绍了一种新的嵌入词实例及其上下文的方法,用于WSI。实例-上下文嵌入(ICE)方法利用神经词嵌入及其捕获的相关统计信息来计算高质量的词上下文嵌入。在WSI中,这些上下文嵌入被聚类以找到文本中存在的词义。ICE是基于一种基于上下文词的语义和时间方面的连续跳跃图组合词嵌入的新方法。ICE在一个新系统中进行评估,并在以前的WSI系统的扩展中进行评估。在这两种情况下,我们都超越了以前最先进的技术,在SemEval-2013的WSI任务上,这突出了ICE的普遍性。我们提出的系统实现了33%的相对改进。
{"title":"Neural context embeddings for automatic discovery of word senses","authors":"Mikael Kågebäck, Fredrik D. Johansson, Richard Johansson, Devdatt P. Dubhashi","doi":"10.3115/v1/W15-1504","DOIUrl":"https://doi.org/10.3115/v1/W15-1504","url":null,"abstract":"Word sense induction (WSI) is the problem of \u0000automatically building an inventory of senses \u0000for a set of target words using only a text \u0000corpus. We introduce a new method for embedding word instances and their context, for use in WSI. The method, Instance-context embedding (ICE), leverages neural word embeddings, and the correlation statistics they capture, to compute high quality embeddings of word contexts. In WSI, these context embeddings are clustered to find the word senses present in the text. ICE is based on a novel method for combining word embeddings using continuous Skip-gram, based on both se- \u0000mantic and a temporal aspects of context \u0000words. ICE is evaluated both in a new system, and in an extension to a previous system \u0000for WSI. In both cases, we surpass previous \u0000state-of-the-art, on the WSI task of SemEval-2013, which highlights the generality of ICE. Our proposed system achieves a 33% relative improvement.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133317605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Distributed Word Representations Improve NER for e-Commerce 分布式词表示改进电子商务的NER
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1522
Mahesh Joshi, Ethan Hart, Mirko Vogel, Jean-David Ruvini
This paper presents a case study of using distributed word representations, word2vec in particular, for improving performance of Named Entity Recognition for the eCommerce domain. We also demonstrate that distributed word representations trained on a smaller amount of in-domain data are more effective than word vectors trained on very large amount of out-of-domain data, and that their combination gives the best results.
本文介绍了一个使用分布式词表示(特别是word2vec)来提高电子商务领域命名实体识别性能的案例研究。我们还证明了在少量域内数据上训练的分布式词表示比在大量域外数据上训练的词向量更有效,并且它们的组合给出了最好的结果。
{"title":"Distributed Word Representations Improve NER for e-Commerce","authors":"Mahesh Joshi, Ethan Hart, Mirko Vogel, Jean-David Ruvini","doi":"10.3115/v1/W15-1522","DOIUrl":"https://doi.org/10.3115/v1/W15-1522","url":null,"abstract":"This paper presents a case study of using distributed word representations, word2vec in particular, for improving performance of Named Entity Recognition for the eCommerce domain. We also demonstrate that distributed word representations trained on a smaller amount of in-domain data are more effective than word vectors trained on very large amount of out-of-domain data, and that their combination gives the best results.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122574762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Vector Space Approach for Aspect Based Sentiment Analysis 基于方面的情感分析的向量空间方法
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1516
Abdulaziz Alghunaim, Mitra Mohtarami, D. S. Cyphers, James R. Glass
Vector representations for language has been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Aspect Based Sentiment Analysis. In particular, we target three sub-tasks namely aspect term extraction, aspect category detection, and aspect sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. We utilize vector representations to compute various vectorbased features and conduct extensive experiments to demonstrate their effectiveness. Using simple vector based features, we achieve F1 scores of 79.91% for aspect term extraction, 86.75% for category detection, and the accuracy 72.39% for aspect sentiment prediction.
语言的向量表示已被证明在许多自然语言处理任务中是有用的。在本文中,我们的目的是研究词向量表示在基于方面的情感分析问题中的有效性。我们特别针对三个子任务,即方面术语提取、方面类别检测和方面情感预测。我们研究了不同文本数据上向量表示的有效性,并评估了领域相关向量的质量。我们利用向量表示来计算各种基于向量的特征,并进行了大量的实验来证明其有效性。使用简单的基于向量的特征,我们在方面词提取方面取得了79.91%的F1分数,在类别检测方面取得了86.75%的F1分数,在方面情感预测方面取得了72.39%的准确率。
{"title":"A Vector Space Approach for Aspect Based Sentiment Analysis","authors":"Abdulaziz Alghunaim, Mitra Mohtarami, D. S. Cyphers, James R. Glass","doi":"10.3115/v1/W15-1516","DOIUrl":"https://doi.org/10.3115/v1/W15-1516","url":null,"abstract":"Vector representations for language has been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Aspect Based Sentiment Analysis. In particular, we target three sub-tasks namely aspect term extraction, aspect category detection, and aspect sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. We utilize vector representations to compute various vectorbased features and conduct extensive experiments to demonstrate their effectiveness. Using simple vector based features, we achieve F1 scores of 79.91% for aspect term extraction, 86.75% for category detection, and the accuracy 72.39% for aspect sentiment prediction.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134397039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Bilingual Word Representations with Monolingual Quality in Mind 考虑单语质量的双语词表示
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1521
Thang Luong, Hieu Pham, Christopher D. Manning
Recent work in learning bilingual representations tend to tailor towards achieving good performance on bilingual tasks, most often the crosslingual document classification (CLDC) evaluation, but to the detriment of preserving clustering structures of word representations monolingually. In this work, we propose a joint model to learn word representations from scratch that utilizes both the context coocurrence information through the monolingual component and the meaning equivalent signals from the bilingual constraint. Specifically, we extend the recently popular skipgram model to learn high quality bilingual representations efficiently. Our learned embeddings achieve a new state-of-the-art accuracy of 80.3 for the German to English CLDC task and a highly competitive performance of 90.7 for the other classification direction. At the same time, our models outperform best embeddings from past bilingual representation work by a large margin in the monolingual word similarity evaluation. 1
最近在学习双语表示方面的工作倾向于在双语任务中获得良好的表现,最常见的是跨语言文档分类(CLDC)评估,但不利于单语言保留单词表示的聚类结构。在这项工作中,我们提出了一个从头开始学习单词表示的联合模型,该模型利用了通过单语组件获得的上下文协同信息和来自双语约束的意义等效信号。具体来说,我们扩展了最近流行的skipgram模型,以有效地学习高质量的双语表示。我们学习的嵌入在德语到英语的CLDC任务中达到了80.3的最新精度,在其他分类方向上达到了90.7的极具竞争力的性能。与此同时,我们的模型在单语词相似度评估方面大大优于过去双语表示工作的最佳嵌入。1
{"title":"Bilingual Word Representations with Monolingual Quality in Mind","authors":"Thang Luong, Hieu Pham, Christopher D. Manning","doi":"10.3115/v1/W15-1521","DOIUrl":"https://doi.org/10.3115/v1/W15-1521","url":null,"abstract":"Recent work in learning bilingual representations tend to tailor towards achieving good performance on bilingual tasks, most often the crosslingual document classification (CLDC) evaluation, but to the detriment of preserving clustering structures of word representations monolingually. In this work, we propose a joint model to learn word representations from scratch that utilizes both the context coocurrence information through the monolingual component and the meaning equivalent signals from the bilingual constraint. Specifically, we extend the recently popular skipgram model to learn high quality bilingual representations efficiently. Our learned embeddings achieve a new state-of-the-art accuracy of 80.3 for the German to English CLDC task and a highly competitive performance of 90.7 for the other classification direction. At the same time, our models outperform best embeddings from past bilingual representation work by a large margin in the monolingual word similarity evaluation. 1","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130406270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 330
A Multi-classifier Approach to support Coreference Resolution in a Vector Space Model 一种支持矢量空间模型中共同参考分辨率的多分类器方法
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1503
Ana Zelaia Jauregi, Olatz Arregi Uriarte, B. Sierra
In this paper a different machine learning approach is presented to deal with the coreference resolution task. This approach consists of a multi-classifier system that classifies mention-pairs in a reduced dimensional vector space. The vector representation for mentionpairs is generated using a rich set of linguistic features. The SVD technique is used to generate the reduced dimensional vector space. The approach is applied to the OntoNotes v4.0 Release Corpus for the column-format files used in CONLL-2011 coreference resolution shared task. The results obtained show that the reduced dimensional representation obtained by SVD is very adequate to appropriately classify mention-pair vectors. Moreover, we can state that the multi-classifier plays an important role in improving the results.
本文提出了一种不同的机器学习方法来处理共同参考解析任务。该方法由一个多分类器系统组成,该系统在降维向量空间中对提及对进行分类。提及对的向量表示是使用一组丰富的语言特征生成的。利用奇异值分解技术生成降维向量空间。该方法应用于OntoNotes v4.0发布语料库,用于CONLL-2011共同引用分辨率共享任务中使用的列格式文件。结果表明,用奇异值分解得到的降维表示可以很好地对提及对向量进行分类。此外,我们可以说,多分类器在改善结果方面起着重要作用。
{"title":"A Multi-classifier Approach to support Coreference Resolution in a Vector Space Model","authors":"Ana Zelaia Jauregi, Olatz Arregi Uriarte, B. Sierra","doi":"10.3115/v1/W15-1503","DOIUrl":"https://doi.org/10.3115/v1/W15-1503","url":null,"abstract":"In this paper a different machine learning approach is presented to deal with the coreference resolution task. This approach consists of a multi-classifier system that classifies mention-pairs in a reduced dimensional vector space. The vector representation for mentionpairs is generated using a rich set of linguistic features. The SVD technique is used to generate the reduced dimensional vector space. The approach is applied to the OntoNotes v4.0 Release Corpus for the column-format files used in CONLL-2011 coreference resolution shared task. The results obtained show that the reduced dimensional representation obtained by SVD is very adequate to appropriately classify mention-pair vectors. Moreover, we can state that the multi-classifier plays an important role in improving the results.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127073921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating User Location in Social Media with Stacked Denoising Auto-encoders 基于堆叠去噪自编码器的社交媒体用户位置估计
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1527
Ji Liu, D. Inkpen
Only very few users disclose their physical locations, which may be valuable and useful in applications such as marketing and security monitoring; in order to automatically detect their locations, many approaches have been proposed using various types of information, including the tweets posted by the users. It is not easy to infer the original locations from textual data, because text tends to be noisy, particularly in social media. Recently, deep learning techniques have been shown to reduce the error rate of many machine learning tasks, due to their ability to learn meaningful representations of input data. We investigate the potential of building a deep-learning architecture to infer the location of Twitter users based merely on their tweets. We find that stacked denoising auto-encoders are well suited for this task, with results comparable to state-of-the-art models.
只有极少数用户披露他们的实际位置,这在市场营销和安全监控等应用中可能是有价值和有用的;为了自动检测他们的位置,已经提出了许多方法,使用各种类型的信息,包括用户发布的推文。从文本数据中推断出原始位置并不容易,因为文本往往是嘈杂的,尤其是在社交媒体中。最近,深度学习技术已经被证明可以降低许多机器学习任务的错误率,因为它们能够学习输入数据的有意义的表示。我们研究了建立一个深度学习架构的潜力,仅根据Twitter用户的推文推断他们的位置。我们发现堆叠去噪自编码器非常适合这项任务,其结果可与最先进的模型相媲美。
{"title":"Estimating User Location in Social Media with Stacked Denoising Auto-encoders","authors":"Ji Liu, D. Inkpen","doi":"10.3115/v1/W15-1527","DOIUrl":"https://doi.org/10.3115/v1/W15-1527","url":null,"abstract":"Only very few users disclose their physical locations, which may be valuable and useful in applications such as marketing and security monitoring; in order to automatically detect their locations, many approaches have been proposed using various types of information, including the tweets posted by the users. It is not easy to infer the original locations from textual data, because text tends to be noisy, particularly in social media. Recently, deep learning techniques have been shown to reduce the error rate of many machine learning tasks, due to their ability to learn meaningful representations of input data. We investigate the potential of building a deep-learning architecture to infer the location of Twitter users based merely on their tweets. We find that stacked denoising auto-encoders are well suited for this task, with results comparable to state-of-the-art models.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131868623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Simple Semi-Supervised POS Tagging 简单的半监督POS标注
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1511
K. Stratos, Michael Collins
We tackle the question: how much supervision is needed to achieve state-of-the-art performance in part-of-speech (POS) tagging, if we leverage lexical representations given by the model of Brown et al. (1992)? It has become a standard practice to use automatically induced “Brown clusters” in place of POS tags. We claim that the underlying sequence model for these clusters is particularly well-suited for capturing POS tags. We empirically demonstrate this claim by drastically reducing supervision in POS tagging with these representations. Using either the bit-string form given by the algorithm of Brown et al. (1992) or the (less well-known) embedding form given by the canonical correlation analysis algorithm of Stratos et al. (2014), we can obtain 93% tagging accuracy with just 400 labeled words and achieve state-of-the-art accuracy (> 97%) with less than 1 percent of the original training data.
我们解决了这样一个问题:如果我们利用Brown等人(1992)的模型给出的词汇表示,在词性(POS)标注中实现最先进的性能需要多少监督?使用自动诱导的“布朗簇”来代替POS标签已经成为一种标准做法。我们声称这些集群的底层序列模型特别适合于捕获POS标签。我们通过经验证明了这一说法,通过这些表示大大减少了对POS标记的监督。无论是使用Brown et al.(1992)算法给出的位串形式,还是使用Stratos et al.(2014)算法给出的(不太知名的)嵌入形式,我们只需使用400个标记词就可以获得93%的标记准确率,并且使用不到1%的原始训练数据就可以达到最先进的准确率(> 97%)。
{"title":"Simple Semi-Supervised POS Tagging","authors":"K. Stratos, Michael Collins","doi":"10.3115/v1/W15-1511","DOIUrl":"https://doi.org/10.3115/v1/W15-1511","url":null,"abstract":"We tackle the question: how much supervision is needed to achieve state-of-the-art performance in part-of-speech (POS) tagging, if we leverage lexical representations given by the model of Brown et al. (1992)? It has become a standard practice to use automatically induced “Brown clusters” in place of POS tags. We claim that the underlying sequence model for these clusters is particularly well-suited for capturing POS tags. We empirically demonstrate this claim by drastically reducing supervision in POS tagging with these representations. Using either the bit-string form given by the algorithm of Brown et al. (1992) or the (less well-known) embedding form given by the canonical correlation analysis algorithm of Stratos et al. (2014), we can obtain 93% tagging accuracy with just 400 labeled words and achieve state-of-the-art accuracy (> 97%) with less than 1 percent of the original training data.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134638763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Vector Space Models for Scientific Document Summarization 科学文献摘要的向量空间模型
Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1525
John M. Conroy, Sashka Davis
In this paper we compare the performance of three approaches for estimating the latent weights of terms for scientific document summarization, given the document and a set of citing documents. The first approach is a termfrequency (TF) vector space method utilizing a nonnegative matrix factorization (NNMF) for dimensionality reduction. The other two are language modeling approaches for predicting the term distributions of human-generated summaries. The language model we build exploits the key sections of the document and a set of citing sentences derived from auxiliary documents that cite the document of interest. The parameters of the model may be set via a minimization of the Jensen-Shannon (JS) divergence. We use the OCCAMS algorithm (Optimal Combinatorial Covering Algorithm for Multi-document Summarization) to select a set of sentences that maximizes the term-coverage score while minimizing redundancy. The results are evaluated with standard ROUGE metrics, and the performance of the resulting methods achieve ROUGE scores exceeding those of the average human summarizer.
在本文中,我们比较了三种估计科学文献摘要中术语潜在权重的方法的性能,给出了文献和一组引用文献。第一种方法是利用非负矩阵分解(NNMF)进行降维的频域(TF)向量空间方法。另外两个是用于预测人工生成摘要的术语分布的语言建模方法。我们建立的语言模型利用了文档的关键部分和一组引用句子,这些句子来源于引用感兴趣文档的辅助文档。模型参数可以通过最小化Jensen-Shannon (JS)散度来设置。我们使用OCCAMS算法(用于多文档摘要的最优组合覆盖算法)来选择一组最大化术语覆盖分数同时最小化冗余的句子。使用标准的ROUGE度量对结果进行评估,结果方法的性能达到了ROUGE分数,超过了一般的人类总结器。
{"title":"Vector Space Models for Scientific Document Summarization","authors":"John M. Conroy, Sashka Davis","doi":"10.3115/v1/W15-1525","DOIUrl":"https://doi.org/10.3115/v1/W15-1525","url":null,"abstract":"In this paper we compare the performance of three approaches for estimating the latent weights of terms for scientific document summarization, given the document and a set of citing documents. The first approach is a termfrequency (TF) vector space method utilizing a nonnegative matrix factorization (NNMF) for dimensionality reduction. The other two are language modeling approaches for predicting the term distributions of human-generated summaries. The language model we build exploits the key sections of the document and a set of citing sentences derived from auxiliary documents that cite the document of interest. The parameters of the model may be set via a minimization of the Jensen-Shannon (JS) divergence. We use the OCCAMS algorithm (Optimal Combinatorial Covering Algorithm for Multi-document Summarization) to select a set of sentences that maximizes the term-coverage score while minimizing redundancy. The results are evaluated with standard ROUGE metrics, and the performance of the resulting methods achieve ROUGE scores exceeding those of the average human summarizer.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114455138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
VS@HLT-NAACL
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1