VS@HLT-NAACL最新文献_第3页

Short Text Clustering via Convolutional Neural Networks 基于卷积神经网络的短文本聚类

VS@HLT-NAACL

Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1509

Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, Hongwei Hao

Short text clustering has become an increasing important task with the popularity of social media, and it is a challenging problem due to its sparseness of text representation. In this paper, we propose a Short Text Clustering via Convolutional neural networks (abbr. to STCC), which is more beneﬁcial for clustering by considering one constraint on learned features through a self-taught learning framework without using any external tags/labels. First, we embed the original keyword features into compact binary codes with a locality-preserving constraint. Then, word embed-dings are explored and fed into convolutional neural networks to learn deep feature representations, with the output units ﬁtting the pre-trained binary code in the training process. After obtaining the learned representations, we use K-means to cluster them. Our extensive experimental study on two public short text datasets shows that the deep feature representation learned by our approach can achieve a signiﬁcantly better performance than some other existing features, such as term frequency-inverse document frequency, Laplacian eigenvectors and average embedding, for clustering.

随着社交媒体的普及，短文本聚类已成为一项越来越重要的任务，但由于文本表示的稀疏性，短文本聚类是一个具有挑战性的问题。在本文中，我们提出了一种基于卷积神经网络(简称STCC)的短文本聚类方法，该方法通过自学框架考虑对学习特征的一个约束，而不使用任何外部标签/标签，从而更有利于聚类。首先，我们将原始关键字特征嵌入到具有位置保持约束的紧凑二进制码中。然后，研究词嵌入并将其输入卷积神经网络以学习深度特征表示，在训练过程中输出单元拟合预训练的二进制代码。在获得学习到的表示后，我们使用K-means对它们进行聚类。我们在两个公开的短文本数据集上的广泛实验研究表明，通过我们的方法学习的深度特征表示可以获得比其他一些现有特征(如词频率-逆文档频率，拉普拉斯特征向量和平均嵌入)更好的聚类性能。

引用次数: 148

Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing 序列标注的词嵌入与词类型:CV解析的奇特案例

VS@HLT-NAACL

Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1517

Melanie Tosik, C. Hansen, Gerard Goossen, M. Rotaru

We explore new methods of improving Curriculum Vitae (CV) parsing for German documents by applying recent research on the application of word embeddings in Natural Language Processing (NLP). Our approach integrates the word embeddings as input features for a probabilistic sequence labeling model that relies on the Conditional Random Field (CRF) framework. Best-performing word embeddings are generated from a large sample of German CVs. The best results on the extraction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features. The improvements are consistent throughout different sections of the target documents. The effect of the word embeddings is strongest on semi-structured, out-of-sample data.

本文通过应用词嵌入在自然语言处理(NLP)中的最新研究，探索了改进德文文档简历(CV)解析的新方法。我们的方法集成了词嵌入作为依赖于条件随机场(CRF)框架的概率序列标记模型的输入特征。表现最好的词嵌入是从大量德国简历样本中生成的。该模型将词嵌入与许多手工特征相结合，在提取任务中获得了最好的结果。这些改进在目标文档的不同部分是一致的。词嵌入对半结构化、样本外数据的影响是最强的。

引用次数: 13

Morpho-syntactic Regularities in Continuous Word Representations: A multilingual study. 连续词表示的词法句法规律:多语言研究。

VS@HLT-NAACL

Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1518

Garrett Nicolai, Colin Cherry, Grzegorz Kondrak

We replicate the syntactic experiments of Mikolov et al. (2013b) on English, and expand them to include morphologically complex languages. We learn vector representations for Dutch, French, German, and Spanish with the WORD2VEC tool, and investigate to what extent inflectional information is preserved across vectors. We observe that the accuracy of vectors on a set of syntactic analogies is inversely correlated with the morphological complexity of the language.

我们复制了Mikolov等人(2013b)在英语上的句法实验，并将其扩展到包括形态复杂的语言。我们使用WORD2VEC工具学习荷兰语、法语、德语和西班牙语的向量表示，并研究跨向量保留屈折信息的程度。我们观察到，在一组句法类比上向量的准确性与语言的形态复杂性呈负相关。

引用次数: 8

Relation Extraction: Perspective from Convolutional Neural Networks 关系抽取:卷积神经网络的视角

VS@HLT-NAACL

Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1506

Thien Huu Nguyen, R. Grishman

Up to now, relation extraction systems have made extensive use of features generated by linguistic analysis modules. Errors in these features lead to errors of relation detection and classification. In this work, we depart from these traditional approaches with complicated feature engineering by introducing a convolutional neural network for relation extraction that automatically learns features from sentences and minimizes the dependence on external toolkits and resources. Our model takes advantages of multiple window sizes for filters and pre-trained word embeddings as an initializer on a non-static architecture to improve the performance. We emphasize the relation extraction problem with an unbalanced corpus. The experimental results show that our system significantly outperforms not only the best baseline systems for relation extraction but also the state-of-the-art systems for relation classification.

到目前为止，关系提取系统已经大量使用了语言分析模块生成的特征。这些特征的错误会导致关系检测和分类的错误。在这项工作中，我们通过引入卷积神经网络来自动从句子中学习特征，并最大限度地减少对外部工具包和资源的依赖，从而摆脱了这些复杂特征工程的传统方法。我们的模型利用过滤器的多个窗口大小和预训练的词嵌入作为非静态架构的初始化器来提高性能。重点讨论了不平衡语料库下的关系抽取问题。实验结果表明，我们的系统不仅在关系提取方面明显优于最好的基线系统，而且在关系分类方面也优于最先进的系统。

引用次数: 464

Semantic Information Extraction for Improved Word Embeddings 改进词嵌入的语义信息提取

VS@HLT-NAACL

Pub Date : 2015-06-01 DOI: 10.3115/v1/W15-1523

Jiaqiang Chen, Gerard de Melo

Word embeddings have recently proven useful in a number of different applications that deal with natural language. Such embeddings succinctly reflect semantic similarities between words based on their sentence-internal contexts in large corpora. In this paper, we show that information extraction techniques provide valuable additional evidence of semantic relationships that can be exploited when producing word embeddings. We propose a joint model to train word embeddings both on regular context information and on more explicit semantic extractions. The word vectors obtained from such an augmented joint training show improved results on word similarity tasks, suggesting that they can be useful in applications that involve word meanings.

词嵌入最近在处理自然语言的许多不同应用程序中被证明是有用的。在大型语料库中，这种嵌入可以简洁地反映出基于句子内部上下文的词之间的语义相似性。在本文中，我们展示了信息提取技术提供了有价值的语义关系的额外证据，这些证据可以在产生词嵌入时被利用。我们提出了一个联合模型，在常规上下文信息和更明确的语义提取上训练词嵌入。从这种增强联合训练中获得的词向量在词相似度任务上显示出改进的结果，这表明它们在涉及词义的应用中是有用的。

引用次数: 14