Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining最新文献

英文中文

Domain Adaptation for Commitment Detection in Email 电子邮件承诺检测的领域自适应

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290984

H. Azarbonyad, Robert Sim, Ryen W. White

People often make commitments to perform future actions. Detecting commitments made in email (e.g., "I'll send the report by end of day'') enables digital assistants to help their users recall promises they have made and assist them in meeting those promises in a timely manner. In this paper, we show that commitments can be reliably extracted from emails when models are trained and evaluated on the same domain (corpus). However, their performance degrades when the evaluation domain differs. This illustrates the domain bias associated with email datasets and a need for more robust and generalizable models for commitment detection. To learn a domain-independent commitment model, we first characterize the differences between domains (email corpora) and then use this characterization to transfer knowledge between them. We investigate the performance of domain adaptation, namely transfer learning, at different granularities: feature-level adaptation and sample-level adaptation. We extend this further using a neural autoencoder trained to learn a domain-independent representation for training samples. We show that transfer learning can help remove domain bias to obtain models with less domain dependence. Overall, our results show that domain differences can have a significant negative impact on the quality of commitment detection models and that transfer learning has enormous potential to address this issue.

人们经常会对未来的行动做出承诺。检测电子邮件中的承诺(例如，“我会在今天结束前发送报告”)使数字助理能够帮助用户回忆起他们所做的承诺，并帮助他们及时实现这些承诺。在本文中，我们表明，当模型在同一领域(语料库)上进行训练和评估时，可以可靠地从电子邮件中提取承诺。然而，当评估域不同时，它们的性能会下降。这说明了与电子邮件数据集相关的领域偏差，以及对更健壮和可推广的承诺检测模型的需求。为了学习一个独立于领域的承诺模型，我们首先对领域(电子邮件语料库)之间的差异进行表征，然后使用这种表征在它们之间转移知识。我们研究了不同粒度的域适应，即迁移学习的性能:特征级适应和样本级适应。我们使用一个神经自编码器来学习训练样本的领域独立表示，进一步扩展了这一点。我们表明迁移学习可以帮助消除领域偏见，以获得较少的领域依赖的模型。总体而言，我们的研究结果表明，领域差异会对承诺检测模型的质量产生显著的负面影响，迁移学习在解决这一问题方面具有巨大的潜力。

{"title":"Domain Adaptation for Commitment Detection in Email","authors":"H. Azarbonyad, Robert Sim, Ryen W. White","doi":"10.1145/3289600.3290984","DOIUrl":"https://doi.org/10.1145/3289600.3290984","url":null,"abstract":"People often make commitments to perform future actions. Detecting commitments made in email (e.g., \"I'll send the report by end of day'') enables digital assistants to help their users recall promises they have made and assist them in meeting those promises in a timely manner. In this paper, we show that commitments can be reliably extracted from emails when models are trained and evaluated on the same domain (corpus). However, their performance degrades when the evaluation domain differs. This illustrates the domain bias associated with email datasets and a need for more robust and generalizable models for commitment detection. To learn a domain-independent commitment model, we first characterize the differences between domains (email corpora) and then use this characterization to transfer knowledge between them. We investigate the performance of domain adaptation, namely transfer learning, at different granularities: feature-level adaptation and sample-level adaptation. We extend this further using a neural autoencoder trained to learn a domain-independent representation for training samples. We show that transfer learning can help remove domain bias to obtain models with less domain dependence. Overall, our results show that domain differences can have a significant negative impact on the quality of commitment detection models and that transfer learning has enormous potential to address this issue.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129252232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments 当人们改变主意:非平稳推荐环境下的非政策评估

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290958

R. Jagerman, I. Markov, M. de Rijke

We consider the novel problem of evaluating a recommendation policy offline in environments where the reward signal is non-stationary. Non-stationarity appears in many Information Retrieval (IR) applications such as recommendation and advertising, but its effect on off-policy evaluation has not been studied at all. We are the first to address this issue. First, we analyze standard off-policy estimators in non-stationary environments and show both theoretically and experimentally that their bias grows with time. Then, we propose new off-policy estimators with moving averages and show that their bias is independent of time and can be bounded. Furthermore, we provide a method to trade-off bias and variance in a principled way to get an off-policy estimator that works well in both non-stationary and stationary environments. We experiment on publicly available recommendation datasets and show that our newly proposed moving average estimators accurately capture changes in non-stationary environments, while standard off-policy estimators fail to do so.

我们考虑了在奖励信号是非平稳的环境下评估离线推荐策略的新问题。非平稳性出现在许多信息检索(IR)应用中，如推荐和广告，但其对非政策评价的影响尚未得到研究。我们是第一个解决这个问题的国家。首先，我们分析了非平稳环境下的标准偏离策略估计器，并从理论上和实验上证明了它们的偏差随时间增长。然后，我们提出了新的带有移动平均线的政策外估计器，并证明了它们的偏差与时间无关，并且可以有界。此外，我们提供了一种权衡偏差和方差的方法，以一种原则性的方式获得在非平稳和平稳环境下都能很好地工作的离策略估计器。我们在公开可用的推荐数据集上进行了实验，并表明我们新提出的移动平均估计器准确地捕获了非平稳环境中的变化，而标准的off-policy估计器则无法做到这一点。

引用次数: 48

clstk: The Cross-Lingual Summarization Tool-Kit clstk:跨语言摘要工具包

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290614

Nisarg Jhaveri, Manish Gupta, Vasudeva Varma

Cross-lingual summarization (CLS) aims to create summaries in a target language, from a document or document set given in a different, source language. Cross-lingual summarization can play a critical role in enabling cross-lingual information access for millions of people across the globe who do not speak or understand languages having large representation on the web. It can also make documents originally published in local languages quickly accessible to a large audience which does not understand those local languages. Though cross-lingual summarization has gathered some attention in the last decade, there has been no serious effort to publish rigorous software for this task. In this paper, we provide a design for an end-to-end CLS software called clstk. Besides implementing a number of methods proposed by different CLS researchers over years, the software integrates multiple components critical for CLS. We hope that this extremely modular tool-kit will help CLS researchers to contribute more effectively to the area.

跨语言摘要(CLS)的目的是根据用不同源语言给出的文档或文档集，用目标语言创建摘要。跨语言摘要可以发挥关键作用，使全球数以百万计的人能够跨语言访问信息，这些人不会说或不理解在网络上有大量代表性的语言。它还可以使最初以当地语言出版的文件迅速提供给不懂这些当地语言的广大读者。尽管在过去的十年中，跨语言摘要已经引起了一些关注，但还没有人认真地为这项任务发布严格的软件。在本文中，我们提供了一个端到端的CLS软件clstk的设计。除了实现多年来不同CLS研究人员提出的许多方法外，该软件还集成了CLS的多个关键组件。我们希望这个极其模块化的工具包将帮助CLS研究人员更有效地为该领域做出贡献。

引用次数: 6

Interactive Anomaly Detection on Attributed Networks 基于属性网络的交互式异常检测

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290964

Kaize Ding, Jundong Li, Huan Liu

Performing anomaly detection on attributed networks concerns with finding nodes whose patterns or behaviors deviate significantly from the majority of reference nodes. Its success can be easily found in many real-world applications such as network intrusion detection, opinion spam detection and system fault diagnosis, to name a few. Despite their empirical success, a vast majority of existing efforts are overwhelmingly performed in an unsupervised scenario due to the expensive labeling costs of ground truth anomalies. In fact, in many scenarios, a small amount of prior human knowledge of the data is often effortless to obtain, and getting it involved in the learning process has shown to be effective in advancing many important learning tasks. Additionally, since new types of anomalies may constantly arise over time especially in an adversarial environment, the interests of human expert could also change accordingly regarding to the detected anomaly types. It brings further challenges to conventional anomaly detection algorithms as they are often applied in a batch setting and are incapable to interact with the environment. To tackle the above issues, in this paper, we investigate the problem of anomaly detection on attributed networks in an interactive setting by allowing the system to proactively communicate with the human expert in making a limited number of queries about ground truth anomalies. Our objective is to maximize the true anomalies presented to the human expert after a given budget is used up. Along with this line, we formulate the problem through the principled multi-armed bandit framework and develop a novel collaborative contextual bandit algorithm, named GraphUCB. In particular, our developed algorithm: (1) explicitly models the nodal attributes and node dependencies seamlessly in a joint framework; and (2) handles the exploration-exploitation dilemma when querying anomalies of different types. Extensive experiments on real-world datasets show the improvement of the proposed algorithm over the state-of-the-art algorithms.

在属性网络上执行异常检测涉及寻找模式或行为明显偏离大多数参考节点的节点。它的成功可以很容易地在许多现实世界的应用中找到，例如网络入侵检测，意见垃圾检测和系统故障诊断，仅举几例。尽管他们在经验上取得了成功，但由于地面真值异常的昂贵标记成本，绝大多数现有的努力都是在无监督的情况下进行的。事实上，在许多情况下，人类对数据的少量先验知识通常是毫不费力地获得的，并且将其纳入学习过程已被证明可以有效地推进许多重要的学习任务。此外，由于新的异常类型可能会随着时间的推移不断出现，特别是在敌对的环境中，人类专家的兴趣也会随着检测到的异常类型而相应改变。传统的异常检测算法通常应用于批处理环境，无法与环境进行交互，这给传统的异常检测算法带来了进一步的挑战。为了解决上述问题，在本文中，我们研究了在交互式设置下的属性网络异常检测问题，通过允许系统主动与人类专家进行交流，对地面真实异常进行有限数量的查询。我们的目标是在给定的预算用完后，最大限度地呈现给人类专家的真实异常。沿着这条线，我们通过原则性的多臂强盗框架来制定问题，并开发了一种新的协作上下文强盗算法，命名为GraphUCB。特别是，我们开发的算法:(1)在联合框架中显式地对节点属性和节点依赖关系进行无缝建模;(2)处理查询不同类型异常时的勘探开发困境。在真实世界数据集上的大量实验表明，所提出的算法比最先进的算法有所改进。

{"title":"Interactive Anomaly Detection on Attributed Networks","authors":"Kaize Ding, Jundong Li, Huan Liu","doi":"10.1145/3289600.3290964","DOIUrl":"https://doi.org/10.1145/3289600.3290964","url":null,"abstract":"Performing anomaly detection on attributed networks concerns with finding nodes whose patterns or behaviors deviate significantly from the majority of reference nodes. Its success can be easily found in many real-world applications such as network intrusion detection, opinion spam detection and system fault diagnosis, to name a few. Despite their empirical success, a vast majority of existing efforts are overwhelmingly performed in an unsupervised scenario due to the expensive labeling costs of ground truth anomalies. In fact, in many scenarios, a small amount of prior human knowledge of the data is often effortless to obtain, and getting it involved in the learning process has shown to be effective in advancing many important learning tasks. Additionally, since new types of anomalies may constantly arise over time especially in an adversarial environment, the interests of human expert could also change accordingly regarding to the detected anomaly types. It brings further challenges to conventional anomaly detection algorithms as they are often applied in a batch setting and are incapable to interact with the environment. To tackle the above issues, in this paper, we investigate the problem of anomaly detection on attributed networks in an interactive setting by allowing the system to proactively communicate with the human expert in making a limited number of queries about ground truth anomalies. Our objective is to maximize the true anomalies presented to the human expert after a given budget is used up. Along with this line, we formulate the problem through the principled multi-armed bandit framework and develop a novel collaborative contextual bandit algorithm, named GraphUCB. In particular, our developed algorithm: (1) explicitly models the nodal attributes and node dependencies seamlessly in a joint framework; and (2) handles the exploration-exploitation dilemma when querying anomalies of different types. Extensive experiments on real-world datasets show the improvement of the proposed algorithm over the state-of-the-art algorithms.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128992871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

Representation Interpretation with Spatial Encoding and Multimodal Analytics 空间编码和多模态分析的表示解释

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290960

Ninghao Liu, Mengnan Du, Xia Hu

Representation learning models map data instances into a low-dimensional vector space, thus facilitating the deployment of subsequent models such as classification and clustering models, or the implementation of downstream applications such as recommendation and anomaly detection. However, the outcome of representation learning is difficult to be directly understood by users, since each dimension of the latent space may not have any specific meaning. Understanding representation learning could be beneficial to many applications. For example, in recommender systems, knowing why a user instance is mapped to a certain position in the latent space may unveil the user's interests and profile. In this paper, we propose an interpretation framework to understand and describe how representation vectors distribute in the latent space. Specifically, we design a coding scheme to transform representation instances into spatial codes to indicate their locations in the latent space. Following that, a multimodal autoencoder is built for generating the description of a representation instance given its spatial codes. The coding scheme enables indication of position with different granularity. The incorporation of autoencoder makes the framework capable of dealing with different types of data. Several metrics are designed to evaluate interpretation results. Experiments under various application scenarios and different representation learning models are conducted to demonstrate the flexibility and effectiveness of the proposed framework.

表示学习模型将数据实例映射到低维向量空间，从而促进后续模型(如分类和聚类模型)的部署，或下游应用程序(如推荐和异常检测)的实现。然而，表征学习的结果很难被用户直接理解，因为潜在空间的每个维度可能没有任何特定的含义。理解表征学习对许多应用都是有益的。例如，在推荐系统中，知道为什么用户实例被映射到潜在空间中的某个位置可能会揭示用户的兴趣和概况。在本文中，我们提出了一个解释框架来理解和描述表示向量在潜在空间中的分布。具体来说，我们设计了一种编码方案，将表示实例转换为空间代码，以表明它们在潜在空间中的位置。然后，构建了一个多模态自编码器，用于生成给定其空间代码的表示实例的描述。该编码方案支持以不同粒度指示位置。自编码器的加入使得该框架能够处理不同类型的数据。设计了几个指标来评估解释结果。在不同的应用场景和不同的表示学习模型下进行了实验，验证了该框架的灵活性和有效性。

{"title":"Representation Interpretation with Spatial Encoding and Multimodal Analytics","authors":"Ninghao Liu, Mengnan Du, Xia Hu","doi":"10.1145/3289600.3290960","DOIUrl":"https://doi.org/10.1145/3289600.3290960","url":null,"abstract":"Representation learning models map data instances into a low-dimensional vector space, thus facilitating the deployment of subsequent models such as classification and clustering models, or the implementation of downstream applications such as recommendation and anomaly detection. However, the outcome of representation learning is difficult to be directly understood by users, since each dimension of the latent space may not have any specific meaning. Understanding representation learning could be beneficial to many applications. For example, in recommender systems, knowing why a user instance is mapped to a certain position in the latent space may unveil the user's interests and profile. In this paper, we propose an interpretation framework to understand and describe how representation vectors distribute in the latent space. Specifically, we design a coding scheme to transform representation instances into spatial codes to indicate their locations in the latent space. Following that, a multimodal autoencoder is built for generating the description of a representation instance given its spatial codes. The coding scheme enables indication of position with different granularity. The incorporation of autoencoder makes the framework capable of dealing with different types of data. Several metrics are designed to evaluate interpretation results. Experiments under various application scenarios and different representation learning models are conducted to demonstrate the flexibility and effectiveness of the proposed framework.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134187394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Session-Based Social Recommendation via Dynamic Graph Attention Networks 基于会话的动态图注意力网络社会推荐

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290989

Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, Jian Tang

Online communities such as Facebook and Twitter are enormously popular and have become an essential part of the daily life of many of their users. Through these platforms, users can discover and create information that others will then consume. In that context, recommending relevant information to users becomes critical for viability. However, recommendation in online communities is a challenging problem: 1) users' interests are dynamic, and 2) users are influenced by their friends. Moreover, the influencers may be context-dependent. That is, different friends may be relied upon for different topics. Modeling both signals is therefore essential for recommendations. We propose a recommender system for online communities based on a dynamic-graph-attention neural network. We model dynamic user behaviors with a recurrent neural network, and context-dependent social influence with a graph-attention neural network, which dynamically infers the influencers based on users' current interests. The whole model can be efficiently fit on large-scale data. Experimental results on several real-world data sets demonstrate the effectiveness of our proposed approach over several competitive baselines including state-of-the-art models.

像Facebook和Twitter这样的在线社区非常受欢迎，已经成为许多用户日常生活中必不可少的一部分。通过这些平台，用户可以发现和创建信息，然后其他人将消费这些信息。在这种情况下，向用户推荐有关信息对可行性至关重要。然而，在线社区的推荐是一个具有挑战性的问题:1)用户的兴趣是动态的，2)用户受其朋友的影响。此外，影响者可能是上下文相关的。也就是说，不同的话题可以依靠不同的朋友。因此，对这两个信号进行建模对于推荐是必不可少的。提出了一种基于动态图注意力神经网络的在线社区推荐系统。我们使用循环神经网络对动态用户行为建模，使用图-注意力神经网络对上下文相关的社会影响建模，该网络根据用户当前的兴趣动态推断影响者。整个模型可以有效地拟合大规模数据。在几个真实数据集上的实验结果证明了我们提出的方法在几个竞争基线(包括最先进的模型)上的有效性。

{"title":"Session-Based Social Recommendation via Dynamic Graph Attention Networks","authors":"Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, Jian Tang","doi":"10.1145/3289600.3290989","DOIUrl":"https://doi.org/10.1145/3289600.3290989","url":null,"abstract":"Online communities such as Facebook and Twitter are enormously popular and have become an essential part of the daily life of many of their users. Through these platforms, users can discover and create information that others will then consume. In that context, recommending relevant information to users becomes critical for viability. However, recommendation in online communities is a challenging problem: 1) users' interests are dynamic, and 2) users are influenced by their friends. Moreover, the influencers may be context-dependent. That is, different friends may be relied upon for different topics. Modeling both signals is therefore essential for recommendations. We propose a recommender system for online communities based on a dynamic-graph-attention neural network. We model dynamic user behaviors with a recurrent neural network, and context-dependent social influence with a graph-attention neural network, which dynamically infers the influencers based on users' current interests. The whole model can be efficiently fit on large-scale data. Experimental results on several real-world data sets demonstrate the effectiveness of our proposed approach over several competitive baselines including state-of-the-art models.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133550304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 316

Pleasant Route Suggestion based on Color and Object Rates 基于颜色和物体率的愉快路线建议

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290611

Shoko Wakamiya, Panote Siriaraya, Yihong Zhang, Yukiko Kawai, E. Aramaki, A. Jatowt

For a tourist who wishes to stroll in an unknown city, it is useful to have a recommendation of not just the shortest routes but also routes that are pleasant. This paper demonstrates a system that provides pleasant route recommendation. Currently, we focus on routes that have much green and bright views. The system measures pleasure scores by extracting colors or objects in Google Street View panorama images and re-ranks shortest paths in the order of the computed pleasure scores. The current prototype provides route recommendation for city areas in Tokyo, Kyoto and San Francisco.

对于想在一个陌生的城市漫步的游客来说，不仅要推荐最短的路线，还要推荐令人愉快的路线，这是很有用的。本文演示了一个提供愉快路线推荐的系统。目前，我们关注的是有很多绿色和明亮景观的路线。该系统通过从谷歌街景全景图像中提取颜色或物体来测量快乐分数，并根据计算的快乐分数重新排列最短路径。目前的原型为东京、京都和旧金山等城市提供路线推荐。

引用次数: 13

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling 利用语义词聚类表示增强主题建模

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3291032

Felipe Viegas, Sérgio D. Canuto, Christian Gomes, Washington Cunha, T. Rosa, Sabir Ribas, L. Rocha, Marcos André Gonçalves

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

在本文中，我们通过一种新的基于预训练词嵌入的非概率矩阵分解的文档表示，推进了主题建模的最新技术。具体来说，我们的策略，称为CluWords，利用给定的预训练词嵌入的最接近的词来生成能够在句法和语义信息方面增强文档表示的元词。我们的解决方案的新颖贡献包括:(i)引入了一种新的数据表示，用于基于句法和语义关系的主题建模，这些关系来源于预训练词嵌入空间中计算的距离;(ii)提出了一种新的基于tf - idf的策略，特别是为CluWords加权而开发的策略。在我们广泛的实验评估中，涵盖了12个数据集和8个最先进的基线，我们在几乎所有情况下都超过了(在少数情况下)，与最佳基线相比，收益超过了50%(与一些亚军相比，收益高达80%)。最后，我们证明了我们的方法能够改善自动文本分类任务的文档表示。

{"title":"CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling","authors":"Felipe Viegas, Sérgio D. Canuto, Christian Gomes, Washington Cunha, T. Rosa, Sabir Ribas, L. Rocha, Marcos André Gonçalves","doi":"10.1145/3289600.3291032","DOIUrl":"https://doi.org/10.1145/3289600.3291032","url":null,"abstract":"In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122546873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Responsible Data Science 负责任数据科学

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3291287

H. Jagadish

Technologists have a responsibility to develop Data Science and AI methods that satisfy fairness, accountability, transparency, and ethical requirements. This statement has repeatedly been made in recent years and in many quarters, including major newspapers and magazines. The technical community has responded with work in this direction. However, almost all of this work has been directed towards the decision-making algorithm that performs a task such as scoring or classification. This presentation examines the Data Science pipeline, and points out the importance of addressing responsibility in all stages of this pipeline, and not just the decision-making stage. The presentation then outlines some recent research results that have been obtained in that regard.

技术人员有责任开发满足公平、问责、透明和道德要求的数据科学和人工智能方法。近年来，在许多方面，包括主要报纸和杂志上，都一再发表这种声明。技术社区已经在这个方向上做出了回应。然而，几乎所有这些工作都是针对执行评分或分类等任务的决策算法。本演讲考察了数据科学管道，并指出在管道的所有阶段解决责任的重要性，而不仅仅是决策阶段。然后，报告概述了在这方面最近取得的一些研究成果。

引用次数: 0

Knowledge Graph Embedding Based Question Answering 基于知识图嵌入的问答

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pub Date : 2019-01-30 DOI: 10.1145/3289600.3290956

Xiao Huang, Jingyuan Zhang, Dingcheng Li, Ping Li

Question answering over knowledge graph (QA-KG) aims to use facts in the knowledge graph (KG) to answer natural language questions. It helps end users more efficiently and more easily access the substantial and valuable knowledge in the KG, without knowing its data structures. QA-KG is a nontrivial problem since capturing the semantic meaning of natural language is difficult for a machine. Meanwhile, many knowledge graph embedding methods have been proposed. The key idea is to represent each predicate/entity as a low-dimensional vector, such that the relation information in the KG could be preserved. The learned vectors could benefit various applications such as KG completion and recommender systems. In this paper, we explore to use them to handle the QA-KG problem. However, this remains a challenging task since a predicate could be expressed in different ways in natural language questions. Also, the ambiguity of entity names and partial names makes the number of possible answers large. To bridge the gap, we propose an effective Knowledge Embedding based Question Answering (KEQA) framework. We focus on answering the most common types of questions, i.e., simple questions, in which each question could be answered by the machine straightforwardly if its single head entity and single predicate are correctly identified. To answer a simple question, instead of inferring its head entity and predicate directly, KEQA targets at jointly recovering the question's head entity, predicate, and tail entity representations in the KG embedding spaces. Based on a carefully-designed joint distance metric, the three learned vectors' closest fact in the KG is returned as the answer. Experiments on a widely-adopted benchmark demonstrate that the proposed KEQA outperforms the state-of-the-art QA-KG methods.

知识图问答(QA-KG)的目的是利用知识图中的事实来回答自然语言问题。它帮助最终用户在不了解其数据结构的情况下更有效、更容易地访问KG中的大量有价值的知识。QA-KG是一个重要的问题，因为机器很难捕获自然语言的语义。同时，也提出了多种知识图嵌入方法。关键思想是将每个谓词/实体表示为低维向量，这样可以保留KG中的关系信息。学习到的向量可以用于各种应用，如KG补全和推荐系统。在本文中，我们探索使用它们来处理QA-KG问题。然而，这仍然是一项具有挑战性的任务，因为在自然语言问题中，谓语可以用不同的方式表示。此外，实体名称和部分名称的模糊性使得可能的答案数量很大。为了弥补这一差距，我们提出了一个有效的基于知识嵌入的问答(KEQA)框架。我们专注于回答最常见的问题类型，即简单问题，其中每个问题都可以由机器直接回答，如果它的单个头部实体和单个谓词被正确识别。为了回答一个简单的问题，KEQA的目标不是直接推断其头部实体和谓词，而是在KG嵌入空间中联合恢复问题的头部实体、谓词和尾部实体表示。基于精心设计的联合距离度量，将三个学习到的向量在KG中最接近的事实作为答案返回。在广泛采用的基准上进行的实验表明，所提出的KEQA优于最先进的QA-KG方法。

{"title":"Knowledge Graph Embedding Based Question Answering","authors":"Xiao Huang, Jingyuan Zhang, Dingcheng Li, Ping Li","doi":"10.1145/3289600.3290956","DOIUrl":"https://doi.org/10.1145/3289600.3290956","url":null,"abstract":"Question answering over knowledge graph (QA-KG) aims to use facts in the knowledge graph (KG) to answer natural language questions. It helps end users more efficiently and more easily access the substantial and valuable knowledge in the KG, without knowing its data structures. QA-KG is a nontrivial problem since capturing the semantic meaning of natural language is difficult for a machine. Meanwhile, many knowledge graph embedding methods have been proposed. The key idea is to represent each predicate/entity as a low-dimensional vector, such that the relation information in the KG could be preserved. The learned vectors could benefit various applications such as KG completion and recommender systems. In this paper, we explore to use them to handle the QA-KG problem. However, this remains a challenging task since a predicate could be expressed in different ways in natural language questions. Also, the ambiguity of entity names and partial names makes the number of possible answers large. To bridge the gap, we propose an effective Knowledge Embedding based Question Answering (KEQA) framework. We focus on answering the most common types of questions, i.e., simple questions, in which each question could be answered by the machine straightforwardly if its single head entity and single predicate are correctly identified. To answer a simple question, instead of inferring its head entity and predicate directly, KEQA targets at jointly recovering the question's head entity, predicate, and tail entity representations in the KG embedding spaces. Based on a carefully-designed joint distance metric, the three learned vectors' closest fact in the KG is returned as the answer. Experiments on a widely-adopted benchmark demonstrate that the proposed KEQA outperforms the state-of-the-art QA-KG methods.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129023456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 383

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀