Proceedings of the 2017 ACM on Conference on Information and Knowledge Management最新文献_第3页

HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning HIN2Vec:探索异构信息网络中的元路径用于表示学习

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132953

Tao-yang Fu, Wang-Chien Lee, Zhen Lei

In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.

本文提出了一种新的异构信息网络表示学习框架，即HIN2Vec。提出的框架的核心是一个神经网络模型，也称为HIN2Vec，旨在通过利用节点之间不同类型的关系来捕获嵌入在HINs中的丰富语义。给定HIN中以元路径形式指定的一组关系，HIN2Vec基于目标关系集联合执行多个预测训练任务，学习HIN中节点和元路径的潜在向量。除了模型设计之外，还研究了HIN2Vec独有的几个问题，包括元路径向量的正则化、负采样中的节点类型选择和随机漫步中的循环。为了验证我们的想法，我们使用四个大规模真实HIN数据集(包括Blogcatalog、Yelp、DBLP和U.S. Patents)学习节点的潜在向量，并将其用作这些网络上多标签节点分类和链接预测应用的特征。实证结果表明，HIN2Vec在网络数据表征学习模型(包括DeepWalk、LINE、node2vec、PTE、HINE和ESim)中表现出色，在多标签节点分类方面比$micro$-$f_1$高出6.6%至23.8%，在链路预测方面比$MAP$高出5%至70.8%。

{"title":"HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning","authors":"Tao-yang Fu, Wang-Chien Lee, Zhen Lei","doi":"10.1145/3132847.3132953","DOIUrl":"https://doi.org/10.1145/3132847.3132953","url":null,"abstract":"In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88204773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 515

Learning Biological Sequence Types Using the Literature 利用文献学习生物序列类型

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133051

Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel

We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.

本文对生物序列数据库中记录的生物序列类型自动分类进行了探讨。序列类型属性提供了关于记录中表示的序列性质的重要信息，并且经常用于搜索以过滤掉不相关的序列。然而，序列类型属性通常是一个非强制性的自由文本字段，因此它容易出现许多错误，包括打字错误、赋值错误和未赋值。在GenBank中，这个问题涉及大约18%的记录，这一惊人的数字应该引起生物存储界的担忧。为了解决自动序列类型分类的问题，我们建议使用与序列记录相关的文献作为可用于分类任务的外部知识来源。我们定义了一组基于文献的特征，并训练机器学习算法将记录分类为六种主要序列类型之一。使用文献来完成这项任务的主要直觉是，序列似乎在科学文章中以不同的方式讨论，取决于它们的类型。我们在PubMed Central collection上进行的实验表明，文献确实是解决序列类型分类问题的有效方法。我们的分类方法达到了92.7%的准确率，并且大大优于用于比较的两种基线方法。

{"title":"Learning Biological Sequence Types Using the Literature","authors":"Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel","doi":"10.1145/3132847.3133051","DOIUrl":"https://doi.org/10.1145/3132847.3133051","url":null,"abstract":"We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"137 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86462155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Learning Community Embedding with Community Detection and Node Embedding on Graphs 用社区检测和图上的节点嵌入学习社区嵌入

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132925

Sandro Cavallari, V. Zheng, Hongyun Cai, K. Chang, E. Cambria

In this paper, we study an important yet largely under-explored setting of graph embedding, i.e., embedding communities instead of each individual nodes. We find that community embedding is not only useful for community-level applications such as graph visualization, but also beneficial to both community detection and node classification. To learn such embedding, our insight hinges upon a closed loop among community embedding, community detection and node embedding. On the one hand, node embedding can help improve community detection, which outputs good communities for fitting better community embedding. On the other hand, community embedding can be used to optimize the node embedding by introducing a community-aware high-order proximity. Guided by this insight, we propose a novel community embedding framework that jointly solves the three tasks together. We evaluate such a framework on multiple real-world datasets, and show that it improves graph visualization and outperforms state-of-the-art baselines in various application tasks, e.g., community detection and node classification.

在本文中，我们研究了图嵌入的一个重要但尚未充分开发的设置，即嵌入社区而不是每个单独的节点。我们发现社区嵌入不仅对社区级应用(如图形可视化)有用，而且对社区检测和节点分类都有好处。要学习这种嵌入，我们的洞察力取决于社区嵌入、社区检测和节点嵌入之间的闭环。一方面，节点嵌入有助于改进社区检测，从而输出好的社区以拟合更好的社区嵌入;另一方面，社区嵌入可以通过引入社区感知的高阶邻近来优化节点嵌入。在此指导下，我们提出了一个新的社区嵌入框架，共同解决这三个任务。我们在多个真实世界的数据集上评估了这样的框架，并表明它改善了图形可视化，并且在各种应用任务中优于最先进的基线，例如社区检测和节点分类。

{"title":"Learning Community Embedding with Community Detection and Node Embedding on Graphs","authors":"Sandro Cavallari, V. Zheng, Hongyun Cai, K. Chang, E. Cambria","doi":"10.1145/3132847.3132925","DOIUrl":"https://doi.org/10.1145/3132847.3132925","url":null,"abstract":"In this paper, we study an important yet largely under-explored setting of graph embedding, i.e., embedding communities instead of each individual nodes. We find that community embedding is not only useful for community-level applications such as graph visualization, but also beneficial to both community detection and node classification. To learn such embedding, our insight hinges upon a closed loop among community embedding, community detection and node embedding. On the one hand, node embedding can help improve community detection, which outputs good communities for fitting better community embedding. On the other hand, community embedding can be used to optimize the node embedding by introducing a community-aware high-order proximity. Guided by this insight, we propose a novel community embedding framework that jointly solves the three tasks together. We evaluate such a framework on multiple real-world datasets, and show that it improves graph visualization and outperforms state-of-the-art baselines in various application tasks, e.g., community detection and node classification.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86086319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 331

iFACT: An Interactive Framework to Assess Claims from Tweets iFACT:评估推文声明的交互式框架

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132995

Wee-Yong Lim, M. Lee, W. Hsu

Posts by users on microblogs such as Twitter provide diverse real-time updates to major events. Unfortunately, not all the information are credible. Previous works that assess the credibility of information in Twitter have focused on extracting features from the Tweets. In this work, we present an interactive framework called iFACT for assessing the credibility of claims from tweets. The proposed framework collects independent evidence from web search results (WSR) and identify the dependencies between claims. It utilizes features from the search results to determine the probabilities that a claim is credible, not credible or inconclusive. Finally, the dependencies between claims are used to adjust the likelihood estimates of a claim being credible, not credible or inconclusive. iFACT allows users to be engaged in the credibility assessment process by providing feedback as to whether the web search results are relevant, support or contradict a claim. Experiment results on multiple real world datasets demonstrate the effectiveness of WSR features and its ability to generalize to claims of new events. Case studies show the usefulness of claim dependencies and how the proposed approach can give explanation to the credibility assessment process.

用户在Twitter等微博上发布的帖子提供了对重大事件的各种实时更新。不幸的是，并非所有的信息都是可信的。之前评估推特信息可信度的工作主要集中在从推特中提取特征。在这项工作中，我们提出了一个名为iFACT的交互式框架，用于评估推文声明的可信度。该框架从网络搜索结果(WSR)中收集独立证据，并识别索赔之间的依赖关系。它利用搜索结果中的特征来确定索赔是可信的、不可信的或不确定的概率。最后，索赔之间的依赖关系用于调整索赔可信、不可信或不确定的可能性估计。iFACT允许用户通过提供反馈来参与可信度评估过程，以确定网络搜索结果是否相关、是否支持或反驳某一主张。在多个真实世界数据集上的实验结果证明了WSR特征的有效性及其推广到新事件声明的能力。案例研究表明索赔依赖关系的有用性，以及所建议的方法如何解释可信度评估过程。

{"title":"iFACT: An Interactive Framework to Assess Claims from Tweets","authors":"Wee-Yong Lim, M. Lee, W. Hsu","doi":"10.1145/3132847.3132995","DOIUrl":"https://doi.org/10.1145/3132847.3132995","url":null,"abstract":"Posts by users on microblogs such as Twitter provide diverse real-time updates to major events. Unfortunately, not all the information are credible. Previous works that assess the credibility of information in Twitter have focused on extracting features from the Tweets. In this work, we present an interactive framework called iFACT for assessing the credibility of claims from tweets. The proposed framework collects independent evidence from web search results (WSR) and identify the dependencies between claims. It utilizes features from the search results to determine the probabilities that a claim is credible, not credible or inconclusive. Finally, the dependencies between claims are used to adjust the likelihood estimates of a claim being credible, not credible or inconclusive. iFACT allows users to be engaged in the credibility assessment process by providing feedback as to whether the web search results are relevant, support or contradict a claim. Experiment results on multiple real world datasets demonstrate the effectiveness of WSR features and its ability to generalize to claims of new events. Case studies show the usefulness of claim dependencies and how the proposed approach can give explanation to the credibility assessment process.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79364467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation CNN-IETS:一种基于cnn的文本分割信息抽取概率方法

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132962

Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao

Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.

文本分割信息提取(IETS)的目的是对文本输入进行分割，提取文本输入中隐含的数据值。最先进的IETS方法主要依赖于机器学习技术，无论是有监督的还是无监督的。然而，虽然监督方法需要大量标记训练数据，但无监督方法在不同数据集上的性能可能不稳定。为了克服它们的缺点，本文引入了CNN- iets，一种新的无监督概率方法，它利用了预先存在的数据和基于卷积神经网络(CNN)的概率分类模型的优势。虽然使用CNN模型可以减轻在将文本段与给定领域的属性关联时选择高质量特征的负担，但预先存在的数据作为领域知识库可以为构建CNN模型提供具有全面特征列表的训练数据。给定输入文本，我们进行初始分割(根据这些词在知识库中的出现次数)，生成用于CNN分类的文本片段。然后，基于概率CNN分类结果，我们寻找对整个输入文本最可能的标记方式。作为补充，最后部署了从测试数据中按需学习的双向测序模型，以对一些有问题的标记片段进行进一步调整。我们对几个真实数据集进行的实验研究表明，CNN-IETS将最先进方法的提取质量提高了10%以上。

{"title":"CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation","authors":"Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao","doi":"10.1145/3132847.3132962","DOIUrl":"https://doi.org/10.1145/3132847.3132962","url":null,"abstract":"Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"284 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83432220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Covering the Optimal Time Window Over Temporal Data 覆盖时间数据的最优时间窗口

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132935

Bin Cao, Chenyu Hou, Jing Fan

In this paper, we propose a new problem: covering the optimal time window over temporal data. Given a duration constraint d and a set of users where each user has multiple time intervals, the goal is to find all time windows which (1) are greater than or equal to the duration d, and (2) can be covered by the intervals from as many as possible users. This problem can be applied to real scenarios where people need to determine the best time for maximizing the number of people to be involved in an activity, e.g., the meeting organization and the online live video broadcasting. As far as we know, there is no existing algorithm that can solve the problem directly. In this paper, we propose two algorithms to solve the problem, the first one is considered as a baseline algorithm called sliding time window (STW), where we utilize the start and end points of all users - intervals to construct time windows satisfying duration d. And then we calculate the number of users whose intervals can cover the current time window. The second method, named TLI, is designed based on the the data structures from the Timeline Index in SAP HANA. In TLI algorithm, we conduct three consecutive phases to achieve the purpose of efficiency improvement, namely construction of Timeline Index, calculation of valid user set and calculation of time windows. Within the third phase, we prune the number of time windows by keeping track of the number of users in current optimal time window, which can help shrink the search space. Through extensive experimental evaluations, we find TLI algorithm outperforms STW two orders of magnitude in terms of querying time.

在本文中，我们提出了一个新的问题:覆盖时间数据的最优时间窗。给定一个持续时间约束d和一组用户，其中每个用户都有多个时间间隔，目标是找到(1)大于或等于持续时间d的所有时间窗口，并且(2)可以被尽可能多的用户的间隔覆盖。这个问题可以应用于人们需要确定最佳时间以最大限度地增加参与活动的人数的真实场景，例如会议组织和在线视频直播。据我们所知，目前还没有能够直接解决这个问题的算法。在本文中，我们提出了两种算法来解决这个问题，第一种算法被认为是一种称为滑动时间窗(STW)的基线算法，该算法利用所有用户-区间的起始点和结束点来构造满足持续时间d的时间窗，然后计算其区间可以覆盖当前时间窗的用户数量。第二种方法称为TLI，它是基于SAP HANA中Timeline Index的数据结构设计的。在TLI算法中，我们通过构建Timeline Index、计算有效用户集和计算时间窗三个连续的阶段来达到提高效率的目的。在第三阶段，我们通过跟踪当前最优时间窗口内的用户数量来减少时间窗口的数量，这有助于缩小搜索空间。通过大量的实验评估，我们发现TLI算法在查询时间上优于STW两个数量级。

{"title":"Covering the Optimal Time Window Over Temporal Data","authors":"Bin Cao, Chenyu Hou, Jing Fan","doi":"10.1145/3132847.3132935","DOIUrl":"https://doi.org/10.1145/3132847.3132935","url":null,"abstract":"In this paper, we propose a new problem: covering the optimal time window over temporal data. Given a duration constraint d and a set of users where each user has multiple time intervals, the goal is to find all time windows which (1) are greater than or equal to the duration d, and (2) can be covered by the intervals from as many as possible users. This problem can be applied to real scenarios where people need to determine the best time for maximizing the number of people to be involved in an activity, e.g., the meeting organization and the online live video broadcasting. As far as we know, there is no existing algorithm that can solve the problem directly. In this paper, we propose two algorithms to solve the problem, the first one is considered as a baseline algorithm called sliding time window (STW), where we utilize the start and end points of all users - intervals to construct time windows satisfying duration d. And then we calculate the number of users whose intervals can cover the current time window. The second method, named TLI, is designed based on the the data structures from the Timeline Index in SAP HANA. In TLI algorithm, we conduct three consecutive phases to achieve the purpose of efficiency improvement, namely construction of Timeline Index, calculation of valid user set and calculation of time windows. Within the third phase, we prune the number of time windows by keeping track of the number of users in current optimal time window, which can help shrink the search space. Through extensive experimental evaluations, we find TLI algorithm outperforms STW two orders of magnitude in terms of querying time.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"SE-11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84638126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid MemNet for Extractive Summarization 用于抽取摘要的混合MemNet

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133127

A. Singh, Manish Gupta, Vasudeva Varma

Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.

摘要文摘一直是自然语言理解领域一个广泛研究的问题。虽然传统的方法主要依赖于手动编译的特征来生成摘要，但很少有人尝试开发用于提取摘要的数据驱动系统。为此，我们提出了一个完全数据驱动的端到端深度网络，我们称之为混合MemNet，用于单文档摘要任务。网络在生成摘要之前学习文档的连续统一表示。它联合捕获局部和全局的句子信息以及总结句子的概念。在两种不同的语料库上的实验结果证实，与最先进的基线相比，我们的模型显示出显著的性能提升。

引用次数: 12

Latency Reduction via Decision Tree Based Query Construction 基于决策树的查询结构降低延迟

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132865

Aman Grover, Dhruv Arya, Ganesh Venkataraman

LinkedIn as a professional network serves the career needs of 450 Million plus members. The task of job recommendation system is to nd the suitable job among a corpus of several million jobs and serve this in real time under tight latency constraints. Job search involves nding suitable job listings given a user, query and context. Typical scoring function for both search and recommendations involves evaluating a function that matches various elds in the job description with various elds in the member pro le. This in turn translates to evaluating a function with several thousands of features to get the right ranking. In recommendations, evaluating all the jobs in the corpus for all members is not possible given the latency constraints. On the other hand, reducing the candidate set could potentially involve loss of relevant jobs. We present a way to model the underlying complex ranking function via decision trees. The branches within the decision trees are query clauses and hence the decision trees can be mapped on to real time queries. We developed an o ine framework which evaluates the quality of the decision tree with respect to latency and recall. We tested the approach on job search and recommendations on LinkedIn and A/B tests show signi cant improvements in member engagement and latency. Our techniques helped reduce job search latency by over 67% and our recommendations latency by over 55%. Our techniques show 3.5% improvement in applications from job recommendations primarily due to reduced timeouts from upstream services. As of writing the approach powers all of job search and recommendations on LinkedIn.

LinkedIn是一个专业网络，服务于4.5亿多会员的职业需求。职位推荐系统的任务是从数百万个职位的语料库中找到合适的职位，并在严格的延迟限制下实时服务。工作搜索包括在给定用户、查询和上下文的情况下搜索合适的工作列表。搜索和推荐的典型评分函数都涉及评估一个函数，该函数将职位描述中的各种字段与成员简历中的各种字段匹配起来。这反过来又转化为对具有数千个特性的函数进行评估，以获得正确的排名。在建议中，考虑到延迟限制，不可能评估语料库中所有成员的所有作业。另一方面，减少候选人数量可能会导致相关工作岗位的流失。我们提出了一种通过决策树对底层复杂排序函数建模的方法。决策树中的分支是查询子句，因此决策树可以映射到实时查询。我们开发了一个在线框架来评估决策树在延迟和召回方面的质量。我们在LinkedIn上对这种方法进行了求职和推荐测试，A/B测试显示，在会员参与度和延迟方面有显著改善。我们的技术帮助减少了超过67%的求职延迟和超过55%的推荐延迟。我们的技术显示，由于上游服务的超时减少，作业推荐的应用程序提高了3.5%。在撰写本文时，该方法为LinkedIn上的所有工作搜索和推荐提供了动力。

{"title":"Latency Reduction via Decision Tree Based Query Construction","authors":"Aman Grover, Dhruv Arya, Ganesh Venkataraman","doi":"10.1145/3132847.3132865","DOIUrl":"https://doi.org/10.1145/3132847.3132865","url":null,"abstract":"LinkedIn as a professional network serves the career needs of 450 Million plus members. The task of job recommendation system is to nd the suitable job among a corpus of several million jobs and serve this in real time under tight latency constraints. Job search involves nding suitable job listings given a user, query and context. Typical scoring function for both search and recommendations involves evaluating a function that matches various elds in the job description with various elds in the member pro le. This in turn translates to evaluating a function with several thousands of features to get the right ranking. In recommendations, evaluating all the jobs in the corpus for all members is not possible given the latency constraints. On the other hand, reducing the candidate set could potentially involve loss of relevant jobs. We present a way to model the underlying complex ranking function via decision trees. The branches within the decision trees are query clauses and hence the decision trees can be mapped on to real time queries. We developed an o ine framework which evaluates the quality of the decision tree with respect to latency and recall. We tested the approach on job search and recommendations on LinkedIn and A/B tests show signi cant improvements in member engagement and latency. Our techniques helped reduce job search latency by over 67% and our recommendations latency by over 55%. Our techniques show 3.5% improvement in applications from job recommendations primarily due to reduced timeouts from upstream services. As of writing the approach powers all of job search and recommendations on LinkedIn.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85309641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge 具有噪声特定领域知识的场景下基于噪声信道模型的快速单词识别

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133028

Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel

Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.

单词识别是许多应用程序面临的一项具有挑战性的任务，特别是在非常嘈杂的场景中。这个问题通常被视为通过噪声信道传输单词，因此有必要确定词典中的哪个已知单词是接收到的字符串。为了可行，只选择一个简化的候选词集。如果可以通过应用最多k个字符编辑操作将它们转换为输入字符串，则通常选择它们。为了对候选项进行排序，最有效的估计使用了从实际使用数据中提取的有关噪声源和误差分布的领域知识。然而，在有很多噪声的场景中，这种估计和通常所需的索引策略不能很好地扩展，因为它们随着k和词典大小呈指数增长。在这项工作中，我们提出了在非常嘈杂的场景中非常有效的单词识别方法，这些方法支持有效的基于编辑的距离算法，可以使用最小完美散列进行搜索。该方法允许对最有希望的候选词进行早期处理，这样快速修剪的搜索在单词排名质量上的损失可以忽略不计。我们还提出了一种线性启发式方法来估计基于编辑的距离，该方法利用了索引已经提供的信息。我们的方法达到了与最先进的方法相似的精度，大约快了十倍。

{"title":"Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge","authors":"Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel","doi":"10.1145/3132847.3133028","DOIUrl":"https://doi.org/10.1145/3132847.3133028","url":null,"abstract":"Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90787331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Sensitive and Scalable Online Evaluation with Theoretical Guarantees 具有理论保证的敏感和可扩展在线评估

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132895

Harrie Oosterhuis, M. de Rijke

Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.

多叶比较方法推广了交错比较方法，为基于常规用户交互的排名系统比较提供了一种可扩展的方法。这些方法使得搜索引擎的研究和发展日益迅速。然而，提供可靠结果的现有多叶比较方法在评估过程中降低了用户体验。相反，当前维持用户体验的多叶比较方法不能保证正确性。我们的贡献是双重的。首先，我们提出了一个理论框架，用于系统地比较多叶比较方法，使用考虑性(涉及保持用户体验)和保真度(涉及可靠的正确结果)的概念。其次，我们引入了一种新的基于文档对偏好进行比较的多叶比较方法——对偏好多叶比较(PPM)，并证明了它是考虑周到的，具有保真度。我们的经验表明，与以前的多叶比较方法相比，PPM对用户偏好更敏感，并且随着所比较的排名器数量的增加而可扩展。

{"title":"Sensitive and Scalable Online Evaluation with Theoretical Guarantees","authors":"Harrie Oosterhuis, M. de Rijke","doi":"10.1145/3132847.3132895","DOIUrl":"https://doi.org/10.1145/3132847.3132895","url":null,"abstract":"Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91270613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16