Proceedings of the 2017 ACM on Conference on Information and Knowledge Management最新文献_第5页

Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks 基于文本的异构信息网络的远端元路径相似性

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133029

Chenguang Wang, Yangqiu Song, Haoran Li, Yizhou Sun, Ming Zhang, Jiawei Han

Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.

度量网络相似度是数据挖掘的一个基本问题。主流的相似度度量方法主要利用网络中实体的结构信息，而不考虑网络语义。在现实世界中，语义丰富的异构信息网络(HINs)是普遍存在的。然而，现有的网络相似度在HIN中不能很好地泛化，因为它们不能捕获HIN语义。元路径已经被提出并证明是在HINs中表示语义的正确方法。因此，原始的基于元路径的相似性(例如，PathSim和KnowSim)已经成功地计算了HINs中的实体接近度。直观的感觉是，实体之间的元路径实例越多，实体就越相似。因此，原始的元路径相似度仅适用于计算两个相邻(连接)实体的接近度。在本文中，我们提出了远程元路径相似性，它能够捕获两个远程(孤立)实体之间的HIN语义，以提供更有意义的实体接近。其主要思想是，即使没有两个实体的共享邻域实体(即没有元路径实例连接)，但如果实体的邻域实体越相似，则两个实体应该越相似。然后，基于不同的理论基础，通过探索相似假设空间，找出最优的远距离元路径相似度。我们在两个基于文本的HINs上展示了远程元路径相似性的最先进的相似性性能，并使数据集公开可用。

{"title":"Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks","authors":"Chenguang Wang, Yangqiu Song, Haoran Li, Yizhou Sun, Ming Zhang, Jiawei Han","doi":"10.1145/3132847.3133029","DOIUrl":"https://doi.org/10.1145/3132847.3133029","url":null,"abstract":"Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84040324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server 基于参数服务器的高效通信并行DBSCAN算法

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133112

Xu Hu, Jun Huang, Minghui Qiu

Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.

最近的基准研究表明，基于mpi的DBSCAN的分布式实现，如PDSDBSCAN，优于其他实现，如apache Spark等。然而，MPI DBSCAN的通信成本随着处理器数量的增加而急剧增加，这使得它在处理大规模问题时效率低下。在本文中，我们提出了PS-DBSCAN算法，一种结合了不相交集数据结构和参数服务器框架的并行DBSCAN算法，以最小化通信成本。由于同一集群中的数据点可能分布在不同的工人上，从而导致多个不相交集，合并它们会产生很大的通信成本。在算法中，我们采用快速全局联合的方法对不相交集进行联合，以减轻通信负担。在不同尺度数据集上的实验表明，PS-DBSCAN在通信效率上比PDSDBSCAN提高了2-10倍。我们在阿里云的AI平台(PAI)上发布了PS-DBSCAN算法。

引用次数: 10

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management 2017年ACM信息与知识管理会议论文集

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847

Ee-Peng Lim, M. Winslett, M. Sanderson, A. Fu, Jimeng Sun, Shane Culpepper, Eric Lo, Joyce Ho, D. Donato, R. Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, V. Tseng, Chenliang Li

Since 1992, the ACM International Conference on Information and Knowledge Management (CIKM) has brought together leading researchers and developers from the knowledge management, information retrieval, and data management communities to discuss cutting-edge research on advanced knowledge and information systems. We are pleased to present the 26th edition of CIKM on 6-10 November, 2017, at the Pan Pacific Singapore hotel, with the special theme of Smart Cities, Smart Nations. This year our attendees will enjoy four keynote speakers: Rajeev Rastogi (Amazon), Qiang Yang (HKUST), Rada Mihalcea (Michigan), and K Ananth Krishnan (Tata Consultancy Services). In 6-7 parallel sessions, our program includes presentations of 171 full research papers, 119 short research papers, and 30 demonstrations of new research advances. The program's focus this year can be seen at a glance in the word cloud at right, constructed from the titles of all accepted research papers. Also on offer are eight tutorials on timely research topics, and six collocated workshops on topics ranging from history to transportation, biomedicine to bias. We are excited about our greatly expanded data analytics competition this year, the CIKM AnalytiCup. During the past nine months, over 1500 teams from all over the world have vied to win over $60,000 in AnalytiCup prizes and travel money by solving real-world analytics problems posed by our corporate sponsors Alibaba/Shenzhen Meteorological Bureau, DataSpark, and Lazada. A fourth competition, a weekend-long hackathon sponsored by DHL, takes place immediately before the conference. The finalists from all four competitions come together on 6 November for a final showdown in front of corporate judges. Solution summaries from finalist teams in the first three competitions can be found in these proceedings. Also new this year are several other events aimed directly at practitioners. During the main conference, we are offering hands-on tutorials on the hot topics of scalable deep learning and scalable data science. The Case Studies track, intended to highlight the experiences and lessons learned by early adopters, debuts this year with 23 studies of technology adoption in interesting applications. And immediately before the main conference, CIKMconnect brings together students and industry for posters, technical discussions, recruiting events, and networking. It takes a village to produce a major conference! Our program committee chairs, senior PC and PC members valiantly and gracefully handled a record total number of submissions: 855 full research papers, 419 short research papers, 80 demos, and 103 case studies. Each submission was reviewed by three program committee members, each a recognized expert in the field, and an independent committee selected the full paper awards recipients.

自1992年以来，ACM信息和知识管理国际会议(CIKM)汇集了来自知识管理、信息检索和数据管理社区的主要研究人员和开发人员，讨论先进知识和信息系统的前沿研究。我们很高兴于2017年11月6日至10日在新加坡泛太平洋酒店举办第26届CIKM，特别主题为智慧城市，智慧国家。今年我们将邀请到四位主讲嘉宾:Rajeev Rastogi(亚马逊)、Qiang Yang(科大)、Rada Mihalcea(密歇根)和K Ananth Krishnan(塔塔咨询服务)。在6-7个平行会议中，我们的计划包括171篇完整的研究论文，119篇简短的研究论文和30篇新的研究进展的演示。这个项目今年的重点可以在右边的词云中一目了然，它是由所有被接受的研究论文的标题组成的。此外，还提供8个及时研究主题的教程，以及6个主题从历史到交通，生物医学到偏见的协同研讨会。我们对今年的数据分析大赛CIKM AnalytiCup感到非常兴奋。在过去的九个月里，来自世界各地的1500多支团队通过解决由我们的企业赞助商阿里巴巴/深圳气象局、DataSpark和Lazada提出的现实世界分析问题，争夺超过6万美元的AnalytiCup奖金和旅费。第四场比赛是由DHL赞助的一场为期一个周末的黑客马拉松，在会议之前举行。11月6日，四场比赛的决赛选手将在公司评委面前进行最后的对决。在本程序中可以找到前三场比赛决赛队伍的解决方案摘要。今年还有其他一些直接针对从业者的活动。在主会议期间，我们将提供关于可扩展深度学习和可扩展数据科学的热门话题的实践教程。“案例研究”专场旨在突出早期采用者的经验和教训，今年首次推出了23项关于在有趣的应用中采用技术的研究。在主要会议之前，CIKMconnect将学生和行业聚集在一起，进行海报、技术讨论、招聘活动和网络交流。举办一次大型会议需要全村人的努力!我们的项目委员会主席，高级PC和PC成员勇敢而优雅地处理了提交的总数量:855篇完整的研究论文，419篇简短的研究论文，80个演示和103个案例研究。每篇论文由三名项目委员会成员审查，每名成员都是该领域公认的专家，并由一个独立的委员会选出完整的论文获奖者。

{"title":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","authors":"Ee-Peng Lim, M. Winslett, M. Sanderson, A. Fu, Jimeng Sun, Shane Culpepper, Eric Lo, Joyce Ho, D. Donato, R. Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, V. Tseng, Chenliang Li","doi":"10.1145/3132847","DOIUrl":"https://doi.org/10.1145/3132847","url":null,"abstract":"Since 1992, the ACM International Conference on Information and Knowledge Management (CIKM) has brought together leading researchers and developers from the knowledge management, information retrieval, and data management communities to discuss cutting-edge research on advanced knowledge and information systems. We are pleased to present the 26th edition of CIKM on 6-10 November, 2017, at the Pan Pacific Singapore hotel, with the special theme of Smart Cities, Smart Nations. \u0000 \u0000This year our attendees will enjoy four keynote speakers: Rajeev Rastogi (Amazon), Qiang Yang (HKUST), Rada Mihalcea (Michigan), and K Ananth Krishnan (Tata Consultancy Services). In 6-7 parallel sessions, our program includes presentations of 171 full research papers, 119 short research papers, and 30 demonstrations of new research advances. The program's focus this year can be seen at a glance in the word cloud at right, constructed from the titles of all accepted research papers. Also on offer are eight tutorials on timely research topics, and six collocated workshops on topics ranging from history to transportation, biomedicine to bias. \u0000 \u0000We are excited about our greatly expanded data analytics competition this year, the CIKM AnalytiCup. During the past nine months, over 1500 teams from all over the world have vied to win over $60,000 in AnalytiCup prizes and travel money by solving real-world analytics problems posed by our corporate sponsors Alibaba/Shenzhen Meteorological Bureau, DataSpark, and Lazada. A fourth competition, a weekend-long hackathon sponsored by DHL, takes place immediately before the conference. The finalists from all four competitions come together on 6 November for a final showdown in front of corporate judges. Solution summaries from finalist teams in the first three competitions can be found in these proceedings. \u0000 \u0000Also new this year are several other events aimed directly at practitioners. During the main conference, we are offering hands-on tutorials on the hot topics of scalable deep learning and scalable data science. The Case Studies track, intended to highlight the experiences and lessons learned by early adopters, debuts this year with 23 studies of technology adoption in interesting applications. And immediately before the main conference, CIKMconnect brings together students and industry for posters, technical discussions, recruiting events, and networking. \u0000 \u0000It takes a village to produce a major conference! Our program committee chairs, senior PC and PC members valiantly and gracefully handled a record total number of submissions: 855 full research papers, 419 short research papers, 80 demos, and 103 case studies. Each submission was reviewed by three program committee members, each a recognized expert in the field, and an independent committee selected the full paper awards recipients.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90048927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

TATHYA: A Multi-Classifier System for Detecting Check-Worthy Statements in Political Debates 在政治辩论中检测值得检查语句的多分类器系统

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133150

Ayush Patwari, Dan Goldwasser, S. Bagchi

Fact-checking political discussions has become an essential clog in computational journalism. This task encompasses an important sub-task---identifying the set of statements with 'check-worthy' claims. Previous work has treated this as a simple text classification problem discounting the nuances involved in determining what makes statements check-worthy. We introduce a dataset of political debates from the 2016 US Presidential election campaign annotated using all major fact-checking media outlets and show that there is a need to model conversation context, debate dynamics and implicit world knowledge. We design a multi-classifier system TATHYA, that models latent groupings in data and improves state-of-art systems in detecting check-worthy statements by 19.5% in F1-score on a held-out test set, gaining primarily gaining in Recall.

对政治讨论进行事实核查已经成为计算新闻的一个重要障碍。此任务包含一个重要的子任务——识别具有“值得检查”声明的语句集。以前的工作将其视为一个简单的文本分类问题，忽略了决定什么使语句值得检查所涉及的细微差别。我们引入了2016年美国总统竞选的政治辩论数据集，使用所有主要的事实核查媒体进行注释，并表明有必要对对话背景、辩论动态和隐含的世界知识进行建模。我们设计了一个多分类器系统TATHYA，该系统对数据中的潜在分组进行建模，并将当前最先进的系统在检测值得检查的语句方面提高了19.5%，在hold - hold测试集中的f1得分提高了19.5%，主要是在召回率上提高。

引用次数: 65

Privacy of Hidden Profiles: Utility-Preserving Profile Removal in Online Forums 隐藏的个人资料的隐私:实用程序保留个人资料删除在网上论坛

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133140

Sedigheh Eslami, Asia J. Biega, Rishiraj Saha Roy, G. Weikum

Users who wish to leave an online forum often do not have the freedom to erase their data completely from the service providers' (SP) system. The primary reason behind this is that analytics on such user data form a core component of many online providers' business models. On the other hand, if the profiles reside in the SP's system in an unchanged form, major privacy violations may occur if the infrastructure is compromised, or the SP is acquired by another organization. In this work, we investigate an alternative solution to standard profile removal, where posts of different users are split and merged into synthetic mediator profiles. The goal of our framework is to preserve the SP's data mining utility as far as possible, while minimizing users' privacy risks. We present several mechanisms of assigning user posts to such mediator accounts and show the effectiveness of our framework using data from StackExchange and various health forums.

希望离开在线论坛的用户通常没有从服务提供商(SP)系统中完全删除其数据的自由。这背后的主要原因是，对此类用户数据的分析构成了许多在线提供商商业模式的核心组成部分。另一方面，如果配置文件以未更改的形式驻留在服务提供商的系统中，则如果基础设施遭到破坏，或者服务提供商被另一个组织收购，则可能发生重大的隐私侵犯。在这项工作中，我们研究了标准配置文件删除的替代解决方案，其中不同用户的帖子被拆分并合并到合成中介配置文件中。我们框架的目标是尽可能地保留SP的数据挖掘实用程序，同时最大限度地降低用户的隐私风险。我们提出了几种将用户帖子分配给此类中介帐户的机制，并使用来自StackExchange和各种健康论坛的数据展示了我们框架的有效性。

引用次数: 6

Ontology-based Graph Visualization for Summarized View 基于本体的汇总视图图形可视化

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133113

Xin Huang, Byron Choi, Jianliang Xu, W. K. Cheung, Yanchun Zhang, Jiming Liu

Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e)≈ 62%-approximation guarantee. Preliminary experimental results on real-world datasets show the effectiveness and efficiency of the proposed algorithm for data summarization.

将数据集的一个小子集呈现给用户的数据摘要已经广泛应用于许多应用程序和系统中。许多数据集用分层术语编码，例如，国际疾病分类-9，医学主题标题和基因本体，仅举几例。在本文中，我们研究了选择不同的k个元素集合来总结具有层次术语的输入数据集的问题，并在本体结构中可视化总结。我们提出了一种有效的贪心算法来解决具有(1-1/e)≈62%近似保证的问题。在实际数据集上的初步实验结果表明了该算法的有效性和高效性。

引用次数: 5

QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks 面向深度神经网络推理的异构服务器qos感知调度

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133045

Zhou Fang, Tong Yu, O. Mengshoel, Rajesh K. Gupta

Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.

深度神经网络(dnn)在计算机视觉和自然语言处理等各个领域都很受欢迎。深度神经网络推理任务作为云计算环境提供的一种服务正在兴起。然而，由于依赖于批处理大小、模型复杂性和资源分配，云托管DNN推理在实现最佳服务质量(QoS)的工作负载调度方面面临新的挑战。本文将QoS度量表示为响应延迟和推理精度的效用函数。首先提出了一种简单有效的启发式方法，既能保持较低的响应延迟，又能满足处理吞吐量的要求。然后，我们描述了一种先进的深度强化学习(RL)方法，它可以从经验中学习调度。使用一组系统状态作为RL策略模型的输入，训练RL调度器以最大化QoS。我们的方法仅在有空闲gpu时执行调度操作，从而减少了在每个连续时间步运行的普通RL调度器的调度开销。我们在仿真平台上评估了调度程序，并展示了强化学习相对于启发式的优势。

{"title":"QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks","authors":"Zhou Fang, Tong Yu, O. Mengshoel, Rajesh K. Gupta","doi":"10.1145/3132847.3133045","DOIUrl":"https://doi.org/10.1145/3132847.3133045","url":null,"abstract":"Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88090506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Hybrid BiLSTM-Siamese network for FAQ Assistance 混合BiLSTM-Siamese网络常见问题解答

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132861

Prerna Khurana, P. Agarwal, Gautam M. Shroff, L. Vig, A. Srinivasan

We describe an automated assistant for answering frequently asked questions; our system has been deployed, and is currently answering HR-related queries in two different areas (leave management and health insurance) to a large number of users. The needs of a large global corporate lead us to model a frequently asked question (FAQ) to be an equivalence class of actually asked questions, for which there is a common answer (certified as being consistent with the organization's policy). When a new question is posed to our system, it finds the class of question, and responds with the answer for the class. At this point, the system is either correct (gives correct answer); or incorrect (gives wrong answer); or incomplete (says "I don't know''). We employ a hybrid deep-learning architecture in which a BiLSTM-based classifier is combined with second BiLSTM-based Siamese network in an iterative manner: Questions for which the classifier makes an error during training are used to generate a set of misclassified question-question pairs. These, along with correct pairs, are used to train the Siamese network to drive apart the (hidden) representations of the misclassified pairs. We present experimental results from our deployment showing that our iteratively trained hybrid network: (a) results in better performance than using just a classifier network, or just a Siamese network; (b) performs better than state-of-the art sentence classifiers in the two areas in which it has been deployed, in terms of both accuracy as well as precision-recall tradeoff; and (c) also performs well on a benchmark public dataset. We also observe that using question-question pairs in our hybrid network, results in marginally better performance than using question-to-answer pairs. Finally, estimates of precision and recall from the deployment of our automated assistant suggest that we can expect the burden on our HR department to drop from answering about 6000 queries a day to about 1000.

我们描述了一个用于回答常见问题的自动助手;我们的系统已经部署完毕，目前正在回答大量用户在两个不同领域(休假管理和健康保险)的人力资源相关查询。大型全球公司的需求导致我们将经常问的问题(FAQ)建模为实际问的问题的等价类，对于这些问题，有一个共同的答案(被认证为与组织的政策一致)。当向我们的系统提出一个新问题时，它会找到问题的类别，并根据类别给出答案。此时，系统要么是正确的(给出正确答案);或incorrect(给出错误的答案);或者不完整(说“我不知道”)。我们采用了一种混合深度学习架构，其中基于bilstm的分类器以迭代的方式与第二个基于bilstm的Siamese网络相结合:分类器在训练过程中出错的问题被用来生成一组错误分类的问题对。这些和正确的配对一起用于训练Siamese网络，以分离(隐藏的)错误分类配对的表示。我们展示了我们部署的实验结果，表明我们迭代训练的混合网络:(a)比仅使用分类器网络或仅使用暹罗网络具有更好的性能;(b)在准确率和准确率-查全率权衡方面，它比目前最先进的句子分类器表现更好;并且(c)在基准公共数据集上也表现良好。我们还观察到，在我们的混合网络中使用问题-问题对，结果比使用问题-答案对略微更好。最后，对部署自动助手的准确率和召回率的估计表明，我们可以预期人力资源部门的负担将从每天回答大约6000个查询下降到大约1000个。

{"title":"Hybrid BiLSTM-Siamese network for FAQ Assistance","authors":"Prerna Khurana, P. Agarwal, Gautam M. Shroff, L. Vig, A. Srinivasan","doi":"10.1145/3132847.3132861","DOIUrl":"https://doi.org/10.1145/3132847.3132861","url":null,"abstract":"We describe an automated assistant for answering frequently asked questions; our system has been deployed, and is currently answering HR-related queries in two different areas (leave management and health insurance) to a large number of users. The needs of a large global corporate lead us to model a frequently asked question (FAQ) to be an equivalence class of actually asked questions, for which there is a common answer (certified as being consistent with the organization's policy). When a new question is posed to our system, it finds the class of question, and responds with the answer for the class. At this point, the system is either correct (gives correct answer); or incorrect (gives wrong answer); or incomplete (says \"I don't know''). We employ a hybrid deep-learning architecture in which a BiLSTM-based classifier is combined with second BiLSTM-based Siamese network in an iterative manner: Questions for which the classifier makes an error during training are used to generate a set of misclassified question-question pairs. These, along with correct pairs, are used to train the Siamese network to drive apart the (hidden) representations of the misclassified pairs. We present experimental results from our deployment showing that our iteratively trained hybrid network: (a) results in better performance than using just a classifier network, or just a Siamese network; (b) performs better than state-of-the art sentence classifiers in the two areas in which it has been deployed, in terms of both accuracy as well as precision-recall tradeoff; and (c) also performs well on a benchmark public dataset. We also observe that using question-question pairs in our hybrid network, results in marginally better performance than using question-to-answer pairs. Finally, estimates of precision and recall from the deployment of our automated assistant suggest that we can expect the burden on our HR department to drop from answering about 6000 queries a day to about 1000.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87887008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Deep Context Modeling for Web Query Entity Disambiguation 面向Web查询实体消歧的深度上下文建模

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132856

Zhen Liao, Xinying Song, Yelong Shen, Saekoo Lee, Jianfeng Gao, Ciya Liao

In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.

在本文中，我们提出了一种新的Web查询实体消歧(QED)的研究，即根据查询中提到的不同候选实体，对知识库中的不同候选实体进行消歧。QED特别具有挑战性，因为查询通常太短，无法提供传统实体消歧方法所需的丰富上下文信息。在本文中，我们提出了几种解决QED问题的方法。首先，我们探索了使用深度神经网络(DNN)来捕获查询中的字符级文本信息。我们的深度神经网络方法将查询及其候选参考实体映射到潜在语义空间中的特征向量，其中查询与其正确参考实体之间的距离最小。其次，我们利用查询的Web搜索结果信息来帮助为DNN模型生成大量弱监督训练数据。第三，我们提出了一种两阶段训练方法，将大规模弱监督数据与少量人类标记数据相结合，可以显著提高深度神经网络模型的性能。我们的方法的有效性在使用大规模真实世界数据集的实验中得到了证明。

{"title":"Deep Context Modeling for Web Query Entity Disambiguation","authors":"Zhen Liao, Xinying Song, Yelong Shen, Saekoo Lee, Jianfeng Gao, Ciya Liao","doi":"10.1145/3132847.3132856","DOIUrl":"https://doi.org/10.1145/3132847.3132856","url":null,"abstract":"In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76027046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Robust Heterogeneous Discriminative Analysis for Single Sample Per Person Face Recognition 单样本人脸识别的鲁棒异质判别分析

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133096

Meng Pang, Yiu-ming Cheung, Binghui Wang, Risheng Liu

Single sample face recognition is one of the most challenging problems in face recognition (FR), where only one single sample per person (SSPP) is enrolled in the gallery set for training. Although patch-based methods have achieved great success in FR with SSPP, they still have significant limitations. In this work, we propose a new patch-based method, namely Robust Heterogeneous Discriminative Analysis (RHDA), to tackle FR with SSPP. Compared with the existing patch-based methods, RHDA can enhance the robustness against complex facial variations from two aspects. First, we develop a novel Fisher-like criterion, which incorporates two manifold embeddings, to learn heterogeneous discriminative representations of image patches. Specifically, for each patch, the Fisher-like criterion is able to preserve the reconstruction relationship of neighboring patches from the same person, while suppressing neighboring patches from different persons. Second, we present two distance metrics, i.e., patch-to-patch distance and patch-to-manifold distance, and develop a fusion strategy to combine the recognition outputs of above two distance metrics via joint majority voting for identification. Experimental results on the AR and FERET benchmark datasets demonstrate the efficacy of the proposed method.

单样本人脸识别是人脸识别中最具挑战性的问题之一，其中每个人只有一个样本(SSPP)被登记在用于训练的图库集中。尽管基于补丁的方法在SSPP的FR中取得了很大的成功，但它们仍然有很大的局限性。在这项工作中，我们提出了一种新的基于补丁的方法，即稳健异质判别分析(RHDA)，以解决SSPP的FR问题。与现有的基于patch的方法相比，RHDA可以从两个方面增强对复杂面部变化的鲁棒性。首先，我们开发了一种新的类fisher准则，该准则包含两个歧管嵌入，以学习图像斑块的异构判别表示。具体而言，对于每个patch，类fisher准则能够保留来自同一人的相邻patch的重建关系，同时抑制来自不同人的相邻patch。其次，我们提出了两个距离度量，即patch-to-patch距离和patch-to-manifold距离，并制定了一种融合策略，通过联合多数投票将上述两个距离度量的识别输出结合起来进行识别。在AR和FERET基准数据集上的实验结果证明了该方法的有效性。

{"title":"Robust Heterogeneous Discriminative Analysis for Single Sample Per Person Face Recognition","authors":"Meng Pang, Yiu-ming Cheung, Binghui Wang, Risheng Liu","doi":"10.1145/3132847.3133096","DOIUrl":"https://doi.org/10.1145/3132847.3133096","url":null,"abstract":"Single sample face recognition is one of the most challenging problems in face recognition (FR), where only one single sample per person (SSPP) is enrolled in the gallery set for training. Although patch-based methods have achieved great success in FR with SSPP, they still have significant limitations. In this work, we propose a new patch-based method, namely Robust Heterogeneous Discriminative Analysis (RHDA), to tackle FR with SSPP. Compared with the existing patch-based methods, RHDA can enhance the robustness against complex facial variations from two aspects. First, we develop a novel Fisher-like criterion, which incorporates two manifold embeddings, to learn heterogeneous discriminative representations of image patches. Specifically, for each patch, the Fisher-like criterion is able to preserve the reconstruction relationship of neighboring patches from the same person, while suppressing neighboring patches from different persons. Second, we present two distance metrics, i.e., patch-to-patch distance and patch-to-manifold distance, and develop a fusion strategy to combine the recognition outputs of above two distance metrics via joint majority voting for identification. Experimental results on the AR and FERET benchmark datasets demonstrate the efficacy of the proposed method.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80010281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1