Chenguang Wang, Yangqiu Song, Haoran Li, Yizhou Sun, Ming Zhang, Jiawei Han
Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.
{"title":"Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks","authors":"Chenguang Wang, Yangqiu Song, Haoran Li, Yizhou Sun, Ming Zhang, Jiawei Han","doi":"10.1145/3132847.3133029","DOIUrl":"https://doi.org/10.1145/3132847.3133029","url":null,"abstract":"Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84040324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.
{"title":"A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server","authors":"Xu Hu, Jun Huang, Minghui Qiu","doi":"10.1145/3132847.3133112","DOIUrl":"https://doi.org/10.1145/3132847.3133112","url":null,"abstract":"Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"163 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83298915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ee-Peng Lim, M. Winslett, M. Sanderson, A. Fu, Jimeng Sun, Shane Culpepper, Eric Lo, Joyce Ho, D. Donato, R. Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, V. Tseng, Chenliang Li
Since 1992, the ACM International Conference on Information and Knowledge Management (CIKM) has brought together leading researchers and developers from the knowledge management, information retrieval, and data management communities to discuss cutting-edge research on advanced knowledge and information systems. We are pleased to present the 26th edition of CIKM on 6-10 November, 2017, at the Pan Pacific Singapore hotel, with the special theme of Smart Cities, Smart Nations. This year our attendees will enjoy four keynote speakers: Rajeev Rastogi (Amazon), Qiang Yang (HKUST), Rada Mihalcea (Michigan), and K Ananth Krishnan (Tata Consultancy Services). In 6-7 parallel sessions, our program includes presentations of 171 full research papers, 119 short research papers, and 30 demonstrations of new research advances. The program's focus this year can be seen at a glance in the word cloud at right, constructed from the titles of all accepted research papers. Also on offer are eight tutorials on timely research topics, and six collocated workshops on topics ranging from history to transportation, biomedicine to bias. We are excited about our greatly expanded data analytics competition this year, the CIKM AnalytiCup. During the past nine months, over 1500 teams from all over the world have vied to win over $60,000 in AnalytiCup prizes and travel money by solving real-world analytics problems posed by our corporate sponsors Alibaba/Shenzhen Meteorological Bureau, DataSpark, and Lazada. A fourth competition, a weekend-long hackathon sponsored by DHL, takes place immediately before the conference. The finalists from all four competitions come together on 6 November for a final showdown in front of corporate judges. Solution summaries from finalist teams in the first three competitions can be found in these proceedings. Also new this year are several other events aimed directly at practitioners. During the main conference, we are offering hands-on tutorials on the hot topics of scalable deep learning and scalable data science. The Case Studies track, intended to highlight the experiences and lessons learned by early adopters, debuts this year with 23 studies of technology adoption in interesting applications. And immediately before the main conference, CIKMconnect brings together students and industry for posters, technical discussions, recruiting events, and networking. It takes a village to produce a major conference! Our program committee chairs, senior PC and PC members valiantly and gracefully handled a record total number of submissions: 855 full research papers, 419 short research papers, 80 demos, and 103 case studies. Each submission was reviewed by three program committee members, each a recognized expert in the field, and an independent committee selected the full paper awards recipients.
{"title":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","authors":"Ee-Peng Lim, M. Winslett, M. Sanderson, A. Fu, Jimeng Sun, Shane Culpepper, Eric Lo, Joyce Ho, D. Donato, R. Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, V. Tseng, Chenliang Li","doi":"10.1145/3132847","DOIUrl":"https://doi.org/10.1145/3132847","url":null,"abstract":"Since 1992, the ACM International Conference on Information and Knowledge Management (CIKM) has brought together leading researchers and developers from the knowledge management, information retrieval, and data management communities to discuss cutting-edge research on advanced knowledge and information systems. We are pleased to present the 26th edition of CIKM on 6-10 November, 2017, at the Pan Pacific Singapore hotel, with the special theme of Smart Cities, Smart Nations. \u0000 \u0000This year our attendees will enjoy four keynote speakers: Rajeev Rastogi (Amazon), Qiang Yang (HKUST), Rada Mihalcea (Michigan), and K Ananth Krishnan (Tata Consultancy Services). In 6-7 parallel sessions, our program includes presentations of 171 full research papers, 119 short research papers, and 30 demonstrations of new research advances. The program's focus this year can be seen at a glance in the word cloud at right, constructed from the titles of all accepted research papers. Also on offer are eight tutorials on timely research topics, and six collocated workshops on topics ranging from history to transportation, biomedicine to bias. \u0000 \u0000We are excited about our greatly expanded data analytics competition this year, the CIKM AnalytiCup. During the past nine months, over 1500 teams from all over the world have vied to win over $60,000 in AnalytiCup prizes and travel money by solving real-world analytics problems posed by our corporate sponsors Alibaba/Shenzhen Meteorological Bureau, DataSpark, and Lazada. A fourth competition, a weekend-long hackathon sponsored by DHL, takes place immediately before the conference. The finalists from all four competitions come together on 6 November for a final showdown in front of corporate judges. Solution summaries from finalist teams in the first three competitions can be found in these proceedings. \u0000 \u0000Also new this year are several other events aimed directly at practitioners. During the main conference, we are offering hands-on tutorials on the hot topics of scalable deep learning and scalable data science. The Case Studies track, intended to highlight the experiences and lessons learned by early adopters, debuts this year with 23 studies of technology adoption in interesting applications. And immediately before the main conference, CIKMconnect brings together students and industry for posters, technical discussions, recruiting events, and networking. \u0000 \u0000It takes a village to produce a major conference! Our program committee chairs, senior PC and PC members valiantly and gracefully handled a record total number of submissions: 855 full research papers, 419 short research papers, 80 demos, and 103 case studies. Each submission was reviewed by three program committee members, each a recognized expert in the field, and an independent committee selected the full paper awards recipients.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90048927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fact-checking political discussions has become an essential clog in computational journalism. This task encompasses an important sub-task---identifying the set of statements with 'check-worthy' claims. Previous work has treated this as a simple text classification problem discounting the nuances involved in determining what makes statements check-worthy. We introduce a dataset of political debates from the 2016 US Presidential election campaign annotated using all major fact-checking media outlets and show that there is a need to model conversation context, debate dynamics and implicit world knowledge. We design a multi-classifier system TATHYA, that models latent groupings in data and improves state-of-art systems in detecting check-worthy statements by 19.5% in F1-score on a held-out test set, gaining primarily gaining in Recall.
{"title":"TATHYA: A Multi-Classifier System for Detecting Check-Worthy Statements in Political Debates","authors":"Ayush Patwari, Dan Goldwasser, S. Bagchi","doi":"10.1145/3132847.3133150","DOIUrl":"https://doi.org/10.1145/3132847.3133150","url":null,"abstract":"Fact-checking political discussions has become an essential clog in computational journalism. This task encompasses an important sub-task---identifying the set of statements with 'check-worthy' claims. Previous work has treated this as a simple text classification problem discounting the nuances involved in determining what makes statements check-worthy. We introduce a dataset of political debates from the 2016 US Presidential election campaign annotated using all major fact-checking media outlets and show that there is a need to model conversation context, debate dynamics and implicit world knowledge. We design a multi-classifier system TATHYA, that models latent groupings in data and improves state-of-art systems in detecting check-worthy statements by 19.5% in F1-score on a held-out test set, gaining primarily gaining in Recall.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89852041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sedigheh Eslami, Asia J. Biega, Rishiraj Saha Roy, G. Weikum
Users who wish to leave an online forum often do not have the freedom to erase their data completely from the service providers' (SP) system. The primary reason behind this is that analytics on such user data form a core component of many online providers' business models. On the other hand, if the profiles reside in the SP's system in an unchanged form, major privacy violations may occur if the infrastructure is compromised, or the SP is acquired by another organization. In this work, we investigate an alternative solution to standard profile removal, where posts of different users are split and merged into synthetic mediator profiles. The goal of our framework is to preserve the SP's data mining utility as far as possible, while minimizing users' privacy risks. We present several mechanisms of assigning user posts to such mediator accounts and show the effectiveness of our framework using data from StackExchange and various health forums.
{"title":"Privacy of Hidden Profiles: Utility-Preserving Profile Removal in Online Forums","authors":"Sedigheh Eslami, Asia J. Biega, Rishiraj Saha Roy, G. Weikum","doi":"10.1145/3132847.3133140","DOIUrl":"https://doi.org/10.1145/3132847.3133140","url":null,"abstract":"Users who wish to leave an online forum often do not have the freedom to erase their data completely from the service providers' (SP) system. The primary reason behind this is that analytics on such user data form a core component of many online providers' business models. On the other hand, if the profiles reside in the SP's system in an unchanged form, major privacy violations may occur if the infrastructure is compromised, or the SP is acquired by another organization. In this work, we investigate an alternative solution to standard profile removal, where posts of different users are split and merged into synthetic mediator profiles. The goal of our framework is to preserve the SP's data mining utility as far as possible, while minimizing users' privacy risks. We present several mechanisms of assigning user posts to such mediator accounts and show the effectiveness of our framework using data from StackExchange and various health forums.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89958449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Huang, Byron Choi, Jianliang Xu, W. K. Cheung, Yanchun Zhang, Jiming Liu
Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e)≈ 62%-approximation guarantee. Preliminary experimental results on real-world datasets show the effectiveness and efficiency of the proposed algorithm for data summarization.
{"title":"Ontology-based Graph Visualization for Summarized View","authors":"Xin Huang, Byron Choi, Jianliang Xu, W. K. Cheung, Yanchun Zhang, Jiming Liu","doi":"10.1145/3132847.3133113","DOIUrl":"https://doi.org/10.1145/3132847.3133113","url":null,"abstract":"Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e)≈ 62%-approximation guarantee. Preliminary experimental results on real-world datasets show the effectiveness and efficiency of the proposed algorithm for data summarization.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86610747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.
{"title":"QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks","authors":"Zhou Fang, Tong Yu, O. Mengshoel, Rajesh K. Gupta","doi":"10.1145/3132847.3133045","DOIUrl":"https://doi.org/10.1145/3132847.3133045","url":null,"abstract":"Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88090506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prerna Khurana, P. Agarwal, Gautam M. Shroff, L. Vig, A. Srinivasan
We describe an automated assistant for answering frequently asked questions; our system has been deployed, and is currently answering HR-related queries in two different areas (leave management and health insurance) to a large number of users. The needs of a large global corporate lead us to model a frequently asked question (FAQ) to be an equivalence class of actually asked questions, for which there is a common answer (certified as being consistent with the organization's policy). When a new question is posed to our system, it finds the class of question, and responds with the answer for the class. At this point, the system is either correct (gives correct answer); or incorrect (gives wrong answer); or incomplete (says "I don't know''). We employ a hybrid deep-learning architecture in which a BiLSTM-based classifier is combined with second BiLSTM-based Siamese network in an iterative manner: Questions for which the classifier makes an error during training are used to generate a set of misclassified question-question pairs. These, along with correct pairs, are used to train the Siamese network to drive apart the (hidden) representations of the misclassified pairs. We present experimental results from our deployment showing that our iteratively trained hybrid network: (a) results in better performance than using just a classifier network, or just a Siamese network; (b) performs better than state-of-the art sentence classifiers in the two areas in which it has been deployed, in terms of both accuracy as well as precision-recall tradeoff; and (c) also performs well on a benchmark public dataset. We also observe that using question-question pairs in our hybrid network, results in marginally better performance than using question-to-answer pairs. Finally, estimates of precision and recall from the deployment of our automated assistant suggest that we can expect the burden on our HR department to drop from answering about 6000 queries a day to about 1000.
{"title":"Hybrid BiLSTM-Siamese network for FAQ Assistance","authors":"Prerna Khurana, P. Agarwal, Gautam M. Shroff, L. Vig, A. Srinivasan","doi":"10.1145/3132847.3132861","DOIUrl":"https://doi.org/10.1145/3132847.3132861","url":null,"abstract":"We describe an automated assistant for answering frequently asked questions; our system has been deployed, and is currently answering HR-related queries in two different areas (leave management and health insurance) to a large number of users. The needs of a large global corporate lead us to model a frequently asked question (FAQ) to be an equivalence class of actually asked questions, for which there is a common answer (certified as being consistent with the organization's policy). When a new question is posed to our system, it finds the class of question, and responds with the answer for the class. At this point, the system is either correct (gives correct answer); or incorrect (gives wrong answer); or incomplete (says \"I don't know''). We employ a hybrid deep-learning architecture in which a BiLSTM-based classifier is combined with second BiLSTM-based Siamese network in an iterative manner: Questions for which the classifier makes an error during training are used to generate a set of misclassified question-question pairs. These, along with correct pairs, are used to train the Siamese network to drive apart the (hidden) representations of the misclassified pairs. We present experimental results from our deployment showing that our iteratively trained hybrid network: (a) results in better performance than using just a classifier network, or just a Siamese network; (b) performs better than state-of-the art sentence classifiers in the two areas in which it has been deployed, in terms of both accuracy as well as precision-recall tradeoff; and (c) also performs well on a benchmark public dataset. We also observe that using question-question pairs in our hybrid network, results in marginally better performance than using question-to-answer pairs. Finally, estimates of precision and recall from the deployment of our automated assistant suggest that we can expect the burden on our HR department to drop from answering about 6000 queries a day to about 1000.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87887008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.
{"title":"Deep Context Modeling for Web Query Entity Disambiguation","authors":"Zhen Liao, Xinying Song, Yelong Shen, Saekoo Lee, Jianfeng Gao, Ciya Liao","doi":"10.1145/3132847.3132856","DOIUrl":"https://doi.org/10.1145/3132847.3132856","url":null,"abstract":"In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76027046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Pang, Yiu-ming Cheung, Binghui Wang, Risheng Liu
Single sample face recognition is one of the most challenging problems in face recognition (FR), where only one single sample per person (SSPP) is enrolled in the gallery set for training. Although patch-based methods have achieved great success in FR with SSPP, they still have significant limitations. In this work, we propose a new patch-based method, namely Robust Heterogeneous Discriminative Analysis (RHDA), to tackle FR with SSPP. Compared with the existing patch-based methods, RHDA can enhance the robustness against complex facial variations from two aspects. First, we develop a novel Fisher-like criterion, which incorporates two manifold embeddings, to learn heterogeneous discriminative representations of image patches. Specifically, for each patch, the Fisher-like criterion is able to preserve the reconstruction relationship of neighboring patches from the same person, while suppressing neighboring patches from different persons. Second, we present two distance metrics, i.e., patch-to-patch distance and patch-to-manifold distance, and develop a fusion strategy to combine the recognition outputs of above two distance metrics via joint majority voting for identification. Experimental results on the AR and FERET benchmark datasets demonstrate the efficacy of the proposed method.
{"title":"Robust Heterogeneous Discriminative Analysis for Single Sample Per Person Face Recognition","authors":"Meng Pang, Yiu-ming Cheung, Binghui Wang, Risheng Liu","doi":"10.1145/3132847.3133096","DOIUrl":"https://doi.org/10.1145/3132847.3133096","url":null,"abstract":"Single sample face recognition is one of the most challenging problems in face recognition (FR), where only one single sample per person (SSPP) is enrolled in the gallery set for training. Although patch-based methods have achieved great success in FR with SSPP, they still have significant limitations. In this work, we propose a new patch-based method, namely Robust Heterogeneous Discriminative Analysis (RHDA), to tackle FR with SSPP. Compared with the existing patch-based methods, RHDA can enhance the robustness against complex facial variations from two aspects. First, we develop a novel Fisher-like criterion, which incorporates two manifold embeddings, to learn heterogeneous discriminative representations of image patches. Specifically, for each patch, the Fisher-like criterion is able to preserve the reconstruction relationship of neighboring patches from the same person, while suppressing neighboring patches from different persons. Second, we present two distance metrics, i.e., patch-to-patch distance and patch-to-manifold distance, and develop a fusion strategy to combine the recognition outputs of above two distance metrics via joint majority voting for identification. Experimental results on the AR and FERET benchmark datasets demonstrate the efficacy of the proposed method.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80010281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}