Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, V. Chandrasekhar
Anomaly detection is a significant and hence well-studied problem. However, developing effective anomaly detection methods for complex and high-dimensional data remains a challenge. As Generative Adversarial Networks (GANs) are able to model the complex high-dimensional distributions of real-world data, they offer a promising approach to address this challenge. In this work, we propose an anomaly detection method, Adversarially Learned Anomaly Detection (ALAD) based on bi-directional GANs, that derives adversarially learned features for the anomaly detection task. ALAD then uses reconstruction errors based on these adversarially learned features to determine if a data sample is anomalous. ALAD builds on recent advances to ensure data-space and latent-space cycle-consistencies and stabilize GAN training, which results in significantly improved anomaly detection performance. ALAD achieves state-of-the-art performance on a range of image and tabular datasets while being several hundred-fold faster at test time than the only published GAN-based method.
{"title":"Adversarially Learned Anomaly Detection","authors":"Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, V. Chandrasekhar","doi":"10.1109/ICDM.2018.00088","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00088","url":null,"abstract":"Anomaly detection is a significant and hence well-studied problem. However, developing effective anomaly detection methods for complex and high-dimensional data remains a challenge. As Generative Adversarial Networks (GANs) are able to model the complex high-dimensional distributions of real-world data, they offer a promising approach to address this challenge. In this work, we propose an anomaly detection method, Adversarially Learned Anomaly Detection (ALAD) based on bi-directional GANs, that derives adversarially learned features for the anomaly detection task. ALAD then uses reconstruction errors based on these adversarially learned features to determine if a data sample is anomalous. ALAD builds on recent advances to ensure data-space and latent-space cycle-consistencies and stabilize GAN training, which results in significantly improved anomaly detection performance. ALAD achieves state-of-the-art performance on a range of image and tabular datasets while being several hundred-fold faster at test time than the only published GAN-based method.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126533126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, matrix factorization–based recommendation methods have been criticized for the problem raised by the triangle inequality violation. Although several metric learning–based approaches have been proposed to overcome this issue, existing approaches typically project each user to a single point in the metric space, and thus do not suffice for properly modeling the intensity and the heterogeneity of user-item relationships in implicit feedback. In this paper, we propose TransCF to discover such latent user-item relationships embodied in implicit user-item interactions. Inspired by the translation mechanism popularized by knowledge graph embedding, we construct user-item specific translation vectors by employing the neighborhood information of users and items, and translate each user toward items according to the user's relationships with the items. Our proposed method outperforms several state-of-the-art methods for top-N recommendation on seven real-world data by up to 17% in terms of hit ratio. We also conduct extensive qualitative evaluations on the translation vectors learned by our proposed method to ascertain the benefit of adopting the translation mechanism for implicit feedback-based recommendations.
{"title":"Collaborative Translational Metric Learning","authors":"Chanyoung Park, Donghyun Kim, Xing Xie, Hwanjo Yu","doi":"10.1109/ICDM.2018.00052","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00052","url":null,"abstract":"Recently, matrix factorization–based recommendation methods have been criticized for the problem raised by the triangle inequality violation. Although several metric learning–based approaches have been proposed to overcome this issue, existing approaches typically project each user to a single point in the metric space, and thus do not suffice for properly modeling the intensity and the heterogeneity of user-item relationships in implicit feedback. In this paper, we propose TransCF to discover such latent user-item relationships embodied in implicit user-item interactions. Inspired by the translation mechanism popularized by knowledge graph embedding, we construct user-item specific translation vectors by employing the neighborhood information of users and items, and translate each user toward items according to the user's relationships with the items. Our proposed method outperforms several state-of-the-art methods for top-N recommendation on seven real-world data by up to 17% in terms of hit ratio. We also conduct extensive qualitative evaluations on the translation vectors learned by our proposed method to ascertain the benefit of adopting the translation mechanism for implicit feedback-based recommendations.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130460521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Writing abnormality detection is very important in education applications, but has received little attention by the community. Considering that abnormally written strokes (writing error or largely distorted stroke) affect the decision confidence of classifier, we propose an approach named DeepAD to detect stroke-level abnormalities in handwritten Chinese characters by analyzing the decision process of deep neural network (DNN). Firstly, to minimize the effect of stroke width variation of handwritten characters, we propose a skeletonization method based on fully convolutional network (FCN) with cross detection. With a convolutional neural network (CNN) for character classification, we evaluate the role of each skeleton pixel by calculating its impact on the prediction of classifier, and detect abnormal strokes by connecting pixels of negative impact. For quantitative evaluation of performance, we build a template-free dataset named SA-CASIA-HW containing 3696 handwritten Chinese characters with various stroke-level abnormalities, and spanning 3000+ different classes written by 60 individual writers. We validate the usefulness of the proposed DeepAD with comparison to related methods.
{"title":"DeepAD: A Deep Learning Based Approach to Stroke-Level Abnormality Detection in Handwritten Chinese Character Recognition","authors":"Tie-Qiang Wang, Cheng-Lin Liu","doi":"10.1109/ICDM.2018.00176","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00176","url":null,"abstract":"Writing abnormality detection is very important in education applications, but has received little attention by the community. Considering that abnormally written strokes (writing error or largely distorted stroke) affect the decision confidence of classifier, we propose an approach named DeepAD to detect stroke-level abnormalities in handwritten Chinese characters by analyzing the decision process of deep neural network (DNN). Firstly, to minimize the effect of stroke width variation of handwritten characters, we propose a skeletonization method based on fully convolutional network (FCN) with cross detection. With a convolutional neural network (CNN) for character classification, we evaluate the role of each skeleton pixel by calculating its impact on the prediction of classifier, and detect abnormal strokes by connecting pixels of negative impact. For quantitative evaluation of performance, we build a template-free dataset named SA-CASIA-HW containing 3696 handwritten Chinese characters with various stroke-level abnormalities, and spanning 3000+ different classes written by 60 individual writers. We validate the usefulness of the proposed DeepAD with comparison to related methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132696285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Additive Noise Models (ANMs) provide a theoretically sound approach to inferring the most likely causal direction between pairs of random variables given only a sample from their joint distribution. The key assumption is that the effect is a function of the cause, with additive noise that is independent of the cause. In many cases ANMs are identifiable. Their performance, however, hinges on the chosen dependence measure, the assumption we make on the true distribution. In this paper we propose to use Shannon entropy to measure the dependence within an ANM, which gives us a general approach by which we do not have to assume a true distribution, nor have to perform explicit significance tests during optimization. The information-theoretic formulation gives us a general, efficient, identifiable, and, as the experiments show, highly accurate method for causal inference on pairs of discrete variables—achieving (near) 100% accuracy on both synthetic and real data.
{"title":"Accurate Causal Inference on Discrete Data","authors":"Kailash Budhathoki, Jilles Vreeken","doi":"10.1109/ICDM.2018.00105","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00105","url":null,"abstract":"Additive Noise Models (ANMs) provide a theoretically sound approach to inferring the most likely causal direction between pairs of random variables given only a sample from their joint distribution. The key assumption is that the effect is a function of the cause, with additive noise that is independent of the cause. In many cases ANMs are identifiable. Their performance, however, hinges on the chosen dependence measure, the assumption we make on the true distribution. In this paper we propose to use Shannon entropy to measure the dependence within an ANM, which gives us a general approach by which we do not have to assume a true distribution, nor have to perform explicit significance tests during optimization. The information-theoretic formulation gives us a general, efficient, identifiable, and, as the experiments show, highly accurate method for causal inference on pairs of discrete variables—achieving (near) 100% accuracy on both synthetic and real data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131820719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An increasing number of industrial areas recognize the opportunities of Big Data, requiring highly efficient algorithms which enable real-time processing to reduce the burden of data storage and maintenance. Decision trees are extremely fast, highly accurate and easy to use in practice. Merging multiple decision trees to an ensemble leads to one of the most powerful machine learning methods. The Very Fast Decision Tree is the state-of-the-art incremental decision tree induction algorithm, capable of learning from massive data streams. It is successful due to its theoretical guarantees based on the Hoeffding bound as well as its competitive performance in terms of classification accuracy and time / space efficiency. In this paper, we increase the efficiency even further by replacing its global splitting scheme, which periodically tries to split every n_min examples. Instead, we utilize local statistics to predict the split-time, thus, avoiding unnecessary split-attempts, usually dominating the computational cost. Concretely, we use the class distributions of previous split-attempts to approximate the minimum number of examples until the Hoeffding bound is met. This cautious approach yields by design a low delay and reduces the number of split-attempts at the same time. We extensively evaluate our method using common stream-learning benchmarks also considering non-stationary environments. The experiments confirm a substantially reduced run-time without a loss in classification performance.
{"title":"Enhancing Very Fast Decision Trees with Local Split-Time Predictions","authors":"Viktor Losing, H. Wersing, B. Hammer","doi":"10.1109/ICDM.2018.00044","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00044","url":null,"abstract":"An increasing number of industrial areas recognize the opportunities of Big Data, requiring highly efficient algorithms which enable real-time processing to reduce the burden of data storage and maintenance. Decision trees are extremely fast, highly accurate and easy to use in practice. Merging multiple decision trees to an ensemble leads to one of the most powerful machine learning methods. The Very Fast Decision Tree is the state-of-the-art incremental decision tree induction algorithm, capable of learning from massive data streams. It is successful due to its theoretical guarantees based on the Hoeffding bound as well as its competitive performance in terms of classification accuracy and time / space efficiency. In this paper, we increase the efficiency even further by replacing its global splitting scheme, which periodically tries to split every n_min examples. Instead, we utilize local statistics to predict the split-time, thus, avoiding unnecessary split-attempts, usually dominating the computational cost. Concretely, we use the class distributions of previous split-attempts to approximate the minimum number of examples until the Hoeffding bound is met. This cautious approach yields by design a low delay and reduces the number of split-attempts at the same time. We extensively evaluate our method using common stream-learning benchmarks also considering non-stationary environments. The experiments confirm a substantially reduced run-time without a loss in classification performance.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131109579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Relation detection is a key step in Knowledge Base Question Answering (KBQA), but far from solved due to the significant differences between questions and relations. Previous studies usually treat relation detection as a text matching task, and mainly focus on reducing the detection error with better representations of KB relations. However, the understanding of questions is also important since they are generally more varied. And the text pair representation requires improvement because KB relations are not always counterparts of questions. In this paper, we propose a novel system with enhanced question understanding and representation processes for KB relation detection (QURRD). We design a KBQA-specific slot filling module based on Bi-LSTM-CRF for question understanding. Besides, with two CNNs for modeling and matching text pairs respectively, QURRD obtains richer question-relation representations for semantic analysis, and achieves better performance through learning from multiple tasks. We conduct experiments on both single-relation (Simple-Questions) and multi-relation (WebQSP) benchmarks. Results show that QURRD is robust against the diversity of questions and outperforms the state-of-the-art system on both tasks.
关系检测是知识库问答(Knowledge Base Question answer, KBQA)的关键步骤,但由于问题与关系之间存在显著差异,关系检测一直没有得到很好的解决。以往的研究通常将关系检测视为文本匹配任务,主要关注通过更好地表示知识库关系来降低检测误差。然而,对问题的理解也很重要,因为它们通常更加多样化。文本对表示需要改进,因为知识库关系并不总是问题的对应物。在本文中,我们提出了一个具有增强问题理解和表示过程的知识库关系检测系统。我们设计了一个基于Bi-LSTM-CRF的针对kbqa的槽位填充模块,用于问题理解。此外,QURRD分别使用两个cnn对文本对建模和匹配,获得了更丰富的问题关系表示用于语义分析,并通过多任务学习获得了更好的性能。我们在单关系(Simple-Questions)和多关系(WebQSP)基准上进行了实验。结果表明,QURRD对问题的多样性具有鲁棒性,并且在这两个任务上都优于最先进的系统。
{"title":"Enhancing Question Understanding and Representation for Knowledge Base Relation Detection","authors":"Zihan Xu, Haitao Zheng, Zuo-You Fu, Wei Wang","doi":"10.1109/ICDM.2018.00186","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00186","url":null,"abstract":"Relation detection is a key step in Knowledge Base Question Answering (KBQA), but far from solved due to the significant differences between questions and relations. Previous studies usually treat relation detection as a text matching task, and mainly focus on reducing the detection error with better representations of KB relations. However, the understanding of questions is also important since they are generally more varied. And the text pair representation requires improvement because KB relations are not always counterparts of questions. In this paper, we propose a novel system with enhanced question understanding and representation processes for KB relation detection (QURRD). We design a KBQA-specific slot filling module based on Bi-LSTM-CRF for question understanding. Besides, with two CNNs for modeling and matching text pairs respectively, QURRD obtains richer question-relation representations for semantic analysis, and achieves better performance through learning from multiple tasks. We conduct experiments on both single-relation (Simple-Questions) and multi-relation (WebQSP) benchmarks. Results show that QURRD is robust against the diversity of questions and outperforms the state-of-the-art system on both tasks.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123786938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Attributed network embedding enables joint representation learning of node links and attributes. Existing attributed network embedding models are designed in continuous Euclidean spaces which often introduce data redundancy and impose challenges to storage and computation costs. To this end, we present a Binarized Attributed Network Embedding model (BANE for short) to learn binary node representation. Specifically, we define a new Weisfeiler-Lehman proximity matrix to capture data dependence between node links and attributes by aggregating the information of node attributes and links from neighboring nodes to a given target node in a layer-wise manner. Based on the Weisfeiler-Lehman proximity matrix, we formulate a new Weisfiler-Lehman matrix factorization learning function under the binary node representation constraint. The learning problem is a mixed integer optimization and an efficient cyclic coordinate descent (CCD) algorithm is used as the solution. Node classification and link prediction experiments on real-world datasets show that the proposed BANE model outperforms the state-of-the-art network embedding methods.
{"title":"Binarized attributed network embedding","authors":"Hong Yang, Shirui Pan, Peng Zhang, Ling Chen, Defu Lian, Chengqi Zhang","doi":"10.1109/ICDM.2018.8626170","DOIUrl":"https://doi.org/10.1109/ICDM.2018.8626170","url":null,"abstract":"Attributed network embedding enables joint representation learning of node links and attributes. Existing attributed network embedding models are designed in continuous Euclidean spaces which often introduce data redundancy and impose challenges to storage and computation costs. To this end, we present a Binarized Attributed Network Embedding model (BANE for short) to learn binary node representation. Specifically, we define a new Weisfeiler-Lehman proximity matrix to capture data dependence between node links and attributes by aggregating the information of node attributes and links from neighboring nodes to a given target node in a layer-wise manner. Based on the Weisfeiler-Lehman proximity matrix, we formulate a new Weisfiler-Lehman matrix factorization learning function under the binary node representation constraint. The learning problem is a mixed integer optimization and an efficient cyclic coordinate descent (CCD) algorithm is used as the solution. Node classification and link prediction experiments on real-world datasets show that the proposed BANE model outperforms the state-of-the-art network embedding methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123833450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xia Chen, Guoxian Yu, C. Domeniconi, J. Wang, Zhao Li, Z. Zhang
Multi-label active learning addresses the scarce labeled example problem by querying the most valuable unlabeled examples, or example-label pairs, to achieve a better performance with limited query cost. Current multi-label active learning methods require the scrutiny of the whole example in order to obtain its annotation. In contrast, one can find positive evidence with respect to a label by examining specific patterns (i.e., subexample), rather than the whole example, thus making the annotation process more efficient. Based on this observation, we propose a novel two-stage cost effective multi-label active learning framework, called CMAL. In the first stage, a novel example-label pair selection strategy is introduced. Our strategy leverages label correlation and label space sparsity of multi-label examples to select the most uncertain example-label pairs. Specifically, the unknown relevant label of an example can be inferred from the correlated labels that are already assigned to the example, thus reducing the uncertainty of the unknown label. In addition, the larger the number of relevant examples of a particular label, the smaller the uncertainty of the label is. In the second stage, CMAL queries the most plausible positive subexample-label pairs of the selected example-label pairs. Comprehensive experiments on multi-label datasets collected from different domains demonstrate the effectiveness of our proposed approach on cost effective queries. We also show that leveraging label correlation and label sparsity contribute to saving costs.
{"title":"Cost Effective Multi-label Active Learning via Querying Subexamples","authors":"Xia Chen, Guoxian Yu, C. Domeniconi, J. Wang, Zhao Li, Z. Zhang","doi":"10.1109/ICDM.2018.00109","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00109","url":null,"abstract":"Multi-label active learning addresses the scarce labeled example problem by querying the most valuable unlabeled examples, or example-label pairs, to achieve a better performance with limited query cost. Current multi-label active learning methods require the scrutiny of the whole example in order to obtain its annotation. In contrast, one can find positive evidence with respect to a label by examining specific patterns (i.e., subexample), rather than the whole example, thus making the annotation process more efficient. Based on this observation, we propose a novel two-stage cost effective multi-label active learning framework, called CMAL. In the first stage, a novel example-label pair selection strategy is introduced. Our strategy leverages label correlation and label space sparsity of multi-label examples to select the most uncertain example-label pairs. Specifically, the unknown relevant label of an example can be inferred from the correlated labels that are already assigned to the example, thus reducing the uncertainty of the unknown label. In addition, the larger the number of relevant examples of a particular label, the smaller the uncertainty of the label is. In the second stage, CMAL queries the most plausible positive subexample-label pairs of the selected example-label pairs. Comprehensive experiments on multi-label datasets collected from different domains demonstrate the effectiveness of our proposed approach on cost effective queries. We also show that leveraging label correlation and label sparsity contribute to saving costs.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rectangle has been recognized as an essential motif in a large number of real-world networks. Counting rectangles in a network plays an important role in network analysis. This paper comprehensively studies the rectangle counting problem on large networks. We propose a novel counting paradigm called the wedge-centric counting, where a wedge is a simple path consisting of three vertices. Unlike the traditional edge-centric counting, the wedge-centric counting uses wedges instead of edges as building blocks of rectangles. The main advantage of the wedge-centric counting is that it does not need to access two-hop neighbors. Based on this paradigm, we develop a collection of rectangle counting algorithms, including an in-memory algorithm with lower time complexity, an external-memory algorithm with the optimal I/O complexity, and two randomized algorithms with provable error bounds. The experimental results on a variety of real networks verify the effectiveness and the efficiency of the proposed wedge-centric rectangle counting algorithms.
{"title":"Fast Rectangle Counting on Massive Networks","authors":"Rong Zhu, Zhaonian Zou, Jianzhong Li","doi":"10.1109/ICDM.2018.00100","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00100","url":null,"abstract":"Rectangle has been recognized as an essential motif in a large number of real-world networks. Counting rectangles in a network plays an important role in network analysis. This paper comprehensively studies the rectangle counting problem on large networks. We propose a novel counting paradigm called the wedge-centric counting, where a wedge is a simple path consisting of three vertices. Unlike the traditional edge-centric counting, the wedge-centric counting uses wedges instead of edges as building blocks of rectangles. The main advantage of the wedge-centric counting is that it does not need to access two-hop neighbors. Based on this paradigm, we develop a collection of rectangle counting algorithms, including an in-memory algorithm with lower time complexity, an external-memory algorithm with the optimal I/O complexity, and two randomized algorithms with provable error bounds. The experimental results on a variety of real networks verify the effectiveness and the efficiency of the proposed wedge-centric rectangle counting algorithms.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125112943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Personalized treatments and targeted therapies are the most promising approaches to treat complex diseases, especially for cancer. However, drug resistance is often acquired after treatments. To overcome or reduce drug resistance, treatments using drug combinations have been actively investigated in the literature. Existing methods mainly focus on chemical properties of drugs for potential combination therapies without considering relationships among different diseases. Also, they often do not consider the rich knowledge of drugs and diseases, which can enhance the prediction of drug combinations. This motivates us to develop a new computational method that can predict the beneficial drug combinations. We propose DrugCom, a tensor-based framework for computing drug combinations across different diseases by integrating multiple heterogeneous data sources of drugs and diseases. DrugCom first constructs a primary third-order tensor (i.e., drug×drug ×disease) and several similarity matrices from multiple data sources regarding drugs (e.g., chemical structure) and diseases (e.g., disease phenotype). DrugCom then formulates an objective function, which simultaneously factorizes coupled tensor and matrices to reveal the molecular mechanisms of drug synergy. We adopt the alternating direction method of multipliers algorithm to effectively solve the optimization problem. Extensive experimental studies using real-world datasets demonstrate superior performance of DrugCom.
{"title":"DrugCom: Synergistic Discovery of Drug Combinations Using Tensor Decomposition","authors":"Huiyuan Chen, Jing Li","doi":"10.1109/ICDM.2018.00108","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00108","url":null,"abstract":"Personalized treatments and targeted therapies are the most promising approaches to treat complex diseases, especially for cancer. However, drug resistance is often acquired after treatments. To overcome or reduce drug resistance, treatments using drug combinations have been actively investigated in the literature. Existing methods mainly focus on chemical properties of drugs for potential combination therapies without considering relationships among different diseases. Also, they often do not consider the rich knowledge of drugs and diseases, which can enhance the prediction of drug combinations. This motivates us to develop a new computational method that can predict the beneficial drug combinations. We propose DrugCom, a tensor-based framework for computing drug combinations across different diseases by integrating multiple heterogeneous data sources of drugs and diseases. DrugCom first constructs a primary third-order tensor (i.e., drug×drug ×disease) and several similarity matrices from multiple data sources regarding drugs (e.g., chemical structure) and diseases (e.g., disease phenotype). DrugCom then formulates an objective function, which simultaneously factorizes coupled tensor and matrices to reveal the molecular mechanisms of drug synergy. We adopt the alternating direction method of multipliers algorithm to effectively solve the optimization problem. Extensive experimental studies using real-world datasets demonstrate superior performance of DrugCom.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125115448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}