2018 IEEE International Conference on Data Mining (ICDM)最新文献

英文中文

Adversarially Learned Anomaly Detection 对抗学习异常检测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00088

Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, V. Chandrasekhar

Anomaly detection is a significant and hence well-studied problem. However, developing effective anomaly detection methods for complex and high-dimensional data remains a challenge. As Generative Adversarial Networks (GANs) are able to model the complex high-dimensional distributions of real-world data, they offer a promising approach to address this challenge. In this work, we propose an anomaly detection method, Adversarially Learned Anomaly Detection (ALAD) based on bi-directional GANs, that derives adversarially learned features for the anomaly detection task. ALAD then uses reconstruction errors based on these adversarially learned features to determine if a data sample is anomalous. ALAD builds on recent advances to ensure data-space and latent-space cycle-consistencies and stabilize GAN training, which results in significantly improved anomaly detection performance. ALAD achieves state-of-the-art performance on a range of image and tabular datasets while being several hundred-fold faster at test time than the only published GAN-based method.

异常检测是一个重要的研究课题。然而，针对复杂高维数据开发有效的异常检测方法仍然是一个挑战。由于生成对抗网络(GANs)能够对现实世界数据的复杂高维分布进行建模，因此它们为解决这一挑战提供了一种很有前途的方法。在这项工作中，我们提出了一种基于双向gan的对抗学习异常检测(ALAD)异常检测方法，该方法为异常检测任务派生对抗学习特征。然后，ALAD使用基于这些对抗学习特征的重建误差来确定数据样本是否异常。ALAD建立在最新进展的基础上，以确保数据空间和潜在空间循环的一致性，并稳定GAN训练，从而显著提高异常检测性能。ALAD在一系列图像和表格数据集上实现了最先进的性能，同时在测试时比唯一发布的基于gan的方法快数百倍。

引用次数: 305

Collaborative Translational Metric Learning 协同翻译度量学习

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00052

Chanyoung Park, Donghyun Kim, Xing Xie, Hwanjo Yu

Recently, matrix factorization–based recommendation methods have been criticized for the problem raised by the triangle inequality violation. Although several metric learning–based approaches have been proposed to overcome this issue, existing approaches typically project each user to a single point in the metric space, and thus do not suffice for properly modeling the intensity and the heterogeneity of user-item relationships in implicit feedback. In this paper, we propose TransCF to discover such latent user-item relationships embodied in implicit user-item interactions. Inspired by the translation mechanism popularized by knowledge graph embedding, we construct user-item specific translation vectors by employing the neighborhood information of users and items, and translate each user toward items according to the user's relationships with the items. Our proposed method outperforms several state-of-the-art methods for top-N recommendation on seven real-world data by up to 17% in terms of hit ratio. We also conduct extensive qualitative evaluations on the translation vectors learned by our proposed method to ascertain the benefit of adopting the translation mechanism for implicit feedback-based recommendations.

近年来，基于矩阵分解的推荐方法因违反三角不等式而受到批评。虽然已经提出了几种基于度量学习的方法来克服这个问题，但现有的方法通常将每个用户投影到度量空间中的单个点上，因此不足以正确建模隐式反馈中用户-项目关系的强度和异质性。在本文中，我们提出了TransCF来发现隐含在用户-项目交互中的潜在用户-项目关系。受知识图嵌入推广的翻译机制的启发，我们利用用户和物品之间的邻域信息构造了针对用户-物品的翻译向量，并根据用户与物品之间的关系将每个用户翻译成物品。就命中率而言，我们提出的方法比七个真实数据的top-N推荐的几种最先进的方法高出17%。我们还对我们提出的方法获得的翻译向量进行了广泛的定性评估，以确定采用基于隐式反馈的推荐的翻译机制的好处。

{"title":"Collaborative Translational Metric Learning","authors":"Chanyoung Park, Donghyun Kim, Xing Xie, Hwanjo Yu","doi":"10.1109/ICDM.2018.00052","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00052","url":null,"abstract":"Recently, matrix factorization–based recommendation methods have been criticized for the problem raised by the triangle inequality violation. Although several metric learning–based approaches have been proposed to overcome this issue, existing approaches typically project each user to a single point in the metric space, and thus do not suffice for properly modeling the intensity and the heterogeneity of user-item relationships in implicit feedback. In this paper, we propose TransCF to discover such latent user-item relationships embodied in implicit user-item interactions. Inspired by the translation mechanism popularized by knowledge graph embedding, we construct user-item specific translation vectors by employing the neighborhood information of users and items, and translate each user toward items according to the user's relationships with the items. Our proposed method outperforms several state-of-the-art methods for top-N recommendation on seven real-world data by up to 17% in terms of hit ratio. We also conduct extensive qualitative evaluations on the translation vectors learned by our proposed method to ascertain the benefit of adopting the translation mechanism for implicit feedback-based recommendations.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130460521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

DeepAD: A Deep Learning Based Approach to Stroke-Level Abnormality Detection in Handwritten Chinese Character Recognition 基于深度学习的手写体汉字笔划级异常检测方法

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00176

Tie-Qiang Wang, Cheng-Lin Liu

Writing abnormality detection is very important in education applications, but has received little attention by the community. Considering that abnormally written strokes (writing error or largely distorted stroke) affect the decision confidence of classifier, we propose an approach named DeepAD to detect stroke-level abnormalities in handwritten Chinese characters by analyzing the decision process of deep neural network (DNN). Firstly, to minimize the effect of stroke width variation of handwritten characters, we propose a skeletonization method based on fully convolutional network (FCN) with cross detection. With a convolutional neural network (CNN) for character classification, we evaluate the role of each skeleton pixel by calculating its impact on the prediction of classifier, and detect abnormal strokes by connecting pixels of negative impact. For quantitative evaluation of performance, we build a template-free dataset named SA-CASIA-HW containing 3696 handwritten Chinese characters with various stroke-level abnormalities, and spanning 3000+ different classes written by 60 individual writers. We validate the usefulness of the proposed DeepAD with comparison to related methods.

书写异常检测在教育应用中非常重要，但却很少受到社会的重视。考虑到异常书写笔划(书写错误或笔划严重扭曲)会影响分类器的决策置信度，我们提出了一种名为DeepAD的方法，通过分析深度神经网络(DNN)的决策过程来检测手写汉字的笔划级别异常。首先，为了最小化手写字符笔画宽度变化的影响，我们提出了一种基于交叉检测的全卷积网络(FCN)骨架化方法。使用卷积神经网络(CNN)进行字符分类，通过计算每个骨架像素对分类器预测的影响来评估其作用，并通过连接负面影响像素来检测异常笔画。为了定量评估性能，我们建立了一个名为SA-CASIA-HW的无模板数据集，该数据集包含3696个不同笔划水平异常的手写汉字，跨越60位作者所写的3000多个不同类别。我们通过与相关方法的比较验证了所提出的DeepAD的有效性。

{"title":"DeepAD: A Deep Learning Based Approach to Stroke-Level Abnormality Detection in Handwritten Chinese Character Recognition","authors":"Tie-Qiang Wang, Cheng-Lin Liu","doi":"10.1109/ICDM.2018.00176","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00176","url":null,"abstract":"Writing abnormality detection is very important in education applications, but has received little attention by the community. Considering that abnormally written strokes (writing error or largely distorted stroke) affect the decision confidence of classifier, we propose an approach named DeepAD to detect stroke-level abnormalities in handwritten Chinese characters by analyzing the decision process of deep neural network (DNN). Firstly, to minimize the effect of stroke width variation of handwritten characters, we propose a skeletonization method based on fully convolutional network (FCN) with cross detection. With a convolutional neural network (CNN) for character classification, we evaluate the role of each skeleton pixel by calculating its impact on the prediction of classifier, and detect abnormal strokes by connecting pixels of negative impact. For quantitative evaluation of performance, we build a template-free dataset named SA-CASIA-HW containing 3696 handwritten Chinese characters with various stroke-level abnormalities, and spanning 3000+ different classes written by 60 individual writers. We validate the usefulness of the proposed DeepAD with comparison to related methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132696285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Accurate Causal Inference on Discrete Data 离散数据的精确因果推断

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00105

Kailash Budhathoki, Jilles Vreeken

Additive Noise Models (ANMs) provide a theoretically sound approach to inferring the most likely causal direction between pairs of random variables given only a sample from their joint distribution. The key assumption is that the effect is a function of the cause, with additive noise that is independent of the cause. In many cases ANMs are identifiable. Their performance, however, hinges on the chosen dependence measure, the assumption we make on the true distribution. In this paper we propose to use Shannon entropy to measure the dependence within an ANM, which gives us a general approach by which we do not have to assume a true distribution, nor have to perform explicit significance tests during optimization. The information-theoretic formulation gives us a general, efficient, identifiable, and, as the experiments show, highly accurate method for causal inference on pairs of discrete variables—achieving (near) 100% accuracy on both synthetic and real data.

加性噪声模型(ANMs)提供了一种理论上合理的方法来推断随机变量对之间最可能的因果方向，仅从它们的联合分布中给出一个样本。关键的假设是，效果是原因的函数，而附加噪声与原因无关。在许多情况下，arm是可识别的。然而，它们的性能取决于所选择的依赖度量，即我们对真实分布的假设。在本文中，我们建议使用香农熵来衡量ANM内的依赖性，这给了我们一种通用的方法，通过这种方法，我们不必假设一个真实的分布，也不必在优化期间执行显式的显著性检验。信息论公式为我们提供了一种通用的、有效的、可识别的，并且如实验所示，对离散变量对进行因果推理的高度准确的方法-在合成和实际数据上实现(接近)100%的准确性。

引用次数: 9

Enhancing Very Fast Decision Trees with Local Split-Time Predictions 用局部分裂时间预测增强快速决策树

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00044

Viktor Losing, H. Wersing, B. Hammer

An increasing number of industrial areas recognize the opportunities of Big Data, requiring highly efficient algorithms which enable real-time processing to reduce the burden of data storage and maintenance. Decision trees are extremely fast, highly accurate and easy to use in practice. Merging multiple decision trees to an ensemble leads to one of the most powerful machine learning methods. The Very Fast Decision Tree is the state-of-the-art incremental decision tree induction algorithm, capable of learning from massive data streams. It is successful due to its theoretical guarantees based on the Hoeffding bound as well as its competitive performance in terms of classification accuracy and time / space efficiency. In this paper, we increase the efficiency even further by replacing its global splitting scheme, which periodically tries to split every n_min examples. Instead, we utilize local statistics to predict the split-time, thus, avoiding unnecessary split-attempts, usually dominating the computational cost. Concretely, we use the class distributions of previous split-attempts to approximate the minimum number of examples until the Hoeffding bound is met. This cautious approach yields by design a low delay and reduces the number of split-attempts at the same time. We extensively evaluate our method using common stream-learning benchmarks also considering non-stationary environments. The experiments confirm a substantially reduced run-time without a loss in classification performance.

越来越多的工业领域认识到大数据的机遇，需要高效的算法来实现实时处理，以减少数据存储和维护的负担。决策树非常快速，高度准确，易于在实践中使用。将多个决策树合并到一个集合中是最强大的机器学习方法之一。快速决策树是最先进的增量决策树归纳算法，能够从大量数据流中学习。它的成功在于它基于Hoeffding界的理论保证，以及它在分类精度和时间/空间效率方面的竞争力。在本文中，我们通过替换它的全局分裂方案来进一步提高效率，该方案周期性地尝试分裂每n_min个样本。相反，我们利用局部统计来预测分割时间，从而避免了不必要的分割尝试，这通常会控制计算成本。具体地说，我们使用之前分裂尝试的类分布来近似最小样例数，直到满足Hoeffding界。这种谨慎的方法通过设计产生低延迟，同时减少了分裂尝试的数量。我们使用常见的流学习基准广泛地评估了我们的方法，也考虑了非平稳环境。实验证实，在不损失分类性能的情况下，大大减少了运行时间。

{"title":"Enhancing Very Fast Decision Trees with Local Split-Time Predictions","authors":"Viktor Losing, H. Wersing, B. Hammer","doi":"10.1109/ICDM.2018.00044","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00044","url":null,"abstract":"An increasing number of industrial areas recognize the opportunities of Big Data, requiring highly efficient algorithms which enable real-time processing to reduce the burden of data storage and maintenance. Decision trees are extremely fast, highly accurate and easy to use in practice. Merging multiple decision trees to an ensemble leads to one of the most powerful machine learning methods. The Very Fast Decision Tree is the state-of-the-art incremental decision tree induction algorithm, capable of learning from massive data streams. It is successful due to its theoretical guarantees based on the Hoeffding bound as well as its competitive performance in terms of classification accuracy and time / space efficiency. In this paper, we increase the efficiency even further by replacing its global splitting scheme, which periodically tries to split every n_min examples. Instead, we utilize local statistics to predict the split-time, thus, avoiding unnecessary split-attempts, usually dominating the computational cost. Concretely, we use the class distributions of previous split-attempts to approximate the minimum number of examples until the Hoeffding bound is met. This cautious approach yields by design a low delay and reduces the number of split-attempts at the same time. We extensively evaluate our method using common stream-learning benchmarks also considering non-stationary environments. The experiments confirm a substantially reduced run-time without a loss in classification performance.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131109579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Enhancing Question Understanding and Representation for Knowledge Base Relation Detection 面向知识库关系检测的问题理解与表达

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00186

Zihan Xu, Haitao Zheng, Zuo-You Fu, Wei Wang

Relation detection is a key step in Knowledge Base Question Answering (KBQA), but far from solved due to the significant differences between questions and relations. Previous studies usually treat relation detection as a text matching task, and mainly focus on reducing the detection error with better representations of KB relations. However, the understanding of questions is also important since they are generally more varied. And the text pair representation requires improvement because KB relations are not always counterparts of questions. In this paper, we propose a novel system with enhanced question understanding and representation processes for KB relation detection (QURRD). We design a KBQA-specific slot filling module based on Bi-LSTM-CRF for question understanding. Besides, with two CNNs for modeling and matching text pairs respectively, QURRD obtains richer question-relation representations for semantic analysis, and achieves better performance through learning from multiple tasks. We conduct experiments on both single-relation (Simple-Questions) and multi-relation (WebQSP) benchmarks. Results show that QURRD is robust against the diversity of questions and outperforms the state-of-the-art system on both tasks.

关系检测是知识库问答(Knowledge Base Question answer, KBQA)的关键步骤，但由于问题与关系之间存在显著差异，关系检测一直没有得到很好的解决。以往的研究通常将关系检测视为文本匹配任务，主要关注通过更好地表示知识库关系来降低检测误差。然而，对问题的理解也很重要，因为它们通常更加多样化。文本对表示需要改进，因为知识库关系并不总是问题的对应物。在本文中，我们提出了一个具有增强问题理解和表示过程的知识库关系检测系统。我们设计了一个基于Bi-LSTM-CRF的针对kbqa的槽位填充模块，用于问题理解。此外，QURRD分别使用两个cnn对文本对建模和匹配，获得了更丰富的问题关系表示用于语义分析，并通过多任务学习获得了更好的性能。我们在单关系(Simple-Questions)和多关系(WebQSP)基准上进行了实验。结果表明，QURRD对问题的多样性具有鲁棒性，并且在这两个任务上都优于最先进的系统。

{"title":"Enhancing Question Understanding and Representation for Knowledge Base Relation Detection","authors":"Zihan Xu, Haitao Zheng, Zuo-You Fu, Wei Wang","doi":"10.1109/ICDM.2018.00186","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00186","url":null,"abstract":"Relation detection is a key step in Knowledge Base Question Answering (KBQA), but far from solved due to the significant differences between questions and relations. Previous studies usually treat relation detection as a text matching task, and mainly focus on reducing the detection error with better representations of KB relations. However, the understanding of questions is also important since they are generally more varied. And the text pair representation requires improvement because KB relations are not always counterparts of questions. In this paper, we propose a novel system with enhanced question understanding and representation processes for KB relation detection (QURRD). We design a KBQA-specific slot filling module based on Bi-LSTM-CRF for question understanding. Besides, with two CNNs for modeling and matching text pairs respectively, QURRD obtains richer question-relation representations for semantic analysis, and achieves better performance through learning from multiple tasks. We conduct experiments on both single-relation (Simple-Questions) and multi-relation (WebQSP) benchmarks. Results show that QURRD is robust against the diversity of questions and outperforms the state-of-the-art system on both tasks.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123786938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Binarized attributed network embedding 二值化属性网络嵌入

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.8626170

Hong Yang, Shirui Pan, Peng Zhang, Ling Chen, Defu Lian, Chengqi Zhang

Attributed network embedding enables joint representation learning of node links and attributes. Existing attributed network embedding models are designed in continuous Euclidean spaces which often introduce data redundancy and impose challenges to storage and computation costs. To this end, we present a Binarized Attributed Network Embedding model (BANE for short) to learn binary node representation. Specifically, we define a new Weisfeiler-Lehman proximity matrix to capture data dependence between node links and attributes by aggregating the information of node attributes and links from neighboring nodes to a given target node in a layer-wise manner. Based on the Weisfeiler-Lehman proximity matrix, we formulate a new Weisfiler-Lehman matrix factorization learning function under the binary node representation constraint. The learning problem is a mixed integer optimization and an efficient cyclic coordinate descent (CCD) algorithm is used as the solution. Node classification and link prediction experiments on real-world datasets show that the proposed BANE model outperforms the state-of-the-art network embedding methods.

属性网络嵌入实现了节点链接和属性的联合表示学习。现有的属性网络嵌入模型是在连续欧氏空间中设计的，往往会引入数据冗余，给存储和计算成本带来挑战。为此，我们提出了一种二值化属性网络嵌入模型(简称BANE)来学习二值节点表示。具体而言，我们定义了一个新的Weisfeiler-Lehman接近矩阵，通过分层方式聚合节点属性信息和从相邻节点到给定目标节点的链接信息来捕获节点链路和属性之间的数据依赖关系。基于Weisfiler-Lehman邻近矩阵，在二元节点表示约束下，构造了一个新的Weisfiler-Lehman矩阵分解学习函数。该学习问题是一个混合整数优化问题，并采用了一种高效的循环坐标下降(CCD)算法作为解决方案。在实际数据集上进行的节点分类和链路预测实验表明，贝恩模型优于目前最先进的网络嵌入方法。

引用次数: 71

Cost Effective Multi-label Active Learning via Querying Subexamples 基于查询子样本的高效多标签主动学习

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00109

Xia Chen, Guoxian Yu, C. Domeniconi, J. Wang, Zhao Li, Z. Zhang

Multi-label active learning addresses the scarce labeled example problem by querying the most valuable unlabeled examples, or example-label pairs, to achieve a better performance with limited query cost. Current multi-label active learning methods require the scrutiny of the whole example in order to obtain its annotation. In contrast, one can find positive evidence with respect to a label by examining specific patterns (i.e., subexample), rather than the whole example, thus making the annotation process more efficient. Based on this observation, we propose a novel two-stage cost effective multi-label active learning framework, called CMAL. In the first stage, a novel example-label pair selection strategy is introduced. Our strategy leverages label correlation and label space sparsity of multi-label examples to select the most uncertain example-label pairs. Specifically, the unknown relevant label of an example can be inferred from the correlated labels that are already assigned to the example, thus reducing the uncertainty of the unknown label. In addition, the larger the number of relevant examples of a particular label, the smaller the uncertainty of the label is. In the second stage, CMAL queries the most plausible positive subexample-label pairs of the selected example-label pairs. Comprehensive experiments on multi-label datasets collected from different domains demonstrate the effectiveness of our proposed approach on cost effective queries. We also show that leveraging label correlation and label sparsity contribute to saving costs.

多标签主动学习通过查询最有价值的未标记示例或示例-标签对来解决稀缺标记示例问题，以有限的查询成本获得更好的性能。目前的多标签主动学习方法需要对整个样本进行仔细检查才能获得其注释。相反，可以通过检查特定的模式(例如，子示例)而不是整个示例来找到关于标签的肯定证据，从而使注释过程更有效。基于这一观察，我们提出了一种新的两阶段成本有效的多标签主动学习框架，称为CMAL。在第一阶段，提出了一种新的样本-标签对选择策略。我们的策略利用多标签示例的标签相关性和标签空间稀疏性来选择最不确定的示例-标签对。具体来说，一个例子的未知相关标签可以从已经分配给该例子的相关标签中推断出来，从而减少了未知标签的不确定性。此外，特定标签的相关样例数量越多，该标签的不确定性越小。在第二阶段，CMAL查询所选样例标签对中最可信的正子样例标签对。从不同领域收集的多标签数据集的综合实验证明了我们提出的方法在成本效益查询方面的有效性。我们还表明，利用标签相关性和标签稀疏性有助于节省成本。

{"title":"Cost Effective Multi-label Active Learning via Querying Subexamples","authors":"Xia Chen, Guoxian Yu, C. Domeniconi, J. Wang, Zhao Li, Z. Zhang","doi":"10.1109/ICDM.2018.00109","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00109","url":null,"abstract":"Multi-label active learning addresses the scarce labeled example problem by querying the most valuable unlabeled examples, or example-label pairs, to achieve a better performance with limited query cost. Current multi-label active learning methods require the scrutiny of the whole example in order to obtain its annotation. In contrast, one can find positive evidence with respect to a label by examining specific patterns (i.e., subexample), rather than the whole example, thus making the annotation process more efficient. Based on this observation, we propose a novel two-stage cost effective multi-label active learning framework, called CMAL. In the first stage, a novel example-label pair selection strategy is introduced. Our strategy leverages label correlation and label space sparsity of multi-label examples to select the most uncertain example-label pairs. Specifically, the unknown relevant label of an example can be inferred from the correlated labels that are already assigned to the example, thus reducing the uncertainty of the unknown label. In addition, the larger the number of relevant examples of a particular label, the smaller the uncertainty of the label is. In the second stage, CMAL queries the most plausible positive subexample-label pairs of the selected example-label pairs. Comprehensive experiments on multi-label datasets collected from different domains demonstrate the effectiveness of our proposed approach on cost effective queries. We also show that leveraging label correlation and label sparsity contribute to saving costs.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Fast Rectangle Counting on Massive Networks 大规模网络上的快速矩形计数

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00100

Rong Zhu, Zhaonian Zou, Jianzhong Li

Rectangle has been recognized as an essential motif in a large number of real-world networks. Counting rectangles in a network plays an important role in network analysis. This paper comprehensively studies the rectangle counting problem on large networks. We propose a novel counting paradigm called the wedge-centric counting, where a wedge is a simple path consisting of three vertices. Unlike the traditional edge-centric counting, the wedge-centric counting uses wedges instead of edges as building blocks of rectangles. The main advantage of the wedge-centric counting is that it does not need to access two-hop neighbors. Based on this paradigm, we develop a collection of rectangle counting algorithms, including an in-memory algorithm with lower time complexity, an external-memory algorithm with the optimal I/O complexity, and two randomized algorithms with provable error bounds. The experimental results on a variety of real networks verify the effectiveness and the efficiency of the proposed wedge-centric rectangle counting algorithms.

在大量的现实网络中，矩形被认为是一个基本的基序。对网络中的矩形进行计数在网络分析中起着重要的作用。本文全面研究了大型网络上的矩形计数问题。我们提出了一种新的计数范式，称为楔形中心计数，其中楔形是由三个顶点组成的简单路径。与传统的以边为中心计数不同，以楔为中心计数使用楔形而不是边缘作为矩形的构建块。楔形中心计数的主要优点是它不需要访问两跳邻居。在此基础上，我们开发了一系列矩形计数算法，包括具有较低时间复杂度的内存内算法，具有最佳I/O复杂度的外部内存算法，以及具有可证明误差界的两种随机算法。在各种真实网络上的实验结果验证了所提出的楔形中心矩形计数算法的有效性和效率。

引用次数: 9

DrugCom: Synergistic Discovery of Drug Combinations Using Tensor Decomposition 使用张量分解的药物组合的协同发现

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00108

Huiyuan Chen, Jing Li

Personalized treatments and targeted therapies are the most promising approaches to treat complex diseases, especially for cancer. However, drug resistance is often acquired after treatments. To overcome or reduce drug resistance, treatments using drug combinations have been actively investigated in the literature. Existing methods mainly focus on chemical properties of drugs for potential combination therapies without considering relationships among different diseases. Also, they often do not consider the rich knowledge of drugs and diseases, which can enhance the prediction of drug combinations. This motivates us to develop a new computational method that can predict the beneficial drug combinations. We propose DrugCom, a tensor-based framework for computing drug combinations across different diseases by integrating multiple heterogeneous data sources of drugs and diseases. DrugCom first constructs a primary third-order tensor (i.e., drug×drug ×disease) and several similarity matrices from multiple data sources regarding drugs (e.g., chemical structure) and diseases (e.g., disease phenotype). DrugCom then formulates an objective function, which simultaneously factorizes coupled tensor and matrices to reveal the molecular mechanisms of drug synergy. We adopt the alternating direction method of multipliers algorithm to effectively solve the optimization problem. Extensive experimental studies using real-world datasets demonstrate superior performance of DrugCom.

个性化治疗和靶向治疗是治疗复杂疾病，尤其是癌症的最有希望的方法。然而，耐药往往是在治疗后获得的。为了克服或减少耐药性，使用药物联合治疗已在文献中积极研究。现有的方法主要关注药物的化学性质，而没有考虑不同疾病之间的关系。此外，他们往往没有考虑到丰富的药物和疾病知识，这可以增强对药物组合的预测。这促使我们开发一种新的计算方法，可以预测有益的药物组合。我们提出了DrugCom，一个基于张量的框架，通过整合多个异构的药物和疾病数据源来计算不同疾病的药物组合。DrugCom首先构建一个初级三阶张量(即drug×drug ×disease)和几个来自多个数据源的关于药物(如化学结构)和疾病(如疾病表型)的相似矩阵。然后，DrugCom建立一个目标函数，该函数同时分解耦合张量和矩阵，揭示药物协同作用的分子机制。采用乘法器算法的交替方向法，有效地解决了优化问题。使用真实数据集的大量实验研究证明了DrugCom的优越性能。

{"title":"DrugCom: Synergistic Discovery of Drug Combinations Using Tensor Decomposition","authors":"Huiyuan Chen, Jing Li","doi":"10.1109/ICDM.2018.00108","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00108","url":null,"abstract":"Personalized treatments and targeted therapies are the most promising approaches to treat complex diseases, especially for cancer. However, drug resistance is often acquired after treatments. To overcome or reduce drug resistance, treatments using drug combinations have been actively investigated in the literature. Existing methods mainly focus on chemical properties of drugs for potential combination therapies without considering relationships among different diseases. Also, they often do not consider the rich knowledge of drugs and diseases, which can enhance the prediction of drug combinations. This motivates us to develop a new computational method that can predict the beneficial drug combinations. We propose DrugCom, a tensor-based framework for computing drug combinations across different diseases by integrating multiple heterogeneous data sources of drugs and diseases. DrugCom first constructs a primary third-order tensor (i.e., drug×drug ×disease) and several similarity matrices from multiple data sources regarding drugs (e.g., chemical structure) and diseases (e.g., disease phenotype). DrugCom then formulates an objective function, which simultaneously factorizes coupled tensor and matrices to reveal the molecular mechanisms of drug synergy. We adopt the alternating direction method of multipliers algorithm to effectively solve the optimization problem. Extensive experimental studies using real-world datasets demonstrate superior performance of DrugCom.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125115448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 IEEE International Conference on Data Mining (ICDM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀