2018 IEEE International Conference on Data Mining (ICDM)最新文献_第4页

Accelerating Experimental Design by Incorporating Experimenter Hunches 结合实验者直觉加速实验设计

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00041

Cheng Li, Santu Rana, Sunil Gupta, Vu Nguyen, S. Venkatesh, A. Sutti, D. R. Leal, Teo Slezak, Murray Height, M. Mohammed, I. Gibson

Experimental design is a process of obtaining a product with target property via experimentation. Bayesian optimization offers a sample-efficient tool for experimental design when experiments are expensive. Often, expert experimenters have 'hunches' about the behavior of the experimental system, offering potentials to further improve the efficiency. In this paper, we consider per-variable monotonic trend in the underlying property that results in a unimodal trend in those variables for a target value optimization. For example, sweetness of a candy is monotonic to the sugar content. However, to obtain a target sweetness, the utility of the sugar content becomes a unimodal function, which peaks at the value giving the target sweetness and falls off both ways. In this paper, we propose a novel method to solve such problems that achieves two main objectives: a) the monotonicity information is used to the fullest extent possible, whilst ensuring that b) the convergence guarantee remains intact. This is achieved by a two-stage Gaussian process modeling, where the first stage uses the monotonicity trend to model the underlying property, and the second stage uses 'virtual' samples, sampled from the first, to model the target value optimization function. The process is made theoretically consistent by adding appropriate adjustment factor in the posterior computation, necessitated because of using the 'virtual' samples. The proposed method is evaluated through both simulations and real world experimental design problems of a) new short polymer fiber with the target length, and b) designing of a new three dimensional porous scaffolding with a target porosity. In all scenarios our method demonstrates faster convergence than the basic Bayesian optimization approach not using such 'hunches'.

实验设计是通过实验获得具有目标性能的产品的过程。当实验成本较高时，贝叶斯优化为实验设计提供了一种有效的工具。通常，专业的实验人员对实验系统的行为有“预感”，提供进一步提高效率的潜力。在本文中，我们考虑单变量单调趋势的潜在性质，导致单峰趋势的那些变量的目标值优化。例如，糖果的甜度与含糖量是单调的。然而，为了获得目标甜度，糖含量的效用变成了一个单峰函数，它在给出目标甜度的值时达到峰值，然后在两个方向上都下降。在本文中，我们提出了一种新的方法来解决这类问题，达到两个主要目标:a)最大限度地利用单调性信息，同时保证b)收敛保证不变。这是通过两阶段高斯过程建模来实现的，其中第一阶段使用单调性趋势来建模底层属性，第二阶段使用从第一阶段采样的“虚拟”样本来建模目标值优化函数。通过在后验计算中添加适当的调整因子，该过程在理论上是一致的，因为使用“虚拟”样本是必要的。通过模拟和现实世界的实验设计问题对所提出的方法进行了评估，a)具有目标长度的新型短聚合物纤维，b)具有目标孔隙率的新型三维多孔支架的设计。在所有情况下，我们的方法都比不使用这种“预感”的基本贝叶斯优化方法收敛得更快。

{"title":"Accelerating Experimental Design by Incorporating Experimenter Hunches","authors":"Cheng Li, Santu Rana, Sunil Gupta, Vu Nguyen, S. Venkatesh, A. Sutti, D. R. Leal, Teo Slezak, Murray Height, M. Mohammed, I. Gibson","doi":"10.1109/ICDM.2018.00041","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00041","url":null,"abstract":"Experimental design is a process of obtaining a product with target property via experimentation. Bayesian optimization offers a sample-efficient tool for experimental design when experiments are expensive. Often, expert experimenters have 'hunches' about the behavior of the experimental system, offering potentials to further improve the efficiency. In this paper, we consider per-variable monotonic trend in the underlying property that results in a unimodal trend in those variables for a target value optimization. For example, sweetness of a candy is monotonic to the sugar content. However, to obtain a target sweetness, the utility of the sugar content becomes a unimodal function, which peaks at the value giving the target sweetness and falls off both ways. In this paper, we propose a novel method to solve such problems that achieves two main objectives: a) the monotonicity information is used to the fullest extent possible, whilst ensuring that b) the convergence guarantee remains intact. This is achieved by a two-stage Gaussian process modeling, where the first stage uses the monotonicity trend to model the underlying property, and the second stage uses 'virtual' samples, sampled from the first, to model the target value optimization function. The process is made theoretically consistent by adding appropriate adjustment factor in the posterior computation, necessitated because of using the 'virtual' samples. The proposed method is evaluated through both simulations and real world experimental design problems of a) new short polymer fiber with the target length, and b) designing of a new three dimensional porous scaffolding with a target porosity. In all scenarios our method demonstrates faster convergence than the basic Bayesian optimization approach not using such 'hunches'.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"127 16","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114047212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Highly Parallel Sequential Pattern Mining on a Heterogeneous Platform 异构平台上的高度并行顺序模式挖掘

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00131

Yu-Heng Hsieh, Chun-Chieh Chen, Hong-Han Shuai, Ming-Syan Chen

Sequential pattern mining can be applied to various fields such as disease prediction and stock analysis. Many algorithms have been proposed for sequential pattern mining, together with acceleration methods. In this paper, we show that a heterogeneous platform with CPU and GPU is more suitable for sequential pattern mining than traditional CPU-based approaches since the support counting process is inherently succinct and repetitive. Therefore, we propose the PArallel SequenTial pAttern mining algorithm, referred to as PASTA, to accelerate sequential pattern mining by combining the merits of CPU and GPU computing. Explicitly, PASTA adopts the vertical bitmap representation of database to exploits the GPU parallelism. In addition, a pipeline strategy is proposed to ensure that both CPU and GPU on the heterogeneous platform operate concurrently to fully utilize the computing power of the platform. Furthermore, we develop a swapping scheme to mitigate the limited memory problem of the GPU hardware without decreasing the performance. Finally, comprehensive experiments are conducted to analyze PASTA with different baselines. The experiments show that PASTA outperforms the state-of-the-art algorithms by orders of magnitude on both real and synthetic datasets.

序列模式挖掘可以应用于疾病预测、库存分析等多个领域。对于顺序模式挖掘，已经提出了许多算法和加速方法。在本文中，我们证明了具有CPU和GPU的异构平台比传统的基于CPU的方法更适合于顺序模式挖掘，因为支持计数过程本质上是简洁和重复的。因此，我们提出并行顺序模式挖掘算法，简称PASTA，通过结合CPU和GPU计算的优点来加速顺序模式挖掘。PASTA明确地采用数据库的垂直位图表示来利用GPU的并行性。此外，提出了一种流水线策略，以保证异构平台上CPU和GPU的并行运行，充分利用平台的计算能力。此外，我们开发了一种交换方案来缓解GPU硬件有限的内存问题，而不会降低性能。最后，对不同基线的PASTA进行综合实验分析。实验表明，PASTA在真实和合成数据集上的性能都优于最先进的算法。

{"title":"Highly Parallel Sequential Pattern Mining on a Heterogeneous Platform","authors":"Yu-Heng Hsieh, Chun-Chieh Chen, Hong-Han Shuai, Ming-Syan Chen","doi":"10.1109/ICDM.2018.00131","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00131","url":null,"abstract":"Sequential pattern mining can be applied to various fields such as disease prediction and stock analysis. Many algorithms have been proposed for sequential pattern mining, together with acceleration methods. In this paper, we show that a heterogeneous platform with CPU and GPU is more suitable for sequential pattern mining than traditional CPU-based approaches since the support counting process is inherently succinct and repetitive. Therefore, we propose the PArallel SequenTial pAttern mining algorithm, referred to as PASTA, to accelerate sequential pattern mining by combining the merits of CPU and GPU computing. Explicitly, PASTA adopts the vertical bitmap representation of database to exploits the GPU parallelism. In addition, a pipeline strategy is proposed to ensure that both CPU and GPU on the heterogeneous platform operate concurrently to fully utilize the computing power of the platform. Furthermore, we develop a swapping scheme to mitigate the limited memory problem of the GPU hardware without decreasing the performance. Finally, comprehensive experiments are conducted to analyze PASTA with different baselines. The experiments show that PASTA outperforms the state-of-the-art algorithms by orders of magnitude on both real and synthetic datasets.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125331477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DeepDiffuse: Predicting the 'Who' and 'When' in Cascades DeepDiffuse:预测级联中的“谁”和“何时”

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00134

Mohammad Raihanul Islam, S. Muthiah, B. Adhikari, B. Prakash, Naren Ramakrishnan

Cascades are an accepted model to capturing how information diffuses across social network platforms. A large body of research has been focused on dissecting the anatomy of such cascades and forecasting their progression. One recurring theme involves predicting the next stage(s) of cascades utilizing pertinent information such as the underlying social network, structural properties of nodes (e.g., degree) and (partial) histories of cascade propagation. However, such type of granular information is rarely available in practice. We study in this paper the problem of cascade prediction utilizing only two types of (coarse) information, viz. which node is infected and its corresponding infection time. We first construct several simple baselines to solve this cascade prediction problem. Then we describe the shortcomings of these methods and propose a new solution leveraging recent progress in embeddings and attention models from representation learning. We also perform an exhaustive analysis of our methods on several real world datasets. Our proposed model outperforms the baselines and several other state-of-the-art methods.

级联是捕获信息如何在社交网络平台上传播的公认模型。大量的研究已经集中在剖析这种级联的解剖结构和预测它们的进展上。一个反复出现的主题涉及利用相关信息预测级联的下一阶段，如潜在的社会网络、节点的结构属性(例如，程度)和级联传播的(部分)历史。然而，这种类型的细粒度信息在实践中很少可用。本文研究了仅利用两类(粗)信息的级联预测问题，即哪个节点被感染及其相应的感染时间。我们首先构造几个简单的基线来解决这个级联预测问题。然后，我们描述了这些方法的缺点，并利用表征学习中嵌入和注意模型的最新进展提出了一种新的解决方案。我们还在几个真实世界的数据集上对我们的方法进行了详尽的分析。我们提出的模型优于基线和其他几种最先进的方法。

{"title":"DeepDiffuse: Predicting the 'Who' and 'When' in Cascades","authors":"Mohammad Raihanul Islam, S. Muthiah, B. Adhikari, B. Prakash, Naren Ramakrishnan","doi":"10.1109/ICDM.2018.00134","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00134","url":null,"abstract":"Cascades are an accepted model to capturing how information diffuses across social network platforms. A large body of research has been focused on dissecting the anatomy of such cascades and forecasting their progression. One recurring theme involves predicting the next stage(s) of cascades utilizing pertinent information such as the underlying social network, structural properties of nodes (e.g., degree) and (partial) histories of cascade propagation. However, such type of granular information is rarely available in practice. We study in this paper the problem of cascade prediction utilizing only two types of (coarse) information, viz. which node is infected and its corresponding infection time. We first construct several simple baselines to solve this cascade prediction problem. Then we describe the shortcomings of these methods and propose a new solution leveraging recent progress in embeddings and attention models from representation learning. We also perform an exhaustive analysis of our methods on several real world datasets. Our proposed model outperforms the baselines and several other state-of-the-art methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131511676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

ASTM: An Attentional Segmentation Based Topic Model for Short Texts ASTM:基于注意力分割的短文本主题模型

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00073

Jiamiao Wang, Ling Chen, Lu Qin, Xindong Wu

To address the data sparsity problem in short text understanding, various alternative topic models leveraging word embeddings as background knowledge have been developed recently. However, existing models combine auxiliary information and topic modeling in a straightforward way without considering human reading habits. In contrast, extensive studies have proven that it is full of potential in textual analysis by taking into account human attention. Therefore, we propose a novel model, Attentional Segmentation based Topic Model (ASTM), to integrate both word embeddings as supplementary information and an attention mechanism that segments short text documents into fragments of adjacent words receiving similar attention. Each segment is assigned to a topic and each document can have multiple topics. We evaluate the performance of our model on three real-world short text datasets. The experimental results demonstrate that our model outperforms the state-of-the-art in terms of both topic coherence and text classification.

为了解决短文本理解中的数据稀疏性问题，近年来出现了多种利用词嵌入作为背景知识的备选主题模型。然而，现有的模型将辅助信息和主题建模直接结合起来，没有考虑人类的阅读习惯。相反，大量的研究证明，在文本分析中考虑人的注意力是很有潜力的。因此，我们提出了一种新的模型，即基于注意力分割的主题模型(ASTM)，该模型将作为补充信息的词嵌入和一种将短文本文档分割成受到相似关注的相邻词片段的注意机制结合在一起。每个片段分配给一个主题，每个文档可以有多个主题。我们在三个真实的短文本数据集上评估了我们的模型的性能。实验结果表明，我们的模型在主题连贯和文本分类方面都优于目前最先进的模型。

引用次数: 10

Doc2Cube: Allocating Documents to Text Cube Without Labeled Data Doc2Cube:将文档分配到没有标记数据的文本立方体

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00169

Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.

数据立方体是结构化数据集多维分析的基础架构。对于医疗保健、商业智能和社交媒体分析中的各种文本密集型应用程序，非常需要对具有立方体结构的文本语料库进行多维分析。然而，构建文本多维数据集的一个瓶颈是自动将数百万个文档放入正确的多维数据集单元中，以便随后进行高质量的多维分析——手动分配文档或依赖大量标记数据的成本太高。我们提出了Doc2Cube，一种从给定文本语料库中以无监督的方式构造文本立方体的方法。最初，只提供每个维度(例如，位置)的标签名称(例如，USA, China)，而不提供任何标记数据。Doc2Cube利用标签名称作为弱监督信号，并迭代地执行标签、术语和文档的联合嵌入，以发现它们的语义相似性。为了生成对多维数据集构造具有区别性的联合嵌入，Doc2Cube通过选择性地关注每个维度中高度标记指示性的术语来学习定制维度的文档表示。此外，Doc2Cube通过将信息从标签名称传播到其他术语并丰富标记的术语集来减轻标签稀疏性。我们在实际数据上的实验证明了Doc2Cube相对于现有方法的优越性。

{"title":"Doc2Cube: Allocating Documents to Text Cube Without Labeled Data","authors":"Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han","doi":"10.1109/ICDM.2018.00169","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00169","url":null,"abstract":"Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132905842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Similarity-Based Active Learning for Image Classification Under Class Imbalance 类不平衡下基于相似性的图像分类主动学习

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00196

Chuanhai Zhang, Wallapak Tavanapong, Gavin Kijkul, J. Wong, P. C. Groen, Jung-Hwan Oh

Many image classification tasks (e.g., medical image classification) have a severe class imbalance problem. Convolutional neural network (CNN) is currently a state-of-the-art method for image classification. CNN relies on a large training dataset to achieve high classification performance. However, manual labeling is costly and may not even be feasible for medical domain. In this paper, we propose a novel similarity-based active deep learning framework (SAL) that deals with class imbalance. SAL actively learns a similarity model to recommend unlabeled rare class samples for experts' manual labeling. Based on similarity ranking, SAL recommends high confidence unlabeled common class samples for automatic pseudo-labeling without experts' labeling effort. To the best of our knowledge, SAL is the first active deep learning framework that deals with a significant class imbalance. Our experiments show that SAL consistently outperforms two other recent active deep learning methods on two challenging datasets. What's more, SAL obtains nearly the upper bound classification performance (using all the images in the training dataset) while the domain experts labeled only 5.6% and 7.5% of all images in the Endoscopy dataset and the Caltech-256 dataset, respectively. SAL significantly reduces the experts' manual labeling efforts while achieving near optimal classification performance.

许多图像分类任务(如医学图像分类)存在严重的类不平衡问题。卷积神经网络(CNN)是目前最先进的图像分类方法。CNN依靠庞大的训练数据集来实现高分类性能。然而，手工标签是昂贵的，甚至可能不可行的医疗领域。在本文中，我们提出了一种新的基于相似性的主动深度学习框架(SAL)来处理类不平衡。SAL主动学习相似度模型，为专家手工标注推荐未标注的稀有类样本。基于相似性排序，SAL推荐高置信度的未标记的公共类样本用于自动伪标记，而无需专家的标记工作。据我们所知，SAL是第一个处理显著类不平衡的主动深度学习框架。我们的实验表明，在两个具有挑战性的数据集上，SAL始终优于其他两种最近的主动深度学习方法。更重要的是，SAL获得了接近上限的分类性能(使用了训练数据集中的所有图像)，而领域专家分别只标记了内窥镜数据集和Caltech-256数据集中5.6%和7.5%的图像。SAL显着减少了专家的手动标记工作，同时实现了接近最佳的分类性能。

{"title":"Similarity-Based Active Learning for Image Classification Under Class Imbalance","authors":"Chuanhai Zhang, Wallapak Tavanapong, Gavin Kijkul, J. Wong, P. C. Groen, Jung-Hwan Oh","doi":"10.1109/ICDM.2018.00196","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00196","url":null,"abstract":"Many image classification tasks (e.g., medical image classification) have a severe class imbalance problem. Convolutional neural network (CNN) is currently a state-of-the-art method for image classification. CNN relies on a large training dataset to achieve high classification performance. However, manual labeling is costly and may not even be feasible for medical domain. In this paper, we propose a novel similarity-based active deep learning framework (SAL) that deals with class imbalance. SAL actively learns a similarity model to recommend unlabeled rare class samples for experts' manual labeling. Based on similarity ranking, SAL recommends high confidence unlabeled common class samples for automatic pseudo-labeling without experts' labeling effort. To the best of our knowledge, SAL is the first active deep learning framework that deals with a significant class imbalance. Our experiments show that SAL consistently outperforms two other recent active deep learning methods on two challenging datasets. What's more, SAL obtains nearly the upper bound classification performance (using all the images in the training dataset) while the domain experts labeled only 5.6% and 7.5% of all images in the Endoscopy dataset and the Caltech-256 dataset, respectively. SAL significantly reduces the experts' manual labeling efforts while achieving near optimal classification performance.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133556584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

MuVAN: A Multi-view Attention Network for Multivariate Temporal Data 多维时间数据的多视角注意力网络

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00087

Ye Yuan, Guangxu Xun, Fenglong Ma, Yaqing Wang, Nan Du, Ke-bin Jia, Lu Su, Aidong Zhang

Recent advances in attention networks have gained enormous interest in time series data mining. Various attention mechanisms are proposed to soft-select relevant timestamps from temporal data by assigning learnable attention scores. However, many real-world tasks involve complex multivariate time series that continuously measure target from multiple views. Different views may provide information of different levels of quality varied over time, and thus should be assigned with different attention scores as well. Unfortunately, the existing attention-based architectures cannot be directly used to jointly learn the attention scores in both time and view domains, due to the data structure complexity. Towards this end, we propose a novel multi-view attention network, namely MuVAN, to learn fine-grained attentional representations from multivariate temporal data. MuVAN is a unified deep learning model that can jointly calculate the two-dimensional attention scores to estimate the quality of information contributed by each view within different timestamps. By constructing a hybrid focus procedure, we are able to bring more diversity to attention, in order to fully utilize the multi-view information. To evaluate the performance of our model, we carry out experiments on three real-world benchmark datasets. Experimental results show that the proposed MuVAN model outperforms the state-of-the-art deep representation approaches in different real-world tasks. Analytical results through a case study demonstrate that MuVAN can discover discriminative and meaningful attention scores across views over time, which improves the feature representation of multivariate temporal data.

注意力网络的最新进展引起了人们对时间序列数据挖掘的极大兴趣。提出了多种注意机制，通过分配可学习的注意分数，从时间数据中软选择相关的时间戳。然而，许多现实世界的任务涉及复杂的多元时间序列，从多个角度连续测量目标。不同视图可以提供信息的不同级别的质量变化随着时间的推移,因此应该被分配不同的注意分数。遗憾的是，由于数据结构的复杂性，现有的基于注意力的体系结构不能直接用于时间域和视图域的注意力分数的联合学习。为此，我们提出了一种新的多视图注意力网络，即MuVAN，从多变量时间数据中学习细粒度的注意力表征。MuVAN是一个统一的深度学习模型，可以联合计算二维注意力分数，以估计不同时间戳内每个视图贡献的信息质量。通过构建混合焦点过程，我们可以使注意力更加多样化，从而充分利用多视角信息。为了评估我们的模型的性能，我们在三个真实世界的基准数据集上进行了实验。实验结果表明，所提出的MuVAN模型在不同的现实世界任务中优于最先进的深度表示方法。通过一个案例研究的分析结果表明，MuVAN可以在不同的视图中发现有区别的和有意义的注意力分数，从而改善了多变量时间数据的特征表示。

{"title":"MuVAN: A Multi-view Attention Network for Multivariate Temporal Data","authors":"Ye Yuan, Guangxu Xun, Fenglong Ma, Yaqing Wang, Nan Du, Ke-bin Jia, Lu Su, Aidong Zhang","doi":"10.1109/ICDM.2018.00087","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00087","url":null,"abstract":"Recent advances in attention networks have gained enormous interest in time series data mining. Various attention mechanisms are proposed to soft-select relevant timestamps from temporal data by assigning learnable attention scores. However, many real-world tasks involve complex multivariate time series that continuously measure target from multiple views. Different views may provide information of different levels of quality varied over time, and thus should be assigned with different attention scores as well. Unfortunately, the existing attention-based architectures cannot be directly used to jointly learn the attention scores in both time and view domains, due to the data structure complexity. Towards this end, we propose a novel multi-view attention network, namely MuVAN, to learn fine-grained attentional representations from multivariate temporal data. MuVAN is a unified deep learning model that can jointly calculate the two-dimensional attention scores to estimate the quality of information contributed by each view within different timestamps. By constructing a hybrid focus procedure, we are able to bring more diversity to attention, in order to fully utilize the multi-view information. To evaluate the performance of our model, we carry out experiments on three real-world benchmark datasets. Experimental results show that the proposed MuVAN model outperforms the state-of-the-art deep representation approaches in different real-world tasks. Analytical results through a case study demonstrate that MuVAN can discover discriminative and meaningful attention scores across views over time, which improves the feature representation of multivariate temporal data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132232287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Robust Distributed Anomaly Detection Using Optimal Weighted One-Class Random Forests 基于最优加权一类随机森林的鲁棒分布式异常检测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00171

Yu-Lin Tsou, Hong-Min Chu, Cong Li, Shao-Wen Yang

Wireless sensor networks (WSNs) have been widely deployed in various applications, e.g., agricultural monitoring and industrial monitoring, for their ease-of-deployment. The low-cost nature makes WSNs particularly vulnerable to changes of extrinsic factors, i.e., the environment, or changes of intrinsic factors, i.e., hardware or software failures. The problem can, often times, be uncovered via detecting unexpected behaviors (anomalies) of devices. However, anomaly detection in WSNs is subject to the following challenges: (1) the limited computation and connectivity, (2) the dynamicity of the environment and network topology, and (3) the need of taking real-time actions in response to anomalies. In this paper, we propose a novel framework using optimal weighted one-class random forests for unsupervised anomaly detection to address the aforementioned challenges in WSNs. The ample experiments showed that our framework not only is feasible but also outperforms the state-of-the-art unsupervised methods in terms of both detection accuracy and resource utilization.

无线传感器网络(WSNs)由于其易于部署，已广泛应用于各种应用，例如农业监测和工业监测。低成本的特性使得wsn特别容易受到外部因素(即环境)变化或内部因素(即硬件或软件故障)变化的影响。通常情况下，可以通过检测设备的意外行为(异常)来发现问题。然而，无线传感器网络中的异常检测面临以下挑战:(1)有限的计算和连通性;(2)环境和网络拓扑的动态性;(3)需要对异常采取实时响应。在本文中，我们提出了一种使用最优加权单类随机森林进行无监督异常检测的新框架，以解决wsn中的上述挑战。大量的实验表明，我们的框架不仅是可行的，而且在检测精度和资源利用率方面都优于目前最先进的无监督方法。

引用次数: 10

Spatial Contextualization for Closed Itemset Mining 封闭项集挖掘的空间语境化

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00155

Altobelli B. Mantuan, L. Fernandes

We present the Spatial Contextualization for Closed Itemset Mining (SCIM) algorithm, an approach that builds a space for the target database in such a way that relevant itemsets can be retrieved regarding the relative spatial location of their items. Our algorithm uses Dual Scaling to map the items of the database to a multidimensional space called Solution Space. The representation of the database in the Solution Space assists in the interpretation and definition of overlapping clusters of related items. Therefore, instead of using the minimum support threshold, a distance threshold is defined concerning the reference and the maximum distances computed per cluster during the mapping procedure. Closed itemsets are efficiently retrieved by a new procedure that uses an FP-Tree, a CFI-Tree and the proposed spatial contextualization. Experiments show that the mean all-confidence measure of itemsets retrieved by our technique outperforms results from state-of-the-art algorithms. Additionally, we use the Minimum Description Length (MDL) metric to verify how descriptive are the collections of mined patterns.

我们提出了封闭项目集挖掘(SCIM)的空间语境化算法，该算法为目标数据库构建空间，从而可以根据项目的相对空间位置检索相关项目集。我们的算法使用双缩放将数据库的项映射到称为解决方案空间的多维空间。解决方案空间中数据库的表示有助于解释和定义相关项的重叠簇。因此，不使用最小支持阈值，而是定义了一个距离阈值，该阈值涉及映射过程中每个集群计算的引用和最大距离。封闭项集通过使用FP-Tree、CFI-Tree和提出的空间上下文化的新过程有效地检索。实验表明，我们的技术检索的项目集的平均全置信度测量优于最先进的算法的结果。此外，我们使用最小描述长度(MDL)度量来验证挖掘模式集合的描述性如何。

引用次数: 1

Entire Regularization Path for Sparse Nonnegative Interaction Model 稀疏非负交互模型的全正则化路径

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00168

Mirai Takayanagi, Yasuo Tabei, Hiroto Saigo

Building sparse combinatorial model with non-negative constraint is essential in solving real-world problems such as in biology, in which the target response is often formulated by additive linear combination of features variables. This paper presents a solution to this problem by combining itemset mining with non-negative least squares. However, once incorporation of modern regularization is considered, then a naive solution requires to solve expensive enumeration problem many times for every regularization parameter. In this paper, we devise a regularization path tracking algorithm such that combinatorial feature is searched and included one by one to the solution set. Our contribution is a proposal of novel bounds specifically designed for the feature search problem. In synthetic dataset, the proposed method is demonstrated to run orders of magnitudes faster than a naive counterpart which does not employ tree pruning. We also empirically show that non-negativity constraints can reduce the number of active features much less than that of LASSO, leading to significant speed-ups in pattern search. In experiments using HIV-1 drug resistance dataset, the proposed method could successfully model the rapidly increasing drug resistance triggered by accumulation of mutations in HIV-1 genetic sequences. We also demonstrate the effectiveness of non-negativity constraints in suppressing false positive features, resulting in a model with smaller number of features and thereby improved interpretability.

建立具有非负约束的稀疏组合模型对于解决诸如生物学等现实问题至关重要，在这些问题中，目标响应通常是由特征变量的加性线性组合来表示的。本文提出了将项目集挖掘与非负最小二乘相结合的方法来解决这一问题。然而，一旦考虑到现代正则化的结合，那么一个朴素的解决方案需要为每个正则化参数多次解决昂贵的枚举问题。本文设计了一种正则化路径跟踪算法，将组合特征逐个搜索并包含到解集中。我们的贡献是提出了专门为特征搜索问题设计的新边界。在合成数据集中，该方法的运行速度比不使用树修剪的朴素方法快几个数量级。我们还通过经验证明，非负性约束减少的活动特征数量远少于LASSO，从而导致模式搜索的显着加速。在使用HIV-1耐药数据集的实验中，该方法可以成功地模拟HIV-1基因序列突变积累引发的快速增加的耐药性。我们还证明了非负性约束在抑制假阳性特征方面的有效性，从而产生具有更少特征的模型，从而提高了可解释性。

{"title":"Entire Regularization Path for Sparse Nonnegative Interaction Model","authors":"Mirai Takayanagi, Yasuo Tabei, Hiroto Saigo","doi":"10.1109/ICDM.2018.00168","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00168","url":null,"abstract":"Building sparse combinatorial model with non-negative constraint is essential in solving real-world problems such as in biology, in which the target response is often formulated by additive linear combination of features variables. This paper presents a solution to this problem by combining itemset mining with non-negative least squares. However, once incorporation of modern regularization is considered, then a naive solution requires to solve expensive enumeration problem many times for every regularization parameter. In this paper, we devise a regularization path tracking algorithm such that combinatorial feature is searched and included one by one to the solution set. Our contribution is a proposal of novel bounds specifically designed for the feature search problem. In synthetic dataset, the proposed method is demonstrated to run orders of magnitudes faster than a naive counterpart which does not employ tree pruning. We also empirically show that non-negativity constraints can reduce the number of active features much less than that of LASSO, leading to significant speed-ups in pattern search. In experiments using HIV-1 drug resistance dataset, the proposed method could successfully model the rapidly increasing drug resistance triggered by accumulation of mutations in HIV-1 genetic sequences. We also demonstrate the effectiveness of non-negativity constraints in suppressing false positive features, resulting in a model with smaller number of features and thereby improved interpretability.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"26 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113980420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3