2018 IEEE International Conference on Data Mining (ICDM)最新文献

英文中文

Predicted Edit Distance Based Clustering of Gene Sequences 基于预测编辑距离的基因序列聚类

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00160

S. Pramanik, A. T. Islam, S. Sural

Effective mining of huge amount of DNA and RNA fragments generated by next generation sequencing (NGS) technologies is facilitated by developing efficient tools to partition these sequence fragments (reads) based on their level of similarities using edit distance. However, edit distance calculation for all pairwise sequence fragments to cluster these huge data sets is a significant performance bottleneck. In this paper we propose a predicted Edit distance based clustering to significantly lower clustering time. Existing clustering methods for sequence fragments, such as, k-mer based VSEARCH and Locality Sensitive Hash based LSH-Div achieve much reduced clustering time but at the cost of significantly lower cluster quality. We show, through extensive performance analysis, clustering based on this predicted Edit distance provides more than 99% accurate clusters while providing an order of magnitude faster clustering time than actual Edit distance based clustering.

通过开发有效的工具，利用编辑距离根据序列片段(reads)的相似度划分这些序列片段(reads)，可以有效地挖掘下一代测序(NGS)技术产生的大量DNA和RNA片段。然而，对所有成对序列片段进行编辑距离计算以聚类这些庞大的数据集是一个重要的性能瓶颈。本文提出了一种基于预测编辑距离的聚类方法，以显著降低聚类时间。现有的序列片段聚类方法，如基于k-mer的VSEARCH和基于Locality Sensitive Hash的LSH-Div，可以大大减少聚类时间，但代价是聚类质量明显降低。通过广泛的性能分析，我们发现，基于这个预测的编辑距离的聚类提供了超过99%的准确率，同时提供了比实际的基于编辑距离的聚类快一个数量级的聚类时间。

引用次数: 0

Bug Localization via Supervised Topic Modeling 基于监督主题建模的Bug定位

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00076

Yaojing Wang, Yuan Yao, Hanghang Tong, Xuan Huo, Min Li, F. Xu, Jian Lu

Bug tracking systems, which help to track the reported software bugs, have been widely used in software development and maintenance. In these systems, recognizing relevant source files among a large number of source files for a given bug report is a time-consuming and labor-intensive task for software developers. To tackle this problem, information retrieval methods have been widely used to capture either the textual similarities or the semantic similarities between bug reports and source files. However, these two types of similarities are usually considered separately and the historical bug fixings are largely ignored by the existing methods. In this paper, we propose a supervised topic modeling method (STMLOCATOR) for automatically locating the relevant source files for a given bug report. In particular, the proposed model is built upon three key observations. First, supervised modeling can effectively make use of the existing fixing histories. Second, certain words in bug reports tend to appear multiple times in their relevant source files. Third, longer source files tend to have more bugs. By integrating the above three observations, the proposed STMLOCATOR utilizes historical fixings in a supervised way and learns both the textual similarities and semantic similarities between bug reports and source files. We further consider a special type of bug reports with stack-traces in bug reports, and propose a variant of STMLOCATOR to tailor for such bug reports. Experimental evaluations on three real data sets demonstrate that the proposed STMLOCATOR can achieve up to 23.6% improvement in terms of prediction accuracy over its best competitors, and scales linearly with the size of the data. Moreover, the proposed variant further improves STMLOCATOR by up to 76.2% on those bug reports with stack-traces.

Bug跟踪系统用于跟踪报告的软件Bug，在软件开发和维护中得到了广泛应用。在这些系统中，对于软件开发人员来说，在给定错误报告的大量源文件中识别相关的源文件是一项耗时且费力的任务。为了解决这个问题，信息检索方法被广泛用于捕获错误报告和源文件之间的文本相似性或语义相似性。然而，这两种类型的相似性通常是分开考虑的，并且历史错误修复在很大程度上被现有方法所忽略。在本文中，我们提出了一种监督主题建模方法(STMLOCATOR)，用于自动定位给定bug报告的相关源文件。特别地，提出的模型是建立在三个关键观察的基础上的。首先，监督建模可以有效地利用现有的修复历史。其次，bug报告中的某些词往往会在相关的源文件中出现多次。第三，较长的源文件往往有更多的错误。通过集成上述三个观察结果，所提出的STMLOCATOR以监督的方式利用历史修复，并学习bug报告和源文件之间的文本相似性和语义相似性。我们进一步考虑在bug报告中使用堆栈跟踪的一种特殊类型的bug报告，并提出STMLOCATOR的一个变体来定制这种bug报告。在三个真实数据集上的实验评估表明，所提出的STMLOCATOR在预测精度方面比其最佳竞争对手提高了23.6%，并且与数据大小呈线性扩展。此外，在那些带有堆栈跟踪的错误报告上，提议的变体进一步将STMLOCATOR提高了76.2%。

{"title":"Bug Localization via Supervised Topic Modeling","authors":"Yaojing Wang, Yuan Yao, Hanghang Tong, Xuan Huo, Min Li, F. Xu, Jian Lu","doi":"10.1109/ICDM.2018.00076","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00076","url":null,"abstract":"Bug tracking systems, which help to track the reported software bugs, have been widely used in software development and maintenance. In these systems, recognizing relevant source files among a large number of source files for a given bug report is a time-consuming and labor-intensive task for software developers. To tackle this problem, information retrieval methods have been widely used to capture either the textual similarities or the semantic similarities between bug reports and source files. However, these two types of similarities are usually considered separately and the historical bug fixings are largely ignored by the existing methods. In this paper, we propose a supervised topic modeling method (STMLOCATOR) for automatically locating the relevant source files for a given bug report. In particular, the proposed model is built upon three key observations. First, supervised modeling can effectively make use of the existing fixing histories. Second, certain words in bug reports tend to appear multiple times in their relevant source files. Third, longer source files tend to have more bugs. By integrating the above three observations, the proposed STMLOCATOR utilizes historical fixings in a supervised way and learns both the textual similarities and semantic similarities between bug reports and source files. We further consider a special type of bug reports with stack-traces in bug reports, and propose a variant of STMLOCATOR to tailor for such bug reports. Experimental evaluations on three real data sets demonstrate that the proposed STMLOCATOR can achieve up to 23.6% improvement in terms of prediction accuracy over its best competitors, and scales linearly with the size of the data. Moreover, the proposed variant further improves STMLOCATOR by up to 76.2% on those bug reports with stack-traces.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116941655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Fast Tucker Factorization for Large-Scale Tensor Completion 大规模张量补全的快速Tucker分解

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00142

Dongha Lee, Jaehyung Lee, Hwanjo Yu

Tensor completion is the task of completing multi-aspect data represented as a tensor by accurately predicting missing entries in the tensor. It is mainly solved by tensor factorization methods, and among them, Tucker factorization has attracted considerable interests due to its powerful ability to learn latent factors and even their interactions. Although several Tucker methods have been developed to reduce the memory and computational complexity, the state-of-the-art method still 1) generates redundant computations and 2) cannot factorize a large tensor that exceeds the size of memory. This paper proposes FTcom, a fast and scalable Tucker factorization method for tensor completion. FTcom performs element-wise updates for factor matrices based on coordinate descent, and adopts a novel caching algorithm which stores frequently-required intermediate data. It also uses a tensor file for disk-based data processing and loads only a small part of the tensor at a time into the memory. Experimental results show that FTcom is much faster and more scalable compared to all other competitors. It significantly shortens the training time of Tucker factorization, especially on real-world tensors, and it can be executed on a billion-scale tensor which is bigger than the memory capacity within a single machine.

张量补全是通过准确预测张量中缺失的条目来完成以张量表示的多方面数据的任务。它主要通过张量分解方法来解决，其中，Tucker分解因其强大的学习潜在因素甚至相互作用的能力而引起了人们的极大兴趣。尽管已经开发了几种Tucker方法来减少内存和计算复杂性，但最先进的方法仍然1)产生冗余计算，2)不能分解超过内存大小的大张量。提出了一种快速、可扩展的张量补全Tucker分解方法FTcom。FTcom基于坐标下降对因子矩阵进行逐元素更新，并采用了一种新颖的缓存算法来存储频繁需要的中间数据。它还使用一个张量文件进行基于磁盘的数据处理，每次只将张量的一小部分加载到内存中。实验结果表明，与所有其他竞争对手相比，FTcom的速度更快，可扩展性更强。它显著缩短了Tucker分解的训练时间，特别是在真实世界的张量上，并且可以在大于单个机器内存容量的十亿尺度张量上执行。

{"title":"Fast Tucker Factorization for Large-Scale Tensor Completion","authors":"Dongha Lee, Jaehyung Lee, Hwanjo Yu","doi":"10.1109/ICDM.2018.00142","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00142","url":null,"abstract":"Tensor completion is the task of completing multi-aspect data represented as a tensor by accurately predicting missing entries in the tensor. It is mainly solved by tensor factorization methods, and among them, Tucker factorization has attracted considerable interests due to its powerful ability to learn latent factors and even their interactions. Although several Tucker methods have been developed to reduce the memory and computational complexity, the state-of-the-art method still 1) generates redundant computations and 2) cannot factorize a large tensor that exceeds the size of memory. This paper proposes FTcom, a fast and scalable Tucker factorization method for tensor completion. FTcom performs element-wise updates for factor matrices based on coordinate descent, and adopts a novel caching algorithm which stores frequently-required intermediate data. It also uses a tensor file for disk-based data processing and loads only a small part of the tensor at a time into the memory. Experimental results show that FTcom is much faster and more scalable compared to all other competitors. It significantly shortens the training time of Tucker factorization, especially on real-world tensors, and it can be executed on a billion-scale tensor which is bigger than the memory capacity within a single machine.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121928018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A Variable-Order Regime Switching Model to Identify Significant Patterns in Financial Markets 一种识别金融市场显著模式的变序制度切换模型

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00106

Philippe Chatigny, Rongbo Chen, Jean-Marc Patenaude, Shengrui Wang

The identification and prediction of complex behaviors in time series are fundamental problems of interest in the field of financial data analysis. Autoregressive (AR) model and Regime switching (RS) models have been used successfully to study the behaviors of financial time series. However, conventional RS models evaluate regimes by using a fixed-order Markov chain and underlying patterns in the data are not considered in their design. In this paper, we propose a novel RS model to identify and predict regimes based on a weighted conditional probability distribution (WCPD) framework capable of discovering and exploiting the significant underlying patterns in time series. Experimental results on stock market data, with 200 stocks, suggest that the structures underlying the financial market behaviors exhibit different dynamics and can be leveraged to better define regimes with superior prediction capabilities than traditional models.

时间序列中复杂行为的识别和预测是金融数据分析领域的基本问题。自回归(AR)模型和状态切换(RS)模型已经成功地用于研究金融时间序列的行为。然而，传统的RS模型通过使用定阶马尔可夫链来评估制度，并且在其设计中没有考虑数据中的潜在模式。在本文中，我们提出了一种新的RS模型来识别和预测基于加权条件概率分布(WCPD)框架，能够发现和利用时间序列中重要的潜在模式。对200只股票市场数据的实验结果表明，金融市场行为背后的结构表现出不同的动态，可以用来更好地定义比传统模型具有更好预测能力的制度。

引用次数: 1

Exploiting Spatio-Temporal Correlations with Multiple 3D Convolutional Neural Networks for Citywide Vehicle Flow Prediction 利用多个三维卷积神经网络的时空相关性进行城市车辆流量预测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00107

Cen Chen, Kenli Li, S. Teo, Guizi Chen, Xiaofeng Zou, Xulei Yang, R. Vijay, Jiashi Feng, Zeng Zeng

Predicting vehicle flows is of great importance to traffic management and public safety in smart cities, and very challenging as it is affected by many complex factors, such as spatio-temporal dependencies with external factors (e.g., holidays, events and weather). Recently, deep learning has shown remarkable performance on traditional challenging tasks, such as image classification, due to its powerful feature learning capabilities. Some works have utilized LSTMs to connect the high-level layers of 2D convolutional neural networks (CNNs) to learn the spatio-temporal features, and have shown better performance as compared to many classical methods in traffic prediction. However, these works only build temporal connections on the high-level features at the top layer while leaving the spatio-temporal correlations in the low-level layers not fully exploited. In this paper, we propose to apply 3D CNNs to learn the spatio-temporal correlation features jointly from lowlevel to high-level layers for traffic data. We also design an end-to-end structure, named as MST3D, especially for vehicle flow prediction. MST3D can learn spatial and multiple temporal dependencies jointly by multiple 3D CNNs, combine the learned features with external factors and assign different weights to different branches dynamically. To the best of our knowledge, it is the first framework that utilizes 3D CNNs for traffic prediction. Experiments on two vehicle flow datasets Beijing and New York City have demonstrated that the proposed framework, MST3D, outperforms the state-of-the-art methods.

预测车辆流量对智慧城市的交通管理和公共安全至关重要，但由于受到许多复杂因素的影响，例如与外部因素(如假期、事件和天气)的时空依赖关系，因此预测车辆流量非常具有挑战性。近年来，由于其强大的特征学习能力，深度学习在图像分类等传统挑战性任务上表现出了显著的性能。一些研究利用lstm连接二维卷积神经网络(cnn)的高层来学习时空特征，并在流量预测中表现出比许多经典方法更好的性能。然而，这些工作只是在顶层的高层特征上建立了时间联系，而在低层的时空相关性没有得到充分利用。在本文中，我们提出应用3D cnn对交通数据进行从低层到高层的时空相关特征联合学习。我们还设计了一个端到端的结构，命名为MST3D，专门用于车辆流量预测。MST3D可以通过多个3D cnn共同学习空间依赖关系和多个时间依赖关系，并将学习到的特征与外部因素结合起来，动态地为不同的分支分配不同的权重。据我们所知，这是第一个利用3D cnn进行流量预测的框架。在北京和纽约两个车辆流量数据集上的实验表明，所提出的框架MST3D优于最先进的方法。

{"title":"Exploiting Spatio-Temporal Correlations with Multiple 3D Convolutional Neural Networks for Citywide Vehicle Flow Prediction","authors":"Cen Chen, Kenli Li, S. Teo, Guizi Chen, Xiaofeng Zou, Xulei Yang, R. Vijay, Jiashi Feng, Zeng Zeng","doi":"10.1109/ICDM.2018.00107","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00107","url":null,"abstract":"Predicting vehicle flows is of great importance to traffic management and public safety in smart cities, and very challenging as it is affected by many complex factors, such as spatio-temporal dependencies with external factors (e.g., holidays, events and weather). Recently, deep learning has shown remarkable performance on traditional challenging tasks, such as image classification, due to its powerful feature learning capabilities. Some works have utilized LSTMs to connect the high-level layers of 2D convolutional neural networks (CNNs) to learn the spatio-temporal features, and have shown better performance as compared to many classical methods in traffic prediction. However, these works only build temporal connections on the high-level features at the top layer while leaving the spatio-temporal correlations in the low-level layers not fully exploited. In this paper, we propose to apply 3D CNNs to learn the spatio-temporal correlation features jointly from lowlevel to high-level layers for traffic data. We also design an end-to-end structure, named as MST3D, especially for vehicle flow prediction. MST3D can learn spatial and multiple temporal dependencies jointly by multiple 3D CNNs, combine the learned features with external factors and assign different weights to different branches dynamically. To the best of our knowledge, it is the first framework that utilizes 3D CNNs for traffic prediction. Experiments on two vehicle flow datasets Beijing and New York City have demonstrated that the proposed framework, MST3D, outperforms the state-of-the-art methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116937870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

EDLT: Enabling Deep Learning for Generic Data Classification EDLT:实现通用数据分类的深度学习

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00030

Huimei Han, Xingquan Zhu, Ying Li

This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instancefeature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix, we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classification (Algorithm source code is available at http://github.com/hhmzwc/EDLT).

本文提出将深度学习用于通用机器学习任务。我们的目标是允许深度学习应用于已经以实例特征表格格式表示的数据，以获得更好的分类精度。由于深度学习依赖于空间/时间相关性来学习新的特征表示，我们的主题是将原始数据集的每个实例转换为合成矩阵格式，以充分利用深度学习方法的特征学习能力。为了最大化矩阵的相关性，我们使用0/1优化来重新排序特征，使具有强相关性的特征彼此相邻。通过使用二维特征重新排序，我们能够创建一个合成矩阵，作为图像，来表示每个实例。由于合成图像保留了原始特征值和数据相关性，因此可以应用卷积神经网络(CNN)等现有深度学习算法学习有效特征进行分类。我们在20个通用数据集上的实验，使用CNN作为深度学习分类器，证实了与通用机器学习方法相比，将深度学习用于通用数据集具有明显的性能增益。此外，本文提出的方法始终优于使用CNN对通用数据集的简单基线。因此，我们的研究允许深度学习广泛应用于学习和分类的通用数据集(算法源代码可在http://github.com/hhmzwc/EDLT获得)。

{"title":"EDLT: Enabling Deep Learning for Generic Data Classification","authors":"Huimei Han, Xingquan Zhu, Ying Li","doi":"10.1109/ICDM.2018.00030","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00030","url":null,"abstract":"This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instancefeature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix, we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classification (Algorithm source code is available at http://github.com/hhmzwc/EDLT).","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115536910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Deep Heterogeneous Autoencoders for Collaborative Filtering 协同过滤的深度异构自编码器

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00153

Tianyu Li, Yukun Ma, Jiu Xu, B. Stenger, Chen Liu, Yu Hirate

This paper leverages heterogeneous auxiliary information to address the data sparsity problem of recommender systems. We propose a model that learns a shared feature space from heterogeneous data, such as item descriptions, product tags and online purchase history, to obtain better predictions. Our model consists of autoencoders, not only for numerical and categorical data, but also for sequential data, which enables capturing user tastes, item characteristics and the recent dynamics of user preference. We learn the autoencoder architecture for each data source independently in order to better model their statistical properties. Our evaluation on two MovieLens datasets and an e-commerce dataset shows that mean average precision and recall improve over state-of-the-art methods.

本文利用异构辅助信息来解决推荐系统的数据稀疏性问题。我们提出了一种从异构数据(如商品描述、产品标签和在线购买历史)中学习共享特征空间的模型，以获得更好的预测。我们的模型由自动编码器组成，不仅适用于数值和分类数据，而且适用于顺序数据，从而能够捕获用户口味，项目特征和用户偏好的最新动态。我们独立学习每个数据源的自编码器架构，以便更好地建模它们的统计属性。我们对两个MovieLens数据集和一个电子商务数据集的评估表明，平均精度和召回率比最先进的方法有所提高。

引用次数: 27

Heterogeneous Embedding Propagation for Large-Scale E-Commerce User Alignment 大规模电子商务用户对齐的异构嵌入传播

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00198

V. Zheng, M. Sha, Yuchen Li, Hongxia Yang, Yuan Fang, Zhenjie Zhang, K. Tan, K. Chang

We study the important problem of user alignment in e-commerce: to predict whether two online user identities that access an e-commerce site from different devices belong to one real-world person. As input, we have a set of user activity logs from Taobao and some labeled user identity linkages. User activity logs can be modeled using a heterogeneous interaction graph (HIG), and subsequently the user alignment task can be formulated as a semi-supervised HIG embedding problem. HIG embedding is challenging for two reasons: its heterogeneous nature and the presence of edge features. To address the challenges, we propose a novel Heterogeneous Embedding Propagation (HEP) model. The core idea is to iteratively reconstruct a node's embedding from its heterogeneous neighbors in a weighted manner, and meanwhile propagate its embedding updates from reconstruction loss and/or classification loss to its neighbors. We conduct extensive experiments on large-scale datasets from Taobao, demonstrating that HEP significantly outperforms state-of-the-art baselines often by more than 10% in F-scores.

我们研究了电子商务中重要的用户一致性问题:预测从不同设备访问电子商务网站的两个在线用户身份是否属于一个真实世界的人。作为输入，我们有一组来自淘宝的用户活动日志和一些标记的用户身份链接。用户活动日志可以使用异构交互图(HIG)建模，随后用户对齐任务可以表述为半监督HIG嵌入问题。HIG嵌入具有挑战性的原因有两个:它的异质性和边缘特征的存在。为了解决这些挑战，我们提出了一种新的异构嵌入传播(HEP)模型。其核心思想是以加权方式从异构邻居中迭代重建节点的嵌入，同时将其嵌入更新从重构损失和/或分类损失传播到其邻居。我们对来自淘宝的大规模数据集进行了广泛的实验，证明HEP在f分数上明显优于最先进的基线，通常超过10%。

{"title":"Heterogeneous Embedding Propagation for Large-Scale E-Commerce User Alignment","authors":"V. Zheng, M. Sha, Yuchen Li, Hongxia Yang, Yuan Fang, Zhenjie Zhang, K. Tan, K. Chang","doi":"10.1109/ICDM.2018.00198","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00198","url":null,"abstract":"We study the important problem of user alignment in e-commerce: to predict whether two online user identities that access an e-commerce site from different devices belong to one real-world person. As input, we have a set of user activity logs from Taobao and some labeled user identity linkages. User activity logs can be modeled using a heterogeneous interaction graph (HIG), and subsequently the user alignment task can be formulated as a semi-supervised HIG embedding problem. HIG embedding is challenging for two reasons: its heterogeneous nature and the presence of edge features. To address the challenges, we propose a novel Heterogeneous Embedding Propagation (HEP) model. The core idea is to iteratively reconstruct a node's embedding from its heterogeneous neighbors in a weighted manner, and meanwhile propagate its embedding updates from reconstruction loss and/or classification loss to its neighbors. We conduct extensive experiments on large-scale datasets from Taobao, demonstrating that HEP significantly outperforms state-of-the-art baselines often by more than 10% in F-scores.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115647257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Robust Regression via Online Feature Selection Under Adversarial Data Corruption 对抗性数据损坏下基于在线特征选择的鲁棒回归

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00199

Xuchao Zhang, Shuo Lei, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu

The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (RoOFS) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. Extensive empirical experiments in both synthetic and real-world data sets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.

用户生成的流数据(如社交媒体)中数据损坏的存在引发了一个新的基本问题，即当特征不能一次完全访问时，学习可靠的回归系数。到目前为止，几个重要的挑战仍然不能同时解决:1)当只有部分特征可访问时损坏的数据估计;2)数据包含对抗性损坏时的在线特征选择;3)扩展到一个庞大的数据集。本文提出了一种基于在线特征选择的鲁棒回归算法，该算法同时解决了上述所有挑战。具体而言，该算法通过鲁棒在线特征替换方法迭代更新回归系数和未损坏集。在合成数据集和真实数据集上进行的大量经验实验表明，我们的新方法在特征选择和回归系数的恢复方面都优于现有方法，具有很强的竞争力。

引用次数: 4

Diagnosis Prediction via Medical Context Attention Networks Using Deep Generative Modeling 基于深度生成建模的医疗情境关注网络的诊断预测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00143

Wonsung Lee, Sungrae Park, Weonyoung Joo, Il-Chul Moon

Predicting the clinical outcome of patients from the historical electronic health records (EHRs) is a fundamental research area in medical informatics. Although EHRs contain various records associated with each patient, the existing work mainly dealt with the diagnosis codes by employing recurrent neural networks (RNNs) with a simple attention mechanism. This type of sequence modeling often ignores the heterogeneity of EHRs. In other words, it only considers historical diagnoses and does not incorporate patient demographics, which correspond to clinically essential context, into the sequence modeling. To address the issue, we aim at investigating the use of an attention mechanism that is tailored to medical context to predict a future diagnosis. We propose a medical context attention (MCA)-based RNN that is composed of an attention-based RNN and a conditional deep generative model. The novel attention mechanism utilizes the derived individual patient information from conditional variational autoencoders (CVAEs). The CVAE models a conditional distribution of patient embeddings and his/her demographics to provide the measurement of patient's phenotypic difference due to illness. Experimental results showed the effectiveness of the proposed model.

利用历史电子病历预测患者的临床预后是医学信息学的一个基础研究领域。虽然电子病历包含与每个患者相关的各种记录，但现有的工作主要是通过使用具有简单注意机制的递归神经网络(rnn)来处理诊断代码。这种类型的序列建模往往忽略了电子病历的异质性。换句话说，它只考虑历史诊断，而不将患者人口统计数据(与临床基本背景相对应)纳入序列建模。为了解决这个问题，我们的目标是研究一种针对医学背景的注意力机制的使用，以预测未来的诊断。本文提出了一种基于医疗上下文注意(MCA)的RNN，该RNN由基于注意的RNN和条件深度生成模型组成。这种新的注意机制利用了从条件变分自编码器(CVAEs)中获得的个体患者信息。CVAE对患者嵌入的条件分布及其人口统计数据进行建模，以提供患者因疾病引起的表型差异的测量。实验结果表明了该模型的有效性。

{"title":"Diagnosis Prediction via Medical Context Attention Networks Using Deep Generative Modeling","authors":"Wonsung Lee, Sungrae Park, Weonyoung Joo, Il-Chul Moon","doi":"10.1109/ICDM.2018.00143","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00143","url":null,"abstract":"Predicting the clinical outcome of patients from the historical electronic health records (EHRs) is a fundamental research area in medical informatics. Although EHRs contain various records associated with each patient, the existing work mainly dealt with the diagnosis codes by employing recurrent neural networks (RNNs) with a simple attention mechanism. This type of sequence modeling often ignores the heterogeneity of EHRs. In other words, it only considers historical diagnoses and does not incorporate patient demographics, which correspond to clinically essential context, into the sequence modeling. To address the issue, we aim at investigating the use of an attention mechanism that is tailored to medical context to predict a future diagnosis. We propose a medical context attention (MCA)-based RNN that is composed of an attention-based RNN and a conditional deep generative model. The novel attention mechanism utilizes the derived individual patient information from conditional variational autoencoders (CVAEs). The CVAE models a conditional distribution of patient embeddings and his/her demographics to provide the measurement of patient's phenotypic difference due to illness. Experimental results showed the effectiveness of the proposed model.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128842479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 IEEE International Conference on Data Mining (ICDM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀