Effective mining of huge amount of DNA and RNA fragments generated by next generation sequencing (NGS) technologies is facilitated by developing efficient tools to partition these sequence fragments (reads) based on their level of similarities using edit distance. However, edit distance calculation for all pairwise sequence fragments to cluster these huge data sets is a significant performance bottleneck. In this paper we propose a predicted Edit distance based clustering to significantly lower clustering time. Existing clustering methods for sequence fragments, such as, k-mer based VSEARCH and Locality Sensitive Hash based LSH-Div achieve much reduced clustering time but at the cost of significantly lower cluster quality. We show, through extensive performance analysis, clustering based on this predicted Edit distance provides more than 99% accurate clusters while providing an order of magnitude faster clustering time than actual Edit distance based clustering.
{"title":"Predicted Edit Distance Based Clustering of Gene Sequences","authors":"S. Pramanik, A. T. Islam, S. Sural","doi":"10.1109/ICDM.2018.00160","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00160","url":null,"abstract":"Effective mining of huge amount of DNA and RNA fragments generated by next generation sequencing (NGS) technologies is facilitated by developing efficient tools to partition these sequence fragments (reads) based on their level of similarities using edit distance. However, edit distance calculation for all pairwise sequence fragments to cluster these huge data sets is a significant performance bottleneck. In this paper we propose a predicted Edit distance based clustering to significantly lower clustering time. Existing clustering methods for sequence fragments, such as, k-mer based VSEARCH and Locality Sensitive Hash based LSH-Div achieve much reduced clustering time but at the cost of significantly lower cluster quality. We show, through extensive performance analysis, clustering based on this predicted Edit distance provides more than 99% accurate clusters while providing an order of magnitude faster clustering time than actual Edit distance based clustering.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125383903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaojing Wang, Yuan Yao, Hanghang Tong, Xuan Huo, Min Li, F. Xu, Jian Lu
Bug tracking systems, which help to track the reported software bugs, have been widely used in software development and maintenance. In these systems, recognizing relevant source files among a large number of source files for a given bug report is a time-consuming and labor-intensive task for software developers. To tackle this problem, information retrieval methods have been widely used to capture either the textual similarities or the semantic similarities between bug reports and source files. However, these two types of similarities are usually considered separately and the historical bug fixings are largely ignored by the existing methods. In this paper, we propose a supervised topic modeling method (STMLOCATOR) for automatically locating the relevant source files for a given bug report. In particular, the proposed model is built upon three key observations. First, supervised modeling can effectively make use of the existing fixing histories. Second, certain words in bug reports tend to appear multiple times in their relevant source files. Third, longer source files tend to have more bugs. By integrating the above three observations, the proposed STMLOCATOR utilizes historical fixings in a supervised way and learns both the textual similarities and semantic similarities between bug reports and source files. We further consider a special type of bug reports with stack-traces in bug reports, and propose a variant of STMLOCATOR to tailor for such bug reports. Experimental evaluations on three real data sets demonstrate that the proposed STMLOCATOR can achieve up to 23.6% improvement in terms of prediction accuracy over its best competitors, and scales linearly with the size of the data. Moreover, the proposed variant further improves STMLOCATOR by up to 76.2% on those bug reports with stack-traces.
{"title":"Bug Localization via Supervised Topic Modeling","authors":"Yaojing Wang, Yuan Yao, Hanghang Tong, Xuan Huo, Min Li, F. Xu, Jian Lu","doi":"10.1109/ICDM.2018.00076","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00076","url":null,"abstract":"Bug tracking systems, which help to track the reported software bugs, have been widely used in software development and maintenance. In these systems, recognizing relevant source files among a large number of source files for a given bug report is a time-consuming and labor-intensive task for software developers. To tackle this problem, information retrieval methods have been widely used to capture either the textual similarities or the semantic similarities between bug reports and source files. However, these two types of similarities are usually considered separately and the historical bug fixings are largely ignored by the existing methods. In this paper, we propose a supervised topic modeling method (STMLOCATOR) for automatically locating the relevant source files for a given bug report. In particular, the proposed model is built upon three key observations. First, supervised modeling can effectively make use of the existing fixing histories. Second, certain words in bug reports tend to appear multiple times in their relevant source files. Third, longer source files tend to have more bugs. By integrating the above three observations, the proposed STMLOCATOR utilizes historical fixings in a supervised way and learns both the textual similarities and semantic similarities between bug reports and source files. We further consider a special type of bug reports with stack-traces in bug reports, and propose a variant of STMLOCATOR to tailor for such bug reports. Experimental evaluations on three real data sets demonstrate that the proposed STMLOCATOR can achieve up to 23.6% improvement in terms of prediction accuracy over its best competitors, and scales linearly with the size of the data. Moreover, the proposed variant further improves STMLOCATOR by up to 76.2% on those bug reports with stack-traces.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116941655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensor completion is the task of completing multi-aspect data represented as a tensor by accurately predicting missing entries in the tensor. It is mainly solved by tensor factorization methods, and among them, Tucker factorization has attracted considerable interests due to its powerful ability to learn latent factors and even their interactions. Although several Tucker methods have been developed to reduce the memory and computational complexity, the state-of-the-art method still 1) generates redundant computations and 2) cannot factorize a large tensor that exceeds the size of memory. This paper proposes FTcom, a fast and scalable Tucker factorization method for tensor completion. FTcom performs element-wise updates for factor matrices based on coordinate descent, and adopts a novel caching algorithm which stores frequently-required intermediate data. It also uses a tensor file for disk-based data processing and loads only a small part of the tensor at a time into the memory. Experimental results show that FTcom is much faster and more scalable compared to all other competitors. It significantly shortens the training time of Tucker factorization, especially on real-world tensors, and it can be executed on a billion-scale tensor which is bigger than the memory capacity within a single machine.
{"title":"Fast Tucker Factorization for Large-Scale Tensor Completion","authors":"Dongha Lee, Jaehyung Lee, Hwanjo Yu","doi":"10.1109/ICDM.2018.00142","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00142","url":null,"abstract":"Tensor completion is the task of completing multi-aspect data represented as a tensor by accurately predicting missing entries in the tensor. It is mainly solved by tensor factorization methods, and among them, Tucker factorization has attracted considerable interests due to its powerful ability to learn latent factors and even their interactions. Although several Tucker methods have been developed to reduce the memory and computational complexity, the state-of-the-art method still 1) generates redundant computations and 2) cannot factorize a large tensor that exceeds the size of memory. This paper proposes FTcom, a fast and scalable Tucker factorization method for tensor completion. FTcom performs element-wise updates for factor matrices based on coordinate descent, and adopts a novel caching algorithm which stores frequently-required intermediate data. It also uses a tensor file for disk-based data processing and loads only a small part of the tensor at a time into the memory. Experimental results show that FTcom is much faster and more scalable compared to all other competitors. It significantly shortens the training time of Tucker factorization, especially on real-world tensors, and it can be executed on a billion-scale tensor which is bigger than the memory capacity within a single machine.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121928018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philippe Chatigny, Rongbo Chen, Jean-Marc Patenaude, Shengrui Wang
The identification and prediction of complex behaviors in time series are fundamental problems of interest in the field of financial data analysis. Autoregressive (AR) model and Regime switching (RS) models have been used successfully to study the behaviors of financial time series. However, conventional RS models evaluate regimes by using a fixed-order Markov chain and underlying patterns in the data are not considered in their design. In this paper, we propose a novel RS model to identify and predict regimes based on a weighted conditional probability distribution (WCPD) framework capable of discovering and exploiting the significant underlying patterns in time series. Experimental results on stock market data, with 200 stocks, suggest that the structures underlying the financial market behaviors exhibit different dynamics and can be leveraged to better define regimes with superior prediction capabilities than traditional models.
{"title":"A Variable-Order Regime Switching Model to Identify Significant Patterns in Financial Markets","authors":"Philippe Chatigny, Rongbo Chen, Jean-Marc Patenaude, Shengrui Wang","doi":"10.1109/ICDM.2018.00106","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00106","url":null,"abstract":"The identification and prediction of complex behaviors in time series are fundamental problems of interest in the field of financial data analysis. Autoregressive (AR) model and Regime switching (RS) models have been used successfully to study the behaviors of financial time series. However, conventional RS models evaluate regimes by using a fixed-order Markov chain and underlying patterns in the data are not considered in their design. In this paper, we propose a novel RS model to identify and predict regimes based on a weighted conditional probability distribution (WCPD) framework capable of discovering and exploiting the significant underlying patterns in time series. Experimental results on stock market data, with 200 stocks, suggest that the structures underlying the financial market behaviors exhibit different dynamics and can be leveraged to better define regimes with superior prediction capabilities than traditional models.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116805521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cen Chen, Kenli Li, S. Teo, Guizi Chen, Xiaofeng Zou, Xulei Yang, R. Vijay, Jiashi Feng, Zeng Zeng
Predicting vehicle flows is of great importance to traffic management and public safety in smart cities, and very challenging as it is affected by many complex factors, such as spatio-temporal dependencies with external factors (e.g., holidays, events and weather). Recently, deep learning has shown remarkable performance on traditional challenging tasks, such as image classification, due to its powerful feature learning capabilities. Some works have utilized LSTMs to connect the high-level layers of 2D convolutional neural networks (CNNs) to learn the spatio-temporal features, and have shown better performance as compared to many classical methods in traffic prediction. However, these works only build temporal connections on the high-level features at the top layer while leaving the spatio-temporal correlations in the low-level layers not fully exploited. In this paper, we propose to apply 3D CNNs to learn the spatio-temporal correlation features jointly from lowlevel to high-level layers for traffic data. We also design an end-to-end structure, named as MST3D, especially for vehicle flow prediction. MST3D can learn spatial and multiple temporal dependencies jointly by multiple 3D CNNs, combine the learned features with external factors and assign different weights to different branches dynamically. To the best of our knowledge, it is the first framework that utilizes 3D CNNs for traffic prediction. Experiments on two vehicle flow datasets Beijing and New York City have demonstrated that the proposed framework, MST3D, outperforms the state-of-the-art methods.
{"title":"Exploiting Spatio-Temporal Correlations with Multiple 3D Convolutional Neural Networks for Citywide Vehicle Flow Prediction","authors":"Cen Chen, Kenli Li, S. Teo, Guizi Chen, Xiaofeng Zou, Xulei Yang, R. Vijay, Jiashi Feng, Zeng Zeng","doi":"10.1109/ICDM.2018.00107","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00107","url":null,"abstract":"Predicting vehicle flows is of great importance to traffic management and public safety in smart cities, and very challenging as it is affected by many complex factors, such as spatio-temporal dependencies with external factors (e.g., holidays, events and weather). Recently, deep learning has shown remarkable performance on traditional challenging tasks, such as image classification, due to its powerful feature learning capabilities. Some works have utilized LSTMs to connect the high-level layers of 2D convolutional neural networks (CNNs) to learn the spatio-temporal features, and have shown better performance as compared to many classical methods in traffic prediction. However, these works only build temporal connections on the high-level features at the top layer while leaving the spatio-temporal correlations in the low-level layers not fully exploited. In this paper, we propose to apply 3D CNNs to learn the spatio-temporal correlation features jointly from lowlevel to high-level layers for traffic data. We also design an end-to-end structure, named as MST3D, especially for vehicle flow prediction. MST3D can learn spatial and multiple temporal dependencies jointly by multiple 3D CNNs, combine the learned features with external factors and assign different weights to different branches dynamically. To the best of our knowledge, it is the first framework that utilizes 3D CNNs for traffic prediction. Experiments on two vehicle flow datasets Beijing and New York City have demonstrated that the proposed framework, MST3D, outperforms the state-of-the-art methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116937870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instancefeature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix, we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classification (Algorithm source code is available at http://github.com/hhmzwc/EDLT).
{"title":"EDLT: Enabling Deep Learning for Generic Data Classification","authors":"Huimei Han, Xingquan Zhu, Ying Li","doi":"10.1109/ICDM.2018.00030","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00030","url":null,"abstract":"This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instancefeature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix, we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classification (Algorithm source code is available at http://github.com/hhmzwc/EDLT).","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115536910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper leverages heterogeneous auxiliary information to address the data sparsity problem of recommender systems. We propose a model that learns a shared feature space from heterogeneous data, such as item descriptions, product tags and online purchase history, to obtain better predictions. Our model consists of autoencoders, not only for numerical and categorical data, but also for sequential data, which enables capturing user tastes, item characteristics and the recent dynamics of user preference. We learn the autoencoder architecture for each data source independently in order to better model their statistical properties. Our evaluation on two MovieLens datasets and an e-commerce dataset shows that mean average precision and recall improve over state-of-the-art methods.
{"title":"Deep Heterogeneous Autoencoders for Collaborative Filtering","authors":"Tianyu Li, Yukun Ma, Jiu Xu, B. Stenger, Chen Liu, Yu Hirate","doi":"10.1109/ICDM.2018.00153","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00153","url":null,"abstract":"This paper leverages heterogeneous auxiliary information to address the data sparsity problem of recommender systems. We propose a model that learns a shared feature space from heterogeneous data, such as item descriptions, product tags and online purchase history, to obtain better predictions. Our model consists of autoencoders, not only for numerical and categorical data, but also for sequential data, which enables capturing user tastes, item characteristics and the recent dynamics of user preference. We learn the autoencoder architecture for each data source independently in order to better model their statistical properties. Our evaluation on two MovieLens datasets and an e-commerce dataset shows that mean average precision and recall improve over state-of-the-art methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123496403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Zheng, M. Sha, Yuchen Li, Hongxia Yang, Yuan Fang, Zhenjie Zhang, K. Tan, K. Chang
We study the important problem of user alignment in e-commerce: to predict whether two online user identities that access an e-commerce site from different devices belong to one real-world person. As input, we have a set of user activity logs from Taobao and some labeled user identity linkages. User activity logs can be modeled using a heterogeneous interaction graph (HIG), and subsequently the user alignment task can be formulated as a semi-supervised HIG embedding problem. HIG embedding is challenging for two reasons: its heterogeneous nature and the presence of edge features. To address the challenges, we propose a novel Heterogeneous Embedding Propagation (HEP) model. The core idea is to iteratively reconstruct a node's embedding from its heterogeneous neighbors in a weighted manner, and meanwhile propagate its embedding updates from reconstruction loss and/or classification loss to its neighbors. We conduct extensive experiments on large-scale datasets from Taobao, demonstrating that HEP significantly outperforms state-of-the-art baselines often by more than 10% in F-scores.
{"title":"Heterogeneous Embedding Propagation for Large-Scale E-Commerce User Alignment","authors":"V. Zheng, M. Sha, Yuchen Li, Hongxia Yang, Yuan Fang, Zhenjie Zhang, K. Tan, K. Chang","doi":"10.1109/ICDM.2018.00198","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00198","url":null,"abstract":"We study the important problem of user alignment in e-commerce: to predict whether two online user identities that access an e-commerce site from different devices belong to one real-world person. As input, we have a set of user activity logs from Taobao and some labeled user identity linkages. User activity logs can be modeled using a heterogeneous interaction graph (HIG), and subsequently the user alignment task can be formulated as a semi-supervised HIG embedding problem. HIG embedding is challenging for two reasons: its heterogeneous nature and the presence of edge features. To address the challenges, we propose a novel Heterogeneous Embedding Propagation (HEP) model. The core idea is to iteratively reconstruct a node's embedding from its heterogeneous neighbors in a weighted manner, and meanwhile propagate its embedding updates from reconstruction loss and/or classification loss to its neighbors. We conduct extensive experiments on large-scale datasets from Taobao, demonstrating that HEP significantly outperforms state-of-the-art baselines often by more than 10% in F-scores.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115647257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuchao Zhang, Shuo Lei, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu
The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (RoOFS) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. Extensive empirical experiments in both synthetic and real-world data sets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.
{"title":"Robust Regression via Online Feature Selection Under Adversarial Data Corruption","authors":"Xuchao Zhang, Shuo Lei, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu","doi":"10.1109/ICDM.2018.00199","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00199","url":null,"abstract":"The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (RoOFS) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. Extensive empirical experiments in both synthetic and real-world data sets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122886229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wonsung Lee, Sungrae Park, Weonyoung Joo, Il-Chul Moon
Predicting the clinical outcome of patients from the historical electronic health records (EHRs) is a fundamental research area in medical informatics. Although EHRs contain various records associated with each patient, the existing work mainly dealt with the diagnosis codes by employing recurrent neural networks (RNNs) with a simple attention mechanism. This type of sequence modeling often ignores the heterogeneity of EHRs. In other words, it only considers historical diagnoses and does not incorporate patient demographics, which correspond to clinically essential context, into the sequence modeling. To address the issue, we aim at investigating the use of an attention mechanism that is tailored to medical context to predict a future diagnosis. We propose a medical context attention (MCA)-based RNN that is composed of an attention-based RNN and a conditional deep generative model. The novel attention mechanism utilizes the derived individual patient information from conditional variational autoencoders (CVAEs). The CVAE models a conditional distribution of patient embeddings and his/her demographics to provide the measurement of patient's phenotypic difference due to illness. Experimental results showed the effectiveness of the proposed model.
{"title":"Diagnosis Prediction via Medical Context Attention Networks Using Deep Generative Modeling","authors":"Wonsung Lee, Sungrae Park, Weonyoung Joo, Il-Chul Moon","doi":"10.1109/ICDM.2018.00143","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00143","url":null,"abstract":"Predicting the clinical outcome of patients from the historical electronic health records (EHRs) is a fundamental research area in medical informatics. Although EHRs contain various records associated with each patient, the existing work mainly dealt with the diagnosis codes by employing recurrent neural networks (RNNs) with a simple attention mechanism. This type of sequence modeling often ignores the heterogeneity of EHRs. In other words, it only considers historical diagnoses and does not incorporate patient demographics, which correspond to clinically essential context, into the sequence modeling. To address the issue, we aim at investigating the use of an attention mechanism that is tailored to medical context to predict a future diagnosis. We propose a medical context attention (MCA)-based RNN that is composed of an attention-based RNN and a conditional deep generative model. The novel attention mechanism utilizes the derived individual patient information from conditional variational autoencoders (CVAEs). The CVAE models a conditional distribution of patient embeddings and his/her demographics to provide the measurement of patient's phenotypic difference due to illness. Experimental results showed the effectiveness of the proposed model.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128842479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}