For many spatio-temporal applications, building regression models that can reproduce the true data distribution is often as important as building models with high prediction accuracy. For example, knowing the future distribution of daily temperature and precipitation can help scientists determine their long-term trends and assess their potential impact on human and natural systems. As conventional methods are designed to minimize residual errors, the shape of their predicted distribution may not be consistent with their actual distribution. To overcome this challenge, this paper presents a novel, distribution-preserving multi-task learning framework for multi-location prediction of spatio-temporal data. The framework employs a non-parametric density estimation approach with L2-distance to measure the divergence between the predicted and true distribution of the data. Experimental results using climate data from more than 1500 weather stations in the United States show that the proposed framework reduces the distribution error for more than 78% of the stations without degrading the prediction accuracy significantly.
{"title":"Distribution Preserving Multi-task Regression for Spatio-Temporal Data","authors":"Xi Liu, P. Tan, Zubin Abraham, L. Luo, P. Hatami","doi":"10.1109/ICDM.2018.00148","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00148","url":null,"abstract":"For many spatio-temporal applications, building regression models that can reproduce the true data distribution is often as important as building models with high prediction accuracy. For example, knowing the future distribution of daily temperature and precipitation can help scientists determine their long-term trends and assess their potential impact on human and natural systems. As conventional methods are designed to minimize residual errors, the shape of their predicted distribution may not be consistent with their actual distribution. To overcome this challenge, this paper presents a novel, distribution-preserving multi-task learning framework for multi-location prediction of spatio-temporal data. The framework employs a non-parametric density estimation approach with L2-distance to measure the divergence between the predicted and true distribution of the data. Experimental results using climate data from more than 1500 weather stations in the United States show that the proposed framework reduces the distribution error for more than 78% of the stations without degrading the prediction accuracy significantly.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In real-world recommendation scenarios, there are two common phenomena: 1) users only provide ratings but there is no review comment. As a result, the historical transaction data available for recommender system are usually unbalanced and sparse; 2) Users' opinions can be better grasped in their reviews than ratings. This indicates that there is always a bias between ratings and reviews. Therefore, it is important that users' ratings and reviews should be mutually reinforced to grasp the users' true opinions. To this end, in this paper, we develop an opinion mining model based on convolutional neural networks for enhancing recommendation (NeuO). Specifically, we exploit a two-step training neural networks, which utilize both reviews and ratings to grasp users' true opinions in unbalanced data. Moreover, we propose a Sentiment Classification scoring method (SC), which employs dual attention vectors to predict the users' sentiment scores of their reviews. A combination function is designed to use the results of SC and user-item rating matrix to catch the opinion bias. Finally, a Multilayer perceptron based Matrix Factorization (MMF) method is proposed to make recommendations with the enhanced user-item matrix. Extensive experiments on real-world data demonstrate that our approach can achieve a superior performance over state-of-the-art baselines on real-world datasets.
{"title":"Exploiting the Sentimental Bias between Ratings and Reviews for Enhancing Recommendation","authors":"Yuanbo Xu, Yongjian Yang, Jiayu Han, E. Wang, Fuzhen Zhuang, Hui Xiong","doi":"10.1109/ICDM.2018.00185","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00185","url":null,"abstract":"In real-world recommendation scenarios, there are two common phenomena: 1) users only provide ratings but there is no review comment. As a result, the historical transaction data available for recommender system are usually unbalanced and sparse; 2) Users' opinions can be better grasped in their reviews than ratings. This indicates that there is always a bias between ratings and reviews. Therefore, it is important that users' ratings and reviews should be mutually reinforced to grasp the users' true opinions. To this end, in this paper, we develop an opinion mining model based on convolutional neural networks for enhancing recommendation (NeuO). Specifically, we exploit a two-step training neural networks, which utilize both reviews and ratings to grasp users' true opinions in unbalanced data. Moreover, we propose a Sentiment Classification scoring method (SC), which employs dual attention vectors to predict the users' sentiment scores of their reviews. A combination function is designed to use the results of SC and user-item rating matrix to catch the opinion bias. Finally, a Multilayer perceptron based Matrix Factorization (MMF) method is proposed to make recommendations with the enhanced user-item matrix. Extensive experiments on real-world data demonstrate that our approach can achieve a superior performance over state-of-the-art baselines on real-world datasets.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"10 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The extraction of featured snippet can be considered as the problem of Question Answering (QA). This paper presents a featured snippet extraction system by employing a technique of machine reading comprehension (MRC). Specifically, we first analyze the characteristics of questions with different types and their corresponding answers. Then, we classify a given question into various types, which is incorporated as key features in the subsequent model configuration. Based on that, we present a model to extract the candidate passages from recalled documents in a MRC fashion. Next, a novel MRC model with multiple stages of attention is proposed to extract answers from the selected passages. Last, in the answer re-ranking stage, we design a question type-adaptive model to produce the final answer. The experimental results on two open-domain QA Datasets clearly validate the effectiveness of our system and models in featured snippet extraction.
{"title":"A Machine Reading Comprehension-Based Approach for Featured Snippet Extraction","authors":"Chen Zhang, Xuanyu Zhang, Hao Wang","doi":"10.1109/ICDM.2018.00195","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00195","url":null,"abstract":"The extraction of featured snippet can be considered as the problem of Question Answering (QA). This paper presents a featured snippet extraction system by employing a technique of machine reading comprehension (MRC). Specifically, we first analyze the characteristics of questions with different types and their corresponding answers. Then, we classify a given question into various types, which is incorporated as key features in the subsequent model configuration. Based on that, we present a model to extract the candidate passages from recalled documents in a MRC fashion. Next, a novel MRC model with multiple stages of attention is proposed to extract answers from the selected passages. Last, in the answer re-ranking stage, we design a question type-adaptive model to produce the final answer. The experimental results on two open-domain QA Datasets clearly validate the effectiveness of our system and models in featured snippet extraction.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129958925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.
{"title":"Outlier Detection in Urban Traffic Flow Distributions","authors":"Y. Djenouri, A. Zimek, Marco Chiarandini","doi":"10.1109/ICDM.2018.00114","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00114","url":null,"abstract":"Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129531981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.
{"title":"GINA: Group Gender Identification Using Privacy-Sensitive Audio Data","authors":"Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, A. Pentland","doi":"10.1109/ICDM.2018.00061","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00061","url":null,"abstract":"Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu
Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.
{"title":"Deep Headline Generation for Clickbait Detection","authors":"Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu","doi":"10.1109/ICDM.2018.00062","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00062","url":null,"abstract":"Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116533765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie
The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.
{"title":"Volatility Drift Prediction for Transactional Data Streams","authors":"Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie","doi":"10.1109/ICDM.2018.00140","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00140","url":null,"abstract":"The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding
This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.
本文提出了一种新的算法——基于增强和知情网络的聚类算法(reinforcement and Informed network -based Clustering, ring),用于在稀疏且基本上不重叠的特征空间中寻找相似数据对象的未知组,在这些特征空间中可以观察到特征之间的网络结构。稀疏和非重叠的未标记数据在文本挖掘和生物医学数据挖掘中越来越普遍和可用。ringc将一个领域知情模型插入到一个无模型神经网络中。特别是,我们的方法将物理上有意义的特征依赖关系集成到神经网络架构和软计算约束中。我们的学习算法通过融合平滑和稀疏自编码器学习来有效地聚类稀疏数据。知情设计需要更少的样本进行训练,并且至少部分模型变得可以解释。增强网络层的体系结构平滑了特征空间中网络依赖的稀疏数据。最重要的是,通过反向传播,增强平滑层的权重同时受到剩余稀疏自编码器层的约束,这些层将目标值设置为等于原始输入。实证结果表明,ringc在提高准确率的同时,呈现出物理上有意义的聚类结果。
{"title":"Clustering on Sparse Data in Non-overlapping Feature Space with Applications to Cancer Subtyping","authors":"Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding","doi":"10.1109/ICDM.2018.00138","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00138","url":null,"abstract":"This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132650420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.
{"title":"Interpretable Word Embeddings for Medical Domain","authors":"Kishlay Jha, Yaqing Wang, Guangxu Xun, Aidong Zhang","doi":"10.1109/ICDM.2018.00135","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00135","url":null,"abstract":"Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122244329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, Y. Fu
Multi-view clustering, as one of the most important methods to analyze multi-view data, has been widely used in many real-world applications. Most existing multi-view clustering methods perform well on the assumption that each sample appears in all views. Nevertheless, in real-world application, each view may well face the problem of the missing data due to noise, or malfunction. In this paper, a new consistent generative adversarial network is proposed for partial multi-view clustering. We learn a common low-dimensional representation, which can both generate the missing view data and capture a better common structure from partial multi-view data for clustering. Different from the most existing methods, we use the common representation encoded by one view to generate the missing data of the corresponding view by generative adversarial networks, then we use the encoder and clustering networks. This is intuitive and meaningful because encoding common representation and generating the missing data in our model will promote mutually. Experimental results on three different multi-view databases illustrate the superiority of the proposed method.
{"title":"Partial Multi-view Clustering via Consistent GAN","authors":"Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, Y. Fu","doi":"10.1109/ICDM.2018.00174","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00174","url":null,"abstract":"Multi-view clustering, as one of the most important methods to analyze multi-view data, has been widely used in many real-world applications. Most existing multi-view clustering methods perform well on the assumption that each sample appears in all views. Nevertheless, in real-world application, each view may well face the problem of the missing data due to noise, or malfunction. In this paper, a new consistent generative adversarial network is proposed for partial multi-view clustering. We learn a common low-dimensional representation, which can both generate the missing view data and capture a better common structure from partial multi-view data for clustering. Different from the most existing methods, we use the common representation encoded by one view to generate the missing data of the corresponding view by generative adversarial networks, then we use the encoder and clustering networks. This is intuitive and meaningful because encoding common representation and generating the missing data in our model will promote mutually. Experimental results on three different multi-view databases illustrate the superiority of the proposed method.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134362813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}