2018 IEEE International Conference on Data Mining (ICDM)最新文献_第5页

Distribution Preserving Multi-task Regression for Spatio-Temporal Data 时空数据的保分布多任务回归

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00148

Xi Liu, P. Tan, Zubin Abraham, L. Luo, P. Hatami

For many spatio-temporal applications, building regression models that can reproduce the true data distribution is often as important as building models with high prediction accuracy. For example, knowing the future distribution of daily temperature and precipitation can help scientists determine their long-term trends and assess their potential impact on human and natural systems. As conventional methods are designed to minimize residual errors, the shape of their predicted distribution may not be consistent with their actual distribution. To overcome this challenge, this paper presents a novel, distribution-preserving multi-task learning framework for multi-location prediction of spatio-temporal data. The framework employs a non-parametric density estimation approach with L2-distance to measure the divergence between the predicted and true distribution of the data. Experimental results using climate data from more than 1500 weather stations in the United States show that the proposed framework reduces the distribution error for more than 78% of the stations without degrading the prediction accuracy significantly.

对于许多时空应用，建立能够再现真实数据分布的回归模型往往与建立具有高预测精度的模型同等重要。例如，了解每日温度和降水的未来分布可以帮助科学家确定它们的长期趋势，并评估它们对人类和自然系统的潜在影响。由于传统的方法是为了最小化残差而设计的，因此它们的预测分布形状可能与实际分布不一致。为了克服这一挑战，本文提出了一种新颖的、保持分布的多任务学习框架，用于时空数据的多位置预测。该框架采用一种具有l2距离的非参数密度估计方法来度量数据的预测分布与真实分布之间的差异。利用美国1500多个气象站的气候数据进行的实验结果表明，该框架在不显著降低预测精度的情况下，减少了78%以上气象站的分布误差。

引用次数: 4

Exploiting the Sentimental Bias between Ratings and Reviews for Enhancing Recommendation 利用评分和评论之间的情感偏差来增强推荐

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00185

Yuanbo Xu, Yongjian Yang, Jiayu Han, E. Wang, Fuzhen Zhuang, Hui Xiong

In real-world recommendation scenarios, there are two common phenomena: 1) users only provide ratings but there is no review comment. As a result, the historical transaction data available for recommender system are usually unbalanced and sparse; 2) Users' opinions can be better grasped in their reviews than ratings. This indicates that there is always a bias between ratings and reviews. Therefore, it is important that users' ratings and reviews should be mutually reinforced to grasp the users' true opinions. To this end, in this paper, we develop an opinion mining model based on convolutional neural networks for enhancing recommendation (NeuO). Specifically, we exploit a two-step training neural networks, which utilize both reviews and ratings to grasp users' true opinions in unbalanced data. Moreover, we propose a Sentiment Classification scoring method (SC), which employs dual attention vectors to predict the users' sentiment scores of their reviews. A combination function is designed to use the results of SC and user-item rating matrix to catch the opinion bias. Finally, a Multilayer perceptron based Matrix Factorization (MMF) method is proposed to make recommendations with the enhanced user-item matrix. Extensive experiments on real-world data demonstrate that our approach can achieve a superior performance over state-of-the-art baselines on real-world datasets.

在真实的推荐场景中，有两种常见的现象:1)用户只提供评分，没有评论。因此，推荐系统可用的历史交易数据通常是不平衡和稀疏的;2)用户的评论比评分更能反映用户的意见。这表明评级和评论之间总是存在偏见。因此，用户的评分和评论应该相互加强，以掌握用户的真实意见。为此，在本文中，我们开发了一种基于卷积神经网络的意见挖掘模型，用于增强推荐(NeuO)。具体来说，我们利用两步训练神经网络，它利用评论和评级来把握用户在不平衡数据中的真实意见。此外，我们提出了一种情感分类评分方法(SC)，该方法采用双注意力向量来预测用户评论的情感得分。设计了一个组合函数，利用SC的结果和用户-物品评价矩阵来捕捉意见偏差。最后，提出了一种基于多层感知器的矩阵分解(MMF)方法，利用增强的用户-项目矩阵进行推荐。在真实世界数据上的大量实验表明，我们的方法可以在真实世界数据集的最先进基线上实现卓越的性能。

{"title":"Exploiting the Sentimental Bias between Ratings and Reviews for Enhancing Recommendation","authors":"Yuanbo Xu, Yongjian Yang, Jiayu Han, E. Wang, Fuzhen Zhuang, Hui Xiong","doi":"10.1109/ICDM.2018.00185","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00185","url":null,"abstract":"In real-world recommendation scenarios, there are two common phenomena: 1) users only provide ratings but there is no review comment. As a result, the historical transaction data available for recommender system are usually unbalanced and sparse; 2) Users' opinions can be better grasped in their reviews than ratings. This indicates that there is always a bias between ratings and reviews. Therefore, it is important that users' ratings and reviews should be mutually reinforced to grasp the users' true opinions. To this end, in this paper, we develop an opinion mining model based on convolutional neural networks for enhancing recommendation (NeuO). Specifically, we exploit a two-step training neural networks, which utilize both reviews and ratings to grasp users' true opinions in unbalanced data. Moreover, we propose a Sentiment Classification scoring method (SC), which employs dual attention vectors to predict the users' sentiment scores of their reviews. A combination function is designed to use the results of SC and user-item rating matrix to catch the opinion bias. Finally, a Multilayer perceptron based Matrix Factorization (MMF) method is proposed to make recommendations with the enhanced user-item matrix. Extensive experiments on real-world data demonstrate that our approach can achieve a superior performance over state-of-the-art baselines on real-world datasets.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"10 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

A Machine Reading Comprehension-Based Approach for Featured Snippet Extraction 基于机器阅读理解的特征片段提取方法

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00195

Chen Zhang, Xuanyu Zhang, Hao Wang

The extraction of featured snippet can be considered as the problem of Question Answering (QA). This paper presents a featured snippet extraction system by employing a technique of machine reading comprehension (MRC). Specifically, we first analyze the characteristics of questions with different types and their corresponding answers. Then, we classify a given question into various types, which is incorporated as key features in the subsequent model configuration. Based on that, we present a model to extract the candidate passages from recalled documents in a MRC fashion. Next, a novel MRC model with multiple stages of attention is proposed to extract answers from the selected passages. Last, in the answer re-ranking stage, we design a question type-adaptive model to produce the final answer. The experimental results on two open-domain QA Datasets clearly validate the effectiveness of our system and models in featured snippet extraction.

特征片段的提取可以看作是问答(QA)问题。本文提出了一种基于机器阅读理解技术的特色摘要提取系统。具体来说，我们首先分析不同类型问题的特点及其对应的答案。然后，我们将给定的问题分类为各种类型，这些类型作为关键特征合并到后续的模型配置中。在此基础上，我们提出了一个以MRC方式从召回文档中提取候选段落的模型。接下来，提出了一种具有多阶段注意力的新型MRC模型，从选定的段落中提取答案。最后，在答案重新排序阶段，我们设计了一个问题类型自适应模型来产生最终答案。在两个开放域QA数据集上的实验结果清楚地验证了我们的系统和模型在特征片段提取方面的有效性。

引用次数: 6

Outlier Detection in Urban Traffic Flow Distributions 城市交通流分布中的离群值检测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00114

Y. Djenouri, A. Zimek, Marco Chiarandini

Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.

城市交通数据由部署的传感器测量的特定地点的汽车或其他车辆的数量和速度等观测数据组成。这些数字可以解释为交通流量，而交通流量又与街道的容量和交通系统的需求有关。城市规划者有兴趣研究各种条件对交通流量的影响，导致不寻常的模式，即离群值。现有的城市交通数据异常值检测方法只考虑单个流量值(即单个观测值)。这对于实时检测突然变化来说很有趣。在这里，我们面临着一个不同的场景:城市规划者希望从历史数据中学习，特殊情况(例如，活动或节日)如何与交通流的不寻常模式相关联，以支持改进活动和交通系统布局的规划。因此，我们建议考虑在一定时间间隔内观测到的交通流值的序列。这样的流序列可以建模为流的概率分布。我们采用了一种已建立的异常检测方法，局部异常因子(LOF)来处理流量分布，而不是单个观测。我们在线应用离群点检测，用新的流分布扩展数据库，这些流分布被认为是内线。为了验证，我们考虑了我们的框架的一个特殊情况，以便与最先进的流异常检测进行比较。此外，对城市交通流数据的实际案例研究表明，我们的方法在交通流数据中找到了有意义的异常值。

{"title":"Outlier Detection in Urban Traffic Flow Distributions","authors":"Y. Djenouri, A. Zimek, Marco Chiarandini","doi":"10.1109/ICDM.2018.00114","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00114","url":null,"abstract":"Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129531981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

GINA: Group Gender Identification Using Privacy-Sensitive Audio Data 使用隐私敏感音频数据进行群体性别识别

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00061

Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, A. Pentland

Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.

群体性别对理解社会互动和群体动态至关重要。随着在自然环境中学习面对面交流的隐私问题日益增加，许多参与者对原始音频录音不开放。现有的基于语音的性别识别方法依赖于生理差异和语音差异引起的声学特征。然而，由于两个主要原因，这些方法可能对隐私敏感的音频无效。首先，与原始音频相比，隐私敏感音频包含的声学特征要少得多。此外，自然设置会在音频数据中产生各种不确定性。在本文中，我们首次尝试使用隐私敏感音频来识别群体性别。我们不是从隐私敏感音频中提取声学特征，而是关注会话特征，包括轮流行为和中断模式。然而，由于人类的行为受到情绪和环境等诸多因素的影响，会话行为在性别认同中是不稳定的。我们利用集成特征选择和两阶段分类来提高我们方法的有效性和鲁棒性。集成特征选择可以通过聚合多个特征选择器的输出来降低选择不稳定特征子集的风险。在第一阶段，我们推断一个群体的性别构成(混合性别或同性)，作为第二阶段识别群体性别的额外输入特征。估计的性别构成显著提高了表现，因为它可以部分地解释会话行为的动态。根据273次会议中100人的实验评价，该方法优于基线方法，采用线性支持向量机的f1得分为0.77。

{"title":"GINA: Group Gender Identification Using Privacy-Sensitive Audio Data","authors":"Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, A. Pentland","doi":"10.1109/ICDM.2018.00061","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00061","url":null,"abstract":"Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Deep Headline Generation for Clickbait Detection 深度标题生成点击党检测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00062

Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu

Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.

点击诱饵是指吸引人的社交帖子或耸人听闻的标题，试图吸引读者点击。点击诱饵在社交媒体上无处不在，对用户和媒体生态系统都有重大的负面影响。例如，用户可能会被误导接收到不准确的信息或遭受点击劫持攻击。同样，媒体平台可能会因为点击诱饵的盛行而失去读者的信任和收入。要使用监督学习框架在社交媒体上计算检测此类点击诱饵，主要障碍之一是由于标记成本高，缺乏大规模标记训练数据。随着深度生成模型的最新进展，为了应对这一挑战，我们建议生成具有特定风格的合成标题，并探索其实用程序，以帮助提高标题党检测。特别是，我们建议通过样式转移从原始文档生成风格化的标题。此外，由于文本的离散性和在实现风格迁移的同时保持文档语义的要求等诸多挑战，生成风格化标题并非易事，我们提出了一种新的解决方案，称为风格化标题生成(stylized Headline Generation, SHG)，该解决方案不仅可以生成可读和真实的标题以扩大原始训练数据，而且有助于提高监督学习的分类能力。在真实数据集上的实验结果证明了SHG在为标题党检测生成高质量和高实用标题方面的有效性。

{"title":"Deep Headline Generation for Clickbait Detection","authors":"Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu","doi":"10.1109/ICDM.2018.00062","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00062","url":null,"abstract":"Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116533765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Volatility Drift Prediction for Transactional Data Streams 交易数据流的波动漂移预测

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00140

Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie

The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.

数据流中概念漂移的原因可以有很大的不同，从机器的老化到人们购买模式的改变。为了有效地检测概念漂移，大多数预测流挖掘系统都包含一个漂移检测器来监测和信号概念漂移。然而，这些系统很少被设计用来发现事务数据集中的漂移，这些数据集中有未标记的数据。事务性数据集描述事件，例如订单或付款，这些事件通常使用关联规则进行分析。在本文中，我们提出了一种新的漂移检测技术，ProChange，它由两部分组成。第一部分是漂移检测器，VR-Change，它使用海灵格距离在未标记的事务数据流中发现真实和虚拟的漂移。第二部分是漂移预测器，它使用概率网络对漂移的波动性进行建模，以预测未来漂移的位置。使用预测器，我们可以动态调整置信阈值，使VR-Change对潜在的未来漂移点更加敏感。我们通过将ProChange与传统检测器进行比较来评估其性能，表明它在准确性方面有效且高效地检测真实和虚拟漂移。

{"title":"Volatility Drift Prediction for Transactional Data Streams","authors":"Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie","doi":"10.1109/ICDM.2018.00140","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00140","url":null,"abstract":"The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Clustering on Sparse Data in Non-overlapping Feature Space with Applications to Cancer Subtyping 非重叠特征空间稀疏数据聚类及其在癌症亚型分型中的应用

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00138

Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding

This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.

本文提出了一种新的算法——基于增强和知情网络的聚类算法(reinforcement and Informed network -based Clustering, ring)，用于在稀疏且基本上不重叠的特征空间中寻找相似数据对象的未知组，在这些特征空间中可以观察到特征之间的网络结构。稀疏和非重叠的未标记数据在文本挖掘和生物医学数据挖掘中越来越普遍和可用。ringc将一个领域知情模型插入到一个无模型神经网络中。特别是，我们的方法将物理上有意义的特征依赖关系集成到神经网络架构和软计算约束中。我们的学习算法通过融合平滑和稀疏自编码器学习来有效地聚类稀疏数据。知情设计需要更少的样本进行训练，并且至少部分模型变得可以解释。增强网络层的体系结构平滑了特征空间中网络依赖的稀疏数据。最重要的是，通过反向传播，增强平滑层的权重同时受到剩余稀疏自编码器层的约束，这些层将目标值设置为等于原始输入。实证结果表明，ringc在提高准确率的同时，呈现出物理上有意义的聚类结果。

{"title":"Clustering on Sparse Data in Non-overlapping Feature Space with Applications to Cancer Subtyping","authors":"Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding","doi":"10.1109/ICDM.2018.00138","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00138","url":null,"abstract":"This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132650420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Interpretable Word Embeddings for Medical Domain 医学领域的可解释词嵌入

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00135

Kishlay Jha, Yaqing Wang, Guangxu Xun, Aidong Zhang

Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.

词嵌入在各种生物医学自然语言处理(bioNLP)任务中的应用越来越多，从药物发现到自动疾病诊断。虽然这些词嵌入整体上显示出有意义的句法和语义规律，但是，单个维度的含义仍然难以捉摸。这在一般情况下，特别是在生物医学等敏感领域都成为问题，在这些领域，结果的可解释性对其广泛采用至关重要。为了解决这个问题，在本研究中，我们的目标是提高从文本语料库生成的预训练词嵌入的可解释性，并以此提供一种系统的方法来形式化这个问题。更具体地说，我们利用生物医学领域中丰富的分类知识，并提出学习一个转换矩阵，将输入嵌入转换到一个新的空间，在这个空间中，它们既可以解释，又保留了它们原来的表达特征。在最大的可用生物医学语料库上进行的实验表明，该模型能够执行与人类直觉非常相似的可解释性。

{"title":"Interpretable Word Embeddings for Medical Domain","authors":"Kishlay Jha, Yaqing Wang, Guangxu Xun, Aidong Zhang","doi":"10.1109/ICDM.2018.00135","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00135","url":null,"abstract":"Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122244329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Partial Multi-view Clustering via Consistent GAN 基于一致性GAN的部分多视图聚类

2018 IEEE International Conference on Data Mining (ICDM)

Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00174

Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, Y. Fu

Multi-view clustering, as one of the most important methods to analyze multi-view data, has been widely used in many real-world applications. Most existing multi-view clustering methods perform well on the assumption that each sample appears in all views. Nevertheless, in real-world application, each view may well face the problem of the missing data due to noise, or malfunction. In this paper, a new consistent generative adversarial network is proposed for partial multi-view clustering. We learn a common low-dimensional representation, which can both generate the missing view data and capture a better common structure from partial multi-view data for clustering. Different from the most existing methods, we use the common representation encoded by one view to generate the missing data of the corresponding view by generative adversarial networks, then we use the encoder and clustering networks. This is intuitive and meaningful because encoding common representation and generating the missing data in our model will promote mutually. Experimental results on three different multi-view databases illustrate the superiority of the proposed method.

多视图聚类作为多视图数据分析的一种重要方法，在现实应用中得到了广泛的应用。大多数现有的多视图聚类方法在每个样本出现在所有视图的假设下表现良好。然而，在实际应用中，每个视图都可能面临由于噪声或故障而丢失数据的问题。针对部分多视图聚类问题，提出了一种新的一致生成对抗网络。我们学习了一种常见的低维表示，它既可以生成缺失的视图数据，又可以从部分多视图数据中捕获更好的公共结构进行聚类。与大多数现有方法不同的是，我们使用一个视图编码的公共表示，通过生成对抗网络生成相应视图的缺失数据，然后使用编码器和聚类网络。这是直观和有意义的，因为在我们的模型中编码公共表示和生成缺失数据是相互促进的。在三种不同的多视图数据库上的实验结果表明了该方法的优越性。

{"title":"Partial Multi-view Clustering via Consistent GAN","authors":"Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, Y. Fu","doi":"10.1109/ICDM.2018.00174","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00174","url":null,"abstract":"Multi-view clustering, as one of the most important methods to analyze multi-view data, has been widely used in many real-world applications. Most existing multi-view clustering methods perform well on the assumption that each sample appears in all views. Nevertheless, in real-world application, each view may well face the problem of the missing data due to noise, or malfunction. In this paper, a new consistent generative adversarial network is proposed for partial multi-view clustering. We learn a common low-dimensional representation, which can both generate the missing view data and capture a better common structure from partial multi-view data for clustering. Different from the most existing methods, we use the common representation encoded by one view to generate the missing data of the corresponding view by generative adversarial networks, then we use the encoder and clustering networks. This is intuitive and meaningful because encoding common representation and generating the missing data in our model will promote mutually. Experimental results on three different multi-view databases illustrate the superiority of the proposed method.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134362813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81