首页 > 最新文献

2018 IEEE International Conference on Data Mining (ICDM)最新文献

英文 中文
Clustering on Sparse Data in Non-overlapping Feature Space with Applications to Cancer Subtyping 非重叠特征空间稀疏数据聚类及其在癌症亚型分型中的应用
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00138
Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding
This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.
本文提出了一种新的算法——基于增强和知情网络的聚类算法(reinforcement and Informed network -based Clustering, ring),用于在稀疏且基本上不重叠的特征空间中寻找相似数据对象的未知组,在这些特征空间中可以观察到特征之间的网络结构。稀疏和非重叠的未标记数据在文本挖掘和生物医学数据挖掘中越来越普遍和可用。ringc将一个领域知情模型插入到一个无模型神经网络中。特别是,我们的方法将物理上有意义的特征依赖关系集成到神经网络架构和软计算约束中。我们的学习算法通过融合平滑和稀疏自编码器学习来有效地聚类稀疏数据。知情设计需要更少的样本进行训练,并且至少部分模型变得可以解释。增强网络层的体系结构平滑了特征空间中网络依赖的稀疏数据。最重要的是,通过反向传播,增强平滑层的权重同时受到剩余稀疏自编码器层的约束,这些层将目标值设置为等于原始输入。实证结果表明,ringc在提高准确率的同时,呈现出物理上有意义的聚类结果。
{"title":"Clustering on Sparse Data in Non-overlapping Feature Space with Applications to Cancer Subtyping","authors":"Tianyu Kang, Kourosh Zarringhalam, M. Kuijjer, Ping Chen, John Quackenbush, W. Ding","doi":"10.1109/ICDM.2018.00138","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00138","url":null,"abstract":"This paper presents a new algorithm, Reinforced and Informed Network-based Clustering(RINC), for finding unknown groups of similar data objects in sparse and largely non-overlapping feature space where a network structure among features can be observed. Sparse and non-overlapping unlabeled data become increasingly common and available especially in text mining and biomedical data mining. RINC inserts a domain informed model into a modelless neural network. In particular, our approach integrates physically meaningful feature dependencies into the neural network architecture and soft computational constraint. Our learning algorithm efficiently clusters sparse data through integrated smoothing and sparse auto-encoder learning. The informed design requires fewer samples for training and at least part of the model becomes explainable. The architecture of the reinforced network layers smooths sparse data over the network dependency in the feature space. Most importantly, through back-propagation, the weights of the reinforced smoothing layers are simultaneously constrained by the remaining sparse auto-encoder layers that set the target values to be equal to the raw inputs. Empirical results demonstrate that RINC achieves improved accuracy and renders physically meaningful clustering results.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132650420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Interpretable Word Embeddings for Medical Domain 医学领域的可解释词嵌入
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00135
Kishlay Jha, Yaqing Wang, Guangxu Xun, Aidong Zhang
Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.
词嵌入在各种生物医学自然语言处理(bioNLP)任务中的应用越来越多,从药物发现到自动疾病诊断。虽然这些词嵌入整体上显示出有意义的句法和语义规律,但是,单个维度的含义仍然难以捉摸。这在一般情况下,特别是在生物医学等敏感领域都成为问题,在这些领域,结果的可解释性对其广泛采用至关重要。为了解决这个问题,在本研究中,我们的目标是提高从文本语料库生成的预训练词嵌入的可解释性,并以此提供一种系统的方法来形式化这个问题。更具体地说,我们利用生物医学领域中丰富的分类知识,并提出学习一个转换矩阵,将输入嵌入转换到一个新的空间,在这个空间中,它们既可以解释,又保留了它们原来的表达特征。在最大的可用生物医学语料库上进行的实验表明,该模型能够执行与人类直觉非常相似的可解释性。
{"title":"Interpretable Word Embeddings for Medical Domain","authors":"Kishlay Jha, Yaqing Wang, Guangxu Xun, Aidong Zhang","doi":"10.1109/ICDM.2018.00135","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00135","url":null,"abstract":"Word embeddings are finding their increasing application in a variety of biomedical Natural Language Processing (bioNLP) tasks, ranging from drug discovery to automated disease diagnosis. While these word embeddings in their entirety have shown meaningful syntactic and semantic regularities, however, the meaning of individual dimensions remains elusive. This becomes problematic both in general and particularly in sensitive domains such as bio-medicine, wherein, the interpretability of results is crucial to its widespread adoption. To address this issue, in this study, we aim to improve the interpretability of pre-trained word embeddings generated from a text corpora, and in doing so provide a systematic approach to formalize the problem. More specifically, we exploit the rich categorical knowledge present in the biomedical domain, and propose to learn a transformation matrix that transforms the input embeddings to a new space where they are both interpretable and retain their original expressive features. Experiments conducted on the largest available biomedical corpus suggests that the model is capable of performing interpretability that resembles closely to the human-level intuition.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122244329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Robust Distributed Anomaly Detection Using Optimal Weighted One-Class Random Forests 基于最优加权一类随机森林的鲁棒分布式异常检测
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00171
Yu-Lin Tsou, Hong-Min Chu, Cong Li, Shao-Wen Yang
Wireless sensor networks (WSNs) have been widely deployed in various applications, e.g., agricultural monitoring and industrial monitoring, for their ease-of-deployment. The low-cost nature makes WSNs particularly vulnerable to changes of extrinsic factors, i.e., the environment, or changes of intrinsic factors, i.e., hardware or software failures. The problem can, often times, be uncovered via detecting unexpected behaviors (anomalies) of devices. However, anomaly detection in WSNs is subject to the following challenges: (1) the limited computation and connectivity, (2) the dynamicity of the environment and network topology, and (3) the need of taking real-time actions in response to anomalies. In this paper, we propose a novel framework using optimal weighted one-class random forests for unsupervised anomaly detection to address the aforementioned challenges in WSNs. The ample experiments showed that our framework not only is feasible but also outperforms the state-of-the-art unsupervised methods in terms of both detection accuracy and resource utilization.
无线传感器网络(WSNs)由于其易于部署,已广泛应用于各种应用,例如农业监测和工业监测。低成本的特性使得wsn特别容易受到外部因素(即环境)变化或内部因素(即硬件或软件故障)变化的影响。通常情况下,可以通过检测设备的意外行为(异常)来发现问题。然而,无线传感器网络中的异常检测面临以下挑战:(1)有限的计算和连通性;(2)环境和网络拓扑的动态性;(3)需要对异常采取实时响应。在本文中,我们提出了一种使用最优加权单类随机森林进行无监督异常检测的新框架,以解决wsn中的上述挑战。大量的实验表明,我们的框架不仅是可行的,而且在检测精度和资源利用率方面都优于目前最先进的无监督方法。
{"title":"Robust Distributed Anomaly Detection Using Optimal Weighted One-Class Random Forests","authors":"Yu-Lin Tsou, Hong-Min Chu, Cong Li, Shao-Wen Yang","doi":"10.1109/ICDM.2018.00171","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00171","url":null,"abstract":"Wireless sensor networks (WSNs) have been widely deployed in various applications, e.g., agricultural monitoring and industrial monitoring, for their ease-of-deployment. The low-cost nature makes WSNs particularly vulnerable to changes of extrinsic factors, i.e., the environment, or changes of intrinsic factors, i.e., hardware or software failures. The problem can, often times, be uncovered via detecting unexpected behaviors (anomalies) of devices. However, anomaly detection in WSNs is subject to the following challenges: (1) the limited computation and connectivity, (2) the dynamicity of the environment and network topology, and (3) the need of taking real-time actions in response to anomalies. In this paper, we propose a novel framework using optimal weighted one-class random forests for unsupervised anomaly detection to address the aforementioned challenges in WSNs. The ample experiments showed that our framework not only is feasible but also outperforms the state-of-the-art unsupervised methods in terms of both detection accuracy and resource utilization.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115239289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Deep Headline Generation for Clickbait Detection 深度标题生成点击党检测
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00062
Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu
Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.
点击诱饵是指吸引人的社交帖子或耸人听闻的标题,试图吸引读者点击。点击诱饵在社交媒体上无处不在,对用户和媒体生态系统都有重大的负面影响。例如,用户可能会被误导接收到不准确的信息或遭受点击劫持攻击。同样,媒体平台可能会因为点击诱饵的盛行而失去读者的信任和收入。要使用监督学习框架在社交媒体上计算检测此类点击诱饵,主要障碍之一是由于标记成本高,缺乏大规模标记训练数据。随着深度生成模型的最新进展,为了应对这一挑战,我们建议生成具有特定风格的合成标题,并探索其实用程序,以帮助提高标题党检测。特别是,我们建议通过样式转移从原始文档生成风格化的标题。此外,由于文本的离散性和在实现风格迁移的同时保持文档语义的要求等诸多挑战,生成风格化标题并非易事,我们提出了一种新的解决方案,称为风格化标题生成(stylized Headline Generation, SHG),该解决方案不仅可以生成可读和真实的标题以扩大原始训练数据,而且有助于提高监督学习的分类能力。在真实数据集上的实验结果证明了SHG在为标题党检测生成高质量和高实用标题方面的有效性。
{"title":"Deep Headline Generation for Clickbait Detection","authors":"Kai Shu, Suhang Wang, Thai Le, Dongwon Lee, Huan Liu","doi":"10.1109/ICDM.2018.00062","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00062","url":null,"abstract":"Clickbaits are catchy social posts or sensational headlines that attempt to lure readers to click. Clickbaits are pervasive on social media and can have significant negative impacts on both users and media ecosystems. For example, users may be misled to receive inaccurate information or fall into click-jacking attacks. Similarly, media platforms could lose readers' trust and revenues due to the prevalence of clickbaits. To computationally detect such clickbaits on social media using a supervised learning framework, one of the major obstacles is the lack of large-scale labeled training data, due to the high cost of labeling. With the recent advancements of deep generative models, to address this challenge, we propose to generate synthetic headlines with specific styles and explore their utilities to help improve clickbait detection. In particular, we propose to generate stylized headlines from original documents with style transfer. Furthermore, as it is non-trivial to generate stylized headlines due to several challenges such as the discrete nature of texts and the requirements of preserving semantic meaning of document while achieving style transfer, we propose a novel solution, named as Stylized Headline Generation (SHG), that can not only generate readable and realistic headlines to enlarge original training data, but also help improve the classification capacity of supervised learning. The experimental results on real-world datasets demonstrate the effectiveness of SHG in generating high-quality and high-utility headlines for clickbait detection.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116533765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Spatial Contextualization for Closed Itemset Mining 封闭项集挖掘的空间语境化
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00155
Altobelli B. Mantuan, L. Fernandes
We present the Spatial Contextualization for Closed Itemset Mining (SCIM) algorithm, an approach that builds a space for the target database in such a way that relevant itemsets can be retrieved regarding the relative spatial location of their items. Our algorithm uses Dual Scaling to map the items of the database to a multidimensional space called Solution Space. The representation of the database in the Solution Space assists in the interpretation and definition of overlapping clusters of related items. Therefore, instead of using the minimum support threshold, a distance threshold is defined concerning the reference and the maximum distances computed per cluster during the mapping procedure. Closed itemsets are efficiently retrieved by a new procedure that uses an FP-Tree, a CFI-Tree and the proposed spatial contextualization. Experiments show that the mean all-confidence measure of itemsets retrieved by our technique outperforms results from state-of-the-art algorithms. Additionally, we use the Minimum Description Length (MDL) metric to verify how descriptive are the collections of mined patterns.
我们提出了封闭项目集挖掘(SCIM)的空间语境化算法,该算法为目标数据库构建空间,从而可以根据项目的相对空间位置检索相关项目集。我们的算法使用双缩放将数据库的项映射到称为解决方案空间的多维空间。解决方案空间中数据库的表示有助于解释和定义相关项的重叠簇。因此,不使用最小支持阈值,而是定义了一个距离阈值,该阈值涉及映射过程中每个集群计算的引用和最大距离。封闭项集通过使用FP-Tree、CFI-Tree和提出的空间上下文化的新过程有效地检索。实验表明,我们的技术检索的项目集的平均全置信度测量优于最先进的算法的结果。此外,我们使用最小描述长度(MDL)度量来验证挖掘模式集合的描述性如何。
{"title":"Spatial Contextualization for Closed Itemset Mining","authors":"Altobelli B. Mantuan, L. Fernandes","doi":"10.1109/ICDM.2018.00155","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00155","url":null,"abstract":"We present the Spatial Contextualization for Closed Itemset Mining (SCIM) algorithm, an approach that builds a space for the target database in such a way that relevant itemsets can be retrieved regarding the relative spatial location of their items. Our algorithm uses Dual Scaling to map the items of the database to a multidimensional space called Solution Space. The representation of the database in the Solution Space assists in the interpretation and definition of overlapping clusters of related items. Therefore, instead of using the minimum support threshold, a distance threshold is defined concerning the reference and the maximum distances computed per cluster during the mapping procedure. Closed itemsets are efficiently retrieved by a new procedure that uses an FP-Tree, a CFI-Tree and the proposed spatial contextualization. Experiments show that the mean all-confidence measure of itemsets retrieved by our technique outperforms results from state-of-the-art algorithms. Additionally, we use the Minimum Description Length (MDL) metric to verify how descriptive are the collections of mined patterns.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123703943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Distribution Preserving Multi-task Regression for Spatio-Temporal Data 时空数据的保分布多任务回归
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00148
Xi Liu, P. Tan, Zubin Abraham, L. Luo, P. Hatami
For many spatio-temporal applications, building regression models that can reproduce the true data distribution is often as important as building models with high prediction accuracy. For example, knowing the future distribution of daily temperature and precipitation can help scientists determine their long-term trends and assess their potential impact on human and natural systems. As conventional methods are designed to minimize residual errors, the shape of their predicted distribution may not be consistent with their actual distribution. To overcome this challenge, this paper presents a novel, distribution-preserving multi-task learning framework for multi-location prediction of spatio-temporal data. The framework employs a non-parametric density estimation approach with L2-distance to measure the divergence between the predicted and true distribution of the data. Experimental results using climate data from more than 1500 weather stations in the United States show that the proposed framework reduces the distribution error for more than 78% of the stations without degrading the prediction accuracy significantly.
对于许多时空应用,建立能够再现真实数据分布的回归模型往往与建立具有高预测精度的模型同等重要。例如,了解每日温度和降水的未来分布可以帮助科学家确定它们的长期趋势,并评估它们对人类和自然系统的潜在影响。由于传统的方法是为了最小化残差而设计的,因此它们的预测分布形状可能与实际分布不一致。为了克服这一挑战,本文提出了一种新颖的、保持分布的多任务学习框架,用于时空数据的多位置预测。该框架采用一种具有l2距离的非参数密度估计方法来度量数据的预测分布与真实分布之间的差异。利用美国1500多个气象站的气候数据进行的实验结果表明,该框架在不显著降低预测精度的情况下,减少了78%以上气象站的分布误差。
{"title":"Distribution Preserving Multi-task Regression for Spatio-Temporal Data","authors":"Xi Liu, P. Tan, Zubin Abraham, L. Luo, P. Hatami","doi":"10.1109/ICDM.2018.00148","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00148","url":null,"abstract":"For many spatio-temporal applications, building regression models that can reproduce the true data distribution is often as important as building models with high prediction accuracy. For example, knowing the future distribution of daily temperature and precipitation can help scientists determine their long-term trends and assess their potential impact on human and natural systems. As conventional methods are designed to minimize residual errors, the shape of their predicted distribution may not be consistent with their actual distribution. To overcome this challenge, this paper presents a novel, distribution-preserving multi-task learning framework for multi-location prediction of spatio-temporal data. The framework employs a non-parametric density estimation approach with L2-distance to measure the divergence between the predicted and true distribution of the data. Experimental results using climate data from more than 1500 weather stations in the United States show that the proposed framework reduces the distribution error for more than 78% of the stations without degrading the prediction accuracy significantly.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Volatility Drift Prediction for Transactional Data Streams 交易数据流的波动漂移预测
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00140
Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie
The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.
数据流中概念漂移的原因可以有很大的不同,从机器的老化到人们购买模式的改变。为了有效地检测概念漂移,大多数预测流挖掘系统都包含一个漂移检测器来监测和信号概念漂移。然而,这些系统很少被设计用来发现事务数据集中的漂移,这些数据集中有未标记的数据。事务性数据集描述事件,例如订单或付款,这些事件通常使用关联规则进行分析。在本文中,我们提出了一种新的漂移检测技术,ProChange,它由两部分组成。第一部分是漂移检测器,VR-Change,它使用海灵格距离在未标记的事务数据流中发现真实和虚拟的漂移。第二部分是漂移预测器,它使用概率网络对漂移的波动性进行建模,以预测未来漂移的位置。使用预测器,我们可以动态调整置信阈值,使VR-Change对潜在的未来漂移点更加敏感。我们通过将ProChange与传统检测器进行比较来评估其性能,表明它在准确性方面有效且高效地检测真实和虚拟漂移。
{"title":"Volatility Drift Prediction for Transactional Data Streams","authors":"Yun Sing Koh, David Tse Jung Huang, C. Pearce, G. Dobbie","doi":"10.1109/ICDM.2018.00140","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00140","url":null,"abstract":"The reasons for concept drift in a data stream can vary widely, from deterioration of a machine to a change in peoples' buying patterns. In order to effectively detect concept drifts, most predictive stream mining systems contain a drift detector that monitors and signals concept drifts. However, few of these systems are designed to find drifts in transactional datasets, which have unlabelled data. Transactional datasets describe events, such as orders or payments, which are traditionally analysed using association rules. In this paper, we propose a novel drift detection technique, ProChange, that has two parts. The first part is a drift detector, VR-Change, that finds both real and virtual drifts in unlabelled transactional data streams using the Hellinger distance. The second part is a drift predictor, which models the volatility of drifts using a probabilistic network to predict the location of future drifts. Using the predictor, we can dynamically adapt the confidence threshold, enabling VR-Change to be more sensitive around potential future drift points. We evaluated the performance of ProChange by comparing it against traditional detectors showing that it detects both real and virtual drifts effectively and efficiently in terms of accuracy.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Outlier Detection in Urban Traffic Flow Distributions 城市交通流分布中的离群值检测
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00114
Y. Djenouri, A. Zimek, Marco Chiarandini
Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.
城市交通数据由部署的传感器测量的特定地点的汽车或其他车辆的数量和速度等观测数据组成。这些数字可以解释为交通流量,而交通流量又与街道的容量和交通系统的需求有关。城市规划者有兴趣研究各种条件对交通流量的影响,导致不寻常的模式,即离群值。现有的城市交通数据异常值检测方法只考虑单个流量值(即单个观测值)。这对于实时检测突然变化来说很有趣。在这里,我们面临着一个不同的场景:城市规划者希望从历史数据中学习,特殊情况(例如,活动或节日)如何与交通流的不寻常模式相关联,以支持改进活动和交通系统布局的规划。因此,我们建议考虑在一定时间间隔内观测到的交通流值的序列。这样的流序列可以建模为流的概率分布。我们采用了一种已建立的异常检测方法,局部异常因子(LOF)来处理流量分布,而不是单个观测。我们在线应用离群点检测,用新的流分布扩展数据库,这些流分布被认为是内线。为了验证,我们考虑了我们的框架的一个特殊情况,以便与最先进的流异常检测进行比较。此外,对城市交通流数据的实际案例研究表明,我们的方法在交通流数据中找到了有意义的异常值。
{"title":"Outlier Detection in Urban Traffic Flow Distributions","authors":"Y. Djenouri, A. Zimek, Marco Chiarandini","doi":"10.1109/ICDM.2018.00114","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00114","url":null,"abstract":"Urban traffic data consists of observations like number and speed of cars or other vehicles at certain locations as measured by deployed sensors. These numbers can be interpreted as traffic flow which in turn relates to the capacity of streets and the demand of the traffic system. City planners are interested in studying the impact of various conditions on the traffic flow, leading to unusual patterns, i.e., outliers. Existing approaches to outlier detection in urban traffic data take into account only individual flow values (i.e., an individual observation). This can be interesting for real time detection of sudden changes. Here, we face a different scenario: The city planners want to learn from historical data, how special circumstances (e.g., events or festivals) relate to unusual patterns in the traffic flow, in order to support improved planing of both, events and the layout of the traffic system. Therefore, we propose to consider the sequence of traffic flow values observed within some time interval. Such flow sequences can be modeled as probability distributions of flows. We adapt an established outlier detection method, the local outlier factor (LOF), to handling flow distributions rather than individual observations. We apply the outlier detection online to extend the database with new flow distributions that are considered inliers. For the validation we consider a special case of our framework for comparison with state-of-the-art outlier detection on flows. In addition, a real case study on urban traffic flow data showcases that our method finds meaningful outliers in the traffic flow data.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129531981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
GINA: Group Gender Identification Using Privacy-Sensitive Audio Data 使用隐私敏感音频数据进行群体性别识别
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00061
Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, A. Pentland
Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.
群体性别对理解社会互动和群体动态至关重要。随着在自然环境中学习面对面交流的隐私问题日益增加,许多参与者对原始音频录音不开放。现有的基于语音的性别识别方法依赖于生理差异和语音差异引起的声学特征。然而,由于两个主要原因,这些方法可能对隐私敏感的音频无效。首先,与原始音频相比,隐私敏感音频包含的声学特征要少得多。此外,自然设置会在音频数据中产生各种不确定性。在本文中,我们首次尝试使用隐私敏感音频来识别群体性别。我们不是从隐私敏感音频中提取声学特征,而是关注会话特征,包括轮流行为和中断模式。然而,由于人类的行为受到情绪和环境等诸多因素的影响,会话行为在性别认同中是不稳定的。我们利用集成特征选择和两阶段分类来提高我们方法的有效性和鲁棒性。集成特征选择可以通过聚合多个特征选择器的输出来降低选择不稳定特征子集的风险。在第一阶段,我们推断一个群体的性别构成(混合性别或同性),作为第二阶段识别群体性别的额外输入特征。估计的性别构成显著提高了表现,因为它可以部分地解释会话行为的动态。根据273次会议中100人的实验评价,该方法优于基线方法,采用线性支持向量机的f1得分为0.77。
{"title":"GINA: Group Gender Identification Using Privacy-Sensitive Audio Data","authors":"Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, A. Pentland","doi":"10.1109/ICDM.2018.00061","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00061","url":null,"abstract":"Group gender is essential in understanding social interaction and group dynamics. With the increasing privacy concerns of studying face-to-face communication in natural settings, many participants are not open to raw audio recording. Existing voice-based gender identification methods rely on acoustic characteristics caused by physiological differences and phonetic differences. However, these methods might become ineffective with privacy-sensitive audio for two main reasons. First, compared to raw audio, privacy-sensitive audio contains significantly fewer acoustic features. Moreover, natural settings generate various uncertainties in the audio data. In this paper, we make the first attempt to identify group gender using privacy-sensitive audio. Instead of extracting acoustic features from privacy-sensitive audio, we focus on conversational features including turn-taking behaviors and interruption patterns. However, conversational behaviors are unstable in gender identification as human behaviors are affected by many factors like emotion and environment. We utilize ensemble feature selection and a two-stage classification to improve the effectiveness and robustness of our approach. Ensemble feature selection could reduce the risk of choosing an unstable subset of features by aggregating the outputs of multiple feature selectors. In the first stage, we infer the gender composition (mixed-gender or same-gender) of a group which is used as an additional input feature for identifying group gender in the second stage. The estimated gender composition significantly improves the performance as it could partially account for the dynamics in conversational behaviors. According to the experimental evaluation of 100 people in 273 meetings, the proposed method outperforms baseline approaches and achieves an F1-score of 0.77 using linear SVM.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Machine Reading Comprehension-Based Approach for Featured Snippet Extraction 基于机器阅读理解的特征片段提取方法
Pub Date : 2018-11-01 DOI: 10.1109/ICDM.2018.00195
Chen Zhang, Xuanyu Zhang, Hao Wang
The extraction of featured snippet can be considered as the problem of Question Answering (QA). This paper presents a featured snippet extraction system by employing a technique of machine reading comprehension (MRC). Specifically, we first analyze the characteristics of questions with different types and their corresponding answers. Then, we classify a given question into various types, which is incorporated as key features in the subsequent model configuration. Based on that, we present a model to extract the candidate passages from recalled documents in a MRC fashion. Next, a novel MRC model with multiple stages of attention is proposed to extract answers from the selected passages. Last, in the answer re-ranking stage, we design a question type-adaptive model to produce the final answer. The experimental results on two open-domain QA Datasets clearly validate the effectiveness of our system and models in featured snippet extraction.
特征片段的提取可以看作是问答(QA)问题。本文提出了一种基于机器阅读理解技术的特色摘要提取系统。具体来说,我们首先分析不同类型问题的特点及其对应的答案。然后,我们将给定的问题分类为各种类型,这些类型作为关键特征合并到后续的模型配置中。在此基础上,我们提出了一个以MRC方式从召回文档中提取候选段落的模型。接下来,提出了一种具有多阶段注意力的新型MRC模型,从选定的段落中提取答案。最后,在答案重新排序阶段,我们设计了一个问题类型自适应模型来产生最终答案。在两个开放域QA数据集上的实验结果清楚地验证了我们的系统和模型在特征片段提取方面的有效性。
{"title":"A Machine Reading Comprehension-Based Approach for Featured Snippet Extraction","authors":"Chen Zhang, Xuanyu Zhang, Hao Wang","doi":"10.1109/ICDM.2018.00195","DOIUrl":"https://doi.org/10.1109/ICDM.2018.00195","url":null,"abstract":"The extraction of featured snippet can be considered as the problem of Question Answering (QA). This paper presents a featured snippet extraction system by employing a technique of machine reading comprehension (MRC). Specifically, we first analyze the characteristics of questions with different types and their corresponding answers. Then, we classify a given question into various types, which is incorporated as key features in the subsequent model configuration. Based on that, we present a model to extract the candidate passages from recalled documents in a MRC fashion. Next, a novel MRC model with multiple stages of attention is proposed to extract answers from the selected passages. Last, in the answer re-ranking stage, we design a question type-adaptive model to produce the final answer. The experimental results on two open-domain QA Datasets clearly validate the effectiveness of our system and models in featured snippet extraction.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129958925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2018 IEEE International Conference on Data Mining (ICDM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1