首页 > 最新文献

2020 International Conference on Data Mining Workshops (ICDMW)最新文献

英文 中文
Mining Heterogeneous Associations from Pediatric Cancer Data by Relational Concept Analysis 利用关联概念分析从儿童癌症数据中挖掘异质关联
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00085
Mickael Wajnberg, Petko Valtchev, A. Massé, A. Benmoussa, M. Krajinovic, C. Laverdière, E. Levy, D. Sinnett, V. Marcil
To gain an in-depth understanding of human diseases, biologists typically mine patient data for relevant patterns. Clinical datasets are often unlabeled and involve features, a.k.a. markers, split into classes w.r.t. biological functions, whereby target patterns might well mix both levels. As such heterogeneous patterns are beyond the reach of current analytical tools, dedicated miners, e.g. for association rules, need to be devised. Contemporary multi-relational (MR) association miners, while capable of mixing object types, are rather limited in rule shape (atomic conclusions) while ignoring feature composition. Our own approach builds upon a MR-extension of concept analysis further enhanced with flexible propositionnalisation operators and dedicated MR modeling of patient data. The resulting MR association miner was validated on a pediatric oncology dataset.
为了深入了解人类疾病,生物学家通常会从患者数据中挖掘相关模式。临床数据集通常是未标记的,涉及特征,也就是标记,分为两类,即生物功能,因此目标模式可能混合了这两种水平。由于这种异构模式超出了当前分析工具的范围,因此需要设计专门的挖掘器,例如关联规则。当代的多关系(MR)关联挖掘器虽然能够混合对象类型,但在规则形状(原子结论)方面相当有限,而忽略了特征组成。我们自己的方法建立在概念分析的核磁共振扩展的基础上,进一步增强了灵活的定位操作和患者数据的专用核磁共振建模。生成的MR关联挖掘器在儿科肿瘤学数据集上进行了验证。
{"title":"Mining Heterogeneous Associations from Pediatric Cancer Data by Relational Concept Analysis","authors":"Mickael Wajnberg, Petko Valtchev, A. Massé, A. Benmoussa, M. Krajinovic, C. Laverdière, E. Levy, D. Sinnett, V. Marcil","doi":"10.1109/ICDMW51313.2020.00085","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00085","url":null,"abstract":"To gain an in-depth understanding of human diseases, biologists typically mine patient data for relevant patterns. Clinical datasets are often unlabeled and involve features, a.k.a. markers, split into classes w.r.t. biological functions, whereby target patterns might well mix both levels. As such heterogeneous patterns are beyond the reach of current analytical tools, dedicated miners, e.g. for association rules, need to be devised. Contemporary multi-relational (MR) association miners, while capable of mixing object types, are rather limited in rule shape (atomic conclusions) while ignoring feature composition. Our own approach builds upon a MR-extension of concept analysis further enhanced with flexible propositionnalisation operators and dedicated MR modeling of patient data. The resulting MR association miner was validated on a pediatric oncology dataset.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124310038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mining Heterogeneous Data for Formulation Design 面向配方设计的异构数据挖掘
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00084
Krati Saxena, Ashwini Patil, Sagar Sunkle, V. Kulkarni
Formulated products such as cosmetics, personal care, pharmaceutical products and industrial products such as paints and coatings are a multi-billion dollar industry. Experts carry out designing of new formulations in most of these industries based on their knowledge and basic search from online and offline resources. Reference data for formulation design comes in several formats and from multiple sources with diverse representation. We present an approach to mine the heterogeneous data for formulation design with case studies of cosmetics and steel coating industries. Our contribution is threefold. First, we show data extraction and mining techniques from multi-source and multi-modal text data. Second, we describe how we store and retrieve the data in graph databases. Lastly, we demonstrate the use of extracted and stored data for a simple recommendation system based on data search techniques that aid the experts for the synthesis of new formulation design.
配方产品,如化妆品、个人护理、医药产品和工业产品,如油漆和涂料,是一个价值数十亿美元的产业。专家根据他们的知识和对线上和线下资源的基本搜索,在这些行业中进行新配方的设计。配方设计的参考数据有几种格式,来自多个来源,具有不同的表示形式。我们提出了一种方法来挖掘异质数据的配方设计与化妆品和钢铁涂料行业的案例研究。我们的贡献是三重的。首先,我们展示了多源和多模态文本数据的数据提取和挖掘技术。其次,我们描述了如何在图数据库中存储和检索数据。最后,我们演示了将提取和存储的数据用于基于数据搜索技术的简单推荐系统,该系统帮助专家合成新的配方设计。
{"title":"Mining Heterogeneous Data for Formulation Design","authors":"Krati Saxena, Ashwini Patil, Sagar Sunkle, V. Kulkarni","doi":"10.1109/ICDMW51313.2020.00084","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00084","url":null,"abstract":"Formulated products such as cosmetics, personal care, pharmaceutical products and industrial products such as paints and coatings are a multi-billion dollar industry. Experts carry out designing of new formulations in most of these industries based on their knowledge and basic search from online and offline resources. Reference data for formulation design comes in several formats and from multiple sources with diverse representation. We present an approach to mine the heterogeneous data for formulation design with case studies of cosmetics and steel coating industries. Our contribution is threefold. First, we show data extraction and mining techniques from multi-source and multi-modal text data. Second, we describe how we store and retrieve the data in graph databases. Lastly, we demonstrate the use of extracted and stored data for a simple recommendation system based on data search techniques that aid the experts for the synthesis of new formulation design.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114593495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HybridGNN-SR: Combining Unsupervised and Supervised Graph Learning for Session-based Recommendation HybridGNN-SR:结合无监督和监督图学习的基于会话的推荐
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00028
Kai Deng, Jiajin Huang, Jin Qin
Session-based recommendation aims to predict the next item that a user may visit in the current session. By constructing a session graph, Graph Neural Networks (GNNs) are employed to capture the connectivity among items in the session graph for recommendation. The existing session-based recommendation methods with GNNs usually formulate the recommendation problem as the classification problem, and then use a specific uniform loss to learn session graph representations. Such supervised learning methods only consider the classification loss, which is insufficient to capture the node features from graph structured data. As unsupervised graph learning methods emphasize the graph structure, this paper proposes the HybridGNN-SR model to combine the unsupervised and supervised graph learning to represent the item transition pattern in a session from the view of graph. Specifically, in the part of unsupervised learning, we propose to combine Variational Graph Auto-Encoder (VGAE) with Mutual Information to represent nodes in a session graph; in the part of supervised learning, we employ a routing algorithm to extract higher conceptual features of a session for recommendation, which takes dependencies among items in the session into consideration. Through extensive experiments on three public datasets, we demonstrate that HybridGNN-SR outperforms a number of state-of-the-art methods on session-based recommendation by integrating the strengths of the unsupervised and supervised graph learning methods.
基于会话的推荐旨在预测用户在当前会话中可能访问的下一个项目。通过构造会话图,利用图神经网络(graph Neural Networks, gnn)捕捉会话图中项目之间的连通性进行推荐。现有的基于会话的gnn推荐方法通常将推荐问题表述为分类问题,然后使用特定的均匀损失来学习会话图表示。这种监督学习方法只考虑分类损失,不足以从图结构数据中捕获节点特征。由于无监督图学习方法强调图结构,本文提出了HybridGNN-SR模型,将无监督和有监督图学习相结合,从图的角度来表示会话中的项目转移模式。具体而言,在无监督学习部分,我们提出将变分图自编码器(VGAE)与互信息相结合来表示会话图中的节点;在监督学习部分,我们采用路由算法提取会话的更高概念特征进行推荐,该算法考虑了会话中项目之间的依赖关系。通过在三个公共数据集上的广泛实验,我们证明了HybridGNN-SR通过整合无监督和有监督图学习方法的优势,在基于会话的推荐方面优于许多最先进的方法。
{"title":"HybridGNN-SR: Combining Unsupervised and Supervised Graph Learning for Session-based Recommendation","authors":"Kai Deng, Jiajin Huang, Jin Qin","doi":"10.1109/ICDMW51313.2020.00028","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00028","url":null,"abstract":"Session-based recommendation aims to predict the next item that a user may visit in the current session. By constructing a session graph, Graph Neural Networks (GNNs) are employed to capture the connectivity among items in the session graph for recommendation. The existing session-based recommendation methods with GNNs usually formulate the recommendation problem as the classification problem, and then use a specific uniform loss to learn session graph representations. Such supervised learning methods only consider the classification loss, which is insufficient to capture the node features from graph structured data. As unsupervised graph learning methods emphasize the graph structure, this paper proposes the HybridGNN-SR model to combine the unsupervised and supervised graph learning to represent the item transition pattern in a session from the view of graph. Specifically, in the part of unsupervised learning, we propose to combine Variational Graph Auto-Encoder (VGAE) with Mutual Information to represent nodes in a session graph; in the part of supervised learning, we employ a routing algorithm to extract higher conceptual features of a session for recommendation, which takes dependencies among items in the session into consideration. Through extensive experiments on three public datasets, we demonstrate that HybridGNN-SR outperforms a number of state-of-the-art methods on session-based recommendation by integrating the strengths of the unsupervised and supervised graph learning methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123442466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploring the Use of Data at Multiple Granularity Levels in Machine Learning-Based Stock Trading 探索在基于机器学习的股票交易中使用多粒度级别的数据
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00053
Jacopo Fior, Luca Cagliero
In the last decade the Artificial Intelligence and Data Science communities have paid an increasing attention to the problem of forecasting stock market movements. The abundance of stock-related data, including price series, news articles, financial reports, and social content has leveraged the use of Machine Learning techniques to drive quantitative stock trading. In this field, a huge body of work has been devoted to identifying the most predictive features and to select the best performing algorithms. However, since algorithm performance is heavily affected by the granularity of the analyzed time series as well as by the amount of historical data used to train the ML models, identifying the most appropriate time granularity and ML pipeline can be challenging. This paper studies the relationship between the granularity of time series data and ML performance. It compares also the performance of established ML pipelines in order to evaluate the pros and cons of periodically retraining the ML models. Furthermore, it performs a step beyond towards the integration of ML into real trading systems by studying how to conveniently set up the most established trading system characteristics. The results provide preliminary empirical evidences on how to profitably trade U.S. NASDAQ-100 stocks and leave room for further investigations.
在过去的十年中,人工智能和数据科学界越来越关注预测股市走势的问题。大量的股票相关数据,包括价格序列、新闻文章、财务报告和社交内容,利用机器学习技术来推动定量股票交易。在这个领域,已经有大量的工作致力于识别最具预测性的特征并选择表现最好的算法。然而,由于算法性能受到分析时间序列的粒度以及用于训练机器学习模型的历史数据量的严重影响,因此确定最合适的时间粒度和机器学习管道可能具有挑战性。本文研究了时间序列数据粒度与机器学习性能之间的关系。它还比较了已建立的机器学习管道的性能,以评估定期重新训练机器学习模型的利弊。此外,它通过研究如何方便地设置最成熟的交易系统特征,向将ML集成到真实的交易系统迈出了一步。研究结果为如何交易美国纳斯达克100指数成分股获利提供了初步的实证证据,并为进一步研究留下了空间。
{"title":"Exploring the Use of Data at Multiple Granularity Levels in Machine Learning-Based Stock Trading","authors":"Jacopo Fior, Luca Cagliero","doi":"10.1109/ICDMW51313.2020.00053","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00053","url":null,"abstract":"In the last decade the Artificial Intelligence and Data Science communities have paid an increasing attention to the problem of forecasting stock market movements. The abundance of stock-related data, including price series, news articles, financial reports, and social content has leveraged the use of Machine Learning techniques to drive quantitative stock trading. In this field, a huge body of work has been devoted to identifying the most predictive features and to select the best performing algorithms. However, since algorithm performance is heavily affected by the granularity of the analyzed time series as well as by the amount of historical data used to train the ML models, identifying the most appropriate time granularity and ML pipeline can be challenging. This paper studies the relationship between the granularity of time series data and ML performance. It compares also the performance of established ML pipelines in order to evaluate the pros and cons of periodically retraining the ML models. Furthermore, it performs a step beyond towards the integration of ML into real trading systems by studying how to conveniently set up the most established trading system characteristics. The results provide preliminary empirical evidences on how to profitably trade U.S. NASDAQ-100 stocks and leave room for further investigations.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125963614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sentiment is an Attitude not a Feeling 感情是一种态度,不是一种感觉
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00010
N. Alsadhan, D. Skillicorn
Sentiment analysis' attempts to measure the strength of the relationship between a person and an object, sometimes a concrete object such as a product and sometimes an abstract object such as a brand. There is considerable confusion about the form of this relationship - it is typically assumed to be a feeling (and so connected to emotions and moods). Here we argue, and demonstrate, that the relationship is better modelled as a cognitive one, and so connected to attitudes. We demonstrate that the more a lexicon avoids mood and emotion words, the greater its prediction accuracy for reviews of Amazon products.
情感分析试图衡量一个人与一个物体之间关系的强度,有时是具体的物体,如产品,有时是抽象的物体,如品牌。关于这种关系的形式有相当多的困惑——它通常被认为是一种感觉(因此与情感和情绪有关)。在这里,我们论证并证明,这种关系最好是一种认知关系,因此与态度有关。我们证明了一个词典越避免使用情绪和情感词,它对亚马逊产品评论的预测准确率就越高。
{"title":"Sentiment is an Attitude not a Feeling","authors":"N. Alsadhan, D. Skillicorn","doi":"10.1109/ICDMW51313.2020.00010","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00010","url":null,"abstract":"Sentiment analysis' attempts to measure the strength of the relationship between a person and an object, sometimes a concrete object such as a product and sometimes an abstract object such as a brand. There is considerable confusion about the form of this relationship - it is typically assumed to be a feeling (and so connected to emotions and moods). Here we argue, and demonstrate, that the relationship is better modelled as a cognitive one, and so connected to attitudes. We demonstrate that the more a lexicon avoids mood and emotion words, the greater its prediction accuracy for reviews of Amazon products.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"&NA; 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125996341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Unlabeled Data for US Supreme Court Case Classification 使用未标记数据进行美国最高法院案件分类
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00116
George Sanchez
The Supreme Court Database provided by Washington University (in St. Louis) School of Law is an essential legal research tool. The Supreme Court Database is organized and categorized to Issue Areas to make it easy for legal researchers to find on-point cases for an area of law. This paper used a semi-supervised learning approach to automatically categorize the Supreme Court's opinions to Issue Areas. An inductive method of clustering then labeling approach was used by employing a nonmetric space of a fast Hierarchical Navigable Small World graph index containing USE (Universal Sentence Encoder) embeddings. After obtaining the labels from the semi-supervised approach, we evaluate several classification approaches to use with the data achieving the weighted average F1-Scores: SVM with Max Norm Features 0.75, RNN 0.78, and BERT 0.68
华盛顿大学(圣路易斯)法学院提供的最高法院数据库是必不可少的法律研究工具。最高法院数据库按问题领域进行组织和分类,使法律研究人员能够轻松找到法律领域的重点案例。本文采用半监督学习方法对最高法院的意见进行自动分类。利用包含USE (Universal Sentence Encoder)嵌入的快速分层可导航小世界图索引的非度量空间,采用归纳聚类再标记方法。在从半监督方法中获得标签后,我们评估了几种分类方法,以使用达到加权平均f1分数的数据:最大范数特征的SVM为0.75,RNN为0.78,BERT为0.68
{"title":"Using Unlabeled Data for US Supreme Court Case Classification","authors":"George Sanchez","doi":"10.1109/ICDMW51313.2020.00116","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00116","url":null,"abstract":"The Supreme Court Database provided by Washington University (in St. Louis) School of Law is an essential legal research tool. The Supreme Court Database is organized and categorized to Issue Areas to make it easy for legal researchers to find on-point cases for an area of law. This paper used a semi-supervised learning approach to automatically categorize the Supreme Court's opinions to Issue Areas. An inductive method of clustering then labeling approach was used by employing a nonmetric space of a fast Hierarchical Navigable Small World graph index containing USE (Universal Sentence Encoder) embeddings. After obtaining the labels from the semi-supervised approach, we evaluate several classification approaches to use with the data achieving the weighted average F1-Scores: SVM with Max Norm Features 0.75, RNN 0.78, and BERT 0.68","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128486443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpreting Deep Neural Networks through Prototype Factorization 通过原型分解解释深度神经网络
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00068
Subhajit Das, Panpan Xu, Zeng Dai, A. Endert, Liu Ren
Typical deep neural networks (DNNs) are complex black-box models and their decision making process can be difficult to comprehend even for experienced machine learning practitioners. Therefore their use could be limited in mission-critical scenarios despite state-of-the-art performance on many challenging ML tasks. Through this work, we empower users to interpret DNNs with a post-hoc analysis protocol. We propose ProtoFac, an explainable matrix factorization technique that decomposes the latent representations at any selected layer in a pre-trained DNN as a collection of weighted prototypes, which are a small number of exemplars extracted from the original data (e.g. image patches, shapelets). Using the factorized weights and prototypes we build a surrogate model for interpretation by replacing the corresponding layer in the neural network. We identify a number of desired properties of ProtoFac including authenticity, interpretability, simplicity and propose the optimization objective and training procedure accordingly. The method is model-agnostic and can be applied to DNNs with varying architectures. It goes beyond per-sample feature-based explanation by providing prototypes as a condensed set of evidences used by the model for decision making. We applied ProtoFac to interpret pretrained DNNs for a variety of ML tasks including time series classification on electrocardiograms, and image classification. The result shows that ProtoFac is able to extract meaningful prototypes to explain the models' decisions while truthfully reflects the models' operation. We also evaluated human interpretability through Amazon Mechanical Turk (MTurk), showing that ProtoFac is able to produce interpretable and user-friendly explanations.
典型的深度神经网络(dnn)是复杂的黑箱模型,即使对于经验丰富的机器学习从业者来说,它们的决策过程也很难理解。因此,尽管在许多具有挑战性的机器学习任务中具有最先进的性能,但它们在关键任务场景中的使用可能受到限制。通过这项工作,我们授权用户使用事后分析协议来解释dnn。我们提出ProtoFac,这是一种可解释的矩阵分解技术,它将预训练DNN中任何选定层的潜在表示分解为加权原型的集合,加权原型是从原始数据中提取的少量样本(例如图像补丁,shapelets)。利用分解的权重和原型,我们通过替换神经网络中的相应层来构建代理模型进行解释。我们确定了ProtoFac的一些期望属性,包括真实性、可解释性、简单性,并提出了相应的优化目标和训练程序。该方法是模型不可知的,可以应用于具有不同结构的dnn。它超越了基于每个样本特征的解释,通过提供原型作为模型用于决策的证据的浓缩集。我们应用ProtoFac来解释各种ML任务的预训练dnn,包括心电图的时间序列分类和图像分类。结果表明,ProtoFac能够提取有意义的原型来解释模型的决策,同时真实地反映模型的运行情况。我们还通过Amazon Mechanical Turk (MTurk)评估了人类的可解释性,表明ProtoFac能够产生可解释且用户友好的解释。
{"title":"Interpreting Deep Neural Networks through Prototype Factorization","authors":"Subhajit Das, Panpan Xu, Zeng Dai, A. Endert, Liu Ren","doi":"10.1109/ICDMW51313.2020.00068","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00068","url":null,"abstract":"Typical deep neural networks (DNNs) are complex black-box models and their decision making process can be difficult to comprehend even for experienced machine learning practitioners. Therefore their use could be limited in mission-critical scenarios despite state-of-the-art performance on many challenging ML tasks. Through this work, we empower users to interpret DNNs with a post-hoc analysis protocol. We propose ProtoFac, an explainable matrix factorization technique that decomposes the latent representations at any selected layer in a pre-trained DNN as a collection of weighted prototypes, which are a small number of exemplars extracted from the original data (e.g. image patches, shapelets). Using the factorized weights and prototypes we build a surrogate model for interpretation by replacing the corresponding layer in the neural network. We identify a number of desired properties of ProtoFac including authenticity, interpretability, simplicity and propose the optimization objective and training procedure accordingly. The method is model-agnostic and can be applied to DNNs with varying architectures. It goes beyond per-sample feature-based explanation by providing prototypes as a condensed set of evidences used by the model for decision making. We applied ProtoFac to interpret pretrained DNNs for a variety of ML tasks including time series classification on electrocardiograms, and image classification. The result shows that ProtoFac is able to extract meaningful prototypes to explain the models' decisions while truthfully reflects the models' operation. We also evaluated human interpretability through Amazon Mechanical Turk (MTurk), showing that ProtoFac is able to produce interpretable and user-friendly explanations.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128533647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Understanding the Personality of Contributors to Information Cascades in Social Media in Response to the COVID-19 Pandemic 了解社交媒体中信息级联贡献者的个性以应对COVID-19大流行
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00016
Diana Nurbakova, Liana Ermakova, Irina Ovchinnikova
Social media have become a major source of health information for lay people. It has the power to influence the public's adoption of health policies and to determine the response to the current COVID-19 pandemic. The aim of this paper is to enhance understanding of personality characteristics of users who spread information about controversial COVID-19 medical treatments on Twitter.
社交媒体已经成为非专业人士获取健康信息的主要来源。它有权影响公众对卫生政策的采纳,并决定对当前COVID-19大流行的应对措施。本文的目的是加强对在推特上传播有争议的COVID-19医疗信息的用户的个性特征的理解。
{"title":"Understanding the Personality of Contributors to Information Cascades in Social Media in Response to the COVID-19 Pandemic","authors":"Diana Nurbakova, Liana Ermakova, Irina Ovchinnikova","doi":"10.1109/ICDMW51313.2020.00016","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00016","url":null,"abstract":"Social media have become a major source of health information for lay people. It has the power to influence the public's adoption of health policies and to determine the response to the current COVID-19 pandemic. The aim of this paper is to enhance understanding of personality characteristics of users who spread information about controversial COVID-19 medical treatments on Twitter.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter 基于文本的在线用户分析的深度语境词嵌入检测Twitter上的社交机器人
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00071
Maryam Heidari, James H. Jones, Özlem Uzuner
Social media platforms can expose influential trends in many aspects of everyday life. However, the trends they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on the extracted user's profile from a Tweet's text. Online user profiles show the user's personal information, such as age, gender, education, and personality. In this work, the user's profile is constructed based on the user's online posts. This work's main contribution is three-fold: First, we aim to improve bot detection through machine learning models based on the user's personal information generated by the user's online comments. The similarity of personal information when comparing two online posts makes it difficult to differentiate a bot from a human user. However, in this research, we leverage personal information similarity among two online posts as an advantage for the new bot detection model. The new proposed model for bot detection creates user profiles based on personal information such as age, personality, gender, education from user's online posts, and introduces a machine learning model to detect social bots with high prediction accuracy based on personal information. Second, we create a new public data set that shows the user's profile for more than 6900 Twitter accounts in the Cresci 2017 [1] data set. All user's profiles are extracted from the online user's posts on Twitter. Third, for the first time, this paper uses a deep contextualized word embedding model, ELMO [2], for a social media bot detection task.
社交媒体平台可以在日常生活的许多方面揭示有影响力的趋势。然而,它们所代表的趋势可能会受到虚假信息的污染。社交机器人是社交媒体上虚假信息的重要来源之一。社交机器人会对社会和公众舆论构成严重的网络威胁。这项研究旨在开发机器学习模型,根据从推文中提取的用户资料来检测机器人。在线用户档案显示用户的个人信息,如年龄、性别、教育程度和性格。在这项工作中,用户的个人资料是基于用户的在线帖子构建的。这项工作的主要贡献有三个方面:首先,我们的目标是通过基于用户在线评论生成的用户个人信息的机器学习模型来改进机器人检测。在比较两个在线帖子时,个人信息的相似性使得很难区分机器人和人类用户。然而,在本研究中,我们利用两个在线帖子之间的个人信息相似性作为新的机器人检测模型的优势。该机器人检测模型基于用户在线帖子中的年龄、性格、性别、教育程度等个人信息创建用户档案,并引入机器学习模型,基于个人信息检测具有较高预测精度的社交机器人。其次,我们创建了一个新的公共数据集,其中显示了Cresci 2017[1]数据集中6900多个Twitter帐户的用户简介。所有用户的个人资料都是从在线用户在Twitter上的帖子中提取出来的。第三,本文首次将深度语境化词嵌入模型ELMO[2]用于社交媒体机器人检测任务。
{"title":"Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter","authors":"Maryam Heidari, James H. Jones, Özlem Uzuner","doi":"10.1109/ICDMW51313.2020.00071","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00071","url":null,"abstract":"Social media platforms can expose influential trends in many aspects of everyday life. However, the trends they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on the extracted user's profile from a Tweet's text. Online user profiles show the user's personal information, such as age, gender, education, and personality. In this work, the user's profile is constructed based on the user's online posts. This work's main contribution is three-fold: First, we aim to improve bot detection through machine learning models based on the user's personal information generated by the user's online comments. The similarity of personal information when comparing two online posts makes it difficult to differentiate a bot from a human user. However, in this research, we leverage personal information similarity among two online posts as an advantage for the new bot detection model. The new proposed model for bot detection creates user profiles based on personal information such as age, personality, gender, education from user's online posts, and introduces a machine learning model to detect social bots with high prediction accuracy based on personal information. Second, we create a new public data set that shows the user's profile for more than 6900 Twitter accounts in the Cresci 2017 [1] data set. All user's profiles are extracted from the online user's posts on Twitter. Third, for the first time, this paper uses a deep contextualized word embedding model, ELMO [2], for a social media bot detection task.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"198 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114120403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Synthetic Data by Principal Component Analysis 主成分分析合成数据
Pub Date : 2020-11-01 DOI: 10.1109/ICDMW51313.2020.00023
Natsuki Sano
In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.
在统计披露控制中,发布合成数据意味着难以识别个别记录,因为合成数据的价值不同于原始数据。我们提出了两种利用主成分分析生成合成数据的方法:正交变换(线性方法)和沙漏型神经网络(非线性方法)。典型的多重插值合成数据生成方法要求人口和调查数据之间存在共同变量,而本文提出的方法可以生成没有共同变量的合成数据。此外,线性方法可以明确地评估信息损失为丢弃特征值的比率。我们利用提出的方法生成十项全能数据的合成数据,并评估四种信息损失度量:我们提出的信息损失度量、每条记录的平均绝对误差、每个变量均值的平均绝对误差和变量间协方差的平均绝对误差。我们发现线性方法的信息损失小于非线性方法。
{"title":"Synthetic Data by Principal Component Analysis","authors":"Natsuki Sano","doi":"10.1109/ICDMW51313.2020.00023","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00023","url":null,"abstract":"In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"302 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114090984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2020 International Conference on Data Mining Workshops (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1