Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00085
Mickael Wajnberg, Petko Valtchev, A. Massé, A. Benmoussa, M. Krajinovic, C. Laverdière, E. Levy, D. Sinnett, V. Marcil
To gain an in-depth understanding of human diseases, biologists typically mine patient data for relevant patterns. Clinical datasets are often unlabeled and involve features, a.k.a. markers, split into classes w.r.t. biological functions, whereby target patterns might well mix both levels. As such heterogeneous patterns are beyond the reach of current analytical tools, dedicated miners, e.g. for association rules, need to be devised. Contemporary multi-relational (MR) association miners, while capable of mixing object types, are rather limited in rule shape (atomic conclusions) while ignoring feature composition. Our own approach builds upon a MR-extension of concept analysis further enhanced with flexible propositionnalisation operators and dedicated MR modeling of patient data. The resulting MR association miner was validated on a pediatric oncology dataset.
{"title":"Mining Heterogeneous Associations from Pediatric Cancer Data by Relational Concept Analysis","authors":"Mickael Wajnberg, Petko Valtchev, A. Massé, A. Benmoussa, M. Krajinovic, C. Laverdière, E. Levy, D. Sinnett, V. Marcil","doi":"10.1109/ICDMW51313.2020.00085","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00085","url":null,"abstract":"To gain an in-depth understanding of human diseases, biologists typically mine patient data for relevant patterns. Clinical datasets are often unlabeled and involve features, a.k.a. markers, split into classes w.r.t. biological functions, whereby target patterns might well mix both levels. As such heterogeneous patterns are beyond the reach of current analytical tools, dedicated miners, e.g. for association rules, need to be devised. Contemporary multi-relational (MR) association miners, while capable of mixing object types, are rather limited in rule shape (atomic conclusions) while ignoring feature composition. Our own approach builds upon a MR-extension of concept analysis further enhanced with flexible propositionnalisation operators and dedicated MR modeling of patient data. The resulting MR association miner was validated on a pediatric oncology dataset.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124310038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00084
Krati Saxena, Ashwini Patil, Sagar Sunkle, V. Kulkarni
Formulated products such as cosmetics, personal care, pharmaceutical products and industrial products such as paints and coatings are a multi-billion dollar industry. Experts carry out designing of new formulations in most of these industries based on their knowledge and basic search from online and offline resources. Reference data for formulation design comes in several formats and from multiple sources with diverse representation. We present an approach to mine the heterogeneous data for formulation design with case studies of cosmetics and steel coating industries. Our contribution is threefold. First, we show data extraction and mining techniques from multi-source and multi-modal text data. Second, we describe how we store and retrieve the data in graph databases. Lastly, we demonstrate the use of extracted and stored data for a simple recommendation system based on data search techniques that aid the experts for the synthesis of new formulation design.
{"title":"Mining Heterogeneous Data for Formulation Design","authors":"Krati Saxena, Ashwini Patil, Sagar Sunkle, V. Kulkarni","doi":"10.1109/ICDMW51313.2020.00084","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00084","url":null,"abstract":"Formulated products such as cosmetics, personal care, pharmaceutical products and industrial products such as paints and coatings are a multi-billion dollar industry. Experts carry out designing of new formulations in most of these industries based on their knowledge and basic search from online and offline resources. Reference data for formulation design comes in several formats and from multiple sources with diverse representation. We present an approach to mine the heterogeneous data for formulation design with case studies of cosmetics and steel coating industries. Our contribution is threefold. First, we show data extraction and mining techniques from multi-source and multi-modal text data. Second, we describe how we store and retrieve the data in graph databases. Lastly, we demonstrate the use of extracted and stored data for a simple recommendation system based on data search techniques that aid the experts for the synthesis of new formulation design.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114593495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00028
Kai Deng, Jiajin Huang, Jin Qin
Session-based recommendation aims to predict the next item that a user may visit in the current session. By constructing a session graph, Graph Neural Networks (GNNs) are employed to capture the connectivity among items in the session graph for recommendation. The existing session-based recommendation methods with GNNs usually formulate the recommendation problem as the classification problem, and then use a specific uniform loss to learn session graph representations. Such supervised learning methods only consider the classification loss, which is insufficient to capture the node features from graph structured data. As unsupervised graph learning methods emphasize the graph structure, this paper proposes the HybridGNN-SR model to combine the unsupervised and supervised graph learning to represent the item transition pattern in a session from the view of graph. Specifically, in the part of unsupervised learning, we propose to combine Variational Graph Auto-Encoder (VGAE) with Mutual Information to represent nodes in a session graph; in the part of supervised learning, we employ a routing algorithm to extract higher conceptual features of a session for recommendation, which takes dependencies among items in the session into consideration. Through extensive experiments on three public datasets, we demonstrate that HybridGNN-SR outperforms a number of state-of-the-art methods on session-based recommendation by integrating the strengths of the unsupervised and supervised graph learning methods.
{"title":"HybridGNN-SR: Combining Unsupervised and Supervised Graph Learning for Session-based Recommendation","authors":"Kai Deng, Jiajin Huang, Jin Qin","doi":"10.1109/ICDMW51313.2020.00028","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00028","url":null,"abstract":"Session-based recommendation aims to predict the next item that a user may visit in the current session. By constructing a session graph, Graph Neural Networks (GNNs) are employed to capture the connectivity among items in the session graph for recommendation. The existing session-based recommendation methods with GNNs usually formulate the recommendation problem as the classification problem, and then use a specific uniform loss to learn session graph representations. Such supervised learning methods only consider the classification loss, which is insufficient to capture the node features from graph structured data. As unsupervised graph learning methods emphasize the graph structure, this paper proposes the HybridGNN-SR model to combine the unsupervised and supervised graph learning to represent the item transition pattern in a session from the view of graph. Specifically, in the part of unsupervised learning, we propose to combine Variational Graph Auto-Encoder (VGAE) with Mutual Information to represent nodes in a session graph; in the part of supervised learning, we employ a routing algorithm to extract higher conceptual features of a session for recommendation, which takes dependencies among items in the session into consideration. Through extensive experiments on three public datasets, we demonstrate that HybridGNN-SR outperforms a number of state-of-the-art methods on session-based recommendation by integrating the strengths of the unsupervised and supervised graph learning methods.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123442466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00053
Jacopo Fior, Luca Cagliero
In the last decade the Artificial Intelligence and Data Science communities have paid an increasing attention to the problem of forecasting stock market movements. The abundance of stock-related data, including price series, news articles, financial reports, and social content has leveraged the use of Machine Learning techniques to drive quantitative stock trading. In this field, a huge body of work has been devoted to identifying the most predictive features and to select the best performing algorithms. However, since algorithm performance is heavily affected by the granularity of the analyzed time series as well as by the amount of historical data used to train the ML models, identifying the most appropriate time granularity and ML pipeline can be challenging. This paper studies the relationship between the granularity of time series data and ML performance. It compares also the performance of established ML pipelines in order to evaluate the pros and cons of periodically retraining the ML models. Furthermore, it performs a step beyond towards the integration of ML into real trading systems by studying how to conveniently set up the most established trading system characteristics. The results provide preliminary empirical evidences on how to profitably trade U.S. NASDAQ-100 stocks and leave room for further investigations.
{"title":"Exploring the Use of Data at Multiple Granularity Levels in Machine Learning-Based Stock Trading","authors":"Jacopo Fior, Luca Cagliero","doi":"10.1109/ICDMW51313.2020.00053","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00053","url":null,"abstract":"In the last decade the Artificial Intelligence and Data Science communities have paid an increasing attention to the problem of forecasting stock market movements. The abundance of stock-related data, including price series, news articles, financial reports, and social content has leveraged the use of Machine Learning techniques to drive quantitative stock trading. In this field, a huge body of work has been devoted to identifying the most predictive features and to select the best performing algorithms. However, since algorithm performance is heavily affected by the granularity of the analyzed time series as well as by the amount of historical data used to train the ML models, identifying the most appropriate time granularity and ML pipeline can be challenging. This paper studies the relationship between the granularity of time series data and ML performance. It compares also the performance of established ML pipelines in order to evaluate the pros and cons of periodically retraining the ML models. Furthermore, it performs a step beyond towards the integration of ML into real trading systems by studying how to conveniently set up the most established trading system characteristics. The results provide preliminary empirical evidences on how to profitably trade U.S. NASDAQ-100 stocks and leave room for further investigations.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125963614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00010
N. Alsadhan, D. Skillicorn
Sentiment analysis' attempts to measure the strength of the relationship between a person and an object, sometimes a concrete object such as a product and sometimes an abstract object such as a brand. There is considerable confusion about the form of this relationship - it is typically assumed to be a feeling (and so connected to emotions and moods). Here we argue, and demonstrate, that the relationship is better modelled as a cognitive one, and so connected to attitudes. We demonstrate that the more a lexicon avoids mood and emotion words, the greater its prediction accuracy for reviews of Amazon products.
{"title":"Sentiment is an Attitude not a Feeling","authors":"N. Alsadhan, D. Skillicorn","doi":"10.1109/ICDMW51313.2020.00010","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00010","url":null,"abstract":"Sentiment analysis' attempts to measure the strength of the relationship between a person and an object, sometimes a concrete object such as a product and sometimes an abstract object such as a brand. There is considerable confusion about the form of this relationship - it is typically assumed to be a feeling (and so connected to emotions and moods). Here we argue, and demonstrate, that the relationship is better modelled as a cognitive one, and so connected to attitudes. We demonstrate that the more a lexicon avoids mood and emotion words, the greater its prediction accuracy for reviews of Amazon products.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"&NA; 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125996341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00116
George Sanchez
The Supreme Court Database provided by Washington University (in St. Louis) School of Law is an essential legal research tool. The Supreme Court Database is organized and categorized to Issue Areas to make it easy for legal researchers to find on-point cases for an area of law. This paper used a semi-supervised learning approach to automatically categorize the Supreme Court's opinions to Issue Areas. An inductive method of clustering then labeling approach was used by employing a nonmetric space of a fast Hierarchical Navigable Small World graph index containing USE (Universal Sentence Encoder) embeddings. After obtaining the labels from the semi-supervised approach, we evaluate several classification approaches to use with the data achieving the weighted average F1-Scores: SVM with Max Norm Features 0.75, RNN 0.78, and BERT 0.68
{"title":"Using Unlabeled Data for US Supreme Court Case Classification","authors":"George Sanchez","doi":"10.1109/ICDMW51313.2020.00116","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00116","url":null,"abstract":"The Supreme Court Database provided by Washington University (in St. Louis) School of Law is an essential legal research tool. The Supreme Court Database is organized and categorized to Issue Areas to make it easy for legal researchers to find on-point cases for an area of law. This paper used a semi-supervised learning approach to automatically categorize the Supreme Court's opinions to Issue Areas. An inductive method of clustering then labeling approach was used by employing a nonmetric space of a fast Hierarchical Navigable Small World graph index containing USE (Universal Sentence Encoder) embeddings. After obtaining the labels from the semi-supervised approach, we evaluate several classification approaches to use with the data achieving the weighted average F1-Scores: SVM with Max Norm Features 0.75, RNN 0.78, and BERT 0.68","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128486443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00068
Subhajit Das, Panpan Xu, Zeng Dai, A. Endert, Liu Ren
Typical deep neural networks (DNNs) are complex black-box models and their decision making process can be difficult to comprehend even for experienced machine learning practitioners. Therefore their use could be limited in mission-critical scenarios despite state-of-the-art performance on many challenging ML tasks. Through this work, we empower users to interpret DNNs with a post-hoc analysis protocol. We propose ProtoFac, an explainable matrix factorization technique that decomposes the latent representations at any selected layer in a pre-trained DNN as a collection of weighted prototypes, which are a small number of exemplars extracted from the original data (e.g. image patches, shapelets). Using the factorized weights and prototypes we build a surrogate model for interpretation by replacing the corresponding layer in the neural network. We identify a number of desired properties of ProtoFac including authenticity, interpretability, simplicity and propose the optimization objective and training procedure accordingly. The method is model-agnostic and can be applied to DNNs with varying architectures. It goes beyond per-sample feature-based explanation by providing prototypes as a condensed set of evidences used by the model for decision making. We applied ProtoFac to interpret pretrained DNNs for a variety of ML tasks including time series classification on electrocardiograms, and image classification. The result shows that ProtoFac is able to extract meaningful prototypes to explain the models' decisions while truthfully reflects the models' operation. We also evaluated human interpretability through Amazon Mechanical Turk (MTurk), showing that ProtoFac is able to produce interpretable and user-friendly explanations.
典型的深度神经网络(dnn)是复杂的黑箱模型,即使对于经验丰富的机器学习从业者来说,它们的决策过程也很难理解。因此,尽管在许多具有挑战性的机器学习任务中具有最先进的性能,但它们在关键任务场景中的使用可能受到限制。通过这项工作,我们授权用户使用事后分析协议来解释dnn。我们提出ProtoFac,这是一种可解释的矩阵分解技术,它将预训练DNN中任何选定层的潜在表示分解为加权原型的集合,加权原型是从原始数据中提取的少量样本(例如图像补丁,shapelets)。利用分解的权重和原型,我们通过替换神经网络中的相应层来构建代理模型进行解释。我们确定了ProtoFac的一些期望属性,包括真实性、可解释性、简单性,并提出了相应的优化目标和训练程序。该方法是模型不可知的,可以应用于具有不同结构的dnn。它超越了基于每个样本特征的解释,通过提供原型作为模型用于决策的证据的浓缩集。我们应用ProtoFac来解释各种ML任务的预训练dnn,包括心电图的时间序列分类和图像分类。结果表明,ProtoFac能够提取有意义的原型来解释模型的决策,同时真实地反映模型的运行情况。我们还通过Amazon Mechanical Turk (MTurk)评估了人类的可解释性,表明ProtoFac能够产生可解释且用户友好的解释。
{"title":"Interpreting Deep Neural Networks through Prototype Factorization","authors":"Subhajit Das, Panpan Xu, Zeng Dai, A. Endert, Liu Ren","doi":"10.1109/ICDMW51313.2020.00068","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00068","url":null,"abstract":"Typical deep neural networks (DNNs) are complex black-box models and their decision making process can be difficult to comprehend even for experienced machine learning practitioners. Therefore their use could be limited in mission-critical scenarios despite state-of-the-art performance on many challenging ML tasks. Through this work, we empower users to interpret DNNs with a post-hoc analysis protocol. We propose ProtoFac, an explainable matrix factorization technique that decomposes the latent representations at any selected layer in a pre-trained DNN as a collection of weighted prototypes, which are a small number of exemplars extracted from the original data (e.g. image patches, shapelets). Using the factorized weights and prototypes we build a surrogate model for interpretation by replacing the corresponding layer in the neural network. We identify a number of desired properties of ProtoFac including authenticity, interpretability, simplicity and propose the optimization objective and training procedure accordingly. The method is model-agnostic and can be applied to DNNs with varying architectures. It goes beyond per-sample feature-based explanation by providing prototypes as a condensed set of evidences used by the model for decision making. We applied ProtoFac to interpret pretrained DNNs for a variety of ML tasks including time series classification on electrocardiograms, and image classification. The result shows that ProtoFac is able to extract meaningful prototypes to explain the models' decisions while truthfully reflects the models' operation. We also evaluated human interpretability through Amazon Mechanical Turk (MTurk), showing that ProtoFac is able to produce interpretable and user-friendly explanations.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128533647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social media have become a major source of health information for lay people. It has the power to influence the public's adoption of health policies and to determine the response to the current COVID-19 pandemic. The aim of this paper is to enhance understanding of personality characteristics of users who spread information about controversial COVID-19 medical treatments on Twitter.
{"title":"Understanding the Personality of Contributors to Information Cascades in Social Media in Response to the COVID-19 Pandemic","authors":"Diana Nurbakova, Liana Ermakova, Irina Ovchinnikova","doi":"10.1109/ICDMW51313.2020.00016","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00016","url":null,"abstract":"Social media have become a major source of health information for lay people. It has the power to influence the public's adoption of health policies and to determine the response to the current COVID-19 pandemic. The aim of this paper is to enhance understanding of personality characteristics of users who spread information about controversial COVID-19 medical treatments on Twitter.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00071
Maryam Heidari, James H. Jones, Özlem Uzuner
Social media platforms can expose influential trends in many aspects of everyday life. However, the trends they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on the extracted user's profile from a Tweet's text. Online user profiles show the user's personal information, such as age, gender, education, and personality. In this work, the user's profile is constructed based on the user's online posts. This work's main contribution is three-fold: First, we aim to improve bot detection through machine learning models based on the user's personal information generated by the user's online comments. The similarity of personal information when comparing two online posts makes it difficult to differentiate a bot from a human user. However, in this research, we leverage personal information similarity among two online posts as an advantage for the new bot detection model. The new proposed model for bot detection creates user profiles based on personal information such as age, personality, gender, education from user's online posts, and introduces a machine learning model to detect social bots with high prediction accuracy based on personal information. Second, we create a new public data set that shows the user's profile for more than 6900 Twitter accounts in the Cresci 2017 [1] data set. All user's profiles are extracted from the online user's posts on Twitter. Third, for the first time, this paper uses a deep contextualized word embedding model, ELMO [2], for a social media bot detection task.
{"title":"Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter","authors":"Maryam Heidari, James H. Jones, Özlem Uzuner","doi":"10.1109/ICDMW51313.2020.00071","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00071","url":null,"abstract":"Social media platforms can expose influential trends in many aspects of everyday life. However, the trends they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on the extracted user's profile from a Tweet's text. Online user profiles show the user's personal information, such as age, gender, education, and personality. In this work, the user's profile is constructed based on the user's online posts. This work's main contribution is three-fold: First, we aim to improve bot detection through machine learning models based on the user's personal information generated by the user's online comments. The similarity of personal information when comparing two online posts makes it difficult to differentiate a bot from a human user. However, in this research, we leverage personal information similarity among two online posts as an advantage for the new bot detection model. The new proposed model for bot detection creates user profiles based on personal information such as age, personality, gender, education from user's online posts, and introduces a machine learning model to detect social bots with high prediction accuracy based on personal information. Second, we create a new public data set that shows the user's profile for more than 6900 Twitter accounts in the Cresci 2017 [1] data set. All user's profiles are extracted from the online user's posts on Twitter. Third, for the first time, this paper uses a deep contextualized word embedding model, ELMO [2], for a social media bot detection task.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"198 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114120403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/ICDMW51313.2020.00023
Natsuki Sano
In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.
{"title":"Synthetic Data by Principal Component Analysis","authors":"Natsuki Sano","doi":"10.1109/ICDMW51313.2020.00023","DOIUrl":"https://doi.org/10.1109/ICDMW51313.2020.00023","url":null,"abstract":"In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"302 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114090984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}