Yegor Tkachenko, Mykel J. Kochenderfer, Krzysztof Kluza
Optimization of control policies for corporate customer relationship management (CRM) systems can boost customer satisfaction, reduce attrition, and increase expected lifetime value of the customer base. However, evaluation of these policies is often complicated. Policies can be evaluated with real-life marketing interactions, but such evaluation can be prohibitively expensive and time consuming. Customer simulators learned from data are an inexpensive alternative suitable for rapid campaign tests. We summarize the literature on the evaluation of direct marketing policies through simulation and propose a decomposition of the problem into distinct tasks: (a) generation of the initial client database snapshot and (b) propagation of clients through time in response to company actions. We present open-source simulators trained and validated on two direct marketing data sets of varying size and complexity.
{"title":"Customer Simulation for Direct Marketing Experiments","authors":"Yegor Tkachenko, Mykel J. Kochenderfer, Krzysztof Kluza","doi":"10.1109/DSAA.2016.59","DOIUrl":"https://doi.org/10.1109/DSAA.2016.59","url":null,"abstract":"Optimization of control policies for corporate customer relationship management (CRM) systems can boost customer satisfaction, reduce attrition, and increase expected lifetime value of the customer base. However, evaluation of these policies is often complicated. Policies can be evaluated with real-life marketing interactions, but such evaluation can be prohibitively expensive and time consuming. Customer simulators learned from data are an inexpensive alternative suitable for rapid campaign tests. We summarize the literature on the evaluation of direct marketing policies through simulation and propose a decomposition of the problem into distinct tasks: (a) generation of the initial client database snapshot and (b) propagation of clients through time in response to company actions. We present open-source simulators trained and validated on two direct marketing data sets of varying size and complexity.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131552204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification with high dimensional variables is a popular goal in many modern statistical studies. Fisher's linear discriminant analysis (LDA) is a common and effective tool for classifying entities into existing groups. It is well known that classification using Fisher's discriminant for high dimensional data is as bad as random guessing due to the many noise features that increases misclassification rate. Recently, it is being acknowledged that complex biological mechanisms occur through multiple features working together, though individually these features may contribute to noise accumulation in the data. In view of these, it is important to perform classification with discriminant vectors that use a subset of important variables, while also utilizing prior biological relationships among features. We tackle this problem in this article and propose methods that incorporate variable selection into the classification problem, for the identification of important biomarkers. Furthermore, we incorporate into the LDA problem prior information on the relationships among variables using undirected graphs in order to identify functionally meaningful biomarkers. We compare our methods to existing sparse LDA approaches via simulation studies and real data analysis.
{"title":"Sparse Linear Discriminant Analysis in Structured Covariates Space","authors":"S. Safo, Q. Long","doi":"10.1002/sam.11376","DOIUrl":"https://doi.org/10.1002/sam.11376","url":null,"abstract":"Classification with high dimensional variables is a popular goal in many modern statistical studies. Fisher's linear discriminant analysis (LDA) is a common and effective tool for classifying entities into existing groups. It is well known that classification using Fisher's discriminant for high dimensional data is as bad as random guessing due to the many noise features that increases misclassification rate. Recently, it is being acknowledged that complex biological mechanisms occur through multiple features working together, though individually these features may contribute to noise accumulation in the data. In view of these, it is important to perform classification with discriminant vectors that use a subset of important variables, while also utilizing prior biological relationships among features. We tackle this problem in this article and propose methods that incorporate variable selection into the classification problem, for the identification of important biomarkers. Furthermore, we incorporate into the LDA problem prior information on the relationships among variables using undirected graphs in order to identify functionally meaningful biomarkers. We compare our methods to existing sparse LDA approaches via simulation studies and real data analysis.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121807498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. F. Sánchez-Rada, C. Iglesias, Ignacio Corcuera, Óscar Araque
Sentiment and emotion analysis technologies have quickly gained momentum in industry and academia. This popularity has spawned a myriad of service and tools. Due to the lack of common interfaces and models, each of these services imposes specific interfaces and representation models. Heterogeneity makes it costly to integrate different services, evaluate them or switch between them. This work aims to remedy heterogeneity by providing an extensible framework and an API aligned with the NIF service specification. It also includes a reference implementation, a first step towards a successful and cost-effective adoption. The specific contributions in this paper are: (i) the Senpy framework, (ii) an architecture for the framework that follows a plug-in approach, (iii) a reference open source implementation of the architecture, (iv) the use and validation of the framework and architecture in a big data sentiment analysis European project. Our aim is to foster the development of a new generation of emotion aware services by isolating the development of new algorithms from the representation of results and the deployment of services.
{"title":"Senpy: A Pragmatic Linked Sentiment Analysis Framework","authors":"J. F. Sánchez-Rada, C. Iglesias, Ignacio Corcuera, Óscar Araque","doi":"10.1109/DSAA.2016.79","DOIUrl":"https://doi.org/10.1109/DSAA.2016.79","url":null,"abstract":"Sentiment and emotion analysis technologies have quickly gained momentum in industry and academia. This popularity has spawned a myriad of service and tools. Due to the lack of common interfaces and models, each of these services imposes specific interfaces and representation models. Heterogeneity makes it costly to integrate different services, evaluate them or switch between them. This work aims to remedy heterogeneity by providing an extensible framework and an API aligned with the NIF service specification. It also includes a reference implementation, a first step towards a successful and cost-effective adoption. The specific contributions in this paper are: (i) the Senpy framework, (ii) an architecture for the framework that follows a plug-in approach, (iii) a reference open source implementation of the architecture, (iv) the use and validation of the framework and architecture in a big data sentiment analysis European project. Our aim is to foster the development of a new generation of emotion aware services by isolating the development of new algorithms from the representation of results and the deployment of services.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122000326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A cloud platform website, offering a catalog of services, operates under a freemium business model or a free trial business model, aggressively marketing to customers who have previously visited. In such a cloud platform or service business, accurate identification of high profile customers is central to the success for the business. However, there are several limitations of existing approaches because of the following challenges: (1) heavy customer traffic flows, (2) the noise in user behaviors, (3) a lack of collaboration across stakeholders, (4) class imbalanced customer data (few paying customers vs. high numbers of freemium or trial customers), and (5) unpredictable business environments. In this paper, we propose a data-driven iterative sales lead prediction framework for cloud everything as a service (XaaS), including a cloud platform or software. In this framework, from the BizDevOps process we collaborate to extract business insights from multiple business stakeholders. From these business insights, we calculate service usage scores using our RFDL (Recency, Frequency, Duration, and Lifetime) analysis and estimate sales lead prediction based on the usage scores in a supervised manner. Our framework adapts to a continuously changing environment through iterations of the whole process, maintains its performance of sales lead prediction, and finally shares the prediction results to the sales or marketing team effectively. A three-month pilot implementation of the framework led to more than 300 paying customers and more than $200K increase in revenue. We expect our scalable, iterative sales lead prediction approach to be widely applicable to online or cloud business domains where there is a constant flux of customer traffic.
{"title":"Data-Driven Sales Leads Prediction for Everything-as-a-Service in the Cloud","authors":"Chul Sung, Bo Zhang, Chunhui Y. Higgins, Y. Choe","doi":"10.1109/DSAA.2016.83","DOIUrl":"https://doi.org/10.1109/DSAA.2016.83","url":null,"abstract":"A cloud platform website, offering a catalog of services, operates under a freemium business model or a free trial business model, aggressively marketing to customers who have previously visited. In such a cloud platform or service business, accurate identification of high profile customers is central to the success for the business. However, there are several limitations of existing approaches because of the following challenges: (1) heavy customer traffic flows, (2) the noise in user behaviors, (3) a lack of collaboration across stakeholders, (4) class imbalanced customer data (few paying customers vs. high numbers of freemium or trial customers), and (5) unpredictable business environments. In this paper, we propose a data-driven iterative sales lead prediction framework for cloud everything as a service (XaaS), including a cloud platform or software. In this framework, from the BizDevOps process we collaborate to extract business insights from multiple business stakeholders. From these business insights, we calculate service usage scores using our RFDL (Recency, Frequency, Duration, and Lifetime) analysis and estimate sales lead prediction based on the usage scores in a supervised manner. Our framework adapts to a continuously changing environment through iterations of the whole process, maintains its performance of sales lead prediction, and finally shares the prediction results to the sales or marketing team effectively. A three-month pilot implementation of the framework led to more than 300 paying customers and more than $200K increase in revenue. We expect our scalable, iterative sales lead prediction approach to be widely applicable to online or cloud business domains where there is a constant flux of customer traffic.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126832973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents two strategies to speed up the alternating direction method of multipliers (ADMM) for distributed data. In the first method, inspired by stochastic gradient descent, each machine uses only a subset of its data at the first few iterations, speeding up those iterations. A key result is in proving that despite this approximation, our method enjoys the same convergence rate in terms of the number of iterations as the standard ADMM, and hence is faster overall. The second method also follows the idea of sampling a subset of the data to update the model before the communication of each round. It converts an objective to the approximated dual form and performs ADMM on the dual. The method turns out to be a distributed variant of the recently proposed SDCA-ADMM. Yet, compared to the straightforward distributed implementation of SDCA-ADMM, the proposed method enjoys less frequent communication between machines, better memory usage, and lighter computational demand. Experiments demonstrate the effectiveness of our two strategies.
{"title":"Efficient Sampling-Based ADMM for Distributed Data","authors":"Jun-Kun Wang, Shou-de Lin","doi":"10.1109/DSAA.2016.41","DOIUrl":"https://doi.org/10.1109/DSAA.2016.41","url":null,"abstract":"This paper presents two strategies to speed up the alternating direction method of multipliers (ADMM) for distributed data. In the first method, inspired by stochastic gradient descent, each machine uses only a subset of its data at the first few iterations, speeding up those iterations. A key result is in proving that despite this approximation, our method enjoys the same convergence rate in terms of the number of iterations as the standard ADMM, and hence is faster overall. The second method also follows the idea of sampling a subset of the data to update the model before the communication of each round. It converts an objective to the approximated dual form and performs ADMM on the dual. The method turns out to be a distributed variant of the recently proposed SDCA-ADMM. Yet, compared to the straightforward distributed implementation of SDCA-ADMM, the proposed method enjoys less frequent communication between machines, better memory usage, and lighter computational demand. Experiments demonstrate the effectiveness of our two strategies.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate the role of mentions on tweet propagation. We propose a novel tweet propagation model SIR_MF based on a multiplex network framework, that allows to analyze the effects of mentioning on final retweet count. The basic bricks of this model are supported by a comprehensive study of multiple real datasets and simulations of the model show a nice agreement with the empirically observed tweet popularity. Studies and experiments also reveal that follower count, retweet rate & profile similarity are important factors in gaining tweet popularity and allow to better understand the impact of the mention strategies on the retweet count. Interestingly, we analytically identify a critical retweet rate regulating the role of mention on the tweet popularity. Finally, our data driven simulation demonstrates that the proposed mention recommendation heuristic "Easy-Mention" outperforms the benchmark "Whom-To-Mention" algorithm.
在本文中,我们研究了提及在tweet传播中的作用。我们提出了一种新的基于多路网络框架的推文传播模型SIR_MF,该模型允许分析提及对最终转发数的影响。该模型的基本组成部分得到了对多个真实数据集的全面研究的支持,模型的模拟与经验观察到的tweet流行度非常吻合。研究和实验还表明,关注者数量、转发率和个人资料相似度是获得推文受欢迎程度的重要因素,可以更好地理解提及策略对转发数的影响。有趣的是,我们通过分析确定了一个关键的转发率,它调节了提及对推文受欢迎程度的作用。最后,我们的数据驱动仿真表明,提出的启发式推荐“Easy-Mention”优于基准的“who - to - mention”算法。
{"title":"On the Role of Mentions on Tweet Virality","authors":"Soumajit Pramanik, Qinna Wang, Maximilien Danisch, Sumanth Bandi, Anand Kumar, Jean-Loup Guillaume, Bivas Mitra","doi":"10.1109/DSAA.2016.28","DOIUrl":"https://doi.org/10.1109/DSAA.2016.28","url":null,"abstract":"In this paper, we investigate the role of mentions on tweet propagation. We propose a novel tweet propagation model SIR_MF based on a multiplex network framework, that allows to analyze the effects of mentioning on final retweet count. The basic bricks of this model are supported by a comprehensive study of multiple real datasets and simulations of the model show a nice agreement with the empirically observed tweet popularity. Studies and experiments also reveal that follower count, retweet rate & profile similarity are important factors in gaining tweet popularity and allow to better understand the impact of the mention strategies on the retweet count. Interestingly, we analytically identify a critical retweet rate regulating the role of mention on the tweet popularity. Finally, our data driven simulation demonstrates that the proposed mention recommendation heuristic \"Easy-Mention\" outperforms the benchmark \"Whom-To-Mention\" algorithm.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116282213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a novel change detection method for temporal networks. In usual change detection algorithms, change scores are generated from an observed time series. When this change score reaches a threshold, an alert is raised to declare the change. Our method aggregates these change scores and alerts based on network centralities. Many types of changes in a network can be discovered from changes to the network structure. Thus, nodes and links should be monitored in order to recognize changes. However, it is difficult to focus on the appropriate nodes and links when there is little information regarding the dataset. Network centrality such as PageRank measures the importance of nodes in a network based on certain criteria. Therefore, it is natural to apply network centralities in order to improve the accuracy of change detection methods. Our analysis reveals how and when network centrality works well in terms of change detection. Based on this understanding, we propose an aggregating algorithm that emphasizes the appropriate network centralities. Our evaluation of the proposed aggregation algorithm showed highly accurate predictions for an artificial dataset and two real datasets. Our method contributes to extending the field of change detection in temporal networks by utilizing network centralities.
{"title":"Temporal Network Change Detection Using Network Centralities","authors":"Yoshitaro Yonamoto, K. Morino, K. Yamanishi","doi":"10.1109/DSAA.2016.13","DOIUrl":"https://doi.org/10.1109/DSAA.2016.13","url":null,"abstract":"In this paper, we propose a novel change detection method for temporal networks. In usual change detection algorithms, change scores are generated from an observed time series. When this change score reaches a threshold, an alert is raised to declare the change. Our method aggregates these change scores and alerts based on network centralities. Many types of changes in a network can be discovered from changes to the network structure. Thus, nodes and links should be monitored in order to recognize changes. However, it is difficult to focus on the appropriate nodes and links when there is little information regarding the dataset. Network centrality such as PageRank measures the importance of nodes in a network based on certain criteria. Therefore, it is natural to apply network centralities in order to improve the accuracy of change detection methods. Our analysis reveals how and when network centrality works well in terms of change detection. Based on this understanding, we propose an aggregating algorithm that emphasizes the appropriate network centralities. Our evaluation of the proposed aggregation algorithm showed highly accurate predictions for an artificial dataset and two real datasets. Our method contributes to extending the field of change detection in temporal networks by utilizing network centralities.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"303 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116329489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingxiang Chen, Tao Wang, Ralph Abbey, J. Pingenot
Decision tree algorithms are very popular in the field of data mining. This paper proposes a distributed decision tree algorithm and shows examples of its implementation on big data platforms. The major contribution of this paper is the novel KS-Tree algorithm which builds a decision tree in a distributed environment. KS-Tree is applied to some real world data mining problems and compared with state-of-the-art decision tree techniques that are implemented in R and Apache Spark. The results show that KS-Tree can achieve better results, especially with large data sets. Furthermore, we demonstrate that KS-Tree can be applied to various data mining tasks, such as variable selection.
{"title":"A Distributed Decision Tree Algorithm and Its Implementation on Big Data Platforms","authors":"Jingxiang Chen, Tao Wang, Ralph Abbey, J. Pingenot","doi":"10.1109/DSAA.2016.64","DOIUrl":"https://doi.org/10.1109/DSAA.2016.64","url":null,"abstract":"Decision tree algorithms are very popular in the field of data mining. This paper proposes a distributed decision tree algorithm and shows examples of its implementation on big data platforms. The major contribution of this paper is the novel KS-Tree algorithm which builds a decision tree in a distributed environment. KS-Tree is applied to some real world data mining problems and compared with state-of-the-art decision tree techniques that are implemented in R and Apache Spark. The results show that KS-Tree can achieve better results, especially with large data sets. Furthermore, we demonstrate that KS-Tree can be applied to various data mining tasks, such as variable selection.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile or cellular phones can record various types of context data related to a user's phone call activities. In this paper, we present an approach to discovering individualized behavior rules for mobile users from their phone call records, based on the temporal context in which a user accepts, rejects or misses a call. One of the determinants of an individual's phone behavior is the various activities undertaken at various times of a day and days of the week. In many cases, such behavior will follow temporal patterns. Currently, researchers modeling user behavior using temporal context statically segment time into arbitrary categories (e.g., morning, evening) or periods (e.g., 1 hour). However, such time categorization does not necessarily map to the patterns of individual user activity and subsequent behavior. Therefore, we propose a behavior-oriented time segmentation (BOTS) technique that dynamically identifies diverse time segments for an individual user's behaviors based on the phone call records. Experiments on real datasets show that our proposed technique better captures the user's dominant call response behavior at various times of the day and week, thereby enabling more appropriate rules to be created for the purpose of automated handling of incoming calls, in an intelligent call interruption management system.
{"title":"Behavior-Oriented Time Segmentation for Mining Individualized Rules of Mobile Phone Users","authors":"Iqbal H. Sarker, A. Colman, M. A. Kabir, Jun Han","doi":"10.1109/DSAA.2016.60","DOIUrl":"https://doi.org/10.1109/DSAA.2016.60","url":null,"abstract":"Mobile or cellular phones can record various types of context data related to a user's phone call activities. In this paper, we present an approach to discovering individualized behavior rules for mobile users from their phone call records, based on the temporal context in which a user accepts, rejects or misses a call. One of the determinants of an individual's phone behavior is the various activities undertaken at various times of a day and days of the week. In many cases, such behavior will follow temporal patterns. Currently, researchers modeling user behavior using temporal context statically segment time into arbitrary categories (e.g., morning, evening) or periods (e.g., 1 hour). However, such time categorization does not necessarily map to the patterns of individual user activity and subsequent behavior. Therefore, we propose a behavior-oriented time segmentation (BOTS) technique that dynamically identifies diverse time segments for an individual user's behaviors based on the phone call records. Experiments on real datasets show that our proposed technique better captures the user's dominant call response behavior at various times of the day and week, thereby enabling more appropriate rules to be created for the purpose of automated handling of incoming calls, in an intelligent call interruption management system.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern automobiles have been proven vulnerable to hacking by security researchers. By exploiting vulnerabilities in the car's external interfaces, such as wifi, bluetooth, and physical connections, they can access a car's controller area network (CAN) bus. On the CAN bus, commands can be sent to control the car, for example cutting the brakes or stopping the engine. While securing the car's interfaces to the outside world is an important part of mitigating this threat, the last line of defence is detecting malicious behaviour on the CAN bus. We propose an anomaly detector based on a Long Short-Term Memory neural network to detect CAN bus attacks. The detector works by learning to predict the next data word originating from each sender on the bus. Highly surprising bits in the actual next word are flagged as anomalies. We evaluate the detector by synthesizing anomalies with modified CAN bus data. The synthesized anomalies are designed to mimic attacks reported in the literature. We show that the detector can detect anomalies we synthesized with low false alarm rates. Additionally, the granularity of the bit predictions can provide forensic investigators clues as to the nature of flagged anomalies.
{"title":"Anomaly Detection in Automobile Control Network Data with Long Short-Term Memory Networks","authors":"Adrian Taylor, Sylvain P. Leblanc, N. Japkowicz","doi":"10.1109/DSAA.2016.20","DOIUrl":"https://doi.org/10.1109/DSAA.2016.20","url":null,"abstract":"Modern automobiles have been proven vulnerable to hacking by security researchers. By exploiting vulnerabilities in the car's external interfaces, such as wifi, bluetooth, and physical connections, they can access a car's controller area network (CAN) bus. On the CAN bus, commands can be sent to control the car, for example cutting the brakes or stopping the engine. While securing the car's interfaces to the outside world is an important part of mitigating this threat, the last line of defence is detecting malicious behaviour on the CAN bus. We propose an anomaly detector based on a Long Short-Term Memory neural network to detect CAN bus attacks. The detector works by learning to predict the next data word originating from each sender on the bus. Highly surprising bits in the actual next word are flagged as anomalies. We evaluate the detector by synthesizing anomalies with modified CAN bus data. The synthesized anomalies are designed to mimic attacks reported in the literature. We show that the detector can detect anomalies we synthesized with low false alarm rates. Additionally, the granularity of the bit predictions can provide forensic investigators clues as to the nature of flagged anomalies.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}