Online social platforms are constantly under attack by bad actors. These bad actors often leverage resources (e.g. IPs, devices) under their control to attack the platform by targeting various, vulnerable endpoints (e.g. account authentication, sybil account creation, friending) which may process millions to billions of events every day. As the scale and multifacetedness of malicious behaviors grows, and new endpoints and corresponding events are utilized and processed every day, the development of fast, extensible and schema-agnostic anomaly detection approaches to enable standardized protocols for different classes of events is critical. This is a notable challenge given that practitioners often have neither time nor means to custom-build anomaly detection services for each new event class and type. Moreover, labeled data is rarely available in such diverse settings, making unsupervised methods appealing. In this work, we study unsupervised, schema-agnostic characterization and detection of resource usage anomalies in social event logs. We propose an efficient algorithmic approach to this end, and evaluate it with promising results on several log datasets of different event classes. Specifically, our contributions include a) formulation: a novel articulation of the schema-agnostic anomaly detection problem for event logs, b) approach: we propose FARE (Finding Anomalous Resources and Events), which integrates online resource anomaly detection and offline event culpability identification components, and c) efficacy: demonstrated accuracy (100% precision@250 on two industrial datasets from the Snapchat platform, with ~50% anomalies previously uncaught by state-of-the-art production defenses), robustness (high precision/recall over suitable synthetic attacks and parameter choices) and scalability (near-linear in the number of events).
{"title":"FARE: Schema-Agnostic Anomaly Detection in Social Event Logs","authors":"Neil Shah","doi":"10.1109/DSAA.2019.00049","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00049","url":null,"abstract":"Online social platforms are constantly under attack by bad actors. These bad actors often leverage resources (e.g. IPs, devices) under their control to attack the platform by targeting various, vulnerable endpoints (e.g. account authentication, sybil account creation, friending) which may process millions to billions of events every day. As the scale and multifacetedness of malicious behaviors grows, and new endpoints and corresponding events are utilized and processed every day, the development of fast, extensible and schema-agnostic anomaly detection approaches to enable standardized protocols for different classes of events is critical. This is a notable challenge given that practitioners often have neither time nor means to custom-build anomaly detection services for each new event class and type. Moreover, labeled data is rarely available in such diverse settings, making unsupervised methods appealing. In this work, we study unsupervised, schema-agnostic characterization and detection of resource usage anomalies in social event logs. We propose an efficient algorithmic approach to this end, and evaluate it with promising results on several log datasets of different event classes. Specifically, our contributions include a) formulation: a novel articulation of the schema-agnostic anomaly detection problem for event logs, b) approach: we propose FARE (Finding Anomalous Resources and Events), which integrates online resource anomaly detection and offline event culpability identification components, and c) efficacy: demonstrated accuracy (100% precision@250 on two industrial datasets from the Snapchat platform, with ~50% anomalies previously uncaught by state-of-the-art production defenses), robustness (high precision/recall over suitable synthetic attacks and parameter choices) and scalability (near-linear in the number of events).","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121790725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data stream mining is among the most contemporary branches of machine learning. The potentially infinite sources give us many opportunities and at the same time pose new challenges. To properly handle streaming data we need to improve our well-established methods, so they can work with dynamic data and under strict constraints. Supervised streaming machine learning algorithms require a certain number of labeled instances in order to stay up-to-date. Since high budgets dedicated for this purpose are usually infeasible, we have to limit the supervision as much as we can. One possible approach is to trigger labeling, only if a change is explicitly indicated by a detector. While there are several supervised algorithms dedicated for this purpose, the more practical unsupervised ones are still lacking a proper attention. In this paper, we propose a novel unsupervised ensemble drift detector that recognizes local changes in feature subspaces (EDFS) without additional supervision, using specialized committees of incremental Kolmogorov-Smirnov tests. We combine it with an adaptive classifier and update it, only if the drift detector signalizes a change. Conducted experiments show that our framework is able to efficiently adapt to various concept drifts and outperform other unsupervised algorithms.
{"title":"Unsupervised Drift Detector Ensembles for Data Stream Mining","authors":"Lukasz Korycki, B. Krawczyk","doi":"10.1109/DSAA.2019.00047","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00047","url":null,"abstract":"Data stream mining is among the most contemporary branches of machine learning. The potentially infinite sources give us many opportunities and at the same time pose new challenges. To properly handle streaming data we need to improve our well-established methods, so they can work with dynamic data and under strict constraints. Supervised streaming machine learning algorithms require a certain number of labeled instances in order to stay up-to-date. Since high budgets dedicated for this purpose are usually infeasible, we have to limit the supervision as much as we can. One possible approach is to trigger labeling, only if a change is explicitly indicated by a detector. While there are several supervised algorithms dedicated for this purpose, the more practical unsupervised ones are still lacking a proper attention. In this paper, we propose a novel unsupervised ensemble drift detector that recognizes local changes in feature subspaces (EDFS) without additional supervision, using specialized committees of incremental Kolmogorov-Smirnov tests. We combine it with an adaptive classifier and update it, only if the drift detector signalizes a change. Conducted experiments show that our framework is able to efficiently adapt to various concept drifts and outperform other unsupervised algorithms.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121536689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank Madrid, Shailendra Singh, Q. Chesnais, K. Mauck, Eamonn J. Keogh
In domains as diverse as entomology and sports medicine, analysts are routinely required to label large amounts of time series data. In a few rare cases, this can be done automatically with a classification algorithm. In many domains however, complex, noisy, and polymorphic data can defeat state-of-the-art classifiers, yet easily yield to human inspection and annotation. This is especially true if the human can access auxiliary information and previous annotations. This labeling task can be a significant bottleneck in scientific progress. For example, an entomology or sports physiology lab may produce several days worth of time series each day. In this work, we introduce an algorithm that greatly reduces the human effort required. Our interactive algorithm groups subsequences and invites the user to label a group's prototype, brushing the label to all members of the group. Thus, our task reduces to optimizing the grouping(s), to allow our system to ask the fewest questions of the user. As we shall show, on diverse domains, we can reduce the human effort by at least an order of magnitude, with no decrease in accuracy.
{"title":"Matrix Profile XVI: Efficient and Effective Labeling of Massive Time Series Archives","authors":"Frank Madrid, Shailendra Singh, Q. Chesnais, K. Mauck, Eamonn J. Keogh","doi":"10.1109/DSAA.2019.00061","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00061","url":null,"abstract":"In domains as diverse as entomology and sports medicine, analysts are routinely required to label large amounts of time series data. In a few rare cases, this can be done automatically with a classification algorithm. In many domains however, complex, noisy, and polymorphic data can defeat state-of-the-art classifiers, yet easily yield to human inspection and annotation. This is especially true if the human can access auxiliary information and previous annotations. This labeling task can be a significant bottleneck in scientific progress. For example, an entomology or sports physiology lab may produce several days worth of time series each day. In this work, we introduce an algorithm that greatly reduces the human effort required. Our interactive algorithm groups subsequences and invites the user to label a group's prototype, brushing the label to all members of the group. Thus, our task reduces to optimizing the grouping(s), to allow our system to ask the fewest questions of the user. As we shall show, on diverse domains, we can reduce the human effort by at least an order of magnitude, with no decrease in accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130039738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahrooz Abghari, V. Boeva, Jens P. Brage, C. Johansson, Håkan Grahn, Niklas Lavesson
We propose a higher order mining (HOM) approach for modelling, monitoring and analyzing district heating (DH) substations' operational behaviour and performance. HOM is concerned with mining over patterns rather than primary or raw data. The proposed approach uses a combination of different data analysis techniques such as sequential pattern mining, clustering analysis, consensus clustering and minimum spanning tree (MST). Initially, a substation's operational behaviour is modeled by extracting weekly patterns and performing clustering analysis. The substation's performance is monitored by assessing its modeled behaviour for every two consecutive weeks. In case some significant difference is observed, further analysis is performed by integrating the built models into a consensus clustering and applying an MST for identifying deviating behaviours. The results of the study show that our method is robust for detecting deviating and sub-optimal behaviours of DH substations. In addition, the proposed method can facilitate domain experts in the interpretation and understanding of the substations' behaviour and performance by providing different data analysis and visualization techniques.
{"title":"Higher Order Mining for Monitoring District Heating Substations","authors":"Shahrooz Abghari, V. Boeva, Jens P. Brage, C. Johansson, Håkan Grahn, Niklas Lavesson","doi":"10.1109/DSAA.2019.00053","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00053","url":null,"abstract":"We propose a higher order mining (HOM) approach for modelling, monitoring and analyzing district heating (DH) substations' operational behaviour and performance. HOM is concerned with mining over patterns rather than primary or raw data. The proposed approach uses a combination of different data analysis techniques such as sequential pattern mining, clustering analysis, consensus clustering and minimum spanning tree (MST). Initially, a substation's operational behaviour is modeled by extracting weekly patterns and performing clustering analysis. The substation's performance is monitored by assessing its modeled behaviour for every two consecutive weeks. In case some significant difference is observed, further analysis is performed by integrating the built models into a consensus clustering and applying an MST for identifying deviating behaviours. The results of the study show that our method is robust for detecting deviating and sub-optimal behaviours of DH substations. In addition, the proposed method can facilitate domain experts in the interpretation and understanding of the substations' behaviour and performance by providing different data analysis and visualization techniques.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129497437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Farhadi, David Chen, R. McCoy, Christopher G. Scott, J. Miller, C. Vachon, Che Ngufor
Efforts to improve early identification of aggressive high grade breast cancers, which pose the greatest risk to patient health if not detected early, are hindered by the rarity of these events. To address this problem, we proposed an accurate and efficient deep transfer learning method to handle the imbalanced data problem that is prominent in breast cancer data. In contrast to existing approaches based primarily on large image databases, we focused on structured data, which has not been commonly used for deep transfer learning. We used a number of publicly available breast cancer data sets to generate a "pre-trained" model and transfer learned concepts to predict high grade malignant tumors in patients diagnosed with breast cancer at Mayo Clinic. We compared our results with state-of-the-art techniques for addressing the problem of imbalanced learning and confirmed the superiority of the proposed method. To further demonstrate the ability of the proposed method to handle different degrees of class imbalance, a series of experiments were performed on publicly available breast cancer data under simulated class imbalanced settings. Based on the experimental results, we concluded that the proposed deep transfer learning on structured data can be used as an efficient method to handle imbalanced class problems in clinical research.
{"title":"Breast Cancer Classification using Deep Transfer Learning on Structured Healthcare Data","authors":"A. Farhadi, David Chen, R. McCoy, Christopher G. Scott, J. Miller, C. Vachon, Che Ngufor","doi":"10.1109/DSAA.2019.00043","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00043","url":null,"abstract":"Efforts to improve early identification of aggressive high grade breast cancers, which pose the greatest risk to patient health if not detected early, are hindered by the rarity of these events. To address this problem, we proposed an accurate and efficient deep transfer learning method to handle the imbalanced data problem that is prominent in breast cancer data. In contrast to existing approaches based primarily on large image databases, we focused on structured data, which has not been commonly used for deep transfer learning. We used a number of publicly available breast cancer data sets to generate a \"pre-trained\" model and transfer learned concepts to predict high grade malignant tumors in patients diagnosed with breast cancer at Mayo Clinic. We compared our results with state-of-the-art techniques for addressing the problem of imbalanced learning and confirmed the superiority of the proposed method. To further demonstrate the ability of the proposed method to handle different degrees of class imbalance, a series of experiments were performed on publicly available breast cancer data under simulated class imbalanced settings. Based on the experimental results, we concluded that the proposed deep transfer learning on structured data can be used as an efficient method to handle imbalanced class problems in clinical research.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
More than 16 million Americans served in World War II. Of these service members, over 400,000 were killed in action during the war. Today, more than 72,000 service members remain unaccounted for from World War II. The United States continues to diligently locate, recover, and identify missing personnel from World War II and other past conflicts to provide the fullest possible accounting. This work importantly provides closure and resolution to numerous US families. To fulfill this mission, massive amounts of information must be integrated from historical records, genealogy records, anthropological data, archeological data, odontology data, and DNA. These disparate data sources are produced and maintained by multiple agencies, with different data governance rules and different internal structuring of service member information. Previously, a manual approach had been undertaken to Extract, Transform, Load (ETL) records from these different data sources, which creates the potential for introduced human error. In addition, a large number of person-hours were required to synthesize this data on a biweekly basis. To address this issue, we implemented (i) a regex decision tree to translate genealogical relationships into DNA type availability and (ii) a machine learning approach for record-linkage between disparate data sources. This application is currently in production and greatly reduces person-hours needed and has a very low error rate for record translation and integration.
{"title":"Machine Learning for Efficient Integration of Record Systems for Missing US Service Members","authors":"Julia D. Warnke-Sommer, Franklin E. Damann","doi":"10.1109/DSAA.2019.00071","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00071","url":null,"abstract":"More than 16 million Americans served in World War II. Of these service members, over 400,000 were killed in action during the war. Today, more than 72,000 service members remain unaccounted for from World War II. The United States continues to diligently locate, recover, and identify missing personnel from World War II and other past conflicts to provide the fullest possible accounting. This work importantly provides closure and resolution to numerous US families. To fulfill this mission, massive amounts of information must be integrated from historical records, genealogy records, anthropological data, archeological data, odontology data, and DNA. These disparate data sources are produced and maintained by multiple agencies, with different data governance rules and different internal structuring of service member information. Previously, a manual approach had been undertaken to Extract, Transform, Load (ETL) records from these different data sources, which creates the potential for introduced human error. In addition, a large number of person-hours were required to synthesize this data on a biweekly basis. To address this issue, we implemented (i) a regex decision tree to translate genealogical relationships into DNA type availability and (ii) a machine learning approach for record-linkage between disparate data sources. This application is currently in production and greatly reduces person-hours needed and has a very low error rate for record translation and integration.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116763360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pattern mining is an important task of data mining and involves the extraction of interesting associations from large databases. Typically, pattern mining is carried out from huge databases, which tend to get updated several times. Consequently, as a given database is updated, some of the patterns discovered may become invalid, while some new patterns may emerge. This has motivated significant research efforts in the area of Incremental Mining. The goal of incremental mining is to efficiently and incrementally mine patterns when a database is updated as opposed to mining all of the patterns from scratch from the complete database. Incidentally, research efforts are being made to develop incremental pattern mining algorithms for extracting different kinds of patterns such as frequent patterns, sequential patterns and utility patterns. However, none of the existing works addresses incremental mining in the context of coverage patterns, which has important applications in areas such as banner advertising, search engine advertising and graph mining. In this regard, the main contributions of this work are three-fold. First, we introduce the problem of incremental mining in the context of coverage patterns. Second, we propose the IncCMine algorithm for efficiently extracting the knowledge of coverage patterns when incremental database is added to the existing database. Third, we performed extensive experiments using two real-world click stream datasets and one synthetic dataset. The results of our performance evaluation demonstrate that our proposed IncCMine algorithm indeed improves the performance significantly w.r.t. the existing CMine algorithm.
{"title":"An Incremental Technique for Mining Coverage Patterns in Large Databases","authors":"Akhil Ralla, P. Reddy, Anirban Mondal","doi":"10.1109/DSAA.2019.00036","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00036","url":null,"abstract":"Pattern mining is an important task of data mining and involves the extraction of interesting associations from large databases. Typically, pattern mining is carried out from huge databases, which tend to get updated several times. Consequently, as a given database is updated, some of the patterns discovered may become invalid, while some new patterns may emerge. This has motivated significant research efforts in the area of Incremental Mining. The goal of incremental mining is to efficiently and incrementally mine patterns when a database is updated as opposed to mining all of the patterns from scratch from the complete database. Incidentally, research efforts are being made to develop incremental pattern mining algorithms for extracting different kinds of patterns such as frequent patterns, sequential patterns and utility patterns. However, none of the existing works addresses incremental mining in the context of coverage patterns, which has important applications in areas such as banner advertising, search engine advertising and graph mining. In this regard, the main contributions of this work are three-fold. First, we introduce the problem of incremental mining in the context of coverage patterns. Second, we propose the IncCMine algorithm for efficiently extracting the knowledge of coverage patterns when incremental database is added to the existing database. Third, we performed extensive experiments using two real-world click stream datasets and one synthetic dataset. The results of our performance evaluation demonstrate that our proposed IncCMine algorithm indeed improves the performance significantly w.r.t. the existing CMine algorithm.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117317499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Klement, S. Gilbert, D. Maziak, A. Seely, F. Shamji, S. Sundaresan, P. Villeneuve, N. Japkowicz
After lung surgery, a chest tube and a pump are used to manage air leaks and fluid drainage from the chest. The decision to remove or maintain the chest tube is based on drainage data collected from a digital pump that continuously monitors the patient. We construct a classifier to support this clinical decision-making process by identifying patients who may suffer adverse, extended air leaks early on. Intuitively, this problem can be modelled as a time-series fitted to monitoring data. However, we present a solution using a simple classifier constructed from data collected in a specific time frame (36- 48 hours) after surgery. We hypothesize that after surgery, patients struggle to attain a stable (favourable or adverse) status which prevails after a period of discrepancies and inconsistencies in the data. A solutions, we propose, is to identify this time frame when the majority of patients achieve their states of stability. Advantages of this approach include better classification performance with a lower burden of data collection during patient treatment. The paper presents the chest tube management as a classification task performed in a sliding window over time during patient monitoring. Our results show that reliable predictions can be achieved in the time window we identify, and that our classifier reduces unsafe chest tube removals at the expense of potentially maintaining a few that can be removed, i.e., we ensure that chest tubes that need to be maintained are not removed with potentially maintaining a few unnecessarily.
{"title":"Chest Tube Management After Lung Resection Surgery using a Classifier","authors":"W. Klement, S. Gilbert, D. Maziak, A. Seely, F. Shamji, S. Sundaresan, P. Villeneuve, N. Japkowicz","doi":"10.1109/DSAA.2019.00058","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00058","url":null,"abstract":"After lung surgery, a chest tube and a pump are used to manage air leaks and fluid drainage from the chest. The decision to remove or maintain the chest tube is based on drainage data collected from a digital pump that continuously monitors the patient. We construct a classifier to support this clinical decision-making process by identifying patients who may suffer adverse, extended air leaks early on. Intuitively, this problem can be modelled as a time-series fitted to monitoring data. However, we present a solution using a simple classifier constructed from data collected in a specific time frame (36- 48 hours) after surgery. We hypothesize that after surgery, patients struggle to attain a stable (favourable or adverse) status which prevails after a period of discrepancies and inconsistencies in the data. A solutions, we propose, is to identify this time frame when the majority of patients achieve their states of stability. Advantages of this approach include better classification performance with a lower burden of data collection during patient treatment. The paper presents the chest tube management as a classification task performed in a sliding window over time during patient monitoring. Our results show that reliable predictions can be achieved in the time window we identify, and that our classifier reduces unsafe chest tube removals at the expense of potentially maintaining a few that can be removed, i.e., we ensure that chest tubes that need to be maintained are not removed with potentially maintaining a few unnecessarily.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123026811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the Internet of Things (IoT) era, with the growing number of data sources, we need to face some challenges such as high cost of the cloud storage caused by large amounts of data. To minimize the communication time and enhance the performance, sending the entire large amount of data is not practical. Thus, it is appropriate to make use of edge computing, or data preprocessing on IoT gateways. In this paper, we propose a data reduction algorithm for the gateway of bridge vibration G-sensors. The data reduction algorithm is based on a pattern system, which is comprised of a pattern library and a pattern classifier. The pattern library is generated by using the K-means clustering method. The results show that the proposed approach is effective in data reduction and outlier detection for bridge vibration data collection on the IoT gateway.
{"title":"Data Reduction for real-time bridge vibration data on Edge","authors":"Anthony Chen, Fu-Hsuan Liu, Sheng-De Wang","doi":"10.1109/DSAA.2019.00077","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00077","url":null,"abstract":"In the Internet of Things (IoT) era, with the growing number of data sources, we need to face some challenges such as high cost of the cloud storage caused by large amounts of data. To minimize the communication time and enhance the performance, sending the entire large amount of data is not practical. Thus, it is appropriate to make use of edge computing, or data preprocessing on IoT gateways. In this paper, we propose a data reduction algorithm for the gateway of bridge vibration G-sensors. The data reduction algorithm is based on a pattern system, which is comprised of a pattern library and a pattern classifier. The pattern library is generated by using the K-means clustering method. The results show that the proposed approach is effective in data reduction and outlier detection for bridge vibration data collection on the IoT gateway.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-step prediction of sea surface temperature (SST) is a challenging problem because small errors in its shortrange forecasts can be compounded to create large errors at longer ranges. In this paper, we propose a hierarchical LSTM framework to improve the accuracy for long-term SST prediction. Our framework alleviates the error accumulation problem in multi-step prediction by leveraging outputs from an ensemble of physically-based dynamical models. Unlike previous methods, which simply take a linear combination of the outputs to produce a single deterministic forecast, our framework learns a nonlinear relationship among the ensemble member forecasts. In addition, its multi-level structure is designed to capture the temporal autocorrelation between forecasts generated for the same lead time as well as those generated for different lead times. Experiments performed using SST data from the tropical Pacific ocean region show that the proposed framework outperforms various baseline methods in more than 70% of the grid cells located in the study region.
{"title":"Hierarchical LSTM Framework for Long-Term Sea Surface Temperature Forecasting","authors":"Xi Liu, T. Wilson, P. Tan, L. Luo","doi":"10.1109/DSAA.2019.00018","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00018","url":null,"abstract":"Multi-step prediction of sea surface temperature (SST) is a challenging problem because small errors in its shortrange forecasts can be compounded to create large errors at longer ranges. In this paper, we propose a hierarchical LSTM framework to improve the accuracy for long-term SST prediction. Our framework alleviates the error accumulation problem in multi-step prediction by leveraging outputs from an ensemble of physically-based dynamical models. Unlike previous methods, which simply take a linear combination of the outputs to produce a single deterministic forecast, our framework learns a nonlinear relationship among the ensemble member forecasts. In addition, its multi-level structure is designed to capture the temporal autocorrelation between forecasts generated for the same lead time as well as those generated for different lead times. Experiments performed using SST data from the tropical Pacific ocean region show that the proposed framework outperforms various baseline methods in more than 70% of the grid cells located in the study region.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122696189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}