Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00024
M. Ciavotta, S. Krstic, D. Tamburri, W. Heuvel
Metaheuristics are search procedures used to solve complex, often intractable problems for which other approaches are unsuitable or unable to provide solutions in reasonable times. Although computing power has grown exponentially with the onset of Cloud Computing and Big Data platforms, the domain of metaheuristics has not yet taken full advantage of this new potential. In this paper, we address this gap by proposing HyperSpark, an optimization framework for the scalable execution of user-defined, computationally-intensive heuristics. We designed HyperSpark as a flexible tool meant to harness the benefits (e.g., scalability by design) and features (e.g., a simple programming model or ad-hoc infrastructure tuning) of state-of-the-art big data technology for the benefit of optimization methods. We elaborate on HyperSpark and assess its validity and generality on a library implementing several metaheuristics for the Permutation Flow-Shop Problem (PFSP). We observe that HyperSpark results are comparable with the best tools and solutions from the literature. We conclude that our proof-of-concept shows great potential for further research and practical use.
{"title":"HyperSpark: A Data-Intensive Programming Environment for Parallel Metaheuristics","authors":"M. Ciavotta, S. Krstic, D. Tamburri, W. Heuvel","doi":"10.1109/BigDataCongress.2019.00024","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00024","url":null,"abstract":"Metaheuristics are search procedures used to solve complex, often intractable problems for which other approaches are unsuitable or unable to provide solutions in reasonable times. Although computing power has grown exponentially with the onset of Cloud Computing and Big Data platforms, the domain of metaheuristics has not yet taken full advantage of this new potential. In this paper, we address this gap by proposing HyperSpark, an optimization framework for the scalable execution of user-defined, computationally-intensive heuristics. We designed HyperSpark as a flexible tool meant to harness the benefits (e.g., scalability by design) and features (e.g., a simple programming model or ad-hoc infrastructure tuning) of state-of-the-art big data technology for the benefit of optimization methods. We elaborate on HyperSpark and assess its validity and generality on a library implementing several metaheuristics for the Permutation Flow-Shop Problem (PFSP). We observe that HyperSpark results are comparable with the best tools and solutions from the literature. We conclude that our proof-of-concept shows great potential for further research and practical use.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124868500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00020
Roberto Corizzo, Michelangelo Ceci, D. Malerba
This paper describes recent results achieved in the analysis of geo-distributed sensor data generated in the context of the energy sector. The approaches described have roots in the Big Data Analytics and Predictive Modeling research fields and are based on distributed architectures. They tackle the energy forecasting task for a network of energy production plants, by also taking into consideration the detection and treatment of anomalies in the data. This research is motivated by and consistent with the objectives of research projects funded by the European Commission and by many national governments.
{"title":"Big Data Analytics and Predictive Modeling Approaches for the Energy Sector","authors":"Roberto Corizzo, Michelangelo Ceci, D. Malerba","doi":"10.1109/BigDataCongress.2019.00020","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00020","url":null,"abstract":"This paper describes recent results achieved in the analysis of geo-distributed sensor data generated in the context of the energy sector. The approaches described have roots in the Big Data Analytics and Predictive Modeling research fields and are based on distributed architectures. They tackle the energy forecasting task for a network of energy production plants, by also taking into consideration the detection and treatment of anomalies in the data. This research is motivated by and consistent with the objectives of research projects funded by the European Commission and by many national governments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131086815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00033
F. Ventura, Stefano Proto, D. Apiletti, T. Cerquitelli, S. Panicucci, Elena Baralis, E. Macii, A. Macii
Evaluating the degradation of predictive models over time has always been a difficult task, also considering that new unseen data might not fit the training distribution. This is a well-known problem in real-world use cases, where collecting the historical training set for all possible prediction labels may be very hard, too expensive or completely unfeasible. To solve this issue, we present a new unsupervised approach to detect and evaluate the degradation of classification and prediction models, based on a scalable variant of the Silhouette index, named Descriptor Silhouette, specifically designed to advance current Big Data state-of-the-art solutions. The newly proposed strategy has been tested and validated over both synthetic and real-world industrial use cases. To this aim, it has been included in a framework named SCALE and resulted to be efficient and more effective in assessing the degradation of prediction performance than current state-of-the-art best solutions.
{"title":"A New Unsupervised Predictive-Model Self-Assessment Approach That SCALEs","authors":"F. Ventura, Stefano Proto, D. Apiletti, T. Cerquitelli, S. Panicucci, Elena Baralis, E. Macii, A. Macii","doi":"10.1109/BigDataCongress.2019.00033","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00033","url":null,"abstract":"Evaluating the degradation of predictive models over time has always been a difficult task, also considering that new unseen data might not fit the training distribution. This is a well-known problem in real-world use cases, where collecting the historical training set for all possible prediction labels may be very hard, too expensive or completely unfeasible. To solve this issue, we present a new unsupervised approach to detect and evaluate the degradation of classification and prediction models, based on a scalable variant of the Silhouette index, named Descriptor Silhouette, specifically designed to advance current Big Data state-of-the-art solutions. The newly proposed strategy has been tested and validated over both synthetic and real-world industrial use cases. To this aim, it has been included in a framework named SCALE and resulted to be efficient and more effective in assessing the degradation of prediction performance than current state-of-the-art best solutions.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133651819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00019
Francesco Pace, D. Milios, D. Carra, P. Michiardi
Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, it is common to reserve resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources often go under-utilized. In this work, we propose a mechanism that improves compute cluster utilization and their responsiveness, while preventing application failures due to contention in accessing finite resources such as RAM. Our method monitors resource utilization and employs a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and reduces the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.
{"title":"Dynamic Resource Shaping for Compute Clusters","authors":"Francesco Pace, D. Milios, D. Carra, P. Michiardi","doi":"10.1109/BigDataCongress.2019.00019","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00019","url":null,"abstract":"Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, it is common to reserve resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources often go under-utilized. In this work, we propose a mechanism that improves compute cluster utilization and their responsiveness, while preventing application failures due to contention in accessing finite resources such as RAM. Our method monitors resource utilization and employs a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and reduces the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00014
Clemens Lachner, T. Rausch, S. Dustdar
Privacy is a fundamental concern that confronts systems dealing with sensitive data. The lack of robust solutions for defining and enforcing privacy measures continues to hinder the general acceptance and adoption of these systems. Edge computing has been recognized as a key enabler for privacy enhanced applications, and has opened new opportunities. In this paper, we propose a novel privacy model based on context-aware edge computing. Our model leverages the context of data to make decisions about how these data need to be processed and managed to achieve privacy. Based on a scenario from the eHealth domain, we show how our generalized model can be used to implement and enact complex domain-specific privacy policies. We illustrate our approach by constructing real world use cases involving a mobile Electronic Health Record that interacts with, and in different environments.
{"title":"Context-Aware Enforcement of Privacy Policies in Edge Computing","authors":"Clemens Lachner, T. Rausch, S. Dustdar","doi":"10.1109/BigDataCongress.2019.00014","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00014","url":null,"abstract":"Privacy is a fundamental concern that confronts systems dealing with sensitive data. The lack of robust solutions for defining and enforcing privacy measures continues to hinder the general acceptance and adoption of these systems. Edge computing has been recognized as a key enabler for privacy enhanced applications, and has opened new opportunities. In this paper, we propose a novel privacy model based on context-aware edge computing. Our model leverages the context of data to make decisions about how these data need to be processed and managed to achieve privacy. Based on a scenario from the eHealth domain, we show how our generalized model can be used to implement and enact complex domain-specific privacy policies. We illustrate our approach by constructing real world use cases involving a mobile Electronic Health Record that interacts with, and in different environments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114503263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00032
Stefano Proto, F. Ventura, D. Apiletti, T. Cerquitelli, Elena Baralis, E. Macii, A. Macii
In recent years, the number of industry-4.0-enabled manufacturing sites has been continuously growing, and both the quantity and variety of signals and data collected in plants are increasing at an unprecedented rate. At the same time, the demand of Big Data processing platforms and analytical tools tailored to manufacturing environments has become more and more prominent. Manufacturing companies are collecting huge amounts of information during the production process through a plethora of sensors and networks. To extract value and actionable knowledge from such precious repositories, suitable data-driven approaches are required. They are expected to improve the production processes by reducing maintenance costs, reliably predicting equipment failures, and avoiding quality degradation. To this aim, Machine Learning techniques tailored for predictive maintenance analysis have been adopted in PREMISES (PREdictive Maintenance service for Industrial procesSES), an innovative framework providing a scalable Big Data service able to predict alarming conditions in slowly-degrading processes characterized by cyclic procedures. PREMISES has been experimentally tested and validated on a real industrial use case, resulting efficient and effective in predicting alarms. The framework has been designed to address the main Big Data and industrial requirements, by being developed on a solid and scalable processing framework, Apache Spark, and supporting the deployment on modularized containers, specifically upon the Docker technology stack.
{"title":"PREMISES, a Scalable Data-Driven Service to Predict Alarms in Slowly-Degrading Multi-Cycle Industrial Processes","authors":"Stefano Proto, F. Ventura, D. Apiletti, T. Cerquitelli, Elena Baralis, E. Macii, A. Macii","doi":"10.1109/BigDataCongress.2019.00032","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00032","url":null,"abstract":"In recent years, the number of industry-4.0-enabled manufacturing sites has been continuously growing, and both the quantity and variety of signals and data collected in plants are increasing at an unprecedented rate. At the same time, the demand of Big Data processing platforms and analytical tools tailored to manufacturing environments has become more and more prominent. Manufacturing companies are collecting huge amounts of information during the production process through a plethora of sensors and networks. To extract value and actionable knowledge from such precious repositories, suitable data-driven approaches are required. They are expected to improve the production processes by reducing maintenance costs, reliably predicting equipment failures, and avoiding quality degradation. To this aim, Machine Learning techniques tailored for predictive maintenance analysis have been adopted in PREMISES (PREdictive Maintenance service for Industrial procesSES), an innovative framework providing a scalable Big Data service able to predict alarming conditions in slowly-degrading processes characterized by cyclic procedures. PREMISES has been experimentally tested and validated on a real industrial use case, resulting efficient and effective in predicting alarms. The framework has been designed to address the main Big Data and industrial requirements, by being developed on a solid and scalable processing framework, Apache Spark, and supporting the deployment on modularized containers, specifically upon the Docker technology stack.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"9 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114085606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00031
Junyao Guo, Lu Liu, Sihai Zhang, Jinkang Zhu
Mobility prediction is an interesting topic attracting many researchers and both prediction theory and models are explored in the existing literature. The entropy metric to evaluate the mobility predictability of individuals gives a theoretical upper bound and lower bound of prediction probability, although the achieved accuracies of users with the same predictability vary. In this work, we investigate the missing locations phenomenon which means the users visit new locations in the testing set. The major difference of theoretical bound between with and without missing locations are found, which shows that users without missing locations are easier to predict. After discussing the impact of missing locations on the prediction accuracy, a modified Markov chain prediction model is proposed to deal with the presence of missing positions. Finally, the correlation between accuracy and predictability can be modeled as the Gaussian distribution and the standard deviation modeled with missing locations can be modeled as double Gaussian function, while that without missing locations can be modeled as the third-order polynomial function.
{"title":"Mobility Prediction with Missing Locations Based on Modified Markov Model for Wireless Users","authors":"Junyao Guo, Lu Liu, Sihai Zhang, Jinkang Zhu","doi":"10.1109/BigDataCongress.2019.00031","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00031","url":null,"abstract":"Mobility prediction is an interesting topic attracting many researchers and both prediction theory and models are explored in the existing literature. The entropy metric to evaluate the mobility predictability of individuals gives a theoretical upper bound and lower bound of prediction probability, although the achieved accuracies of users with the same predictability vary. In this work, we investigate the missing locations phenomenon which means the users visit new locations in the testing set. The major difference of theoretical bound between with and without missing locations are found, which shows that users without missing locations are easier to predict. After discussing the impact of missing locations on the prediction accuracy, a modified Markov chain prediction model is proposed to deal with the presence of missing positions. Finally, the correlation between accuracy and predictability can be modeled as the Gaussian distribution and the standard deviation modeled with missing locations can be modeled as double Gaussian function, while that without missing locations can be modeled as the third-order polynomial function.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126232950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00023
Daniel Peralta, Y. Saeys
The current explosion of data, which is impacting many different areas, is especially noticeable in biomedical research thanks to the development of new technologies that are able to capture high-dimensional and high-resolution data at the single-cell scale. Processing such data in an interpretable way often requires the computation of pairwise dissimilarity measures between the multiple features of the data, a task that can be very difficult to tackle when the dataset is large enough, and which is prone to numerical instability. In this paper we propose a distributed framework to efficiently compute dissimilarity matrices in arbitrarily large datasets in a numerically robust way. It implements a combination of the pairwise and two-pass algorithms for computing the variance, in order to maintain the numerical robustness of the former while reducing its overhead. The proposal is parallelizable both across multiple computers and multiple cores, maximizing the performance while maintaining the benefits of memory locality. The proposal is tested on a real use case: a dataset generated from high-content screening images composed by a billion individual cells and 786 features. The results showed linear scalability with respect to the size of the dataset and close to linear speedup.
{"title":"Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets","authors":"Daniel Peralta, Y. Saeys","doi":"10.1109/BigDataCongress.2019.00023","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00023","url":null,"abstract":"The current explosion of data, which is impacting many different areas, is especially noticeable in biomedical research thanks to the development of new technologies that are able to capture high-dimensional and high-resolution data at the single-cell scale. Processing such data in an interpretable way often requires the computation of pairwise dissimilarity measures between the multiple features of the data, a task that can be very difficult to tackle when the dataset is large enough, and which is prone to numerical instability. In this paper we propose a distributed framework to efficiently compute dissimilarity matrices in arbitrarily large datasets in a numerically robust way. It implements a combination of the pairwise and two-pass algorithms for computing the variance, in order to maintain the numerical robustness of the former while reducing its overhead. The proposal is parallelizable both across multiple computers and multiple cores, maximizing the performance while maintaining the benefits of memory locality. The proposal is tested on a real use case: a dataset generated from high-content screening images composed by a billion individual cells and 786 features. The results showed linear scalability with respect to the size of the dataset and close to linear speedup.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123923737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00034
Nesma Mahmoud, Youssef Essam, Radwa El Shawi, S. Sakr
Recently, deep learning has become one of the most disruptive trends in the technology world. Deep learning techniques are increasingly achieving significant results in different domains such as speech recognition, image recognition and natural language processing. In general, there are various reasons behind the increasing popularity of deep learning techniques. These reasons include increasing data availability, the increasing availability of powerful hardware and computing resources in addition to the increasing availability of deep learning frameworks. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate the performance characteristics of these systems. In this paper, we present an extensive experimental study of six popular deep learning frameworks, namely TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras. Our experimental evaluation covers different aspects for its comparison including accuracy, speed and resource consumption. Our experiments have been conducted on both CPU and GPU environments and using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.
{"title":"DLBench: An Experimental Evaluation of Deep Learning Frameworks","authors":"Nesma Mahmoud, Youssef Essam, Radwa El Shawi, S. Sakr","doi":"10.1109/BigDataCongress.2019.00034","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00034","url":null,"abstract":"Recently, deep learning has become one of the most disruptive trends in the technology world. Deep learning techniques are increasingly achieving significant results in different domains such as speech recognition, image recognition and natural language processing. In general, there are various reasons behind the increasing popularity of deep learning techniques. These reasons include increasing data availability, the increasing availability of powerful hardware and computing resources in addition to the increasing availability of deep learning frameworks. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate the performance characteristics of these systems. In this paper, we present an extensive experimental study of six popular deep learning frameworks, namely TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras. Our experimental evaluation covers different aspects for its comparison including accuracy, speed and resource consumption. Our experiments have been conducted on both CPU and GPU environments and using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128454432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08DOI: 10.1109/BigDataCongress.2019.00017
P. Missier, J. Cala
Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically "redo" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.
{"title":"Efficient Re-Computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications","authors":"P. Missier, J. Cala","doi":"10.1109/BigDataCongress.2019.00017","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00017","url":null,"abstract":"Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically \"redo\" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115972244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}