首页 > 最新文献

2019 IEEE International Congress on Big Data (BigDataCongress)最新文献

英文 中文
HyperSpark: A Data-Intensive Programming Environment for Parallel Metaheuristics HyperSpark:用于并行元启发式的数据密集型编程环境
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00024
M. Ciavotta, S. Krstic, D. Tamburri, W. Heuvel
Metaheuristics are search procedures used to solve complex, often intractable problems for which other approaches are unsuitable or unable to provide solutions in reasonable times. Although computing power has grown exponentially with the onset of Cloud Computing and Big Data platforms, the domain of metaheuristics has not yet taken full advantage of this new potential. In this paper, we address this gap by proposing HyperSpark, an optimization framework for the scalable execution of user-defined, computationally-intensive heuristics. We designed HyperSpark as a flexible tool meant to harness the benefits (e.g., scalability by design) and features (e.g., a simple programming model or ad-hoc infrastructure tuning) of state-of-the-art big data technology for the benefit of optimization methods. We elaborate on HyperSpark and assess its validity and generality on a library implementing several metaheuristics for the Permutation Flow-Shop Problem (PFSP). We observe that HyperSpark results are comparable with the best tools and solutions from the literature. We conclude that our proof-of-concept shows great potential for further research and practical use.
元启发式是用来解决复杂的、通常是棘手的问题的搜索过程,对于这些问题,其他方法不适合或无法在合理的时间内提供解决方案。尽管随着云计算和大数据平台的出现,计算能力呈指数级增长,但元启发式领域尚未充分利用这一新的潜力。在本文中,我们通过提出HyperSpark来解决这一差距,HyperSpark是一个用于用户定义的计算密集型启发式的可扩展执行的优化框架。我们将HyperSpark设计为一个灵活的工具,旨在利用最先进的大数据技术的优势(例如,设计的可扩展性)和功能(例如,简单的编程模型或特设基础设施调整),以实现优化方法的优势。我们详细阐述了HyperSpark,并在一个库上评估了它的有效性和通用性,该库实现了用于排列流车间问题(PFSP)的几个元启发式方法。我们观察到HyperSpark的结果与文献中最好的工具和解决方案相当。我们的结论是,我们的概念验证显示出进一步研究和实际应用的巨大潜力。
{"title":"HyperSpark: A Data-Intensive Programming Environment for Parallel Metaheuristics","authors":"M. Ciavotta, S. Krstic, D. Tamburri, W. Heuvel","doi":"10.1109/BigDataCongress.2019.00024","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00024","url":null,"abstract":"Metaheuristics are search procedures used to solve complex, often intractable problems for which other approaches are unsuitable or unable to provide solutions in reasonable times. Although computing power has grown exponentially with the onset of Cloud Computing and Big Data platforms, the domain of metaheuristics has not yet taken full advantage of this new potential. In this paper, we address this gap by proposing HyperSpark, an optimization framework for the scalable execution of user-defined, computationally-intensive heuristics. We designed HyperSpark as a flexible tool meant to harness the benefits (e.g., scalability by design) and features (e.g., a simple programming model or ad-hoc infrastructure tuning) of state-of-the-art big data technology for the benefit of optimization methods. We elaborate on HyperSpark and assess its validity and generality on a library implementing several metaheuristics for the Permutation Flow-Shop Problem (PFSP). We observe that HyperSpark results are comparable with the best tools and solutions from the literature. We conclude that our proof-of-concept shows great potential for further research and practical use.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124868500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Big Data Analytics and Predictive Modeling Approaches for the Energy Sector 能源领域的大数据分析和预测建模方法
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00020
Roberto Corizzo, Michelangelo Ceci, D. Malerba
This paper describes recent results achieved in the analysis of geo-distributed sensor data generated in the context of the energy sector. The approaches described have roots in the Big Data Analytics and Predictive Modeling research fields and are based on distributed architectures. They tackle the energy forecasting task for a network of energy production plants, by also taking into consideration the detection and treatment of anomalies in the data. This research is motivated by and consistent with the objectives of research projects funded by the European Commission and by many national governments.
本文描述了在能源部门背景下产生的地理分布式传感器数据分析中取得的最新成果。所描述的方法植根于大数据分析和预测建模研究领域,并且基于分布式架构。他们通过考虑数据异常的检测和处理,来解决能源生产工厂网络的能源预测任务。这项研究是由欧洲委员会和许多国家政府资助的研究项目的目标所推动和一致的。
{"title":"Big Data Analytics and Predictive Modeling Approaches for the Energy Sector","authors":"Roberto Corizzo, Michelangelo Ceci, D. Malerba","doi":"10.1109/BigDataCongress.2019.00020","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00020","url":null,"abstract":"This paper describes recent results achieved in the analysis of geo-distributed sensor data generated in the context of the energy sector. The approaches described have roots in the Big Data Analytics and Predictive Modeling research fields and are based on distributed architectures. They tackle the energy forecasting task for a network of energy production plants, by also taking into consideration the detection and treatment of anomalies in the data. This research is motivated by and consistent with the objectives of research projects funded by the European Commission and by many national governments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131086815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A New Unsupervised Predictive-Model Self-Assessment Approach That SCALEs 一种新的无监督预测模型自评方法
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00033
F. Ventura, Stefano Proto, D. Apiletti, T. Cerquitelli, S. Panicucci, Elena Baralis, E. Macii, A. Macii
Evaluating the degradation of predictive models over time has always been a difficult task, also considering that new unseen data might not fit the training distribution. This is a well-known problem in real-world use cases, where collecting the historical training set for all possible prediction labels may be very hard, too expensive or completely unfeasible. To solve this issue, we present a new unsupervised approach to detect and evaluate the degradation of classification and prediction models, based on a scalable variant of the Silhouette index, named Descriptor Silhouette, specifically designed to advance current Big Data state-of-the-art solutions. The newly proposed strategy has been tested and validated over both synthetic and real-world industrial use cases. To this aim, it has been included in a framework named SCALE and resulted to be efficient and more effective in assessing the degradation of prediction performance than current state-of-the-art best solutions.
评估预测模型随时间的退化一直是一项艰巨的任务,同时考虑到新的未见数据可能不适合训练分布。这是现实世界用例中一个众所周知的问题,在现实世界中,收集所有可能的预测标签的历史训练集可能非常困难,太昂贵或完全不可行的。为了解决这个问题,我们提出了一种新的无监督方法来检测和评估分类和预测模型的退化,该方法基于Silhouette指数的可扩展变体,名为Descriptor Silhouette,专门用于推进当前大数据最先进的解决方案。新提出的策略已经在合成和实际工业用例中进行了测试和验证。为此目的,它已被列入一个名为SCALE的框架,结果在评估预测性能的退化方面比目前最先进的最佳解决方案更有效。
{"title":"A New Unsupervised Predictive-Model Self-Assessment Approach That SCALEs","authors":"F. Ventura, Stefano Proto, D. Apiletti, T. Cerquitelli, S. Panicucci, Elena Baralis, E. Macii, A. Macii","doi":"10.1109/BigDataCongress.2019.00033","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00033","url":null,"abstract":"Evaluating the degradation of predictive models over time has always been a difficult task, also considering that new unseen data might not fit the training distribution. This is a well-known problem in real-world use cases, where collecting the historical training set for all possible prediction labels may be very hard, too expensive or completely unfeasible. To solve this issue, we present a new unsupervised approach to detect and evaluate the degradation of classification and prediction models, based on a scalable variant of the Silhouette index, named Descriptor Silhouette, specifically designed to advance current Big Data state-of-the-art solutions. The newly proposed strategy has been tested and validated over both synthetic and real-world industrial use cases. To this aim, it has been included in a framework named SCALE and resulted to be efficient and more effective in assessing the degradation of prediction performance than current state-of-the-art best solutions.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133651819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Dynamic Resource Shaping for Compute Clusters 计算集群的动态资源整形
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00019
Francesco Pace, D. Milios, D. Carra, P. Michiardi
Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, it is common to reserve resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources often go under-utilized. In this work, we propose a mechanism that improves compute cluster utilization and their responsiveness, while preventing application failures due to contention in accessing finite resources such as RAM. Our method monitors resource utilization and employs a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and reduces the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.
如今,由于资源分配基于保留机制,而忽略了实际的资源利用率,数据中心在很大程度上没有得到充分利用。实际上,为峰值需求保留资源是很常见的,峰值需求可能只在应用程序生命周期的一小部分时间内出现。因此,集群资源经常得不到充分利用。在这项工作中,我们提出了一种机制,可以提高计算集群的利用率和响应能力,同时防止由于访问有限资源(如RAM)的争用而导致应用程序失败。我们的方法监测资源利用,并采用数据驱动的方法进行资源需求预测,其特点是预测中的不确定性量化。使用需求预测及其置信度,我们的机制调节分配给运行应用程序的集群资源,并在控制应用程序故障的同时将周转时间减少一个数量级以上。因此,租户可以享受响应灵敏的系统,而提供者可以从高效的集群利用中获益。
{"title":"Dynamic Resource Shaping for Compute Clusters","authors":"Francesco Pace, D. Milios, D. Carra, P. Michiardi","doi":"10.1109/BigDataCongress.2019.00019","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00019","url":null,"abstract":"Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, it is common to reserve resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources often go under-utilized. In this work, we propose a mechanism that improves compute cluster utilization and their responsiveness, while preventing application failures due to contention in accessing finite resources such as RAM. Our method monitors resource utilization and employs a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and reduces the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Context-Aware Enforcement of Privacy Policies in Edge Computing 边缘计算中上下文感知隐私策略的实施
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00014
Clemens Lachner, T. Rausch, S. Dustdar
Privacy is a fundamental concern that confronts systems dealing with sensitive data. The lack of robust solutions for defining and enforcing privacy measures continues to hinder the general acceptance and adoption of these systems. Edge computing has been recognized as a key enabler for privacy enhanced applications, and has opened new opportunities. In this paper, we propose a novel privacy model based on context-aware edge computing. Our model leverages the context of data to make decisions about how these data need to be processed and managed to achieve privacy. Based on a scenario from the eHealth domain, we show how our generalized model can be used to implement and enact complex domain-specific privacy policies. We illustrate our approach by constructing real world use cases involving a mobile Electronic Health Record that interacts with, and in different environments.
隐私是处理敏感数据的系统所面临的一个基本问题。缺乏定义和执行隐私措施的健壮解决方案继续阻碍这些系统的普遍接受和采用。边缘计算已被公认为隐私增强应用程序的关键推动者,并开辟了新的机会。本文提出了一种基于上下文感知边缘计算的新型隐私模型。我们的模型利用数据的上下文来决定如何处理和管理这些数据以实现隐私。基于eHealth领域的一个场景,我们将展示如何使用我们的一般化模型来实现和制定复杂的特定于领域的隐私策略。我们通过构建涉及与不同环境交互的移动电子健康记录的真实用例来说明我们的方法。
{"title":"Context-Aware Enforcement of Privacy Policies in Edge Computing","authors":"Clemens Lachner, T. Rausch, S. Dustdar","doi":"10.1109/BigDataCongress.2019.00014","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00014","url":null,"abstract":"Privacy is a fundamental concern that confronts systems dealing with sensitive data. The lack of robust solutions for defining and enforcing privacy measures continues to hinder the general acceptance and adoption of these systems. Edge computing has been recognized as a key enabler for privacy enhanced applications, and has opened new opportunities. In this paper, we propose a novel privacy model based on context-aware edge computing. Our model leverages the context of data to make decisions about how these data need to be processed and managed to achieve privacy. Based on a scenario from the eHealth domain, we show how our generalized model can be used to implement and enact complex domain-specific privacy policies. We illustrate our approach by constructing real world use cases involving a mobile Electronic Health Record that interacts with, and in different environments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114503263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
PREMISES, a Scalable Data-Driven Service to Predict Alarms in Slowly-Degrading Multi-Cycle Industrial Processes PREMISES,一种可扩展的数据驱动服务,用于预测缓慢退化的多周期工业过程中的警报
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00032
Stefano Proto, F. Ventura, D. Apiletti, T. Cerquitelli, Elena Baralis, E. Macii, A. Macii
In recent years, the number of industry-4.0-enabled manufacturing sites has been continuously growing, and both the quantity and variety of signals and data collected in plants are increasing at an unprecedented rate. At the same time, the demand of Big Data processing platforms and analytical tools tailored to manufacturing environments has become more and more prominent. Manufacturing companies are collecting huge amounts of information during the production process through a plethora of sensors and networks. To extract value and actionable knowledge from such precious repositories, suitable data-driven approaches are required. They are expected to improve the production processes by reducing maintenance costs, reliably predicting equipment failures, and avoiding quality degradation. To this aim, Machine Learning techniques tailored for predictive maintenance analysis have been adopted in PREMISES (PREdictive Maintenance service for Industrial procesSES), an innovative framework providing a scalable Big Data service able to predict alarming conditions in slowly-degrading processes characterized by cyclic procedures. PREMISES has been experimentally tested and validated on a real industrial use case, resulting efficient and effective in predicting alarms. The framework has been designed to address the main Big Data and industrial requirements, by being developed on a solid and scalable processing framework, Apache Spark, and supporting the deployment on modularized containers, specifically upon the Docker technology stack.
近年来,支持工业4.0的制造场所的数量不断增长,工厂收集的信号和数据的数量和种类都以前所未有的速度增长。与此同时,针对制造环境量身定制的大数据处理平台和分析工具的需求也越来越突出。制造公司在生产过程中通过大量的传感器和网络收集大量的信息。要从这些宝贵的存储库中提取价值和可操作的知识,需要合适的数据驱动方法。它们有望通过降低维护成本、可靠地预测设备故障和避免质量下降来改善生产过程。为此,PREMISES(工业过程预测性维护服务)采用了为预测性维护分析量身定制的机器学习技术,这是一个创新的框架,提供可扩展的大数据服务,能够预测以循环过程为特征的缓慢退化过程中的报警条件。PREMISES已经在一个真实的工业用例中进行了实验测试和验证,从而高效地预测警报。该框架旨在解决主要的大数据和工业需求,它是在一个可靠的、可扩展的处理框架Apache Spark上开发的,并支持在模块化容器上的部署,特别是在Docker技术堆栈上。
{"title":"PREMISES, a Scalable Data-Driven Service to Predict Alarms in Slowly-Degrading Multi-Cycle Industrial Processes","authors":"Stefano Proto, F. Ventura, D. Apiletti, T. Cerquitelli, Elena Baralis, E. Macii, A. Macii","doi":"10.1109/BigDataCongress.2019.00032","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00032","url":null,"abstract":"In recent years, the number of industry-4.0-enabled manufacturing sites has been continuously growing, and both the quantity and variety of signals and data collected in plants are increasing at an unprecedented rate. At the same time, the demand of Big Data processing platforms and analytical tools tailored to manufacturing environments has become more and more prominent. Manufacturing companies are collecting huge amounts of information during the production process through a plethora of sensors and networks. To extract value and actionable knowledge from such precious repositories, suitable data-driven approaches are required. They are expected to improve the production processes by reducing maintenance costs, reliably predicting equipment failures, and avoiding quality degradation. To this aim, Machine Learning techniques tailored for predictive maintenance analysis have been adopted in PREMISES (PREdictive Maintenance service for Industrial procesSES), an innovative framework providing a scalable Big Data service able to predict alarming conditions in slowly-degrading processes characterized by cyclic procedures. PREMISES has been experimentally tested and validated on a real industrial use case, resulting efficient and effective in predicting alarms. The framework has been designed to address the main Big Data and industrial requirements, by being developed on a solid and scalable processing framework, Apache Spark, and supporting the deployment on modularized containers, specifically upon the Docker technology stack.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"9 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114085606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Mobility Prediction with Missing Locations Based on Modified Markov Model for Wireless Users 基于改进马尔可夫模型的无线用户缺失位置移动预测
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00031
Junyao Guo, Lu Liu, Sihai Zhang, Jinkang Zhu
Mobility prediction is an interesting topic attracting many researchers and both prediction theory and models are explored in the existing literature. The entropy metric to evaluate the mobility predictability of individuals gives a theoretical upper bound and lower bound of prediction probability, although the achieved accuracies of users with the same predictability vary. In this work, we investigate the missing locations phenomenon which means the users visit new locations in the testing set. The major difference of theoretical bound between with and without missing locations are found, which shows that users without missing locations are easier to predict. After discussing the impact of missing locations on the prediction accuracy, a modified Markov chain prediction model is proposed to deal with the presence of missing positions. Finally, the correlation between accuracy and predictability can be modeled as the Gaussian distribution and the standard deviation modeled with missing locations can be modeled as double Gaussian function, while that without missing locations can be modeled as the third-order polynomial function.
流动性预测是一个有趣的研究课题,现有文献对预测理论和模型进行了探讨。尽管具有相同可预测性的用户的实现精度各不相同,但用于评估个人移动性可预测性的熵度量给出了预测概率的理论上限和下限。在这项工作中,我们研究了缺失位置现象,即用户访问测试集中的新位置。发现有缺失位置和没有缺失位置的理论边界存在较大差异,说明没有缺失位置的用户更容易预测。在讨论了缺失位置对预测精度的影响后,提出了一种改进的马尔可夫链预测模型来处理缺失位置的存在。最后,准确度与可预测性之间的相关性可建模为高斯分布,有缺失位置的标准差可建模为双高斯函数,无缺失位置的标准差可建模为三阶多项式函数。
{"title":"Mobility Prediction with Missing Locations Based on Modified Markov Model for Wireless Users","authors":"Junyao Guo, Lu Liu, Sihai Zhang, Jinkang Zhu","doi":"10.1109/BigDataCongress.2019.00031","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00031","url":null,"abstract":"Mobility prediction is an interesting topic attracting many researchers and both prediction theory and models are explored in the existing literature. The entropy metric to evaluate the mobility predictability of individuals gives a theoretical upper bound and lower bound of prediction probability, although the achieved accuracies of users with the same predictability vary. In this work, we investigate the missing locations phenomenon which means the users visit new locations in the testing set. The major difference of theoretical bound between with and without missing locations are found, which shows that users without missing locations are easier to predict. After discussing the impact of missing locations on the prediction accuracy, a modified Markov chain prediction model is proposed to deal with the presence of missing positions. Finally, the correlation between accuracy and predictability can be modeled as the Gaussian distribution and the standard deviation modeled with missing locations can be modeled as double Gaussian function, while that without missing locations can be modeled as the third-order polynomial function.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126232950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets 用MPI计算超大数据集的分布、数值稳定距离和协方差
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00023
Daniel Peralta, Y. Saeys
The current explosion of data, which is impacting many different areas, is especially noticeable in biomedical research thanks to the development of new technologies that are able to capture high-dimensional and high-resolution data at the single-cell scale. Processing such data in an interpretable way often requires the computation of pairwise dissimilarity measures between the multiple features of the data, a task that can be very difficult to tackle when the dataset is large enough, and which is prone to numerical instability. In this paper we propose a distributed framework to efficiently compute dissimilarity matrices in arbitrarily large datasets in a numerically robust way. It implements a combination of the pairwise and two-pass algorithms for computing the variance, in order to maintain the numerical robustness of the former while reducing its overhead. The proposal is parallelizable both across multiple computers and multiple cores, maximizing the performance while maintaining the benefits of memory locality. The proposal is tested on a real use case: a dataset generated from high-content screening images composed by a billion individual cells and 786 features. The results showed linear scalability with respect to the size of the dataset and close to linear speedup.
当前的数据爆炸正在影响许多不同的领域,由于能够在单细胞尺度上捕获高维和高分辨率数据的新技术的发展,在生物医学研究中尤为明显。以可解释的方式处理这些数据通常需要计算数据的多个特征之间的两两不相似度量,当数据集足够大时,这一任务可能很难处理,并且容易出现数值不稳定性。在本文中,我们提出了一个分布式框架,以一种数字鲁棒的方式有效地计算任意大数据集中的不相似矩阵。为了保持前者的数值鲁棒性,同时减少其开销,它实现了两两算法和两步算法的组合来计算方差。该建议可以跨多台计算机和多核并行,在保持内存局部性优势的同时最大限度地提高性能。该建议在一个真实的用例上进行了测试:一个由10亿个单个细胞和786个特征组成的高内容筛选图像生成的数据集。结果显示了相对于数据集大小的线性可扩展性和接近线性的加速。
{"title":"Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets","authors":"Daniel Peralta, Y. Saeys","doi":"10.1109/BigDataCongress.2019.00023","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00023","url":null,"abstract":"The current explosion of data, which is impacting many different areas, is especially noticeable in biomedical research thanks to the development of new technologies that are able to capture high-dimensional and high-resolution data at the single-cell scale. Processing such data in an interpretable way often requires the computation of pairwise dissimilarity measures between the multiple features of the data, a task that can be very difficult to tackle when the dataset is large enough, and which is prone to numerical instability. In this paper we propose a distributed framework to efficiently compute dissimilarity matrices in arbitrarily large datasets in a numerically robust way. It implements a combination of the pairwise and two-pass algorithms for computing the variance, in order to maintain the numerical robustness of the former while reducing its overhead. The proposal is parallelizable both across multiple computers and multiple cores, maximizing the performance while maintaining the benefits of memory locality. The proposal is tested on a real use case: a dataset generated from high-content screening images composed by a billion individual cells and 786 features. The results showed linear scalability with respect to the size of the dataset and close to linear speedup.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123923737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DLBench: An Experimental Evaluation of Deep Learning Frameworks DLBench:深度学习框架的实验评估
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00034
Nesma Mahmoud, Youssef Essam, Radwa El Shawi, S. Sakr
Recently, deep learning has become one of the most disruptive trends in the technology world. Deep learning techniques are increasingly achieving significant results in different domains such as speech recognition, image recognition and natural language processing. In general, there are various reasons behind the increasing popularity of deep learning techniques. These reasons include increasing data availability, the increasing availability of powerful hardware and computing resources in addition to the increasing availability of deep learning frameworks. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate the performance characteristics of these systems. In this paper, we present an extensive experimental study of six popular deep learning frameworks, namely TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras. Our experimental evaluation covers different aspects for its comparison including accuracy, speed and resource consumption. Our experiments have been conducted on both CPU and GPU environments and using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.
最近,深度学习已经成为科技界最具颠覆性的趋势之一。深度学习技术在语音识别、图像识别和自然语言处理等不同领域取得了越来越显著的成果。总的来说,深度学习技术越来越受欢迎背后有各种各样的原因。这些原因包括数据可用性的增加,强大的硬件和计算资源的可用性的增加,以及深度学习框架的可用性的增加。在实践中,深度学习框架的日益普及需要能够有效评估这些系统性能特征的基准研究。在本文中,我们对六个流行的深度学习框架进行了广泛的实验研究,即TensorFlow, MXNet, PyTorch, Theano, Chainer和Keras。我们的实验评估涵盖了准确性、速度和资源消耗等不同方面进行比较。我们的实验在CPU和GPU环境下进行,并使用不同的数据集。我们报告并分析了所研究框架的性能特征。此外,我们还报告了我们从进行实验中学到的一系列见解和重要经验教训。
{"title":"DLBench: An Experimental Evaluation of Deep Learning Frameworks","authors":"Nesma Mahmoud, Youssef Essam, Radwa El Shawi, S. Sakr","doi":"10.1109/BigDataCongress.2019.00034","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00034","url":null,"abstract":"Recently, deep learning has become one of the most disruptive trends in the technology world. Deep learning techniques are increasingly achieving significant results in different domains such as speech recognition, image recognition and natural language processing. In general, there are various reasons behind the increasing popularity of deep learning techniques. These reasons include increasing data availability, the increasing availability of powerful hardware and computing resources in addition to the increasing availability of deep learning frameworks. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate the performance characteristics of these systems. In this paper, we present an extensive experimental study of six popular deep learning frameworks, namely TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras. Our experimental evaluation covers different aspects for its comparison including accuracy, speed and resource consumption. Our experiments have been conducted on both CPU and GPU environments and using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128454432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Efficient Re-Computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications 变化存在下大数据分析过程的高效再计算:计算框架、参考架构和应用
Pub Date : 2019-07-08 DOI: 10.1109/BigDataCongress.2019.00017
P. Missier, J. Cala
Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically "redo" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.
通过分析过程从大数据中产生的见解往往随着时间的推移而不稳定,从而失去其价值,因为分析通常依赖于动态变化和演变的元素。然而,在评估结果的好处时,通常不会考虑必须定期“重做”计算上昂贵的数据分析的成本。ReComp项目解决了有效地重新计算(全部或部分)复杂分析过程的结果的问题,以响应过程依赖关系中发生的一些变化。虽然这些依赖关系可能包括应用程序和系统库,以及部署环境,但ReComp只关注对参考数据集和原始输入的更改。我们的假设是,有效的重新计算策略需要具备以下能力:(i)观察和量化数据变化,(ii)估计这些变化对先前结果群体的影响,(iii)确定可以恢复受影响结果的货币的最小过程片段,以及(iv)有选择地驱动它们的刷新。在本文中,我们提出了一个解决这些要求的通用框架,并展示了如何对其进行定制,以在两个非常不同的领域(即基因组学和地球科学)的案例研究中进行操作。我们讨论了经验教训,并概述了实现ReComp愿景的下一步步骤。
{"title":"Efficient Re-Computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications","authors":"P. Missier, J. Cala","doi":"10.1109/BigDataCongress.2019.00017","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2019.00017","url":null,"abstract":"Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically \"redo\" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115972244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2019 IEEE International Congress on Big Data (BigDataCongress)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1