Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00068
Srinivasan Chandrasekharan, C. Gniady
As memory becomes cheaper, use of it has become more prominent in computer systems. This increase in number of memory modules increases the ratio of energy consumption by memory to the overall energy consumption of a computer system. As Database Systems become more memory centric and put more pressure on the memory subsystem, managing energy consumption of main memory is becoming critical. Therefore, it is important to take advantage of all memory idle times and lower power states provided by newer memory architectures by placing memory in low power modes using application level cues. While there have been studies on CPU power consumption in Database Systems, only limited research has been done on the role of memory in Database Systems with respect to energy management. We propose Query Aware Memory Energy Management (QAMEM) where the Database System provides application level cues to the memory controller to switch to lower power states using query information and performance counters. Our results show that by using QAMEM on TPC-H workloads one can save 25% of total system energy in comparison to the state of the art memory energy management mechanisms.
{"title":"QAMEM: Query Aware Memory Energy Management","authors":"Srinivasan Chandrasekharan, C. Gniady","doi":"10.1109/CCGRID.2018.00068","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00068","url":null,"abstract":"As memory becomes cheaper, use of it has become more prominent in computer systems. This increase in number of memory modules increases the ratio of energy consumption by memory to the overall energy consumption of a computer system. As Database Systems become more memory centric and put more pressure on the memory subsystem, managing energy consumption of main memory is becoming critical. Therefore, it is important to take advantage of all memory idle times and lower power states provided by newer memory architectures by placing memory in low power modes using application level cues. While there have been studies on CPU power consumption in Database Systems, only limited research has been done on the role of memory in Database Systems with respect to energy management. We propose Query Aware Memory Energy Management (QAMEM) where the Database System provides application level cues to the memory controller to switch to lower power states using query information and performance counters. Our results show that by using QAMEM on TPC-H workloads one can save 25% of total system energy in comparison to the state of the art memory energy management mechanisms.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126241649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00015
Zhuozhao Li, Haiying Shen, Ankur Sarker
In spite of many shuffle-heavy jobs in current commercial data-parallel clusters, few previous studies have considered the network traffic in the shuffle phase, which contains a large amount of data transfers and may adversely affect the cluster performance. In this paper, we propose a network-aware scheduler (NAS) that handles two main challenges associated with the shuffle phase for high performance: i) balancing cross-node network load, and ii) avoiding and reducing cross-rack network congestion. NAS consists of three main mechanisms: i) map task scheduling (MTS), ii) congestion-avoidance reduce task scheduling (CA-RTS) and iii) congestion-reduction reduce task scheduling (CR-RTS). MTS constrains the shuffle data on each node when scheduling the map tasks to balance the cross-node network load. CA-RTS distributes the reduce tasks for each job based on the distribution of its shuffle data among the racks in order to minimize cross-rack traffic. When the network is congested, CR-RTS schedules reduce tasks that generate negligible shuffle traffic to reduce the congestion. We implemented NAS in Hadoop on a cluster. Our trace-driven simulation and real cluster experiment demonstrate the superior performance of NAS on improving the throughput (up to 62%), reducing the average job execution time (up to 44%) and reducing the cross-rack traffic (up to 40%) compared with state-of-the-art schedulers.
{"title":"A Network-Aware Scheduler in Data-Parallel Clusters for High Performance","authors":"Zhuozhao Li, Haiying Shen, Ankur Sarker","doi":"10.1109/CCGRID.2018.00015","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00015","url":null,"abstract":"In spite of many shuffle-heavy jobs in current commercial data-parallel clusters, few previous studies have considered the network traffic in the shuffle phase, which contains a large amount of data transfers and may adversely affect the cluster performance. In this paper, we propose a network-aware scheduler (NAS) that handles two main challenges associated with the shuffle phase for high performance: i) balancing cross-node network load, and ii) avoiding and reducing cross-rack network congestion. NAS consists of three main mechanisms: i) map task scheduling (MTS), ii) congestion-avoidance reduce task scheduling (CA-RTS) and iii) congestion-reduction reduce task scheduling (CR-RTS). MTS constrains the shuffle data on each node when scheduling the map tasks to balance the cross-node network load. CA-RTS distributes the reduce tasks for each job based on the distribution of its shuffle data among the racks in order to minimize cross-rack traffic. When the network is congested, CR-RTS schedules reduce tasks that generate negligible shuffle traffic to reduce the congestion. We implemented NAS in Hadoop on a cluster. Our trace-driven simulation and real cluster experiment demonstrate the superior performance of NAS on improving the throughput (up to 62%), reducing the average job execution time (up to 44%) and reducing the cross-rack traffic (up to 40%) compared with state-of-the-art schedulers.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128477530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00046
Myeonggyun Han, Seongdae Yu, Woongki Baek
With server consolidation, latency-critical and batch workloads are collocated on the same physical servers. The resource manager dynamically allocates the hardware resources to the workloads to maximize the overall throughput while providing the service-level objective (SLO) guarantees for the latency-critical workloads. As the hardware resources are dynamically allocated across the workloads on the same physical server, information leakage can be established, making them vulnerable to micro-architectural side-channel attacks. Despite extensive prior works, it remains unexplored to investigate the efficient design and implementation of the dynamic resource management system that maximizes resource efficiency without compromising the SLO and security guarantees. To bridge this gap, this work proposes SDCP, secure and dynamic core and cache partitioning for safe and efficient server consolidation. In line with the state-of-the-art dynamic server consolidation techniques, SDCP dynamically allocates the hardware resources (i.e., cores and caches) to maximize the resource utilization with the SLO guarantees. In contrast to the existing techniques, however, SDCP dynamically sanitizes the hardware resources to ensure that no micro-architectural side channel is established between different security domains. Our experimental results demonstrate that SDCP provides high resource sanitization quality, incurs small performance overheads, and achieves high resource efficiency with the SLO and security guarantees.
{"title":"Secure and Dynamic Core and Cache Partitioning for Safe and Efficient Server Consolidation","authors":"Myeonggyun Han, Seongdae Yu, Woongki Baek","doi":"10.1109/CCGRID.2018.00046","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00046","url":null,"abstract":"With server consolidation, latency-critical and batch workloads are collocated on the same physical servers. The resource manager dynamically allocates the hardware resources to the workloads to maximize the overall throughput while providing the service-level objective (SLO) guarantees for the latency-critical workloads. As the hardware resources are dynamically allocated across the workloads on the same physical server, information leakage can be established, making them vulnerable to micro-architectural side-channel attacks. Despite extensive prior works, it remains unexplored to investigate the efficient design and implementation of the dynamic resource management system that maximizes resource efficiency without compromising the SLO and security guarantees. To bridge this gap, this work proposes SDCP, secure and dynamic core and cache partitioning for safe and efficient server consolidation. In line with the state-of-the-art dynamic server consolidation techniques, SDCP dynamically allocates the hardware resources (i.e., cores and caches) to maximize the resource utilization with the SLO guarantees. In contrast to the existing techniques, however, SDCP dynamically sanitizes the hardware resources to ensure that no micro-architectural side channel is established between different security domains. Our experimental results demonstrate that SDCP provides high resource sanitization quality, incurs small performance overheads, and achieves high resource efficiency with the SLO and security guarantees.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00-33
J. Al-Jaroodi, N. Mohamed
The concept of cache has proven to be very beneficial in various domains. This paper introduces the concept of distributed cloud cache. It uses small caches available on servers or fog nodes over the Internet to reduce the load on the data centers. Dual-direction parallel transfer is used to improve download times of new released software, games, and movies from the release point to the clients.
{"title":"Distributed Cloud Cache","authors":"J. Al-Jaroodi, N. Mohamed","doi":"10.1109/CCGRID.2018.00-33","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00-33","url":null,"abstract":"The concept of cache has proven to be very beneficial in various domains. This paper introduces the concept of distributed cloud cache. It uses small caches available on servers or fog nodes over the Internet to reduce the load on the data centers. Dual-direction parallel transfer is used to improve download times of new released software, games, and movies from the release point to the clients.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133010343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00040
Brandon Posey, Christopher Gropp, Boyd Wilson, Boyd McGeachie, S. Padhi, Alexander Herzog, A. Apon
A major limitation for time-to-science can be the lack of available computing resources. Depending on the capacity of resources, executing an application suite with hundreds of thousands of jobs can take weeks when resources are in high demand. We describe how we dynamically provision a large scale high performance computing cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large scale cluster to study a parameter sweep workflow composed of message-passing parallel topic modeling jobs on multiple datasets. At peak, we achieve a simultaneous core count of 1,119,196 vCPUs across nearly 50,000 instances, and are able to execute almost half a million jobs within two hours utilizing AWS Spot Instances in a single AWS region. Our solutions to the challenges and trade-offs have broad application to the lifecycle management of similar clusters on other commercial clouds.
研究时间的一个主要限制可能是缺乏可用的计算资源。根据资源的容量,当资源需求量很大时,执行具有数十万个作业的应用程序套件可能需要数周时间。我们描述了如何利用Amazon Web Services (AWS)动态地提供超过一百万核的大规模高性能计算集群。我们将讨论与使用商业云资源创建如此大规模集群相关的权衡、挑战和解决方案。我们利用我们的大规模集群研究了一个由多个数据集上的消息传递并行主题建模作业组成的参数扫描工作流。在峰值时,我们在近50,000个实例中实现了1,119,196个vcpu的同时核心计数,并且能够在单个AWS区域中利用AWS Spot实例在两小时内执行近50万个作业。我们针对挑战和权衡的解决方案广泛应用于其他商业云上类似集群的生命周期管理。
{"title":"Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud","authors":"Brandon Posey, Christopher Gropp, Boyd Wilson, Boyd McGeachie, S. Padhi, Alexander Herzog, A. Apon","doi":"10.1109/CCGRID.2018.00040","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00040","url":null,"abstract":"A major limitation for time-to-science can be the lack of available computing resources. Depending on the capacity of resources, executing an application suite with hundreds of thousands of jobs can take weeks when resources are in high demand. We describe how we dynamically provision a large scale high performance computing cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large scale cluster to study a parameter sweep workflow composed of message-passing parallel topic modeling jobs on multiple datasets. At peak, we achieve a simultaneous core count of 1,119,196 vCPUs across nearly 50,000 instances, and are able to execute almost half a million jobs within two hours utilizing AWS Spot Instances in a single AWS region. Our solutions to the challenges and trade-offs have broad application to the lifecycle management of similar clusters on other commercial clouds.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127410631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00052
J. Al-Jaroodi, N. Mohamed
A smart city has recently become an aspiration for many cities around the world. These cities are looking to apply the smart city concept to improve sustainability, quality of life for residents, and economic development. The smart city concept depends on employing a wide range of advanced technologies to improve the performance of various services and activities such as transportation, energy, healthcare, and education, while at the same time improve the city's resources utilization and initiate new business opportunities. One of the promising technologies to support such efforts is the big data technology. Effective and intelligent use of big data accumulated over time in various sectors can offer many advantages to enhance decision making in smart cities. In this paper we identify the different types of decision making processes involved in smart cities. Then we propose a service-oriented architecture to support big data analytics for decision making in smart cities. This architecture allows for integrating different technologies such as fog and cloud computing to support different types of analytics and decision-making operations needed to effectively utilize available big data. It provides different functions and capabilities to use big data and provide smart capabilities as services that the architecture supports. As a result, different big data applications will be able to access and use these services for varying proposes within the smart city.
{"title":"Service-Oriented Architecture for Big Data Analytics in Smart Cities","authors":"J. Al-Jaroodi, N. Mohamed","doi":"10.1109/CCGRID.2018.00052","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00052","url":null,"abstract":"A smart city has recently become an aspiration for many cities around the world. These cities are looking to apply the smart city concept to improve sustainability, quality of life for residents, and economic development. The smart city concept depends on employing a wide range of advanced technologies to improve the performance of various services and activities such as transportation, energy, healthcare, and education, while at the same time improve the city's resources utilization and initiate new business opportunities. One of the promising technologies to support such efforts is the big data technology. Effective and intelligent use of big data accumulated over time in various sectors can offer many advantages to enhance decision making in smart cities. In this paper we identify the different types of decision making processes involved in smart cities. Then we propose a service-oriented architecture to support big data analytics for decision making in smart cities. This architecture allows for integrating different technologies such as fog and cloud computing to support different types of analytics and decision-making operations needed to effectively utilize available big data. It provides different functions and capabilities to use big data and provide smart capabilities as services that the architecture supports. As a result, different big data applications will be able to access and use these services for varying proposes within the smart city.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127446479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00026
Houjun Tang, S. Byna, François Tessier, Teng Wang, Bin Dong, Jingqing Mu, Q. Koziol, Jérome Soumagne, V. Vishwanath, Jialin Liu, R. Warren
Emerging high performance computing (HPC) systems are expected to be deployed with an unprecedented level of complexity due to a deep system memory and storage hierarchy. Efficient and scalable methods of data management and movement through this hierarchy is critical for scientific applications using exascale systems. Moving toward new paradigms for scalable I/O in the extreme-scale era, we introduce novel object-centric data abstractions and storage mechanisms that take advantage of the deep storage hierarchy, named Proactive Data Containers (PDC). In this paper, we formulate object-centric PDCs and their mappings in different levels of the storage hierarchy. PDC adopts a client-server architecture with a set of servers managing data movement across storage layers. To demonstrate the effectiveness of the proposed PDC system, we have measured performance of benchmarks and I/O kernels from scientific simulation and analysis applications using PDC programming interface, and compared the results with existing highly tuned I/O libraries. Using asynchronous I/O along with data and metadata optimizations, PDC demonstrates up to 23× speedup over HDF5 and PLFS in writing and reading data from a plasma physics simulation. PDC achieves comparable performance with HDF5 and PLFS in reading and writing data of a single timestep at small scale, and outperforms them at a scale of larger than 10K cores. In contrast to existing storage systems, PDC offers user-space data management with the flexibility to allocate the number of PDC servers depending on the workload.
{"title":"Toward Scalable and Asynchronous Object-Centric Data Management for HPC","authors":"Houjun Tang, S. Byna, François Tessier, Teng Wang, Bin Dong, Jingqing Mu, Q. Koziol, Jérome Soumagne, V. Vishwanath, Jialin Liu, R. Warren","doi":"10.1109/CCGRID.2018.00026","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00026","url":null,"abstract":"Emerging high performance computing (HPC) systems are expected to be deployed with an unprecedented level of complexity due to a deep system memory and storage hierarchy. Efficient and scalable methods of data management and movement through this hierarchy is critical for scientific applications using exascale systems. Moving toward new paradigms for scalable I/O in the extreme-scale era, we introduce novel object-centric data abstractions and storage mechanisms that take advantage of the deep storage hierarchy, named Proactive Data Containers (PDC). In this paper, we formulate object-centric PDCs and their mappings in different levels of the storage hierarchy. PDC adopts a client-server architecture with a set of servers managing data movement across storage layers. To demonstrate the effectiveness of the proposed PDC system, we have measured performance of benchmarks and I/O kernels from scientific simulation and analysis applications using PDC programming interface, and compared the results with existing highly tuned I/O libraries. Using asynchronous I/O along with data and metadata optimizations, PDC demonstrates up to 23× speedup over HDF5 and PLFS in writing and reading data from a plasma physics simulation. PDC achieves comparable performance with HDF5 and PLFS in reading and writing data of a single timestep at small scale, and outperforms them at a scale of larger than 10K cores. In contrast to existing storage systems, PDC offers user-space data management with the flexibility to allocate the number of PDC servers depending on the workload.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128166385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00098
Spiros Koulouzis, Rahaf Mousa, Andreas Karakannas, C. D. Laat, Zhiming Zhao
Persistent identifiers (PIDs) such as Digital Object Identifiers (DOIs) provide a unique and persistent way to identify and cite digital objects such as publications, media content and research data. They are widely used by data producers to catalogue and publish digital assets and research data. Nowadays, research infrastructures (RIs) offer services not only for accessing and publishing data objects, but also for processing data based on user demands, e.g., via scientific workflows or third party virtual research environments. However, efficiently retrieving and sharing digital objects in a shared data processing environment requires knowledge of application access patterns as well as the underlying network level distribution. As the number and size of data objects increases, optimizing data discovery and access among distributed partners on shared infrastructure emerges as an important challenge for infrastructure operators to maintain quality of service and user experience. In this paper, we propose a novel approach that utilizes Information Centric Networking (ICN) to retrieve content based on PIDs while optimizing data access on shared infrastructure.
{"title":"Information Centric Networking for Sharing and Accessing Digital Objects with Persistent Identifiers on Data Infrastructures","authors":"Spiros Koulouzis, Rahaf Mousa, Andreas Karakannas, C. D. Laat, Zhiming Zhao","doi":"10.1109/CCGRID.2018.00098","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00098","url":null,"abstract":"Persistent identifiers (PIDs) such as Digital Object Identifiers (DOIs) provide a unique and persistent way to identify and cite digital objects such as publications, media content and research data. They are widely used by data producers to catalogue and publish digital assets and research data. Nowadays, research infrastructures (RIs) offer services not only for accessing and publishing data objects, but also for processing data based on user demands, e.g., via scientific workflows or third party virtual research environments. However, efficiently retrieving and sharing digital objects in a shared data processing environment requires knowledge of application access patterns as well as the underlying network level distribution. As the number and size of data objects increases, optimizing data discovery and access among distributed partners on shared infrastructure emerges as an important challenge for infrastructure operators to maintain quality of service and user experience. In this paper, we propose a novel approach that utilizes Information Centric Networking (ICN) to retrieve content based on PIDs while optimizing data access on shared infrastructure.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116347520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00100
A. Cuzzocrea, E. Damiani
This paper introduces a general framework for supporting data-driven privacy-preserving big data management in distributed environments, such as emerging Cloud settings. The proposed framework can be viewed as an alternative to classical approaches where the privacy of big data is ensured via security-inspired protocols that check several (protocol) layers in order to achieve the desired privacy. Unfortunately, this injects considerable computational overheads in the overall process, thus introducing relevant challenges to be considered. Our approach instead tries to recognize the "pedigree" of suitable summary data representatives computed on top of the target big data repositories, hence avoiding computational overheads due to protocol checking. We also provide a relevant realization of the framework above, the so-called Data-dRIven aggregate-PROvenance privacypreserving big Multidimensional data (DRIPROM) framework, which specifically considers multidimensional data as the case of interest.
{"title":"Pedigree-ing Your Big Data: Data-Driven Big Data Privacy in Distributed Environments","authors":"A. Cuzzocrea, E. Damiani","doi":"10.1109/CCGRID.2018.00100","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00100","url":null,"abstract":"This paper introduces a general framework for supporting data-driven privacy-preserving big data management in distributed environments, such as emerging Cloud settings. The proposed framework can be viewed as an alternative to classical approaches where the privacy of big data is ensured via security-inspired protocols that check several (protocol) layers in order to achieve the desired privacy. Unfortunately, this injects considerable computational overheads in the overall process, thus introducing relevant challenges to be considered. Our approach instead tries to recognize the \"pedigree\" of suitable summary data representatives computed on top of the target big data repositories, hence avoiding computational overheads due to protocol checking. We also provide a relevant realization of the framework above, the so-called Data-dRIven aggregate-PROvenance privacypreserving big Multidimensional data (DRIPROM) framework, which specifically considers multidimensional data as the case of interest.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115580238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00021
Shigeru Imai, S. Patterson, Carlos A. Varela
Stream processing systems deployed on the cloud need to be elastic to effectively accommodate workload variations over time. Performance models can predict maximum sustainable throughput (MST) as a function of the number of VMs allocated. We present a scheduling framework that incorporates three statistical techniques to improve Quality of Service (QoS) of cloud stream processing systems: (i) uncertainty quantification to consider variance in the MST model; (ii) online learning to update MST model as new performance metrics are gathered; and (iii) workload models to predict input data stream rates assuming regular patterns occur over time. Our framework can be parameterized by a QoS satisfaction target that statistically finds the best performance/cost tradeoff. Our results illustrate that each of the three techniques alone significantly improves QoS, from 52% to 73-81% QoS satisfaction rates on average for eight benchmark applications. Furthermore, applying all three techniques allows us to reach 98.62% QoS satisfaction rate with a cost less than twice the cost of the optimal (in hindsight) VM allocations, and half of the cost of allocating VMs for the peak demand in the workload.
{"title":"Uncertainty-Aware Elastic Virtual Machine Scheduling for Stream Processing Systems","authors":"Shigeru Imai, S. Patterson, Carlos A. Varela","doi":"10.1109/CCGRID.2018.00021","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00021","url":null,"abstract":"Stream processing systems deployed on the cloud need to be elastic to effectively accommodate workload variations over time. Performance models can predict maximum sustainable throughput (MST) as a function of the number of VMs allocated. We present a scheduling framework that incorporates three statistical techniques to improve Quality of Service (QoS) of cloud stream processing systems: (i) uncertainty quantification to consider variance in the MST model; (ii) online learning to update MST model as new performance metrics are gathered; and (iii) workload models to predict input data stream rates assuming regular patterns occur over time. Our framework can be parameterized by a QoS satisfaction target that statistically finds the best performance/cost tradeoff. Our results illustrate that each of the three techniques alone significantly improves QoS, from 52% to 73-81% QoS satisfaction rates on average for eight benchmark applications. Furthermore, applying all three techniques allows us to reach 98.62% QoS satisfaction rate with a cost less than twice the cost of the optimal (in hindsight) VM allocations, and half of the cost of allocating VMs for the peak demand in the workload.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116621462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}