In this paper, we address the problem caused by fixed assignment of task slots in Hadoop MapReduce. It is infeasible to manually configure optimal task slots since the characteristics of various workloads are different. We design and implement an automatic control mechanism to dynamically assign task slots based on the resource utilization on each Task Tracker node. The assignment takes the lag period into account. It can improve the cluster-wide resource utilization and avoid contention. Experimental results show that our implementation can dynamically adjust the task slots capacity to the optimal setting in runtime. In some case such as Word Count, our control mechanism outperforms the current Hadoop with optimal task slots configuration found by manual tuning.
{"title":"Automatic task slots assignment in Hadoop MapReduce","authors":"Kun Wang, B. Tan, Juwei Shi, Bo Yang","doi":"10.1145/2377978.2377982","DOIUrl":"https://doi.org/10.1145/2377978.2377982","url":null,"abstract":"In this paper, we address the problem caused by fixed assignment of task slots in Hadoop MapReduce. It is infeasible to manually configure optimal task slots since the characteristics of various workloads are different. We design and implement an automatic control mechanism to dynamically assign task slots based on the resource utilization on each Task Tracker node. The assignment takes the lag period into account. It can improve the cluster-wide resource utilization and avoid contention. Experimental results show that our implementation can dynamically adjust the task slots capacity to the optimal setting in runtime. In some case such as Word Count, our control mechanism outperforms the current Hadoop with optimal task slots configuration found by manual tuning.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125357067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander B. Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, V. Markl
The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously -- a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a shared-nothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets.
{"title":"Myriad: parallel data generation on shared-nothing architectures","authors":"Alexander B. Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, V. Markl","doi":"10.1145/2377978.2377983","DOIUrl":"https://doi.org/10.1145/2377978.2377983","url":null,"abstract":"The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously -- a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a shared-nothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"64 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113937416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyan Gu, Rui Hou, Ke Zhang, Lixin Zhang, Weiping Wang
Building energy-efficient systems is critical for big data applications. This paper investigates and compares the energy consumption and the execution time of a typical Hadoop-based big data application running on a traditional Xeon-based cluster and an Atom-based (Micro-server) cluster. Our experimental results show that the micro-server platform is more energy-efficient than the Xeon-based platform. Our experimental results also reveal that data compression and decompression accounts for a considerable percentage of the total execution time. More precisely, data compression/decompression occupies 7-11% of the execution time of the map tasks and 37.9-41.2% of the execution time of the reduce tasks. Based on our findings, we demonstrate the necessity of using a heterogeneous architecture for energy-efficient big data processing. The desired architecture takes the advantages of both micro-server processors and hardware compression/decompression accelerators. In addition, we propose a mechanism that enables the accelerators to perform more efficient data compression/decompression.
{"title":"Application-driven energy-efficient architecture explorations for big data","authors":"Xiaoyan Gu, Rui Hou, Ke Zhang, Lixin Zhang, Weiping Wang","doi":"10.1145/2377978.2377984","DOIUrl":"https://doi.org/10.1145/2377978.2377984","url":null,"abstract":"Building energy-efficient systems is critical for big data applications. This paper investigates and compares the energy consumption and the execution time of a typical Hadoop-based big data application running on a traditional Xeon-based cluster and an Atom-based (Micro-server) cluster. Our experimental results show that the micro-server platform is more energy-efficient than the Xeon-based platform. Our experimental results also reveal that data compression and decompression accounts for a considerable percentage of the total execution time. More precisely, data compression/decompression occupies 7-11% of the execution time of the map tasks and 37.9-41.2% of the execution time of the reduce tasks. Based on our findings, we demonstrate the necessity of using a heterogeneous architecture for energy-efficient big data processing. The desired architecture takes the advantages of both micro-server processors and hardware compression/decompression accelerators. In addition, we propose a mechanism that enables the accelerators to perform more efficient data compression/decompression.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122289864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Samih, Ren Wang, C. Maciocco, T. Tai, Yan Solihin
With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.
{"title":"A collaborative memory system for high-performance and cost-effective clustered architectures","authors":"A. Samih, Ren Wang, C. Maciocco, T. Tai, Yan Solihin","doi":"10.1145/2377978.2377979","DOIUrl":"https://doi.org/10.1145/2377978.2377979","url":null,"abstract":"With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116980206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pietro Cicotti, Jeffrey W. Bennet, Shawn M. Strande, R. Sinkovits, A. Snavely
To meet the growing demand for high performance computing systems that are capable of processing large datasets, the San Diego Supercomputer Center is deploying Gordon. This system was specifically designed for data intensive workloads and uses flash memory to fill the large latency gap in the memory hierarchy between DRAM and hard disk. In preparation for the deployment of Gordon, we evaluated the performance of multiple remote storage technologies and file systems for use with the flash memory. We find that OCFS and XFS are both superior to PVFS at delivering fast random access to flash. In addition, our tests indicate that the Linux SCSI target framework (TGT) can export flash storage devices with minimal overhead and achieve a large fraction of the theoretical peak I/O performance. Despite the difficulties in fairly comparing I/O solutions due to the many differences in file systems and service implementations, we conclude that OCFS on TGT is a viable option for our system as it provides both excellent performance and a user-friendly shared file system interface. In those instances where a parallel file system is not required, XFS on TGT is a better alternative.
{"title":"Evaluation of I/O technologies on a flash-based I/O sub-system for HPC","authors":"Pietro Cicotti, Jeffrey W. Bennet, Shawn M. Strande, R. Sinkovits, A. Snavely","doi":"10.1145/2377978.2377980","DOIUrl":"https://doi.org/10.1145/2377978.2377980","url":null,"abstract":"To meet the growing demand for high performance computing systems that are capable of processing large datasets, the San Diego Supercomputer Center is deploying Gordon. This system was specifically designed for data intensive workloads and uses flash memory to fill the large latency gap in the memory hierarchy between DRAM and hard disk. In preparation for the deployment of Gordon, we evaluated the performance of multiple remote storage technologies and file systems for use with the flash memory. We find that OCFS and XFS are both superior to PVFS at delivering fast random access to flash. In addition, our tests indicate that the Linux SCSI target framework (TGT) can export flash storage devices with minimal overhead and achieve a large fraction of the theoretical peak I/O performance. Despite the difficulties in fairly comparing I/O solutions due to the many differences in file systems and service implementations, we conclude that OCFS on TGT is a viable option for our system as it provides both excellent performance and a user-friendly shared file system interface. In those instances where a parallel file system is not required, XFS on TGT is a better alternative.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134481190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current trends in computing and system architecture point towards a need for accelerators such as GPUs to have inherent communication capabilities. We review previous and current software libraries that provide pseudo-communication abilities through direct message passing. We show how these libraries are beneficial to the HPC community, but are not forward-thinking enough. We give motivation as to why MPI should be extended to support these accelerators, and provide a road map of achievable milestones to complete such an extension, some of which require advances in hardware and device drivers.
{"title":"Extending MPI to accelerators","authors":"Jeff A. Stuart, P. Balaji, John Douglas Owens","doi":"10.1145/2377978.2377981","DOIUrl":"https://doi.org/10.1145/2377978.2377981","url":null,"abstract":"Current trends in computing and system architecture point towards a need for accelerators such as GPUs to have inherent communication capabilities. We review previous and current software libraries that provide pseudo-communication abilities through direct message passing. We show how these libraries are beneficial to the HPC community, but are not forward-thinking enough. We give motivation as to why MPI should be extended to support these accelerators, and provide a road map of achievable milestones to complete such an extension, some of which require advances in hardware and device drivers.","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123170728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","authors":"","doi":"10.1145/2377978","DOIUrl":"https://doi.org/10.1145/2377978","url":null,"abstract":"","PeriodicalId":231147,"journal":{"name":"Proceedings of the 1st Workshop on Architectures and Systems for Big Data","volume":"34 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}