With the rapid growth of data storage, the demand for high reliability becomes critical in large data centers where RAID-5 is widely used. However, the disk failure rate increases sharply after some usage, and thus concurrent disk failures are not rare, therefore RAID-5 is insufficient to provide high reliability. A solution is to convert an existing RAID-5 to a RAID-6 (a type of "RAID level migration") to tolerate more concurrent disk failures via erasure codes, but existing approaches involve complex conversion process and high transformation cost. To address these challenges, we propose a novel MDS code, called "Code 5-6", to combine a new dedicated parity column with the original RAID-5 layout. Code 5-6 not only accelerates online conversion from a RAID-5 to a RAID-6, but also demonstrates several optimal properties of MDS codes. Our mathematical analysis shows that, compared to existing MDS codes, Code 5-6 reduces new parities, decreases the total I/O operations, and speeds up the conversion process by up to 80%, 48.5%, and 3.38×, respectively.
{"title":"Code 5-6: An Efficient MDS Array Coding Scheme to Accelerate Online RAID Level Migration","authors":"Chentao Wu, Xubin He, Jie Li, M. Guo","doi":"10.1109/ICPP.2015.54","DOIUrl":"https://doi.org/10.1109/ICPP.2015.54","url":null,"abstract":"With the rapid growth of data storage, the demand for high reliability becomes critical in large data centers where RAID-5 is widely used. However, the disk failure rate increases sharply after some usage, and thus concurrent disk failures are not rare, therefore RAID-5 is insufficient to provide high reliability. A solution is to convert an existing RAID-5 to a RAID-6 (a type of \"RAID level migration\") to tolerate more concurrent disk failures via erasure codes, but existing approaches involve complex conversion process and high transformation cost. To address these challenges, we propose a novel MDS code, called \"Code 5-6\", to combine a new dedicated parity column with the original RAID-5 layout. Code 5-6 not only accelerates online conversion from a RAID-5 to a RAID-6, but also demonstrates several optimal properties of MDS codes. Our mathematical analysis shows that, compared to existing MDS codes, Code 5-6 reduces new parities, decreases the total I/O operations, and speeds up the conversion process by up to 80%, 48.5%, and 3.38×, respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130344213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Crowd sourced sensing over smartphones presents a new paradigm for collecting sensing data over a vast area for real-time monitoring applications. A monitoring application may require different types of sensing data, while under a budget constraint. This paper explores the crucial problem of maximizing the aggregate data utility of heterogeneous sensing tasks while maintaining utility-centric fairness across different tasks under a budget constraint. In particular, we take the redundancy of sensing data into account. This problem is highly challenging given its unique characteristics including the intrinsic trade off between aggregate data utility and fairness, and the large number of smartphones. We propose a fairness-aware distributed approach to solving this problem. To overcome the intractability of the problem, we decompose it to two sub problems of recruiting smartphones under a budget constraint and allocating workloads of sensing tasks. For the first sub problem, we propose an efficient greedy algorithm which has a constant approximation ratio of two. For the second problem, we apply dual based decomposition based on which we design a distributed algorithm for determining the workloads of different tasks on each recruited smartphone. We have implemented our distributed algorithm on a windows-based server and Android-based smartphones. With extensive simulations we demonstrate that our approach achieves high aggregate data utility while maintaining good utility-centric fairness across sensing tasks.
{"title":"Crowdsourcing Sensing Workloads of Heterogeneous Tasks: A Distributed Fairness-Aware Approach","authors":"Wei Sun, Yanmin Zhu, L. Ni, Bo Li","doi":"10.1109/ICPP.2015.67","DOIUrl":"https://doi.org/10.1109/ICPP.2015.67","url":null,"abstract":"Crowd sourced sensing over smartphones presents a new paradigm for collecting sensing data over a vast area for real-time monitoring applications. A monitoring application may require different types of sensing data, while under a budget constraint. This paper explores the crucial problem of maximizing the aggregate data utility of heterogeneous sensing tasks while maintaining utility-centric fairness across different tasks under a budget constraint. In particular, we take the redundancy of sensing data into account. This problem is highly challenging given its unique characteristics including the intrinsic trade off between aggregate data utility and fairness, and the large number of smartphones. We propose a fairness-aware distributed approach to solving this problem. To overcome the intractability of the problem, we decompose it to two sub problems of recruiting smartphones under a budget constraint and allocating workloads of sensing tasks. For the first sub problem, we propose an efficient greedy algorithm which has a constant approximation ratio of two. For the second problem, we apply dual based decomposition based on which we design a distributed algorithm for determining the workloads of different tasks on each recruited smartphone. We have implemented our distributed algorithm on a windows-based server and Android-based smartphones. With extensive simulations we demonstrate that our approach achieves high aggregate data utility while maintaining good utility-centric fairness across sensing tasks.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed computation on directed graphs has been increasingly important in emerging big data analytics. However, partitioning the huge real-world graphs, such as social and web networks, is known challenging for their skewed (or power-law) degree distributions. In this paper, by investigating two representative k-way balanced edge-cut methods (LDG streaming heuristic and METIS) on 12 real social and web graphs, we empirically find that both LDG and METIS can partition page-level web graphs with extremely high quality, but fail to generate low-cut balanced partitions for social networks and host-level web graphs. Our deep analysis identifies that the global star-motif structures around high-degree vertices is the main obstacle to high-quality partitioning. Based on the empirical study, we further propose a new distributed graph model, namely Agent-Graph, and the Agent+ framework that partitions power-law graphs in the Agent-Graph model. Agent-Graph is a vertex cut variant in the context of message passing, where any high-degree vertex is factored into arbitrary computational agents in remote partitions for message combining and scattering. The Agent framework filters the high-degree vertices to form a residual graph which is then partitioned with high quality by existing edge-cut methods, and finally refills high-degree vertices as agents to construct an agent-graph. Experiments show that the Agent+ approach constantly generates high-quality partitions for all tested real-world skewed graphs. In particular, for 64-way partitioning on social networks and host-level web graphs, the Agent+ approach reduces edge cut equivalently by 27%~79% for LDG and 23%~82% for METIS.
{"title":"Study on Partitioning Real-World Directed Graphs of Skewed Degree Distribution","authors":"Jie Yan, Guangming Tan, Ninghui Sun","doi":"10.1109/ICPP.2015.37","DOIUrl":"https://doi.org/10.1109/ICPP.2015.37","url":null,"abstract":"Distributed computation on directed graphs has been increasingly important in emerging big data analytics. However, partitioning the huge real-world graphs, such as social and web networks, is known challenging for their skewed (or power-law) degree distributions. In this paper, by investigating two representative k-way balanced edge-cut methods (LDG streaming heuristic and METIS) on 12 real social and web graphs, we empirically find that both LDG and METIS can partition page-level web graphs with extremely high quality, but fail to generate low-cut balanced partitions for social networks and host-level web graphs. Our deep analysis identifies that the global star-motif structures around high-degree vertices is the main obstacle to high-quality partitioning. Based on the empirical study, we further propose a new distributed graph model, namely Agent-Graph, and the Agent+ framework that partitions power-law graphs in the Agent-Graph model. Agent-Graph is a vertex cut variant in the context of message passing, where any high-degree vertex is factored into arbitrary computational agents in remote partitions for message combining and scattering. The Agent framework filters the high-degree vertices to form a residual graph which is then partitioned with high quality by existing edge-cut methods, and finally refills high-degree vertices as agents to construct an agent-graph. Experiments show that the Agent+ approach constantly generates high-quality partitions for all tested real-world skewed graphs. In particular, for 64-way partitioning on social networks and host-level web graphs, the Agent+ approach reduces edge cut equivalently by 27%~79% for LDG and 23%~82% for METIS.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134369845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The C programming language continues to play an essential role in the development of system software. May-Happen-in-Parallel (MHP) analysis is the basis of many other analyses and optimisations for concurrent programs. Existing MHP analyses that work well for programming languages such as X10 are often not effective for C (with Pthreads). This paper presents a new MHP algorithm for C that operates at the granularity of code regions rather than individual statements in a program. A flow-sensitive Happens-Before (HB) analysis is performed to account for fork-join semantics of pthreads on an interprocedural thread-sensitive control flow graph representation of a program, enabling the HB relations among its statements to be discovered. All the statements that share the same HB properties are then grouped into one region. As a result, computing the MHP information for all pairs of statements in a program is reduced to one of inferring the HB relations from among its regions. We have implemented our algorithm in LLVM-3.5.0 and evaluated it using 14 programs from the SPLASH2 and PARSEC benchmark suites. Our preliminary results show that our approach is more precise than two existing MHP analyses yet computationally comparable with the fastest MHP analysis.
{"title":"Region-Based May-Happen-in-Parallel Analysis for C Programs","authors":"Peng Di, Yulei Sui, Ding Ye, Jingling Xue","doi":"10.1109/ICPP.2015.98","DOIUrl":"https://doi.org/10.1109/ICPP.2015.98","url":null,"abstract":"The C programming language continues to play an essential role in the development of system software. May-Happen-in-Parallel (MHP) analysis is the basis of many other analyses and optimisations for concurrent programs. Existing MHP analyses that work well for programming languages such as X10 are often not effective for C (with Pthreads). This paper presents a new MHP algorithm for C that operates at the granularity of code regions rather than individual statements in a program. A flow-sensitive Happens-Before (HB) analysis is performed to account for fork-join semantics of pthreads on an interprocedural thread-sensitive control flow graph representation of a program, enabling the HB relations among its statements to be discovered. All the statements that share the same HB properties are then grouped into one region. As a result, computing the MHP information for all pairs of statements in a program is reduced to one of inferring the HB relations from among its regions. We have implemented our algorithm in LLVM-3.5.0 and evaluated it using 14 programs from the SPLASH2 and PARSEC benchmark suites. Our preliminary results show that our approach is more precise than two existing MHP analyses yet computationally comparable with the fastest MHP analysis.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133711829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pei Li, E. Brunet, François Trahay, C. Parrot, Gaël Thomas, R. Namyst
Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. Writing such code manually is time consuming and error-prone. In this paper, we propose a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for multiple accelerators. We evaluate both the performance and the usefulness of STEPOCL with three applications and show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with a handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.
{"title":"Automatic OpenCL Code Generation for Multi-device Heterogeneous Architectures","authors":"Pei Li, E. Brunet, François Trahay, C. Parrot, Gaël Thomas, R. Namyst","doi":"10.1109/ICPP.2015.105","DOIUrl":"https://doi.org/10.1109/ICPP.2015.105","url":null,"abstract":"Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. Writing such code manually is time consuming and error-prone. In this paper, we propose a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for multiple accelerators. We evaluate both the performance and the usefulness of STEPOCL with three applications and show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with a handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131965350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data access has become the preeminent performance bottleneck of computing. In this study, a Layered Performance Matching (LPM) model and its associated algorithm are proposed to match the request and reply speed for each layer of a memory hierarchy to improve memory performance. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. The terms pure miss and pure miss penalty are introduced to measure the effectiveness of such hit-miss overlapping. By distinguishing between (general) miss and pure miss, we have made LPM optimization practical and feasible. Our evaluation shows the data stall time can be reduced significantly with an optimized hardware configuration. We also have achieved noticeable performance improvement by simply adopting smart LPM scheduling without changing the underlying hardware configurations. Analysis and experimental results show LPM is feasible and effective. It provides a novel and efficient way to cope with the ever-widening memory wall problem, and to optimize the vital memory system design.
{"title":"LPM: Concurrency-Driven Layered Performance Matching","authors":"Yuhang Liu, Xian-He Sun","doi":"10.1109/ICPP.2015.97","DOIUrl":"https://doi.org/10.1109/ICPP.2015.97","url":null,"abstract":"Data access has become the preeminent performance bottleneck of computing. In this study, a Layered Performance Matching (LPM) model and its associated algorithm are proposed to match the request and reply speed for each layer of a memory hierarchy to improve memory performance. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. The terms pure miss and pure miss penalty are introduced to measure the effectiveness of such hit-miss overlapping. By distinguishing between (general) miss and pure miss, we have made LPM optimization practical and feasible. Our evaluation shows the data stall time can be reduced significantly with an optimized hardware configuration. We also have achieved noticeable performance improvement by simply adopting smart LPM scheduling without changing the underlying hardware configurations. Analysis and experimental results show LPM is feasible and effective. It provides a novel and efficient way to cope with the ever-widening memory wall problem, and to optimize the vital memory system design.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132323946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.
{"title":"Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors","authors":"Cihang Jiang, Youhui Zhang, Weimin Zheng","doi":"10.1109/ICPP.2015.71","DOIUrl":"https://doi.org/10.1109/ICPP.2015.71","url":null,"abstract":"Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"145 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133086923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arya Mazaheri, A. Jannesari, Abdolreza Mirzaei, F. Wolf
Communication patterns extracted from parallel programs can provide a valuable source of information for parallel pattern detection, application auto-tuning, and runtime workload scheduling on heterogeneous systems. Once identified, such patterns can help find the most promising optimizations. Communication patterns can be detected using different methods, including sandbox simulation, memory profiling, and hardware counter analysis. However, these analyses usually suffer from high runtime and memory overhead, necessitating a trade off between accuracy and resource consumption. More importantly, none of the existing methods exploit fine-grained communication patterns on the level of individual code regions. In this paper, we present an efficient tool based on Disco PoP profiler that characterizes the communication pattern of every hotspot in a shared-memory application. With the aid of static and dynamic code analysis, it produces a nested structure of communication patterns based on program's loops. By employing asymmetric signature memory, the runtime overhead is around 225× while the required amount of memory remains fixed. In comparison with other profilers, the proposed method is efficient enough to be used with real world applications.
{"title":"Characterizing Loop-Level Communication Patterns in Shared Memory","authors":"Arya Mazaheri, A. Jannesari, Abdolreza Mirzaei, F. Wolf","doi":"10.1109/ICPP.2015.85","DOIUrl":"https://doi.org/10.1109/ICPP.2015.85","url":null,"abstract":"Communication patterns extracted from parallel programs can provide a valuable source of information for parallel pattern detection, application auto-tuning, and runtime workload scheduling on heterogeneous systems. Once identified, such patterns can help find the most promising optimizations. Communication patterns can be detected using different methods, including sandbox simulation, memory profiling, and hardware counter analysis. However, these analyses usually suffer from high runtime and memory overhead, necessitating a trade off between accuracy and resource consumption. More importantly, none of the existing methods exploit fine-grained communication patterns on the level of individual code regions. In this paper, we present an efficient tool based on Disco PoP profiler that characterizes the communication pattern of every hotspot in a shared-memory application. With the aid of static and dynamic code analysis, it produces a nested structure of communication patterns based on program's loops. By employing asymmetric signature memory, the runtime overhead is around 225× while the required amount of memory remains fixed. In comparison with other profilers, the proposed method is efficient enough to be used with real world applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130081756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At present, MapReduce is the most popular programming model for Big Data processing. As a typical open source implementation of MapReduce, Hadoop is divided into map, shuffle, and reduce. In the mapping phase, according to the principle moving computation towards data, the load is basically balanced and network traffic is relatively small. However, shuffle is likely to result in the outburst of network communication. At the same time, reduce without considering data skew will lead to an imbalanced load, and then performance degradation. This paper proposes a Locality-Enhanced Load Balance (LELB) algorithm, and then extends the execution flow of MapReduce to Map, Local reduce, Shuffle and final Reduce (MLSR), and proposes a corresponding MLSR algorithm. Use of the novel algorithms can share the computation of reduce and overlap with shuffle in order to take full advantage of CPU and I/O resources. The actual test results demonstrate that the execution performance using the LELB algorithm and the MLSR algorithm outperforms the execution performance using hadoop by up to 9.2% (for Merge Sort) and 14.4% (for Word Count).
{"title":"Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce","authors":"Jianjiang Li, Jie Wu, Xiaolei Yang, Shiqi Zhong","doi":"10.1109/ICPP.2015.103","DOIUrl":"https://doi.org/10.1109/ICPP.2015.103","url":null,"abstract":"At present, MapReduce is the most popular programming model for Big Data processing. As a typical open source implementation of MapReduce, Hadoop is divided into map, shuffle, and reduce. In the mapping phase, according to the principle moving computation towards data, the load is basically balanced and network traffic is relatively small. However, shuffle is likely to result in the outburst of network communication. At the same time, reduce without considering data skew will lead to an imbalanced load, and then performance degradation. This paper proposes a Locality-Enhanced Load Balance (LELB) algorithm, and then extends the execution flow of MapReduce to Map, Local reduce, Shuffle and final Reduce (MLSR), and proposes a corresponding MLSR algorithm. Use of the novel algorithms can share the computation of reduce and overlap with shuffle in order to take full advantage of CPU and I/O resources. The actual test results demonstrate that the execution performance using the LELB algorithm and the MLSR algorithm outperforms the execution performance using hadoop by up to 9.2% (for Merge Sort) and 14.4% (for Word Count).","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132707033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yali Zhao, R. Calheiros, G. Gange, K. Ramamohanarao, R. Buyya
Data analytics plays a significant role in gaining insight of big data that can benefit in decision making and problem solving for various application domains such as science, engineering, and commerce. Cloud computing is a suitable platform for Big Data Analytic Applications (BDAAs) that can greatly reduce application cost by elastically provisioning resources based on user requirements and in a pay as you go model. BDAAs are typically catered for specific domains and are usually expensive. Moreover, it is difficult to provision resources for BDAAs with fluctuating resource requirements and reduce the resource cost. As a result, BDAAs are mostly used by large enterprises. Therefore, it is necessary to have a general Analytics as a Service (AaaS) platform that can provision BDAAs to users in various domains as consumable services in an easy to use way and at lower price. To support the AaaS platform, our research focuses on efficiently scheduling Cloud resources for BDAAs to satisfy Quality of Service (QoS) requirements of budget and deadline for data analytic requests and maximize profit for the AaaS platform. We propose an admission control and resource scheduling algorithm, which not only satisfies QoS requirements of requests as guaranteed in Service Level Agreements (SLAs), but also increases the profit for AaaS providers by offering a cost-effective resource scheduling solution. We propose the architecture and models for the AaaS platform and conduct experiments to evaluate the proposed algorithm. Results show the efficiency of the algorithm in SLA guarantee, profit enhancement, and cost saving.
{"title":"SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments","authors":"Yali Zhao, R. Calheiros, G. Gange, K. Ramamohanarao, R. Buyya","doi":"10.1109/ICPP.2015.60","DOIUrl":"https://doi.org/10.1109/ICPP.2015.60","url":null,"abstract":"Data analytics plays a significant role in gaining insight of big data that can benefit in decision making and problem solving for various application domains such as science, engineering, and commerce. Cloud computing is a suitable platform for Big Data Analytic Applications (BDAAs) that can greatly reduce application cost by elastically provisioning resources based on user requirements and in a pay as you go model. BDAAs are typically catered for specific domains and are usually expensive. Moreover, it is difficult to provision resources for BDAAs with fluctuating resource requirements and reduce the resource cost. As a result, BDAAs are mostly used by large enterprises. Therefore, it is necessary to have a general Analytics as a Service (AaaS) platform that can provision BDAAs to users in various domains as consumable services in an easy to use way and at lower price. To support the AaaS platform, our research focuses on efficiently scheduling Cloud resources for BDAAs to satisfy Quality of Service (QoS) requirements of budget and deadline for data analytic requests and maximize profit for the AaaS platform. We propose an admission control and resource scheduling algorithm, which not only satisfies QoS requirements of requests as guaranteed in Service Level Agreements (SLAs), but also increases the profit for AaaS providers by offering a cost-effective resource scheduling solution. We propose the architecture and models for the AaaS platform and conduct experiments to evaluate the proposed algorithm. Results show the efficiency of the algorithm in SLA guarantee, profit enhancement, and cost saving.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116505726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}