Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda
Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.
{"title":"Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store","authors":"Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda","doi":"10.1109/ICPP.2015.79","DOIUrl":"https://doi.org/10.1109/ICPP.2015.79","url":null,"abstract":"Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124966517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many data center applications have deadline requirements, which pose a requirement of deadline-awareness in network transport. Completing within deadlines is a necessary requirement for flows to be completed. Transport protocols in current data centers try to share the network resources fairly and are deadline-agnostic. Recently several works try to address the problem by making as many flows meet deadlines as possible. However, for many data center applications, a task cannot be completed until the last flow finishes, which indicates the bandwidths consumed by completed flows are wasted if some flows in the task cannot meet deadlines. In this paper we design a task-level deadline-aware preemptive flow scheduling(TAPS), which aims to make more tasks meet deadlines. We leverage software defined networking (SDN) technology and generalize SDN from flow-level awareness to task-level awareness. The scheduling algorithm runs on the SDN controller, which decides whether a flow should be accepted or discarded, pre-allocates the transmission time slices and computes the routing paths for accepted flows. Extensive flow-level simulations demonstrate TAPS outperforms Varys, Bara at, PDQ (Preemptive Distributed Quick flow scheduling), D3 (Deadline-Driven Delivery control protocol) and Fair Sharing transport protocols in deadline sensitive data center environment. A simple implementation on real systems also proves that TAPS makes high effective utilization of the network bandwidth in data centers.
{"title":"TAPS: Software Defined Task-Level Deadline-Aware Preemptive Flow Scheduling in Data Centers","authors":"Lili Liu, Dan Li, Jianping Wu","doi":"10.1109/ICPP.2015.75","DOIUrl":"https://doi.org/10.1109/ICPP.2015.75","url":null,"abstract":"Many data center applications have deadline requirements, which pose a requirement of deadline-awareness in network transport. Completing within deadlines is a necessary requirement for flows to be completed. Transport protocols in current data centers try to share the network resources fairly and are deadline-agnostic. Recently several works try to address the problem by making as many flows meet deadlines as possible. However, for many data center applications, a task cannot be completed until the last flow finishes, which indicates the bandwidths consumed by completed flows are wasted if some flows in the task cannot meet deadlines. In this paper we design a task-level deadline-aware preemptive flow scheduling(TAPS), which aims to make more tasks meet deadlines. We leverage software defined networking (SDN) technology and generalize SDN from flow-level awareness to task-level awareness. The scheduling algorithm runs on the SDN controller, which decides whether a flow should be accepted or discarded, pre-allocates the transmission time slices and computes the routing paths for accepted flows. Extensive flow-level simulations demonstrate TAPS outperforms Varys, Bara at, PDQ (Preemptive Distributed Quick flow scheduling), D3 (Deadline-Driven Delivery control protocol) and Fair Sharing transport protocols in deadline sensitive data center environment. A simple implementation on real systems also proves that TAPS makes high effective utilization of the network bandwidth in data centers.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114642562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Song Wu, Chuxiong Yan, Haibao Chen, Hai Jin, Wenting Guo, Zhen Wang, Deqing Zou
For data centers with limited power supply, restricting the servers' power budget (i.e., The maximal power provided to servers) is an efficient approach to increase the server density (the server quantity per rack), which can effectively improve the cost-effectiveness of the data centers. However, this approach may also affect the performance of applications in servers. Hence, the prerequisite of adopting the approach in data centers is to precisely evaluate the application performance degradation caused by restricting the servers' power budget. Unfortunately, existing evaluation methods are inaccurate because they are either improper or coarse-grained, especially for the latency-sensitive applications widely deployed in data centers. In this paper, we analyze the reasons why state-of-the-art methods are not appropriate for evaluating the performance degradation of latency-sensitive applications in case of power restriction, and we propose a new evaluation method which can provide a fine-grained way to precisely describe and evaluate such degradation. We verify our proposed method by a real-world application and the traces from Ten cent's date enter with 25328 servers. The experimental results show that our method is much more accurate compared with the state of the art, and we can significantly increase datacenter efficiency by saving servers' power budget while maintaining the applications' performance degradation within controllable and acceptable range.
{"title":"Evaluating Latency-Sensitive Applications: Performance Degradation in Datacenters with Restricted Power Budget","authors":"Song Wu, Chuxiong Yan, Haibao Chen, Hai Jin, Wenting Guo, Zhen Wang, Deqing Zou","doi":"10.1109/ICPP.2015.73","DOIUrl":"https://doi.org/10.1109/ICPP.2015.73","url":null,"abstract":"For data centers with limited power supply, restricting the servers' power budget (i.e., The maximal power provided to servers) is an efficient approach to increase the server density (the server quantity per rack), which can effectively improve the cost-effectiveness of the data centers. However, this approach may also affect the performance of applications in servers. Hence, the prerequisite of adopting the approach in data centers is to precisely evaluate the application performance degradation caused by restricting the servers' power budget. Unfortunately, existing evaluation methods are inaccurate because they are either improper or coarse-grained, especially for the latency-sensitive applications widely deployed in data centers. In this paper, we analyze the reasons why state-of-the-art methods are not appropriate for evaluating the performance degradation of latency-sensitive applications in case of power restriction, and we propose a new evaluation method which can provide a fine-grained way to precisely describe and evaluate such degradation. We verify our proposed method by a real-world application and the traces from Ten cent's date enter with 25328 servers. The experimental results show that our method is much more accurate compared with the state of the art, and we can significantly increase datacenter efficiency by saving servers' power budget while maintaining the applications' performance degradation within controllable and acceptable range.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116773758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanfeng Zhang, Q. Ma, Tong Zhu, Kebin Liu, Lan Zhang, Wenbo He, Yunhao Liu
Crowdsensing applications require individuals toshare local and personal sensing data with others to produce valuableknowledge and services. Meanwhile, it has raised concernsespecially for location privacy. Users may wish to prevent privacyleak and publish as many non-sensitive contexts as possible.Simply suppressing sensitive contexts is vulnerable to the adversariesexploiting spatio-temporal correlations in users' behavior.In this work, we present PLP, a crowdsensing scheme whichpreserves privacy while maximizes the amount of data collectionby filtering a user's context stream. PLP leverages a conditionalrandom field to model the spatio-temporal correlations amongthe contexts, and proposes a speed-up algorithm to learn theweaknesses in the correlations. Even if the adversaries are strongenough to know the filtering system and the weaknesses, PLPcan still provably preserves privacy, with little computationalcost for online operations. PLP is evaluated and validated overtwo real-world smartphone context traces of 34 users. Theexperimental results show that PLP efficiently protects privacywithout sacrificing much utility.
{"title":"PLP: Protecting Location Privacy Against Correlation-Analysis Attack in Crowdsensing","authors":"Shanfeng Zhang, Q. Ma, Tong Zhu, Kebin Liu, Lan Zhang, Wenbo He, Yunhao Liu","doi":"10.1109/ICPP.2015.20","DOIUrl":"https://doi.org/10.1109/ICPP.2015.20","url":null,"abstract":"Crowdsensing applications require individuals toshare local and personal sensing data with others to produce valuableknowledge and services. Meanwhile, it has raised concernsespecially for location privacy. Users may wish to prevent privacyleak and publish as many non-sensitive contexts as possible.Simply suppressing sensitive contexts is vulnerable to the adversariesexploiting spatio-temporal correlations in users' behavior.In this work, we present PLP, a crowdsensing scheme whichpreserves privacy while maximizes the amount of data collectionby filtering a user's context stream. PLP leverages a conditionalrandom field to model the spatio-temporal correlations amongthe contexts, and proposes a speed-up algorithm to learn theweaknesses in the correlations. Even if the adversaries are strongenough to know the filtering system and the weaknesses, PLPcan still provably preserves privacy, with little computationalcost for online operations. PLP is evaluated and validated overtwo real-world smartphone context traces of 34 users. Theexperimental results show that PLP efficiently protects privacywithout sacrificing much utility.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124725160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haisheng Yu, Keqiu Li, Heng Qi, Wenxin Li, Xiaoyi Tao
Traditional networks are surprisingly fragile and difficult to manage. Software Defined Networking (SDN) gained significant attention from both academia and industry, as if simplify network management through centralized configuration. Existing work primarily focuses on networks of limited scope such as data-centers and enterprises, which makes the development of SDN hindered when it comes to large-scale network environments. One way of enabling communication between data-centers, enterprises and ISPs in a large-scale network is to establish a standard communication mechanism between these entities. In this paper, we propose Zebra, a framework for enabling communication between different SDN domains. Zebra has two modules: Heterogeneous Controller Management (HCM) module and Domain Relationships Management (DRM) module. HCM collects network information from a group of controllers with no interconnection and generate a domain-wide network view. DRM collects network information from other domains to generate a global-wide network view. Moreover, HCM supports different SDN controllers, such as floodlight, maestro and so on. To test this framework, we develop a prototype system, and give some experimental results.
{"title":"Zebra: An East-West Control Framework for SDN Controllers","authors":"Haisheng Yu, Keqiu Li, Heng Qi, Wenxin Li, Xiaoyi Tao","doi":"10.1109/ICPP.2015.70","DOIUrl":"https://doi.org/10.1109/ICPP.2015.70","url":null,"abstract":"Traditional networks are surprisingly fragile and difficult to manage. Software Defined Networking (SDN) gained significant attention from both academia and industry, as if simplify network management through centralized configuration. Existing work primarily focuses on networks of limited scope such as data-centers and enterprises, which makes the development of SDN hindered when it comes to large-scale network environments. One way of enabling communication between data-centers, enterprises and ISPs in a large-scale network is to establish a standard communication mechanism between these entities. In this paper, we propose Zebra, a framework for enabling communication between different SDN domains. Zebra has two modules: Heterogeneous Controller Management (HCM) module and Domain Relationships Management (DRM) module. HCM collects network information from a group of controllers with no interconnection and generate a domain-wide network view. DRM collects network information from other domains to generate a global-wide network view. Moreover, HCM supports different SDN controllers, such as floodlight, maestro and so on. To test this framework, we develop a prototype system, and give some experimental results.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130378489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing Data Center Network (DCN) architectures are classified into two categories: switch-centric and server-centric architectures. In switch-centric DCNs, routing intelligence is placed on switches, each server usually uses only one port of the Network Interface Card (NIC) to connect to the network. In server-centric DCNs, switches are only used as cross-bars, and routing intelligence is placed on servers, where multiple NIC ports may be used. In this paper, we formally introduce a new category of DCN architectures: the dual-centric DCN architectures, where routing intelligence can be placed on both switches and servers. We propose two typical dual-centric DCN architectures: FSquare and Rectangle, both of which are based on the folded Clos topology. FSquare is a high performance DCN architecture, in which the diameter is small and the bisection bandwidth is large, however, the DCN power consumption per server in FSquare is high. Rectangle significantly reduces the DCN power consumption per server, compared to FSquare, at the sacrifice of some performances, thus, Rectangle has a larger diameter and a smaller bisection bandwidth. By investigating FSquare and Rectangle, and by comparing them with existing architectures, we demonstrate that, these two novel dual-centric architectures enjoy the advantages of both switch-centric designs and server-centric designs, have various nice properties for practical data centers, and provide flexible choices in designing DCN architectures.
{"title":"Dual-centric Data Center Network Architectures","authors":"Dawei Li, Jie Wu, Zhiyong Liu, Fa Zhang","doi":"10.1109/ICPP.2015.77","DOIUrl":"https://doi.org/10.1109/ICPP.2015.77","url":null,"abstract":"Existing Data Center Network (DCN) architectures are classified into two categories: switch-centric and server-centric architectures. In switch-centric DCNs, routing intelligence is placed on switches, each server usually uses only one port of the Network Interface Card (NIC) to connect to the network. In server-centric DCNs, switches are only used as cross-bars, and routing intelligence is placed on servers, where multiple NIC ports may be used. In this paper, we formally introduce a new category of DCN architectures: the dual-centric DCN architectures, where routing intelligence can be placed on both switches and servers. We propose two typical dual-centric DCN architectures: FSquare and Rectangle, both of which are based on the folded Clos topology. FSquare is a high performance DCN architecture, in which the diameter is small and the bisection bandwidth is large, however, the DCN power consumption per server in FSquare is high. Rectangle significantly reduces the DCN power consumption per server, compared to FSquare, at the sacrifice of some performances, thus, Rectangle has a larger diameter and a smaller bisection bandwidth. By investigating FSquare and Rectangle, and by comparing them with existing architectures, we demonstrate that, these two novel dual-centric architectures enjoy the advantages of both switch-centric designs and server-centric designs, have various nice properties for practical data centers, and provide flexible choices in designing DCN architectures.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130614283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, Canqun Yang
This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.
{"title":"Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors","authors":"Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, Canqun Yang","doi":"10.1109/ICPP.2015.29","DOIUrl":"https://doi.org/10.1109/ICPP.2015.29","url":null,"abstract":"This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114575252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Lejeune, L. Arantes, Julien Sopena, Pierre Sens
Generalized distributed mutual exclusion algorithms allow processes to concurrently access a set of shared resources. However, they must ensure an exclusive access to each resource. In order to avoid deadlocks, many of them are based on the strong assumption of a prior knowledge about conflicts between processes' requests. Some other approaches, which do not require such a knowledge, exploit broadcast mechanisms or a global lock, degrading message complexity and synchronization cost. We propose in this paper a new solution for shared resources allocation which reduces the communication between non-conflicting processes without a prior knowledge of processes conflicts. Performance evaluation results show that our solution improves resource use rate by a factor up to 20 compared to a global lock based algorithm.
{"title":"Reducing Synchronization Cost in Distributed Multi-resource Allocation Problem","authors":"Jonathan Lejeune, L. Arantes, Julien Sopena, Pierre Sens","doi":"10.1109/ICPP.2015.63","DOIUrl":"https://doi.org/10.1109/ICPP.2015.63","url":null,"abstract":"Generalized distributed mutual exclusion algorithms allow processes to concurrently access a set of shared resources. However, they must ensure an exclusive access to each resource. In order to avoid deadlocks, many of them are based on the strong assumption of a prior knowledge about conflicts between processes' requests. Some other approaches, which do not require such a knowledge, exploit broadcast mechanisms or a global lock, degrading message complexity and synchronization cost. We propose in this paper a new solution for shared resources allocation which reduces the communication between non-conflicting processes without a prior knowledge of processes conflicts. Performance evaluation results show that our solution improves resource use rate by a factor up to 20 compared to a global lock based algorithm.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114723792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu
Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.
{"title":"SZTS: A Novel Big Data Transportation System Benchmark Suite","authors":"Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu","doi":"10.1109/ICPP.2015.91","DOIUrl":"https://doi.org/10.1109/ICPP.2015.91","url":null,"abstract":"Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130751256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao
Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.
锐度是一种用于锐化图像的算法。随着图像尺寸、分辨率的增加以及对实时处理要求的提高,对图像的清晰度性能要求得到很大的提高。锐度的独立像素计算为使用GPU大幅提高性能提供了很好的机会。然而,要将其移植到GPU,一个挑战是清晰度涉及几个阶段来执行。每个阶段都有自己的特征,或者与其他阶段有数据依赖关系,或者没有数据依赖关系。基于这些特点,本文提出了在GPU上实现和优化图像清晰度的完整方案。我们的解决方案包括五个主要和有效的技术:数据传输优化,核融合,数据局域矢量化,边界和约简优化。实验表明,与优化后的CPU版本相比,我们的GPU解决方案在AMD Fire Pro W8000 GPU上对不同图像大小的加速可以达到10.7~ 69.3倍。
{"title":"Optimizing Image Sharpening Algorithm on GPU","authors":"Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao","doi":"10.1109/ICPP.2015.32","DOIUrl":"https://doi.org/10.1109/ICPP.2015.32","url":null,"abstract":"Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}