Provenance captured from E-Science experimentation is often large and complex, for instance, from agent-based simulations that have tens of thousands of heterogeneous components interacting over extended time periods. The subject of study of my dissertation is the use of E-Science provenance at scale. My initial research studied the visualization of large provenance graphs and proposed an abstract representation of provenance that supports useful data mining. Recent work involves analyzing large provenance data generated from agent-based simulations on a single machine. In continuation, I propose stream processing techniques to support the continuous and real-time analysis of data provenance, which is captured from agent based simulations on HPC and thus has unprecedented volume and complexity.
{"title":"Big Data Provenance Analysis and Visualization","authors":"Peng Chen, Beth Plale","doi":"10.1109/CCGrid.2015.85","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.85","url":null,"abstract":"Provenance captured from E-Science experimentation is often large and complex, for instance, from agent-based simulations that have tens of thousands of heterogeneous components interacting over extended time periods. The subject of study of my dissertation is the use of E-Science provenance at scale. My initial research studied the visualization of large provenance graphs and proposed an abstract representation of provenance that supports useful data mining. Recent work involves analyzing large provenance data generated from agent-based simulations on a single machine. In continuation, I propose stream processing techniques to support the continuous and real-time analysis of data provenance, which is captured from agent based simulations on HPC and thus has unprecedented volume and complexity.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"20 1","pages":"797-800"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73678529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Driving is an integral part of our everyday lives, and the average driving time of people globally is increasing to 84 minutes everyday, which is a time when people are uniquely vulnerable. A number of research works have identified that mobile crowd sensing in vehicular social networks (VSNs) can be effectively used for many purposes and bring huge economic benefits, e.g., safety improvement and traffic management. This paper presents our effort that toward context-aware mobile crowd sensing in VSNs. First, we introduce a novel application-oriented service collaboration (ASCM) model which can automatically match multiple users with multiple mobile crowd sensing tasks in VSNs in an efficient manner. After that, for users' dynamic contexts of VSNs, we proposes a context information management model, that aims to enable the mobile crowd sensing applications to autonomously match appropriate service and information with different users (requesters and participants) in crowdsensing.
{"title":"Towards Context-Aware Mobile Crowdsensing in Vehicular Social Networks","authors":"Xiping Hu, Victor C. M. Leung","doi":"10.1109/CCGrid.2015.155","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.155","url":null,"abstract":"Driving is an integral part of our everyday lives, and the average driving time of people globally is increasing to 84 minutes everyday, which is a time when people are uniquely vulnerable. A number of research works have identified that mobile crowd sensing in vehicular social networks (VSNs) can be effectively used for many purposes and bring huge economic benefits, e.g., safety improvement and traffic management. This paper presents our effort that toward context-aware mobile crowd sensing in VSNs. First, we introduce a novel application-oriented service collaboration (ASCM) model which can automatically match multiple users with multiple mobile crowd sensing tasks in VSNs in an efficient manner. After that, for users' dynamic contexts of VSNs, we proposes a context information management model, that aims to enable the mobile crowd sensing applications to autonomously match appropriate service and information with different users (requesters and participants) in crowdsensing.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"72 1","pages":"749-752"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73689067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ping Xiang, Yi Yang, Mike Mantor, Norman Rubin, Huiyang Zhou
Many-core architectures such as graphics processing units (GPUs) rely on thread-level parallelism (TLP)to overcome pipeline hazards. Consequently, each core in a many-core processor employs a relatively simple in-order pipeline with limited capability to exploit instruction-level parallelism (ILP). In this paper, we study the ILP impact on the throughput-oriented many-core architecture, including data bypassing, score boarding and branch prediction. We show that these ILP techniques significantly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a high number of threads concurrently. Furthermore, ILP techniques reduce the demand on on-chip resource to support high TLP. Given the workload-dependent impact from ILP, we propose heterogeneous GPGPU architecture, consisting of both the cores designed for high TLP and those customized with ILPtechniques. Our results show that our heterogeneous GPUarchitecture achieves high throughput as well as high energy and area-efficiency compared to homogenous designs.
{"title":"Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture","authors":"Ping Xiang, Yi Yang, Mike Mantor, Norman Rubin, Huiyang Zhou","doi":"10.1109/CCGRID.2015.14","DOIUrl":"https://doi.org/10.1109/CCGRID.2015.14","url":null,"abstract":"Many-core architectures such as graphics processing units (GPUs) rely on thread-level parallelism (TLP)to overcome pipeline hazards. Consequently, each core in a many-core processor employs a relatively simple in-order pipeline with limited capability to exploit instruction-level parallelism (ILP). In this paper, we study the ILP impact on the throughput-oriented many-core architecture, including data bypassing, score boarding and branch prediction. We show that these ILP techniques significantly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a high number of threads concurrently. Furthermore, ILP techniques reduce the demand on on-chip resource to support high TLP. Given the workload-dependent impact from ILP, we propose heterogeneous GPGPU architecture, consisting of both the cores designed for high TLP and those customized with ILPtechniques. Our results show that our heterogeneous GPUarchitecture achieves high throughput as well as high energy and area-efficiency compared to homogenous designs.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"20 1","pages":"121-130"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78814355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
{"title":"Lessons Learned Implementing User-Level Failure Mitigation in MPICH","authors":"Wesley Bland, Huiwei Lu, Sangmin Seo, P. Balaji","doi":"10.1109/CCGrid.2015.51","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.51","url":null,"abstract":"User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"1123-1126"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78499721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Jin, Hong Jiang, Ke Zhou, Ronglei Wei, Dongliang Lei, Ping Huang
Data outsourcing relieves cloud users of the heavy burden of infrastructure management and maintenance. However, the handover of data control to untrusted cloud servers significantly complicates the security issues. Conventional signature verification widely adopted in cryptographic storage system only guarantees the integrity of retrieved data, for those rarely or never accessed data, it does not work. This paper integrates proof of storage technique with data dynamics support into cryptographic storage design to provide full integrity for outsourced data. Besides, we provide instantaneously freshness check for retrieved data to defend against potential replay attacks. We achieve these goals by designing flexible block structures and combining broadcast encryption, key regression, Merkle hash tree, proof of storage and fine-grained access control policies together to provide a secure storage service for outsourced data. Experimental evaluation of our prototype shows that the cryptographic cost and throughput is reasonable and acceptable.
{"title":"Full Integrity and Freshness for Outsourced Storage","authors":"Hao Jin, Hong Jiang, Ke Zhou, Ronglei Wei, Dongliang Lei, Ping Huang","doi":"10.1109/CCGrid.2015.90","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.90","url":null,"abstract":"Data outsourcing relieves cloud users of the heavy burden of infrastructure management and maintenance. However, the handover of data control to untrusted cloud servers significantly complicates the security issues. Conventional signature verification widely adopted in cryptographic storage system only guarantees the integrity of retrieved data, for those rarely or never accessed data, it does not work. This paper integrates proof of storage technique with data dynamics support into cryptographic storage design to provide full integrity for outsourced data. Besides, we provide instantaneously freshness check for retrieved data to defend against potential replay attacks. We achieve these goals by designing flexible block structures and combining broadcast encryption, key regression, Merkle hash tree, proof of storage and fine-grained access control policies together to provide a secure storage service for outsourced data. Experimental evaluation of our prototype shows that the cryptographic cost and throughput is reasonable and acceptable.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"127 1","pages":"362-371"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76152383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kashif Nizam Khan, Filip Nyback, Zhonghong Ou, J. Nurminen, T. Niemi, G. Eulisse, P. Elmer, David Abdurachmanov
Energy efficiency has become a primary concern for data centers in recent years. Understanding where the energy has been spent within a software is fundamental for energy-efficiency study as a whole. In this paper, we take the first step towards this direction by building an energy profiling module on top of IgProf. IgProf is an application profiler developed at CERN for scientific computing workloads. The energy profiling module is based on sampling and obtains energy measurements from the Running Average Power Limit (RAPL) interface present on the latest Intel processors. The initial profiling results of a single-threaded program demonstrates potential, showing a close correlation between the execution time and the energy spent within a function.
{"title":"Energy Profiling Using IgProf","authors":"Kashif Nizam Khan, Filip Nyback, Zhonghong Ou, J. Nurminen, T. Niemi, G. Eulisse, P. Elmer, David Abdurachmanov","doi":"10.1109/CCGrid.2015.118","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.118","url":null,"abstract":"Energy efficiency has become a primary concern for data centers in recent years. Understanding where the energy has been spent within a software is fundamental for energy-efficiency study as a whole. In this paper, we take the first step towards this direction by building an energy profiling module on top of IgProf. IgProf is an application profiler developed at CERN for scientific computing workloads. The energy profiling module is based on sampling and obtains energy measurements from the Running Average Power Limit (RAPL) interface present on the latest Intel processors. The initial profiling results of a single-threaded program demonstrates potential, showing a close correlation between the execution time and the energy spent within a function.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"1115-1118"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79339526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.
{"title":"Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi","authors":"James Lin, Akira Nukada, S. Matsuoka","doi":"10.1109/CCGrid.2015.59","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.59","url":null,"abstract":"Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"40 1","pages":"713-716"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84557296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.
{"title":"SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment","authors":"Guoguang Zhao, Cheng Ling, Donghong Sun","doi":"10.1109/CCGrid.2015.55","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.55","url":null,"abstract":"The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"22 1","pages":"845-852"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85087796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing has become a dominant computing paradigm to provide elastic, affordable computing resources to end users. Due to the increased computing power of modern machines powered by multi/many-core computing, data centers often co-locate multiple virtual machines (VMs) into one physical machine, resulting in co-tenancy, and resource sharing and competition. Applications or VMs co-locating in one physical machine can interfere with each other despite of the promise of performance isolation through virtualization. Modelling and predicting co-run interference therefore becomes critical for data center job scheduling and QoS (Quality of Service) assurance. Co-run interference can be categorized into two metrics, sensitivity and pressure, where the former denotes how an application's performance is affected by its co-run applications, and the latter measures how it impacts the performance of its co-run applications. This paper shows that sensitivity and pressure are both application-and architecture dependent. Further, we propose a regression model that predicts an application's sensitivity and pressure across architectures with high accuracy. This regression model enables a data center scheduler to guarantee the QoS of a VM/application when it is scheduled to co-locate with another VMs/applications.
{"title":"Modeling Cross-Architecture Co-Tenancy Performance Interference","authors":"Wei Kuang, Laura E. Brown, Zhenlin Wang","doi":"10.1109/CCGrid.2015.152","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.152","url":null,"abstract":"Cloud computing has become a dominant computing paradigm to provide elastic, affordable computing resources to end users. Due to the increased computing power of modern machines powered by multi/many-core computing, data centers often co-locate multiple virtual machines (VMs) into one physical machine, resulting in co-tenancy, and resource sharing and competition. Applications or VMs co-locating in one physical machine can interfere with each other despite of the promise of performance isolation through virtualization. Modelling and predicting co-run interference therefore becomes critical for data center job scheduling and QoS (Quality of Service) assurance. Co-run interference can be categorized into two metrics, sensitivity and pressure, where the former denotes how an application's performance is affected by its co-run applications, and the latter measures how it impacts the performance of its co-run applications. This paper shows that sensitivity and pressure are both application-and architecture dependent. Further, we propose a regression model that predicts an application's sensitivity and pressure across architectures with high accuracy. This regression model enables a data center scheduler to guarantee the QoS of a VM/application when it is scheduled to co-locate with another VMs/applications.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"194 1","pages":"231-240"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85473011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaoliang Peng, Xiangke Liao, Canqun Yang, Yutong Lu, Jie Liu, Yingbo Cui, Heng Wang, Chengkun Wu, Bingqiang Wang
Whole genome re-sequencing plays a crucial role in biomedical studies. The emergence of genomic big data calls for an enormous amount of computing power. However, current computational methods are inefficient in utilizing available computational resources. In this paper, we address this challenge by optimizing the utilization of the fastest supercomputer in the world - TH-2 supercomputer. TH-2 is featured by its neo-heterogeneous architecture, in which each compute node is equipped with 2 Intel Xeon CPUs and 3 Intel Xeon Phi coprocessors. The heterogeneity and the massive amount of data to be processed pose great challenges for the deployment of the genome analysis software pipeline on TH-2. Runtime profiling shows that SOAP3-dp and SOAPsnp are the most time-consuming components (up to 70% of total runtime) in a typical genome-analyzing pipeline. To optimize the whole pipeline, we first devise a number of parallel and optimization strategies for SOAP3-dp and SOAPsnp, respectively targeting each node to fully utilize all sorts of hardware resources provided both by CPU and MIC. We also employ a few scaling methods to reduce communication between different nodes. We then scaled up our method on TH-2. With 8192 nodes, the whole analyzing procedure took 8.37 hours to finish the analysis of a 300 TB dataset of whole genome sequences from 2,000 human beings, which can take as long as 8 months on a commodity server. The speedup is about 700x.
{"title":"The Challenge of Scaling Genome Big Data Analysis Software on TH-2 Supercomputer","authors":"Shaoliang Peng, Xiangke Liao, Canqun Yang, Yutong Lu, Jie Liu, Yingbo Cui, Heng Wang, Chengkun Wu, Bingqiang Wang","doi":"10.1109/CCGrid.2015.46","DOIUrl":"https://doi.org/10.1109/CCGrid.2015.46","url":null,"abstract":"Whole genome re-sequencing plays a crucial role in biomedical studies. The emergence of genomic big data calls for an enormous amount of computing power. However, current computational methods are inefficient in utilizing available computational resources. In this paper, we address this challenge by optimizing the utilization of the fastest supercomputer in the world - TH-2 supercomputer. TH-2 is featured by its neo-heterogeneous architecture, in which each compute node is equipped with 2 Intel Xeon CPUs and 3 Intel Xeon Phi coprocessors. The heterogeneity and the massive amount of data to be processed pose great challenges for the deployment of the genome analysis software pipeline on TH-2. Runtime profiling shows that SOAP3-dp and SOAPsnp are the most time-consuming components (up to 70% of total runtime) in a typical genome-analyzing pipeline. To optimize the whole pipeline, we first devise a number of parallel and optimization strategies for SOAP3-dp and SOAPsnp, respectively targeting each node to fully utilize all sorts of hardware resources provided both by CPU and MIC. We also employ a few scaling methods to reduce communication between different nodes. We then scaled up our method on TH-2. With 8192 nodes, the whole analyzing procedure took 8.37 hours to finish the analysis of a 300 TB dataset of whole genome sequences from 2,000 human beings, which can take as long as 8 months on a commodity server. The speedup is about 700x.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"8 1","pages":"823-828"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78334165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}