After five decades of sustained progress, Moore's law appears to be reaching its limits. In order to sustain the dramatic improvements to which we have become accustomed, computing will need to transform to Kurzweil's sixth wave of computing. The supercomputing community will likely need to re-think most of its fundamental technologies and tools, spanning innovative materials and devices, circuits, system architectures, programming systems, system software, and applications. We already see evidence of this transition in the move to new architectures that employ heterogeneous processing, non-volatile memory, multimode memory hierarchies, and optical interconnection networks. In this talk, I will recap progress in these areas over the past three decades, discuss current solutions, and contemplate various future technologies that our community will need for continued progress in supercomputing.
{"title":"Preparing for Supercomputing's Sixth Wave","authors":"J. Vetter","doi":"10.1145/2907294.2911994","DOIUrl":"https://doi.org/10.1145/2907294.2911994","url":null,"abstract":"After five decades of sustained progress, Moore's law appears to be reaching its limits. In order to sustain the dramatic improvements to which we have become accustomed, computing will need to transform to Kurzweil's sixth wave of computing. The supercomputing community will likely need to re-think most of its fundamental technologies and tools, spanning innovative materials and devices, circuits, system architectures, programming systems, system software, and applications. We already see evidence of this transition in the move to new architectures that employ heterogeneous processing, non-volatile memory, multimode memory hierarchies, and optical interconnection networks. In this talk, I will recap progress in these areas over the past three decades, discuss current solutions, and contemplate various future technologies that our community will need for continued progress in supercomputing.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85521054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geo-distributed cloud storage systems must tame complexity at many levels: uniform APIs for storage access, supporting flexible storage policies that meet a wide array of application metrics, handling uncertain network dynamics and access dynamism, and operating across many levels of heterogeneity both within and across data-centers. In this paper, we present an integrated solution called Wiera. Wiera extends our earlier cloud storage system, Tiera, that is targeted to multi-tiered policy-based single cloud storage, to the wide-area and multiple data-centers (even across different providers). Wiera enables the specification of global data management policies built on top of local Tiera policies. Such policies enable the user to optimize for cost, performance, reliability, durability, and consistency, both within and across data-centers, and to express their tradeoffs. A key aspect of Wiera is first-class support for dynamism due to network, workload, and access patterns changes. Wiera policies can adapt to changes in user workload, poorly performing data tiers, failures, and changes in user metrics (e.g., cost). Wiera allows unmodified applications to reap the benefits of flexible data/storage policies by externalizing the policy specification. As far as we know, Wiera is the first geo-distributed cloud storage system which handles dynamism actively at run-time. We show how Wiera enables a rich specification of dynamic policies using a concise notation and describe the design and implementation of the system. We have implemented a Wiera prototype on multiple cloud environments, AWS and Azure, that illustrates potential benefits from managing dynamics and in using multiple cloud storage tiers both within and across data-centers.
{"title":"Wiera: Towards Flexible Multi-Tiered Geo-Distributed Cloud Storage Instances","authors":"Kwangsung Oh, A. Chandra, J. Weissman","doi":"10.1145/2907294.2907322","DOIUrl":"https://doi.org/10.1145/2907294.2907322","url":null,"abstract":"Geo-distributed cloud storage systems must tame complexity at many levels: uniform APIs for storage access, supporting flexible storage policies that meet a wide array of application metrics, handling uncertain network dynamics and access dynamism, and operating across many levels of heterogeneity both within and across data-centers. In this paper, we present an integrated solution called Wiera. Wiera extends our earlier cloud storage system, Tiera, that is targeted to multi-tiered policy-based single cloud storage, to the wide-area and multiple data-centers (even across different providers). Wiera enables the specification of global data management policies built on top of local Tiera policies. Such policies enable the user to optimize for cost, performance, reliability, durability, and consistency, both within and across data-centers, and to express their tradeoffs. A key aspect of Wiera is first-class support for dynamism due to network, workload, and access patterns changes. Wiera policies can adapt to changes in user workload, poorly performing data tiers, failures, and changes in user metrics (e.g., cost). Wiera allows unmodified applications to reap the benefits of flexible data/storage policies by externalizing the policy specification. As far as we know, Wiera is the first geo-distributed cloud storage system which handles dynamism actively at run-time. We show how Wiera enables a rich specification of dynamic policies using a concise notation and describe the design and implementation of the system. We have implemented a Wiera prototype on multiple cloud environments, AWS and Azure, that illustrates potential benefits from managing dynamics and in using multiple cloud storage tiers both within and across data-centers.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75306171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme. With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.
{"title":"Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs","authors":"Wei Tan, Liangliang Cao, L. Fong","doi":"10.1145/2907294.2907297","DOIUrl":"https://doi.org/10.1145/2907294.2907297","url":null,"abstract":"Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme. With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"53 70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78112394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpretation of Chinese Address Information Based on Multi-factor Inference","authors":"Xiaolin Li, Yanhui Duan, Huabing Zhou, Yi Zhang","doi":"10.1109/ISPDC.2016.72","DOIUrl":"https://doi.org/10.1109/ISPDC.2016.72","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"12 1","pages":"420-424"},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72996168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Continuous Self-Checking Validation Framework on Processor Exceptions","authors":"Jian Tan, Daifeng Li","doi":"10.1109/ISPDC.2016.52","DOIUrl":"https://doi.org/10.1109/ISPDC.2016.52","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"25 1","pages":"314-318"},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74455236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real time streaming and processing of big graphs is a relevant and challenging application to be executed in a Cloud infrastructure. We have analysed the amount of resources needed to partition large streamed graphs with different distributed architectures. We have improved state of the art limitations proposing a decentralised and scalable model which is more efficient in memory usage, network traffic and number of processing machines. The improvement has been achieved summarising incoming vertices of the graph and accessing to local information of the already partitioned graph. Classical approaches need all information about the previous vertices. In our system, local information is updated in a feedback scheme periodically. Our experimental results show that current architectures cannot process large scale streamed graphs due to memory limitations. We have proved that our architecture reduces the number of needed machines by seven because it accesses to local memory instead of a distributed one. The total memory size has been also reduced. Finally, our model allows to adjust the quality of the partition solution to the desired amount of memory and network traffic.
{"title":"Resource Efficiency to Partition Big Streamed Graphs","authors":"Víctor Medel Gracia, Unai Arronategui Arribalzaga","doi":"10.1109/ISPDC.2015.21","DOIUrl":"https://doi.org/10.1109/ISPDC.2015.21","url":null,"abstract":"Real time streaming and processing of big graphs is a relevant and challenging application to be executed in a Cloud infrastructure. We have analysed the amount of resources needed to partition large streamed graphs with different distributed architectures. We have improved state of the art limitations proposing a decentralised and scalable model which is more efficient in memory usage, network traffic and number of processing machines. The improvement has been achieved summarising incoming vertices of the graph and accessing to local information of the already partitioned graph. Classical approaches need all information about the previous vertices. In our system, local information is updated in a feedback scheme periodically. Our experimental results show that current architectures cannot process large scale streamed graphs due to memory limitations. We have proved that our architecture reduces the number of needed machines by seven because it accesses to local memory instead of a distributed one. The total memory size has been also reduced. Finally, our model allows to adjust the quality of the partition solution to the desired amount of memory and network traffic.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"38 1","pages":"120-129"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78958158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage--e.g. number of registers, amount of shared memory--of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted. We propose an approach that shares the resources of SM to utilize the wasted resources by launching more thread blocks. We show the effectiveness of our approach for two resources: register sharing, and scratchpad (shared memory) sharing. We further propose optimizations to hide long execution latencies, thus reducing the number of stall cycles. We implemented our approach in GPGPU-Sim simulator and experimentally validated it on 19 applications from 4 different benchmark suites: GPGPU-Sim, Rodinia, CUDA-SDK, and Parboil. We observed that applications that underutilize register resource show a maximum improvement of 24% and an average improvement of 11% with register sharing. Similarly, the applications that underutilize scratchpad resource show a maximum improvement of 30% and an average improvement of 12.5% with scratchpad sharing. The remaining applications, which do not waste any resources, perform similar to the baseline approach.
{"title":"Improving GPU Performance Through Resource Sharing","authors":"Vishwesh Jatala, Jayvant Anantpur, Amey Karkare","doi":"10.1145/2907294.2907298","DOIUrl":"https://doi.org/10.1145/2907294.2907298","url":null,"abstract":"Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence the number of threads that can be launched on an SM, depends on the resource usage--e.g. number of registers, amount of shared memory--of the thread blocks. Since the allocation of threads to an SM is at the thread block granularity, some of the resources may not be used up completely and hence will be wasted. We propose an approach that shares the resources of SM to utilize the wasted resources by launching more thread blocks. We show the effectiveness of our approach for two resources: register sharing, and scratchpad (shared memory) sharing. We further propose optimizations to hide long execution latencies, thus reducing the number of stall cycles. We implemented our approach in GPGPU-Sim simulator and experimentally validated it on 19 applications from 4 different benchmark suites: GPGPU-Sim, Rodinia, CUDA-SDK, and Parboil. We observed that applications that underutilize register resource show a maximum improvement of 24% and an average improvement of 11% with register sharing. Similarly, the applications that underutilize scratchpad resource show a maximum improvement of 30% and an average improvement of 12.5% with scratchpad sharing. The remaining applications, which do not waste any resources, perform similar to the baseline approach.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77406955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E-NEXT is an EU FP6 network of excellence that focuses on Internet protocols and services. This short paper presents an overview of the network's goals, organization and achievements
{"title":"E-NEXT: Network of Excellence - Emerging Network Technologies","authors":"D. Grigoras","doi":"10.1109/ISPDC.2005.22","DOIUrl":"https://doi.org/10.1109/ISPDC.2005.22","url":null,"abstract":"E-NEXT is an EU FP6 network of excellence that focuses on Internet protocols and services. This short paper presents an overview of the network's goals, organization and achievements","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"84 1","pages":"9-10"},"PeriodicalIF":0.0,"publicationDate":"2005-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83452622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallelism and Optimization are two disciplines that are used together in numerous applications. Solving complex problems in optimization often means to face complex search landscapes, what needs time-consuming operations. Exact and heuristic techniques are being used nowadays to get solutions to problems in mathematics, logistics, bioinformatics, telecommunications, and many other relevant fields. For these tasks it is mandatory to deal with cluster computing in many cases, multiprocessors, and even with computational grids. In this talk I will address the basic challenges of using parallel tools, software, and hardware for extending existing optimization procedures to work in a parallel environment. I will present some basic optimization algorithms, especially heuristic ones, and discuss the application of parallelism to them. Also, I will show how new techniques become possible due to parallelism, giving birth to a whole new class of algorithms and new research lines.
{"title":"New Challenges in Parallel Optimization","authors":"E. Alba","doi":"10.1109/ISPDC.2005.36","DOIUrl":"https://doi.org/10.1109/ISPDC.2005.36","url":null,"abstract":"Parallelism and Optimization are two disciplines that are used together in numerous applications. Solving complex problems in optimization often means to face complex search landscapes, what needs time-consuming operations. Exact and heuristic techniques are being used nowadays to get solutions to problems in mathematics, logistics, bioinformatics, telecommunications, and many other relevant fields. For these tasks it is mandatory to deal with cluster computing in many cases, multiprocessors, and even with computational grids. In this talk I will address the basic challenges of using parallel tools, software, and hardware for extending existing optimization procedures to work in a parallel environment. I will present some basic optimization algorithms, especially heuristic ones, and discuss the application of parallelism to them. Also, I will show how new techniques become possible due to parallelism, giving birth to a whole new class of algorithms and new research lines.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"10 1","pages":"5"},"PeriodicalIF":0.0,"publicationDate":"2005-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81984892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Grid seems to be everywhere, with announcements appearing almost every day of Grid products, sales, and deployments from major vendors. However, in spite of the popularity of the term, there is often confusion as to what the Grid is and what problems it solves. Is there any "there there" or is it all just marketing hype? In this talk, I will address these questions, describing what the Grid is, what problems it solves, and what technology has been developed to build Grid infrastructure and create Grid applications. I will review the current status of Grid infrastructure and deployment and give examples of where Grid technology is being used not only to perform current tasks better, but to provide fundamentally new capabilities that are not possible otherwise.
{"title":"A New Era in Computing: Moving Services onto Grid","authors":"Ian T Foster","doi":"10.1109/ISPDC.2005.7","DOIUrl":"https://doi.org/10.1109/ISPDC.2005.7","url":null,"abstract":"The Grid seems to be everywhere, with announcements appearing almost every day of Grid products, sales, and deployments from major vendors. However, in spite of the popularity of the term, there is often confusion as to what the Grid is and what problems it solves. Is there any \"there there\" or is it all just marketing hype? In this talk, I will address these questions, describing what the Grid is, what problems it solves, and what technology has been developed to build Grid infrastructure and create Grid applications. I will review the current status of Grid infrastructure and deployment and give examples of where Grid technology is being used not only to perform current tasks better, but to provide fundamentally new capabilities that are not possible otherwise.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2005-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72863701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}