The continuing decrease in memory capacity per core and the increasing disparity between core count and off-chip memory bandwidth create significant challenges for I/O operations in exascale systems. The exascale challenges require rethinking collective I/O for the effective exploitation of the correlation among I/O accesses in the exascale system. In this study we introduce a Memory-Conscious Collective I/O considering the constraint of the memory space. 1)Restricts aggregation data traffic within disjointed subgroups 2)Coordinates I/O accesses in intra-node and inter-node layer 3)Determines I/O aggregators at run time considering data distribution and memory consumption among processes.
{"title":"Poster: Memory-Conscious Collective I/O for Extreme-Scale HPC Systems","authors":"Yin Lu, Yong Chen, R. Thakur, Zhuang Yu","doi":"10.1145/2491661.2481430","DOIUrl":"https://doi.org/10.1145/2491661.2481430","url":null,"abstract":"The continuing decrease in memory capacity per core and the increasing disparity between core count and off-chip memory bandwidth create significant challenges for I/O operations in exascale systems. The exascale challenges require rethinking collective I/O for the effective exploitation of the correlation among I/O accesses in the exascale system. In this study we introduce a Memory-Conscious Collective I/O considering the constraint of the memory space. 1)Restricts aggregation data traffic within disjointed subgroups 2)Coordinates I/O accesses in intra-node and inter-node layer 3)Determines I/O aggregators at run time considering data distribution and memory consumption among processes.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"27 1","pages":"1362-1362"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80244915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael O. Lam, B. Supinski, M. LeGendre, J. Hollingsworth
As scientific computation continues to scale, it is crucial to use floating-point arithmetic processors as efficiently as possible. Lower precision allows streaming architectures to perform more operations per second and can reduce memory bandwidth pressure on all architectures. However, using a precision that is too low for a given algorithm and data set will result in inaccurate results. In this poster, we present a framework that uses binary instrumentation and modification to build mixed-precision configurations of existing binaries that were originally developed to use only double-precision. This allows developers to easily experiment with mixed-precision configurations without modifying their source code, and it permits auto-tuning of floating-point precision. We also implemented a simple search algorithm to automatically identify which code regions can use lower precision. We include results for several benchmarks that show both the efficacy and overhead of our tool.
{"title":"Abstract: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation","authors":"Michael O. Lam, B. Supinski, M. LeGendre, J. Hollingsworth","doi":"10.1145/2464996.2465018","DOIUrl":"https://doi.org/10.1145/2464996.2465018","url":null,"abstract":"As scientific computation continues to scale, it is crucial to use floating-point arithmetic processors as efficiently as possible. Lower precision allows streaming architectures to perform more operations per second and can reduce memory bandwidth pressure on all architectures. However, using a precision that is too low for a given algorithm and data set will result in inaccurate results. In this poster, we present a framework that uses binary instrumentation and modification to build mixed-precision configurations of existing binaries that were originally developed to use only double-precision. This allows developers to easily experiment with mixed-precision configurations without modifying their source code, and it permits auto-tuning of floating-point precision. We also implemented a simple search algorithm to automatically identify which code regions can use lower precision. We include results for several benchmarks that show both the efficacy and overhead of our tool.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"154 1","pages":"1423-1423"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75958805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this poster, we present a parallel Image-to-Mesh Conversion (I2M) algorithm with quality and fidelity guarantees achieved by dynamic point insertions and removals. Starting directly from an image, it is able to recover the surface and mesh the volume with tetrahedra of good shape. Our tightly-coupled shared-memory parallel speculative execution paradigm employs carefully designed memory and contention managers, load balancing, synchronization and optimizations schemes, while it maintains high single-threaded performance: our single-threaded performance is faster than CGAL, the state of the art sequential I2M software we are aware of. Our meshes come also with theoretical guarantees: the radius-edge is less than 2 and the planar angles of the boundary triangles are more than 30 degrees. The effectiveness of our method is shown on Blacklight, the large cache-coherent NUMA machine of Pittsburgh Supercomputing Center. We observe a more than 74% strong scaling efficiency for up to 128 cores and a super-linear weak scaling efficiency for up to 128 cores.
{"title":"High Quality Real-Time Image-to-Mesh Conversion for Finite Element Simulations","authors":"Panagiotis A. Foteinos, N. Chrisochoides","doi":"10.1145/2464996.2465439","DOIUrl":"https://doi.org/10.1145/2464996.2465439","url":null,"abstract":"In this poster, we present a parallel Image-to-Mesh Conversion (I2M) algorithm with quality and fidelity guarantees achieved by dynamic point insertions and removals. Starting directly from an image, it is able to recover the surface and mesh the volume with tetrahedra of good shape. Our tightly-coupled shared-memory parallel speculative execution paradigm employs carefully designed memory and contention managers, load balancing, synchronization and optimizations schemes, while it maintains high single-threaded performance: our single-threaded performance is faster than CGAL, the state of the art sequential I2M software we are aware of. Our meshes come also with theoretical guarantees: the radius-edge is less than 2 and the planar angles of the boundary triangles are more than 30 degrees. The effectiveness of our method is shown on Blacklight, the large cache-coherent NUMA machine of Pittsburgh Supercomputing Center. We observe a more than 74% strong scaling efficiency for up to 128 cores and a super-linear weak scaling efficiency for up to 128 cores.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"12 1","pages":"1552-1553"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73265437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-03DOI: 10.1109/CloudCom.2012.6427493
Satoshi Takahashi, H. Nakada, A. Takefusa, T. Kudoh, Maiko Shigeno, Akiko Yoshise
VM (Virtual Machine)-based flexible capacity man- agement is an effective scheme to reduce total power consumption in the data center. However, there have been the following issues, tradeoff of power-saving and user experience, decision of VM packing in feasible calculation time and collision avoidance of VM migration processes. In order to resolve these issues, we propose a matching-based and a greedy-type VM packing algorithm, which enables to decide a suitable VM packing plan in polynomial time. The experiments evaluate not only a basic performance, but also a feasibility of the algorithms by comparing with optimization solvers. The feasibility experiment uses a super computer trace data prepared by Center for Computational Sciences of Univer- sity of Tsukuba. The basic performance experiment shows that the algorithms reduce total power consumption by between 18% and 50%.
{"title":"Abstract: Virtual Machine Packing Algorithms for Lower Power Consumption","authors":"Satoshi Takahashi, H. Nakada, A. Takefusa, T. Kudoh, Maiko Shigeno, Akiko Yoshise","doi":"10.1109/CloudCom.2012.6427493","DOIUrl":"https://doi.org/10.1109/CloudCom.2012.6427493","url":null,"abstract":"VM (Virtual Machine)-based flexible capacity man- agement is an effective scheme to reduce total power consumption in the data center. However, there have been the following issues, tradeoff of power-saving and user experience, decision of VM packing in feasible calculation time and collision avoidance of VM migration processes. In order to resolve these issues, we propose a matching-based and a greedy-type VM packing algorithm, which enables to decide a suitable VM packing plan in polynomial time. The experiments evaluate not only a basic performance, but also a feasibility of the algorithms by comparing with optimization solvers. The feasibility experiment uses a super computer trace data prepared by Center for Computational Sciences of Univer- sity of Tsukuba. The basic performance experiment shows that the algorithms reduce total power consumption by between 18% and 50%.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"34 1","pages":"1517-1518"},"PeriodicalIF":0.0,"publicationDate":"2012-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79439849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SC.Companion.2012.259
Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry
We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information.
{"title":"Abstract: Digitization and Search: A Non-Traditional Use of HPC","authors":"Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry","doi":"10.1109/SC.Companion.2012.259","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.259","url":null,"abstract":"We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"42 1","pages":"1460-1461"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81538041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.115
K. Moreland, Brad King, Robert Maynard, K. Ma
We are on the threshold of a transformative change in the basic architecture of high-performance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. In this paper we describe our approach to address these two immediate problems with respect to scientific analysis and visualization algorithms. Our approach to accelerator programming forms the basis of the Dax toolkit, a framework to build data analysis and visualization algorithms applicable to exascale computing.
{"title":"Flexible Analysis Software for Emerging Architectures","authors":"K. Moreland, Brad King, Robert Maynard, K. Ma","doi":"10.1109/SC.Companion.2012.115","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.115","url":null,"abstract":"We are on the threshold of a transformative change in the basic architecture of high-performance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. In this paper we describe our approach to address these two immediate problems with respect to scientific analysis and visualization algorithms. Our approach to accelerator programming forms the basis of the Dax toolkit, a framework to build data analysis and visualization algorithms applicable to exascale computing.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"48 1","pages":"821-826"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74202308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.31
Miki Horiuchi, K. Taura
Data I/O has been one of major bottlenecks in the execution of data-intensive workflow applications. Appropriate task scheduling of a workflow can achieve high I/O throughput by reducing remote data accesses. However, most such task scheduling algorithms require the user to explicitly describe files to be accessed by each job, typically by stage-in/stage-out directives in job description, where such annotations are at best tedious and sometime impossible. Thus, a more automated mechanism is necessary. In this paper, we propose a method for predicting input/output files of each job without user-supplied annotations. It predicts I/O files by collecting file access history in a profiling run prior to the production run. We implemented the proposed method in a workflow system GXP Make and a distributed file system Mogami. We evaluate our system with two real workflow applications. Our data-aware job scheduler increases the ratio of local file accesses from 50% to 75% in one application and from 23% to 45% in the other. As a result, it reduces the makespan of the two applications by 2.5% and 7.5%, respectively.
{"title":"Acceleration of Data-Intensive Workflow Applications by Using File Access History","authors":"Miki Horiuchi, K. Taura","doi":"10.1109/SC.Companion.2012.31","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.31","url":null,"abstract":"Data I/O has been one of major bottlenecks in the execution of data-intensive workflow applications. Appropriate task scheduling of a workflow can achieve high I/O throughput by reducing remote data accesses. However, most such task scheduling algorithms require the user to explicitly describe files to be accessed by each job, typically by stage-in/stage-out directives in job description, where such annotations are at best tedious and sometime impossible. Thus, a more automated mechanism is necessary. In this paper, we propose a method for predicting input/output files of each job without user-supplied annotations. It predicts I/O files by collecting file access history in a profiling run prior to the production run. We implemented the proposed method in a workflow system GXP Make and a distributed file system Mogami. We evaluate our system with two real workflow applications. Our data-aware job scheduler increases the ratio of local file accesses from 50% to 75% in one application and from 23% to 45% in the other. As a result, it reduces the makespan of the two applications by 2.5% and 7.5%, respectively.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"1 1","pages":"157-165"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75160574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.202
Katherine E. Isaacs, Aaditya G. Landge, T. Gamblin, P. Bremer, Valerio Pascucci, B. Hamann
The growth in size and complexity of scaling applications and the systems on which they run pose challenges in analyzing and improving their overall performance. With metrics coming from thousands or millions of processes, visualization techniques are necessary to make sense of the increasing amount of data. To aid the process of exploration and understanding, we announce the initial release of Boxfish, an extensible tool for manipulating and visualizing data pertaining to application behavior. Combining and visually presenting data and knowledge from multiple domains, such as the application's communication patterns and the hardware's network configuration and routing policies, can yield the insight necessary to discover the underlying causes of observed behavior. Boxfish allows users to query, filter and project data across these domains to create interactive, linked visualizations.
{"title":"Abstract: Exploring Performance Data with Boxfish","authors":"Katherine E. Isaacs, Aaditya G. Landge, T. Gamblin, P. Bremer, Valerio Pascucci, B. Hamann","doi":"10.1109/SC.Companion.2012.202","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.202","url":null,"abstract":"The growth in size and complexity of scaling applications and the systems on which they run pose challenges in analyzing and improving their overall performance. With metrics coming from thousands or millions of processes, visualization techniques are necessary to make sense of the increasing amount of data. To aid the process of exploration and understanding, we announce the initial release of Boxfish, an extensible tool for manipulating and visualizing data pertaining to application behavior. Combining and visually presenting data and knowledge from multiple domains, such as the application's communication patterns and the hardware's network configuration and routing policies, can yield the insight necessary to discover the underlying causes of observed behavior. Boxfish allows users to query, filter and project data across these domains to create interactive, linked visualizations.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"24 1","pages":"1380-1381"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74603563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.214
P. DeMar, D. Dykstra, G. Garzoglio, P. Mhashikar, Anupam Rajendran, Wenji Wu
Exascale science translates to big data. In the case of the Large Hadron Collider (LHC), the data is not only immense, it is also globally distributed. Fermilab is host to the LHC Compact Muon Solenoid (CMS) experiment's US Tier-1 Center. It must deal with both scaling and wide-area distribution challenges in processing its CMS data. This poster will describe the ongoing network-related R&D activities at Fermilab as a mosaic of efforts that combine to facilitate big data processing and movement.
{"title":"Abstract: Networking Research Activities at Fermilab for Big Data Analysis","authors":"P. DeMar, D. Dykstra, G. Garzoglio, P. Mhashikar, Anupam Rajendran, Wenji Wu","doi":"10.1109/SC.Companion.2012.214","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.214","url":null,"abstract":"Exascale science translates to big data. In the case of the Large Hadron Collider (LHC), the data is not only immense, it is also globally distributed. Fermilab is host to the LHC Compact Muon Solenoid (CMS) experiment's US Tier-1 Center. It must deal with both scaling and wide-area distribution challenges in processing its CMS data. This poster will describe the ongoing network-related R&D activities at Fermilab as a mosaic of efforts that combine to facilitate big data processing and movement.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"24 1","pages":"1398-1399"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74604088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}