首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
A framework for hybrid parallel flow simulations with a trillion cells in complex geometries 一个具有复杂几何形状的万亿个细胞的混合平行流模拟框架
Christian Godenschwager, F. Schornbaum, Martin Bauer, H. Köstler, U. Rüde
waLBerla is a massively parallel software framework for simulating complex flows with the lattice Boltzmann method (LBM). Performance and scalability results are presented for SuperMUC, the world's fastest x86-based supercomputer ranked number 6 on the Top500 list, and JUQUEEN, a Blue Gene/Q system ranked as number 5. We reach resolutions with more than one trillion cells and perform up to 1.93 trillion cell updates per second using 1.8 million threads. The design and implementation of waLBerla is driven by a careful analysis of the performance on current petascale supercomputers. Our fully distributed data structures and algorithms allow for efficient, massively parallel simulations on these machines. Elaborate node level optimizations and vectorization using SIMD instructions result in highly optimized compute kernels for the single- and two-relaxation-time LBM. Excellent weak and strong scaling is achieved for a complex vascular geometry of the human coronary tree.
waLBerla是一个用晶格玻尔兹曼方法(LBM)模拟复杂流动的大规模并行软件框架。SuperMUC是世界上最快的x86超级计算机,在500强榜单上排名第六,JUQUEEN是蓝色基因/Q系统,排名第五。我们达到了超过一万亿单元格的分辨率,并且使用180万个线程每秒执行高达1.93万亿的单元格更新。waLBerla的设计和实现是由对当前千万亿次超级计算机性能的仔细分析驱动的。我们完全分布式的数据结构和算法允许在这些机器上进行高效、大规模并行的模拟。使用SIMD指令进行精细的节点级优化和向量化,为单松弛时间和双松弛时间LBM提供了高度优化的计算内核。优秀的弱和强缩放实现了复杂的血管几何的人类冠状树。
{"title":"A framework for hybrid parallel flow simulations with a trillion cells in complex geometries","authors":"Christian Godenschwager, F. Schornbaum, Martin Bauer, H. Köstler, U. Rüde","doi":"10.1145/2503210.2503273","DOIUrl":"https://doi.org/10.1145/2503210.2503273","url":null,"abstract":"waLBerla is a massively parallel software framework for simulating complex flows with the lattice Boltzmann method (LBM). Performance and scalability results are presented for SuperMUC, the world's fastest x86-based supercomputer ranked number 6 on the Top500 list, and JUQUEEN, a Blue Gene/Q system ranked as number 5. We reach resolutions with more than one trillion cells and perform up to 1.93 trillion cell updates per second using 1.8 million threads. The design and implementation of waLBerla is driven by a careful analysis of the performance on current petascale supercomputers. Our fully distributed data structures and algorithms allow for efficient, massively parallel simulations on these machines. Elaborate node level optimizations and vectorization using SIMD instructions result in highly optimized compute kernels for the single- and two-relaxation-time LBM. Excellent weak and strong scaling is achieved for a complex vascular geometry of the human coronary tree.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130833170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 95
Using cross-layer adaptations for dynamic data management in large scale coupled scientific workflows 在大规模耦合科学工作流中使用跨层适应动态数据管理
Tong Jin, Fan Zhang, Qian Sun, H. Bui, M. Parashar, Hongfeng Yu, S. Klasky, N. Podhorszki, H. Abbasi
As system scales and application complexity grow, managing and processing simulation data has become a significant challenge. While recent approaches based on data staging and in-situ/in-transit data processing are promising, dynamic data volumes and distributions,such as those occurring in AMR-based simulations, make the efficient use of these techniques challenging. In this paper we propose cross-layer adaptations that address these challenges and respond at runtime to dynamic data management requirements. Specifically we explore (1) adaptations of the spatial resolution at which the data is processed, (2) dynamic placement and scheduling of data processing kernels, and (3) dynamic allocation of in-transit resources. We also exploit co-ordinated approaches that dynamically combine these adaptations at the different layers. We evaluate the performance of our adaptive cross-layer management approach on the Intrepid IBM-BlueGene/P and Titan Cray-XK7 systems using Chombo-based AMR applications, and demonstrate its effectiveness in improving overall time-to-solution and increasing resource efficiency.
随着系统规模和应用复杂性的增长,管理和处理仿真数据已成为一个重大挑战。虽然最近基于数据分期和原位/传输数据处理的方法很有前途,但动态数据量和分布(例如基于amr的模拟中出现的数据)使这些技术的有效使用变得具有挑战性。在本文中,我们提出了跨层调整,以解决这些挑战,并在运行时响应动态数据管理需求。具体来说,我们探讨了(1)数据处理的空间分辨率的适应性,(2)数据处理内核的动态放置和调度,以及(3)在轨资源的动态分配。我们还利用协调的方法,在不同的层面动态地结合这些适应性。我们使用基于chombo的AMR应用程序评估了我们的自适应跨层管理方法在Intrepid IBM-BlueGene/P和Titan Cray-XK7系统上的性能,并证明了其在改善整体解决方案时间和提高资源效率方面的有效性。
{"title":"Using cross-layer adaptations for dynamic data management in large scale coupled scientific workflows","authors":"Tong Jin, Fan Zhang, Qian Sun, H. Bui, M. Parashar, Hongfeng Yu, S. Klasky, N. Podhorszki, H. Abbasi","doi":"10.1145/2503210.2503301","DOIUrl":"https://doi.org/10.1145/2503210.2503301","url":null,"abstract":"As system scales and application complexity grow, managing and processing simulation data has become a significant challenge. While recent approaches based on data staging and in-situ/in-transit data processing are promising, dynamic data volumes and distributions,such as those occurring in AMR-based simulations, make the efficient use of these techniques challenging. In this paper we propose cross-layer adaptations that address these challenges and respond at runtime to dynamic data management requirements. Specifically we explore (1) adaptations of the spatial resolution at which the data is processed, (2) dynamic placement and scheduling of data processing kernels, and (3) dynamic allocation of in-transit resources. We also exploit co-ordinated approaches that dynamically combine these adaptations at the different layers. We evaluate the performance of our adaptive cross-layer management approach on the Intrepid IBM-BlueGene/P and Titan Cray-XK7 systems using Chombo-based AMR applications, and demonstrate its effectiveness in improving overall time-to-solution and increasing resource efficiency.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125809862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems 基于numa感知的rdma端到端数据传输系统的设计与性能评估
Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, T. Robertazzi
Data-intensive applications place stringent requirements on the performance of both back-end storage systems and frontend network interfaces. However, for ultra high-speed data transfer, for example, at 100 Gbps and higher, the effects of multiple bottlenecks along a full end-to-end path, have not been resolved efficiently. In this paper, we describe our implementation of an end-to-end data transfer software at such high-speeds. At the back-end, we construct a storage area network with the iSCSI protocols, and utilize efficient RDMA technology. At the front-end, we design network communication software to transfer data in parallel, and utilize NUMA techniques to maximize the performance of multiple network interfaces. We demonstrate that our system can deliver the full 100 Gbps end-to-end data transfer throughput. The software product is tested rigorously and demonstrated applicable to supporting various data-intensive applications that constantly move bulk data within and across data centers.
数据密集型应用对后端存储系统和前端网络接口的性能都提出了严格的要求。然而,对于超高速数据传输,例如,在100 Gbps或更高的速度下,沿着完整的端到端路径的多个瓶颈的影响尚未得到有效解决。在本文中,我们描述了我们的实现端到端数据传输软件在这样的高速。后端采用iSCSI协议构建存储区域网络,并利用高效的RDMA技术。在前端,我们设计了网络通信软件来并行传输数据,并利用NUMA技术来最大限度地提高多个网络接口的性能。我们证明了我们的系统可以提供完整的100 Gbps端到端数据传输吞吐量。该软件产品经过严格测试,并证明适用于支持各种数据密集型应用程序,这些应用程序在数据中心内部和跨数据中心不断移动大量数据。
{"title":"Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems","authors":"Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, T. Robertazzi","doi":"10.1145/2503210.2503260","DOIUrl":"https://doi.org/10.1145/2503210.2503260","url":null,"abstract":"Data-intensive applications place stringent requirements on the performance of both back-end storage systems and frontend network interfaces. However, for ultra high-speed data transfer, for example, at 100 Gbps and higher, the effects of multiple bottlenecks along a full end-to-end path, have not been resolved efficiently. In this paper, we describe our implementation of an end-to-end data transfer software at such high-speeds. At the back-end, we construct a storage area network with the iSCSI protocols, and utilize efficient RDMA technology. At the front-end, we design network communication software to transfer data in parallel, and utilize NUMA techniques to maximize the performance of multiple network interfaces. We demonstrate that our system can deliver the full 100 Gbps end-to-end data transfer throughput. The software product is tested rigorously and demonstrated applicable to supporting various data-intensive applications that constantly move bulk data within and across data centers.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130622987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs AUGEM:在x86 cpu上自动生成高性能密集线性代数内核
Qian Wang, Xianyi Zhang, Yunquan Zhang, Qing Yi
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our templatebased approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.
基本线性代数子程序(BLAS)是科学计算的基础库。在本文中,我们提出了一个基于模板的优化框架AUGEM,它可以在不同的多核cpu上自动为几种密集线性代数(DLA)内核(如GEMM, GEMV, AXPY和DOT)生成完全优化的汇编代码,而无需开发人员进行任何手动干扰。特别是,基于关于DLA内核算法的特定领域知识,我们使用一组参数化代码模板来在这些DLA内核的优化的低级C代码中制定一些常见的指令序列。然后,我们的框架使用专门的低级C优化器来识别与预定义代码模板匹配的指令序列,从而将它们转换为极其高效的SSE/AVX指令。我们基于模板的方法生成的DLA内核在英特尔Sandy Bridge和AMD Piledriver处理器上超越了英特尔MKL和AMD ACML BLAS库的实现。
{"title":"AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs","authors":"Qian Wang, Xianyi Zhang, Yunquan Zhang, Qing Yi","doi":"10.1145/2503210.2503219","DOIUrl":"https://doi.org/10.1145/2503210.2503219","url":null,"abstract":"Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our templatebased approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130693930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
Channel reservation protocol for over-subscribed channels and destinations 用于超额订阅的通道和目的地的通道预订协议
George Michelogiannakis, Nan Jiang, Daniel U. Becker, W. Dally
Channels in system-wide networks tend to be over-subscribed due to the cost of bandwidth and increasing traffic demands. To make matters worse, workloads can overstress specific destinations, creating hotspots. Lossless networks offer attractive advantages compared to lossy networks but suffer from tree saturation. This led to the development of explicit congestion notification (ECN). However, ECN is very sensitive to its configuration parameters and acts only after congestion forms. We propose channel reservation protocol (CRP) to enable sources to reserve bandwidth in multiple resources in advance of packet transmission and with a single request, but without idling resources like circuit switching. CRP prevents congestion from ever occurring and thus reacts instantly to traffic changes, whereas ECN requires 300,000 cycles to stabilize in our experiments. Furthermore, ECN may not prevent congestion formed by short-lived flows generated by a large combination of source-destination pairs.
由于带宽成本和不断增加的流量需求,全系统网络中的信道往往会被过度订阅。更糟糕的是,工作负载可能会对特定目的地造成过度压力,从而产生热点。与有损网络相比,无损网络具有吸引人的优势,但存在树饱和的问题。这导致了显式拥塞通知(ECN)的发展。然而,ECN对其配置参数非常敏感,只有在拥塞形成后才会起作用。我们提出了信道保留协议(CRP),以使源在数据包传输之前和单个请求中保留多个资源中的带宽,而不会像电路交换那样空闲资源。在我们的实验中,CRP可以防止拥堵的发生,从而对交通变化做出即时反应,而ECN需要30万个循环才能稳定下来。此外,ECN可能无法防止由大量源-目的地对组合产生的短时间流形成的拥塞。
{"title":"Channel reservation protocol for over-subscribed channels and destinations","authors":"George Michelogiannakis, Nan Jiang, Daniel U. Becker, W. Dally","doi":"10.1145/2503210.2503213","DOIUrl":"https://doi.org/10.1145/2503210.2503213","url":null,"abstract":"Channels in system-wide networks tend to be over-subscribed due to the cost of bandwidth and increasing traffic demands. To make matters worse, workloads can overstress specific destinations, creating hotspots. Lossless networks offer attractive advantages compared to lossy networks but suffer from tree saturation. This led to the development of explicit congestion notification (ECN). However, ECN is very sensitive to its configuration parameters and acts only after congestion forms. We propose channel reservation protocol (CRP) to enable sources to reserve bandwidth in multiple resources in advance of packet transmission and with a single request, but without idling resources like circuit switching. CRP prevents congestion from ever occurring and thus reacts instantly to traffic changes, whereas ECN requires 300,000 cycles to stabilize in our experiments. Furthermore, ECN may not prevent congestion formed by short-lived flows generated by a large combination of source-destination pairs.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129684254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Scalable parallel OPTICS data clustering using graph algorithmic techniques 使用图算法技术的可扩展并行光学数据聚类
Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, W. Liao, F. Manne, A. Choudhary
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
OPTICS是一种基于分层密度的数据聚类算法,可以发现任意形状的聚类,并使用可调的可达距离阈值消除噪声。并行化光学被认为是具有挑战性的,因为该算法显示了强顺序的数据访问顺序。我们提出了一个可扩展的并行光学算法(POPTICS)设计使用图算法的概念。为了打破数据访问的顺序性,POPTICS利用OPTICS算法和PRIM的最小生成树算法之间的相似性。此外,我们使用disjoint-set数据结构来实现分布式聚类提取的高并行性。使用包含多达10亿个浮点数的高维数据集,我们在40核共享内存机器上的OpenMP实现的可扩展速度高达27.5,在4096核分布式内存机器上的MPI实现的可扩展速度高达3008。我们还证明了POPTICS给出的结果质量与经典光学算法给出的结果质量相当。
{"title":"Scalable parallel OPTICS data clustering using graph algorithmic techniques","authors":"Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, W. Liao, F. Manne, A. Choudhary","doi":"10.1145/2503210.2503255","DOIUrl":"https://doi.org/10.1145/2503210.2503255","url":null,"abstract":"OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122524265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
An early performance evaluation of many integrated core architecture based sgi rackable computing system 基于集成核心体系结构的sgi可机架计算系统的早期性能评价
S. Saini, Haoqiang Jin, D. Jespersen, Huiyu Feng, M. J. Djomehri, William Arasin, R. Hood, P. Mehrotra, R. Biswas
Intel recently introduced the Xeon Phi coprocessor based on the Many Integrated Core architecture featuring 60 cores with a peak performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI Rackable system where each node has two Intel Xeon E2670 8-core Sandy Bridge processors along with two Xeon Phi 5110P coprocessors. We have conducted an early performance evaluation of the Xeon Phi. We used microbenchmarks to measure the latency and bandwidth of memory and interconnect, I/O rates, and the performance of OpenMP directives and MPI functions. We also used OpenMP and MPI versions of the NAS Parallel Benchmarks along with two production CFD applications to test four programming modes: offload, processor native, coprocessor native and symmetric (processor plus coprocessor). In this paper we present preliminary results based on our performance evaluation of various aspects of a Phi-based system.
英特尔最近推出了基于多核集成架构的Xeon Phi协处理器,具有60核,峰值性能为1.0 Tflop/s。美国宇航局已经部署了一个128节点的SGI可机架系统,每个节点都有两个英特尔至强E2670 8核Sandy Bridge处理器和两个至强Phi 5110P协处理器。我们对Xeon Phi进行了早期性能评估。我们使用微基准测试来测量内存和互连的延迟和带宽、I/O速率以及OpenMP指令和MPI函数的性能。我们还使用了OpenMP和MPI版本的NAS Parallel benchmark以及两个生产CFD应用程序来测试四种编程模式:卸载、处理器原生、协处理器原生和对称(处理器加协处理器)。在本文中,我们根据我们对基于phi的系统的各个方面的性能评估提出了初步结果。
{"title":"An early performance evaluation of many integrated core architecture based sgi rackable computing system","authors":"S. Saini, Haoqiang Jin, D. Jespersen, Huiyu Feng, M. J. Djomehri, William Arasin, R. Hood, P. Mehrotra, R. Biswas","doi":"10.1145/2503210.2503272","DOIUrl":"https://doi.org/10.1145/2503210.2503272","url":null,"abstract":"Intel recently introduced the Xeon Phi coprocessor based on the Many Integrated Core architecture featuring 60 cores with a peak performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI Rackable system where each node has two Intel Xeon E2670 8-core Sandy Bridge processors along with two Xeon Phi 5110P coprocessors. We have conducted an early performance evaluation of the Xeon Phi. We used microbenchmarks to measure the latency and bandwidth of memory and interconnect, I/O rates, and the performance of OpenMP directives and MPI functions. We also used OpenMP and MPI versions of the NAS Parallel Benchmarks along with two production CFD applications to test four programming modes: offload, processor native, coprocessor native and symmetric (processor plus coprocessor). In this paper we present preliminary results based on our performance evaluation of various aspects of a Phi-based system.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126262923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
COCA: Online distributed resource management for cost minimization and carbon neutrality in data centers COCA:用于数据中心成本最小化和碳中和的在线分布式资源管理
Shaolei Ren, Yuxiong He
Due to the enormous energy consumption and associated environmental concerns, data centers have been increasingly pressured to reduce long-term net carbon footprint to zero, i.e., carbon neutrality. In this paper, we propose an online algorithm, called COCA (optimizing for COst minimization and CArbon neutrality), for minimizing data center operational cost while satisfying carbon neutrality without long-term future information. Unlike the existing research, COCA enables distributed server-level resource management: each server autonomously adjusts its processing speed and optimally decides the amount of workloads to process. We prove that COCA achieves a close-to-minimum operational cost (incorporating both electricity and delay costs) compared to the optimal algorithm with future information, while bounding the potential violation of carbon neutrality. We also perform trace-based simulation studies to complement the analysis, and the results show that COCA reduces cost by more than 25% (compared to state of the art) while resulting in a smaller carbon footprint.
由于巨大的能源消耗和相关的环境问题,数据中心受到越来越大的压力,需要将长期净碳足迹减少到零,即碳中和。在本文中,我们提出了一种称为COCA(优化成本最小化和碳中和)的在线算法,用于最小化数据中心运营成本,同时在没有长期未来信息的情况下满足碳中和。与现有的研究不同,COCA支持分布式服务器级资源管理:每个服务器自主调整其处理速度,并以最佳方式决定要处理的工作负载量。我们证明,与具有未来信息的最优算法相比,COCA实现了接近最小的运行成本(包括电力和延迟成本),同时限制了可能违反碳中和的行为。我们还进行了基于痕量的模拟研究来补充分析,结果表明,COCA将成本降低了25%以上(与最先进的技术相比),同时减少了碳足迹。
{"title":"COCA: Online distributed resource management for cost minimization and carbon neutrality in data centers","authors":"Shaolei Ren, Yuxiong He","doi":"10.1145/2503210.2503248","DOIUrl":"https://doi.org/10.1145/2503210.2503248","url":null,"abstract":"Due to the enormous energy consumption and associated environmental concerns, data centers have been increasingly pressured to reduce long-term net carbon footprint to zero, i.e., carbon neutrality. In this paper, we propose an online algorithm, called COCA (optimizing for COst minimization and CArbon neutrality), for minimizing data center operational cost while satisfying carbon neutrality without long-term future information. Unlike the existing research, COCA enables distributed server-level resource management: each server autonomously adjusts its processing speed and optimally decides the amount of workloads to process. We prove that COCA achieves a close-to-minimum operational cost (incorporating both electricity and delay costs) compared to the optimal algorithm with future information, while bounding the potential violation of carbon neutrality. We also perform trace-based simulation studies to complement the analysis, and the results show that COCA reduces cost by more than 25% (compared to state of the art) while resulting in a smaller carbon footprint.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"85 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129318365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Insights for exascale IO APIs from building a petascale IO API 从构建千兆级IO API获得百亿亿级IO API的见解
J. Lofstead, R. Ross
Near the dawn of the petascale era, IO libraries had reached a stability in their function and data layout with only incremental changes being incorporated. The shift in technology, particularly the scale of parallel file systems and the number of compute processes, prompted revisiting best practices for optimal IO performance. Among other efforts like PLFS, the project that led to ADIOS, the ADaptable IO System, was motivated by both the shift in technology and the historical requirement, for optimal IO performance, to change how simulations performed IO depending on the platform. To solve both issues, the ADIOS team, along with consultation with other leading IO experts, sought to build a new IO platform based on the assumptions inherent in the petascale hardware platforms. This paper helps inform the design of future IO platforms with a discussion of lessons learned as part of the process of designing and building ADIOS.
在千兆级(peascale)时代即将到来之际,IO库在功能和数据布局方面已经达到了稳定,只有增量更改被纳入。技术的转变,特别是并行文件系统的规模和计算进程的数量,促使人们重新审视最佳IO性能的最佳实践。在PLFS等其他努力中,ADIOS(可适应IO系统)项目的发展受到技术转变和历史要求的推动,以实现最佳IO性能,改变模拟如何根据平台执行IO。为了解决这两个问题,ADIOS团队与其他领先的IO专家协商,试图建立一个基于千兆级硬件平台固有假设的新IO平台。本文通过讨论作为设计和构建ADIOS过程一部分的经验教训,有助于为未来IO平台的设计提供信息。
{"title":"Insights for exascale IO APIs from building a petascale IO API","authors":"J. Lofstead, R. Ross","doi":"10.1145/2503210.2503238","DOIUrl":"https://doi.org/10.1145/2503210.2503238","url":null,"abstract":"Near the dawn of the petascale era, IO libraries had reached a stability in their function and data layout with only incremental changes being incorporated. The shift in technology, particularly the scale of parallel file systems and the number of compute processes, prompted revisiting best practices for optimal IO performance. Among other efforts like PLFS, the project that led to ADIOS, the ADaptable IO System, was motivated by both the shift in technology and the historical requirement, for optimal IO performance, to change how simulations performed IO depending on the platform. To solve both issues, the ADIOS team, along with consultation with other leading IO experts, sought to build a new IO platform based on the assumptions inherent in the petascale hardware platforms. This paper helps inform the design of future IO platforms with a discussion of lessons learned as part of the process of designing and building ADIOS.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132273519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning 通过动态任务划分实现张量收缩表达式负载平衡的框架
Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, S. Krishnamoorthy, P. Sadayappan
In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing runtime. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from Coupled Cluster (CC) methods.
在本文中,我们介绍了动态负载平衡张量收缩(DLTC),一个用于高效任务并行执行张量收缩表达式的特定领域库,这是量子化学和物理中遇到的一类计算。我们的框架将每个收缩分解为更小的任务单元,由称为迭代器的抽象表示。我们通过动态负载平衡运行时让任务跨独立收缩并发执行,从而利用了额外的并行性级别。我们用耦合簇(CC)方法的例子证明了在并行计算机上计算张量收缩表达式的性能、可扩展性和灵活性的改进。
{"title":"A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning","authors":"Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, S. Krishnamoorthy, P. Sadayappan","doi":"10.1145/2503210.2503290","DOIUrl":"https://doi.org/10.1145/2503210.2503290","url":null,"abstract":"In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing runtime. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from Coupled Cluster (CC) methods.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131376762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1