首页 > 最新文献

2012 SC Companion: High Performance Computing, Networking Storage and Analysis最新文献

英文 中文
Incremental and Parallel Analytics on Astrophysical Data Streams 天体物理数据流的增量和并行分析
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.130
D. Mishin, T. Budavári, A. Szalay, Yanif Ahmad
Stream processing methods and online algorithms are increasingly appealing in the scientific and large-scale data management communities due to increasing ingestion rates of scientific instruments, the ability to produce and inspect results interactively, and the simplicity and efficiency of sequential storage access over enormous datasets. This article will showcase our experiences in using off-the-shelf streaming technology to implement incremental and parallel spectral analysis of galaxies from the Sloan Digital Sky Survey (SDSS) to detect a wide variety of galaxy features. The technical focus of the article is on a robust, highly scalable principal components analysis (PCA) algorithm and its use of coordination primitives to realize consistency as part of parallel execution. Our algorithm and framework can be readily used in other domains.
流处理方法和在线算法在科学和大规模数据管理社区中越来越有吸引力,因为科学仪器的摄取率越来越高,交互式产生和检查结果的能力,以及对大量数据集进行顺序存储访问的简单性和效率。本文将展示我们使用现成的流技术来实现斯隆数字巡天(SDSS)星系的增量和并行光谱分析的经验,以探测各种各样的星系特征。本文的技术重点是一个健壮的、高度可伸缩的主成分分析(PCA)算法,以及它使用协调原语来实现作为并行执行一部分的一致性。我们的算法和框架可以很容易地应用于其他领域。
{"title":"Incremental and Parallel Analytics on Astrophysical Data Streams","authors":"D. Mishin, T. Budavári, A. Szalay, Yanif Ahmad","doi":"10.1109/SC.Companion.2012.130","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.130","url":null,"abstract":"Stream processing methods and online algorithms are increasingly appealing in the scientific and large-scale data management communities due to increasing ingestion rates of scientific instruments, the ability to produce and inspect results interactively, and the simplicity and efficiency of sequential storage access over enormous datasets. This article will showcase our experiences in using off-the-shelf streaming technology to implement incremental and parallel spectral analysis of galaxies from the Sloan Digital Sky Survey (SDSS) to detect a wide variety of galaxy features. The technical focus of the article is on a robust, highly scalable principal components analysis (PCA) algorithm and its use of coordination primitives to realize consistency as part of parallel execution. Our algorithm and framework can be readily used in other domains.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"54 1","pages":"1078-1086"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82369743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Designing a Collaborative Filtering Recommender on the Single Chip Cloud Computer 在单片云计算机上设计协同过滤推荐器
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.118
Aalap Tripathy, Atish Patra, S. Mohan, R. Mahapatra
Fast response requirements for big-data applications on cloud infrastructures continues to grow. At the same time, many cores on-chip have now become a reality. These developments are set to redefine infrastructure nodes of cloud data centers in the future. For this to happen, parallel programming runtimes need to be designed for many-cores on chip as the target architecture. In this paper, we show that the commonly used MapReduce programming paradigm can be adapted to run on Intel's experimental single chip cloud computer (SCC) with 48-cores on chip. We demonstrate this using a Collaborative Filtering (CF) recommender system as an application. This is a widely used technique for information filtering to predict user's preference towards an unknown item from their past ratings. These systems are typically deployed in distributed clusters and operate on large apriori datasets. We address scalability with data partitioning, combining and sorting algorithms, maximize data locality to minimize communication cost within the SCC cores. We demonstrate ~2x speedup, ~94% lower energy consumption for benchmark workloads as compared to a distributed cluster of single and multi-processor nodes.
云基础设施上大数据应用的快速响应需求持续增长。与此同时,片上多核已经成为现实。这些发展将在未来重新定义云数据中心的基础设施节点。为了实现这一点,并行编程运行时需要设计为芯片上的多核作为目标架构。在本文中,我们证明了常用的MapReduce编程范式可以在英特尔的48核单芯片云计算机(SCC)上运行。我们使用协同过滤(CF)推荐系统作为应用程序来演示这一点。这是一种广泛使用的信息过滤技术,可以根据用户过去的评分预测他们对未知物品的偏好。这些系统通常部署在分布式集群中,并在大型先验数据集上操作。我们通过数据分区、组合和排序算法解决可扩展性问题,最大限度地提高数据局域性,以最大限度地减少SCC核心内的通信成本。我们演示了与单处理器和多处理器节点的分布式集群相比,基准工作负载的速度提高了约2倍,能耗降低了约94%。
{"title":"Designing a Collaborative Filtering Recommender on the Single Chip Cloud Computer","authors":"Aalap Tripathy, Atish Patra, S. Mohan, R. Mahapatra","doi":"10.1109/SC.Companion.2012.118","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.118","url":null,"abstract":"Fast response requirements for big-data applications on cloud infrastructures continues to grow. At the same time, many cores on-chip have now become a reality. These developments are set to redefine infrastructure nodes of cloud data centers in the future. For this to happen, parallel programming runtimes need to be designed for many-cores on chip as the target architecture. In this paper, we show that the commonly used MapReduce programming paradigm can be adapted to run on Intel's experimental single chip cloud computer (SCC) with 48-cores on chip. We demonstrate this using a Collaborative Filtering (CF) recommender system as an application. This is a widely used technique for information filtering to predict user's preference towards an unknown item from their past ratings. These systems are typically deployed in distributed clusters and operate on large apriori datasets. We address scalability with data partitioning, combining and sorting algorithms, maximize data locality to minimize communication cost within the SCC cores. We demonstrate ~2x speedup, ~94% lower energy consumption for benchmark workloads as compared to a distributed cluster of single and multi-processor nodes.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"49 1","pages":"838-847"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82857851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SAN Optimization for High Performance Storage with RDMA Data Transfer 基于RDMA数据传输的高性能存储SAN优化
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.15
Jae-Woo Choi, Youngjin Yu, Hyeonsang Eom, H. Yeom, Dongin Shin
Today's server environments consist of many machines constructing clusters for distributed computing system or storage area networks (SAN) for effectively processing or saving enormous data. In these kinds of server environments, backend-storages are usually the bottleneck of the overall system. But it is not enough to simply replace the devices with better ones to exploit their performance benefits. In other words, proper optimizations are needed to fully utilize their performance gains. In this work, we first applied a high performance device as a backend-storage to the existing SAN solution, and found that it could not utilize the low latency and high bandwidth of the device, especially in case of small sized random I/O pattern even though a high speed network was used. To address this problem, we propose a new design that contains three optimizations: 1) removing software overheads to lower I/O latency; 2) parallelism to utilize the high bandwidth of the device; 3) temporal merge mechanism to reduce network overhead. We implemented them as a prototype and found that our solution makes substantial performance improvements in terms of both the latency and bandwidth.
今天的服务器环境由许多机器组成,这些机器为分布式计算系统或存储区域网络(SAN)构建集群,以有效地处理或保存大量数据。在这些类型的服务器环境中,后端存储通常是整个系统的瓶颈。但是,仅仅用更好的设备替换旧设备来利用它们的性能优势是不够的。换句话说,需要适当的优化来充分利用它们的性能增益。在这项工作中,我们首先将高性能设备作为后端存储应用到现有的SAN解决方案中,发现即使使用高速网络,也无法利用设备的低延迟和高带宽,特别是在小尺寸随机I/O模式的情况下。为了解决这个问题,我们提出了一个包含三个优化的新设计:1)消除软件开销以降低I/O延迟;2)并行性以利用器件的高带宽;3)时间合并机制,减少网络开销。我们将它们作为原型实现,并发现我们的解决方案在延迟和带宽方面都有实质性的性能改进。
{"title":"SAN Optimization for High Performance Storage with RDMA Data Transfer","authors":"Jae-Woo Choi, Youngjin Yu, Hyeonsang Eom, H. Yeom, Dongin Shin","doi":"10.1109/SC.Companion.2012.15","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.15","url":null,"abstract":"Today's server environments consist of many machines constructing clusters for distributed computing system or storage area networks (SAN) for effectively processing or saving enormous data. In these kinds of server environments, backend-storages are usually the bottleneck of the overall system. But it is not enough to simply replace the devices with better ones to exploit their performance benefits. In other words, proper optimizations are needed to fully utilize their performance gains. In this work, we first applied a high performance device as a backend-storage to the existing SAN solution, and found that it could not utilize the low latency and high bandwidth of the device, especially in case of small sized random I/O pattern even though a high speed network was used. To address this problem, we propose a new design that contains three optimizations: 1) removing software overheads to lower I/O latency; 2) parallelism to utilize the high bandwidth of the device; 3) temporal merge mechanism to reduce network overhead. We implemented them as a prototype and found that our solution makes substantial performance improvements in terms of both the latency and bandwidth.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"14 1","pages":"24-29"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82899841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Abstract: Hybrid Breadth First Search Implementation for Hybrid-Core Computers 摘要:混合核计算机的混合广度优先搜索实现
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.184
Kevin R. Wadleigh, John Amelio, K. Collins, G. Edwards
Summary form only given. The Graph500 benchmark is designed to evaluate the suitability of supercomputing systems for graph algorithms, which are increasingly important in HPC. The timed Graph500 kernel, Breadth First Search, exhibits memory access patterns typical of these types of applications, with poor spatial locality and synchronization between multiple streams of execution. The Graph500 benchmark was ported to the Convey HC-2ex and MX-100, hybrid-core computers with an Intel host system and a coprocessor incorporating four reprogrammable Xilinx FPGAs. The computers contain a unique memory system designed to sustain high bandwidth for random memory accesses. The BFS kernel was implemented as a hybrid algorithm with concurrent processing on both the host and coprocessor. The early steps use a top-down algorithm on the host with results copied to coprocessor memory for use in a bottom-up algorithm. The coprocessor uses thousands of threads to traverse the graph. The resulting implementation runs at over 16 billion TEPS.
只提供摘要形式。Graph500基准测试旨在评估超级计算系统对图算法的适用性,图算法在高性能计算中越来越重要。限时Graph500内核,广度优先搜索,展示了这些类型应用程序的典型内存访问模式,具有较差的空间局部性和多个执行流之间的同步。Graph500基准测试被移植到带有Intel主机系统和包含四个可重新编程Xilinx fpga的协处理器的混合核计算机上。这些计算机包含一个独特的存储系统,设计用于维持随机存储器访问的高带宽。BFS内核是采用主机和协处理器并行处理的混合算法实现的。早期的步骤在主机上使用自顶向下算法,并将结果复制到协处理器内存中,以便在自底向上算法中使用。协处理器使用数千个线程来遍历图。最终实现的运行速度超过160亿TEPS。
{"title":"Abstract: Hybrid Breadth First Search Implementation for Hybrid-Core Computers","authors":"Kevin R. Wadleigh, John Amelio, K. Collins, G. Edwards","doi":"10.1109/SC.Companion.2012.184","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.184","url":null,"abstract":"Summary form only given. The Graph500 benchmark is designed to evaluate the suitability of supercomputing systems for graph algorithms, which are increasingly important in HPC. The timed Graph500 kernel, Breadth First Search, exhibits memory access patterns typical of these types of applications, with poor spatial locality and synchronization between multiple streams of execution. The Graph500 benchmark was ported to the Convey HC-2ex and MX-100, hybrid-core computers with an Intel host system and a coprocessor incorporating four reprogrammable Xilinx FPGAs. The computers contain a unique memory system designed to sustain high bandwidth for random memory accesses. The BFS kernel was implemented as a hybrid algorithm with concurrent processing on both the host and coprocessor. The early steps use a top-down algorithm on the host with results copied to coprocessor memory for use in a bottom-up algorithm. The coprocessor uses thousands of threads to traverse the graph. The resulting implementation runs at over 16 billion TEPS.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"5 1","pages":"1354-1354"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90383297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The SDAV Software Frameworks for Visualization and Analysis on Next-Generation Multi-Core and Many-Core Architectures 面向下一代多核与多核架构的SDAV可视化与分析软件框架
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.36
Christopher M. Sewell, J. Meredith, K. Moreland, T. Peterka, David E. DeMarle, Li-Ta Lo, J. Ahrens, Robert Maynard, Berk Geveci
This paper surveys the four software frameworks being developed as part of the visualization pillar of the SDAV (Scalable Data Management, Analysis, and Visualization) Institute, one of the SciDAC (Scientific Discovery through Advanced Computing) Institutes established by the ASCR (Advanced Scientific Computing Research) Program of the U.S. Department of Energy. These frameworks include EAVL (Extreme-scale Analysis and Visualization Library), DAX (Data Analysis at Extreme), DIY (Do It Yourself), and PISTON. The objective of these frameworks is to facilitate the adaptation of visualization and analysis algorithms to take advantage of the available parallelism in emerging multi-core and many-core hardware architectures, in anticipation of the need for such algorithms to be run in-situ with LCF (leadership-class facilities) simulation codes on supercomputers.
本文调查了作为SDAV(可扩展数据管理、分析和可视化)研究所可视化支柱的一部分正在开发的四个软件框架,SDAV是由美国能源部ASCR(高级科学计算研究)计划建立的SciDAC(通过高级计算进行科学发现)研究所之一。这些框架包括EAVL(极端规模分析和可视化库)、DAX(极端数据分析)、DIY(自己动手)和活塞。这些框架的目标是促进可视化和分析算法的适应,以利用新兴的多核和多核硬件架构中的可用并行性,预计这些算法需要在超级计算机上与LCF(领导级设施)模拟代码一起原位运行。
{"title":"The SDAV Software Frameworks for Visualization and Analysis on Next-Generation Multi-Core and Many-Core Architectures","authors":"Christopher M. Sewell, J. Meredith, K. Moreland, T. Peterka, David E. DeMarle, Li-Ta Lo, J. Ahrens, Robert Maynard, Berk Geveci","doi":"10.1109/SC.Companion.2012.36","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.36","url":null,"abstract":"This paper surveys the four software frameworks being developed as part of the visualization pillar of the SDAV (Scalable Data Management, Analysis, and Visualization) Institute, one of the SciDAC (Scientific Discovery through Advanced Computing) Institutes established by the ASCR (Advanced Scientific Computing Research) Program of the U.S. Department of Energy. These frameworks include EAVL (Extreme-scale Analysis and Visualization Library), DAX (Data Analysis at Extreme), DIY (Do It Yourself), and PISTON. The objective of these frameworks is to facilitate the adaptation of visualization and analysis algorithms to take advantage of the available parallelism in emerging multi-core and many-core hardware architectures, in anticipation of the need for such algorithms to be run in-situ with LCF (leadership-class facilities) simulation codes on supercomputers.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"78 1","pages":"206-214"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83940541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Poster: High Performance GPU Accelerated TSP Solver 海报:高性能GPU加速TSP求解器
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.225
K. Rocki, R. Suda
We are presenting a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage greatly decreases the time needed to optimize the route, however requires a complicated and well tuned implementation. With the increasing problem size, the time spent on comparing the graph edges grows significantly. We used instances from the TSPLIB library for for testing and our results show that by using our GPU algorithm, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to parallel CPU code implementation using 6 cores. The code has been implemented in CUDA as well as in OpenCL and tested on NVIDIA and AMD devices. The experimental studies have shown that the optimization algorithm using the GPU local search converges from up to 300 times faster on average compared to the sequential CPU version, depending on the problem size. The main contributions of this work are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.
针对旅行商问题(TSP),提出了一种高性能GPU加速实现的2-opt局部搜索算法。GPU的使用大大减少了优化路由所需的时间,但是需要一个复杂和良好的实现。随着问题规模的增加,花在图边比较上的时间显著增加。我们使用TSPLIB库中的实例进行测试,结果表明,与使用6核的并行CPU代码实现相比,使用我们的GPU算法执行简单的本地搜索操作所需的时间可以减少大约5到45倍。该代码已在CUDA和OpenCL中实现,并在NVIDIA和AMD设备上进行了测试。实验研究表明,根据问题的大小,使用GPU局部搜索的优化算法的收敛速度平均比顺序CPU版本快300倍。这项工作的主要贡献是利用数据局部性的问题划分方案,该方案允许使用GPU解决任意大的问题实例,以及算法本身的并行实现。
{"title":"Poster: High Performance GPU Accelerated TSP Solver","authors":"K. Rocki, R. Suda","doi":"10.1109/SC.Companion.2012.225","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.225","url":null,"abstract":"We are presenting a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage greatly decreases the time needed to optimize the route, however requires a complicated and well tuned implementation. With the increasing problem size, the time spent on comparing the graph edges grows significantly. We used instances from the TSPLIB library for for testing and our results show that by using our GPU algorithm, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to parallel CPU code implementation using 6 cores. The code has been implemented in CUDA as well as in OpenCL and tested on NVIDIA and AMD devices. The experimental studies have shown that the optimization algorithm using the GPU local search converges from up to 300 times faster on average compared to the sequential CPU version, depending on the problem size. The main contributions of this work are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"10 1","pages":"1413-1414"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88065052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Application of High Performance Computing to Solvency and Profitability Calculations for Life Assurance Contracts 高性能计算在寿险合同偿付能力和盈利能力计算中的应用
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.140
Mark Tucker, J. M. Bull
In the UK, pension providers are required by law to demonstrate solvency on a regular basis; the regulations governing how solvency is demonstrated are changing. Historically, it has been sufficient to report solvency using a single `best estimate' set of assumptions. The new regulations require a Monte Carlo approach to finding a worst-case scenario that requires computing power which is outside the systems currently available in the industry. This paper aims to show that the new regulations could be met by moving away from current actuarial valuation software packages and producing well-performing ab initio code, employing a variety of HPC techniques. Using a combination of algorithmic improvements, serial optimisations and multi-core parallelism, we demonstrate a performance improvement over commercial software of a factor of over 105. We show that this brings the Monte Carlo simulations within the bounds of practicality, and we suggest possibilities for further improvements, for example using clusters of GPUs. We also identify other possible use cases for high performance solvency and profitability calculations.
在英国,法律要求养老金提供者定期证明其偿付能力;有关如何证明偿付能力的规定正在发生变化。从历史上看,只用一组假设的“最佳估计”就足以报告偿付能力。新规定要求采用蒙特卡罗方法来发现最坏的情况,这种情况需要的计算能力超出目前业界可用的系统。本文旨在表明,新的法规可以通过摆脱当前的精算估值软件包和生产性能良好的从头算代码来满足,采用各种高性能计算技术。通过结合算法改进、串行优化和多核并行性,我们展示了比商业软件性能提高105倍以上的性能。我们表明,这使蒙特卡罗模拟在实用性的范围内,我们提出了进一步改进的可能性,例如使用gpu集群。我们还确定了高性能偿付能力和盈利能力计算的其他可能用例。
{"title":"The Application of High Performance Computing to Solvency and Profitability Calculations for Life Assurance Contracts","authors":"Mark Tucker, J. M. Bull","doi":"10.1109/SC.Companion.2012.140","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.140","url":null,"abstract":"In the UK, pension providers are required by law to demonstrate solvency on a regular basis; the regulations governing how solvency is demonstrated are changing. Historically, it has been sufficient to report solvency using a single `best estimate' set of assumptions. The new regulations require a Monte Carlo approach to finding a worst-case scenario that requires computing power which is outside the systems currently available in the industry. This paper aims to show that the new regulations could be met by moving away from current actuarial valuation software packages and producing well-performing ab initio code, employing a variety of HPC techniques. Using a combination of algorithmic improvements, serial optimisations and multi-core parallelism, we demonstrate a performance improvement over commercial software of a factor of over 105. We show that this brings the Monte Carlo simulations within the bounds of practicality, and we suggest possibilities for further improvements, for example using clusters of GPUs. We also identify other possible use cases for high performance solvency and profitability calculations.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"56 1","pages":"1163-1170"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86829298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Poster: Acceleration of the BLAST Hydro Code on GPU 海报:在GPU上加速BLAST Hydro Code
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.172
Tingxing Dong, T. Kolev, R. Rieben, V. Dobrev
The BLAST code implements a high-order numerical algorithm that solves the equations of compressible hydrodynamics using the Finite Element Method in a moving Lagrangian frame. BLAST is coded in C++ and parallelized by MPI. We accelerate the most computationally intensive parts (80%-95%) of BLAST on an NVIDIA GPU with the CUDA programming model. Several 2D and 3D problems were tested and a maximum speedup of 4.3x was delivered. Our results demonstrate the validity and capability of GPU computing.
BLAST代码实现了一个高阶数值算法,该算法在移动拉格朗日坐标系中使用有限元法求解可压缩流体动力学方程。BLAST是用c++编写的,并通过MPI进行并行化。我们使用CUDA编程模型在NVIDIA GPU上加速了BLAST中计算最密集的部分(80%-95%)。测试了几个2D和3D问题,并提供了4.3倍的最大加速。我们的结果证明了GPU计算的有效性和能力。
{"title":"Poster: Acceleration of the BLAST Hydro Code on GPU","authors":"Tingxing Dong, T. Kolev, R. Rieben, V. Dobrev","doi":"10.1109/SC.Companion.2012.172","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.172","url":null,"abstract":"The BLAST code implements a high-order numerical algorithm that solves the equations of compressible hydrodynamics using the Finite Element Method in a moving Lagrangian frame. BLAST is coded in C++ and parallelized by MPI. We accelerate the most computationally intensive parts (80%-95%) of BLAST on an NVIDIA GPU with the CUDA programming model. Several 2D and 3D problems were tested and a maximum speedup of 4.3x was delivered. Our results demonstrate the validity and capability of GPU computing.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"53 1","pages":"1337-1337"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83570189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Case for Optimistic Coordination in HPC Storage Systems 高性能计算存储系统的乐观协调问题
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.19
P. Carns, K. Harms, D. Kimpe, R. Ross, J. Wozniak, L. Ward, M. Curry, Ruth Klundt, Geoff Danielson, Cengiz Karakoyunlu, J. Chandy, Bradley Settlemeyer, W. Gropp
High-performance computing (HPC) storage systems rely on access coordination to ensure that concurrent updates do not produce incoherent results. HPC storage systems typically employ pessimistic distributed locking to provide this functionality in cases where applications cannot perform their own coordination. This approach, however, introduces significant performance overhead and complicates fault handling. In this work we evaluate the viability of optimistic conditional storage operations as an alternative to distributed locking in HPC storage systems. We investigate design strategies and compare the two approaches in a prototype object storage system using a parallel read/modify/write benchmark. Our prototype illustrates that conditional operations can be easily integrated into distributed object storage systems and can outperform standard coordination primitives for simple update workloads. Our experiments show that conditional updates can achieve over two orders of magnitude higher performance than pessimistic locking for some parallel read/modify/write workloads.
高性能计算(HPC)存储系统依靠访问协调来确保并发更新不会产生不一致的结果。HPC存储系统通常在应用程序无法执行自身协调的情况下使用悲观分布式锁定来提供此功能。然而,这种方法带来了巨大的性能开销,并使故障处理变得复杂。在这项工作中,我们评估了乐观条件存储操作作为高性能计算存储系统中分布式锁定的替代方案的可行性。我们研究了设计策略,并在使用并行读/修改/写基准的原型对象存储系统中比较了两种方法。我们的原型说明了条件操作可以很容易地集成到分布式对象存储系统中,并且可以在简单的更新工作负载中优于标准协调原语。我们的实验表明,对于一些并行读/修改/写工作负载,条件更新可以实现比悲观锁定高出两个数量级的性能。
{"title":"A Case for Optimistic Coordination in HPC Storage Systems","authors":"P. Carns, K. Harms, D. Kimpe, R. Ross, J. Wozniak, L. Ward, M. Curry, Ruth Klundt, Geoff Danielson, Cengiz Karakoyunlu, J. Chandy, Bradley Settlemeyer, W. Gropp","doi":"10.1109/SC.Companion.2012.19","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.19","url":null,"abstract":"High-performance computing (HPC) storage systems rely on access coordination to ensure that concurrent updates do not produce incoherent results. HPC storage systems typically employ pessimistic distributed locking to provide this functionality in cases where applications cannot perform their own coordination. This approach, however, introduces significant performance overhead and complicates fault handling. In this work we evaluate the viability of optimistic conditional storage operations as an alternative to distributed locking in HPC storage systems. We investigate design strategies and compare the two approaches in a prototype object storage system using a parallel read/modify/write benchmark. Our prototype illustrates that conditional operations can be easily integrated into distributed object storage systems and can outperform standard coordination primitives for simple update workloads. Our experiments show that conditional updates can achieve over two orders of magnitude higher performance than pessimistic locking for some parallel read/modify/write workloads.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"34 1","pages":"48-53"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79485477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Communication avoiding algorithms 通信避免算法
Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.351
J. Demmel
{"title":"Communication avoiding algorithms","authors":"J. Demmel","doi":"10.1109/SC.Companion.2012.351","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.351","url":null,"abstract":"","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"1 1","pages":"1942-2000"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88259438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2012 SC Companion: High Performance Computing, Networking Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1