首页 > 最新文献

ACM/IEEE SC 2002 Conference (SC'02)最新文献

英文 中文
Massive Arrays of Idle Disks For Storage Archives 存储档案的海量空闲磁盘阵列
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10058
Dennis Colarelli, D. Grunwald
The declining costs of commodity disk drives is rapidly changing the economics of deploying large amounts of online or near-line storage. Conventional mass storage systems use either high performance RAID clusters, automated tape libraries or a combination of tape and disk. In this paper, we analyze an alternative design using massive arrays of idle disks, or MAID. We argue that this storage organization provides storage densities matching or exceeding those of tape libraries with performance similar to disk arrays. Moreover, we show that with effective power management of individual drives, this performance can be achieved using a very small power budget. In particular, we show that our power management strategy can result in the performance comparable to an always-on RAID system while using 1/15th the power of such a RAID system.
商品磁盘驱动器成本的下降正在迅速改变部署大量在线或近线存储的经济性。传统的大容量存储系统要么使用高性能RAID集群、自动化磁带库,要么使用磁带和磁盘的组合。在本文中,我们分析了一种使用大量空闲磁盘阵列(MAID)的替代设计。我们认为这种存储组织提供了匹配或超过磁带库的存储密度,其性能与磁盘阵列相似。此外,我们还表明,通过对单个驱动器进行有效的电源管理,可以使用非常小的功率预算来实现这种性能。特别是,我们展示了我们的电源管理策略可以产生与始终在线的RAID系统相当的性能,而功耗仅为此类RAID系统的1/15。
{"title":"Massive Arrays of Idle Disks For Storage Archives","authors":"Dennis Colarelli, D. Grunwald","doi":"10.1109/SC.2002.10058","DOIUrl":"https://doi.org/10.1109/SC.2002.10058","url":null,"abstract":"The declining costs of commodity disk drives is rapidly changing the economics of deploying large amounts of online or near-line storage. Conventional mass storage systems use either high performance RAID clusters, automated tape libraries or a combination of tape and disk. In this paper, we analyze an alternative design using massive arrays of idle disks, or MAID. We argue that this storage organization provides storage densities matching or exceeding those of tape libraries with performance similar to disk arrays. Moreover, we show that with effective power management of individual drives, this performance can be achieved using a very small power budget. In particular, we show that our power management strategy can result in the performance comparable to an always-on RAID system while using 1/15th the power of such a RAID system.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116896473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 410
Early Evaluation of the IBM p690 IBM p690的早期评估
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10000
P. Worley, T. Dunigan, M. Fahey, James B. White, Arthur S. Bland
Oak Ridge National Laboratory recently received 27 32-way IBM pSeries 690 SMP nodes. In this paper, we describe our initial evaluation of the p690 architecture, focusing on the performance of benchmarks and applications that are representative of the expected production workload.
橡树岭国家实验室最近收到了27个32路IBM pSeries 690 SMP节点。在本文中,我们描述了我们对p690体系结构的初步评估,重点关注代表预期生产工作负载的基准测试和应用程序的性能。
{"title":"Early Evaluation of the IBM p690","authors":"P. Worley, T. Dunigan, M. Fahey, James B. White, Arthur S. Bland","doi":"10.1109/SC.2002.10000","DOIUrl":"https://doi.org/10.1109/SC.2002.10000","url":null,"abstract":"Oak Ridge National Laboratory recently received 27 32-way IBM pSeries 690 SMP nodes. In this paper, we describe our initial evaluation of the p690 architecture, focusing on the performance of benchmarks and applications that are representative of the expected production workload.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116466287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator 在地球模拟器上用光谱变换方法模拟26.58 Tflops全球大气
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10053
S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Yoshinori Tsuda, W. Ohfuchi, Yuji Sasaki, Kazuo Kobayashi, Takashi Hagiwara, S. Habata, M. Yokokawa, Hiroyuki Itoh, K. Otsuka
A spectral atmospheric general circulation model called AFES (AGCM for Earth Simulator) was developed and optimized for the architecture of the Earth Simulator (ES). The ES is a massively parallel vector supercomputer that consists of 640 processor nodes interconnected by a single stage crossbar network with its total peak performance of 40.96 Tflops was achieved for a high resolution simulation (T1279L96) with AFES by utilizing the full 640-node configuration of the ES. The resulting computing efficiency is 64.9% of the peak performance, well surpassing that of conventional weather/climate applications having just 25-50% efficiency even on vector parallel computers. This remarkable performance proves the effectiveness of the ES as a viable means for practical applications.
针对地球模拟器(ES)的结构,开发并优化了光谱大气环流模式AFES (AGCM for Earth Simulator)。ES是一种大规模并行矢量超级计算机,由640个处理器节点组成,通过单级横杆网络相互连接,利用ES的全部640个节点配置,利用AFES进行高分辨率模拟(T1279L96),其总峰值性能达到40.96 Tflops。由此产生的计算效率是峰值性能的64.9%,远远超过了传统的天气/气候应用程序,即使在矢量并行计算机上也只有25-50%的效率。这一显著的性能证明了ES在实际应用中的有效性。
{"title":"A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator","authors":"S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Yoshinori Tsuda, W. Ohfuchi, Yuji Sasaki, Kazuo Kobayashi, Takashi Hagiwara, S. Habata, M. Yokokawa, Hiroyuki Itoh, K. Otsuka","doi":"10.1109/SC.2002.10053","DOIUrl":"https://doi.org/10.1109/SC.2002.10053","url":null,"abstract":"A spectral atmospheric general circulation model called AFES (AGCM for Earth Simulator) was developed and optimized for the architecture of the Earth Simulator (ES). The ES is a massively parallel vector supercomputer that consists of 640 processor nodes interconnected by a single stage crossbar network with its total peak performance of 40.96 Tflops was achieved for a high resolution simulation (T1279L96) with AFES by utilizing the full 640-node configuration of the ES. The resulting computing efficiency is 64.9% of the peak performance, well surpassing that of conventional weather/climate applications having just 25-50% efficiency even on vector parallel computers. This remarkable performance proves the effectiveness of the ES as a viable means for practical applications.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126809824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Collaborative Simulation Grid: Multiscale Quantum-Mechanical/Classical Atomistic Simulations on Distributed PC Clusters in the US and Japan 协同模拟网格:美国和日本分布式PC集群上的多尺度量子力学/经典原子模拟
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10013
H. Kikuchi, R. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno, F. Shimojo, K. Tsuruta, S. Saini
A multidisciplinary,collaborative simulation has been performed on a Grid of geographically distributed PC clusters.The multiscale simulation approach seamlessly combines i) atomistic simulation based on the molecular dynamics (MD) method and ii) quantum mechanical (QM) calculation based on the density functional theory (DFT), so that accurate but less scalable computations are performed only where they are needed. The multiscale MD/QM simulation code has been Grid-enabled using i) a modular, additive hybridization scheme, ii) multiple QM clustering, and iii) computation/communication overlapping. The Gridified MD/QM simulation code has been used to study environmental effects of water molecules on fracture in silicon. A preliminary run of the code has achieved a parallel efficiency of 94% on 25 PCs distributed over 3 PC clusters in the US and Japan, and a larger test involving 154 processors on 5 distributed PC clusters is in progress.
在地理分布的PC集群网格上进行了多学科协作仿真。多尺度模拟方法无缝地结合了i)基于分子动力学(MD)方法的原子模拟和ii)基于密度泛函数理论(DFT)的量子力学(QM)计算,因此只有在需要的地方才执行精确但可扩展性较低的计算。多尺度MD/QM仿真代码已经使用i)模块化,加性杂交方案,ii)多个QM聚类,以及iii)计算/通信重叠实现网格化。采用栅格化MD/QM模拟程序研究了水分子对硅断裂的环境影响。代码的初步运行在分布在美国和日本的3个PC集群上的25台PC上实现了94%的并行效率,在5个分布式PC集群上进行的涉及154个处理器的更大测试正在进行中。
{"title":"Collaborative Simulation Grid: Multiscale Quantum-Mechanical/Classical Atomistic Simulations on Distributed PC Clusters in the US and Japan","authors":"H. Kikuchi, R. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno, F. Shimojo, K. Tsuruta, S. Saini","doi":"10.1109/SC.2002.10013","DOIUrl":"https://doi.org/10.1109/SC.2002.10013","url":null,"abstract":"A multidisciplinary,collaborative simulation has been performed on a Grid of geographically distributed PC clusters.The multiscale simulation approach seamlessly combines i) atomistic simulation based on the molecular dynamics (MD) method and ii) quantum mechanical (QM) calculation based on the density functional theory (DFT), so that accurate but less scalable computations are performed only where they are needed. The multiscale MD/QM simulation code has been Grid-enabled using i) a modular, additive hybridization scheme, ii) multiple QM clustering, and iii) computation/communication overlapping. The Gridified MD/QM simulation code has been used to study environmental effects of water molecules on fracture in silicon. A preliminary run of the code has achieved a parallel efficiency of 94% on 25 PCs distributed over 3 PC clusters in the US and Japan, and a larger test involving 154 processors on 5 distributed PC clusters is in progress.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130186306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Implementation and Evaluation of A QoS-Capable Cluster-Based IP Router 一种具有qos功能的集群IP路由器的实现与评价
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10026
P. Pradhan, T. Chiueh
A major challenge in Internet edge router design is to support both high packet forwarding performance and versatile and efficient packet processing capabilities. The thesis of this research project is that a cluster of PCs connected by a high-speed system area network provides an effective hardware platform for building routers to be used at the edges of the Internet. This paper describes a scalable and extensible edge router architecture called Panama, which supports a novel aggregate route caching scheme, a real-time link scheduling algorithm whose performance overhead is independent of the number of real-time flows, a highly efficient kernel extension mechanism to safely load networking software extensions dynamically, and an integrated resource scheduler which ensures that real-time flows with additional packet processing requirements still meet their end-to-end performance requirements. This paper describes the implementation and evaluation of the first Panama prototype based on a cluster of PCs and Myrinet.
互联网边缘路由器设计的一个主要挑战是支持高数据包转发性能和通用高效的数据包处理能力。本研究项目的主题是,通过高速系统局域网连接的pc集群为构建用于互联网边缘的路由器提供了有效的硬件平台。本文介绍了一种可扩展的边缘路由器体系结构——巴拿马,它支持一种新颖的聚合路由缓存方案、一种性能开销与实时流数量无关的实时链路调度算法、一种安全动态加载网络软件扩展的高效内核扩展机制。集成的资源调度器确保具有额外数据包处理要求的实时流仍然满足端到端性能要求。本文描述了基于pc机集群和Myrinet的第一个Panama原型的实现和评估。
{"title":"Implementation and Evaluation of A QoS-Capable Cluster-Based IP Router","authors":"P. Pradhan, T. Chiueh","doi":"10.1109/SC.2002.10026","DOIUrl":"https://doi.org/10.1109/SC.2002.10026","url":null,"abstract":"A major challenge in Internet edge router design is to support both high packet forwarding performance and versatile and efficient packet processing capabilities. The thesis of this research project is that a cluster of PCs connected by a high-speed system area network provides an effective hardware platform for building routers to be used at the edges of the Internet. This paper describes a scalable and extensible edge router architecture called Panama, which supports a novel aggregate route caching scheme, a real-time link scheduling algorithm whose performance overhead is independent of the number of real-time flows, a highly efficient kernel extension mechanism to safely load networking software extensions dynamically, and an integrated resource scheduler which ensures that real-time flows with additional packet processing requirements still meet their end-to-end performance requirements. This paper describes the implementation and evaluation of the first Panama prototype based on a cluster of PCs and Myrinet.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127698164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
NAMD: Biomolecular Simulation on Thousands of Processors 数千个处理器上的生物分子模拟
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10019
James C. Phillips, G. Zheng, Sameer Kumar, L. Kalé
NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.
NAMD是一个功能齐全的生产分子动力学程序,用于高性能模拟大型生物分子系统。我们之前在SC2000上展示了在ASCI Red机器的多达2048个处理器上具有截止静电的模拟缩放结果,通过基于对象的混合力和空间分解方案以及基于积极测量的预测负载平衡框架实现。我们通过在PSC Lemieux Alpha集群的更快的处理器上演示类似的缩放,以及采用高效(on log N)粒子网格Ewald全静电的模拟来扩展这项工作。在生物分子模拟代码中,这种前所未有的可扩展性是通过延迟容忍、对多处理器节点的适应以及直接使用Quadrics Elan库代替MPI(由Charm++/Converse并行运行时系统实现的)来实现的。
{"title":"NAMD: Biomolecular Simulation on Thousands of Processors","authors":"James C. Phillips, G. Zheng, Sameer Kumar, L. Kalé","doi":"10.1109/SC.2002.10019","DOIUrl":"https://doi.org/10.1109/SC.2002.10019","url":null,"abstract":"NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"74 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128044572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 284
Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture cc-NUMA架构中加速缓存到缓存传输失误的所有者预测
Pub Date : 2002-11-16 DOI: 10.5555/762761.762762
M. Acacio, José González, José M. García, J. Duato
Cache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path of 3-hop misses, which significantly penalizes them compared to SMP designs. This work studies the use of owner prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for cache-to-cache transfer misses. Our proposal comprises an effective prediction scheme as well as a coherence protocol designed to support the use of prediction. Results indicate that owner prediction can significantly reduce the latency of cache-to-cache transfer misses, which translates into speed-ups on application performance up to 12%. In order to also accelerate most of those 3-hop misses that are either not predicted or mispredicted, the inclusion of a small and fast directory cache in every node is evaluated, leading to improvements up to 16% on the final performance.
必须从远程缓存获取数据的缓存丢失(缓存到缓存的传输丢失)占总丢失率的重要部分。不幸的是,cc-NUMA设计将对目录信息的访问置于3跳未命中的关键路径中,与SMP设计相比,这明显不利于它们。这项工作研究了所有者预测的使用,作为一种为cc-NUMA多处理器提供更有效的支持缓存到缓存传输失误的手段。我们的建议包括一个有效的预测方案以及一个旨在支持预测使用的一致性协议。结果表明,所有者预测可以显著减少缓存到缓存传输失败的延迟,这可以将应用程序性能提高12%。为了加速大多数未预测或错误预测的3跳丢失,在每个节点中包含一个小而快速的目录缓存进行了评估,最终性能提高了16%。
{"title":"Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture","authors":"M. Acacio, José González, José M. García, J. Duato","doi":"10.5555/762761.762762","DOIUrl":"https://doi.org/10.5555/762761.762762","url":null,"abstract":"Cache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path of 3-hop misses, which significantly penalizes them compared to SMP designs. This work studies the use of owner prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for cache-to-cache transfer misses. Our proposal comprises an effective prediction scheme as well as a coherence protocol designed to support the use of prediction. Results indicate that owner prediction can significantly reduce the latency of cache-to-cache transfer misses, which translates into speed-ups on application performance up to 12%. In order to also accelerate most of those 3-hop misses that are either not predicted or mispredicted, the inclusion of a small and fast directory cache in every node is evaluated, leading to improvements up to 16% on the final performance.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134141664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Distributed Dynamic Hash Tables Using IBM LAPI 使用IBM LAPI的分布式动态哈希表
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10041
J. Malard, R. Stewart
An asynchronous communication library for accessing and managing dynamic hash tables over a network of Symmetric Multiprocessors (SMP) is presented. A blocking factor is shown experimentally to reduce the variance of the wall clock time. It is also shown that remote accesses to a distributed hash table can be as effective and scalable as the one-sided operations of the low-level communication middleware on an IBM SP.
提出了一个异步通信库,用于在对称多处理器(SMP)网络上访问和管理动态哈希表。实验表明,阻塞因子可以减小挂钟时间的方差。本文还表明,对分布式散列表的远程访问可以与IBM SP上的低级通信中间件的单向操作一样有效和可扩展。
{"title":"Distributed Dynamic Hash Tables Using IBM LAPI","authors":"J. Malard, R. Stewart","doi":"10.1109/SC.2002.10041","DOIUrl":"https://doi.org/10.1109/SC.2002.10041","url":null,"abstract":"An asynchronous communication library for accessing and managing dynamic hash tables over a network of Symmetric Multiprocessors (SMP) is presented. A blocking factor is shown experimentally to reduce the variance of the wall clock time. It is also shown that remote accesses to a distributed hash table can be as effective and scalable as the one-sided operations of the low-level communication middleware on an IBM SP.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"64 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130587568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited 提高程序优化中的体系结构意识以弥合峰值和持续处理器性能之间的差距&#8212矩阵相乘重新审视
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10054
David Parello, O. Temam, J. Verdun
As the complexity of processor architectures increases, there is a widening gap between peak processor performance and sustained processor performance so that programs now tend to exploit only a fraction of available performance. While there is a tremendous amount of literature on program optimizations, compiler optimizations lack efficiency because they are plagued by three flaws: (1) they often implicitly use simplified, if not simplistic, models of processor architecture, (2) they usually focus on a single processor component (e.g., cache) and ignore the interactions among multiple components, (3) the most heavily nvestigated components (e.g., caches) sometimes have only a small impact on overall performance. Through the in-depth analysis of a simple program kernel, we want to show that understanding the complex interactions between programs and the numerous processor architecture components is both feasible and critical to design efficient program optimizations.
随着处理器体系结构复杂性的增加,峰值处理器性能和持续处理器性能之间的差距越来越大,因此程序现在倾向于只利用可用性能的一小部分。虽然有大量关于程序优化的文献,但编译器优化缺乏效率,因为它们受到三个缺陷的困扰:(1)它们通常隐式地使用简化的处理器体系结构模型,(2)它们通常关注单个处理器组件(例如缓存)而忽略多个组件之间的交互,(3)最深入研究的组件(例如缓存)有时对整体性能只有很小的影响。通过对一个简单程序内核的深入分析,我们希望表明,理解程序与众多处理器体系结构组件之间的复杂交互对于设计有效的程序优化既可行又至关重要。
{"title":"On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited","authors":"David Parello, O. Temam, J. Verdun","doi":"10.1109/SC.2002.10054","DOIUrl":"https://doi.org/10.1109/SC.2002.10054","url":null,"abstract":"As the complexity of processor architectures increases, there is a widening gap between peak processor performance and sustained processor performance so that programs now tend to exploit only a fraction of available performance. While there is a tremendous amount of literature on program optimizations, compiler optimizations lack efficiency because they are plagued by three flaws: (1) they often implicitly use simplified, if not simplistic, models of processor architecture, (2) they usually focus on a single processor component (e.g., cache) and ignore the interactions among multiple components, (3) the most heavily nvestigated components (e.g., caches) sometimes have only a small impact on overall performance. Through the in-depth analysis of a simple program kernel, we want to show that understanding the complex interactions between programs and the numerous processor architecture components is both feasible and critical to design efficient program optimizations.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131870830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Parallel Multiscale Gauss-Newton-Krylov Methods for Inverse Wave Propagation 反波传播的平行多尺度高斯-牛顿-克雷洛夫方法
Pub Date : 2002-11-16 DOI: 10.5555/762761.762827
V. Akçelik, G. Biros, O. Ghattas
One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in contradistinction to the forward problems that usually characterize large-scale simulation. Inverse problems are significantly more difficult to solve than forward problems, due to ill-posedness, large dense ill-conditioned operators, multiple minima, space-time coupling, and the need to solve the forward problem repeatedly. We present a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium. The difficulties mentioned above are addressed through a combination of total variation regularization, preconditioned matrix-free Gauss-Newton-Krylov iteration, algorithmic checkpointing, and multiscale continuation. We are able to solve a synthetic inverse wave propagation problem though a pelvic bone geometry involving 2.1 million inversion parameters in 3 hours on 256 processors of the Terascale Computing System at the Pittsburgh Supercomputing Center.
计算科学和工程的突出挑战之一是由偏微分方程控制的系统的大规模非线性参数估计。这些被称为逆问题,与通常具有大规模模拟特征的正问题相反。由于病态性、大密集病态算子、多重极小值、时空耦合以及需要反复求解正向问题,逆问题的求解难度明显高于正向问题。我们提出了一种求解时变偏微分方程反问题的并行算法,并给出了确定声介质物质场的反波传播问题的可扩展性结果。上述困难是通过总变分正则化、预条件无矩阵高斯-牛顿-克雷洛夫迭代、算法点检和多尺度延拓的组合来解决的。我们能够在匹兹堡超级计算中心的256个太斯卡尔计算系统的处理器上,在3小时内解决一个包含210万个反演参数的骨盆骨几何合成逆波传播问题。
{"title":"Parallel Multiscale Gauss-Newton-Krylov Methods for Inverse Wave Propagation","authors":"V. Akçelik, G. Biros, O. Ghattas","doi":"10.5555/762761.762827","DOIUrl":"https://doi.org/10.5555/762761.762827","url":null,"abstract":"One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in contradistinction to the forward problems that usually characterize large-scale simulation. Inverse problems are significantly more difficult to solve than forward problems, due to ill-posedness, large dense ill-conditioned operators, multiple minima, space-time coupling, and the need to solve the forward problem repeatedly. We present a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium. The difficulties mentioned above are addressed through a combination of total variation regularization, preconditioned matrix-free Gauss-Newton-Krylov iteration, algorithmic checkpointing, and multiscale continuation. We are able to solve a synthetic inverse wave propagation problem though a pelvic bone geometry involving 2.1 million inversion parameters in 3 hours on 256 processors of the Terascale Computing System at the Pittsburgh Supercomputing Center.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134100327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 163
期刊
ACM/IEEE SC 2002 Conference (SC'02)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1