首页 > 最新文献

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Performance optimization of TCP/IP over 10 Gigabit Ethernet by precise instrumentation 通过精密仪器对10千兆以太网上的TCP/IP进行性能优化
Takeshi Yoshino, Yutaka Sugawara, K. Inagami, J. Tamatsukuri, M. Inaba, K. Hiraki
End-to-end communications on 10 Gigabit Ethernet (10 GbE) WAN became popular. However, there are difficulties that need to be solved before utilizing Long Fat-pipe Networks (LFNs) by using TCP. We observed that the followings caused performance depression: short-term bursty data transfer, mismatch between TCP and hardware support, and excess CPU load. In this research, we have established systematic methodologies to optimize TCP on LFNs. In order to pinpoint causes of performance depression, we analyzed real networks precisely by using our hardware-based wire-rate analyzer with 100-ns time-resolution. We took the following actions on the basis of the observations: (1) utilizing hardware-based pacing to avoid unnecessary packet losses due to collisions at bottlenecks, (2) modifying TCP to adapt packet coalescing mechanism, (3) modifying programs to reduce memory copies. We have achieved a constant through-put of 9.08 Gbps on a 500 ms RTT network for 5 h. Our approach has overcome the difficulties on single-end 10 GbE LFNs.
10千兆以太网(10gbe) WAN上的端到端通信变得流行起来。但是,在利用TCP协议利用长胖管网(Long - Fat-pipe network, LFNs)之前,存在一些困难需要解决。我们观察到以下因素会导致性能下降:短期突发数据传输、TCP和硬件支持之间的不匹配以及过多的CPU负载。在本研究中,我们建立了系统的方法来优化LFNs上的TCP。为了找出导致性能下降的原因,我们使用基于硬件的线率分析仪对真实网络进行了精确分析,时间分辨率为100毫秒。根据观察结果,我们采取了以下措施:(1)利用基于硬件的步调来避免由于瓶颈碰撞而导致的不必要的数据包丢失;(2)修改TCP以适应数据包合并机制;(3)修改程序以减少内存副本。我们已经在500毫秒RTT网络上实现了5小时9.08 Gbps的恒定吞吐量。我们的方法克服了单端10 GbE LFNs的困难。
{"title":"Performance optimization of TCP/IP over 10 Gigabit Ethernet by precise instrumentation","authors":"Takeshi Yoshino, Yutaka Sugawara, K. Inagami, J. Tamatsukuri, M. Inaba, K. Hiraki","doi":"10.5555/1413370.1413382","DOIUrl":"https://doi.org/10.5555/1413370.1413382","url":null,"abstract":"End-to-end communications on 10 Gigabit Ethernet (10 GbE) WAN became popular. However, there are difficulties that need to be solved before utilizing Long Fat-pipe Networks (LFNs) by using TCP. We observed that the followings caused performance depression: short-term bursty data transfer, mismatch between TCP and hardware support, and excess CPU load. In this research, we have established systematic methodologies to optimize TCP on LFNs. In order to pinpoint causes of performance depression, we analyzed real networks precisely by using our hardware-based wire-rate analyzer with 100-ns time-resolution. We took the following actions on the basis of the observations: (1) utilizing hardware-based pacing to avoid unnecessary packet losses due to collisions at bottlenecks, (2) modifying TCP to adapt packet coalescing mechanism, (3) modifying programs to reduce memory copies. We have achieved a constant through-put of 9.08 Gbps on a 500 ms RTT network for 5 h. Our approach has overcome the difficulties on single-end 10 GbE LFNs.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126052213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Adapting a message-driven parallel application to GPU-accelerated clusters 使消息驱动的并行应用程序适应gpu加速的集群
James C. Phillips, J. Stone, K. Schulten
Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.
由于GPU硬件的性能和灵活性的进步,以及针对通用和科学计算的GPU软件开发工具的可用性,图形处理单元(GPU)已经成为加速科学计算的一个有吸引力的选择。然而,在集群中有效使用gpu提出了许多应用程序开发和系统集成方面的挑战。我们描述了CPU内核和GPU之间计算的分解和调度策略,以及与GPU内核执行重叠通信和CPU计算的技术。我们报告了这些技术对NAMD(一个广泛使用的并行分子动力学模拟包)的适应,并给出了64核64 gpu集群的性能结果。
{"title":"Adapting a message-driven parallel application to GPU-accelerated clusters","authors":"James C. Phillips, J. Stone, K. Schulten","doi":"10.1109/SC.2008.5214716","DOIUrl":"https://doi.org/10.1109/SC.2008.5214716","url":null,"abstract":"Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128272900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 185
Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers SGI Altix 4700、IBM POWER5+和SGI ICE 8200超级计算机基于科学应用程序的性能比较
S. Saini, Dale Talcott, D. Jespersen, M. J. Djomehri, Haoqiang Jin, R. Biswas
The suitability of next-generation high-performance computing systems for petascale simulations will depend on various performance factors attributable to processor, memory, local and global network, and input/output characteristics. In this paper, we evaluate performance of new dual-core SGI Altix 4700, quad-core SGI Altix ICE 8200, and dual-core IBM POWER5+ systems. To measure performance, we used micro-benchmarks from High Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), and four real-world applications- three from computational fluid dynamics (CFD) and one from climate modeling. We used the micro-benchmarks to develop a controlled understanding of individual system components, then analyzed and interpreted performance of the NPBs and applications. We also explored the hybrid programming model (MPI+OpenMP) using multi-zone NPBs and the CFD application OVERFLOW-2. Achievable application performance is compared across the systems. For the ICE platform, we also investigated the effect of memory bandwidth on performance by testing 1, 2, 4, and 8 cores per node.
下一代高性能计算系统对千万亿次模拟的适用性将取决于处理器、内存、本地和全局网络以及输入/输出特性的各种性能因素。在本文中,我们评估了新的双核SGI Altix 4700,四核SGI Altix ICE 8200和双核IBM POWER5+系统的性能。为了测量性能,我们使用了来自高性能计算挑战赛(HPCC)、NAS并行基准测试(NPB)和四个实际应用程序的微基准测试,其中三个来自计算流体动力学(CFD),一个来自气候建模。我们使用微基准测试来开发对单个系统组件的可控理解,然后分析和解释npb和应用程序的性能。我们还探索了使用多区域npb和CFD应用程序OVERFLOW-2的混合编程模型(MPI+OpenMP)。跨系统比较可实现的应用程序性能。对于ICE平台,我们还通过测试每个节点1、2、4和8核来研究内存带宽对性能的影响。
{"title":"Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers","authors":"S. Saini, Dale Talcott, D. Jespersen, M. J. Djomehri, Haoqiang Jin, R. Biswas","doi":"10.1145/1413370.1413378","DOIUrl":"https://doi.org/10.1145/1413370.1413378","url":null,"abstract":"The suitability of next-generation high-performance computing systems for petascale simulations will depend on various performance factors attributable to processor, memory, local and global network, and input/output characteristics. In this paper, we evaluate performance of new dual-core SGI Altix 4700, quad-core SGI Altix ICE 8200, and dual-core IBM POWER5+ systems. To measure performance, we used micro-benchmarks from High Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), and four real-world applications- three from computational fluid dynamics (CFD) and one from climate modeling. We used the micro-benchmarks to develop a controlled understanding of individual system components, then analyzed and interpreted performance of the NPBs and applications. We also explored the hybrid programming model (MPI+OpenMP) using multi-zone NPBs and the CFD application OVERFLOW-2. Achievable application performance is compared across the systems. For the ICE platform, we also investigated the effect of memory bandwidth on performance by testing 1, 2, 4, and 8 cores per node.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114202042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
The cost of doing science on the cloud: The Montage example 在云上进行科学研究的成本:蒙太奇的例子
E. Deelman, Gurmeet Singh, M. Livny, B. Berriman, J. Good
Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans of the same application may result in significantly different costs. Using the Amazon cloud fee structure and a real-life astronomy application, we study via simulation the cost performance tradeoffs of different execution and resource provisioning plans. We also study these trade-offs in the context of the storage and communication fees of Amazon S3 when used for long-term application data archival. Our results show that by provisioning the right amount of storage and compute resources, cost can be significantly reduced with no significant impact on application performance.
Amazon EC2云和Amazon S3等公用事业网格提供计算和存储资源,计算和数据密集型应用程序可以按需付费使用这些资源。在这样的云上运行应用程序的成本取决于它将提供和消耗的计算、存储和通信资源。同一应用程序的不同执行计划可能导致显著不同的成本。使用Amazon云收费结构和现实生活中的天文应用程序,我们通过模拟研究了不同执行和资源供应计划的成本性能权衡。我们还在Amazon S3用于长期应用程序数据归档时的存储和通信费用上下文中研究了这些权衡。我们的结果表明,通过提供适量的存储和计算资源,可以显著降低成本,而不会对应用程序性能产生显著影响。
{"title":"The cost of doing science on the cloud: The Montage example","authors":"E. Deelman, Gurmeet Singh, M. Livny, B. Berriman, J. Good","doi":"10.1109/SC.2008.5217932","DOIUrl":"https://doi.org/10.1109/SC.2008.5217932","url":null,"abstract":"Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans of the same application may result in significantly different costs. Using the Amazon cloud fee structure and a real-life astronomy application, we study via simulation the cost performance tradeoffs of different execution and resource provisioning plans. We also study these trade-offs in the context of the storage and communication fees of Amazon S3 when used for long-term application data archival. Our results show that by provisioning the right amount of storage and compute resources, cost can be significantly reduced with no significant impact on application performance.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1969 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130013178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 815
Extending CC-NUMA systems to support write update optimizations 扩展CC-NUMA系统以支持写更新优化
Liqun Cheng, J. Carter
Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.
由一致性缺失引起的处理器停滞和协议消息限制了共享内存应用程序的性能。现代多处理器采用写无效一致性协议,该协议诱导读失败以确保一致性。以前的研究表明,invalidate协议并不是所有内存访问模式的最佳选择——当数据被大量共享或以可预测的模式访问时,更新协议可以显著优于invalidate协议。然而,更新协议可能会产生过多的网络流量,并且难以在可扩展(非总线)互连上构建。为了获得invalidate协议和update协议的优点,我们在write-invalidate协议之上构建了一个推测性的顺序一致的写更新机制。为了确保一致性,希望写入数据块的处理器使用传统的write-invalidate协议在修改数据块之前获得对该块的独占访问权。为了提高性能,写处理器可以稍后将修改后的块自降级为共享状态,并将其刷新回主节点,主节点将新数据转发给它预测可能会使用这些数据的处理器。我们提出了一种实用且经济有效的设计,用于扩展CC-NUMA系统以支持这种推测更新机制,该机制不需要更改处理器核心、总线接口或内存一致性模型。我们还提出了两种硬件高效的机制来检测访问模式,这两种机制受益于推测更新机制、稳定的读取器集和流。我们在广泛的科学基准和商业应用上评估我们的更新机制。在未来16节点SGI多处理器的周期精确执行驱动模拟器上,我们发现本文提出的机制将平均远程失分率降低了30%,网络流量减少了15%,性能提高了10%,而且在任何情况下都不会影响性能。
{"title":"Extending CC-NUMA systems to support write update optimizations","authors":"Liqun Cheng, J. Carter","doi":"10.1145/1413370.1413401","DOIUrl":"https://doi.org/10.1145/1413370.1413401","url":null,"abstract":"Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130161632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols 基于底层并行文件系统锁定协议的集体I/O动态调整文件域分区方法
W. Liao, A. Choudhary
Collective I/O, such as that provided in MPI-IO, enables process collaboration among a group of processes for greater I/O parallelism. Its implementation involves file domain partitioning, and having the right partitioning is a key to achieving high-performance I/O. As modern parallel file systems maintain data consistency by adopting a distributed file locking mechanism to avoid centralized lock management, different locking protocols can have significant impact to the degree of parallelism of a given file domain partitioning method. In this paper, we propose dynamic file partitioning methods that adapt according to the underlying locking protocols in the parallel file systems and evaluate the performance of four partitioning methods under two locking protocols. By running multiple I/O benchmarks, our experiments demonstrate that no single partitioning guarantees the best performance. Using MPI-IO as an implementation platform, we provide guidelines to select the most appropriate partitioning methods for various I/O patterns and file systems.
集体I/O(例如MPI-IO中提供的I/O)支持一组进程之间的进程协作,以获得更高的I/O并行性。它的实现涉及到文件域分区,拥有正确的分区是实现高性能I/O的关键。由于现代并行文件系统通过采用分布式文件锁定机制来保持数据一致性,避免集中锁管理,因此不同的锁定协议对给定文件域分区方法的并行度有很大影响。本文提出了一种适应并行文件系统底层锁协议的动态文件分区方法,并在两种锁协议下对四种分区方法的性能进行了评价。通过运行多个I/O基准测试,我们的实验表明,单个分区不能保证最佳性能。使用MPI-IO作为实现平台,我们提供了为各种I/O模式和文件系统选择最合适的分区方法的指导方针。
{"title":"Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols","authors":"W. Liao, A. Choudhary","doi":"10.1145/1413370.1413374","DOIUrl":"https://doi.org/10.1145/1413370.1413374","url":null,"abstract":"Collective I/O, such as that provided in MPI-IO, enables process collaboration among a group of processes for greater I/O parallelism. Its implementation involves file domain partitioning, and having the right partitioning is a key to achieving high-performance I/O. As modern parallel file systems maintain data consistency by adopting a distributed file locking mechanism to avoid centralized lock management, different locking protocols can have significant impact to the degree of parallelism of a given file domain partitioning method. In this paper, we propose dynamic file partitioning methods that adapt according to the underlying locking protocols in the parallel file systems and evaluate the performance of four partitioning methods under two locking protocols. By running multiple I/O benchmarks, our experiments demonstrate that no single partitioning guarantees the best performance. Using MPI-IO as an implementation platform, we provide guidelines to select the most appropriate partitioning methods for various I/O patterns and file systems.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133432417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
Nimrod/K: Towards massively parallel dynamic Grid workflows Nimrod/K:走向大规模并行动态网格工作流
D. Abramson, C. Enticott, I. Altintas
A challenge for Grid computing is the difficulty in developing software that is parallel, distributed and highly dynamic. Whilst there have been many general purpose mechanisms developed over the years, Grid programming still remains a low level, error prone task. Scientific workflow engines can double as programming environments, and allow a user to compose dasiavirtualpsila Grid applications from pre-existing components. Whilst existing workflow engines can specify arbitrary parallel programs, (where components use message passing) they are typically not effective with large and variable parallelism. Here we discuss dynamic dataflow, originally developed for parallel tagged dataflow architectures (TDAs), and show that these can be used for implementing Grid workflows. TDAs spawn parallel threads dynamically without additional programming. We have added TDAs to Kepler, and show that the system can orchestrate workflows that have large amounts of variable parallelism. We demonstrate the system using case studies in chemistry and in cardiac modelling.
网格计算面临的一个挑战是开发并行、分布式和高度动态的软件的困难。虽然多年来已经开发了许多通用机制,但网格编程仍然是一个低级的、容易出错的任务。科学的工作流引擎可以兼作编程环境,并允许用户从已有的组件组成虚拟网格应用程序。虽然现有的工作流引擎可以指定任意的并行程序(其中组件使用消息传递),但它们通常不能有效地处理大型和可变的并行性。这里我们讨论动态数据流,它最初是为并行标记数据流架构(tda)开发的,并展示了这些可以用于实现网格工作流。tda动态生成并行线程,无需额外编程。我们已经将tda添加到Kepler中,并表明该系统可以编排具有大量可变并行性的工作流。我们在化学和心脏建模中使用案例研究来演示该系统。
{"title":"Nimrod/K: Towards massively parallel dynamic Grid workflows","authors":"D. Abramson, C. Enticott, I. Altintas","doi":"10.1109/SC.2008.5215726","DOIUrl":"https://doi.org/10.1109/SC.2008.5215726","url":null,"abstract":"A challenge for Grid computing is the difficulty in developing software that is parallel, distributed and highly dynamic. Whilst there have been many general purpose mechanisms developed over the years, Grid programming still remains a low level, error prone task. Scientific workflow engines can double as programming environments, and allow a user to compose dasiavirtualpsila Grid applications from pre-existing components. Whilst existing workflow engines can specify arbitrary parallel programs, (where components use message passing) they are typically not effective with large and variable parallelism. Here we discuss dynamic dataflow, originally developed for parallel tagged dataflow architectures (TDAs), and show that these can be used for implementing Grid workflows. TDAs spawn parallel threads dynamically without additional programming. We have added TDAs to Kepler, and show that the system can orchestrate workflows that have large amounts of variable parallelism. We demonstrate the system using case studies in chemistry and in cardiac modelling.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"209 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114315400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
EpiSimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks 流行病:一种有效的算法,用于模拟传染病在大型现实社会网络中的传播
C. Barrett, K. Bisset, S. Eubank, Xizhou Feng, M. Marathe
Preventing and controlling outbreaks of infectious diseases such as pandemic influenza is a top public health priority. We describe EpiSimdemics - a scalable parallel algorithm to simulate the spread of contagion in large, realistic social contact networks using individual-based models. EpiSimdemics is an interaction-based simulation of a certain class of stochastic reaction-diffusion processes. Straightforward simulations of such process do not scale well, limiting the use of individual-based models to very small populations. EpiSimdemics is specifically designed to scale to social networks with 100 million individuals. The scaling is obtained by exploiting the semantics of disease evolution and disease propagation in large networks. We evaluate an MPI-based parallel implementation of EpiSimdemics on a mid-sized HPC system, demonstrating that EpiSimdemics scales well. EpiSimdemics has been used in numerous sponsor defined case studies targeted at policy planning and course of action analysis, demonstrating the usefulness of EpiSimdemics in practical situations.
预防和控制大流行性流感等传染病的爆发是公共卫生的首要重点。我们描述了episimdemic——一种可扩展的并行算法,使用基于个体的模型来模拟传染病在大型现实社会联系网络中的传播。流行病是对一类随机反应扩散过程的基于相互作用的模拟。对这一过程的直接模拟不能很好地扩展,将基于个体的模型的使用限制在非常小的种群中。episimdemic专门设计用于扩展到拥有1亿个人的社交网络。该尺度是利用大型网络中疾病进化和疾病传播的语义来实现的。我们评估了episimdemic在中型HPC系统上基于mpi的并行实现,证明episimdemic具有良好的可扩展性。episimdemic已用于许多赞助者确定的针对政策规划和行动方案分析的案例研究,表明episimdemic在实际情况中的有用性。
{"title":"EpiSimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks","authors":"C. Barrett, K. Bisset, S. Eubank, Xizhou Feng, M. Marathe","doi":"10.1109/SC.2008.5214892","DOIUrl":"https://doi.org/10.1109/SC.2008.5214892","url":null,"abstract":"Preventing and controlling outbreaks of infectious diseases such as pandemic influenza is a top public health priority. We describe EpiSimdemics - a scalable parallel algorithm to simulate the spread of contagion in large, realistic social contact networks using individual-based models. EpiSimdemics is an interaction-based simulation of a certain class of stochastic reaction-diffusion processes. Straightforward simulations of such process do not scale well, limiting the use of individual-based models to very small populations. EpiSimdemics is specifically designed to scale to social networks with 100 million individuals. The scaling is obtained by exploiting the semantics of disease evolution and disease propagation in large networks. We evaluate an MPI-based parallel implementation of EpiSimdemics on a mid-sized HPC system, demonstrating that EpiSimdemics scales well. EpiSimdemics has been used in numerous sponsor defined case studies targeted at policy planning and course of action analysis, demonstrating the usefulness of EpiSimdemics in practical situations.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124037123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 302
Using overlays for efficient data transfer over shared wide-area networks 利用覆盖层在共享广域网上进行有效的数据传输
Gaurav Khanna, Ümit V. Çatalyürek, T. Kurç, R. Kettimuthu, P. Sadayappan, Ian T Foster, J. Saltz
Data-intensive applications frequently transfer large amounts of data over wide-area networks. The performance achieved in such settings can often be improved by routing data via intermediate nodes chosen to increase aggregate bandwidth. We explore the benefits of overlay network approaches by designing and implementing a service-oriented architecture that incorporates two key optimizations - multi-hop path splitting and multi-pathing - within the GridFTP file transfer protocol. We develop a file transfer scheduling algorithm that incorporates the two optimizations in conjunction with the use of available file replicas. The algorithm makes use of information from past GridFTP transfers to estimate network bandwidths and resource availability. The effectiveness of these optimizations is evaluated using several application file transfer patterns: one-to-all broadcast, all-to-one gather, and data redistribution, on a wide-area testbed. The experimental results show that our architecture and algorithm achieve significant performance improvement.
数据密集型应用程序经常在广域网上传输大量数据。在这种设置中实现的性能通常可以通过选择中间节点路由数据来提高聚合带宽。我们通过设计和实现一个面向服务的体系结构来探索覆盖网络方法的好处,该体系结构在GridFTP文件传输协议中集成了两个关键优化——多跳路径分割和多路径。我们开发了一种文件传输调度算法,将这两种优化与可用文件副本的使用结合在一起。该算法利用过去GridFTP传输的信息来估计网络带宽和资源可用性。使用几种应用程序文件传输模式来评估这些优化的有效性:一对所有广播、所有对一个收集和数据再分发,这些模式都在一个广域测试平台上进行。实验结果表明,我们的架构和算法取得了显著的性能提升。
{"title":"Using overlays for efficient data transfer over shared wide-area networks","authors":"Gaurav Khanna, Ümit V. Çatalyürek, T. Kurç, R. Kettimuthu, P. Sadayappan, Ian T Foster, J. Saltz","doi":"10.1145/1413370.1413418","DOIUrl":"https://doi.org/10.1145/1413370.1413418","url":null,"abstract":"Data-intensive applications frequently transfer large amounts of data over wide-area networks. The performance achieved in such settings can often be improved by routing data via intermediate nodes chosen to increase aggregate bandwidth. We explore the benefits of overlay network approaches by designing and implementing a service-oriented architecture that incorporates two key optimizations - multi-hop path splitting and multi-pathing - within the GridFTP file transfer protocol. We develop a file transfer scheduling algorithm that incorporates the two optimizations in conjunction with the use of available file replicas. The algorithm makes use of information from past GridFTP transfers to estimate network bandwidths and resource availability. The effectiveness of these optimizations is evaluated using several application file transfer patterns: one-to-all broadcast, all-to-one gather, and data redistribution, on a wide-area testbed. The experimental results show that our architecture and algorithm achieve significant performance improvement.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130209022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Capturing performance knowledge for automated analysis 为自动分析捕获性能知识
K. Huck, Oscar R. Hernandez, Van Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris
Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides auto-instrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, Pand power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general.
自动化并行性能实验、分析和问题诊断的过程可以增强面向性能的应用程序开发、编译和执行的环境。当参数化研究、建模和优化策略需要收集和处理大量数据以进行知识合成和重用时,这一点尤其正确。本文描述了PerfExplorer性能数据挖掘框架与OpenUH编译器基础架构的集成。OpenUH为性能实验提供了源代码的自动检测,而PerfExplorer通过脚本接口提供了性能数据的自动化和可重用分析。更重要的是,已经开发了PerfExplorer推理规则来识别和诊断对优化策略和建模很重要的性能特征。本文提出了三个案例研究,展示了我们在OpenMP和MPI代码调优、参数化表征、功率建模等方面的自动化成功。本文讨论了集成如何支持跨应用程序的性能知识工程和基于反馈的编译器优化。
{"title":"Capturing performance knowledge for automated analysis","authors":"K. Huck, Oscar R. Hernandez, Van Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris","doi":"10.1109/SC.2008.5222642","DOIUrl":"https://doi.org/10.1109/SC.2008.5222642","url":null,"abstract":"Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides auto-instrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, Pand power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
期刊
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1