首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization 表征和建模的能量和能量的极端尺度现场可视化
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.113
Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin
Plans for exascale computing have identified power and energy as looming problems for simulations running at that scale. In particular, writing to disk all the data generated by these simulations is becoming prohibitively expensive due to the energy consumption of the supercomputer while it idles waiting for data to be written to permanent storage. In addition, the power cost of data movement is also steadily increasing. A solution to this problem is to write only a small fraction of the data generated while still maintaining the cognitive fidelity of the visualization. With domain scientists increasingly amenable towards adopting an in-situ framework that can identify and extract valuable data from extremely large simulation results and write them to permanent storage as compact images, a large-scale simulation will commit to disk a reduced dataset of data extracts that will be much smaller than the raw results, resulting in a savings in both power and energy. The goal of this paper is two-fold: (i) to understand the role of in-situ techniques in combating power and energy issues of extreme-scale visualization and (ii) to create a model for performance, power, energy, and storage to facilitate what-if analysis. Our experiments on a specially instrumented, dedicated 150-node cluster show that while it is difficult to achieve power savings in practice using in-situ techniques, applications can achieve significant energy savings due to shorter write times for in-situ visualization. We present a characterization of power and energy for in-situ visualization; an application-aware, architecturespecific methodology for modeling and analysis of such in-situ workflows; and results that uncover indirect power savings in visualization workflows for high-performance computing (HPC).
百亿亿次计算的计划已经确定,功率和能源是在这种规模上运行模拟的迫在眉睫的问题。特别是,将这些模拟生成的所有数据写入磁盘变得非常昂贵,因为超级计算机在空闲等待将数据写入永久存储器时需要消耗能量。此外,数据移动的电力成本也在稳步增加。这个问题的解决方案是只编写生成数据的一小部分,同时仍然保持可视化的认知保真度。随着领域科学家越来越倾向于采用原位框架,该框架可以从极大的模拟结果中识别和提取有价值的数据,并将其作为紧凑图像写入永久存储,大规模模拟将向磁盘提交数据提取的简化数据集,该数据集将比原始结果小得多,从而节省电力和能源。本文的目标是双重的:(i)了解原位技术在应对极端尺度可视化的电力和能源问题中的作用;(ii)创建一个性能、电力、能源和存储的模型,以促进假设分析。我们在一个专用的150节点集群上进行的实验表明,虽然使用原位技术在实践中很难实现节能,但由于原位可视化的写入时间缩短,应用程序可以实现显著的节能。我们提出了一种表征功率和能量的原位可视化;一种应用感知的、特定于体系结构的方法,用于对此类现场工作流进行建模和分析;以及揭示在高性能计算(HPC)的可视化工作流中间接节省电力的结果。
{"title":"Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization","authors":"Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin","doi":"10.1109/IPDPS.2017.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.113","url":null,"abstract":"Plans for exascale computing have identified power and energy as looming problems for simulations running at that scale. In particular, writing to disk all the data generated by these simulations is becoming prohibitively expensive due to the energy consumption of the supercomputer while it idles waiting for data to be written to permanent storage. In addition, the power cost of data movement is also steadily increasing. A solution to this problem is to write only a small fraction of the data generated while still maintaining the cognitive fidelity of the visualization. With domain scientists increasingly amenable towards adopting an in-situ framework that can identify and extract valuable data from extremely large simulation results and write them to permanent storage as compact images, a large-scale simulation will commit to disk a reduced dataset of data extracts that will be much smaller than the raw results, resulting in a savings in both power and energy. The goal of this paper is two-fold: (i) to understand the role of in-situ techniques in combating power and energy issues of extreme-scale visualization and (ii) to create a model for performance, power, energy, and storage to facilitate what-if analysis. Our experiments on a specially instrumented, dedicated 150-node cluster show that while it is difficult to achieve power savings in practice using in-situ techniques, applications can achieve significant energy savings due to shorter write times for in-situ visualization. We present a characterization of power and energy for in-situ visualization; an application-aware, architecturespecific methodology for modeling and analysis of such in-situ workflows; and results that uncover indirect power savings in visualization workflows for high-performance computing (HPC).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications 在耦合并行应用程序中适应线程级异构性
S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans
Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for both inter- and intra-node parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult; the challenge is exacerbated because contemporary parallel job launchers provide only static resource binding policies over entire application executions. A standard approach for accommodating thread-level heterogeneity is to under-subscribe compute resources such that the library with the highest degree of threading per process has one processing element per thread. This results in libraries with fewer threads per process utilizing only a fraction of the available compute resources. We present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and memory affinities. We show that our approach can improve overall application performance by up to 5.8x in real-world production codes. Furthermore, the practicality and utility of our approach has been demonstrated by continuous production use for over one year, and by more recent incorporation into a number of production codes.
结合消息传递和多线程(MP+MT)的混合并行程序模型正变得越来越流行,它扩展了基本消息传递(MP)模型,该模型使用单线程进程实现节点间和节点内的并行性。其结果是,耦合并行应用程序越来越多地由MP库和具有不同优先线程程度的MP+MT库组成,从而导致线程级别的异构性。在独立开发和维护的库之间追溯匹配线程级别是困难的;由于当前的并行作业启动程序在整个应用程序执行过程中仅提供静态资源绑定策略,因此挑战更加严峻。适应线程级别异构性的标准方法是订阅计算资源,以便每个进程线程化程度最高的库在每个线程中拥有一个处理元素。这导致库中每个进程的线程更少,只利用一小部分可用计算资源。我们提出并评估了一种适应线程级异质性的新方法。我们的方法通过提供可编程的工具,为具有不同线程因素和内存亲缘性的计算阶段动态地重新配置运行时环境,从而在整个应用程序执行过程中充分利用所有可用的计算资源。我们表明,在实际生产代码中,我们的方法可以将应用程序的整体性能提高5.8倍。此外,我们的方法的实用性和实用性已经在一年多的连续生产使用中得到了证明,并且最近被纳入了一些生产代码。
{"title":"Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications","authors":"S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans","doi":"10.1109/IPDPS.2017.13","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.13","url":null,"abstract":"Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for both inter- and intra-node parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult; the challenge is exacerbated because contemporary parallel job launchers provide only static resource binding policies over entire application executions. A standard approach for accommodating thread-level heterogeneity is to under-subscribe compute resources such that the library with the highest degree of threading per process has one processing element per thread. This results in libraries with fewer threads per process utilizing only a fraction of the available compute resources. We present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and memory affinities. We show that our approach can improve overall application performance by up to 5.8x in real-world production codes. Furthermore, the practicality and utility of our approach has been demonstrated by continuous production use for over one year, and by more recent incorporation into a number of production codes.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Centric Performance Measurement Techniques for Chapel Programs 以数据为中心的教堂项目性能测量技术
Hui Zhang, J. Hollingsworth
Chapel is an emerging PGAS (Partitioned Global Address Space) language whose design goal is to make parallel programming more productive and generally accessible. To date, the implementation effort has focused primarily on correctness over performance. We present a performance measurement technique for Chapel and the idea is also applicable to other PGAS models. The unique feature of our tool is that it associates the performance statistics not to the code regions (functions), but to the variables (including the heap allocated, static, and local variables) in the source code. Unlike code-centric methods, this data-centric analysis capability exposes new optimization opportunities that are useful in resolving data locality problems. This paper introduces our idea and implementations of the approach with three benchmarks. We also include a case study optimizing benchmarks based on the information from our tool. The optimized versions improved the performance by a factor of 1.4x for LULESH, 2.3x for MiniMD, and 2.1x for CLOMP with simple modifications to the source code.
Chapel是一种新兴的PGAS(分区全局地址空间)语言,其设计目标是使并行编程更高效,更易于访问。迄今为止,实现工作主要集中在正确性而不是性能上。我们提出了一种用于Chapel的性能测量技术,该思想也适用于其他PGAS模型。我们的工具的独特之处在于,它将性能统计信息不与代码区域(函数)关联,而是与源代码中的变量(包括堆分配、静态和局部变量)关联。与以代码为中心的方法不同,这种以数据为中心的分析能力提供了新的优化机会,有助于解决数据局部性问题。本文以三个基准介绍了我们的思路和实现方法。我们还包括一个案例研究,该案例研究基于来自我们工具的信息来优化基准测试。优化后的版本通过对源代码的简单修改,将LULESH的性能提高了1.4倍,MiniMD的性能提高了2.3倍,CLOMP的性能提高了2.1倍。
{"title":"Data Centric Performance Measurement Techniques for Chapel Programs","authors":"Hui Zhang, J. Hollingsworth","doi":"10.1109/IPDPS.2017.37","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.37","url":null,"abstract":"Chapel is an emerging PGAS (Partitioned Global Address Space) language whose design goal is to make parallel programming more productive and generally accessible. To date, the implementation effort has focused primarily on correctness over performance. We present a performance measurement technique for Chapel and the idea is also applicable to other PGAS models. The unique feature of our tool is that it associates the performance statistics not to the code regions (functions), but to the variables (including the heap allocated, static, and local variables) in the source code. Unlike code-centric methods, this data-centric analysis capability exposes new optimization opportunities that are useful in resolving data locality problems. This paper introduces our idea and implementations of the approach with three benchmarks. We also include a case study optimizing benchmarks based on the information from our tool. The optimized versions improved the performance by a factor of 1.4x for LULESH, 2.3x for MiniMD, and 2.1x for CLOMP with simple modifications to the source code.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Memory Compression Techniques for Network Address Management in MPI MPI中网络地址管理的内存压缩技术
Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji
MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism, called AV-Rankmap, for managing such translation. AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally more performance critical than others. It uses this information to compress the memory used for network address management. We demonstrate that AV-Rankmap can achieve performance similar to or better than that of other MPI implementations while using significantly less memory.
MPI允许应用程序将进程视为每个MPI通信器的整数秩的逻辑集合,同时在内部将这些逻辑秩转换为实际的网络地址。在当前的MPI实现中,这些网络地址的管理和查找使用的内存大小与每个通信器中的进程数量成正比。在本文中,我们提出了一种新的机制,称为AV-Rankmap,用于管理这种翻译。AV-Rankmap利用了大多数应用程序自然倾向于使用的秩-地址映射中的逻辑模式,并且利用了网络地址结构的某些部分自然比其他部分对性能更为关键的事实。它使用这些信息来压缩用于网络地址管理的内存。我们证明了AV-Rankmap可以在使用更少内存的情况下实现与其他MPI实现相似或更好的性能。
{"title":"Memory Compression Techniques for Network Address Management in MPI","authors":"Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji","doi":"10.1109/IPDPS.2017.18","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.18","url":null,"abstract":"MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism, called AV-Rankmap, for managing such translation. AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally more performance critical than others. It uses this information to compress the memory used for network address management. We demonstrate that AV-Rankmap can achieve performance similar to or better than that of other MPI implementations while using significantly less memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131325275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
DC^2-MTCP: Light-Weight Coding for Efficient Multi-Path Transmission in Data Center Network DC^2-MTCP:数据中心网络中高效多径传输的轻量级编码
Jiyan Sun, Yan Zhang, Xin Wang, Shihan Xiao, Zhen Xu, Hongjing Wu, Xin Chen, Yanni Han
Multi-path TCP has recently shown great potential to take advantage of the rich path diversity in data center networks (DCN) to increase transmission throughput. However, the small flows, which take a large fraction of data center traffic, will easily get a timeout when split onto multiple paths. Moreover, the dynamic congestions and node failures in DCN will exacerbate the reorder problem of parallel multi-path transmissions for large flows. In this paper, we propose DC2-MTCP (Data Center Coded Multi-path TCP), which employs a fast and light-weight coding method to address the above challenges while maintaining the benefit of parallel multi-path transmissions. To meet the high flow performance in DCN, we insert a very low ratio of coded packets with a careful selection of the packets to be coded. We further present a progressive decoding algorithm to decode the packets online with a low time complexity. Extensive ns2-based simulations show that with two orders of magnitude lower coding delay, DC2-MTCP can reduce on average 40% flow completion time for small flows and increase 30% flow throughput for large flows compared to the peer schemes in varying network conditions.
在利用数据中心网络(DCN)中丰富的路径多样性来提高传输吞吐量方面,多路径TCP最近显示出了巨大的潜力。但是,小流量(占数据中心流量的很大一部分)在分割到多条路径上时很容易超时。此外,DCN中的动态拥塞和节点故障将加剧大流量并行多径传输的重排序问题。在本文中,我们提出了DC2-MTCP(数据中心编码多路径TCP),它采用一种快速轻量级的编码方法来解决上述挑战,同时保持并行多路径传输的优势。为了满足DCN中的高流量性能,我们插入了非常低比例的编码数据包,并仔细选择了要编码的数据包。我们进一步提出了一种递进解码算法,以低时间复杂度在线解码数据包。大量基于ns2的仿真表明,在不同的网络条件下,与同类方案相比,DC2-MTCP在编码延迟降低两个数量级的情况下,对小流量平均减少40%的流量完成时间,对大流量平均增加30%的流量吞吐量。
{"title":"DC^2-MTCP: Light-Weight Coding for Efficient Multi-Path Transmission in Data Center Network","authors":"Jiyan Sun, Yan Zhang, Xin Wang, Shihan Xiao, Zhen Xu, Hongjing Wu, Xin Chen, Yanni Han","doi":"10.1109/IPDPS.2017.40","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.40","url":null,"abstract":"Multi-path TCP has recently shown great potential to take advantage of the rich path diversity in data center networks (DCN) to increase transmission throughput. However, the small flows, which take a large fraction of data center traffic, will easily get a timeout when split onto multiple paths. Moreover, the dynamic congestions and node failures in DCN will exacerbate the reorder problem of parallel multi-path transmissions for large flows. In this paper, we propose DC2-MTCP (Data Center Coded Multi-path TCP), which employs a fast and light-weight coding method to address the above challenges while maintaining the benefit of parallel multi-path transmissions. To meet the high flow performance in DCN, we insert a very low ratio of coded packets with a careful selection of the packets to be coded. We further present a progressive decoding algorithm to decode the packets online with a low time complexity. Extensive ns2-based simulations show that with two orders of magnitude lower coding delay, DC2-MTCP can reduce on average 40% flow completion time for small flows and increase 30% flow throughput for large flows compared to the peer schemes in varying network conditions.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123740278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Language-Based Optimizations for Persistence on Nonvolatile Main Memory Systems 基于语言的非易失性主存系统持久性优化
J. Denny, Seyong Lee, J. Vetter
Substantial advances in nonvolatile memory (NVM) technologies have motivated wide-spread integration of NVM into mobile, enterprise, and HPC systems. Recently, considerable research has focused on architectural integration of NVM and respective programming systems, exploiting NVM's trait of persistence correctly and efficiently. In this regard, we design several novel language-based optimization techniques for programming NVM and demonstrate them as an extension of our NVL-C system. Specifically, we focus on optimizing the performance of atomic updates to complex data structures residing in NVM. We build on two variants of automatic undo logging: canonical undo logging, and shadow updates. We show these techniques can be implemented transparently and efficiently, using dynamic selection and other logging optimizations. Our empirical results on several applications gathered on an NVM testbed illustrate that our cost-model-based dynamic selection technique can accurately choose the best logging variant across different NVM modes and input sizes. In comparison to statically choosing canonical undo logging, this improvement reduces execution time to as little as 53% for block-addressable NVM and 73% for emulated byte-addressable NVM on a Fusion-io ioScale device.
非易失性存储器(NVM)技术的实质性进步推动了NVM在移动、企业和高性能计算系统中的广泛集成。近年来,大量的研究集中在NVM与相应编程系统的架构集成上,以正确有效地利用NVM的持久性特性。在这方面,我们设计了几种新的基于语言的NVM编程优化技术,并将它们作为我们的NVM - c系统的扩展进行了演示。具体来说,我们专注于优化驻留在NVM中的复杂数据结构的原子更新的性能。我们建立在自动撤销日志的两个变体之上:规范的撤销日志和影子更新。我们展示了这些技术可以通过动态选择和其他日志优化透明而高效地实现。我们在NVM测试平台上收集的几个应用程序的实证结果表明,我们基于成本模型的动态选择技术可以在不同的NVM模式和输入大小中准确地选择最佳的测井变量。与静态选择规范撤销日志记录相比,在Fusion-io - scale设备上,这种改进将块可寻址NVM的执行时间减少了53%,将模拟字节可寻址NVM的执行时间减少了73%。
{"title":"Language-Based Optimizations for Persistence on Nonvolatile Main Memory Systems","authors":"J. Denny, Seyong Lee, J. Vetter","doi":"10.1109/IPDPS.2017.60","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.60","url":null,"abstract":"Substantial advances in nonvolatile memory (NVM) technologies have motivated wide-spread integration of NVM into mobile, enterprise, and HPC systems. Recently, considerable research has focused on architectural integration of NVM and respective programming systems, exploiting NVM's trait of persistence correctly and efficiently. In this regard, we design several novel language-based optimization techniques for programming NVM and demonstrate them as an extension of our NVL-C system. Specifically, we focus on optimizing the performance of atomic updates to complex data structures residing in NVM. We build on two variants of automatic undo logging: canonical undo logging, and shadow updates. We show these techniques can be implemented transparently and efficiently, using dynamic selection and other logging optimizations. Our empirical results on several applications gathered on an NVM testbed illustrate that our cost-model-based dynamic selection technique can accurately choose the best logging variant across different NVM modes and input sizes. In comparison to statically choosing canonical undo logging, this improvement reduces execution time to as little as 53% for block-addressable NVM and 73% for emulated byte-addressable NVM on a Fusion-io ioScale device.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126506625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Autonomic Resource Management for Program Orchestration in Large-Scale Data Analysis 大规模数据分析中程序编排的自主资源管理
Masahiro Tanaka, K. Taura, Kentaro Torisawa
Large-scale data analysis applications are becoming more and more prevalent in a wide variety of areas. These applications are composed of many currently available programs called analysis components. Thousands of analysis component processes are orchestrated on many compute nodes. This paper proposes a novel self-tuning framework for optimizing an application's throughput in large-scale data analysis. One challenge is developing efficient orchestration that takes into account the diversity of analysis components and the varying performances of compute nodes. In our previous work, we achieved such an orchestration to a certain degree by introducing our own middleware, which wraps each analysis component as a remote procedure call (RPC) service. The middleware also pools the processes to reduce startup overhead, which is a serious obstacle to achieving high throughput. This work tackles the remaining task of tuning the size of the analysis components' process pools to maximize the application's throughput. This is challenging because analysis components differ drastically in turnaround times and memory footprints. The size of the process pool for each type of analysis component should be set by giving consideration to these properties as well as the constraints on both the memory capacity and the processor core counts. In this work, we formulate this task as a linear programming problem and obtain the optimal pool sizes by solving it. Compared to our previous work, we significantly improved the scalability of our framework by reformulating the performance model to work on hundreds of heterogeneous nodes. We also extended the service allocation mechanism to manage the computational load on each compute node and reduce communication overhead. The experimental results show that our approach is scalable to thousands of analysis component processes running on 200 compute nodes across three clusters. Moreover, our approach significantly reduces memory footprint.
大规模数据分析应用在各个领域变得越来越普遍。这些应用程序由许多当前可用的称为分析组件的程序组成。在许多计算节点上编排了数千个分析组件流程。本文提出了一种新的自调优框架,用于优化大规模数据分析中应用程序的吞吐量。其中一个挑战是开发有效的编排,该编排要考虑到分析组件的多样性和计算节点的不同性能。在我们之前的工作中,我们通过引入自己的中间件在一定程度上实现了这样的编排,中间件将每个分析组件包装为远程过程调用(RPC)服务。中间件还汇集进程以减少启动开销,这是实现高吞吐量的一个严重障碍。这项工作处理剩下的任务,即调整分析组件的进程池的大小,以最大限度地提高应用程序的吞吐量。这是具有挑战性的,因为分析组件在周转时间和内存占用方面差别很大。每种分析组件类型的进程池的大小应该通过考虑这些属性以及对内存容量和处理器核数的限制来设置。在这项工作中,我们将这个任务表述为一个线性规划问题,并通过求解它来获得最优池大小。与以前的工作相比,我们通过重新制定性能模型来处理数百个异构节点,从而显著提高了框架的可伸缩性。我们还扩展了服务分配机制,以管理每个计算节点上的计算负载,减少通信开销。实验结果表明,我们的方法可扩展到在三个集群的200个计算节点上运行的数千个分析组件进程。此外,我们的方法显著减少了内存占用。
{"title":"Autonomic Resource Management for Program Orchestration in Large-Scale Data Analysis","authors":"Masahiro Tanaka, K. Taura, Kentaro Torisawa","doi":"10.1109/IPDPS.2017.89","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.89","url":null,"abstract":"Large-scale data analysis applications are becoming more and more prevalent in a wide variety of areas. These applications are composed of many currently available programs called analysis components. Thousands of analysis component processes are orchestrated on many compute nodes. This paper proposes a novel self-tuning framework for optimizing an application's throughput in large-scale data analysis. One challenge is developing efficient orchestration that takes into account the diversity of analysis components and the varying performances of compute nodes. In our previous work, we achieved such an orchestration to a certain degree by introducing our own middleware, which wraps each analysis component as a remote procedure call (RPC) service. The middleware also pools the processes to reduce startup overhead, which is a serious obstacle to achieving high throughput. This work tackles the remaining task of tuning the size of the analysis components' process pools to maximize the application's throughput. This is challenging because analysis components differ drastically in turnaround times and memory footprints. The size of the process pool for each type of analysis component should be set by giving consideration to these properties as well as the constraints on both the memory capacity and the processor core counts. In this work, we formulate this task as a linear programming problem and obtain the optimal pool sizes by solving it. Compared to our previous work, we significantly improved the scalability of our framework by reformulating the performance model to work on hundreds of heterogeneous nodes. We also extended the service allocation mechanism to manage the computational load on each compute node and reduce communication overhead. The experimental results show that our approach is scalable to thousands of analysis component processes running on 200 compute nodes across three clusters. Moreover, our approach significantly reduces memory footprint.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable and Resilient Microarchitecture Based on Multiport Binding for High-Radix Router Design 基于多端口绑定的高基数路由器微架构设计
Yi Dai, Kefei Wang, G. Qu, Liquan Xiao, Dezun Dong, Xingyun Qi
High-radix routers with low latency and high bandwidth play an increasingly important role in the design of large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing throughput higher than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. In this paper, we propose a truly scalable router microarchitecture called Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into a different array of tiles, thus the number of tiles and hardware overhead can be considerably reduced. Compared with a hierarchical crossbar, MBTR achieves up to 50%∼75% reduction in memory consumption as well as wire area. Simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated an ASIC MBTR chip with 28nm technology. Internally, it runs at 700MHz and 30ns latency without any speedup. We also discuss how the microarchitecture parameters of MBTR can be adjusted based on the power, area, and design complexity constraints of the arbitration logic.
低时延、高带宽的高基数路由器在超级计算机和数据中心等大规模互联网络设计中发挥着越来越重要的作用。基于块的交叉条方法将单个大交叉条划分为许多小块,可以大大降低仲裁的复杂性,同时提供比传统交换机实现更高的吞吐量。但是,由于功耗、放置和路由问题,它无法扩展。在本文中,我们提出了一种真正可扩展的路由器微架构,称为基于Multiport Binding tile的路由器(MBTR)。通过将多个物理端口聚合到单个块中,高基数路由器可以灵活地组织到不同的块阵列中,从而可以大大减少块的数量和硬件开销。与分层交叉杆相比,MBTR在内存消耗和导线面积方面减少了50% ~ 75%。仿真结果表明,MBTR在吞吐量和延迟方面与YARC路由器没有区别,甚至可以通过减少输出端口的潜在争用而优于YARC路由器。我们用28纳米技术制作了一个ASIC MBTR芯片。在内部,它以700MHz和30ns的延迟运行,没有任何加速。我们还讨论了如何根据仲裁逻辑的功率、面积和设计复杂性约束来调整MBTR的微架构参数。
{"title":"A Scalable and Resilient Microarchitecture Based on Multiport Binding for High-Radix Router Design","authors":"Yi Dai, Kefei Wang, G. Qu, Liquan Xiao, Dezun Dong, Xingyun Qi","doi":"10.1109/IPDPS.2017.15","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.15","url":null,"abstract":"High-radix routers with low latency and high bandwidth play an increasingly important role in the design of large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing throughput higher than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. In this paper, we propose a truly scalable router microarchitecture called Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into a different array of tiles, thus the number of tiles and hardware overhead can be considerably reduced. Compared with a hierarchical crossbar, MBTR achieves up to 50%∼75% reduction in memory consumption as well as wire area. Simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated an ASIC MBTR chip with 28nm technology. Internally, it runs at 700MHz and 30ns latency without any speedup. We also discuss how the microarchitecture parameters of MBTR can be adjusted based on the power, area, and design complexity constraints of the arbitration logic.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114185445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Computational Challenges in Constructing the Tree of Life 构建生命之树的计算挑战
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.128
T. Warnow
Estimating the Tree of Life is one of the grand computational challenges in Science, and has applications to many areas of science and biomedical research. Despite intensive research over the last several decades, many problems remain inadequately solved. Relatively small datasets can take hundreds of CPU years (e.g., the Avian Phylogenomics Project analysis of just 48 bird genomes used more than 200 CPU years to construct its tree), and larger datasets will require much more time. Thus, the estimation of the Tree of Life, which contains millions of species each with a genome containing millions of nucleotides, will depend on both novel algorithmic designs and effective use of high performance and distributed computing platforms.
估算生命之树是科学领域的重大计算挑战之一,在许多科学和生物医学研究领域都有应用。尽管在过去的几十年里进行了深入的研究,但许多问题仍然没有得到充分解决。相对较小的数据集可能需要数百个CPU年(例如,鸟类系统基因组学项目分析了48个鸟类基因组,使用了200多个CPU年来构建它的树),而更大的数据集将需要更多的时间。因此,生命之树包含数百万个物种,每个物种的基因组包含数百万个核苷酸,对生命之树的估计将取决于新的算法设计和高效使用高性能和分布式计算平台。
{"title":"Computational Challenges in Constructing the Tree of Life","authors":"T. Warnow","doi":"10.1109/IPDPS.2017.128","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.128","url":null,"abstract":"Estimating the Tree of Life is one of the grand computational challenges in Science, and has applications to many areas of science and biomedical research. Despite intensive research over the last several decades, many problems remain inadequately solved. Relatively small datasets can take hundreds of CPU years (e.g., the Avian Phylogenomics Project analysis of just 48 bird genomes used more than 200 CPU years to construct its tree), and larger datasets will require much more time. Thus, the estimation of the Tree of Life, which contains millions of species each with a genome containing millions of nucleotides, will depend on both novel algorithmic designs and effective use of high performance and distributed computing platforms.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116051993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Bounded Reordering Allows Efficient Reliable Message Transmission 有界重排序允许高效可靠的消息传输
Keishla D. Ortiz-Lopez, J. Welch
In the reliable message transmission problem (RMTP) processors communicate by exchanging messages, but the channel that connects two processors is subject to message loss, duplication, and reordering. Previous work focused on proposing protocols in asynchronous systems, where message size is finite and sequence numbers are bounded. However, if the channel can duplicate messages-but not lose them-and arbitrarily reorder the messages, the problem is unsolvable. We consider a strengthening of the asynchronous model in which reordering of messages is bounded. In this model, we develop an efficient protocol to solve the RMTP when messages may be duplicated but not lost. This result is in contrast to the impossibility of such an algorithm when reordering is unbounded. Our protocol has the pleasing property that no messages need to be sent from the receiver to the sender and it works when message loss is allowed with some minimal modifications.
在可靠消息传输问题(RMTP)中,处理器通过交换消息进行通信,但是连接两个处理器的通道容易出现消息丢失、重复和重新排序的问题。以前的工作主要集中在异步系统中提出协议,其中消息大小是有限的,序列号是有限的。但是,如果通道可以复制消息(但不能丢失消息)并任意重新排序消息,则问题无法解决。我们考虑了异步模型的增强,其中消息的重新排序是有界的。在这个模型中,我们开发了一个有效的协议来解决消息可能被复制但不会丢失的RMTP问题。这个结果与这种算法在重新排序无界时的不可能性形成对比。我们的协议有一个令人满意的特性,即不需要将消息从接收方发送到发送方,并且在允许消息丢失的情况下,通过一些最小的修改,它可以工作。
{"title":"Bounded Reordering Allows Efficient Reliable Message Transmission","authors":"Keishla D. Ortiz-Lopez, J. Welch","doi":"10.1109/IPDPS.2017.14","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.14","url":null,"abstract":"In the reliable message transmission problem (RMTP) processors communicate by exchanging messages, but the channel that connects two processors is subject to message loss, duplication, and reordering. Previous work focused on proposing protocols in asynchronous systems, where message size is finite and sequence numbers are bounded. However, if the channel can duplicate messages-but not lose them-and arbitrarily reorder the messages, the problem is unsolvable. We consider a strengthening of the asynchronous model in which reordering of messages is bounded. In this model, we develop an efficient protocol to solve the RMTP when messages may be duplicated but not lost. This result is in contrast to the impossibility of such an algorithm when reordering is unbounded. Our protocol has the pleasing property that no messages need to be sent from the receiver to the sender and it works when message loss is allowed with some minimal modifications.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115356105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1