首页 > 最新文献

ROSS@ICS最新文献

英文 中文
Reduction of operating system jitter caused by page reclaim 减少由页面回收引起的操作系统抖动
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612270
Y. Oyama, Shun Ishiguro, J. Murakami, Shin Sasaki, R. Matsumiya, O. Tatebe
Operating system jitter is one of the major causes of runtime overhead in applications of high performance computing. Jitter results from the execution of services by the operating system kernel, such as interrupt handling and tasklets, or the execution of various daemon processes developed in order to provide operating system services, such as memory management daemons. This execution interrupts application computations and increases their execution time. Jitter significantly affects applications where many processes or threads frequently synchronize with each other. In this paper, we investigate the impact of jitter caused by reclaiming memory pages, and propose a method for reducing the impact. The target operating system is Linux. When the Linux kernel runs out of memory, the kernel awakens a special kernel thread to reclaim memory pages that are unlikely to be used in the near future. If the kernel thread is frequently awakened, application performance is degraded because of its resource consumption. The proposed method can reclaim memory pages in advance of the kernel thread. It reclaims more pages at one time than the kernel thread, thus reducing the frequency of page reclaim and the impact of jitter. We implement a system based on the proposed method and conduct an experiment using practical weather forecast software. Results of the experiment show that the proposed method minimizes performance degradation caused by jitter.
操作系统抖动是高性能计算应用程序运行时开销的主要原因之一。抖动是由操作系统内核执行服务造成的,比如中断处理和微线程,或者是为了提供操作系统服务而开发的各种守护进程的执行造成的,比如内存管理守护进程。此执行会中断应用程序的计算并增加其执行时间。当许多进程或线程频繁地相互同步时,抖动会严重影响应用程序。在本文中,我们研究了由内存页面回收引起的抖动的影响,并提出了一种减少影响的方法。目标操作系统为Linux。当Linux内核用完内存时,内核会唤醒一个特殊的内核线程来回收近期不太可能使用的内存页。如果频繁地唤醒内核线程,则由于其资源消耗而降低应用程序性能。该方法可以在内核线程之前回收内存页。它一次回收比内核线程更多的页面,从而减少了页面回收的频率和抖动的影响。在此基础上实现了系统,并利用实际天气预报软件进行了实验。实验结果表明,该方法可以最大限度地降低抖动引起的性能下降。
{"title":"Reduction of operating system jitter caused by page reclaim","authors":"Y. Oyama, Shun Ishiguro, J. Murakami, Shin Sasaki, R. Matsumiya, O. Tatebe","doi":"10.1145/2612262.2612270","DOIUrl":"https://doi.org/10.1145/2612262.2612270","url":null,"abstract":"Operating system jitter is one of the major causes of runtime overhead in applications of high performance computing. Jitter results from the execution of services by the operating system kernel, such as interrupt handling and tasklets, or the execution of various daemon processes developed in order to provide operating system services, such as memory management daemons. This execution interrupts application computations and increases their execution time. Jitter significantly affects applications where many processes or threads frequently synchronize with each other. In this paper, we investigate the impact of jitter caused by reclaiming memory pages, and propose a method for reducing the impact. The target operating system is Linux. When the Linux kernel runs out of memory, the kernel awakens a special kernel thread to reclaim memory pages that are unlikely to be used in the near future. If the kernel thread is frequently awakened, application performance is degraded because of its resource consumption. The proposed method can reclaim memory pages in advance of the kernel thread. It reclaims more pages at one time than the kernel thread, thus reducing the frequency of page reclaim and the impact of jitter. We implement a system based on the proposed method and conduct an experiment using practical weather forecast software. Results of the experiment show that the proposed method minimizes performance degradation caused by jitter.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130392822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Overhead of a decentralized gossip algorithm on the performance of HPC applications 分散式八卦算法对高性能计算应用性能的影响
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612271
Ely Levy, A. Barak, A. Shiloh, Matthias Lieber, C. Weinhold, Hermann Härtig
Gossip algorithms can provide online information about the availability and the state of the resources in supercomputers. These algorithms require minimal computing and storage capabilities at each node and when properly tuned, they are not expected to overload the nodes or the network that connects these nodes. These properties make gossip interesting for future exascale systems. This paper examines the overhead of a decentralized gossip algorithm on the performance of parallel MPI applications running on up to 8192 nodes of an IBM BlueGene/Q supercomputer. The applications that were used in the experiments include PTRANS and MPI-FFT from the HPCC benchmark suite as well as the coupled weather and cloud simulation model COSMO-SPECS+FD4. In most cases, no gossip overhead was observed when the gossip messages were sent at intervals of 256ms or more. As expected, the overhead that is observed at higher rates is sensitive to the communication pattern of the application and the amount of gossip information being circulated.
八卦算法可以在线提供超级计算机中资源的可用性和状态信息。这些算法对每个节点的计算和存储能力要求最低,并且在适当调优时,它们不会使节点或连接这些节点的网络过载。这些特性让未来的百亿亿级系统变得有趣起来。本文研究了在IBM BlueGene/Q超级计算机多达8192个节点上运行并行MPI应用程序时,分散八卦算法对性能的开销。实验中使用的应用程序包括HPCC基准套件中的PTRANS和MPI-FFT,以及耦合天气和云模拟模型cosmos - specs +FD4。在大多数情况下,当以256ms或更长时间间隔发送八卦消息时,不会观察到八卦开销。正如预期的那样,在较高速率下观察到的开销对应用程序的通信模式和正在传播的八卦信息的数量很敏感。
{"title":"Overhead of a decentralized gossip algorithm on the performance of HPC applications","authors":"Ely Levy, A. Barak, A. Shiloh, Matthias Lieber, C. Weinhold, Hermann Härtig","doi":"10.1145/2612262.2612271","DOIUrl":"https://doi.org/10.1145/2612262.2612271","url":null,"abstract":"Gossip algorithms can provide online information about the availability and the state of the resources in supercomputers. These algorithms require minimal computing and storage capabilities at each node and when properly tuned, they are not expected to overload the nodes or the network that connects these nodes. These properties make gossip interesting for future exascale systems. This paper examines the overhead of a decentralized gossip algorithm on the performance of parallel MPI applications running on up to 8192 nodes of an IBM BlueGene/Q supercomputer. The applications that were used in the experiments include PTRANS and MPI-FFT from the HPCC benchmark suite as well as the coupled weather and cloud simulation model COSMO-SPECS+FD4. In most cases, no gossip overhead was observed when the gossip messages were sent at intervals of 256ms or more. As expected, the overhead that is observed at higher rates is sensitive to the communication pattern of the application and the amount of gossip information being circulated.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VMM emulation of Intel hardware transactional memory 英特尔硬件事务性内存的VMM仿真
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612265
Maciej Swiech, Kyle C. Hale, P. Dinda
We describe the design, implementation, and evaluation of emulated hardware transactional memory, specifically the Intel Haswell Restricted Transactional Memory (RTM) architectural extensions for x86/64, within a virtual machine monitor (VMM). Our system allows users to investigate RTM on hardware that does not provide it, debug their RTM-based transactional software, and stress test it on diverse emulated hardware configurations, including potential future configurations that might support arbitrary length transactions. Initial performance results suggest that we are able to accomplish this approximately 60 times faster than under a full emulator. A noteworthy aspect of our system is a novel page-flipping technique that allows us to completely avoid instruction emulation, and to limit instruction decoding to only that necessary to determine instruction length. This makes it possible to implement RTM emulation, and potentially other techniques, far more compactly than would otherwise be possible. We have implemented our system in the context of the Palacios VMM. Our techniques are not specific to Palacios, and could be implemented in other VMMs.
我们描述仿真硬件事务性内存的设计、实现和评估,特别是在虚拟机监视器(VMM)中针对x86/64的Intel Haswell Restricted transactional memory (RTM)体系结构扩展。我们的系统允许用户在不提供RTM的硬件上调查RTM,调试基于RTM的事务软件,并在各种模拟硬件配置上对其进行压力测试,包括可能支持任意长度事务的潜在未来配置。最初的性能结果表明,我们能够比在完整的模拟器下实现大约60倍的速度。我们系统的一个值得注意的方面是一种新颖的翻页技术,它允许我们完全避免指令模拟,并将指令解码限制为仅确定指令长度所必需的。这使得实现RTM仿真以及潜在的其他技术成为可能,并且比其他方式更加紧凑。我们已经在Palacios VMM上下文中实现了我们的系统。我们的技术不是特定于Palacios的,可以在其他vmm中实现。
{"title":"VMM emulation of Intel hardware transactional memory","authors":"Maciej Swiech, Kyle C. Hale, P. Dinda","doi":"10.1145/2612262.2612265","DOIUrl":"https://doi.org/10.1145/2612262.2612265","url":null,"abstract":"We describe the design, implementation, and evaluation of emulated hardware transactional memory, specifically the Intel Haswell Restricted Transactional Memory (RTM) architectural extensions for x86/64, within a virtual machine monitor (VMM). Our system allows users to investigate RTM on hardware that does not provide it, debug their RTM-based transactional software, and stress test it on diverse emulated hardware configurations, including potential future configurations that might support arbitrary length transactions. Initial performance results suggest that we are able to accomplish this approximately 60 times faster than under a full emulator. A noteworthy aspect of our system is a novel page-flipping technique that allows us to completely avoid instruction emulation, and to limit instruction decoding to only that necessary to determine instruction length. This makes it possible to implement RTM emulation, and potentially other techniques, far more compactly than would otherwise be possible. We have implemented our system in the context of the Palacios VMM. Our techniques are not specific to Palacios, and could be implemented in other VMMs.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123591235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Hybrid MPI: a case study on the Xeon Phi platform 混合MPI: Xeon Phi平台的案例研究
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612267
U. Wickramasinghe, G. Bronevetsky, A. Lumsdaine, A. Friedley
New many-core architectures such as Intel Xeon Phi offer applications significantly higher power efficiency than conventional multi-core processors. However, while this processor's compute and communication performance is an excellent match for MPI applications, leveraging its potential in practice has proven difficult because of the mismatch between the MPI distributed memory model and this processor's shared memory communication hardware. Hybrid MPI is a high performance portable implementation of MPI designed for communication over shared memory hardware. It shares the heaps of all the MPI processes that run on the same node, enabling them to communicate directly without unnecessary copies. This paper describes our work to port Hybrid MPI to the Xeon Phi platform, demonstrating that Hybrid MPI offers better performance than the native Intel MPI implementation in terms of memory bandwidth, latency and benchmark performance.
与传统的多核处理器相比,新的多核架构(如Intel Xeon Phi)为应用程序提供了显著更高的功耗效率。然而,尽管该处理器的计算和通信性能非常适合MPI应用程序,但由于MPI分布式内存模型与该处理器的共享内存通信硬件之间的不匹配,在实践中很难利用其潜力。混合MPI是一种高性能可移植的MPI实现,专为共享内存硬件上的通信而设计。它共享在同一节点上运行的所有MPI进程的堆,使它们能够直接通信,而无需进行不必要的复制。本文描述了我们将Hybrid MPI移植到Xeon Phi平台的工作,证明Hybrid MPI在内存带宽、延迟和基准性能方面比原生Intel MPI实现提供更好的性能。
{"title":"Hybrid MPI: a case study on the Xeon Phi platform","authors":"U. Wickramasinghe, G. Bronevetsky, A. Lumsdaine, A. Friedley","doi":"10.1145/2612262.2612267","DOIUrl":"https://doi.org/10.1145/2612262.2612267","url":null,"abstract":"New many-core architectures such as Intel Xeon Phi offer applications significantly higher power efficiency than conventional multi-core processors. However, while this processor's compute and communication performance is an excellent match for MPI applications, leveraging its potential in practice has proven difficult because of the mismatch between the MPI distributed memory model and this processor's shared memory communication hardware. Hybrid MPI is a high performance portable implementation of MPI designed for communication over shared memory hardware. It shares the heaps of all the MPI processes that run on the same node, enabling them to communicate directly without unnecessary copies. This paper describes our work to port Hybrid MPI to the Xeon Phi platform, demonstrating that Hybrid MPI offers better performance than the native Intel MPI implementation in terms of memory bandwidth, latency and benchmark performance.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"26 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121045113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor Intel Xeon Phi协处理器上OpenMP应用程序的自动SMT线程
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612268
W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout
Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads. In this paper, we extend CRUST (ClusteR-aware Undersubscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware.
同时多线程是一种在Intel Xeon Phi协处理器上运行并行应用程序时可以提高性能的技术。然而,选择最有效的线程数是非常重要的,因为效率的潜在提高必须与其他潜在的负面因素相平衡,例如线程间对缓存容量的竞争和同步开销的增加。在本文中,我们扩展了CRUST (ClusteR-aware undersubscribe Scheduling of Threads),这是一种用于查找运行在集群缓存架构上的OpenMP应用程序的最佳线程数的技术,以考虑Xeon Phi处理器上的同步多线程行为。通过利用OpenMP并行段边界上的应用程序阶段行为,CRUST可以自动找到子应用程序粒度上的最佳线程数,并使用硬件性能计数器信息来深入了解应用程序的行为。我们在Intel OpenMP运行库中实现了一个CRUST原型,并展示了它在实际Xeon Phi硬件上运行的效率。
{"title":"Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1145/2612262.2612268","DOIUrl":"https://doi.org/10.1145/2612262.2612268","url":null,"abstract":"Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads.\u0000 In this paper, we extend CRUST (ClusteR-aware Undersubscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123633837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Revisiting virtual memory for high performance computing on manycore architectures: a hybrid segmentation kernel approach 在多核架构上为高性能计算重新访问虚拟内存:一种混合分段核方法
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612264
Yuki Soma, Balazs Gerofi, Y. Ishikawa
Page-based memory management (paging) is utilized by most of the current operating systems (OSs) due to its rich features such as prevention of memory fragmentation and fine-grained access control. Page-based virtual memory, however, stores virtual to physical mappings in page tables that also reside in main memory. Because translating virtual to physical addresses requires walking the page tables, which in turn implies additional memory accesses, modern CPUs employ translation lookaside buffers (TLBs) to cache the mappings. Nevertheless, TLBs are limited in size and applications that consume a large amount of memory and exhibit little or no locality in their memory access patterns, such as graph algorithms, suffer from the high overhead of TLB misses. This paper proposes a new hybrid kernel design targeting many-core CPUs, which manages the application's memory space by segmentation and offloads kernel services to dedicated CPU cores where paging is utilized. The method enables applications to run on top of the low-cost segmented memory management while allows the kernel to use the rich features of paging. We present the design and implementation of our kernel and demonstrate that segmentation can provide superior performance compared to both regular and large page based virtual memory. For example, running Graph500 on top of our segmentation design over Intel's Xeon Phi chip can yield up to 81% and 9% improvement compared to utilizing 4kB and 2MB pages in MPSS Linux, respectively.
基于页面的内存管理(分页)由于其丰富的特性(如防止内存碎片和细粒度访问控制)被当前大多数操作系统所使用。然而,基于页的虚拟内存在页表中存储虚拟到物理的映射,页表也驻留在主内存中。因为将虚拟地址转换为物理地址需要遍历页表,这又意味着需要额外的内存访问,所以现代cpu使用转换外置缓冲区(tlb)来缓存映射。然而,TLB的大小是有限的,那些消耗大量内存并且在其内存访问模式(如图算法)中很少或没有局部性的应用程序,会遭受TLB丢失的高开销。本文提出了一种新的针对多核CPU的混合内核设计,该设计通过分段管理应用程序的内存空间,并将内核服务卸载到使用分页的专用CPU内核上。该方法使应用程序能够在低成本的分段内存管理之上运行,同时允许内核使用分页的丰富特性。我们介绍了我们的内核的设计和实现,并演示了与常规和基于大页面的虚拟内存相比,分段可以提供更好的性能。例如,与在MPSS Linux中分别使用4kB和2MB页面相比,在我们的英特尔Xeon Phi芯片分段设计之上运行Graph500可以产生高达81%和9%的改进。
{"title":"Revisiting virtual memory for high performance computing on manycore architectures: a hybrid segmentation kernel approach","authors":"Yuki Soma, Balazs Gerofi, Y. Ishikawa","doi":"10.1145/2612262.2612264","DOIUrl":"https://doi.org/10.1145/2612262.2612264","url":null,"abstract":"Page-based memory management (paging) is utilized by most of the current operating systems (OSs) due to its rich features such as prevention of memory fragmentation and fine-grained access control. Page-based virtual memory, however, stores virtual to physical mappings in page tables that also reside in main memory. Because translating virtual to physical addresses requires walking the page tables, which in turn implies additional memory accesses, modern CPUs employ translation lookaside buffers (TLBs) to cache the mappings. Nevertheless, TLBs are limited in size and applications that consume a large amount of memory and exhibit little or no locality in their memory access patterns, such as graph algorithms, suffer from the high overhead of TLB misses.\u0000 This paper proposes a new hybrid kernel design targeting many-core CPUs, which manages the application's memory space by segmentation and offloads kernel services to dedicated CPU cores where paging is utilized. The method enables applications to run on top of the low-cost segmented memory management while allows the kernel to use the rich features of paging. We present the design and implementation of our kernel and demonstrate that segmentation can provide superior performance compared to both regular and large page based virtual memory. For example, running Graph500 on top of our segmentation design over Intel's Xeon Phi chip can yield up to 81% and 9% improvement compared to utilizing 4kB and 2MB pages in MPSS Linux, respectively.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129130420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
mOS: an architecture for extreme-scale operating systems mOS:用于超大规模操作系统的架构
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612263
R. Wisniewski, T. Inglett, Pardo Keppel, Ravi Murty, R. Riesen
Linux®, or more specifically, the Linux API, plays a key role in HPC computing. Even for extreme-scale computing, a known and familiar API is required for production machines. However, an off-the-shelf Linux distribution faces challenges at extreme scale. To date, two approaches have been used to address the challenges of providing an operating system (OS) at extreme scale. In the Full-Weight Kernel (FWK) approach, an OS, typically Linux, forms the starting point, and work is undertaken to remove features from the environment so that it will scale up across more cores and out across a large cluster. A Light-Weight Kernel (LWK) approach often starts with a new kernel and work is undertaken to add functionality to provide a familiar API, typically Linux. Either approach however, results in an execution environment that is not fully Linux compatible. mOS (multi Operating System) runs both an FWK (Linux), and an LWK, simultaneously as kernels on the same compute node. mOS thereby achieves the scalability and reliability of LWKs, while providing the full Linux functionality of an FWK. Further, mOS works in concert with Operating System Nodes (OSNs) to offload system calls, e.g., I/O, that are too invasive to run on the compute nodes at extreme-scale. Beyond providing full Linux capability with LWK performance, other advantages of mOS include the ability to effectively manage different types of compute and memory resources, interface easily with proposed asynchronous and fine-grained runtimes, and nimbly manage new technologies. This paper is an architectural description of mOS. As a prototype is not yet finished, the contributions of this work are a description of mOS's architecture, an exploration of the tradeoffs and value of this approach for the purposes listed above, and a detailed architecture description of each of the six components of mOS, including the tradeoffs we considered. The uptick of OS research work indicates that many view this as an important area for getting to extreme scale. Thus, most importantly, the goal of the paper is to generate discussion in this area at the workshop.
Linux®,或者更具体地说,Linux API,在HPC计算中起着关键作用。即使对于极端规模的计算,生产机器也需要已知和熟悉的API。然而,现成的Linux发行版面临着极端规模的挑战。迄今为止,已经使用了两种方法来解决在极端规模下提供操作系统(OS)的挑战。在全权重内核(FWK)方法中,一个操作系统(通常是Linux)构成了起点,并承担了从环境中删除特性的工作,以便它可以扩展到更多内核并跨大型集群。轻量级内核(Light-Weight Kernel, LWK)方法通常从一个新的内核开始,并致力于添加功能以提供熟悉的API(通常是Linux)。然而,这两种方法都会导致不完全兼容Linux的执行环境。mOS (multi Operating System,多操作系统)在同一个计算节点上同时运行一个FWK (Linux)和一个LWK作为内核。因此,mOS实现了lwk的可伸缩性和可靠性,同时提供了FWK的全部Linux功能。此外,mOS与操作系统节点(OSNs)协同工作,以卸载系统调用,例如I/O,这些调用过于侵入性,无法在极端规模的计算节点上运行。除了提供具有LWK性能的完整Linux功能之外,mOS的其他优点还包括能够有效地管理不同类型的计算和内存资源,轻松地与建议的异步和细粒度运行时进行接口,以及灵活地管理新技术。本文是对mOS的体系结构描述。由于原型尚未完成,本工作的贡献是对mOS体系结构的描述,对上述目的的权衡和该方法的价值的探索,以及对mOS的六个组件中的每个组件的详细体系结构描述,包括我们考虑的权衡。操作系统研究工作的增加表明,许多人认为这是实现极端规模的重要领域。因此,最重要的是,本文的目标是在研讨会上就这一领域进行讨论。
{"title":"mOS: an architecture for extreme-scale operating systems","authors":"R. Wisniewski, T. Inglett, Pardo Keppel, Ravi Murty, R. Riesen","doi":"10.1145/2612262.2612263","DOIUrl":"https://doi.org/10.1145/2612262.2612263","url":null,"abstract":"Linux®, or more specifically, the Linux API, plays a key role in HPC computing. Even for extreme-scale computing, a known and familiar API is required for production machines. However, an off-the-shelf Linux distribution faces challenges at extreme scale. To date, two approaches have been used to address the challenges of providing an operating system (OS) at extreme scale. In the Full-Weight Kernel (FWK) approach, an OS, typically Linux, forms the starting point, and work is undertaken to remove features from the environment so that it will scale up across more cores and out across a large cluster. A Light-Weight Kernel (LWK) approach often starts with a new kernel and work is undertaken to add functionality to provide a familiar API, typically Linux. Either approach however, results in an execution environment that is not fully Linux compatible.\u0000 mOS (multi Operating System) runs both an FWK (Linux), and an LWK, simultaneously as kernels on the same compute node. mOS thereby achieves the scalability and reliability of LWKs, while providing the full Linux functionality of an FWK. Further, mOS works in concert with Operating System Nodes (OSNs) to offload system calls, e.g., I/O, that are too invasive to run on the compute nodes at extreme-scale. Beyond providing full Linux capability with LWK performance, other advantages of mOS include the ability to effectively manage different types of compute and memory resources, interface easily with proposed asynchronous and fine-grained runtimes, and nimbly manage new technologies.\u0000 This paper is an architectural description of mOS. As a prototype is not yet finished, the contributions of this work are a description of mOS's architecture, an exploration of the tradeoffs and value of this approach for the purposes listed above, and a detailed architecture description of each of the six components of mOS, including the tradeoffs we considered. The uptick of OS research work indicates that many view this as an important area for getting to extreme scale. Thus, most importantly, the goal of the paper is to generate discussion in this area at the workshop.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126007180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
An evaluation of BitTorrent's performance in HPC environments BitTorrent在高性能计算环境下的性能评估
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612269
Matthew G. F. Dosanjh, P. Bridges, S. M. Kelly, J. Laros, C. Vaughan
A number of novel decentralized systems have recently been developed to address challenges of scale in large distributed systems. The suitability of such systems for meeting the challenges of scale in high performance computing (HPC) systems is unclear, however. In this paper, we begin to answer this question by examining the suitability of the popular BitTorrent protocol to handle dynamic shared library distribution in HPC systems. To that end, we describe the architecture and implementation of a system that uses BitTorrent to distribute shared libraries in HPC systems, evaluate and optimize BitTorrent protocol usage for the HPC environment, and measure the performance of the resulting system. Our results demonstrate the potential viability of BitTorrent-style protocols in HPC systems, but also highlight the challenges of these protocols. In particular, our results show that the protocol mechanisms meant to enforce fairness in a distributed computing environment can have a significant impact on system performance if not properly taken into account in system design and implementation.
为了解决大型分布式系统的规模问题,最近开发了许多新的分散系统。然而,这种系统是否适合满足高性能计算(HPC)系统的规模挑战尚不清楚。在本文中,我们通过研究流行的BitTorrent协议在HPC系统中处理动态共享库分发的适用性来开始回答这个问题。为此,我们描述了一个系统的架构和实现,该系统使用BitTorrent在HPC系统中分发共享库,评估和优化BitTorrent协议在HPC环境中的使用,并测量了最终系统的性能。我们的研究结果证明了bittorrent式协议在高性能计算系统中的潜在可行性,但也强调了这些协议的挑战。特别是,我们的结果表明,如果在系统设计和实现中没有适当考虑到,用于在分布式计算环境中强制执行公平性的协议机制可能会对系统性能产生重大影响。
{"title":"An evaluation of BitTorrent's performance in HPC environments","authors":"Matthew G. F. Dosanjh, P. Bridges, S. M. Kelly, J. Laros, C. Vaughan","doi":"10.1145/2612262.2612269","DOIUrl":"https://doi.org/10.1145/2612262.2612269","url":null,"abstract":"A number of novel decentralized systems have recently been developed to address challenges of scale in large distributed systems. The suitability of such systems for meeting the challenges of scale in high performance computing (HPC) systems is unclear, however. In this paper, we begin to answer this question by examining the suitability of the popular BitTorrent protocol to handle dynamic shared library distribution in HPC systems. To that end, we describe the architecture and implementation of a system that uses BitTorrent to distribute shared libraries in HPC systems, evaluate and optimize BitTorrent protocol usage for the HPC environment, and measure the performance of the resulting system. Our results demonstrate the potential viability of BitTorrent-style protocols in HPC systems, but also highlight the challenges of these protocols. In particular, our results show that the protocol mechanisms meant to enforce fairness in a distributed computing environment can have a significant impact on system performance if not properly taken into account in system design and implementation.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114444785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Building blocks for an exa-scale operating system 超大规模操作系统的构建块
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2627355
Hermann Härtig
Currently, high performance systems are mostly used by splitting them into fixed-size partitions which are completely owned and operated by applications. Hardware architecture designs strive to remove the operating system from the critical path, for example using techniques such as RDMA and busy waiting for synchronisation. Operating system functionality is restricted to batch schedulers that load and start applications and to I/O. Applications take over traditional operating system functionality such as balancing load over resources. In exa-scale computing, new challenges and opportunities may put an end to that mode of operation. These developments include applications too complex and too dynamic to do application-level balancing and hardware too diverse to maintain an application-level view of a fixed number of reliable and predictable resources. The talk will discuss examples of operating system building blocks at various system levels that may receive new appreciation in exa-scale supercomputing. These building blocks include schedulers, microkernels, library OSes, virtualization, execution time predictors and gossip algorithms that need to be combined into a coherent architecture.
目前,高性能系统主要通过将它们分割成固定大小的分区来使用,这些分区完全由应用程序拥有和操作。硬件架构设计力求将操作系统从关键路径中移除,例如使用RDMA和忙等待同步等技术。操作系统功能仅限于装载和启动应用程序的批处理调度器和I/O。应用程序接管了传统的操作系统功能,比如平衡资源负载。在超大规模计算中,新的挑战和机遇可能会终结这种操作模式。这些开发包括过于复杂和过于动态的应用程序,无法进行应用程序级平衡,以及过于多样化的硬件,无法维护固定数量的可靠和可预测资源的应用程序级视图。讲座将讨论在各种系统级别上的操作系统构建块的示例,这些示例可能会在超大规模超级计算中获得新的赏识。这些构建块包括调度器、微内核、库操作系统、虚拟化、执行时间预测器和八卦算法,它们需要组合成一个连贯的体系结构。
{"title":"Building blocks for an exa-scale operating system","authors":"Hermann Härtig","doi":"10.1145/2612262.2627355","DOIUrl":"https://doi.org/10.1145/2612262.2627355","url":null,"abstract":"Currently, high performance systems are mostly used by splitting them into fixed-size partitions which are completely owned and operated by applications. Hardware architecture designs strive to remove the operating system from the critical path, for example using techniques such as RDMA and busy waiting for synchronisation. Operating system functionality is restricted to batch schedulers that load and start applications and to I/O. Applications take over traditional operating system functionality such as balancing load over resources.\u0000 In exa-scale computing, new challenges and opportunities may put an end to that mode of operation. These developments include applications too complex and too dynamic to do application-level balancing and hardware too diverse to maintain an application-level view of a fixed number of reliable and predictable resources. The talk will discuss examples of operating system building blocks at various system levels that may receive new appreciation in exa-scale supercomputing. These building blocks include schedulers, microkernels, library OSes, virtualization, execution time predictors and gossip algorithms that need to be combined into a coherent architecture.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125827359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PICS: a performance-analysis-based introspective control system to steer parallel applications PICS:一种基于性能分析的内省控制系统,用于引导并行应用
Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612266
Yanhua Sun, J. Lifflander, L. Kalé
Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. In this paper, we describe the design and implementation of PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic set of abstractions to the applications to expose the application-specific knowledge to the runtime system. The abstractions are called control points, which are tunable parameters affecting application performance. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the analysis results and expert knowledge rules, program characteristics are extracted to assist the search for optimal configurations of the control points. We have implemented the PICS control system in Charm++, an asynchronous message-driven parallel programming model. We demonstrate the utility of PICS with several benchmarks and a real-world application and show its effectiveness.
由于硬件的复杂性和应用程序的多样性,并行编程一直是困难的。尽管在学术界和工业界的研究人员的显著努力下取得了重大进展,但在具有数百万核的大型超级计算机上实现各种应用的高并行效率仍然具有挑战性。因此,性能调优变得比以往任何时候都更加重要和具有挑战性。在本文中,我们描述了PICS:基于性能分析的内省控制系统的设计和实现,该系统用于调整并行程序。PICS为应用程序提供了一组通用的抽象,以便向运行时系统公开特定于应用程序的知识。这些抽象被称为控制点,它们是影响应用程序性能的可调参数。应用程序的行为由PICS观察、测量和自动分析。根据分析结果和专家知识规则,提取程序特征,辅助寻找控制点的最优配置。我们在charm++中实现了PICS控制系统,charm++是一个异步消息驱动的并行编程模型。我们通过几个基准测试和一个实际应用程序演示了PICS的效用,并展示了它的有效性。
{"title":"PICS: a performance-analysis-based introspective control system to steer parallel applications","authors":"Yanhua Sun, J. Lifflander, L. Kalé","doi":"10.1145/2612262.2612266","DOIUrl":"https://doi.org/10.1145/2612262.2612266","url":null,"abstract":"Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. In this paper, we describe the design and implementation of PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic set of abstractions to the applications to expose the application-specific knowledge to the runtime system. The abstractions are called control points, which are tunable parameters affecting application performance. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the analysis results and expert knowledge rules, program characteristics are extracted to assist the search for optimal configurations of the control points. We have implemented the PICS control system in Charm++, an asynchronous message-driven parallel programming model. We demonstrate the utility of PICS with several benchmarks and a real-world application and show its effectiveness.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115221091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
ROSS@ICS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1