首页 > 最新文献

2008 IEEE International Conference on Cluster Computing最新文献

英文 中文
Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection 用于自动性能调优的强化学习:稀疏矩阵格式选择的初始评估
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663802
Warren Armstrong, Alistair P. Rendell
The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of non-zero elements, the standard deviation of non-zeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.
强化学习领域已经发展了在动态环境中选择有益行为的技术。这些技术是从经验中学习的,不需要教授。本文探讨了如何使用强化学习技术来确定稀疏矩阵的有效存储格式。考虑了三种不同的存储格式:坐标、压缩稀疏行和阻塞压缩稀疏行。哪种格式表现最好在很大程度上取决于矩阵的性质和所使用的计算机系统。为了测试上述内容,编写了一个程序来生成一系列稀疏矩阵,其中任何给定的矩阵使用三种不同存储类型中的一种都具有最佳性能。对每个矩阵进行多个稀疏矩阵向量积。学习代理的目标是预测该矩阵的最优稀疏矩阵存储格式。所提出的代理使用稀疏矩阵的五个属性:行数、列数、非零元素的数量、每行非零元素的标准差和邻居的平均数量。智能体的特征有两个参数:一个是勘探率,另一个是决定状态空间如何划分的参数。对于一系列1000个自动生成的测试矩阵,分析了代理成功预测最佳存储格式的能力。
{"title":"Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection","authors":"Warren Armstrong, Alistair P. Rendell","doi":"10.1109/CLUSTR.2008.4663802","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663802","url":null,"abstract":"The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of non-zero elements, the standard deviation of non-zeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124077860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A novel hint-based I/O mechanism for centralized file server of cluster 一种新的基于提示的集群集中式文件服务器I/O机制
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663771
Huan Chen, Jin Xiong, Ninghui Sun
In the medium and small cluster systems, the centralized file server such as NFS is the main approach to provide the storage service with low cost and easy management. However, when multiple parallel applications access the shared storage at the same time, the I/O performance decreases much because of the interference of the I/O requests coming from the different clients. In this paper, a hint-based I/O mechanism is proposed and implemented in the United-FS. By analyzing the hint information of the I/O requests, the related requests are grouped, sorted and scheduled by our hint-based I/O scheduler. The experiments show that our hint-based I/O mechanism nearly doubles the read performance compared with NFS, and has better scalability.
在中小型集群系统中,NFS等集中式文件服务器是提供低成本、易于管理的存储服务的主要方式。但是,当多个并行应用程序同时访问共享存储时,由于来自不同客户机的I/O请求的干扰,I/O性能会大大降低。本文提出并在United-FS中实现了一种基于提示的I/O机制。通过分析I/O请求的提示信息,我们的基于提示的I/O调度器对相关请求进行分组、排序和调度。实验表明,与NFS相比,基于提示的I/O机制的读性能几乎提高了一倍,并且具有更好的可扩展性。
{"title":"A novel hint-based I/O mechanism for centralized file server of cluster","authors":"Huan Chen, Jin Xiong, Ninghui Sun","doi":"10.1109/CLUSTR.2008.4663771","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663771","url":null,"abstract":"In the medium and small cluster systems, the centralized file server such as NFS is the main approach to provide the storage service with low cost and easy management. However, when multiple parallel applications access the shared storage at the same time, the I/O performance decreases much because of the interference of the I/O requests coming from the different clients. In this paper, a hint-based I/O mechanism is proposed and implemented in the United-FS. By analyzing the hint information of the I/O requests, the related requests are grouped, sorted and scheduled by our hint-based I/O scheduler. The experiments show that our hint-based I/O mechanism nearly doubles the read performance compared with NFS, and has better scalability.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"7 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117341123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Message progression in parallel computing - to thread or not to thread? 并行计算中的消息进展-线程还是不线程?
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663774
T. Hoefler, A. Lumsdaine
Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progress thread, and communication offload. In this paper we analyze threaded progression approaches, comparing the effects of using shared or dedicated CPU cores for progression. To perform these comparisons, we propose time-based and work-based benchmark schemes. As expected, threaded progression performs well when a spare core is available to be dedicated to communication progression, but a number of operating system effects prevent the same benefits from being obtained when communication progress must share a core with computation. We show that some limited performance improvement can be obtained in the shared-core case by real-time scheduling of the progress thread.
允许通信和计算重叠的消息进展方案有可能提高并行应用程序的性能。在当前可用的高性能网络中,有几种执行进度的选项:手动进度、使用进度线程和通信卸载。在本文中,我们分析了线程进程方法,比较了使用共享或专用CPU内核进行进程的效果。为了进行这些比较,我们提出了基于时间和基于工作的基准方案。正如预期的那样,当可用的备用内核专用于通信进程时,线程进程性能良好,但是当通信进程必须与计算共享一个内核时,许多操作系统的影响会阻止获得相同的好处。我们证明了在共享核情况下,通过实时调度进度线程可以获得一些有限的性能改进。
{"title":"Message progression in parallel computing - to thread or not to thread?","authors":"T. Hoefler, A. Lumsdaine","doi":"10.1109/CLUSTR.2008.4663774","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663774","url":null,"abstract":"Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progress thread, and communication offload. In this paper we analyze threaded progression approaches, comparing the effects of using shared or dedicated CPU cores for progression. To perform these comparisons, we propose time-based and work-based benchmark schemes. As expected, threaded progression performs well when a spare core is available to be dedicated to communication progression, but a number of operating system effects prevent the same benefits from being obtained when communication progress must share a core with computation. We show that some limited performance improvement can be obtained in the shared-core case by real-time scheduling of the progress thread.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116046199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 105
DWC2: A dynamic weight-based cooperative caching scheme for object-based storage cluster DWC2:用于基于对象的存储集群的基于权重的动态协作缓存方案
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663768
Q. Wei, B. Veeravalli, Lingfang Zeng
Object-based storage is emerging as a next generation of distributed storage technology. Aiming at improving the performance and load-balancing of large-scale object-based storage system, we present a dynamic weight-based cooperative caching scheme referred to as DWC2, which allows an object-based storage device (OSD) to use the available free cache of the neighbouring OSD. Our proposed DWC2 replaces objects based on their weights which is a function of object size, popularity and replica number, and dynamically partitions the memory of OSD into local cache and remote cache according to activity workload. An object data is cached in local cache or remote cache of the cooperative OSDs, thus increasing cache hit ratio, reducing expensive disk access time as well as improving load balance. We benchmarked our proposed DWC2 with existing cooperative caching schemes under various OSD environments. Our rigorous experiment results conclusively demonstrate that our DWC2 is scalable and can achieve a significant cache hit ratio, average response time, and load balancing for large-scale OSD cluster.
基于对象的存储作为下一代分布式存储技术正在兴起。为了提高大规模对象存储系统的性能和负载均衡,提出了一种基于权重的动态协同缓存方案DWC2,该方案允许对象存储设备(OSD)使用相邻OSD的可用空闲缓存。我们提出的DWC2基于对象的权重(对象大小、流行度和副本数的函数)来替换对象,并根据活动负载动态地将OSD的内存划分为本地缓存和远程缓存。对象数据被缓存在协作盘的本地缓存或远程缓存中,从而提高了缓存命中率,减少了昂贵的磁盘访问时间,改善了负载均衡。我们用现有的协作缓存方案在各种OSD环境下对我们提出的DWC2进行了基准测试。我们严格的实验结果最终证明了我们的DWC2是可扩展的,并且可以在大规模OSD集群中实现显着的缓存命中率、平均响应时间和负载平衡。
{"title":"DWC2: A dynamic weight-based cooperative caching scheme for object-based storage cluster","authors":"Q. Wei, B. Veeravalli, Lingfang Zeng","doi":"10.1109/CLUSTR.2008.4663768","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663768","url":null,"abstract":"Object-based storage is emerging as a next generation of distributed storage technology. Aiming at improving the performance and load-balancing of large-scale object-based storage system, we present a dynamic weight-based cooperative caching scheme referred to as DWC2, which allows an object-based storage device (OSD) to use the available free cache of the neighbouring OSD. Our proposed DWC2 replaces objects based on their weights which is a function of object size, popularity and replica number, and dynamically partitions the memory of OSD into local cache and remote cache according to activity workload. An object data is cached in local cache or remote cache of the cooperative OSDs, thus increasing cache hit ratio, reducing expensive disk access time as well as improving load balance. We benchmarked our proposed DWC2 with existing cooperative caching schemes under various OSD environments. Our rigorous experiment results conclusively demonstrate that our DWC2 is scalable and can achieve a significant cache hit ratio, average response time, and load balancing for large-scale OSD cluster.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A trace-driven emulation framework to predict scalability of large clusters in presence of OS Jitter 一个跟踪驱动的仿真框架,用于预测存在操作系统抖动的大型集群的可伸缩性
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663776
Pradipta De, Ravina Kothari, V. Mann
Various studies have pointed out the debilitating effects of OS jitter on the performance of parallel applications on large clusters such as the ASCI Purple and the Mare Nostrum at Barcelona Supercomputing Center. These clusters use commodity OSes such as AIX and Linux respectively. The biggest hindrance in evaluating any technique to mitigate jitter is getting access to such large scale production HPC systems running a commodity OS. An earlier attempt aimed at solving this problem was to emulate the effects of OS jitter on more widely available and jitter-free systems such as BlueGene/L. In this paper, we point out the shortcomings of previous such approaches and present the design and implementation of an emulation framework that helps overcome those shortcomings by using innovative techniques. We collect jitter traces on a commodity OS with a given configuration, under which we want to study the scaling behavior. These traces are then replayed on a jitter-free system to predict scalability in presence of OS jitter. The application of this emulation framework to predict scalability is illustrated through a comparative scalability study of an off-the-shelf Linux distribution with a minimal configuration (runlevel 1) and a highly optimized embedded Linux distribution, running on the IO nodes of BlueGene/L. We validate the results of our emulation both on a single node as well as on a real cluster. Our results indicate that an optimized OS along with a technique to synchronize jitter can reduce the performance degradation due to jitter from 99% (in case of the off-the-shelf Linux without any synchronization) to a much more tolerable level of 6% (in case of highly optimized BlueGene/L IO node Linux with synchronization) at 2048 processors. Furthermore, perfect synchronization can give linear scaling with less than 1% slowdown, regardless of the type of OS used. However, as the jitter at different nodes starts getting desynchronized, even with a minor skew across nodes, the optimized OS starts outperforming the off-the-shelf OS.
各种研究都指出了操作系统抖动对大型集群(如巴塞罗那超级计算中心的ASCI Purple和Mare Nostrum)上并行应用程序性能的削弱作用。这些集群分别使用商用操作系统,如AIX和Linux。评估任何减轻抖动的技术的最大障碍是如何访问运行商用操作系统的大规模生产HPC系统。解决这个问题的早期尝试是在更广泛可用和无抖动的系统(如BlueGene/L)上模拟操作系统抖动的影响。在本文中,我们指出了以前这种方法的缺点,并提出了一个仿真框架的设计和实现,该框架通过使用创新技术来帮助克服这些缺点。我们收集具有给定配置的商品操作系统上的抖动痕迹,在此情况下我们想要研究缩放行为。然后在无抖动的系统上重播这些跟踪,以预测存在操作系统抖动的可伸缩性。该仿真框架用于预测可伸缩性的应用程序通过对具有最小配置(运行级1)的现成Linux发行版和在BlueGene/L的IO节点上运行的高度优化的嵌入式Linux发行版的比较可伸缩性研究来说明。我们在单个节点和实际集群上验证了仿真结果。我们的结果表明,优化的操作系统以及同步抖动技术可以在2048个处理器下将抖动从99%(在没有任何同步的现成Linux的情况下)减少到更可容忍的6%(在高度优化的BlueGene/L IO节点Linux具有同步的情况下)的性能下降。此外,无论使用何种类型的操作系统,完美的同步都可以在小于1%的速度下实现线性扩展。然而,当不同节点上的抖动开始变得不同步时,即使节点之间有轻微的倾斜,优化后的操作系统开始优于现成的操作系统。
{"title":"A trace-driven emulation framework to predict scalability of large clusters in presence of OS Jitter","authors":"Pradipta De, Ravina Kothari, V. Mann","doi":"10.1109/CLUSTR.2008.4663776","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663776","url":null,"abstract":"Various studies have pointed out the debilitating effects of OS jitter on the performance of parallel applications on large clusters such as the ASCI Purple and the Mare Nostrum at Barcelona Supercomputing Center. These clusters use commodity OSes such as AIX and Linux respectively. The biggest hindrance in evaluating any technique to mitigate jitter is getting access to such large scale production HPC systems running a commodity OS. An earlier attempt aimed at solving this problem was to emulate the effects of OS jitter on more widely available and jitter-free systems such as BlueGene/L. In this paper, we point out the shortcomings of previous such approaches and present the design and implementation of an emulation framework that helps overcome those shortcomings by using innovative techniques. We collect jitter traces on a commodity OS with a given configuration, under which we want to study the scaling behavior. These traces are then replayed on a jitter-free system to predict scalability in presence of OS jitter. The application of this emulation framework to predict scalability is illustrated through a comparative scalability study of an off-the-shelf Linux distribution with a minimal configuration (runlevel 1) and a highly optimized embedded Linux distribution, running on the IO nodes of BlueGene/L. We validate the results of our emulation both on a single node as well as on a real cluster. Our results indicate that an optimized OS along with a technique to synchronize jitter can reduce the performance degradation due to jitter from 99% (in case of the off-the-shelf Linux without any synchronization) to a much more tolerable level of 6% (in case of highly optimized BlueGene/L IO node Linux with synchronization) at 2048 processors. Furthermore, perfect synchronization can give linear scaling with less than 1% slowdown, regardless of the type of OS used. However, as the jitter at different nodes starts getting desynchronized, even with a minor skew across nodes, the optimized OS starts outperforming the off-the-shelf OS.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129417469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
High message rate, NIC-based atomics: Design and performance considerations 高消息速率、基于nic的原子:设计和性能考虑
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663764
K. Underwood, M. Levenhagen, K. Hemmert, R. Brightwell
Remote atomic memory operations are critical for achieving high-performance synchronization in tightly-coupled systems. Previous approaches to implementing atomic memory operations on high-performance networks have explored providing the primitives necessary to achieve low latency and low host processor overhead. In this paper, we explore the implementation of atomic memory operations with a focus on achieving high message rate. We believe that high message rate is a key performance characteristic that will determine the viability of a high-performance network to support future multi-petascale systems, especially those that expect to employ a partitioned global address space (PGAS) programming model. As an example, many have proposed using network interface level atomic operations to enhance the performance of the HPCC RandomAccess benchmark. This paper explores several issues relevant to the design of an atomic unit on the network interface. We explore the implications of the size of the cache as well as the associativity. Given the growing ratio of bandwidth to latency of modern host interfaces, we explore some of the interactions that impact the concurrency needed to saturate the interface.
远程原子内存操作对于在紧密耦合的系统中实现高性能同步至关重要。以前在高性能网络上实现原子内存操作的方法已经探索了提供实现低延迟和低主机处理器开销所需的原语。在本文中,我们探讨了原子内存操作的实现,重点是实现高消息率。我们认为,高消息率是一个关键的性能特征,它将决定高性能网络支持未来多千兆级系统的可行性,特别是那些期望采用分区全局地址空间(PGAS)编程模型的网络。例如,许多人建议使用网络接口级原子操作来增强HPCC RandomAccess基准的性能。本文探讨了与网络接口上原子单元设计有关的几个问题。我们将探讨缓存大小和结合性的含义。考虑到现代主机接口的带宽与延迟的比例不断增长,我们将探讨影响接口饱和所需的并发性的一些交互。
{"title":"High message rate, NIC-based atomics: Design and performance considerations","authors":"K. Underwood, M. Levenhagen, K. Hemmert, R. Brightwell","doi":"10.1109/CLUSTR.2008.4663764","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663764","url":null,"abstract":"Remote atomic memory operations are critical for achieving high-performance synchronization in tightly-coupled systems. Previous approaches to implementing atomic memory operations on high-performance networks have explored providing the primitives necessary to achieve low latency and low host processor overhead. In this paper, we explore the implementation of atomic memory operations with a focus on achieving high message rate. We believe that high message rate is a key performance characteristic that will determine the viability of a high-performance network to support future multi-petascale systems, especially those that expect to employ a partitioned global address space (PGAS) programming model. As an example, many have proposed using network interface level atomic operations to enhance the performance of the HPCC RandomAccess benchmark. This paper explores several issues relevant to the design of an atomic unit on the network interface. We explore the implications of the size of the cache as well as the associativity. Given the growing ratio of bandwidth to latency of modern host interfaces, we explore some of the interactions that impact the concurrency needed to saturate the interface.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Runtime DVFS control with instrumented Code in power-scalable cluster system 功率可伸缩集群系统运行时DVFS控制的仪表化代码
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663795
Hideaki Kimura, M. Sato, Takayuki Imada, Y. Hotta
Recently, several energy reduction techniques using DVFS have been presented for PC clusters. This work proposes a Code-instrumented Runtime DVFS control, in which the combination of frequency and voltage (called a gear) is managed at the instrumented code at runtime. The codes are inserted by defining the program regions that have the same characteristics. The Code-instrumented Runtime DVFS control method is better than the Interrupt-based Runtime DVFS control method, in which the gear is managed by periodic interrupt, because it can reflect the program information to control DVFS. Though Static DVFS control, which makes use of the power profile before execution, gives better energy reduction, the proposed Code-instrumented Runtime DVFS control is easier to use, because it requires no information such as profile. The proposed DVFS control method was designed and implemented. The beta-adaptation was used as the runtime algorithm to choose the appropriate gear. The results show that the proposed method can improve the performance and energy consumption compared with Interrupt-based Runtime DVFS control. Although our Code-instrumented Runtime DVFS control can select lower voltages and frequencies than the present Runtime DVFS control given a certain deadline, unfortunately, it was also found to increase power consumption of the PC cluster due to an increase in the execution time.
近年来,针对PC集群提出了几种基于DVFS的节能技术。这项工作提出了一种代码仪表化的运行时DVFS控制,其中频率和电压的组合(称为齿轮)在运行时在仪表化的代码中进行管理。通过定义具有相同特征的程序区域来插入代码。基于代码的运行时DVFS控制方法比基于中断的运行时DVFS控制方法更能反映控制DVFS的程序信息,因此优于基于中断的运行时DVFS控制方法。虽然静态DVFS控制(在执行前利用功率配置文件)可以更好地降低能耗,但建议的代码仪表运行时DVFS控制更容易使用,因为它不需要诸如配置文件之类的信息。设计并实现了DVFS控制方法。采用β -自适应作为运行时算法,选择合适的齿轮。结果表明,与基于中断的运行时DVFS控制相比,该方法可以提高系统的性能和能耗。尽管我们的代码仪表运行时DVFS控件可以在给定的截止日期下选择比当前运行时DVFS控件更低的电压和频率,但不幸的是,由于执行时间的增加,它也会增加PC集群的功耗。
{"title":"Runtime DVFS control with instrumented Code in power-scalable cluster system","authors":"Hideaki Kimura, M. Sato, Takayuki Imada, Y. Hotta","doi":"10.1109/CLUSTR.2008.4663795","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663795","url":null,"abstract":"Recently, several energy reduction techniques using DVFS have been presented for PC clusters. This work proposes a Code-instrumented Runtime DVFS control, in which the combination of frequency and voltage (called a gear) is managed at the instrumented code at runtime. The codes are inserted by defining the program regions that have the same characteristics. The Code-instrumented Runtime DVFS control method is better than the Interrupt-based Runtime DVFS control method, in which the gear is managed by periodic interrupt, because it can reflect the program information to control DVFS. Though Static DVFS control, which makes use of the power profile before execution, gives better energy reduction, the proposed Code-instrumented Runtime DVFS control is easier to use, because it requires no information such as profile. The proposed DVFS control method was designed and implemented. The beta-adaptation was used as the runtime algorithm to choose the appropriate gear. The results show that the proposed method can improve the performance and energy consumption compared with Interrupt-based Runtime DVFS control. Although our Code-instrumented Runtime DVFS control can select lower voltages and frequencies than the present Runtime DVFS control given a certain deadline, unfortunately, it was also found to increase power consumption of the PC cluster due to an increase in the execution time.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132447013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Prediction of behavior of MPI applications 预测MPI应用程序的行为
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663777
Marc Casas, Rosa M. Badia, Jesús Labarta
Scalability and performance of applications is a very important issue today. As more complex have become high performance architectures, it is more complex to predict the behavior of a given application running on them. In this paper, we propose a methodology which automatically and quickly predicts, from a very limited number of runs using very few processors, the scalability and performance of a given application in a wide range of supercomputers taking into account details of the architecture and the network of the machines.
如今,应用程序的可伸缩性和性能是一个非常重要的问题。随着更复杂的体系结构变成高性能体系结构,预测在其上运行的给定应用程序的行为变得更加复杂。在本文中,我们提出了一种方法,该方法可以自动和快速地预测,从非常有限的运行次数使用很少的处理器,一个给定的应用程序在大范围的超级计算机的可扩展性和性能,同时考虑到机器的架构和网络的细节。
{"title":"Prediction of behavior of MPI applications","authors":"Marc Casas, Rosa M. Badia, Jesús Labarta","doi":"10.1109/CLUSTR.2008.4663777","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663777","url":null,"abstract":"Scalability and performance of applications is a very important issue today. As more complex have become high performance architectures, it is more complex to predict the behavior of a given application running on them. In this paper, we propose a methodology which automatically and quickly predicts, from a very limited number of runs using very few processors, the scalability and performance of a given application in a wide range of supercomputers taking into account details of the architecture and the network of the machines.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133488982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
In search of sweet-spots in parallel performance monitoring 在并行性能监控中寻找最佳点
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663757
A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller
Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the same time it raises concerns for the impact on application execution from monitor overhead. The application monitoring scheme parameterized by performance events to monitor, access frequency and the type of data analysis operation defines a set of monitoring requirements. The monitoring infrastructure presents its own choices, particularly the amount and configuration of resources devoted explicitly to monitoring. The key to scalable, low-overhead parallel performance monitoring is to match the application monitoring demands to the effective operating range of the monitoring system (or vice-versa). A poor match can result in over-provisioning (wasted resources) or in under-provisioning (lack of scalability, high overheads and poor quality of performance data). We present a methodology and evaluation framework to determine the sweet-spots for performance monitoring using TAU and MRNet.
并行性能监控扩展了并行测量系统的基础设施和接口,用于在线性能数据访问、通信和分析。同时,它引起了对监视器开销对应用程序执行的影响的关注。应用程序监控方案参数化了要监控的性能事件、访问频率和数据分析操作的类型,定义了一组监控需求。监视基础设施提供了自己的选择,特别是用于监视的资源的数量和配置。可扩展的、低开销的并行性能监视的关键是将应用程序监视需求与监视系统的有效操作范围相匹配(反之亦然)。不匹配可能导致过度配置(浪费资源)或配置不足(缺乏可伸缩性、高开销和性能数据质量差)。我们提出了一种方法和评估框架,以确定使用TAU和MRNet进行性能监测的最佳点。
{"title":"In search of sweet-spots in parallel performance monitoring","authors":"A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller","doi":"10.1109/CLUSTR.2008.4663757","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663757","url":null,"abstract":"Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the same time it raises concerns for the impact on application execution from monitor overhead. The application monitoring scheme parameterized by performance events to monitor, access frequency and the type of data analysis operation defines a set of monitoring requirements. The monitoring infrastructure presents its own choices, particularly the amount and configuration of resources devoted explicitly to monitoring. The key to scalable, low-overhead parallel performance monitoring is to match the application monitoring demands to the effective operating range of the monitoring system (or vice-versa). A poor match can result in over-provisioning (wasted resources) or in under-provisioning (lack of scalability, high overheads and poor quality of performance data). We present a methodology and evaluation framework to determine the sweet-spots for performance monitoring using TAU and MRNet.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133505792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A large-grained parallel algorithm for nonlinear eigenvalue problems and its implementation using OmniRPC 非线性特征值问题的大粒度并行算法及其使用 OmniRPC 的实现
Pub Date : 2008-10-31 DOI: 10.1109/CLUSTR.2008.4663754
Takeshi Amako, Yusaku Yamamoto, Shaoliang Zhang
The nonlinear eigenvalue problem plays an important role in various fields such as nonlinear elasticity, electronic structure calculation and theoretical fluid dynamics. We recently proposed a new algorithm for the nonlinear eigenvalue problem, which reduces the original problem to a smaller generalized linear eigenvalue problem with Hankel coefficient matrices through complex contour integral. This algorithm has a unique feature that it can find all the eigenvalues in a closed curve on the complex plane. Moreover, it has large-grain parallelism and is suited for execution in a grid environment. In this paper, we study the numerical properties of our algorithm theoretically. In particular, we analyze the effect of numerical integration to the computed eigenvalues and give a guideline on how to choose the size of the Hankel matrices properly. Also, we show the parallel performance of our algorithm implemented on a PC cluster using OmniRPC, a grid RPC system. Parallel efficiency of 75% is achieved when solving a nonlinear eigenvalue problem of order 1000 using 14 processors.
非线性特征值问题在非线性弹性、电子结构计算和理论流体力学等多个领域发挥着重要作用。我们最近提出了一种新的非线性特征值问题算法,通过复轮廓积分将原问题简化为一个较小的带有汉克尔系数矩阵的广义线性特征值问题。该算法有一个独特之处,即它能找到复平面上封闭曲线中的所有特征值。此外,它还具有大网格并行性,适合在网格环境中执行。本文从理论上研究了我们算法的数值特性。特别是,我们分析了数值积分对计算特征值的影响,并给出了如何正确选择汉克尔矩阵大小的指导原则。此外,我们还展示了使用网格 RPC 系统 OmniRPC 在 PC 集群上实现的算法的并行性能。在使用 14 个处理器求解阶数为 1000 的非线性特征值问题时,并行效率达到 75%。
{"title":"A large-grained parallel algorithm for nonlinear eigenvalue problems and its implementation using OmniRPC","authors":"Takeshi Amako, Yusaku Yamamoto, Shaoliang Zhang","doi":"10.1109/CLUSTR.2008.4663754","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663754","url":null,"abstract":"The nonlinear eigenvalue problem plays an important role in various fields such as nonlinear elasticity, electronic structure calculation and theoretical fluid dynamics. We recently proposed a new algorithm for the nonlinear eigenvalue problem, which reduces the original problem to a smaller generalized linear eigenvalue problem with Hankel coefficient matrices through complex contour integral. This algorithm has a unique feature that it can find all the eigenvalues in a closed curve on the complex plane. Moreover, it has large-grain parallelism and is suited for execution in a grid environment. In this paper, we study the numerical properties of our algorithm theoretically. In particular, we analyze the effect of numerical integration to the computed eigenvalues and give a guideline on how to choose the size of the Hankel matrices properly. Also, we show the parallel performance of our algorithm implemented on a PC cluster using OmniRPC, a grid RPC system. Parallel efficiency of 75% is achieved when solving a nonlinear eigenvalue problem of order 1000 using 14 processors.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131370676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2008 IEEE International Conference on Cluster Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1