Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663807
D. S. Guthridge
We have developed a highly reliable InfiniBand host attached block storage management and virtualization system that supports several off-the-shelf Fibre Channel RAID controllers on the back end. The system is based on the existing IBM TotalStorage SAN Volume Controller (SVC) product, and therefore offers performance, a wide array of storage virtualization features, and support for many existing storage controllers. We provide an overview of the driver design as well as performance results. Large read performance from SVC cache exceeds 3 GB/s in a minimal two-node cluster configuration.
我们开发了一个高可靠的InfiniBand主机附加块存储管理和虚拟化系统,支持几个现成的后端光纤通道RAID控制器。该系统基于现有的IBM TotalStorage SAN Volume Controller (SVC)产品,因此提供了性能、广泛的存储虚拟化特性和对许多现有存储控制器的支持。我们提供了驱动程序设计的概述以及性能结果。在最小的双节点集群配置中,来自SVC缓存的大读性能超过3gb /s。
{"title":"Scalable, high performance InfiniBand-attached SAN Volume Controller","authors":"D. S. Guthridge","doi":"10.1109/CLUSTR.2008.4663807","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663807","url":null,"abstract":"We have developed a highly reliable InfiniBand host attached block storage management and virtualization system that supports several off-the-shelf Fibre Channel RAID controllers on the back end. The system is based on the existing IBM TotalStorage SAN Volume Controller (SVC) product, and therefore offers performance, a wide array of storage virtualization features, and support for many existing storage controllers. We provide an overview of the driver design as well as performance results. Large read performance from SVC cache exceeds 3 GB/s in a minimal two-node cluster configuration.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128864973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663775
Brice Goglin
Open-MX is a new message passing layer implemented on top of the generic Ethernet stack of the Linux kernel. Open-MX works on all Ethernet hardware, but it suffers from expensive memory copy requirements on the receiver side due to the hardwarepsilas inability to deposit messages directly in the target application buffers.
{"title":"Improving message passing over Ethernet with I/OAT copy offload in Open-MX","authors":"Brice Goglin","doi":"10.1109/CLUSTR.2008.4663775","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663775","url":null,"abstract":"Open-MX is a new message passing layer implemented on top of the generic Ethernet stack of the Linux kernel. Open-MX works on all Ethernet hardware, but it suffers from expensive memory copy requirements on the receiver side due to the hardwarepsilas inability to deposit messages directly in the target application buffers.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126508667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663779
R. Suda
Divisible load model allows scheduling algorithms that give nearly optimal makespan with practical computational complexity. Beaumont et al. have shown that their algorithm produces a schedule whose makespan is within 1+O(1/radicT) times larger than the optimal solution when the total amount of tasks T scales up and the other conditions are fixed. We have proposed an extension of their algorithm for multiple masters with heterogeneous performance of processors but limited to uniform network performance. This paper analyzes the asymptotic performance of our algorithm, and shows that the asymptotic performance of our algorithm is either 1+O(1/radicT), 1+O(log T/T) or 1+O(1/T ), depending on the problem. For the latter two cases, our algorithm asymptotically outperforms the algorithm by Beaumont et al.
{"title":"Divisible load scheduling with improved asymptotic optimality","authors":"R. Suda","doi":"10.1109/CLUSTR.2008.4663779","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663779","url":null,"abstract":"Divisible load model allows scheduling algorithms that give nearly optimal makespan with practical computational complexity. Beaumont et al. have shown that their algorithm produces a schedule whose makespan is within 1+O(1/radicT) times larger than the optimal solution when the total amount of tasks T scales up and the other conditions are fixed. We have proposed an extension of their algorithm for multiple masters with heterogeneous performance of processors but limited to uniform network performance. This paper analyzes the asymptotic performance of our algorithm, and shows that the asymptotic performance of our algorithm is either 1+O(1/radicT), 1+O(log T/T) or 1+O(1/T ), depending on the problem. For the latter two cases, our algorithm asymptotically outperforms the algorithm by Beaumont et al.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"238 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115662559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663811
Rosa Filgueira, D. E. Singh, J. C. Pichel, J. Carretero
This paper presents Two-Phase Compressed I/O (TPC I/O,) an optimization of the Two-Phase collective I/O technique from ROMIO, the most popular MPI-IO implementation. In order to reduce network traffic, TPC I/O employs LZO algorithm to compress and decompress exchanged data in the inter-node communication operations. The compression algorithm has been fully implemented in the MPI collective technique, allowing to dynamically use (or not) compression. Compared with Two-Phase I/O, Two-Phase Compressed I/O obtains important improvements in the overall execution time for many of the considered scenarios.
{"title":"Exploiting data compression in collective I/O techniques","authors":"Rosa Filgueira, D. E. Singh, J. C. Pichel, J. Carretero","doi":"10.1109/CLUSTR.2008.4663811","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663811","url":null,"abstract":"This paper presents Two-Phase Compressed I/O (TPC I/O,) an optimization of the Two-Phase collective I/O technique from ROMIO, the most popular MPI-IO implementation. In order to reduce network traffic, TPC I/O employs LZO algorithm to compress and decompress exchanged data in the inter-node communication operations. The compression algorithm has been fully implemented in the MPI collective technique, allowing to dynamically use (or not) compression. Compared with Two-Phase I/O, Two-Phase Compressed I/O obtains important improvements in the overall execution time for many of the considered scenarios.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"132 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114104795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663762
T. Hoefler, Timo Schneider, A. Lumsdaine
Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.
{"title":"Multistage switches are not crossbars: Effects of static routing in high-performance networks","authors":"T. Hoefler, Timo Schneider, A. Lumsdaine","doi":"10.1109/CLUSTR.2008.4663762","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663762","url":null,"abstract":"Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126057389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663770
Q. Wei, Zhixiang Li
This paper presents a differentiated storage service in object-based storage system, called DifferStore. To enable differentiated storage service for different applications in a single object-based storage platform, DifferStore utilizes a two-layer architecture to efficiently decouple upper-layer application specific storage policies and lower-layer application independent storage functions. For the lower application independent layer, this paper proposes a weight-based object I/O scheduler with differentiated scheduling policy for different request classes, and a versatile storage manager. The versatile storage manager implements differentiated storage policies in terms of disk layout and free space allocation, as well as an efficient object namespace management enabling directly access object on-disk data just with object ID. The DifferStore also provides ability for upper application specific layer to assign complex striping, placement, load-balancing policies and specific metadata structure of file. Experimental evaluation on our user space prototype demonstrates that the DifferStore can perform well under mixed workloads and satisfy requirements of different applications.
{"title":"DifferStore: A differentiated storage service in object-based storage system","authors":"Q. Wei, Zhixiang Li","doi":"10.1109/CLUSTR.2008.4663770","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663770","url":null,"abstract":"This paper presents a differentiated storage service in object-based storage system, called DifferStore. To enable differentiated storage service for different applications in a single object-based storage platform, DifferStore utilizes a two-layer architecture to efficiently decouple upper-layer application specific storage policies and lower-layer application independent storage functions. For the lower application independent layer, this paper proposes a weight-based object I/O scheduler with differentiated scheduling policy for different request classes, and a versatile storage manager. The versatile storage manager implements differentiated storage policies in terms of disk layout and free space allocation, as well as an efficient object namespace management enabling directly access object on-disk data just with object ID. The DifferStore also provides ability for upper application specific layer to assign complex striping, placement, load-balancing policies and specific metadata structure of file. Experimental evaluation on our user space prototype demonstrates that the DifferStore can perform well under mixed workloads and satisfy requirements of different applications.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121153329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663769
N. Ali, A. Devulapalli, D. Dalessandro, P. Wyckoff, P. Sadayappan
Distributed file systems that use multiple servers to store data in parallel are becoming commonplace. Much work has already gone into such systems to maximize data throughput. However, metadata management has historically been treated as an afterthought. In previous work we focused on improving metadata management techniques by placing file metadata along with data on object-based storage devices (OSDs). However, we did not investigate directory operations. This work looks at the possibility of designing directory structures directly on OSDs, without the need for intervening servers. In particular, the need for atomicity is a fundamental requirement that we explore in depth. Through performance results of benchmarks and applications we show the feasibility of using OSDs directly for metadata, including directory operations.
{"title":"An OSD-based approach to managing directory operations in parallel file systems","authors":"N. Ali, A. Devulapalli, D. Dalessandro, P. Wyckoff, P. Sadayappan","doi":"10.1109/CLUSTR.2008.4663769","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663769","url":null,"abstract":"Distributed file systems that use multiple servers to store data in parallel are becoming commonplace. Much work has already gone into such systems to maximize data throughput. However, metadata management has historically been treated as an afterthought. In previous work we focused on improving metadata management techniques by placing file metadata along with data on object-based storage devices (OSDs). However, we did not investigate directory operations. This work looks at the possibility of designing directory structures directly on OSDs, without the need for intervening servers. In particular, the need for atomicity is a fundamental requirement that we explore in depth. Through performance results of benchmarks and applications we show the feasibility of using OSDs directly for metadata, including directory operations.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":" 19","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132124010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663797
E. Walker
A job proxy is an abstraction for provisioning CPU resources. This paper proposes an adaptive algorithm for allocating job proxies to distributed host clusters with the objective of improving large-scale job ensemble throughput. Specifically, the paper proposes a decision metric for selecting appropriate pending job proxies for migration between host clusters, and a self-synchronizing Paxos-style distributed consensus algorithm for performing the migration of these selected job proxies. The algorithm is further described in the context of a concrete application, the MyCluster system, which implements a framework for submitting, managing and adapting job proxies across distributed high performance computing (HPC) host clusters. To date, the system has been used to provision many hundreds of thousands of CPUs for computational experiments requiring high throughput on HPC infrastructures like the NSF TeraGrid. Experimental evaluation of the proposed algorithm shows significant improvement in user job throughput: an average of 8% in simulation, and 15% in a real-world experiment.
{"title":"Continuous adaptation for high performance throughput computing across distributed clusters","authors":"E. Walker","doi":"10.1109/CLUSTR.2008.4663797","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663797","url":null,"abstract":"A job proxy is an abstraction for provisioning CPU resources. This paper proposes an adaptive algorithm for allocating job proxies to distributed host clusters with the objective of improving large-scale job ensemble throughput. Specifically, the paper proposes a decision metric for selecting appropriate pending job proxies for migration between host clusters, and a self-synchronizing Paxos-style distributed consensus algorithm for performing the migration of these selected job proxies. The algorithm is further described in the context of a concrete application, the MyCluster system, which implements a framework for submitting, managing and adapting job proxies across distributed high performance computing (HPC) host clusters. To date, the system has been used to provision many hundreds of thousands of CPUs for computational experiments requiring high throughput on HPC infrastructures like the NSF TeraGrid. Experimental evaluation of the proposed algorithm shows significant improvement in user job throughput: an average of 8% in simulation, and 15% in a real-world experiment.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"110 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129075506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663784
Moon-Sang Lee, Joonwon Lee, S. Maeng
User-level communication allows an application process to access the network interface directly. Bypassing the kernel requires that a user process accesses the network interface using its own virtual address which should be translated to a physical address. A small caching structure which is similar to the hardware TLB on the host processor has been used to cache the mappings between virtual and physical addresses on the network interface memory. In this study, we propose a new TLB architecture for the network interface. The proposed architecture splits an original caching structure into as many partitions as the number of processors on the SMP system and assigns a separate partition to each application process. In addition, the architecture becomes aware of user contexts and switches the content of caching structure in accordance with context switching. According to our experiments, our scheme achieves significant reduction in application execution time compared to the previous approach.
{"title":"Context-aware address translation for high performance SMP cluster system","authors":"Moon-Sang Lee, Joonwon Lee, S. Maeng","doi":"10.1109/CLUSTR.2008.4663784","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663784","url":null,"abstract":"User-level communication allows an application process to access the network interface directly. Bypassing the kernel requires that a user process accesses the network interface using its own virtual address which should be translated to a physical address. A small caching structure which is similar to the hardware TLB on the host processor has been used to cache the mappings between virtual and physical addresses on the network interface memory. In this study, we propose a new TLB architecture for the network interface. The proposed architecture splits an original caching structure into as many partitions as the number of processors on the SMP system and assigns a separate partition to each application process. In addition, the architecture becomes aware of user contexts and switches the content of caching structure in accordance with context switching. According to our experiments, our scheme achieves significant reduction in application execution time compared to the previous approach.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126535268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-31DOI: 10.1109/CLUSTR.2008.4663805
Fei Chen, Hailiang Cheng, Xiaojun Yang, R. Liu
This paper presents a design and implementation of a HyperTransport (HT) core in lattice SCM FPGA which can run at 800 MHz DDR link frequency. An effective approach is also proposed to solve the ordering problem caused by different virtual channels which exists not only in HT but also PCI-e. HT is a high performance, low latency I/O standard which can be used directly to connect with some general-purpose processors, such as AMDpsilas Opteron processor family. HT interface on Opteron processor run at a maximum of 1 GHz frequency. However, most HT core in FPGA runs at a maximum of 500 MHz frequency which limits the performance of communication. In this paper, a 16 bit 800 MHz HT core is proposed to reduce the gap of ASIC and FPGA.
{"title":"Design and implementation of an effective HyperTransport core in FPGA","authors":"Fei Chen, Hailiang Cheng, Xiaojun Yang, R. Liu","doi":"10.1109/CLUSTR.2008.4663805","DOIUrl":"https://doi.org/10.1109/CLUSTR.2008.4663805","url":null,"abstract":"This paper presents a design and implementation of a HyperTransport (HT) core in lattice SCM FPGA which can run at 800 MHz DDR link frequency. An effective approach is also proposed to solve the ordering problem caused by different virtual channels which exists not only in HT but also PCI-e. HT is a high performance, low latency I/O standard which can be used directly to connect with some general-purpose processors, such as AMDpsilas Opteron processor family. HT interface on Opteron processor run at a maximum of 1 GHz frequency. However, most HT core in FPGA runs at a maximum of 500 MHz frequency which limits the performance of communication. In this paper, a 16 bit 800 MHz HT core is proposed to reduce the gap of ASIC and FPGA.","PeriodicalId":198768,"journal":{"name":"2008 IEEE International Conference on Cluster Computing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128300470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}