Chip Multiprocessors (CMPs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle).We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.
{"title":"A novel migration-based NUCA design for Chip Multiprocessors","authors":"M. Kandemir, Feihui Li, M. J. Irwin, S. Son","doi":"10.1109/SC.2008.5216918","DOIUrl":"https://doi.org/10.1109/SC.2008.5216918","url":null,"abstract":"Chip Multiprocessors (CMPs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle).We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121809251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by developing cost models that characterize the performance of the tiled program as a function of tile sizes. All previous approaches to tile size selection (TSS) are cost model specific. Due to this they are neither extensible (e.g., to richer program classes/newer architectures) nor scalable (e.g., to multiple levels of tiling). This paper identifies positivity as a fundamental property shared by the functions and parameters commonly used in TSS models. We show how this positivity can be used as a basis to derive a TSS framework which is both efficient and scalable. We also show that almost all TSS models proposed in the literature (including those used in production compilers and auto-tuners) can be reduced to our framework.
{"title":"Positivity, posynomials and tile size selection","authors":"Lakshminarayanan Renganarayanan, S. Rajopadhye","doi":"10.1145/1413370.1413426","DOIUrl":"https://doi.org/10.1145/1413370.1413426","url":null,"abstract":"Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by developing cost models that characterize the performance of the tiled program as a function of tile sizes. All previous approaches to tile size selection (TSS) are cost model specific. Due to this they are neither extensible (e.g., to richer program classes/newer architectures) nor scalable (e.g., to multiple levels of tiling). This paper identifies positivity as a fundamental property shared by the functions and parameters commonly used in TSS models. We show how this positivity can be used as a basis to derive a TSS framework which is both efficient and scalable. We also show that almost all TSS models proposed in the literature (including those used in production compilers and auto-tuners) can be reduced to our framework.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130197085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, Collin McCurdy, James H. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, Weikuan Yu
BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.
{"title":"Early evaluation of IBM BlueGene/P","authors":"S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, Collin McCurdy, James H. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, Weikuan Yu","doi":"10.1109/SC.2008.5214725","DOIUrl":"https://doi.org/10.1109/SC.2008.5214725","url":null,"abstract":"BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129817376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we (i) quantify the impact of compiler-directed I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify inter-client misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used.
{"title":"Prefetch throttling and data pinning for improving performance of shared caches","authors":"O. Ozturk, S. Son, M. Kandemir, Mustafa Karaköy","doi":"10.1145/1413370.1413430","DOIUrl":"https://doi.org/10.1145/1413370.1413430","url":null,"abstract":"In this paper, we (i) quantify the impact of compiler-directed I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of clients (compute nodes) is increased; (ii) identify inter-client misses due to harmful I/O prefetches as one of the main sources for this reduction in performance with increased number of clients; and (iii) propose and experimentally evaluate prefetch throttling and data pinning schemes to improve performance of I/O prefetching. Prefetch throttling prevents one or more clients from issuing further prefetches if such prefetches are predicted to be harmful, i.e., replace from the memory cache the useful data accessed by other clients. Data pinning on the other hand makes selected data blocks immune to harmful prefetches by pinning them in the memory cache. We show that these two schemes can be applied in isolation or combined together, and they can be applied at a coarse or fine granularity. Our experiments with these two optimizations using four disk-intensive applications reveal that they can improve performance by 9.7% and 15.1% on average, over standard compiler-directed I/O prefetching and no-prefetch case, respectively, when 8 clients are used.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115988862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional cyclic layout. The algorithm is based on a new pivoting strategy, which is stable in practice. The new algorithm is optimal (up to polylogarithmic factors) in the amount of communication it performs. Our experiments show that CALU leads to a reduction in the parallel time, in particular when the latency time is an important factor of the overall time. The factorization of a block-column, a subroutine of CALU, outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on an IBM POWER 5 system and up to a factor of 5.58 on a Cray XT4 system. On square matrices of order 104, CALU outperforms the corresponding routine PDGETRF from ScaLAPACK by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4.
针对分布在二维循环布局中的密集矩阵的LU分解问题,提出了一种通信避免算法CALU。该算法基于一种新的旋转策略,在实践中具有较好的稳定性。新算法在其执行的通信量方面是最优的(达到多对数因子)。我们的实验表明,CALU可以减少并行时间,特别是当延迟时间是整体时间的重要因素时。块列(CALU的一个子例程)的分解在IBM POWER 5系统上比ScaLAPACK的相应例程PDGETF2性能高4.37倍,在Cray XT4系统上比PDGETF2性能高5.58倍。在阶为104的方阵上,CALU比来自ScaLAPACK的相应例程PDGETRF的性能在IBM POWER 5上高出1.24倍,在Cray XT4上高出1.31倍。
{"title":"Communication Avoiding Gaussian elimination","authors":"L. Grigori, J. Demmel, Hua Xiang","doi":"10.1109/SC.2008.5214287","DOIUrl":"https://doi.org/10.1109/SC.2008.5214287","url":null,"abstract":"We present CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional cyclic layout. The algorithm is based on a new pivoting strategy, which is stable in practice. The new algorithm is optimal (up to polylogarithmic factors) in the amount of communication it performs. Our experiments show that CALU leads to a reduction in the parallel time, in particular when the latency time is an important factor of the overall time. The factorization of a block-column, a subroutine of CALU, outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on an IBM POWER 5 system and up to a factor of 5.58 on a Cray XT4 system. On square matrices of order 104, CALU outperforms the corresponding routine PDGETRF from ScaLAPACK by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"423 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117351230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
{"title":"An efficient parallel approach for identifying protein families in large-scale metagenomic data sets","authors":"Changjun Wu, A. Kalyanaraman","doi":"10.1145/1413370.1413406","DOIUrl":"https://doi.org/10.1145/1413370.1413406","url":null,"abstract":"Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124254337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present timing and performance numbers for a short-range parallel molecular dynamics (MD) code, SPaSM, that has been rewritten for the heterogeneous Roadrunner supercomputer. Each Roadrunner compute node consists of two AMD Opteron dualcore microprocessors and four PowerXCell 8i enhanced Cell microprocessors, so that there are four MPI ranks per node, each with one Opteron and one Cell. The interatomic forces are computed on the Cells (each with one PPU and eight SPU cores), while the Opterons are used to direct inter-rank communication and perform I/O-heavy periodic analysis, visualization, and checkpointing tasks. The performance measured for our initial implementation of a standard Lennard-Jones pair potential benchmark reached a peak of 369 Tflop/s double-precision floating-point performance on the full Roadrunner system (27.7% of peak), corresponding to 124 MFlop/Watt/s at a price of approximately 3.69 MFlops/dollar. We demonstrate an initial target application, the jetting and ejection of material from a shocked surface.
{"title":"369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer","authors":"S. Swaminarayan, K. Kadau, T. Germann, G. Fossum","doi":"10.1145/1413370.1413436","DOIUrl":"https://doi.org/10.1145/1413370.1413436","url":null,"abstract":"We present timing and performance numbers for a short-range parallel molecular dynamics (MD) code, SPaSM, that has been rewritten for the heterogeneous Roadrunner supercomputer. Each Roadrunner compute node consists of two AMD Opteron dualcore microprocessors and four PowerXCell 8i enhanced Cell microprocessors, so that there are four MPI ranks per node, each with one Opteron and one Cell. The interatomic forces are computed on the Cells (each with one PPU and eight SPU cores), while the Opterons are used to direct inter-rank communication and perform I/O-heavy periodic analysis, visualization, and checkpointing tasks. The performance measured for our initial implementation of a standard Lennard-Jones pair potential benchmark reached a peak of 369 Tflop/s double-precision floating-point performance on the full Roadrunner system (27.7% of peak), corresponding to 124 MFlop/Watt/s at a price of approximately 3.69 MFlops/dollar. We demonstrate an initial target application, the jetting and ejection of material from a shocked surface.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128796673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ever-increasing amounts of simulation data produced by scientists demand high-end parallel visualization capability. However, image compositing, which requires interprocessor communication, is often the bottleneck stage for parallel rendering of large volume data sets. Existing image compositing solutions either incur a large number of messages exchanged among processors (such as the direct send method), or limit the number of processors that can be effectively utilized (such as the binary swap method). We introduce a new image compositing algorithm, called 2-3 swap, which combines the flexibility of the direct send method and the optimality of the binary swap method. The 2-3 swap algorithm allows an arbitrary number of processors to be used for compositing, and fully utilizes all participating processors throughout the course of the compositing. We experiment with this image compositing solution on a supercomputer with thousands of processors, and demonstrate its great flexibility as well as scalability.
{"title":"Massively parallel volume rendering using 2–3 swap image compositing","authors":"Hongfeng Yu, Chaoli Wang, K. Ma","doi":"10.1145/1508044.1508084","DOIUrl":"https://doi.org/10.1145/1508044.1508084","url":null,"abstract":"The ever-increasing amounts of simulation data produced by scientists demand high-end parallel visualization capability. However, image compositing, which requires interprocessor communication, is often the bottleneck stage for parallel rendering of large volume data sets. Existing image compositing solutions either incur a large number of messages exchanged among processors (such as the direct send method), or limit the number of processors that can be effectively utilized (such as the binary swap method). We introduce a new image compositing algorithm, called 2-3 swap, which combines the flexibility of the direct send method and the optimality of the binary swap method. The 2-3 swap algorithm allows an arbitrary number of processors to be used for compositing, and fully utilizes all participating processors throughout the course of the compositing. We experiment with this image compositing solution on a supercomputer with thousands of processors, and demonstrate its great flexibility as well as scalability.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122778720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Byna, Yong Chen, Xian-He Sun, R. Thakur, W. Gropp
Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermined I/O signature of an application to guide prefetching. To put this idea to work, we first derived a classification of patterns and introduced a simple and effective signature notation to represent patterns. We then developed a toolkit to trace and generate I/O signatures automatically. Finally, we designed and implemented a thread-based client-side collective prefetching cache layer for MPI-IO library to support prefetching. A prefetching thread reads I/O signatures of an application and adjusts them by observing I/O accesses at runtime. Experimental results show that the proposed prefetching method improves I/O performance significantly for applications with complex patterns.
{"title":"Parallel I/O prefetching using MPI file caching and I/O signatures","authors":"S. Byna, Yong Chen, Xian-He Sun, R. Thakur, W. Gropp","doi":"10.1109/SC.2008.5213604","DOIUrl":"https://doi.org/10.1109/SC.2008.5213604","url":null,"abstract":"Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermined I/O signature of an application to guide prefetching. To put this idea to work, we first derived a classification of patterns and introduced a simple and effective signature notation to represent patterns. We then developed a toolkit to trace and generate I/O signatures automatically. Finally, we designed and implemented a thread-based client-side collective prefetching cache layer for MPI-IO library to support prefetching. A prefetching thread reads I/O signatures of an application and adjusts them by observing I/O accesses at runtime. Experimental results show that the proposed prefetching method improves I/O performance significantly for applications with complex patterns.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126327793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing a general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost.
{"title":"BitDew: A programmable environment for large-scale data management and distribution","authors":"G. Fedak, Haiwu He, F. Cappello","doi":"10.1109/SC.2008.5213939","DOIUrl":"https://doi.org/10.1109/SC.2008.5213939","url":null,"abstract":"Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing a general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129211017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}