Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470905
Wu-chun Feng, Heshan Lin
The Green500 turned two years old this past November at the ACM/IEEE SC|09 Conference. As part of the grassroots movement of the Green500, this paper takes a look back and reflects on how the Green500 has evolved in its second year as well as since its inception. Specifically, it analyzes trends in the Green500 and reports on the implications of these trends. In addition, based on significant feedback from the high-end computing (HEC) community, the Green500 announced three exploratory sub-lists: the Little Green500, the Open Green500, and the HPCC Green500, which are each discussed in this paper.
{"title":"The Green500 List: Year two","authors":"Wu-chun Feng, Heshan Lin","doi":"10.1109/IPDPSW.2010.5470905","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470905","url":null,"abstract":"The Green500 turned two years old this past November at the ACM/IEEE SC|09 Conference. As part of the grassroots movement of the Green500, this paper takes a look back and reflects on how the Green500 has evolved in its second year as well as since its inception. Specifically, it analyzes trends in the Green500 and reports on the implications of these trends. In addition, based on significant feedback from the high-end computing (HEC) community, the Green500 announced three exploratory sub-lists: the Little Green500, the Open Green500, and the HPCC Green500, which are each discussed in this paper.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117202688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470869
R. C. Chiang, A. A. Maciejewski, A. Rosenberg, H. Siegel
If cluster C1 consists of computers with a faster mean speed than the computers in cluster C2, does this imply that cluster C1 is more productive than cluster C2? What if the computers in cluster C1 have the same mean speed as the computers in cluster C2: is the one with computers that have a higher variance in speed more productive? Simulation experiments are performed to explore the above questions within a formal framework for measuring the performance of a cluster. Simulation results show that both mean speed and variance in speed (when mean speeds are equal) are typically correlated with the performance of a cluster, but not always; these statements are quantified statistically for our simulation environments. In addition, simulation results also show that: (1) If the mean speed of computers in cluster C1 is faster by at least a threshold amount than the mean speed of computers in cluster C2, then C1 is more productive than C2. (2) If the computers in clusters C1 and C2 have the same mean speed, then C1 is more productive than C2 when the variance in speed of computers in cluster C1 is higher by at least a threshold amount than the variance in speed of computers in cluster C2.
{"title":"Statistical predictors of computing power in heterogeneous clusters","authors":"R. C. Chiang, A. A. Maciejewski, A. Rosenberg, H. Siegel","doi":"10.1109/IPDPSW.2010.5470869","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470869","url":null,"abstract":"If cluster C<inf>1</inf> consists of computers with a faster mean speed than the computers in cluster C<inf>2</inf>, does this imply that cluster C<inf>1</inf> is more productive than cluster C<inf>2</inf>? What if the computers in cluster C<inf>1</inf> have the same mean speed as the computers in cluster C<inf>2</inf>: is the one with computers that have a higher variance in speed more productive? Simulation experiments are performed to explore the above questions within a formal framework for measuring the performance of a cluster. Simulation results show that both mean speed and variance in speed (when mean speeds are equal) are typically correlated with the performance of a cluster, but not always; these statements are quantified statistically for our simulation environments. In addition, simulation results also show that: (1) If the mean speed of computers in cluster C<inf>1</inf> is faster by at least a threshold amount than the mean speed of computers in cluster C<inf>2</inf>, then C<inf>1</inf> is more productive than C<inf>2</inf>. (2) If the computers in clusters C<inf>1</inf> and C<inf>2</inf> have the same mean speed, then C<inf>1</inf> is more productive than C<inf>2</inf> when the variance in speed of computers in cluster C<inf>1</inf> is higher by at least a threshold amount than the variance in speed of computers in cluster C<inf>2</inf>.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132471935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470831
Jakob Bieling, Patrick Peschlow, P. Martini
The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange existing algorithms so as to obtain a significant performance gain from the execution on a GPU. In this paper, we address this challenge by presenting an efficient GPU implementation of a very popular algorithm for linear programming, the revised simplex method. We describe how to carry out the steps of the revised simplex method to take full advantage of the parallel processing capabilities of a GPU. Our experiments demonstrate considerable speedup over a widely used CPU implementation, thus underlining the tremendous potential of GPGPU.
{"title":"An efficient GPU implementation of the revised simplex method","authors":"Jakob Bieling, Patrick Peschlow, P. Martini","doi":"10.1109/IPDPSW.2010.5470831","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470831","url":null,"abstract":"The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange existing algorithms so as to obtain a significant performance gain from the execution on a GPU. In this paper, we address this challenge by presenting an efficient GPU implementation of a very popular algorithm for linear programming, the revised simplex method. We describe how to carry out the steps of the revised simplex method to take full advantage of the parallel processing capabilities of a GPU. Our experiments demonstrate considerable speedup over a widely used CPU implementation, thus underlining the tremendous potential of GPGPU.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134286837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470787
Xin Chen, J. Langston, Xubin He, Fengjiang Mao
A fundamental issue in a large-scale distributed system consisting of heterogeneous machines which vary in both I/O and computing capabilities is to distribute workloads with respect to the capabilities of each node to achieve the optimal performance. However, node capabilities are often not stable due to various factors. Simply using a static workload distribution scheme may not well match the capability of each node. To address this issue, we distribute workload adaptively to the change of system node capability. In this paper we present an adaptive I/O load distribution scheme to dynamically capture the I/O capabilities among system nodes and to predictively determine an suitable load distribution pattern. A case study is conducted by applying our load distribution scheme into a popular distributed file system PVFS2. Experiments results show that our adaptive load distribution scheme can dramatically improve the performance: up to 70% performance gain for writes and 80% for reads, and up to 63% overall performance loss can be avoided in the presence of an unstable Object Storage Device (OSD).
{"title":"An adaptive I/O load distribution scheme for distributed systems","authors":"Xin Chen, J. Langston, Xubin He, Fengjiang Mao","doi":"10.1109/IPDPSW.2010.5470787","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470787","url":null,"abstract":"A fundamental issue in a large-scale distributed system consisting of heterogeneous machines which vary in both I/O and computing capabilities is to distribute workloads with respect to the capabilities of each node to achieve the optimal performance. However, node capabilities are often not stable due to various factors. Simply using a static workload distribution scheme may not well match the capability of each node. To address this issue, we distribute workload adaptively to the change of system node capability. In this paper we present an adaptive I/O load distribution scheme to dynamically capture the I/O capabilities among system nodes and to predictively determine an suitable load distribution pattern. A case study is conducted by applying our load distribution scheme into a popular distributed file system PVFS2. Experiments results show that our adaptive load distribution scheme can dramatically improve the performance: up to 70% performance gain for writes and 80% for reads, and up to 63% overall performance loss can be avoided in the presence of an unstable Object Storage Device (OSD).","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470874
O. Oluwasanmi, Jared Saia, Valerie King
A recent theoretical result by King and Saia shows that it is possible to solve the Byzantine agreement, leader election and universe reduction problems in the full information model with Õ(n3/2) total bits sent. However, this result, while theoretically interesting, is not practical due to large hidden constants. In this paper, we design a new practical algorithm, based on this theoretical result. For networks containing more than about 1,000 processors, our new algorithm sends significantly fewer bits than a well-known algorithm due to Cachin, Kursawe and Shoup. To obtain our practical algorithm, we relax the fault model compared to the model of King and Saia by (1) allowing the adversary to control only a 1/8, and not a 1/3 fraction of the processors; and (2) assuming the existence of a cryptographic bit commitment primitive. Our algorithm assumes a partially synchronous communication model, where any message sent from one honest player to another honest player needs at most Δ time steps to be received and processed by the recipient for some fixed Δ, and we assume that the clock speeds of the honest players are roughly the same. However, the clocks do not have to be synchronized (i.e., show the same time)
{"title":"An empirical study of a scalable Byzantine agreement algorithm","authors":"O. Oluwasanmi, Jared Saia, Valerie King","doi":"10.1109/IPDPSW.2010.5470874","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470874","url":null,"abstract":"A recent theoretical result by King and Saia shows that it is possible to solve the Byzantine agreement, leader election and universe reduction problems in the full information model with Õ(n3/2) total bits sent. However, this result, while theoretically interesting, is not practical due to large hidden constants. In this paper, we design a new practical algorithm, based on this theoretical result. For networks containing more than about 1,000 processors, our new algorithm sends significantly fewer bits than a well-known algorithm due to Cachin, Kursawe and Shoup. To obtain our practical algorithm, we relax the fault model compared to the model of King and Saia by (1) allowing the adversary to control only a 1/8, and not a 1/3 fraction of the processors; and (2) assuming the existence of a cryptographic bit commitment primitive. Our algorithm assumes a partially synchronous communication model, where any message sent from one honest player to another honest player needs at most Δ time steps to be received and processed by the recipient for some fixed Δ, and we assume that the clock speeds of the honest players are roughly the same. However, the clocks do not have to be synchronized (i.e., show the same time)","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133213492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470687
David Ediger, Karl Jiang, E. J. Riedy, David A. Bader
We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.
{"title":"Massive streaming data analytics: A case study with clustering coefficients","authors":"David Ediger, Karl Jiang, E. J. Riedy, David A. Bader","doi":"10.1109/IPDPSW.2010.5470687","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470687","url":null,"abstract":"We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133233930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470735
Gayatri Mehta, A. Jones
In this paper, we describe a design space exploration (DSE) tool for domain specific reconfigurable computing where the needs of the applications drive the construction of the device architecture. The tool has been developed to automate the design space case studies which allows application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some of the edge-detection benchmarks from the image processing domain for our case studies. We compare the energy consumption of the architecture selected from manual design space case studies with the architectural solution selected by the design space exploration tool. The architecture selected by the DSE tool consumes approximately 9% less energy on an average as compared to the best candidate from the manual design space case studies. The fabric architecture selected from the manual design case studies and the one selected by the tool were synthesized on 130 nm cell-based ASIC fabrication process from IBM. We compare the energy of the benchmarks implemented onto the fabric with other hardware and software implementations. Both fabric architectures (manual and tool) yield energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor.
在本文中,我们描述了一个设计空间探索(DSE)工具,用于特定领域的可重构计算,其中应用程序的需求驱动了设备体系结构的构建。开发该工具是为了自动化设计空间案例研究,允许应用程序开发人员有效地探索架构权衡并快速达成解决方案。我们从mediabbench基准测试套件中选择了一些核心信号处理基准测试,并从图像处理领域中选择了一些边缘检测基准测试来进行案例研究。我们将手工设计空间案例研究中选择的建筑能耗与设计空间探索工具选择的建筑解决方案进行比较。与手工设计空间案例研究中的最佳候选方案相比,DSE工具选择的体系结构平均消耗大约9%的能量。从手工设计案例研究中选择的织物结构和工具选择的织物结构在IBM的130 nm基于单元的ASIC制造工艺上合成。我们比较了在fabric上实现的基准测试与其他硬件和软件实现的能耗。两种结构架构(手工和工具)产生的能量都在直接ASIC实现的3倍以内,比Virtex-II Pro FPGA好330X,比英特尔XScale处理器好2016X。
{"title":"An architectural space exploration tool for domain specific reconfigurable computing","authors":"Gayatri Mehta, A. Jones","doi":"10.1109/IPDPSW.2010.5470735","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470735","url":null,"abstract":"In this paper, we describe a design space exploration (DSE) tool for domain specific reconfigurable computing where the needs of the applications drive the construction of the device architecture. The tool has been developed to automate the design space case studies which allows application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some of the edge-detection benchmarks from the image processing domain for our case studies. We compare the energy consumption of the architecture selected from manual design space case studies with the architectural solution selected by the design space exploration tool. The architecture selected by the DSE tool consumes approximately 9% less energy on an average as compared to the best candidate from the manual design space case studies. The fabric architecture selected from the manual design case studies and the one selected by the tool were synthesized on 130 nm cell-based ASIC fabrication process from IBM. We compare the energy of the benchmarks implemented onto the fabric with other hardware and software implementations. Both fabric architectures (manual and tool) yield energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470835
A. Ilic, L. Sousa
Nowadays, commodity computers are complex heterogeneous systems that provide a huge amount of computational power. However, to take advantage of this power we have to orchestrate the use of processing units with different characteristics. Such distributed memory systems make use of relatively slow interconnection networks, such as system buses. Therefore, most of the time we only individually take advantage of the central processing unit (CPU) or processing accelerators, which are simpler homogeneous subsystems. In this paper we propose a collaborative execution environment for exploiting data parallelism in a heterogeneous system. It is shown that this environment can be applied to program both CPU and graphics processing units (GPUs) to collaboratively compute matrix multiplication and fast Fourier transform (FFT). Experimental results show that significant performance benefits are achieved when both CPU and GPU are used.
{"title":"Collaborative execution environment for heterogeneous parallel systems","authors":"A. Ilic, L. Sousa","doi":"10.1109/IPDPSW.2010.5470835","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470835","url":null,"abstract":"Nowadays, commodity computers are complex heterogeneous systems that provide a huge amount of computational power. However, to take advantage of this power we have to orchestrate the use of processing units with different characteristics. Such distributed memory systems make use of relatively slow interconnection networks, such as system buses. Therefore, most of the time we only individually take advantage of the central processing unit (CPU) or processing accelerators, which are simpler homogeneous subsystems. In this paper we propose a collaborative execution environment for exploiting data parallelism in a heterogeneous system. It is shown that this environment can be applied to program both CPU and graphics processing units (GPUs) to collaboratively compute matrix multiplication and fast Fourier transform (FFT). Experimental results show that significant performance benefits are achieved when both CPU and GPU are used.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122801941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470780
Derek L. Schuff, Benjamin S. Parsons, Vijay S. Pai
This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.
{"title":"Multicore-aware reuse distance analysis","authors":"Derek L. Schuff, Benjamin S. Parsons, Vijay S. Pai","doi":"10.1109/IPDPSW.2010.5470780","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470780","url":null,"abstract":"This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123951970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470917
Sérgio Esteves, L. Veiga, P. Ferreira
The last few years have witnessed huge growth in computer technology and available resources throughout the Internet. These resources can be used to run CPU-intensive applications requiring long periods of processing time. Grid systems allow us to take advantage of available resources lying over a network. However, these systems impose several difficulties to their usage (e.g. heavy authentication and configuration management); in order to overcome them, Peer-to-Peer systems provide open access making the Grid available to any user. Our solution consists of a platform for distributed cycle sharing which attempts to combine Grid and Peer-to-Peer models. A major goal is to allow any ordinary user to use remote idle cycles in order to speedup commodity applications. On the other hand, users can also provide spare cycles of their machines when they are not using them. Our solution encompasses the following functionalities: application management, job creation and scheduling, resource discovery, security policies, and overlay network management. The simple and modular organization of this system allows that components can be changed at minimum cost. In addition, the use of history-based policies provides powerful usage semantics concerning the resource management.
{"title":"GridP2P: Resource usage in Grids and Peer-to-Peer systems","authors":"Sérgio Esteves, L. Veiga, P. Ferreira","doi":"10.1109/IPDPSW.2010.5470917","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470917","url":null,"abstract":"The last few years have witnessed huge growth in computer technology and available resources throughout the Internet. These resources can be used to run CPU-intensive applications requiring long periods of processing time. Grid systems allow us to take advantage of available resources lying over a network. However, these systems impose several difficulties to their usage (e.g. heavy authentication and configuration management); in order to overcome them, Peer-to-Peer systems provide open access making the Grid available to any user. Our solution consists of a platform for distributed cycle sharing which attempts to combine Grid and Peer-to-Peer models. A major goal is to allow any ordinary user to use remote idle cycles in order to speedup commodity applications. On the other hand, users can also provide spare cycles of their machines when they are not using them. Our solution encompasses the following functionalities: application management, job creation and scheduling, resource discovery, security policies, and overlay network management. The simple and modular organization of this system allows that components can be changed at minimum cost. In addition, the use of history-based policies provides powerful usage semantics concerning the resource management.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127700511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}