首页 > 最新文献

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)最新文献

英文 中文
The Green500 List: Year two Green500榜单:第二年
Wu-chun Feng, Heshan Lin
The Green500 turned two years old this past November at the ACM/IEEE SC|09 Conference. As part of the grassroots movement of the Green500, this paper takes a look back and reflects on how the Green500 has evolved in its second year as well as since its inception. Specifically, it analyzes trends in the Green500 and reports on the implications of these trends. In addition, based on significant feedback from the high-end computing (HEC) community, the Green500 announced three exploratory sub-lists: the Little Green500, the Open Green500, and the HPCC Green500, which are each discussed in this paper.
去年11月,在ACM/IEEE SC|09大会上,Green500迎来了两岁生日。作为“绿色500强”草根运动的一部分,本文回顾并反思了“绿色500强”在第二年以及自成立以来的发展历程。具体来说,它分析了Green500的趋势,并报告了这些趋势的含义。此外,基于来自高端计算(HEC)社区的大量反馈,Green500宣布了三个探索性的子列表:Little Green500、Open Green500和HPCC Green500,这三个子列表将在本文中进行讨论。
{"title":"The Green500 List: Year two","authors":"Wu-chun Feng, Heshan Lin","doi":"10.1109/IPDPSW.2010.5470905","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470905","url":null,"abstract":"The Green500 turned two years old this past November at the ACM/IEEE SC|09 Conference. As part of the grassroots movement of the Green500, this paper takes a look back and reflects on how the Green500 has evolved in its second year as well as since its inception. Specifically, it analyzes trends in the Green500 and reports on the implications of these trends. In addition, based on significant feedback from the high-end computing (HEC) community, the Green500 announced three exploratory sub-lists: the Little Green500, the Open Green500, and the HPCC Green500, which are each discussed in this paper.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117202688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Statistical predictors of computing power in heterogeneous clusters 异构集群中计算能力的统计预测
R. C. Chiang, A. A. Maciejewski, A. Rosenberg, H. Siegel
If cluster C1 consists of computers with a faster mean speed than the computers in cluster C2, does this imply that cluster C1 is more productive than cluster C2? What if the computers in cluster C1 have the same mean speed as the computers in cluster C2: is the one with computers that have a higher variance in speed more productive? Simulation experiments are performed to explore the above questions within a formal framework for measuring the performance of a cluster. Simulation results show that both mean speed and variance in speed (when mean speeds are equal) are typically correlated with the performance of a cluster, but not always; these statements are quantified statistically for our simulation environments. In addition, simulation results also show that: (1) If the mean speed of computers in cluster C1 is faster by at least a threshold amount than the mean speed of computers in cluster C2, then C1 is more productive than C2. (2) If the computers in clusters C1 and C2 have the same mean speed, then C1 is more productive than C2 when the variance in speed of computers in cluster C1 is higher by at least a threshold amount than the variance in speed of computers in cluster C2.
如果集群C1由平均速度比集群C2中的计算机更快的计算机组成,这是否意味着集群C1比集群C2更高效?如果集群C1中的计算机与集群C2中的计算机具有相同的平均速度,那么具有速度差异较大的计算机的计算机是否具有更高的生产力?在测量集群性能的正式框架内进行模拟实验以探索上述问题。仿真结果表明,平均速度和速度方差(当平均速度相等时)通常与集群的性能相关,但并非总是如此;这些语句在我们的模拟环境中进行了统计量化。此外,仿真结果还表明:(1)如果集群C1中的计算机的平均速度比集群C2中的计算机的平均速度快至少一个阈值,则C1比C2的生产率更高。(2)如果集群C1和C2中的计算机具有相同的平均速度,则当集群C1中的计算机的速度方差比集群C2中的计算机的速度方差至少高一个阈值时,C1比C2的生产效率更高。
{"title":"Statistical predictors of computing power in heterogeneous clusters","authors":"R. C. Chiang, A. A. Maciejewski, A. Rosenberg, H. Siegel","doi":"10.1109/IPDPSW.2010.5470869","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470869","url":null,"abstract":"If cluster C<inf>1</inf> consists of computers with a faster mean speed than the computers in cluster C<inf>2</inf>, does this imply that cluster C<inf>1</inf> is more productive than cluster C<inf>2</inf>? What if the computers in cluster C<inf>1</inf> have the same mean speed as the computers in cluster C<inf>2</inf>: is the one with computers that have a higher variance in speed more productive? Simulation experiments are performed to explore the above questions within a formal framework for measuring the performance of a cluster. Simulation results show that both mean speed and variance in speed (when mean speeds are equal) are typically correlated with the performance of a cluster, but not always; these statements are quantified statistically for our simulation environments. In addition, simulation results also show that: (1) If the mean speed of computers in cluster C<inf>1</inf> is faster by at least a threshold amount than the mean speed of computers in cluster C<inf>2</inf>, then C<inf>1</inf> is more productive than C<inf>2</inf>. (2) If the computers in clusters C<inf>1</inf> and C<inf>2</inf> have the same mean speed, then C<inf>1</inf> is more productive than C<inf>2</inf> when the variance in speed of computers in cluster C<inf>1</inf> is higher by at least a threshold amount than the variance in speed of computers in cluster C<inf>2</inf>.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132471935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An efficient GPU implementation of the revised simplex method 修正单纯形法的高效GPU实现
Jakob Bieling, Patrick Peschlow, P. Martini
The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange existing algorithms so as to obtain a significant performance gain from the execution on a GPU. In this paper, we address this challenge by presenting an efficient GPU implementation of a very popular algorithm for linear programming, the revised simplex method. We describe how to carry out the steps of the revised simplex method to take full advantage of the parallel processing capabilities of a GPU. Our experiments demonstrate considerable speedup over a widely used CPU implementation, thus underlining the tremendous potential of GPGPU.
现代图形处理单元(gpu)的大规模并行性所提供的计算能力在过去几年中日益成为人们关注的焦点。特别是gpu上的通用计算(GPGPU)正在引起研究人员和实践者的关注。然而,GPGPU的研究仍处于起步阶段,一个主要的挑战是重新排列现有的算法,以便从GPU上的执行中获得显着的性能增益。在本文中,我们通过提出一种非常流行的线性规划算法的高效GPU实现来解决这一挑战,即修正单纯形法。我们描述了如何执行改进的单纯形方法的步骤,以充分利用GPU的并行处理能力。我们的实验证明了在广泛使用的CPU实现上有相当大的加速,从而强调了GPGPU的巨大潜力。
{"title":"An efficient GPU implementation of the revised simplex method","authors":"Jakob Bieling, Patrick Peschlow, P. Martini","doi":"10.1109/IPDPSW.2010.5470831","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470831","url":null,"abstract":"The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange existing algorithms so as to obtain a significant performance gain from the execution on a GPU. In this paper, we address this challenge by presenting an efficient GPU implementation of a very popular algorithm for linear programming, the revised simplex method. We describe how to carry out the steps of the revised simplex method to take full advantage of the parallel processing capabilities of a GPU. Our experiments demonstrate considerable speedup over a widely used CPU implementation, thus underlining the tremendous potential of GPGPU.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134286837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
An adaptive I/O load distribution scheme for distributed systems 分布式系统的自适应I/O负载分配方案
Xin Chen, J. Langston, Xubin He, Fengjiang Mao
A fundamental issue in a large-scale distributed system consisting of heterogeneous machines which vary in both I/O and computing capabilities is to distribute workloads with respect to the capabilities of each node to achieve the optimal performance. However, node capabilities are often not stable due to various factors. Simply using a static workload distribution scheme may not well match the capability of each node. To address this issue, we distribute workload adaptively to the change of system node capability. In this paper we present an adaptive I/O load distribution scheme to dynamically capture the I/O capabilities among system nodes and to predictively determine an suitable load distribution pattern. A case study is conducted by applying our load distribution scheme into a popular distributed file system PVFS2. Experiments results show that our adaptive load distribution scheme can dramatically improve the performance: up to 70% performance gain for writes and 80% for reads, and up to 63% overall performance loss can be avoided in the presence of an unstable Object Storage Device (OSD).
在由I/O和计算能力各不相同的异构机器组成的大规模分布式系统中,一个基本问题是根据每个节点的能力分配工作负载,以实现最佳性能。但是,由于各种因素的影响,节点的能力往往不稳定。简单地使用静态工作负载分配方案可能无法很好地匹配每个节点的能力。为了解决这个问题,我们根据系统节点能力的变化自适应地分配工作负载。在本文中,我们提出了一种自适应I/O负载分配方案,以动态捕获系统节点之间的I/O能力,并预测确定合适的负载分配模式。通过将我们的负载分配方案应用于一个流行的分布式文件系统PVFS2,进行了一个案例研究。实验结果表明,我们的自适应负载分配方案可以显著提高性能:在不稳定对象存储设备(OSD)存在的情况下,写性能可提高70%,读性能可提高80%,总体性能损失可避免63%。
{"title":"An adaptive I/O load distribution scheme for distributed systems","authors":"Xin Chen, J. Langston, Xubin He, Fengjiang Mao","doi":"10.1109/IPDPSW.2010.5470787","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470787","url":null,"abstract":"A fundamental issue in a large-scale distributed system consisting of heterogeneous machines which vary in both I/O and computing capabilities is to distribute workloads with respect to the capabilities of each node to achieve the optimal performance. However, node capabilities are often not stable due to various factors. Simply using a static workload distribution scheme may not well match the capability of each node. To address this issue, we distribute workload adaptively to the change of system node capability. In this paper we present an adaptive I/O load distribution scheme to dynamically capture the I/O capabilities among system nodes and to predictively determine an suitable load distribution pattern. A case study is conducted by applying our load distribution scheme into a popular distributed file system PVFS2. Experiments results show that our adaptive load distribution scheme can dramatically improve the performance: up to 70% performance gain for writes and 80% for reads, and up to 63% overall performance loss can be avoided in the presence of an unstable Object Storage Device (OSD).","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An empirical study of a scalable Byzantine agreement algorithm 可扩展拜占庭协议算法的实证研究
O. Oluwasanmi, Jared Saia, Valerie King
A recent theoretical result by King and Saia shows that it is possible to solve the Byzantine agreement, leader election and universe reduction problems in the full information model with Õ(n3/2) total bits sent. However, this result, while theoretically interesting, is not practical due to large hidden constants. In this paper, we design a new practical algorithm, based on this theoretical result. For networks containing more than about 1,000 processors, our new algorithm sends significantly fewer bits than a well-known algorithm due to Cachin, Kursawe and Shoup. To obtain our practical algorithm, we relax the fault model compared to the model of King and Saia by (1) allowing the adversary to control only a 1/8, and not a 1/3 fraction of the processors; and (2) assuming the existence of a cryptographic bit commitment primitive. Our algorithm assumes a partially synchronous communication model, where any message sent from one honest player to another honest player needs at most Δ time steps to be received and processed by the recipient for some fixed Δ, and we assume that the clock speeds of the honest players are roughly the same. However, the clocks do not have to be synchronized (i.e., show the same time)
King和Saia最近的理论结果表明,在发送Õ(n3/2)总比特的情况下,可以解决全信息模型中的拜占庭协议、领袖选举和宇宙约简问题。然而,这个结果虽然在理论上很有趣,但由于隐藏常数很大,因此并不实用。本文基于这一理论结果,设计了一种新的实用算法。对于包含超过1000个处理器的网络,由于Cachin, Kursawe和Shoup,我们的新算法发送的比特数明显少于已知的算法。为了获得我们的实用算法,与King和Saia的模型相比,我们放宽了故障模型(1)允许对手只控制1/8的处理器,而不是1/3的分数;(2)假设存在加密位承诺原语。我们的算法假设一个部分同步的通信模型,其中从一个诚实的玩家发送到另一个诚实的玩家的任何消息最多需要Δ时间步来接收和处理一些固定的Δ,并且我们假设诚实的玩家的时钟速度大致相同。然而,时钟不必同步(即显示相同的时间)。
{"title":"An empirical study of a scalable Byzantine agreement algorithm","authors":"O. Oluwasanmi, Jared Saia, Valerie King","doi":"10.1109/IPDPSW.2010.5470874","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470874","url":null,"abstract":"A recent theoretical result by King and Saia shows that it is possible to solve the Byzantine agreement, leader election and universe reduction problems in the full information model with Õ(n3/2) total bits sent. However, this result, while theoretically interesting, is not practical due to large hidden constants. In this paper, we design a new practical algorithm, based on this theoretical result. For networks containing more than about 1,000 processors, our new algorithm sends significantly fewer bits than a well-known algorithm due to Cachin, Kursawe and Shoup. To obtain our practical algorithm, we relax the fault model compared to the model of King and Saia by (1) allowing the adversary to control only a 1/8, and not a 1/3 fraction of the processors; and (2) assuming the existence of a cryptographic bit commitment primitive. Our algorithm assumes a partially synchronous communication model, where any message sent from one honest player to another honest player needs at most Δ time steps to be received and processed by the recipient for some fixed Δ, and we assume that the clock speeds of the honest players are roughly the same. However, the clocks do not have to be synchronized (i.e., show the same time)","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133213492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Massive streaming data analytics: A case study with clustering coefficients 大规模流数据分析:聚类系数的案例研究
David Ediger, Karl Jiang, E. J. Riedy, David A. Bader
We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.
我们提出了一种新的方法,以动态和可扩展的表示方式对流、时态数据进行并行海量图分析。处理来自医疗保健、安全、业务和社交网络应用程序的持续不断的新数据流需要新的算法和数据结构。我们研究了数据结构和算法权衡,以提取海量图的高性能更新分析所需的并行性。静态分析核通常依赖于在特定结构中存储输入数据。为每个可能的具有高数据速率的内核维护这些结构会带来巨大的性能成本。一个在通用数据结构上计算聚类系数的案例研究表明,增量更新可能比全局重新计算更有效。在这个内核中,我们比较了三种动态更新局部聚类系数的方法:强力局部重新计算、排序算法和我们使用Bloom过滤器的新近似方法。在具有224≈1600万个顶点和229≈5.37亿个边的合成无尺度图的Cray XMT的32个处理器上,暴力破解方法平均每秒处理超过5万次更新,我们的Bloom过滤器接近每秒20万次更新。
{"title":"Massive streaming data analytics: A case study with clustering coefficients","authors":"David Ediger, Karl Jiang, E. J. Riedy, David A. Bader","doi":"10.1109/IPDPSW.2010.5470687","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470687","url":null,"abstract":"We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133233930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
An architectural space exploration tool for domain specific reconfigurable computing 用于特定领域可重构计算的体系结构空间探索工具
Gayatri Mehta, A. Jones
In this paper, we describe a design space exploration (DSE) tool for domain specific reconfigurable computing where the needs of the applications drive the construction of the device architecture. The tool has been developed to automate the design space case studies which allows application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some of the edge-detection benchmarks from the image processing domain for our case studies. We compare the energy consumption of the architecture selected from manual design space case studies with the architectural solution selected by the design space exploration tool. The architecture selected by the DSE tool consumes approximately 9% less energy on an average as compared to the best candidate from the manual design space case studies. The fabric architecture selected from the manual design case studies and the one selected by the tool were synthesized on 130 nm cell-based ASIC fabrication process from IBM. We compare the energy of the benchmarks implemented onto the fabric with other hardware and software implementations. Both fabric architectures (manual and tool) yield energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor.
在本文中,我们描述了一个设计空间探索(DSE)工具,用于特定领域的可重构计算,其中应用程序的需求驱动了设备体系结构的构建。开发该工具是为了自动化设计空间案例研究,允许应用程序开发人员有效地探索架构权衡并快速达成解决方案。我们从mediabbench基准测试套件中选择了一些核心信号处理基准测试,并从图像处理领域中选择了一些边缘检测基准测试来进行案例研究。我们将手工设计空间案例研究中选择的建筑能耗与设计空间探索工具选择的建筑解决方案进行比较。与手工设计空间案例研究中的最佳候选方案相比,DSE工具选择的体系结构平均消耗大约9%的能量。从手工设计案例研究中选择的织物结构和工具选择的织物结构在IBM的130 nm基于单元的ASIC制造工艺上合成。我们比较了在fabric上实现的基准测试与其他硬件和软件实现的能耗。两种结构架构(手工和工具)产生的能量都在直接ASIC实现的3倍以内,比Virtex-II Pro FPGA好330X,比英特尔XScale处理器好2016X。
{"title":"An architectural space exploration tool for domain specific reconfigurable computing","authors":"Gayatri Mehta, A. Jones","doi":"10.1109/IPDPSW.2010.5470735","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470735","url":null,"abstract":"In this paper, we describe a design space exploration (DSE) tool for domain specific reconfigurable computing where the needs of the applications drive the construction of the device architecture. The tool has been developed to automate the design space case studies which allows application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some of the edge-detection benchmarks from the image processing domain for our case studies. We compare the energy consumption of the architecture selected from manual design space case studies with the architectural solution selected by the design space exploration tool. The architecture selected by the DSE tool consumes approximately 9% less energy on an average as compared to the best candidate from the manual design space case studies. The fabric architecture selected from the manual design case studies and the one selected by the tool were synthesized on 130 nm cell-based ASIC fabrication process from IBM. We compare the energy of the benchmarks implemented onto the fabric with other hardware and software implementations. Both fabric architectures (manual and tool) yield energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Collaborative execution environment for heterogeneous parallel systems 异构并行系统的协同执行环境
A. Ilic, L. Sousa
Nowadays, commodity computers are complex heterogeneous systems that provide a huge amount of computational power. However, to take advantage of this power we have to orchestrate the use of processing units with different characteristics. Such distributed memory systems make use of relatively slow interconnection networks, such as system buses. Therefore, most of the time we only individually take advantage of the central processing unit (CPU) or processing accelerators, which are simpler homogeneous subsystems. In this paper we propose a collaborative execution environment for exploiting data parallelism in a heterogeneous system. It is shown that this environment can be applied to program both CPU and graphics processing units (GPUs) to collaboratively compute matrix multiplication and fast Fourier transform (FFT). Experimental results show that significant performance benefits are achieved when both CPU and GPU are used.
如今,商用计算机是复杂的异构系统,提供了巨大的计算能力。然而,为了利用这种能力,我们必须协调使用具有不同特性的处理单元。这种分布式内存系统使用相对较慢的互连网络,例如系统总线。因此,大多数时候我们只单独利用中央处理单元(CPU)或处理加速器,它们是更简单的同构子系统。在本文中,我们提出了一个在异构系统中利用数据并行性的协作执行环境。结果表明,该环境可用于对CPU和图形处理器(gpu)进行编程,以协同计算矩阵乘法和快速傅里叶变换(FFT)。实验结果表明,当CPU和GPU同时使用时,可以获得显著的性能提升。
{"title":"Collaborative execution environment for heterogeneous parallel systems","authors":"A. Ilic, L. Sousa","doi":"10.1109/IPDPSW.2010.5470835","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470835","url":null,"abstract":"Nowadays, commodity computers are complex heterogeneous systems that provide a huge amount of computational power. However, to take advantage of this power we have to orchestrate the use of processing units with different characteristics. Such distributed memory systems make use of relatively slow interconnection networks, such as system buses. Therefore, most of the time we only individually take advantage of the central processing unit (CPU) or processing accelerators, which are simpler homogeneous subsystems. In this paper we propose a collaborative execution environment for exploiting data parallelism in a heterogeneous system. It is shown that this environment can be applied to program both CPU and graphics processing units (GPUs) to collaboratively compute matrix multiplication and fast Fourier transform (FFT). Experimental results show that significant performance benefits are achieved when both CPU and GPU are used.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122801941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Multicore-aware reuse distance analysis 多核感知的重用距离分析
Derek L. Schuff, Benjamin S. Parsons, Vijay S. Pai
This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.
通过考虑基于无效的缓存一致性和核间缓存共享,提出并验证了将应用程序局部性特征的重用距离分析扩展到共享内存多核平台的方法。现有的重用距离分析方法跟踪给定线程对同一地址的重用之间引用的不同地址的数量,但不模拟其他线程对数据引用的影响。本文展示了几种保持重用堆栈一致性的方法,以便它们考虑到失效和缓存共享,无论是在模拟执行中还是在同步点上出现的引用。这些方法在基于simics的一致性缓存模拟器上进行评估,该模拟器运行几个OpenMP和基于事务的基准测试。结果表明,添加多核感知大大提高了重用距离分析来建模缓存行为的能力,将缺失率预测的误差(相对于特定缓存大小的缓存模拟)降低了每核缓存的平均70%和共享缓存的平均90%。
{"title":"Multicore-aware reuse distance analysis","authors":"Derek L. Schuff, Benjamin S. Parsons, Vijay S. Pai","doi":"10.1109/IPDPSW.2010.5470780","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470780","url":null,"abstract":"This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123951970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
GridP2P: Resource usage in Grids and Peer-to-Peer systems 网格和点对点系统中的资源使用
Sérgio Esteves, L. Veiga, P. Ferreira
The last few years have witnessed huge growth in computer technology and available resources throughout the Internet. These resources can be used to run CPU-intensive applications requiring long periods of processing time. Grid systems allow us to take advantage of available resources lying over a network. However, these systems impose several difficulties to their usage (e.g. heavy authentication and configuration management); in order to overcome them, Peer-to-Peer systems provide open access making the Grid available to any user. Our solution consists of a platform for distributed cycle sharing which attempts to combine Grid and Peer-to-Peer models. A major goal is to allow any ordinary user to use remote idle cycles in order to speedup commodity applications. On the other hand, users can also provide spare cycles of their machines when they are not using them. Our solution encompasses the following functionalities: application management, job creation and scheduling, resource discovery, security policies, and overlay network management. The simple and modular organization of this system allows that components can be changed at minimum cost. In addition, the use of history-based policies provides powerful usage semantics concerning the resource management.
在过去的几年里,计算机技术和互联网上的可用资源得到了巨大的发展。这些资源可用于运行需要长时间处理的cpu密集型应用程序。网格系统允许我们利用网络上的可用资源。然而,这些系统给它们的使用带来了一些困难(例如繁重的身份验证和配置管理);为了克服这些问题,点对点系统提供了开放访问,使得任何用户都可以使用网格。我们的解决方案包括一个分布式循环共享平台,它试图将网格和点对点模型结合起来。一个主要目标是允许任何普通用户使用远程空闲周期,以加速商品应用程序。另一方面,用户也可以在不使用机器时提供备用周期。我们的解决方案包含以下功能:应用程序管理、作业创建和调度、资源发现、安全策略和覆盖网络管理。该系统的简单和模块化组织允许以最小的成本更改组件。此外,使用基于历史的策略提供了与资源管理相关的强大使用语义。
{"title":"GridP2P: Resource usage in Grids and Peer-to-Peer systems","authors":"Sérgio Esteves, L. Veiga, P. Ferreira","doi":"10.1109/IPDPSW.2010.5470917","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470917","url":null,"abstract":"The last few years have witnessed huge growth in computer technology and available resources throughout the Internet. These resources can be used to run CPU-intensive applications requiring long periods of processing time. Grid systems allow us to take advantage of available resources lying over a network. However, these systems impose several difficulties to their usage (e.g. heavy authentication and configuration management); in order to overcome them, Peer-to-Peer systems provide open access making the Grid available to any user. Our solution consists of a platform for distributed cycle sharing which attempts to combine Grid and Peer-to-Peer models. A major goal is to allow any ordinary user to use remote idle cycles in order to speedup commodity applications. On the other hand, users can also provide spare cycles of their machines when they are not using them. Our solution encompasses the following functionalities: application management, job creation and scheduling, resource discovery, security policies, and overlay network management. The simple and modular organization of this system allows that components can be changed at minimum cost. In addition, the use of history-based policies provides powerful usage semantics concerning the resource management.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127700511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1