首页 > 最新文献

2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文 中文
Accelerating Large-Scale Single-Source Shortest Path on FPGA FPGA上大规模单源最短路径加速
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.130
Shijie Zhou, C. Chelmis, V. Prasanna
Many real-world problems can be represented as graphs and solved by graph traversal algorithms. Single-Source Shortest Path (SSSP) is a fundamental graph algorithm. Today, large-scale graphs involve millions or even billions of vertices, making efficient parallel graph processing challenging. In this paper, we propose a single-FPGA based design to accelerate SSSP for massive graphs. We adopt the well-known Bellman-Ford algorithm. In the proposed design, graph is stored in external memory, which is more realistic for processing large scale graphs. Using the available external memory bandwidth, our design achieves the maximum data parallelism to concurrently process multiple edges in each clock cycle, regardless of data dependencies. The performance of our design is independent of the graph structure as well. We propose a optimized data layout to enable efficient utilization of external memory bandwidth. We prototype our design using a state-of-the-art FPGA. Experimental results show that our design is capable of processing 1.6 billion edges per second (GTEPS) using a single FPGA, while simultaneously achieving high clock rate of over 200 MHz. This would place us in the 131st position of the Graph 500 benchmark list of supercomputing systems for data intensive applications. Our solution therefore provides comparable performance to state-of-the-art systems.
许多现实世界的问题都可以用图来表示,并通过图遍历算法来解决。单源最短路径(SSSP)是一种基本的图算法。如今,大规模图涉及数百万甚至数十亿个顶点,这使得高效的并行图处理具有挑战性。在本文中,我们提出了一种基于单fpga的设计来加速大规模图形的SSSP。我们采用了著名的Bellman-Ford算法。在本设计中,图形存储在外部存储器中,这对于处理大规模图形更为现实。利用可用的外部内存带宽,我们的设计实现了最大的数据并行性,在每个时钟周期内并发处理多个边缘,而不考虑数据依赖性。我们设计的性能与图的结构无关。我们提出一个优化的数据布局,使有效利用外部存储器带宽。我们使用最先进的FPGA设计原型。实验结果表明,我们的设计能够使用单个FPGA每秒处理16亿个边缘(GTEPS),同时实现超过200 MHz的高时钟速率。这将使我们在数据密集型应用程序的Graph 500超级计算系统基准列表中排名第131位。因此,我们的解决方案提供了与最先进的系统相当的性能。
{"title":"Accelerating Large-Scale Single-Source Shortest Path on FPGA","authors":"Shijie Zhou, C. Chelmis, V. Prasanna","doi":"10.1109/IPDPSW.2015.130","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.130","url":null,"abstract":"Many real-world problems can be represented as graphs and solved by graph traversal algorithms. Single-Source Shortest Path (SSSP) is a fundamental graph algorithm. Today, large-scale graphs involve millions or even billions of vertices, making efficient parallel graph processing challenging. In this paper, we propose a single-FPGA based design to accelerate SSSP for massive graphs. We adopt the well-known Bellman-Ford algorithm. In the proposed design, graph is stored in external memory, which is more realistic for processing large scale graphs. Using the available external memory bandwidth, our design achieves the maximum data parallelism to concurrently process multiple edges in each clock cycle, regardless of data dependencies. The performance of our design is independent of the graph structure as well. We propose a optimized data layout to enable efficient utilization of external memory bandwidth. We prototype our design using a state-of-the-art FPGA. Experimental results show that our design is capable of processing 1.6 billion edges per second (GTEPS) using a single FPGA, while simultaneously achieving high clock rate of over 200 MHz. This would place us in the 131st position of the Graph 500 benchmark list of supercomputing systems for data intensive applications. Our solution therefore provides comparable performance to state-of-the-art systems.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"54 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113956531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
NIDISC Introduction and Committees NIDISC介绍和委员会
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2014.211
P. Bouvry, F. Seredyński, E. Talbi
This section includes the articles presented at the 18th International Workshop on Nature Inspired Distributed Computing (NIDISC 2015) held in conjunction with the 29th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS 2015), May 25-29 2015, Hyderabad, India. The NIDISC workshop is an opportunity for researchers to explore the connections between biology, nature-inspired techniques, metaheuristics and the development of solutions to problems that arise in parallel and distributed processing, communications and other application areas.
本节包括在第18届自然启发分布式计算国际研讨会(NIDISC 2015)上发表的文章,该研讨会与第29届IEEE/ACM国际并行和分布式处理研讨会(IPDPS 2015)一起举行,2015年5月25日至29日,印度海德拉巴。NIDISC研讨会为研究人员提供了一个机会,探索生物学、自然启发技术、元启发式和解决并行和分布式处理、通信和其他应用领域出现的问题的发展之间的联系。
{"title":"NIDISC Introduction and Committees","authors":"P. Bouvry, F. Seredyński, E. Talbi","doi":"10.1109/IPDPSW.2014.211","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.211","url":null,"abstract":"This section includes the articles presented at the 18th International Workshop on Nature Inspired Distributed Computing (NIDISC 2015) held in conjunction with the 29th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS 2015), May 25-29 2015, Hyderabad, India. The NIDISC workshop is an opportunity for researchers to explore the connections between biology, nature-inspired techniques, metaheuristics and the development of solutions to problems that arise in parallel and distributed processing, communications and other application areas.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125642672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Generic Framework for Impossibility Results in Time-Varying Graphs 时变图不可能结果的一般框架
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.59
Nicolas Braud-Santoni, S. Dubois, Mohamed-Hamza Kaaouachi, F. Petit
We address highly dynamic distributed systems modelled by time-varying graphs (TVGs). We are interested in proof of impossibility results that often use informal arguments about convergence. First, we provide a topological distance metric over sets of TVGs to correctly define the convergence of TVG sequences in such sets. Next, we provide a general framework that formally proves the convergence of the sequence of executions of any deterministic algorithm over TVGs of any convergent sequence of TVGs. Finally, we illustrate the relevance of the above result by proving that no deterministic algorithm exists to compute the underlying graph of any connected-over-time TVG, i.e. Any TVG of the weakest class of long-lived TVGs.
我们解决了由时变图(tvg)建模的高度动态分布式系统。我们感兴趣的是不可能结果的证明,通常使用关于收敛的非正式论证。首先,我们提供了TVG集合上的拓扑距离度量,以正确定义TVG序列在这些集合中的收敛性。接下来,我们提供了一个通用框架,正式证明了任意确定性算法在任意收敛tvg序列的tvg上执行序列的收敛性。最后,我们通过证明不存在确定性算法来计算任何连接随时间推移的TVG的底层图,即任何长寿命TVG的最弱类的TVG,来说明上述结果的相关性。
{"title":"A Generic Framework for Impossibility Results in Time-Varying Graphs","authors":"Nicolas Braud-Santoni, S. Dubois, Mohamed-Hamza Kaaouachi, F. Petit","doi":"10.1109/IPDPSW.2015.59","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.59","url":null,"abstract":"We address highly dynamic distributed systems modelled by time-varying graphs (TVGs). We are interested in proof of impossibility results that often use informal arguments about convergence. First, we provide a topological distance metric over sets of TVGs to correctly define the convergence of TVG sequences in such sets. Next, we provide a general framework that formally proves the convergence of the sequence of executions of any deterministic algorithm over TVGs of any convergent sequence of TVGs. Finally, we illustrate the relevance of the above result by proving that no deterministic algorithm exists to compute the underlying graph of any connected-over-time TVG, i.e. Any TVG of the weakest class of long-lived TVGs.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122679351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Distributed Scheduling Algorithm for Highly Available Component Based Applications 基于高可用组件应用的分布式调度算法
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.114
M. Frîncu
The emergence of multi-clouds makes it difficult for application providers to offer reliable applications to end users. The different levels of infrastructure reliability offered by various cloud providers need to be abstracted at application level through application-aware algorithms for high availability. This task is challenging due to the closed world approach taken by the various cloud providers. In the face of different access and management policies orchestrated distributed management algorithms are needed instead of centralized solutions. In this paper we present a decentralized autonomic algorithm for achieving application high availability by harnessing the properties of scalable component-based applications and the advantage of overlay networks to communicate between peers. In a multi-cloud environment the algorithm maintains cloud provider independence while achieving global application availability. The algorithm was tested on a simulator and results show that it gives similar results to a centralized approach without inducing much communication overhead.
多云的出现使得应用程序提供商很难向最终用户提供可靠的应用程序。各种云提供商提供的不同级别的基础设施可靠性需要通过应用程序感知算法在应用程序级别进行抽象,以实现高可用性。由于各种云提供商采用的封闭世界方法,这项任务具有挑战性。面对不同的访问和管理策略,需要编排分布式管理算法而不是集中式解决方案。在本文中,我们提出了一种分散的自治算法,通过利用可扩展的基于组件的应用程序的特性和覆盖网络在对等体之间通信的优势来实现应用程序的高可用性。在多云环境中,该算法在实现全局应用程序可用性的同时保持云提供商的独立性。该算法在模拟器上进行了测试,结果表明,该算法在不增加通信开销的情况下获得了与集中式方法相似的结果。
{"title":"Distributed Scheduling Algorithm for Highly Available Component Based Applications","authors":"M. Frîncu","doi":"10.1109/IPDPSW.2015.114","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.114","url":null,"abstract":"The emergence of multi-clouds makes it difficult for application providers to offer reliable applications to end users. The different levels of infrastructure reliability offered by various cloud providers need to be abstracted at application level through application-aware algorithms for high availability. This task is challenging due to the closed world approach taken by the various cloud providers. In the face of different access and management policies orchestrated distributed management algorithms are needed instead of centralized solutions. In this paper we present a decentralized autonomic algorithm for achieving application high availability by harnessing the properties of scalable component-based applications and the advantage of overlay networks to communicate between peers. In a multi-cloud environment the algorithm maintains cloud provider independence while achieving global application availability. The algorithm was tested on a simulator and results show that it gives similar results to a centralized approach without inducing much communication overhead.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"439 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122885804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU Accelerated Molecular Dynamics with Method of Heterogeneous Load Balancing 基于异构负载均衡方法的GPU加速分子动力学
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.41
T. Udagawa, M. Sekijima
Molecular Dynamics simulations are widely used to obtain a deeper understanding of chemical reactions, fluid flows, phase transitions, and other physical phenomena due to molecular interactions. The main problem with this method is that it is computationally demanding because of its amount of O (N2) and requirements for prolonged simulations. The use of Graphics Processing Units (GPUs) is an attractive solution and has been applied to this problem thus far. However, such heterogeneous approaches occasionally cause load imbalances between CPUs and GPUs and they don't utilize all computational resources. We propose a method of balancing the workload between CPUs and GPUs, which we implemented. Our method is based on formulating and observing workloads and it statically distributes work according to spatial decomposition. We succeeded in utilizing processors more efficiently and accelerating simulations by 20.7 % at most compared to the original GPU optimized code.
分子动力学模拟被广泛用于对化学反应、流体流动、相变和其他由分子相互作用引起的物理现象有更深入的了解。这种方法的主要问题是,由于其O (N2)的数量和长时间模拟的要求,它的计算要求很高。图形处理单元(gpu)的使用是一个有吸引力的解决方案,迄今为止已经应用于这个问题。然而,这种异构方法偶尔会导致cpu和gpu之间的负载不平衡,而且它们不会利用所有的计算资源。我们提出了一种在cpu和gpu之间平衡工作负载的方法,并实现了该方法。该方法基于对工作量的表述和观察,根据空间分解静态分配工作。与最初的GPU优化代码相比,我们成功地更有效地利用了处理器,并将模拟速度提高了20.7%。
{"title":"GPU Accelerated Molecular Dynamics with Method of Heterogeneous Load Balancing","authors":"T. Udagawa, M. Sekijima","doi":"10.1109/IPDPSW.2015.41","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.41","url":null,"abstract":"Molecular Dynamics simulations are widely used to obtain a deeper understanding of chemical reactions, fluid flows, phase transitions, and other physical phenomena due to molecular interactions. The main problem with this method is that it is computationally demanding because of its amount of O (N2) and requirements for prolonged simulations. The use of Graphics Processing Units (GPUs) is an attractive solution and has been applied to this problem thus far. However, such heterogeneous approaches occasionally cause load imbalances between CPUs and GPUs and they don't utilize all computational resources. We propose a method of balancing the workload between CPUs and GPUs, which we implemented. Our method is based on formulating and observing workloads and it statically distributes work according to spatial decomposition. We succeeded in utilizing processors more efficiently and accelerating simulations by 20.7 % at most compared to the original GPU optimized code.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132657339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Enhancing Speedups for FPGA Accelerated SPICE through Frequency Scaling and Precision Reduction 通过频率缩放和精度降低来提高FPGA加速SPICE的速度
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.100
L. Hui, Nachiket Kapre
Frequency scaling and precision reduction optimization of an FPGA accelerated SPICE circuit simulator can enhance performance by 1.5x while lowering implementation cost by 15 -- 20%. This is possible due the inherent fault tolerant capabilities of SPICE that can naturally drive simulator convergence even in presence of arithmetic errors due to frequency scaling and precision reduction. We quantify the impact of these transformations on SPICE by analyzing the resulting convergence residue and runtime. To explain the impact of our optimizations, we develop an empirical error model derived from in-situ frequency scaling experiments and build analytical models of rounding and truncation errors using Gappa-based numerical analysis. Across a range of benchmark SPICE circuits, we are able to tolerate to bit-level fault rates of 10--4 (frequency scaling) and manage up to 8-bit loss in least-significant digits (precision reduction) without compromising SPICE convergence quality while delivering speedups.
FPGA加速SPICE电路模拟器的频率缩放和精度降低优化可以提高1.5倍的性能,同时降低15 - 20%的实施成本。这是可能的,因为SPICE固有的容错能力,即使在频率缩放和精度降低导致的算术错误存在的情况下,也可以自然地驱动模拟器收敛。我们通过分析结果的收敛剩余和运行时间来量化这些转换对SPICE的影响。为了解释优化的影响,我们建立了一个基于现场频率缩放实验的经验误差模型,并使用基于gappa的数值分析建立了舍入和截断误差的分析模型。在一系列基准SPICE电路中,我们能够容忍10- 4的比特级故障率(频率缩放),并在最低有效数字(精度降低)中管理高达8位的损失,而不会影响SPICE收敛质量,同时提供速度。
{"title":"Enhancing Speedups for FPGA Accelerated SPICE through Frequency Scaling and Precision Reduction","authors":"L. Hui, Nachiket Kapre","doi":"10.1109/IPDPSW.2015.100","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.100","url":null,"abstract":"Frequency scaling and precision reduction optimization of an FPGA accelerated SPICE circuit simulator can enhance performance by 1.5x while lowering implementation cost by 15 -- 20%. This is possible due the inherent fault tolerant capabilities of SPICE that can naturally drive simulator convergence even in presence of arithmetic errors due to frequency scaling and precision reduction. We quantify the impact of these transformations on SPICE by analyzing the resulting convergence residue and runtime. To explain the impact of our optimizations, we develop an empirical error model derived from in-situ frequency scaling experiments and build analytical models of rounding and truncation errors using Gappa-based numerical analysis. Across a range of benchmark SPICE circuits, we are able to tolerate to bit-level fault rates of 10--4 (frequency scaling) and manage up to 8-bit loss in least-significant digits (precision reduction) without compromising SPICE convergence quality while delivering speedups.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134404026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Genetic Algorithm Approach for Adjusting Time Series Based Load Prediction 基于时间序列调整的负荷预测遗传算法
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.96
Raed Alkharboush, R. E. Grande, A. Boukerche
Distributed virtual simulation are prone to load oscillations, as well as load imbalances during run-time. Detecting such imbalances and responding accordingly using load redistribution can be of great utility in keeping execution performance close to the aimed optimal. A dynamic balancing scheme can introduce a reactive approach, but a predictive scheme can prevent imbalances before they occur. Several models can be employed for predicting load, but due to the characteristics in which the load is collected and presented, time series offer reasonable load forecasting in a short time. However, the Holt's model, well known model for time series representation, shows limitations on the forecasting of load. In order to correct this issue, a genetic algorithm approach is introduced to dynamically adjust the model based on the recent modifications on the load behaviour. The convergence of the algorithm can substantially influence the response time of the predictive balancing system, so an analysis is conducted to identify the minimum number of iterations for generating a reasonable adjustment.
分布式虚拟仿真在运行过程中容易出现负载振荡和负载不平衡等问题。检测这种不平衡并使用负载重新分配做出相应的响应,对于保持执行性能接近目标最优非常有用。动态平衡方案可以引入反应性方法,但预测方案可以在失衡发生之前防止失衡。有几种模型可用于负荷预测,但由于负荷收集和呈现的特性,时间序列在短时间内提供了合理的负荷预测。然而,以时间序列表示著称的霍尔特模型在负荷预测方面存在一定的局限性。为了纠正这一问题,引入了一种遗传算法方法,根据最近荷载行为的变化动态调整模型。算法的收敛性对预测平衡系统的响应时间有很大的影响,因此进行了分析以确定产生合理调整的最小迭代次数。
{"title":"A Genetic Algorithm Approach for Adjusting Time Series Based Load Prediction","authors":"Raed Alkharboush, R. E. Grande, A. Boukerche","doi":"10.1109/IPDPSW.2015.96","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.96","url":null,"abstract":"Distributed virtual simulation are prone to load oscillations, as well as load imbalances during run-time. Detecting such imbalances and responding accordingly using load redistribution can be of great utility in keeping execution performance close to the aimed optimal. A dynamic balancing scheme can introduce a reactive approach, but a predictive scheme can prevent imbalances before they occur. Several models can be employed for predicting load, but due to the characteristics in which the load is collected and presented, time series offer reasonable load forecasting in a short time. However, the Holt's model, well known model for time series representation, shows limitations on the forecasting of load. In order to correct this issue, a genetic algorithm approach is introduced to dynamically adjust the model based on the recent modifications on the load behaviour. The convergence of the algorithm can substantially influence the response time of the predictive balancing system, so an analysis is conducted to identify the minimum number of iterations for generating a reasonable adjustment.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134589198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems GraphReduce:基于加速器的高性能计算系统上的大规模图形分析
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.16
D. Sengupta, K. Agarwal, S. Song, K. Schwan
Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.
最近关于图形分析的工作试图利用GPU设备提供的高性能,但由于图形算法固有的不规则性和GPU驻留内存存储大型图形的限制,挑战仍然存在。本文提出的Graph Reduce方法允许基于gpu的加速器对超出其内部内存容量的图形进行操作。Graph Reduce结合了以边缘为中心和以顶点为中心的collect - apply - scatter编程模型实现,通过在GPU和主机内存之间划分图形并有效地在两者之间移动图形数据的方法来实现高度并行性。基于Graph reduce的编程是通过设备函数执行的,这些设备函数包括gather map、gather reduce、apply和scatter,由程序员为他们希望实现的图算法实现。对各种各样的图输入、算法和系统配置的实验评估表明,graph Reduce优于其他竞争方法。
{"title":"GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems","authors":"D. Sengupta, K. Agarwal, S. Song, K. Schwan","doi":"10.1109/IPDPSW.2015.16","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.16","url":null,"abstract":"Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
GraphMMU: Memory Management Unit for Sparse Graph Accelerators GraphMMU:稀疏图形加速器的内存管理单元
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.101
Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha
Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.
使用低级AXI描述符链来保存不规则的面向图形的访问序列的内存管理单元可以帮助将图形算法的DRAM内存吞吐量提高几乎一个数量级。对于Xilinx Zed板,我们探索并比较了使用(1)支持缓存的cpu与操作系统,(2)支持缓存的cpu运行裸机代码,(2)基于cpu的基于fpga的AXI DMA控制,以及(3)基于本地fpga的AXI DMA传输控制时可实现的内存吞吐量。对于稀疏图访问模式生成的短突发不规则流量,我们观察到,与缓存友好的顺序访问相比,由于DRAM行激活,性能损失几乎是10倍。当使用在FPGA逻辑中配置并从CPU以AXI寄存器模式编程的AXI DMA引擎时,我们可以通过naïve对CPU的随机访问将DRAM性能提高多达2.4倍。在这种模式下,我们使用主机CPU通过在DMA引擎的内部寄存器中写入适当的控制信息来触发DMA传输。我们还将稀疏图访问模式编码为本地存储的bram托管的AXI描述符链,以便在Scatter Gather模式下以最小的CPU占用来驱动AXI DMA引擎。在此配置中,我们提供了额外的3倍加速,在运行操作系统管理不规则访问时,使用缓存的基于cpu的方法累计吞吐量提高了7倍。
{"title":"GraphMMU: Memory Management Unit for Sparse Graph Accelerators","authors":"Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha","doi":"10.1109/IPDPSW.2015.101","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.101","url":null,"abstract":"Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134287147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Fair Randomized Contention Resolution Protocol for Wireless Nodes without Collision Detection Capabilities 无碰撞检测无线节点公平随机争用解决协议
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.86
Marcos F. Caetano, J. Bordim
Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.
基于争用的协议通常用于向希望通信的节点提供通道访问。二进制指数回退(BEB)是由IEEE 802.11标准实现的一个著名的争用协议。尽管它被广泛使用,但当竞争节点数量增加时,采用BEB的介质访问控制(MAC)协议难以让步通道访问。本工作的主要贡献是针对竞争站具有无碰撞检测(NCD)能力的情况,提出了一种随机竞争协议。提出的协议,称为RNCD,探索使用音调信令提供公平选择的发射机。我们证明了在n≥2个ncd站中选择单个发射机的任务可以在48n个时隙中完成,概率至少为1 - 2-1.5n。此外,RNCD工作时不需要事先知道竞争节点的数量。为了比较,RNCD和BEB在omnet++模拟器中实现。仿真结果表明,当n = 256时,RNCD每秒的传输量是BEB协议的两倍,而信道访问分辨率所需的时间不到BEB协议所需时间的1%。与BEB实现的信道访问时间呈指数增长趋势不同,RNCD具有对数增长趋势,使其能够更好地满足实时应用的QoS需求。
{"title":"A Fair Randomized Contention Resolution Protocol for Wireless Nodes without Collision Detection Capabilities","authors":"Marcos F. Caetano, J. Bordim","doi":"10.1109/IPDPSW.2015.86","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.86","url":null,"abstract":"Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131926484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2015 IEEE International Parallel and Distributed Processing Symposium Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1