2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)最新文献

A Parallel Algorithm for Minimum Spanning Tree on GPU 基于GPU的最小生成树并行算法

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.20

J. Vasconcellos, E. Cáceres, H. Mongelli, S. W. Song

Computing a minimum spanning tree (MST) of a graph is a fundamental problem in Graph Theory and arises as a subproblem in many applications. In this paper, we propose a parallel MST algorithm and implement it on a GPU (Graphics Processing Unit). One of the steps of previous parallel MST algorithms is a heavy use of parallel list ranking. Besides the fact that list ranking is present in several parallel libraries, it is very time-consuming. Using a different graph decomposition, called strut, we devised a new parallel MST algorithm that does not make use of the list ranking procedure. Based on the BSP/CGM model we proved that our algorithm is correct and it finds the MST after O(log p) iterations (communication and computation rounds). To show that our algorithm has a good performance onreal parallel machines, we have implemented it on GPU. The way that we have designed the parallel algorithm allowed us to exploit the computing power of the GPU. The efficiency of the algorithm was confirmed by our experimental results. The tests performed show that, for randomly constructed graphs, with vertex numbers varying from 10,000 to 30,000 and density between 0.02 and 0.2, the algorithm constructs an MST in a maximum of six iterations. When the graph is not very sparse, our implementation achieved a speedup of more than 50, for some instances as high 296, over a minimum spanning tree sequential algorithm previously proposed in the literature.

图的最小生成树(MST)计算是图论中的一个基本问题，在许多应用中作为子问题出现。本文提出了一种并行MST算法，并在GPU(图形处理单元)上实现。以前的并行MST算法的一个步骤是大量使用并行列表排序。除了在几个并行库中存在列表排序之外，它非常耗时。使用不同的图分解，称为strut，我们设计了一个新的并行MST算法，它不使用列表排序过程。基于BSP/CGM模型，我们证明了我们的算法是正确的，它在O(log p)次迭代(通信和计算轮)后找到了MST。为了证明该算法在实际并行机上具有良好的性能，我们在GPU上实现了该算法。我们设计并行算法的方式允许我们利用GPU的计算能力。实验结果验证了该算法的有效性。所执行的测试表明，对于随机构造的图(顶点数从10,000到30,000不等，密度在0.02到0.2之间)，该算法最多只需六次迭代即可构造一个MST。当图不是很稀疏时，我们的实现比文献中先前提出的最小生成树顺序算法实现了超过50的加速，对于某些实例高达296。

{"title":"A Parallel Algorithm for Minimum Spanning Tree on GPU","authors":"J. Vasconcellos, E. Cáceres, H. Mongelli, S. W. Song","doi":"10.1109/SBAC-PADW.2017.20","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.20","url":null,"abstract":"Computing a minimum spanning tree (MST) of a graph is a fundamental problem in Graph Theory and arises as a subproblem in many applications. In this paper, we propose a parallel MST algorithm and implement it on a GPU (Graphics Processing Unit). One of the steps of previous parallel MST algorithms is a heavy use of parallel list ranking. Besides the fact that list ranking is present in several parallel libraries, it is very time-consuming. Using a different graph decomposition, called strut, we devised a new parallel MST algorithm that does not make use of the list ranking procedure. Based on the BSP/CGM model we proved that our algorithm is correct and it finds the MST after O(log p) iterations (communication and computation rounds). To show that our algorithm has a good performance onreal parallel machines, we have implemented it on GPU. The way that we have designed the parallel algorithm allowed us to exploit the computing power of the GPU. The efficiency of the algorithm was confirmed by our experimental results. The tests performed show that, for randomly constructed graphs, with vertex numbers varying from 10,000 to 30,000 and density between 0.02 and 0.2, the algorithm constructs an MST in a maximum of six iterations. When the graph is not very sparse, our implementation achieved a speedup of more than 50, for some instances as high 296, over a minimum spanning tree sequential algorithm previously proposed in the literature.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122308140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Comparing Performance of C Compilers Optimizations on Different Multicore Architectures 比较不同多核架构下C编译器优化的性能

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.13

R. Machado, R. Almeida, Andre D. Jardim, A. Pernas, A. Yamin, G. H. Cavalheiro

Multithread programming tools become popular for exploitation of high performance processing with the dissemination of multicore processors. In this context, it is also popular to exploit compiler optimization to improve the performance at execution time. In this work, we evaluate the performance achieved by the use of flags -O1, -O2, and -O3 of two C compilers (GCC and ICC) associated with five different APIs: Pthreads, C++11, OpenMP, Cilk Plus, and TBB. The experiments were performed on two distinct but compatible architectures (Intel Xeon and AMD Opteron). In our experiments, the use of optimization improves the performance independently from the API. We observe that the application scheduling performed by the programming interfaces providing an application level scheduler has more impact on the final performance than the optimizations.

随着多核处理器的普及，多线程编程工具在开发高性能处理方面变得越来越流行。在这种情况下，利用编译器优化来提高执行时的性能也很流行。在这项工作中，我们评估了使用两个C编译器(GCC和ICC)与五个不同api (Pthreads、c++ 11、OpenMP、Cilk Plus和TBB)相关的标志-O1、-O2和-O3所取得的性能。实验是在两种不同但兼容的架构(Intel Xeon和AMD Opteron)上进行的。在我们的实验中，使用优化可以独立于API提高性能。我们观察到，由提供应用程序级别调度器的编程接口执行的应用程序调度对最终性能的影响比优化更大。

引用次数: 10

A Dataflow Implementation of Region Growing Method for Cracks Segmentation 区域增长方法在裂纹分割中的数据流实现

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.22

L. A. J. Marzulo, A. Sena, G. Mota, O. Gomes

Region growing is an image segmentation algorithm extremely useful for continuous regions extraction. It defines an initial set of seeds, according to a specific criteria, and iteratively aggregates similar neighbor pixels. The algorithm converges when no pixel aggregation is performed in a certain iteration. Within this research project, region growing is employed for the segmentation of cracks in images of ore particles acquired by scanning electron microscopy (SEM). The goal is to help scientists evaluate the efficiency of cracking methods that would improve metal exposure for extraction through heap leaching and bioleaching. However, this is a computational intensive application that could take hours to analyze even a small set of images, if executed sequentially. This paper presents and evaluates a dataflow parallel version of the region growing method for cracks segmentation. The solution employs the Sucuri dataflow library for Python to orchestrate the execution in a computer cluster. Since the application processes images of different sizes and complexity, Sucuri played an important role in balancing load between machines in a transparent way. Experimental results show speedups of up to 26.85 in a small cluster with 40 processing cores and 23.75 in a 36-cores machine.

区域增长是一种对连续区域提取非常有用的图像分割算法。它根据特定的标准定义一组初始的种子，并迭代地聚集相似的邻居像素。该算法在一次迭代中不进行像素聚合时收敛。在本研究项目中，采用区域生长法对扫描电子显微镜(SEM)获得的矿石颗粒图像进行裂缝分割。目的是帮助科学家评估通过堆浸和生物浸出来提高金属暴露的裂解方法的效率。然而，这是一个计算密集型的应用程序，如果按顺序执行，甚至需要花费数小时来分析一小组图像。本文提出并评价了一种数据流并行版本的区域增长方法用于裂缝分割。该解决方案使用Python的Sucuri数据流库来编排计算机集群中的执行。由于应用程序处理不同大小和复杂程度的图像，Sucuri在机器之间以透明的方式平衡负载方面发挥了重要作用。实验结果表明，在具有40个处理核心的小型集群中，速度可达26.85，在36个核心的机器中可达23.75。

{"title":"A Dataflow Implementation of Region Growing Method for Cracks Segmentation","authors":"L. A. J. Marzulo, A. Sena, G. Mota, O. Gomes","doi":"10.1109/SBAC-PADW.2017.22","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.22","url":null,"abstract":"Region growing is an image segmentation algorithm extremely useful for continuous regions extraction. It defines an initial set of seeds, according to a specific criteria, and iteratively aggregates similar neighbor pixels. The algorithm converges when no pixel aggregation is performed in a certain iteration. Within this research project, region growing is employed for the segmentation of cracks in images of ore particles acquired by scanning electron microscopy (SEM). The goal is to help scientists evaluate the efficiency of cracking methods that would improve metal exposure for extraction through heap leaching and bioleaching. However, this is a computational intensive application that could take hours to analyze even a small set of images, if executed sequentially. This paper presents and evaluates a dataflow parallel version of the region growing method for cracks segmentation. The solution employs the Sucuri dataflow library for Python to orchestrate the execution in a computer cluster. Since the application processes images of different sizes and complexity, Sucuri played an important role in balancing load between machines in a transparent way. Experimental results show speedups of up to 26.85 in a small cluster with 40 processing cores and 23.75 in a 36-cores machine.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126866253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Strategies to Improve the Performance of a Geophysics Model for Different Manycore Systems 提高不同多核系统地球物理模型性能的策略

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.17

M. Serpa, E. Cruz, M. Diener, Arthur M. Krause, Albert Farrés, C. Rosas, J. Panetta, Mauricio Hanzich, P. Navaux

Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave propagation simulation. To perform such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for five architectures: Intel Haswell, Intel Knights Corner, Intel Knights Landing, NVIDIA Kepler and NVIDIA Maxwell. We focus on improving the cache memory usage, vectorization, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Maxwell improves over Intel Haswell, Intel Knights Corner, Intel Knights Landing and NVIDIA Kepler performance by up to 17.9x.

许多油气行业的地球物理勘探软件机制都是基于波传播模拟的。为了执行这样的模拟，采用了最先进的高性能计算架构，每一代生成的结果更快，更准确。软件必须不断发展以支持每种设计的新特性，以保持性能的可扩展性。此外，为了尽可能提高性能，理解应用于软件的每个更改的影响是很重要的。在本文中，我们针对五种架构(Intel Haswell、Intel Knights Corner、Intel Knights Landing、NVIDIA Kepler和NVIDIA Maxwell)的波传播模型提出了几种优化策略。我们专注于改进缓存内存的使用、向量化和内存层次结构中的局部性。我们分析了优化对硬件的影响，提供了每种策略如何提高性能的见解。结果表明，与Intel Haswell、Intel Knights Corner、Intel Knights Landing和NVIDIA Kepler相比，NVIDIA Maxwell的性能提高了17.9倍。

{"title":"Strategies to Improve the Performance of a Geophysics Model for Different Manycore Systems","authors":"M. Serpa, E. Cruz, M. Diener, Arthur M. Krause, Albert Farrés, C. Rosas, J. Panetta, Mauricio Hanzich, P. Navaux","doi":"10.1109/SBAC-PADW.2017.17","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.17","url":null,"abstract":"Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave propagation simulation. To perform such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for five architectures: Intel Haswell, Intel Knights Corner, Intel Knights Landing, NVIDIA Kepler and NVIDIA Maxwell. We focus on improving the cache memory usage, vectorization, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Maxwell improves over Intel Haswell, Intel Knights Corner, Intel Knights Landing and NVIDIA Kepler performance by up to 17.9x.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133145261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Efficient In-Situ Quantum Computing Simulation of Shor's and Grover's Algorithms Shor和Grover算法的高效原位量子计算模拟

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.19

A. Avila, R. Reiser, A. Yamin, M. Pilla

Exponential increase and global access to read/write memory states in quantum computing simulation limit both the number of qubits and quantum transformations that can be currently simulated. Although quantum computing simulation is parallel by nature, spatial and temporal complexity are major performance hazards, making this an important application for HPC. A new methodology employing reduction and decomposition optimizations has shown great results, but its GPU implementation could be further improved. In this work, we intend to do a new implementation for in-situ GPU simulation that better explores its resources without requiring further HPC hardware. Shors and Grovers algorithms are simulated and compared to the previous version and to LIQUi|s simulator, showing better results with relative speedups up to 15.5x and 765.76x respectively.

在量子计算模拟中，指数增长和对读写存储器状态的全局访问限制了当前可以模拟的量子位和量子变换的数量。虽然量子计算模拟本质上是并行的，但空间和时间的复杂性是主要的性能危害，使其成为高性能计算的重要应用。一种采用约简和分解优化的新方法已经显示出了很好的结果，但其GPU实现还有待进一步改进。在这项工作中，我们打算为原位GPU模拟做一个新的实现，在不需要进一步的HPC硬件的情况下更好地探索其资源。对Shors和Grovers算法进行了仿真，并与之前的版本和LIQUi的模拟器进行了比较，显示出更好的结果，相对速度分别达到15.5倍和765.76倍。

引用次数: 3

Energy Consumption Improvement of Shared-Cache Multicore Clusters Based on Explicit Simultaneous Multithreading 基于显式同步多线程的共享缓存多核集群能耗改进

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.9

M. Souza, T. T. Cota, Matheus M. Queiroz, H. Freitas

The use of multicore clusters is one of the strategies used to achieve energy-efficient multicore architecture designs. Even though chips have multiple cores in these designs, cache constraints such as size, latency, concurrency, and scalability still apply. Multicore clusters must therefore implement alternative solutions to the shared cache access problem. Bigger or more frequently accessed caches consume more energy, which is a problem in explicit multithread concurrency. In this work, we simulate different multicore cluster architectures to identify the best configuration in terms of energy efficiency, concerning a varying number of cores, cache sizes and sharing strategies. We also observe the simultaneous and individual multithreading concurrency of two application groups. The results showed that for applications with regular tasks loads, the simultaneous multithreading approach was 43.6% better than the individual one, in terms of energy consumption. For irregular tasks loads, individual executions proved to be the best option, with an increase of up to 81.3% in energy efficiency. We also concluded that shared L2 caches were up to 13.4% more energy-efficient than private cache configurations.

多核集群的使用是实现节能多核架构设计的策略之一。即使芯片在这些设计中有多个核心，缓存限制(如大小、延迟、并发性和可伸缩性)仍然适用。因此，多核集群必须实现共享缓存访问问题的替代解决方案。更大或更频繁访问的缓存消耗更多的能量，这在显式多线程并发性中是一个问题。在这项工作中，我们模拟了不同的多核集群架构，以确定在能源效率方面的最佳配置，涉及不同数量的核心，缓存大小和共享策略。我们还观察了两个应用程序组的同时和单独的多线程并发性。结果表明，对于具有常规任务负载的应用程序，同步多线程方法在能耗方面比单个多线程方法好43.6%。对于不规则的任务负载，单个执行被证明是最佳选择，其能源效率提高高达81.3%。我们还得出结论，与私有缓存配置相比，共享L2缓存的能效最高可提高13.4%。

{"title":"Energy Consumption Improvement of Shared-Cache Multicore Clusters Based on Explicit Simultaneous Multithreading","authors":"M. Souza, T. T. Cota, Matheus M. Queiroz, H. Freitas","doi":"10.1109/SBAC-PADW.2017.9","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.9","url":null,"abstract":"The use of multicore clusters is one of the strategies used to achieve energy-efficient multicore architecture designs. Even though chips have multiple cores in these designs, cache constraints such as size, latency, concurrency, and scalability still apply. Multicore clusters must therefore implement alternative solutions to the shared cache access problem. Bigger or more frequently accessed caches consume more energy, which is a problem in explicit multithread concurrency. In this work, we simulate different multicore cluster architectures to identify the best configuration in terms of energy efficiency, concerning a varying number of cores, cache sizes and sharing strategies. We also observe the simultaneous and individual multithreading concurrency of two application groups. The results showed that for applications with regular tasks loads, the simultaneous multithreading approach was 43.6% better than the individual one, in terms of energy consumption. For irregular tasks loads, individual executions proved to be the best option, with an increase of up to 81.3% in energy efficiency. We also concluded that shared L2 caches were up to 13.4% more energy-efficient than private cache configurations.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121753529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Algorithm for Dynamic Community Detection 动态社区检测的并行算法

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.18

Hugo Resende, Á. Fazenda, M. G. Quiles

Many real systems can be naturally modeled by complex networks. A complex network represents an abstraction of the system regarding its components and their respective interactions. Thus, by scrutinizing the network, interesting properties of the system can be revealed. Among them, the presence of communities, which consists of groups of densely connected nodes, is a significant one. For instance, a community might reveal patterns, such as the functional units of the system, or even groups correlated people in social networks. Albeit important, the community detection process is not a simple computational task, in special when the network is dynamic. Thus, several researchers have addressed this problem providing distinct methods, especially to deal with static networks. Recently, a new algorithm was introduced to solve this problem. The approach consists of modeling the network as a set of particles inspired by a N-body problem. Besides delivering similar results to state-of-the-art community detection algorithm, the proposed model is dynamic in nature; thus, it can be straightforwardly applied to time-varying complex networks. However, the Particle Model still has a major drawback. Its computational cost is quadratic per cycle, which restricts its application to mid-scale networks. To overcome this limitation, here, we present a novel parallel algorithm using many-core high-performance resources. Through the implementation of a new data structure, named distance matrix, was allowed a massive parallelization of the particles interactions. Simulation results show that our parallel approach, running both traditional CPUs and hardware accelerators based on multicore CPUs and GPUs, can speed up the method permitting its application to large-scale networks.

许多真实的系统可以用复杂的网络自然地建模。一个复杂的网络代表了一个抽象的系统关于它的组件和他们各自的相互作用。因此，通过仔细检查网络，可以揭示系统的有趣属性。其中，由密集连接的节点群组成的社区的存在是一个重要的因素。例如，社区可能会揭示模式，例如系统的功能单元，甚至是社会网络中相关的人群。尽管社区检测很重要，但它并不是一个简单的计算任务，特别是当网络是动态的时候。因此，一些研究人员已经解决了这个问题，提供了不同的方法，特别是处理静态网络。最近，一种新的算法被引入来解决这个问题。该方法包括将网络建模为一组受n体问题启发的粒子。除了提供与最先进的社区检测算法相似的结果外，所提出的模型本质上是动态的;因此，它可以直接应用于时变复杂网络。然而，粒子模型仍然有一个主要的缺点。该算法每周期的计算成本为二次元，限制了其在中等规模网络中的应用。为了克服这一限制，我们提出了一种使用多核高性能资源的新型并行算法。通过实现一种新的数据结构，即距离矩阵，可以实现粒子相互作用的大规模并行化。仿真结果表明，在传统cpu和基于多核cpu和gpu的硬件加速器上并行运行，可以提高算法的速度，使其适用于大规模网络。

{"title":"Parallel Algorithm for Dynamic Community Detection","authors":"Hugo Resende, Á. Fazenda, M. G. Quiles","doi":"10.1109/SBAC-PADW.2017.18","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.18","url":null,"abstract":"Many real systems can be naturally modeled by complex networks. A complex network represents an abstraction of the system regarding its components and their respective interactions. Thus, by scrutinizing the network, interesting properties of the system can be revealed. Among them, the presence of communities, which consists of groups of densely connected nodes, is a significant one. For instance, a community might reveal patterns, such as the functional units of the system, or even groups correlated people in social networks. Albeit important, the community detection process is not a simple computational task, in special when the network is dynamic. Thus, several researchers have addressed this problem providing distinct methods, especially to deal with static networks. Recently, a new algorithm was introduced to solve this problem. The approach consists of modeling the network as a set of particles inspired by a N-body problem. Besides delivering similar results to state-of-the-art community detection algorithm, the proposed model is dynamic in nature; thus, it can be straightforwardly applied to time-varying complex networks. However, the Particle Model still has a major drawback. Its computational cost is quadratic per cycle, which restricts its application to mid-scale networks. To overcome this limitation, here, we present a novel parallel algorithm using many-core high-performance resources. Through the implementation of a new data structure, named distance matrix, was allowed a massive parallelization of the particles interactions. Simulation results show that our parallel approach, running both traditional CPUs and hardware accelerators based on multicore CPUs and GPUs, can speed up the method permitting its application to large-scale networks.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133833120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Impact of Version Management for Transactional Memories on Phase-Change Memories 事务性记忆体版本管理对相变记忆体的影响

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.24

Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé

Two of the major issues in current computer systems are energy consumption and how to explore concurrent systems in a correct and efficient way. Solutions for these hazards may be sought both in hardware and in software. Phase-Change Memory (PCM) is a memory technology intended to replace DRAMs (Dynamic Random Access Memories) as the main memory, providing reduced static power consumption. Their main problem is related to write operations that are slow and wear their material. Transactional Memories are synchronization methods developed to reduce the limitations of lock-based synchronization. Their main advantages are related to being high-level and allowing composition and reuse of code, besides the absence of deadlocks. The objective of this study is to analyze the impact of different versioning managers (VMs) for transactional memories in PCMs. The lazy versioning/lazy acquisition scheme for version management presented the lowest wear on the PCM in 3 of 7 benchmarks analyzed, and results similar to the alternative versioning for the other 4~benchmarks. These results are related to the number of aborts of VMs, where this VM presents a much smaller number of aborts than the others, up to 39 times less aborts in the experiment with the benchmark Kmeans with 64 threads.

当前计算机系统的两个主要问题是能耗和如何以正确有效的方式探索并发系统。这些危险的解决方案可以在硬件和软件中寻找。相变存储器(PCM)是一种存储器技术，旨在取代dram(动态随机存取存储器)作为主存储器，提供更低的静态功耗。它们的主要问题与写操作缓慢和磨损材料有关。事务性内存是为减少基于锁的同步限制而开发的同步方法。它们的主要优点除了没有死锁外，还与高级和允许代码组合和重用有关。本研究的目的是分析不同版本管理器(vm)对pcm中事务性内存的影响。在分析的7个基准测试中，用于版本管理的延迟版本控制/延迟获取方案在3个基准测试中对PCM的损耗最小，其结果与其他4个基准测试的替代版本控制相似。这些结果与VM的中止次数有关，其中这个VM的中止次数比其他VM少得多，在使用64线程的基准Kmeans进行的实验中，中止次数减少了39倍。

{"title":"Impact of Version Management for Transactional Memories on Phase-Change Memories","authors":"Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé","doi":"10.1109/SBAC-PADW.2017.24","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.24","url":null,"abstract":"Two of the major issues in current computer systems are energy consumption and how to explore concurrent systems in a correct and efficient way. Solutions for these hazards may be sought both in hardware and in software. Phase-Change Memory (PCM) is a memory technology intended to replace DRAMs (Dynamic Random Access Memories) as the main memory, providing reduced static power consumption. Their main problem is related to write operations that are slow and wear their material. Transactional Memories are synchronization methods developed to reduce the limitations of lock-based synchronization. Their main advantages are related to being high-level and allowing composition and reuse of code, besides the absence of deadlocks. The objective of this study is to analyze the impact of different versioning managers (VMs) for transactional memories in PCMs. The lazy versioning/lazy acquisition scheme for version management presented the lowest wear on the PCM in 3 of 7 benchmarks analyzed, and results similar to the alternative versioning for the other 4~benchmarks. These results are related to the number of aborts of VMs, where this VM presents a much smaller number of aborts than the others, up to 39 times less aborts in the experiment with the benchmark Kmeans with 64 threads.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124798445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tuning Up TVD HOPMOC Method on Intel MIC Xeon Phi Architectures with Intel Parallel Studio Tools 利用Intel Parallel Studio工具在Intel MIC Xeon Phi架构上调优TVD HOPMOC方法

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.12

F. L. Cabral, Carla Osthoff, Gabriel P. Costa, Diego N. Brandão, M. Kischinhevsky, S. L. G. D. Oliveira

This paper focuses on the parallelization of TVD Method scheme for numerical time integration of evolutionary differential equations. The Hopmoc method for numerical integration of differential equations was developed aiming at benefiting from both the concept of integration along characteristic lines as well as from the spatially decomposed Hopscotch method. The set of grid points is initially decomposed into two subsets during the implementation of the integration step. Then, two updates are performed, one explicit and one implicit, on each variable in the course of the iterative process. Each update requires an integration semi step. This is carried out along characteristic lines in a Semi-Lagrangian scheme based on the Modified Method of Characteristics. This work analises two strategies to implement the parallel version of TVD Hopmoc based on the analysis performed by Intel Tools such Parallel and Threading Advisor. A naive solution is substituted by a chunk loop strategy in order to avoid fine-grain tasks inside main loops.

研究了演化微分方程数值时间积分TVD方法的并行化问题。微分方程数值积分的Hopmoc方法是为了借鉴特征线积分的概念和空间分解的Hopscotch方法而发展起来的。在积分步骤的实现过程中，将网格点集初始分解为两个子集。然后，在迭代过程中对每个变量执行两次更新，一次显式更新，一次隐式更新。每次更新都需要一个集成半步骤。这是在基于修正特征法的半拉格朗日格式中沿着特征线进行的。本文在分析英特尔并行和线程顾问等工具的基础上，分析了实现并行版TVD Hopmoc的两种策略。一个简单的解决方案被块循环策略所取代，以避免在主循环中执行细粒度任务。

{"title":"Tuning Up TVD HOPMOC Method on Intel MIC Xeon Phi Architectures with Intel Parallel Studio Tools","authors":"F. L. Cabral, Carla Osthoff, Gabriel P. Costa, Diego N. Brandão, M. Kischinhevsky, S. L. G. D. Oliveira","doi":"10.1109/SBAC-PADW.2017.12","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.12","url":null,"abstract":"This paper focuses on the parallelization of TVD Method scheme for numerical time integration of evolutionary differential equations. The Hopmoc method for numerical integration of differential equations was developed aiming at benefiting from both the concept of integration along characteristic lines as well as from the spatially decomposed Hopscotch method. The set of grid points is initially decomposed into two subsets during the implementation of the integration step. Then, two updates are performed, one explicit and one implicit, on each variable in the course of the iterative process. Each update requires an integration semi step. This is carried out along characteristic lines in a Semi-Lagrangian scheme based on the Modified Method of Characteristics. This work analises two strategies to implement the parallel version of TVD Hopmoc based on the analysis performed by Intel Tools such Parallel and Threading Advisor. A naive solution is substituted by a chunk loop strategy in order to avoid fine-grain tasks inside main loops.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133923991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Towards a Dataflow Runtime Environment for Edge, Fog and In-Situ Computing 面向边缘、雾和原位计算的数据流运行环境

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Pub Date : 2017-10-01 DOI: 10.1109/SBAC-PADW.2017.28

Caio B. G. Carvalho, V. C. Ferreira, F. França, C. Bentes, Tiago A. O. Alves, A. Sena, L. A. J. Marzulo

In the dataflow computation model, instructions or tasks are fired according to their data dependencies, instead of following program order, thus allowing natural parallelism exploitation. Dataflow has been used, in different flavors and abstraction levels (from processors to runtime libraries), as an interesting alternative for harnessing the potential of modern computing systems. Sucuri is a dataflow library for Python that allows users to specify their application as a dependency graph and execute it transparently at clusters of multicores, while taking care of scheduling issues. Recent trends in Fog and In-situ computing assumes that storage and network devices will be equipped with processing elements that usually have lower power consumption and performance. An important decision on such system is whether to move data to traditional processors (paying the communication costs), or performing computation where data is sitting, using a potentially slower processor. Hence, runtime environments that deal with that trade-off are extremely necessary. This work takes a first step towards a solution that considers Edge/Fog/In-situ in a dataflow runtime. We use Sucuri to manage the execution in a small system with a regular PC and a Parallella board. Experiments with text processing applications running with different input sizes, network latency and packet loss rates allow a discussion of scenarios where this approach would be fruitful.

在数据流计算模型中，指令或任务是根据它们的数据依赖关系触发的，而不是遵循程序顺序，从而允许利用自然的并行性。数据流已经以不同的风格和抽象级别(从处理器到运行时库)被用作利用现代计算系统潜力的有趣替代方案。Sucuri是Python的一个数据流库，它允许用户将他们的应用程序指定为依赖图，并在多核集群中透明地执行它，同时处理调度问题。雾和原位计算的最新趋势假设存储和网络设备将配备通常具有较低功耗和性能的处理元件。这种系统的一个重要决策是，是将数据移动到传统处理器(支付通信成本)，还是在数据所在的位置执行计算，使用可能较慢的处理器。因此，处理这种权衡的运行时环境是非常必要的。这项工作为在数据流运行时考虑边缘/雾/原位解决方案迈出了第一步。我们使用Sucuri来管理一个小型系统的执行，该系统有一个普通的PC和一个平行板。在不同输入大小、网络延迟和丢包率的情况下运行文本处理应用程序的实验允许讨论这种方法的效果。

{"title":"Towards a Dataflow Runtime Environment for Edge, Fog and In-Situ Computing","authors":"Caio B. G. Carvalho, V. C. Ferreira, F. França, C. Bentes, Tiago A. O. Alves, A. Sena, L. A. J. Marzulo","doi":"10.1109/SBAC-PADW.2017.28","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.28","url":null,"abstract":"In the dataflow computation model, instructions or tasks are fired according to their data dependencies, instead of following program order, thus allowing natural parallelism exploitation. Dataflow has been used, in different flavors and abstraction levels (from processors to runtime libraries), as an interesting alternative for harnessing the potential of modern computing systems. Sucuri is a dataflow library for Python that allows users to specify their application as a dependency graph and execute it transparently at clusters of multicores, while taking care of scheduling issues. Recent trends in Fog and In-situ computing assumes that storage and network devices will be equipped with processing elements that usually have lower power consumption and performance. An important decision on such system is whether to move data to traditional processors (paying the communication costs), or performing computation where data is sitting, using a potentially slower processor. Hence, runtime environments that deal with that trade-off are extremely necessary. This work takes a first step towards a solution that considers Edge/Fog/In-situ in a dataflow runtime. We use Sucuri to manage the execution in a small system with a regular PC and a Parallella board. Experiments with text processing applications running with different input sizes, network latency and packet loss rates allow a discussion of scenarios where this approach would be fruitful.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3