首页 > 最新文献

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)最新文献

英文 中文
An Efficient Channel Model for Evaluating Wireless NoC Architectures 一种评估无线NoC体系结构的有效信道模型
Michael Opoku Agyeman, Quoc-Tuan Vien, G. Hill, S. J Turner, T. Mak
Wireless Networks-on-Chip (WiNoCs) have emerged to solve the scalability and performance bottleneck of conventional wired NoC architectures. However unlike communication in the macro-world, on-chip communication poses several constraints, hence there is the need for simulation and design tools that consider the effect of the wireless channel at the nanotechnology level. In this paper, we present a parameterizable channel model for WiNoCs which takes into account practical issues and constraints of the propagation medium, such as transmission frequency, operating temperature, ambient pressure and distance between the on-chip antennas. The proposed channel model demonstrates that total path loss of the wireless channel in WiNoCs suffers from not only dielectric propagation loss (DPL) but also molecular absorption attenuation (MAA) which reduces the reliability of the system.
无线片上网络(WiNoCs)的出现是为了解决传统有线片上网络架构的可扩展性和性能瓶颈。然而,与宏观世界中的通信不同,片上通信存在一些限制,因此需要在纳米技术水平上考虑无线信道影响的仿真和设计工具。在本文中,我们提出了一个可参数化的winoc信道模型,该模型考虑了实际问题和传播介质的限制,如传输频率、工作温度、环境压力和片上天线之间的距离。该信道模型表明,winoc无线信道的总路径损耗不仅受到介质传播损耗(DPL)的影响,还受到分子吸收衰减(MAA)的影响,从而降低了系统的可靠性。
{"title":"An Efficient Channel Model for Evaluating Wireless NoC Architectures","authors":"Michael Opoku Agyeman, Quoc-Tuan Vien, G. Hill, S. J Turner, T. Mak","doi":"10.1109/SBAC-PADW.2016.23","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.23","url":null,"abstract":"Wireless Networks-on-Chip (WiNoCs) have emerged to solve the scalability and performance bottleneck of conventional wired NoC architectures. However unlike communication in the macro-world, on-chip communication poses several constraints, hence there is the need for simulation and design tools that consider the effect of the wireless channel at the nanotechnology level. In this paper, we present a parameterizable channel model for WiNoCs which takes into account practical issues and constraints of the propagation medium, such as transmission frequency, operating temperature, ambient pressure and distance between the on-chip antennas. The proposed channel model demonstrates that total path loss of the wireless channel in WiNoCs suffers from not only dielectric propagation loss (DPL) but also molecular absorption attenuation (MAA) which reduces the reliability of the system.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114599255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
PY-PITS: A Scalable Python Runtime System for the Computation of Partially Idempotent Tasks PY-PITS:用于计算部分幂等任务的可伸缩Python运行时系统
E. Borin, C. Benedicto, I. Rodrigues, F. Pisani, M. Tygel, M. Breternitz
The popularization of multi-core architectures and cloud services has allowed users access to high performance computing infrastructures. However, programming for these systems might be cumbersome due to challenges involving system failures, load balancing, and task scheduling. Aiming at solving these problems, we previously introduced SPITS, a programming model and reference architecture for executing bag-of-task applications. In this work, we discuss how this programming model allowed us to design and implement PY-PITS, a simple and effective open source runtime system that is scalable, tolerates faults and allows dynamic provisioning of resources during computation of tasks. We also discuss how PY-PITS can be used to improve utilization of multi-user computational clusters equipped with queues to submit jobs and propose a performance model to aid users to understand when the performance of PY-PITS scales with the number of Workers.
多核架构和云服务的普及使用户能够访问高性能计算基础设施。然而,由于涉及系统故障、负载平衡和任务调度的挑战,为这些系统编程可能会很麻烦。为了解决这些问题,我们在前面介绍了SPITS,一种用于执行任务包应用程序的编程模型和参考体系结构。在这项工作中,我们讨论了这个编程模型如何允许我们设计和实现PY-PITS,这是一个简单而有效的开源运行时系统,具有可扩展性,容错性,并允许在任务计算期间动态提供资源。我们还讨论了如何使用PY-PITS来提高配备队列以提交作业的多用户计算集群的利用率,并提出了一个性能模型,以帮助用户了解PY-PITS的性能何时随worker的数量而变化。
{"title":"PY-PITS: A Scalable Python Runtime System for the Computation of Partially Idempotent Tasks","authors":"E. Borin, C. Benedicto, I. Rodrigues, F. Pisani, M. Tygel, M. Breternitz","doi":"10.1109/SBAC-PADW.2016.10","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.10","url":null,"abstract":"The popularization of multi-core architectures and cloud services has allowed users access to high performance computing infrastructures. However, programming for these systems might be cumbersome due to challenges involving system failures, load balancing, and task scheduling. Aiming at solving these problems, we previously introduced SPITS, a programming model and reference architecture for executing bag-of-task applications. In this work, we discuss how this programming model allowed us to design and implement PY-PITS, a simple and effective open source runtime system that is scalable, tolerates faults and allows dynamic provisioning of resources during computation of tasks. We also discuss how PY-PITS can be used to improve utilization of multi-user computational clusters equipped with queues to submit jobs and propose a performance model to aid users to understand when the performance of PY-PITS scales with the number of Workers.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130564370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Comparative Study of SYCL, OpenCL, and OpenMP SYCL、OpenCL和OpenMP的比较研究
H. C. D. Silva, F. Pisani, E. Borin
Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices, including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however, creating efficient code for them may require that programmers manage memory assignments and use specialized APIs, compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet. Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement kernels and also fewer calls to essential API functions and methods.
最近的趋势表明,未来的计算系统将由一组异构计算设备组成,包括cpu、gpu和其他硬件加速器。这些设备提供了更高的处理性能,然而,为它们创建高效的代码可能需要程序员管理内存分配并使用专门的api、编译器或运行时系统,从而使他们的程序依赖于特定的工具。在这种情况下,SYCL是一种针对OpenCL的新兴c++编程模型,它允许开发人员为异构计算设备编写与标准c++编译框架兼容的代码。在本文中,我们使用基准测试和实际应用程序分析了SYCL、OpenMP和OpenCL的性能和编程特征。我们的性能结果表明,依赖于可用SYCL运行时的程序还不能与基于OpenMP和OpenCL的程序相提并论。尽管如此,如果我们考虑到以前研究报告的结果,差距会越来越小。在可编程性方面,SYCL将自己呈现为OpenCL的竞争替代品,它需要更少的代码行来实现内核,并且对基本API函数和方法的调用也更少。
{"title":"A Comparative Study of SYCL, OpenCL, and OpenMP","authors":"H. C. D. Silva, F. Pisani, E. Borin","doi":"10.1109/SBAC-PADW.2016.19","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.19","url":null,"abstract":"Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices, including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however, creating efficient code for them may require that programmers manage memory assignments and use specialized APIs, compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet. Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement kernels and also fewer calls to essential API functions and methods.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132634380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Parallelism and Scalability: A Solution Focused on the Cloud Computing Processing Service Billing 并行性和可伸缩性:一种专注于云计算处理服务计费的解决方案
Emmanoel M. De Sousa Junior, I. Sardiña, Frederico Lopes
The application scheduling is an important requirement in the cloud computing context. It allows to define the required resources to execute applications tasks following predefined criteria, for instance, maximum execution time, number of virtual machines, volume of data, among others. Selecting process to choose the most appropriate execution structure is driven by scheduling algorithms. This paper proposes a scheduling mechanism for data processing in cloud computing environments. Such mechanism analyzes some specific variables in the business context of a software house specialized in software for lawyers and law offices. The main goal of this mechanism is to fulfill the seasonal company's demand using IaaS services and considering two policies: (i) the maximum execution time allowed by the application may not be exceeded and (ii) the data have to be processed considering the lowest possible monetary cost. The proposed solution generates strategies to select the best set of virtual machines to process the current bunch of data considering the amount of data, the estimated execution time for each specific strategy and the monetary cost of the virtual machines sets. In the context of this work, the strategy concept means the schedule of a set of virtual machines to process a specific amount of data, load balancing decisions and the parallelism of application's execution flow. The proposed solution has resulted in great impact for that company since it allowed the vertiginous increase of the amount of clients served.
应用程序调度是云计算环境下的一个重要需求。它允许根据预定义的标准定义执行应用程序任务所需的资源,例如,最大执行时间、虚拟机数量、数据量等。调度算法驱动进程选择最合适的执行结构。提出了一种云计算环境下数据处理的调度机制。这种机制分析了专门为律师和律师事务所提供软件的软件公司的业务环境中的一些特定变量。此机制的主要目标是使用IaaS服务并考虑两个策略来满足季节性公司的需求:(i)应用程序允许的最大执行时间不得超过,(ii)必须考虑尽可能低的货币成本来处理数据。该解决方案根据数据量、每个特定策略的估计执行时间和虚拟机集的货币成本,生成策略来选择最佳的虚拟机集来处理当前的数据。在本工作的上下文中,策略概念意味着一组虚拟机的调度,以处理特定数量的数据、负载平衡决策和应用程序执行流的并行性。所提出的解决方案对该公司产生了很大的影响,因为它使所服务的客户数量急剧增加。
{"title":"Parallelism and Scalability: A Solution Focused on the Cloud Computing Processing Service Billing","authors":"Emmanoel M. De Sousa Junior, I. Sardiña, Frederico Lopes","doi":"10.1109/SBAC-PADW.2016.14","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.14","url":null,"abstract":"The application scheduling is an important requirement in the cloud computing context. It allows to define the required resources to execute applications tasks following predefined criteria, for instance, maximum execution time, number of virtual machines, volume of data, among others. Selecting process to choose the most appropriate execution structure is driven by scheduling algorithms. This paper proposes a scheduling mechanism for data processing in cloud computing environments. Such mechanism analyzes some specific variables in the business context of a software house specialized in software for lawyers and law offices. The main goal of this mechanism is to fulfill the seasonal company's demand using IaaS services and considering two policies: (i) the maximum execution time allowed by the application may not be exceeded and (ii) the data have to be processed considering the lowest possible monetary cost. The proposed solution generates strategies to select the best set of virtual machines to process the current bunch of data considering the amount of data, the estimated execution time for each specific strategy and the monetary cost of the virtual machines sets. In the context of this work, the strategy concept means the schedule of a set of virtual machines to process a specific amount of data, load balancing decisions and the parallelism of application's execution flow. The proposed solution has resulted in great impact for that company since it allowed the vertiginous increase of the amount of clients served.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122128907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems CPU/GPU系统中多改进邻域搜索策略的基准研究
E. Rios, I. M. Coelho, L. Ochi, Cristina Boeres, R. Farias
In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy is a recently proposed method that consists in exploring simultaneously multiple move operations during the NS phase aiming to reach good quality solutions with shorter computational steps. This paper presents a benchmark for MI strategies in hybrid CPU/GPU systems. This technique efficiently explores the CPU processing power together with the massive parallelism achieved by modern GPUs, emerging as an efficient alternative for classic CPU neighborhood search strategies. The advantage of this approach depends heavily on finding the best tradeoff between CPU and GPU processing, as well as minimizing the memory transfers involved in the process. In the experiments, several MI configurations were tested in a hybrid CPU/GPU environment presenting better results than classical neighborhood search strategies for the Minimum Latency Problem, a hard combinatorial optimization problem.
在组合优化问题中,邻域搜索(NS)是基于局部搜索的启发式算法的一个基本组成部分。它包括通过称为移动的操作从高基数的邻居解决方案集中选择一个解决方案。为了执行这种搜索,NS算法通常采用两种主要方法:选择第一个或最佳改进步。Multi Improvement (MI)策略是最近提出的一种方法,它包括在NS阶段同时探索多个移动操作,旨在用更短的计算步骤获得高质量的解决方案。本文提出了CPU/GPU混合系统中MI策略的一个基准。该技术有效地利用了CPU的处理能力以及现代gpu所实现的大规模并行性,成为经典CPU邻域搜索策略的有效替代方案。这种方法的优势在很大程度上取决于找到CPU和GPU处理之间的最佳权衡,以及最小化进程中涉及的内存传输。在实验中,几种MI配置在CPU/GPU混合环境中进行了测试,结果表明,对于最小延迟问题(一个困难的组合优化问题),MI配置比经典邻域搜索策略的结果更好。
{"title":"A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems","authors":"E. Rios, I. M. Coelho, L. Ochi, Cristina Boeres, R. Farias","doi":"10.1109/SBAC-PADW.2016.17","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.17","url":null,"abstract":"In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy is a recently proposed method that consists in exploring simultaneously multiple move operations during the NS phase aiming to reach good quality solutions with shorter computational steps. This paper presents a benchmark for MI strategies in hybrid CPU/GPU systems. This technique efficiently explores the CPU processing power together with the massive parallelism achieved by modern GPUs, emerging as an efficient alternative for classic CPU neighborhood search strategies. The advantage of this approach depends heavily on finding the best tradeoff between CPU and GPU processing, as well as minimizing the memory transfers involved in the process. In the experiments, several MI configurations were tested in a hybrid CPU/GPU environment presenting better results than classical neighborhood search strategies for the Minimum Latency Problem, a hard combinatorial optimization problem.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120842517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Hybrid Parallel Algorithm for the Auction Algorithm in Multicore Systems 多核系统中竞价算法的混合并行算法
A. P. Nascimento, C. Vasconcelos, F. S. Jamel, A. Sena
The bipartite graph matching problem is based on finding a point that maximizes the chances of similarity with another one, and it is explored in several areas such as Bioinformatics and Computer Vision. To solve that matching problem the auction algorithm has been widely used and its parallel implementation is employed to find matching solutions in a reasonable computational time. For example, image analysis may require a large amount of processing, as dense images can have thousands of points to be considered. Furthermore, to exploit the benefits of multicore architectures, a hybrid implementation can be used to deal with communication in both distributed and shared memory. The main goal of this paper is to implement and evaluate the performance of an hybrid parallel auction algorithm for multicore clusters. The experiments carried out analyzes the problem size, the number of iterations to solve the matching and the impact of these parameters in the communication costs and how it affects the execution times.
二部图匹配问题是基于寻找与另一个点相似性最大的点,并且在生物信息学和计算机视觉等多个领域进行了探索。为了解决这一匹配问题,拍卖算法得到了广泛的应用,并采用其并行实现在合理的计算时间内找到匹配解。例如,图像分析可能需要大量的处理,因为密集的图像可能有数千个点需要考虑。此外,为了利用多核架构的优势,可以使用混合实现来处理分布式和共享内存中的通信。本文的主要目标是实现并评估一种多核集群的混合并行拍卖算法的性能。实验分析了问题大小、求解匹配的迭代次数以及这些参数对通信成本的影响及其对执行时间的影响。
{"title":"A Hybrid Parallel Algorithm for the Auction Algorithm in Multicore Systems","authors":"A. P. Nascimento, C. Vasconcelos, F. S. Jamel, A. Sena","doi":"10.1109/SBAC-PADW.2016.21","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.21","url":null,"abstract":"The bipartite graph matching problem is based on finding a point that maximizes the chances of similarity with another one, and it is explored in several areas such as Bioinformatics and Computer Vision. To solve that matching problem the auction algorithm has been widely used and its parallel implementation is employed to find matching solutions in a reasonable computational time. For example, image analysis may require a large amount of processing, as dense images can have thousands of points to be considered. Furthermore, to exploit the benefits of multicore architectures, a hybrid implementation can be used to deal with communication in both distributed and shared memory. The main goal of this paper is to implement and evaluate the performance of an hybrid parallel auction algorithm for multicore clusters. The experiments carried out analyzes the problem size, the number of iterations to solve the matching and the impact of these parameters in the communication costs and how it affects the execution times.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124371794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Processor Workload Distribution Algorithm for Massively Parallel Applications 面向大规模并行应用的处理器工作负载分配算法
Serge Midonnet, Achille Wattelar
Directed Acyclic Graph (DAG) is a standard model used to describe tasks that execute according to precedence constraints and that allows intra-task parallelism. This model is well suited to camera-based applications where multiple treatments must be executed in parallel according to the camera input, such applications found for example in self-driving cars or image recognition via convolutional neural network (CNN). Such applications are used on embedded systems and therefore require low energy cost and a limited hardware space. The main contribution of this paper is to present a new partitioning algorithm based on a DAG stretching technique. This stretching algorithm frees processor cores and thus implies energy savings and leads to new hardware design using a reduced number of processors. We present an experimental evaluation of this algorithm to show its efficiency.
有向无环图(DAG)是一种标准模型,用于描述根据优先级约束执行的任务,并允许任务内部并行。该模型非常适合基于摄像头的应用,其中必须根据摄像头输入并行执行多个处理,例如自动驾驶汽车或通过卷积神经网络(CNN)进行图像识别的应用。此类应用程序用于嵌入式系统,因此需要低能源成本和有限的硬件空间。本文的主要贡献是提出了一种新的基于DAG拉伸技术的分区算法。这种扩展算法释放了处理器内核,从而意味着节能,并导致使用减少处理器数量的新硬件设计。我们对该算法进行了实验评估,以证明其有效性。
{"title":"A Processor Workload Distribution Algorithm for Massively Parallel Applications","authors":"Serge Midonnet, Achille Wattelar","doi":"10.1109/SBAC-PADW.2016.13","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.13","url":null,"abstract":"Directed Acyclic Graph (DAG) is a standard model used to describe tasks that execute according to precedence constraints and that allows intra-task parallelism. This model is well suited to camera-based applications where multiple treatments must be executed in parallel according to the camera input, such applications found for example in self-driving cars or image recognition via convolutional neural network (CNN). Such applications are used on embedded systems and therefore require low energy cost and a limited hardware space. The main contribution of this paper is to present a new partitioning algorithm based on a DAG stretching technique. This stretching algorithm frees processor cores and thus implies energy savings and leads to new hardware design using a reduced number of processors. We present an experimental evaluation of this algorithm to show its efficiency.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134098430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Outline of a Thick Control Flow Architecture 厚控制流体系结构的概要
M. Forsell, J. Roivainen, V. Leppänen
The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, thread-like computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this paper we outline an architecture for efficiently executing programs written for the TCF model. It features scalable latency hiding via replication of instructions, radical synchronization cost reduction via a wave-based synchronization mechanism, and improved low-level parallelism exploitation via chaining of functional units. Replication of instructions is supported by a dynamic multithreading-like mechanism, which saves the fiber-wise data into special replicated register blocks. The architecture facilitates programmers with compact, unbounded notation of fibers and groups of them together with strong synchronous shared memory algorithmics. According to evaluations, the architecture is able to efficiently handle workloads featuring computational elements with the same control flow, independently of the number of elements. In its turn, this pays out as improved performance and lower power consumption due to elimination of redundant parts of computation and machinery.
最近发明的厚控制流(TCF)模型将无限数量的纤维,线程状的计算实体打包在一起,流经相同的控制路径。这有望通过部分消除循环和人工线程算法来简化并行编程。在本文中,我们概述了一个有效执行为TCF模型编写的程序的体系结构。它的特点是通过复制指令来隐藏可扩展的延迟,通过基于波的同步机制来降低同步成本,并通过功能单元链来改进低级并行性。指令的复制由动态多线程机制支持,该机制将光纤数据保存到特殊的复制寄存器块中。该体系结构为程序员提供了紧凑、无界的光纤符号,并将它们与强同步共享内存算法组合在一起。根据评估,该体系结构能够有效地处理具有相同控制流的计算元素的工作负载,而不受元素数量的影响。反过来,由于消除了计算和机器的冗余部分,这将带来性能的提高和功耗的降低。
{"title":"Outline of a Thick Control Flow Architecture","authors":"M. Forsell, J. Roivainen, V. Leppänen","doi":"10.1109/SBAC-PADW.2016.9","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.9","url":null,"abstract":"The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, thread-like computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this paper we outline an architecture for efficiently executing programs written for the TCF model. It features scalable latency hiding via replication of instructions, radical synchronization cost reduction via a wave-based synchronization mechanism, and improved low-level parallelism exploitation via chaining of functional units. Replication of instructions is supported by a dynamic multithreading-like mechanism, which saves the fiber-wise data into special replicated register blocks. The architecture facilitates programmers with compact, unbounded notation of fibers and groups of them together with strong synchronous shared memory algorithmics. According to evaluations, the architecture is able to efficiently handle workloads featuring computational elements with the same control flow, independently of the number of elements. In its turn, this pays out as improved performance and lower power consumption due to elimination of redundant parts of computation and machinery.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123868455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Dynamic Load Balance Algorithm for the S4 Parallel Stream Processing Engine S4并行流处理引擎的动态负载平衡算法
V. Gil-Costa, Nicolás Hidalgo, Erika Rosas, Mauricio Marín
Large streams of data can be analyzed in realtimeby Parallel Stream Processing Engines (PSPEs) which arebased on a graph paradigm where vertices represent processingelements (PEs) and edges represent flows of data among PEs. Inthis work, we propose a new elastic strategy for the S4 PSPE toadjust the overall load of PEs in accordance with the utilizationlevels and data traffic at each PE. Our approach exploits aproducer/consumer model to achieve load balance where newworkers pull events from a buffer queue in order to release theamount of traffic in an overloaded PE. Results show that theproposed strategy prevents saturation of PEs and improves theoverall throughput of the system by up to 470%.
并行流处理引擎(pspe)可以实时分析大数据流,pspe基于图形范式,其中顶点表示处理元素(pe),边表示pe之间的数据流。在这项工作中,我们为S4 PSPE提出了一种新的弹性策略,以根据每个PE的利用率和数据流量调整PE的总体负载。我们的方法利用生产者/消费者模型来实现负载平衡,其中newworker从缓冲队列中提取事件,以释放过载PE中的流量。结果表明,该策略防止了pe的饱和,并将系统的总吞吐量提高了470%。
{"title":"A Dynamic Load Balance Algorithm for the S4 Parallel Stream Processing Engine","authors":"V. Gil-Costa, Nicolás Hidalgo, Erika Rosas, Mauricio Marín","doi":"10.1109/SBAC-PADW.2016.12","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.12","url":null,"abstract":"Large streams of data can be analyzed in realtimeby Parallel Stream Processing Engines (PSPEs) which arebased on a graph paradigm where vertices represent processingelements (PEs) and edges represent flows of data among PEs. Inthis work, we propose a new elastic strategy for the S4 PSPE toadjust the overall load of PEs in accordance with the utilizationlevels and data traffic at each PE. Our approach exploits aproducer/consumer model to achieve load balance where newworkers pull events from a buffer queue in order to release theamount of traffic in an overloaded PE. Results show that theproposed strategy prevents saturation of PEs and improves theoverall throughput of the system by up to 470%.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133977625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Thread Footprint Analysis for the Design of Multithreaded Applications and Multicore Systems 多线程应用和多核系统设计中的线程占用分析
R. Santos, Ricardo Aguiar, Paulo Soken, Samuel Ferraz, Liana Duenha
This work presents Coretool, a pin tool for thread analysis (identification, scheduling, and instruction workload) of multithreaded application in multicore systems. The main goal of Coretool is to provide enough information to improve performance in multithreaded applications and multicore systems. Coretool can be helpful for multithreaded software developer to take the application performance overheads into account to redesign the application. A multicore system designer/administrator can use the thread scheduling, threads usage, and instruction workload to perform a system tuning to improve performance or to maximize throughput. We have performed a set of experiments to characterize multithreaded applications according to their thread footprint on multicore available resources have shown some applications with thread workload unbalance, thus suggesting the need of application redesigning.
本文介绍了一个用于多核系统中多线程应用程序的线程分析(识别、调度和指令负载)的引脚工具Coretool。Coretool的主要目标是提供足够的信息来提高多线程应用程序和多核系统的性能。Coretool可以帮助多线程软件开发人员考虑应用程序的性能开销来重新设计应用程序。多核系统设计人员/管理员可以使用线程调度、线程使用和指令工作负载来执行系统调优,以提高性能或最大化吞吐量。我们进行了一组实验,根据多线程应用程序在多核可用资源上的线程占用来描述它们的特征,结果显示一些应用程序存在线程工作负载不平衡,从而表明需要重新设计应用程序。
{"title":"Thread Footprint Analysis for the Design of Multithreaded Applications and Multicore Systems","authors":"R. Santos, Ricardo Aguiar, Paulo Soken, Samuel Ferraz, Liana Duenha","doi":"10.1109/SBAC-PADW.2016.18","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.18","url":null,"abstract":"This work presents Coretool, a pin tool for thread analysis (identification, scheduling, and instruction workload) of multithreaded application in multicore systems. The main goal of Coretool is to provide enough information to improve performance in multithreaded applications and multicore systems. Coretool can be helpful for multithreaded software developer to take the application performance overheads into account to redesign the application. A multicore system designer/administrator can use the thread scheduling, threads usage, and instruction workload to perform a system tuning to improve performance or to maximize throughput. We have performed a set of experiments to characterize multithreaded applications according to their thread footprint on multicore available resources have shown some applications with thread workload unbalance, thus suggesting the need of application redesigning.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114583384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1