Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems最新文献

英文中文

Data-layout optimization based on memory-access-pattern analysis for source-code performance improvement 基于内存访问模式分析的数据布局优化，以提高源代码性能

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391874

Riyane Sid Lakhdar, H. Charles, Maha Kooli

With the rising impact of the memory wall, selecting the adequate data-structure implementation for a given kernel has become a performance-critical issue. This paper presents a new methodology to solve the data-layout decision problem by adapting an input implementation to the host hardware-memory hierarchy. The proposed method automatically identifies, for a given input software, the most performing data-layout implementation for each selected variable by analyzing the memory-access pattern. The proposed method is designed to be embedded within a general-purpose compiler. Experiments on PolybenchC benchmark, recursive-bilateral filter and jpeg-compression kernels, show that our method accurately determines the optimized data structure implementation. These optimized implementations allow reaching an execution-time speed-up up to 48.9X and a L3-miss reduction up to 98.1X, on an X86 processor implementing an Intel Xeon with three levels of data-caches using the least recently used cache-replacement policy.

随着内存墙的影响越来越大，为给定的内核选择适当的数据结构实现已经成为一个性能关键问题。本文提出了一种解决数据布局决策问题的新方法，通过使输入实现适应主机硬件-内存层次结构。该方法通过分析内存访问模式，为给定的输入软件自动识别每个选定变量的性能最佳的数据布局实现。所提出的方法被设计为嵌入在通用编译器中。在PolybenchC基准测试、递归双边过滤器和jpeg压缩核上的实验表明，我们的方法准确地确定了优化后的数据结构实现。在使用最近最少使用的缓存替换策略实现具有三层数据缓存的Intel Xeon的X86处理器上，这些优化的实现允许实现高达48.9倍的执行时加速和高达98.1倍的L3-miss减少。

引用次数: 2

Analog implementation of arithmetic operations on real memristors 真实忆阻器上算术运算的模拟实现

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391883

Thore Kolms, Andreas Waldner, Christine Lang, Philipp Grothe, Jan Haase

The upcoming topic of in-memory-computing tries to support CPUs by taking over simple calculations that can be done in memory. This leads to less performance drain caused by those simple calculations as well as lower energy consumption for the whole system, which is particularly important for embedded systems. Memristors are variable and non-volatile resistors that can be used to store analog values. This makes them suitable for in-memory computing. In this paper, a prototypical implementation of analog calculations (addition, subtraction, multiplication) is described. The prototype is based on an ESP32 microcontroller. Typical calculations currently take around 1μs.

即将到来的内存计算主题试图通过接管可以在内存中完成的简单计算来支持cpu。这减少了由这些简单的计算引起的性能损耗，同时降低了整个系统的能耗，这对于嵌入式系统来说尤为重要。忆阻器是可变和非易失性电阻器，可用于存储模拟值。这使得它们适合于内存计算。本文描述了模拟计算(加、减、乘)的一个原型实现。原型是基于ESP32微控制器。目前的典型计算大约需要1μs。

引用次数: 1

Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs 卷积神经网络在CGRAs上分层并行执行的设计空间探索

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391878

C. Heidorn, Frank Hannig, J. Teich

In this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that computes the required resources (processing elements) and buffer memory and thus hardware cost C to sustain a given throughput T as well as the resulting overall energy consumption E for inference. Further, we propose an efficient design space exploration (DSE) to determine the fronts of Pareto-optimal (T,E,C) solutions. This exploration helps to determine the limits of scalability of the presented tiled CGRA accelerator architectures in terms of throughput, the number of parallel layers that can be simultaneously processed, and memory requirements. Finally, we provide an evaluation of energy savings achievable on our architecture in comparison to implementations that execute sequentially a CNN layer-by-layer. In experiments, it is shown that layer-parallel processing is able to reduce energy consumption E by 3.6X, hardware cost C by 1.2X, and increase the achievable throughput T by 6.2X for MobileNet.

在这项工作中，我们系统地探索了卷积神经网络(cnn)到粗粒度可重构阵列(CGRAs)的层并行映射的吞吐量、能量和硬件成本的设计空间。我们推导了一个分析模型，该模型计算了所需的资源(处理元素)和缓冲内存，从而计算了维持给定吞吐量T所需的硬件成本C，以及由此产生的推断总体能耗E。此外，我们提出了一个有效的设计空间探索(DSE)来确定帕累托最优(T,E,C)解决方案的前沿。这种探索有助于确定所提出的平铺式CGRA加速器体系结构在吞吐量、可同时处理的并行层数量和内存需求方面的可伸缩性限制。最后，与逐层执行CNN的实现相比，我们提供了在我们的架构上可以实现的节能评估。实验表明，分层并行处理可使MobileNet的能耗E降低3.6倍，硬件成本C降低1.2倍，可实现吞吐量T提高6.2倍。

{"title":"Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs","authors":"C. Heidorn, Frank Hannig, J. Teich","doi":"10.1145/3378678.3391878","DOIUrl":"https://doi.org/10.1145/3378678.3391878","url":null,"abstract":"In this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that computes the required resources (processing elements) and buffer memory and thus hardware cost C to sustain a given throughput T as well as the resulting overall energy consumption E for inference. Further, we propose an efficient design space exploration (DSE) to determine the fronts of Pareto-optimal (T,E,C) solutions. This exploration helps to determine the limits of scalability of the presented tiled CGRA accelerator architectures in terms of throughput, the number of parallel layers that can be simultaneously processed, and memory requirements. Finally, we provide an evaluation of energy savings achievable on our architecture in comparison to implementations that execute sequentially a CNN layer-by-layer. In experiments, it is shown that layer-parallel processing is able to reduce energy consumption E by 3.6X, hardware cost C by 1.2X, and increase the achievable throughput T by 6.2X for MobileNet.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124057857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Portable exploitation of parallel and heterogeneous HPC architectures in neural simulation using SkePU 基于SkePU的并行和异构HPC架构在神经仿真中的可移植开发

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391889

Sotirios Panagiotou, August Ernstsson, Johan Ahlqvist, Lazaros Papadopoulos, C. Kessler, D. Soudris

The complexity of modern HPC systems requires the use of new tools that support advanced programming models and offer portability and programmability of parallel and heterogeneous architectures. In this work we evaluate the use of SkePU framework in an HPC application from the neural computing domain. We demonstrate the successful deployment of the application based on SkePU using multiple back-ends (OpenMP, OpenCL and MPI) and present lessons-learned towards future extensions of the SkePU framework.

现代HPC系统的复杂性要求使用支持高级编程模型的新工具，并提供并行和异构架构的可移植性和可编程性。在这项工作中，我们评估了SkePU框架在神经计算领域的HPC应用程序中的使用。我们演示了使用多个后端(OpenMP、OpenCL和MPI)成功部署基于SkePU的应用程序，并介绍了对SkePU框架未来扩展的经验教训。

引用次数: 1

Compiler-based WCET prediction performing function specialization 基于编译器的WCET预测执行函数专门化

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391879

Kateryna Muts, H. Falk

The Worst-Case Execution Time (WCET) is one of the most important criteria of hard real-time systems. Many optimizations have been proposed to improve WCET of an embedded application at compile time. Moreover, since modern embedded systems must also satisfy the additional design criteria like, e.g., code size or energy consumption, more often the compiler's optimizations go towards multi-objective optimization problems. Evolutionary algorithms are the most widely used method to solve a multi-objective problem. In order to find the set of the best trade-offs between the objectives, any evolutionary algorithm requires extensive evaluations of the objective functions. Thus, considering WCET as an objective in a multi-objective problem is infeasible in many cases, because the WCET analysis at compile time can be very time-consuming. For this reason, we propose a method based on a machine learning technique to predict the values of WCET at compile time. A well-known compiler-based optimization, function specialization, is considered as a base for the proposed prediction model. A regression method is analyzed in terms of making WCET predictions as precise as possible performing function specialization.

最坏情况执行时间(WCET)是硬实时系统的重要指标之一。已经提出了许多优化来改进嵌入式应用程序在编译时的WCET。此外，由于现代嵌入式系统还必须满足额外的设计标准，例如代码大小或能耗，因此编译器的优化通常会朝着多目标优化问题发展。进化算法是解决多目标问题最广泛使用的方法。为了找到目标之间的最佳权衡集，任何进化算法都需要对目标函数进行广泛的评估。因此，在许多情况下，将WCET作为多目标问题中的一个目标是不可行的，因为在编译时对WCET进行分析可能非常耗时。因此，我们提出了一种基于机器学习技术的方法来预测编译时的WCET值。一个众所周知的基于编译器的优化，函数专门化，被认为是提出的预测模型的基础。从实现函数专门化使WCET预测尽可能精确的角度分析了回归方法。

{"title":"Compiler-based WCET prediction performing function specialization","authors":"Kateryna Muts, H. Falk","doi":"10.1145/3378678.3391879","DOIUrl":"https://doi.org/10.1145/3378678.3391879","url":null,"abstract":"The Worst-Case Execution Time (WCET) is one of the most important criteria of hard real-time systems. Many optimizations have been proposed to improve WCET of an embedded application at compile time. Moreover, since modern embedded systems must also satisfy the additional design criteria like, e.g., code size or energy consumption, more often the compiler's optimizations go towards multi-objective optimization problems. Evolutionary algorithms are the most widely used method to solve a multi-objective problem. In order to find the set of the best trade-offs between the objectives, any evolutionary algorithm requires extensive evaluations of the objective functions. Thus, considering WCET as an objective in a multi-objective problem is infeasible in many cases, because the WCET analysis at compile time can be very time-consuming. For this reason, we propose a method based on a machine learning technique to predict the values of WCET at compile time. A well-known compiler-based optimization, function specialization, is considered as a base for the proposed prediction model. A regression method is analyzed in terms of making WCET predictions as precise as possible performing function specialization.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient parallel reduction on GPUs with Hipacc 高效并行减少gpu与Hipacc

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391885

Bo Qiao, Oliver Reiche, M. A. Özkan, J. Teich, Frank Hannig

Hipacc is a domain-specific language for ease of programming image processing applications on hardware accelerators such as GPUs. It relieves the burden of manually porting algorithms to hardware for developers with the help of domain- and architecture-specific knowledge. One fundamental operation in image processing is reduction. Global reduction operators are the building blocks of many widely used algorithms, including image normalization, similarity estimation, etc. This paper presents an efficient approach to perform parallel reductions on GPUs with Hipacc. Our proposed approach benefits from the continuous effort of performance and programmability improvement by hardware vendors, for example, by utilizing the latest low-level primitives from Nvidia. Results show our approach achieves a speedup of up to 3.43 over an existing Hipacc implementation with traditional optimization methods, and a speedup of up to 9.02 over an implementation using the Thrust library from Nvidia.

hipac是一种特定于领域的语言，便于在gpu等硬件加速器上编写图像处理应用程序。借助特定于领域和体系结构的知识，它减轻了开发人员手动将算法移植到硬件的负担。图像处理中的一个基本操作是还原。全局约简算子是许多广泛使用的算法的组成部分，包括图像归一化、相似度估计等。本文提出了一种利用Hipacc在gpu上进行并行缩减的有效方法。我们提出的方法得益于硬件供应商对性能和可编程性改进的持续努力，例如，通过利用Nvidia最新的低级原语。结果表明，我们的方法比使用传统优化方法的现有hipac实现实现的速度提高了3.43，比使用Nvidia的Thrust库实现的速度提高了9.02。

引用次数: 1

Compiling synchronous languages to optimal move code for exposed datapath architectures 编译同步语言以优化公开数据路径体系结构的移动代码

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2020-05-25 DOI: 10.1145/3378678.3391877

Marc Dahlem, K. Schneider

Conventional processor architectures are limited in exploiting instruction level parallelism (ILP). One of the reasons for this limitation is their relatively low number of registers. Thus, recent processor architectures expose their datapaths so that the compiler can take care of directly transporting results from processing units to other processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a recently proposed exposed datapath architecture whose goal is to completely bypass the use of registers. Processor architectures with a high degree of ILP like SCAD are particularly useful for executing synchronous programs: The execution of a synchronous program is a sequence of reaction steps that consist of atomic actions that have to be executed in dataflow order. Synchronous programs typically provide a lot of ILP so that exposed datapath architectures may execute these programs efficiently. However, optimal code generation for SCAD is a big challenge: Previous work already showed how one can compile basic blocks to optimal move code for SCAD by means of answer set programming (ASP). This paper extends this approach in order to compile complete synchronous programs instead of only basic blocks to optimal move code. As a result, an ASP-based compiler was developed to translate Quartz programs to move code for the SCAD architecture by maximizing the use of the available ILP in the program while respecting the available resource limitations of the available processor.

传统的处理器体系结构在利用指令级并行性(ILP)方面受到限制。造成这种限制的原因之一是它们的寄存器数量相对较少。因此，最近的处理器体系结构公开了它们的数据路径，以便编译器能够直接将结果从处理单元传输到其他处理单元。在这些体系结构中，同步控制异步数据流(SCAD)体系结构是最近提出的公开数据路径体系结构，其目标是完全绕过寄存器的使用。具有高度ILP(如SCAD)的处理器体系结构对于执行同步程序特别有用:同步程序的执行是由必须按数据流顺序执行的原子操作组成的一系列反应步骤。同步程序通常提供大量的ILP，以便公开的数据路径体系结构可以有效地执行这些程序。然而，SCAD的最佳代码生成是一个巨大的挑战:以前的工作已经展示了如何通过答案集编程(ASP)将基本块编译为SCAD的最佳移动代码。本文扩展了这种方法，以编译完整的同步程序，而不仅仅是基本块来优化移动代码。因此，开发了一种基于asp的编译器，通过最大限度地利用程序中的可用ILP，同时尊重可用处理器的可用资源限制，将Quartz程序转换为SCAD体系结构的代码。

{"title":"Compiling synchronous languages to optimal move code for exposed datapath architectures","authors":"Marc Dahlem, K. Schneider","doi":"10.1145/3378678.3391877","DOIUrl":"https://doi.org/10.1145/3378678.3391877","url":null,"abstract":"Conventional processor architectures are limited in exploiting instruction level parallelism (ILP). One of the reasons for this limitation is their relatively low number of registers. Thus, recent processor architectures expose their datapaths so that the compiler can take care of directly transporting results from processing units to other processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a recently proposed exposed datapath architecture whose goal is to completely bypass the use of registers. Processor architectures with a high degree of ILP like SCAD are particularly useful for executing synchronous programs: The execution of a synchronous program is a sequence of reaction steps that consist of atomic actions that have to be executed in dataflow order. Synchronous programs typically provide a lot of ILP so that exposed datapath architectures may execute these programs efficiently. However, optimal code generation for SCAD is a big challenge: Previous work already showed how one can compile basic blocks to optimal move code for SCAD by means of answer set programming (ASP). This paper extends this approach in order to compile complete synchronous programs instead of only basic blocks to optimal move code. As a result, an ASP-based compiler was developed to translate Quartz programs to move code for the SCAD architecture by maximizing the use of the available ILP in the program while respecting the available resource limitations of the available processor.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133511530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀