Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems最新文献_第2页

Plasmon-based Virus Detection on Heterogeneous Embedded Systems 异构嵌入式系统中基于等离子体的病毒检测

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2764976

Olaf Neugebauer, Pascal Libuschewski, M. Engel, H. Müller, P. Marwedel

Embedded systems, e.g. in computer vision applications, are expected to provide significant amounts of computing power to process large data volumes. Many of these systems, such as used in medical diagnosis, are mobile devices and face significant challenges to provide sufficient performance while operating on a constrained energy budget. Modern embedded MPSoC platforms use heterogeneous CPU and GPU cores providing a large number of optimization parameters. This allows to find useful trade-offs between energy consumption and performance for a given application. In this paper, we describe how the complex data processing required for PAMONO, a novel type of biosensor for the detection of biological viruses, can efficiently be implemented on a state-of-the-art heterogeneous MPSoC platform. An additional optimization dimension explored is the achieved quality of service. Reducing the virus detection accuracy enables additional optimizations not achievable by modifying hardware or software parameters alone. Instead of relying on often inaccurate simulation models, our design space exploration employs a hardware-in-the-loop approach to evaluate the performance and energy consumption on the embedded target platform. Trade-offs between performance, energy and accuracy are controlled by a genetic algorithm running on a PC control system which deploys the evaluation tasks to a number of connected embedded boards. Using our optimization approach, we are able to achieve frame rates meeting the requirements without losing accuracy. Further, our approach is able to reduce the energy consumption by 93% with a still reasonable detection quality.

嵌入式系统，例如在计算机视觉应用中，有望提供大量的计算能力来处理大量数据。许多此类系统，例如用于医疗诊断的系统，都是移动设备，在有限的能源预算下运行时，要提供足够的性能，面临着重大挑战。现代嵌入式MPSoC平台使用异构CPU和GPU内核，提供了大量的优化参数。这允许在给定应用程序的能耗和性能之间找到有用的权衡。在本文中，我们描述了如何在最先进的异构MPSoC平台上有效地实现PAMONO(一种用于检测生物病毒的新型生物传感器)所需的复杂数据处理。探索的另一个优化维度是实现的服务质量。降低病毒检测的准确性可以实现单独修改硬件或软件参数无法实现的额外优化。我们的设计空间探索不依赖于经常不准确的仿真模型，而是采用硬件在环方法来评估嵌入式目标平台上的性能和能耗。性能，能量和精度之间的权衡由运行在PC控制系统上的遗传算法控制，该系统将评估任务部署到许多连接的嵌入式板上。使用我们的优化方法，我们能够在不损失精度的情况下实现满足要求的帧率。此外，我们的方法能够在检测质量仍然合理的情况下减少93%的能耗。

{"title":"Plasmon-based Virus Detection on Heterogeneous Embedded Systems","authors":"Olaf Neugebauer, Pascal Libuschewski, M. Engel, H. Müller, P. Marwedel","doi":"10.1145/2764967.2764976","DOIUrl":"https://doi.org/10.1145/2764967.2764976","url":null,"abstract":"Embedded systems, e.g. in computer vision applications, are expected to provide significant amounts of computing power to process large data volumes. Many of these systems, such as used in medical diagnosis, are mobile devices and face significant challenges to provide sufficient performance while operating on a constrained energy budget. Modern embedded MPSoC platforms use heterogeneous CPU and GPU cores providing a large number of optimization parameters. This allows to find useful trade-offs between energy consumption and performance for a given application. In this paper, we describe how the complex data processing required for PAMONO, a novel type of biosensor for the detection of biological viruses, can efficiently be implemented on a state-of-the-art heterogeneous MPSoC platform. An additional optimization dimension explored is the achieved quality of service. Reducing the virus detection accuracy enables additional optimizations not achievable by modifying hardware or software parameters alone. Instead of relying on often inaccurate simulation models, our design space exploration employs a hardware-in-the-loop approach to evaluate the performance and energy consumption on the embedded target platform. Trade-offs between performance, energy and accuracy are controlled by a genetic algorithm running on a PC control system which deploys the evaluation tasks to a number of connected embedded boards. Using our optimization approach, we are able to achieve frame rates meeting the requirements without losing accuracy. Further, our approach is able to reduce the energy consumption by 93% with a still reasonable detection quality.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115272435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Adaptive Isolation for Predictable MPSoC Stream Processing 自适应隔离可预测的MPSoC流处理

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2771821

J. Teich

Resource sharing and interferences of multiple threads of one, but even worse between multiple application programs running concurrently on a Multi-Processor System-on-a-Chip (MPSoC) today make it very hard to provide any timing or throughput-critical applications with time bounds. Additional interferences result from the interaction of OS functions such as thread multiplexing and scheduling as well as complex resource (e.g., cache) reservation protocols used heavily today. Finally, dynamic power and temperature management on a chip might also throttle down processor speed at arbitrary times leading to additional variations and jitter in execution time. This may be intolerable for many safety-critical applications such as medical imaging or automotive driver assistance systems. Static solutions to provide the required isolation by allocating distinct resources to safety-critical applications may not be feasible for reasons of cost and due to the lack of efficiency and inflexibility. Also, shutting off or restricting temperature and power management might not be tolerable. In this keynote, we propose new techniques for adaptive isolation of resources including processor, I/O, memory as well as communication resources on demand on an MPSoC based on the paradigm of Invasive Computing. In Invasive Computing, a programmer may specify bounds on the execution quality of a program or even single segments of a program followed by an invade command. This system returns a constellation of exclusive resources called a claim that is subsequently used in a by-default non-shared way until being released again by the invader. Through this principle, it becomes possible to isolate applications automatically and in an on-demand manner. In invasive computing, isolation is supported on all levels of hardware and software including an invasive OS. In case of an abundant number of cores available on an MPSoC today, the problem still becomes how to find suitable claims that will guarantee a performance bound in a negligible amount of time? For a broad class of streaming applications, we propose a combined static/dynamic approach based on a static design space exploration phase to extract a set of satisfying claim characteristics for which program execution is guaranteed to stay within the desired performance bounds. For a class of compositional and heterogeneous MPSoC systems, only very little information must then be passed to the OS for run-time claim search in the form of so-called CCGs (claim constraint graphs). A special role here plays a compositional Network-on-a-Chip (NoC) architecture that allows to invade guaranteed bandwith between processor, memory and I/O tiles independently from other applications. We demonstrate the above concepts for a complex object detection application algorithm chain taken from robot vision to show jitter-minimized implementations become possible, even for statically unknown arrivals of other concurrent applications.

资源共享和一个多线程的干扰，甚至更糟糕的是在多处理器单片系统(MPSoC)上并发运行的多个应用程序之间，使得很难为任何定时或吞吐量关键应用程序提供时间限制。额外的干扰来自于操作系统功能的交互，比如线程多路复用和调度，以及目前大量使用的复杂资源(例如缓存)保留协议。最后，芯片上的动态电源和温度管理也可能在任意时间降低处理器速度，从而导致执行时间的额外变化和抖动。对于医疗成像或汽车驾驶辅助系统等许多安全关键应用来说，这可能是无法忍受的。静态解决方案通过为安全关键型应用程序分配不同的资源来提供所需的隔离，但由于成本和缺乏效率和灵活性的原因，这种解决方案可能不可行。此外，关闭或限制温度和电源管理可能是不可容忍的。在本主题演讲中，我们提出了基于侵入式计算范式的MPSoC上资源自适应隔离的新技术，包括处理器，I/O，内存以及按需通信资源。在侵入式计算中，程序员可以指定程序的执行质量界限，甚至可以指定程序的单个片段，后跟一个入侵命令。该系统返回一组称为声明的独占资源，这些资源随后以默认的非共享方式使用，直到入侵者再次释放它们。通过这一原则，可以按需自动隔离应用程序。在侵入式计算中，所有级别的硬件和软件(包括侵入式操作系统)都支持隔离。在今天MPSoC上可用的大量内核的情况下，问题仍然是如何找到合适的声明，以保证在可忽略不计的时间内达到性能界限?对于一类广泛的流应用程序，我们提出了一种基于静态设计空间探索阶段的静态/动态组合方法，以提取一组令人满意的索赔特征，保证程序执行保持在期望的性能范围内。对于一类组合和异构的MPSoC系统，只有很少的信息必须以所谓的ccg(索赔约束图)的形式传递给操作系统进行运行时索赔搜索。在这里，一个特殊的角色是组成片上网络(NoC)架构，它允许独立于其他应用程序侵入处理器、内存和I/O块之间的保证带宽。我们演示了上述概念，用于从机器人视觉中提取的复杂对象检测应用程序算法链，以显示最小化抖动的实现是可能的，即使对于其他并发应用程序的静态未知到达也是如此。

{"title":"Adaptive Isolation for Predictable MPSoC Stream Processing","authors":"J. Teich","doi":"10.1145/2764967.2771821","DOIUrl":"https://doi.org/10.1145/2764967.2771821","url":null,"abstract":"Resource sharing and interferences of multiple threads of one, but even worse between multiple application programs running concurrently on a Multi-Processor System-on-a-Chip (MPSoC) today make it very hard to provide any timing or throughput-critical applications with time bounds. Additional interferences result from the interaction of OS functions such as thread multiplexing and scheduling as well as complex resource (e.g., cache) reservation protocols used heavily today. Finally, dynamic power and temperature management on a chip might also throttle down processor speed at arbitrary times leading to additional variations and jitter in execution time. This may be intolerable for many safety-critical applications such as medical imaging or automotive driver assistance systems. Static solutions to provide the required isolation by allocating distinct resources to safety-critical applications may not be feasible for reasons of cost and due to the lack of efficiency and inflexibility. Also, shutting off or restricting temperature and power management might not be tolerable. In this keynote, we propose new techniques for adaptive isolation of resources including processor, I/O, memory as well as communication resources on demand on an MPSoC based on the paradigm of Invasive Computing. In Invasive Computing, a programmer may specify bounds on the execution quality of a program or even single segments of a program followed by an invade command. This system returns a constellation of exclusive resources called a claim that is subsequently used in a by-default non-shared way until being released again by the invader. Through this principle, it becomes possible to isolate applications automatically and in an on-demand manner. In invasive computing, isolation is supported on all levels of hardware and software including an invasive OS. In case of an abundant number of cores available on an MPSoC today, the problem still becomes how to find suitable claims that will guarantee a performance bound in a negligible amount of time? For a broad class of streaming applications, we propose a combined static/dynamic approach based on a static design space exploration phase to extract a set of satisfying claim characteristics for which program execution is guaranteed to stay within the desired performance bounds. For a class of compositional and heterogeneous MPSoC systems, only very little information must then be passed to the OS for run-time claim search in the form of so-called CCGs (claim constraint graphs). A special role here plays a compositional Network-on-a-Chip (NoC) architecture that allows to invade guaranteed bandwith between processor, memory and I/O tiles independently from other applications. We demonstrate the above concepts for a complex object detection application algorithm chain taken from robot vision to show jitter-minimized implementations become possible, even for statically unknown arrivals of other concurrent applications.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116391555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach 异构架构流程序的高效编译:一种基于模型检查的方法

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2764968

R. K. Thakur, Y. Srikant

Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing units (GPUs) now emerging as general purpose co-processors, scheduling and distribution of these stream programs onto heterogeneous architectures (having both GPUs and CPUs) provides for challenging research. Exploiting this abundant parallelism in hardware, and providing a scalable solution is a hard problem. In this paper we describe a coarse-grained software pipelined scheduling algorithm for stream programs which statically schedules a stream graph onto heterogeneous architectures. We formulate the problem of partitioning the work between the CPU cores and the GPU as a model-checking problem. The partitioning process takes into account the costs of the required buffer layout transformations associated with the partitioning and the distribution of the stream graph. The solution trace result from the model checking provides a map for the distribution of actors across different processors/-cores. This solution is then divided into stages, and then a coarse grained software-pipelined code is generated. We use CUDA streams to map these programs synergistically onto the CPU and GPUs. We use a performance model for data transfers to determine the optimal number of CUDA streams on GPUs. Our software-pipelined schedule yields a speedup of upto 55.86X and a geometric mean speedup of 9.62X over a single threaded CPU.

基于同步数据流(SDF)模型的流编程自然地暴露了数据、任务和管道的并行性。面向同构架构的静态调度流程序一直是一个广泛研究的领域。随着图形处理单元(gpu)作为通用协处理器的出现，将这些流程序调度和分发到异构架构(同时具有gpu和cpu)上提供了具有挑战性的研究。利用硬件中的这种丰富的并行性并提供可伸缩的解决方案是一个难题。本文描述了一种流程序的粗粒度软件流水线调度算法，该算法将流图静态地调度到异构架构上。我们将在CPU内核和GPU之间划分工作的问题表述为一个模型检查问题。分区过程考虑了与分区和流图分布相关的所需缓冲区布局转换的成本。来自模型检查的解决方案跟踪结果为参与者跨不同处理器/核心的分布提供了映射。然后将该解决方案分成几个阶段，然后生成粗粒度的软件管道代码。我们使用CUDA流将这些程序协同映射到CPU和gpu上。我们使用数据传输的性能模型来确定gpu上CUDA流的最佳数量。我们的软件流水线调度在单线程CPU上产生高达55.86X的加速和9.62X的几何平均加速。

{"title":"Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach","authors":"R. K. Thakur, Y. Srikant","doi":"10.1145/2764967.2764968","DOIUrl":"https://doi.org/10.1145/2764967.2764968","url":null,"abstract":"Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing units (GPUs) now emerging as general purpose co-processors, scheduling and distribution of these stream programs onto heterogeneous architectures (having both GPUs and CPUs) provides for challenging research. Exploiting this abundant parallelism in hardware, and providing a scalable solution is a hard problem. In this paper we describe a coarse-grained software pipelined scheduling algorithm for stream programs which statically schedules a stream graph onto heterogeneous architectures. We formulate the problem of partitioning the work between the CPU cores and the GPU as a model-checking problem. The partitioning process takes into account the costs of the required buffer layout transformations associated with the partitioning and the distribution of the stream graph. The solution trace result from the model checking provides a map for the distribution of actors across different processors/-cores. This solution is then divided into stages, and then a coarse grained software-pipelined code is generated. We use CUDA streams to map these programs synergistically onto the CPU and GPUs. We use a performance model for data transfers to determine the optimal number of CUDA streams on GPUs. Our software-pipelined schedule yields a speedup of upto 55.86X and a geometric mean speedup of 9.62X over a single threaded CPU.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125479194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Is dynamic compilation possible for embedded systems? 动态编译对嵌入式系统是可能的吗?

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2782785

H. Charles, V. Lomüller

JIT compilation and dynamic compilation are powerful techniques allowing to delay the final code generation to the runtime. There is many benefits: improved portability, virtual machine security, etc. Unforturnately the tools used for JIT compilation and dynamic compilation does not met the classical requirement for embedded platforms: memory size is huge and code generation has big overheads. In this paper we show how dynamic code specialization (JIT) can be used and be beneficial in terms of execution speed and energy consumption with memory footprint kept under control. We based our approaches on our tool deGoal and on LLVM, that we extended to be able to produce lightweight runtime specializers from annotated LLVM programs. Benchmarks are manipulated and transformed into templates and a specialization routine is build to instantiate the routines. Such approach allows to produce efficient specializations routines, with a minimal energy consumption and memory footprint compare to a generic JIT application. Through some benchmarks, we present its efficiency in terms of speed, energy and memory footprint. We show that over static compilation we can achieve a speed-up of 21 % in terms of execution speed but also a 10 % energy reduction with a moderate memory footprint.

JIT编译和动态编译是强大的技术，可以将最终的代码生成延迟到运行时。有很多好处:改进的可移植性、虚拟机安全性等。不幸的是，用于JIT编译和动态编译的工具不能满足嵌入式平台的传统需求:内存大小巨大，代码生成开销很大。在本文中，我们将展示如何使用动态代码专门化(JIT)，以及如何在控制内存占用的情况下提高执行速度和能耗。我们的方法基于我们的工具deGoal和LLVM，我们扩展了它们，以便能够从带注释的LLVM程序中生成轻量级的运行时专门化器。操作基准并将其转换为模板，并构建专门化例程来实例化这些例程。这种方法允许生成高效的专门化例程，与通用JIT应用程序相比，能耗和内存占用最小。通过一些基准测试，我们展示了它在速度、能量和内存占用方面的效率。我们表明，通过静态编译，我们可以在执行速度方面实现21%的加速提升，同时还可以在适度的内存占用下减少10%的能量。

{"title":"Is dynamic compilation possible for embedded systems?","authors":"H. Charles, V. Lomüller","doi":"10.1145/2764967.2782785","DOIUrl":"https://doi.org/10.1145/2764967.2782785","url":null,"abstract":"JIT compilation and dynamic compilation are powerful techniques allowing to delay the final code generation to the runtime. There is many benefits: improved portability, virtual machine security, etc. Unforturnately the tools used for JIT compilation and dynamic compilation does not met the classical requirement for embedded platforms: memory size is huge and code generation has big overheads. In this paper we show how dynamic code specialization (JIT) can be used and be beneficial in terms of execution speed and energy consumption with memory footprint kept under control. We based our approaches on our tool deGoal and on LLVM, that we extended to be able to produce lightweight runtime specializers from annotated LLVM programs. Benchmarks are manipulated and transformed into templates and a specialization routine is build to instantiate the routines. Such approach allows to produce efficient specializations routines, with a minimal energy consumption and memory footprint compare to a generic JIT application. Through some benchmarks, we present its efficiency in terms of speed, energy and memory footprint. We show that over static compilation we can achieve a speed-up of 21 % in terms of execution speed but also a 10 % energy reduction with a moderate memory footprint.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114081048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An Energy Efficient Message Passing Synchronization Algorithm for Concurrent Data Structures in Embedded Systems 嵌入式系统并发数据结构的高能效消息传递同步算法

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2771931

Lazaros Papadopoulos, D. Soudris

Nowadays, modern multicore embedded systems often execute complex applications that rely heavily on concurrent data structures. Databases on embedded microservers, file systems and stream processing algorithms belong in application domains that normally utilize concurrent data structures to store and process their data. The prevalent lock-based synchronization methods based on mutexes provide poor scalability and, most importantly, they lead to high energy consumption, which is an important constraint on embedded systems. In this work, we propose an energy efficient synchronization model for embedded system architectures based on message-passing communication. Our results show that concurrent data structures based on the proposed model provide lower power consumption in comparison with the corresponding lock-based implementations, along with comparable performance.

如今，现代多核嵌入式系统经常执行严重依赖并发数据结构的复杂应用程序。嵌入式微服务器上的数据库、文件系统和流处理算法属于通常使用并发数据结构来存储和处理其数据的应用领域。当前流行的基于锁的互斥锁同步方法可扩展性差，最重要的是，它们导致高能耗，这是嵌入式系统的一个重要限制。在这项工作中，我们提出了一种基于消息传递通信的嵌入式系统架构节能同步模型。我们的结果表明，与相应的基于锁的实现相比，基于所提出模型的并发数据结构提供了更低的功耗，并且具有相当的性能。

引用次数: 1

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling 基于准静态调度的多核数据流应用的吞吐量优化编译

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2764972

T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, C. Haubelt

Application modeling using dynamic dataflow graphs is well-suited for multi-core platforms. However, there is often a mismatch between the fine granularity of the application and the platform. Tailoring this granularity to the platform promises performance gains by (a) reducing dynamic scheduling overhead and (b) exploiting compiler optimizations. In this paper, we propose a throughput-optimizing compilation approach that uses Quasi-Static Schedules (QSSs) to combine actors of static dataflow subgraphs. Our proposed approach combines core allocation, QSSs, and actor binding in a Design Space Exploration (DSE), optimizing the throughput for a number of available cores. During the DSE, each implementation candidate is compiled to and evaluated on the target hardware---here an Intel i7 and an ARM Cortex-A9. Experimental results including synthetic benchmarks as well as a real-world control application show that our proposed holistic compilation approach outperforms classic DSEs that are agnostic of QSS as well as a DSE that employs QSS as a post-processing step. Amongst others, we show a case where the compilation approach obtains a speedup of 9.91 x for a 4-core implementation, while a classic DSE only obtains a speedup of 2.12 x.

使用动态数据流图的应用程序建模非常适合多核平台。但是，应用程序的细粒度和平台之间经常存在不匹配。根据平台定制这种粒度，可以通过(a)减少动态调度开销和(b)利用编译器优化来实现性能提升。在本文中，我们提出了一种吞吐量优化的编译方法，该方法使用准静态调度(qss)来组合静态数据流子图的参与者。我们提出的方法在设计空间探索(Design Space Exploration, DSE)中结合了核心分配、qss和参与者绑定，优化了许多可用核心的吞吐量。在DSE期间，每个候选实现都被编译并在目标硬件上进行评估——这里是Intel i7和ARM Cortex-A9。包括合成基准测试和现实世界控制应用在内的实验结果表明，我们提出的整体编译方法优于不可知QSS的经典DSE以及采用QSS作为后处理步骤的DSE。其中，我们展示了一个案例，其中编译方法在4核实现中获得了9.91 x的加速，而经典的DSE只获得了2.12 x的加速。

{"title":"Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling","authors":"T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, C. Haubelt","doi":"10.1145/2764967.2764972","DOIUrl":"https://doi.org/10.1145/2764967.2764972","url":null,"abstract":"Application modeling using dynamic dataflow graphs is well-suited for multi-core platforms. However, there is often a mismatch between the fine granularity of the application and the platform. Tailoring this granularity to the platform promises performance gains by (a) reducing dynamic scheduling overhead and (b) exploiting compiler optimizations. In this paper, we propose a throughput-optimizing compilation approach that uses Quasi-Static Schedules (QSSs) to combine actors of static dataflow subgraphs. Our proposed approach combines core allocation, QSSs, and actor binding in a Design Space Exploration (DSE), optimizing the throughput for a number of available cores. During the DSE, each implementation candidate is compiled to and evaluated on the target hardware---here an Intel i7 and an ARM Cortex-A9. Experimental results including synthetic benchmarks as well as a real-world control application show that our proposed holistic compilation approach outperforms classic DSEs that are agnostic of QSS as well as a DSE that employs QSS as a post-processing step. Amongst others, we show a case where the compilation approach obtains a speedup of 9.91 x for a 4-core implementation, while a classic DSE only obtains a speedup of 2.12 x.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126153584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Application-Specific Architecture Exploration Based on Processor-Agnostic Performance Estimation 基于处理器不可知性能估计的特定应用架构探索

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2771932

Juan Fernando Eusse Giraldo, L. Murillo, C. McGirr, R. Leupers, G. Ascheid

Early design decisions such as architectural class and instruction set selection largely determine the performance and energy consumption of application specific processors (ASIPs). However, making decisions that effectively reflect in high performance require that a careful analysis of the target application is done by an experienced designer. Such process is extremely time consuming, and a confirmation that the processor meets the application requirements can only be extracted after costly architectural implementation, synthesis and simulation. To shorten design times, this work couples High-Level Synthesis (HLS) with pre-architectural performance estimation. We do so with the aim of providing designers with an initial architectural seed together with quantitative feedback about its performance. This enables to perform a light-weight refinement process based on the obtained feedback, such that time-consuming microarchitectural implementation is done only once at the end of the refinement steps. We employed our flow to generate four potential ASIPs for a 1024-point FFT. Estimates validation and gain evaluation is performed on actual ASIP implementations, which achieve performance gains of up to 8.42x and energy gains up to 1.32x over an existing VLIW processor.

早期的设计决策(如体系结构类和指令集选择)在很大程度上决定了特定于应用程序的处理器的性能和能耗。然而，要做出有效反映高性能的决策，需要由经验丰富的设计人员对目标应用程序进行仔细分析。这个过程非常耗时，并且只有经过昂贵的体系结构实现、综合和仿真之后才能确认处理器满足应用需求。为了缩短设计时间，这项工作将高级综合(High-Level Synthesis, HLS)与架构前的性能评估结合起来。我们这样做的目的是为设计师提供一个最初的建筑种子，以及对其性能的定量反馈。这使得可以基于获得的反馈执行轻量级的细化过程，这样耗时的微架构实现只在细化步骤结束时完成一次。我们使用我们的流为1024点FFT生成四个潜在的api。在实际的ASIP实现上进行估计验证和增益评估，与现有的VLIW处理器相比，实现了高达8.42倍的性能增益和高达1.32倍的能量增益。

{"title":"Application-Specific Architecture Exploration Based on Processor-Agnostic Performance Estimation","authors":"Juan Fernando Eusse Giraldo, L. Murillo, C. McGirr, R. Leupers, G. Ascheid","doi":"10.1145/2764967.2771932","DOIUrl":"https://doi.org/10.1145/2764967.2771932","url":null,"abstract":"Early design decisions such as architectural class and instruction set selection largely determine the performance and energy consumption of application specific processors (ASIPs). However, making decisions that effectively reflect in high performance require that a careful analysis of the target application is done by an experienced designer. Such process is extremely time consuming, and a confirmation that the processor meets the application requirements can only be extracted after costly architectural implementation, synthesis and simulation. To shorten design times, this work couples High-Level Synthesis (HLS) with pre-architectural performance estimation. We do so with the aim of providing designers with an initial architectural seed together with quantitative feedback about its performance. This enables to perform a light-weight refinement process based on the obtained feedback, such that time-consuming microarchitectural implementation is done only once at the end of the refinement steps. We employed our flow to generate four potential ASIPs for a 1024-point FFT. Estimates validation and gain evaluation is performed on actual ASIP implementations, which achieve performance gains of up to 8.42x and energy gains up to 1.32x over an existing VLIW processor.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128645407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays 大规模并行处理器阵列中热和功耗约束下应用程序执行的运行时适应性

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2771933

É. Sousa, Frank Hannig, J. Teich, Qingqing Chen, Ulf Schlichtmann

Massively Parallel Processor Arrays (MPPAs) can be nicely used in portable devices such as tablets and smartphones. However, applications running on mobile platforms require a certain performance level or quality (e.g., high-resolution image processing) that need to be satisfied while adhering to a certain power budget and temperature threshold. As a solution to the aforementioned challenges, we consider a resource-aware computing paradigm to exploit runtime adaptation without violating any thermal and/or power constraint in a programmable MPPA. For estimating the power consumption, we developed a mathematical model based on the post-synthesis implementation of an MPPA in different CMOS technologies while the temperature variation was emulated. We showcase our hardware/software mechanism to load new, on-the-fly configurations into the accelerator, considering quality/throughput tradeoffs for image processing applications. The results show that the average power consumption of a Sobel and Laplace operators using different number of processing elements amounts to 1.24 mW and 10.35 mW, respectively. Furthermore, only 1.64 μs are necessary for configuring a class of MPPA running at 550 MHz.

大规模并行处理器阵列(MPPAs)可以很好地用于便携式设备，如平板电脑和智能手机。然而，在移动平台上运行的应用程序需要满足一定的性能水平或质量(例如，高分辨率图像处理)，同时坚持一定的功率预算和温度阈值。作为上述挑战的解决方案，我们考虑了一种资源感知计算范式，在不违反可编程MPPA中的任何热和/或功率限制的情况下利用运行时适应性。为了估计功耗，我们基于不同CMOS技术的MPPA合成后实现建立了一个数学模型，同时模拟了温度变化。我们展示了我们的硬件/软件机制，以加载新的，动态配置到加速器中，考虑到图像处理应用程序的质量/吞吐量权衡。结果表明，使用不同处理单元数的Sobel算子和Laplace算子的平均功耗分别为1.24 mW和10.35 mW。此外，配置工作频率为550mhz的一类MPPA只需要1.64 μs。

{"title":"Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays","authors":"É. Sousa, Frank Hannig, J. Teich, Qingqing Chen, Ulf Schlichtmann","doi":"10.1145/2764967.2771933","DOIUrl":"https://doi.org/10.1145/2764967.2771933","url":null,"abstract":"Massively Parallel Processor Arrays (MPPAs) can be nicely used in portable devices such as tablets and smartphones. However, applications running on mobile platforms require a certain performance level or quality (e.g., high-resolution image processing) that need to be satisfied while adhering to a certain power budget and temperature threshold. As a solution to the aforementioned challenges, we consider a resource-aware computing paradigm to exploit runtime adaptation without violating any thermal and/or power constraint in a programmable MPPA. For estimating the power consumption, we developed a mathematical model based on the post-synthesis implementation of an MPPA in different CMOS technologies while the temperature variation was emulated. We showcase our hardware/software mechanism to load new, on-the-fly configurations into the accelerator, considering quality/throughput tradeoffs for image processing applications. The results show that the average power consumption of a Sobel and Laplace operators using different number of processing elements amounts to 1.24 mW and 10.35 mW, respectively. Furthermore, only 1.64 μs are necessary for configuring a class of MPPA running at 550 MHz.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132258275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

VLIW Code Generation for a Convolutional Network Accelerator 卷积网络加速器VLIW代码生成

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2771928

Maurice Peemen, W. Pramadi, B. Mesman, H. Corporaal

This paper presents a compiler flow to map Deep Convolutional Networks (ConvNets) to a highly specialized VLIW accelerator core targeting the low-power embedded market. Earlier works have focused on energy efficient accelerators for this class of algorithms, but none of them provides a complete and practical programming model. Due to the large parameter set of a ConvNet it is essential that the user can abstract from the accelerator architecture and does not have to rely on an error prone and ad-hoc assembly programming model. By using modulo scheduling for software pipelining we demonstrate that our automatic generated code achieves equal or within 5-20% less hardware utilization w.r.t. code written manually by experts. Our compiler removes the huge manual workload to efficiently map ConvNets to an energy-efficient core for the next-generation mobile and wearable devices.

本文提出了一个编译器流程，将深度卷积网络(ConvNets)映射到针对低功耗嵌入式市场的高度专业化的VLIW加速器核心。早期的工作集中在这类算法的节能加速器上，但它们都没有提供一个完整和实用的编程模型。由于卷积神经网络的参数集很大，因此用户必须能够从加速器体系结构中抽象出来，而不必依赖于容易出错和特别的汇编编程模型。通过对软件流水线使用模调度，我们证明了我们自动生成的代码与专家手工编写的代码相比，在5-20%的范围内实现了相同或更少的硬件利用率。我们的编译器消除了巨大的人工工作量，有效地将卷积神经网络映射到下一代移动和可穿戴设备的节能核心。

引用次数: 4

A model-based, single-source approach to design-space exploration and synthesis of mixed-criticality systems 一种基于模型的单一来源方法，用于混合临界系统的设计空间探索和综合

Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Pub Date : 2015-06-01 DOI: 10.1145/2764967.2784777

F. Herrera, P. Peñil, E. Villar

While the Moore's Law is still in place, the complexity of embedded systems continues to growth exponentially. Embedded Systems are implemented on complex HW/SW platforms, requiring more powerful design methods and tools. Electronic System Level (ESL) design [1] proposes to raise the level of abstraction in which the system is modeled in order to allow the analysis and optimization of the system at earlier stages of the design process.

虽然摩尔定律仍然有效，但嵌入式系统的复杂性继续呈指数级增长。嵌入式系统是在复杂的硬件/软件平台上实现的，需要更强大的设计方法和工具。电子系统级(ESL)设计[1]提出提高系统建模的抽象级别，以便在设计过程的早期阶段对系统进行分析和优化。

引用次数: 8