2012 International Conference on Embedded Computer Systems (SAMOS)最新文献

英文中文

Energy efficient stream-based configurable architecture for embedded platforms 嵌入式平台的高能效流可配置架构

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404174

F. Pratas, P. Tomás, P. Trancoso, L. Sousa

Reconfigurable hardware can be used as an energy and performance efficient co-processing solution to accelerate certain types of applications. To facilitate the design of hardware accelerators we have proposed a methodology that adopts the stream-based computing model and the usage of Graphics Processing Units as prototyping platforms. In this paper we go a step further and propose a new modular architecture for low-power reconfigurable systems to easily map the stream-based algorithms. In particular, the architecture consists of a semi-programable accelerator set that can be adapted to the application needs in terms of functional units and number of streaming engines. The proposed embedded architecture mates the flexibility of reconfigurable hardware with the advantages of stream computing for the strict needs of embedded reconfigurable devices. We show a possible organization for this architecture. Moreover, we provide a general case study to analyze the scalability of the proposed architecture in an Altera FPGA. Our experimental results show that a significant speed-up can be achieved compared to general purpose processors using low-power FPGA devices. Our preliminary estimates show that it is also possible to achieve energy savings of up to 118x.

可重构硬件可以作为一种节能高效的协同处理解决方案来加速某些类型的应用程序。为了方便硬件加速器的设计，我们提出了一种采用基于流的计算模型和使用图形处理单元作为原型平台的方法。在本文中，我们进一步提出了一种新的模块化架构，用于低功耗可重构系统，以方便地映射基于流的算法。特别是，该体系结构包含一个半可编程的加速器集，可以根据应用程序在功能单元和流引擎数量方面的需求进行调整。所提出的嵌入式架构将可重构硬件的灵活性与流计算的优势相结合，以满足嵌入式可重构设备的严格要求。我们展示了这种体系结构的一种可能的组织。此外，我们提供了一个一般的案例研究来分析所提出的架构在Altera FPGA中的可扩展性。我们的实验结果表明，与使用低功耗FPGA器件的通用处理器相比，可以实现显着的加速。我们的初步估计表明，它也有可能实现高达118倍的能源节约。

{"title":"Energy efficient stream-based configurable architecture for embedded platforms","authors":"F. Pratas, P. Tomás, P. Trancoso, L. Sousa","doi":"10.1109/SAMOS.2012.6404174","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404174","url":null,"abstract":"Reconfigurable hardware can be used as an energy and performance efficient co-processing solution to accelerate certain types of applications. To facilitate the design of hardware accelerators we have proposed a methodology that adopts the stream-based computing model and the usage of Graphics Processing Units as prototyping platforms. In this paper we go a step further and propose a new modular architecture for low-power reconfigurable systems to easily map the stream-based algorithms. In particular, the architecture consists of a semi-programable accelerator set that can be adapted to the application needs in terms of functional units and number of streaming engines. The proposed embedded architecture mates the flexibility of reconfigurable hardware with the advantages of stream computing for the strict needs of embedded reconfigurable devices. We show a possible organization for this architecture. Moreover, we provide a general case study to analyze the scalability of the proposed architecture in an Altera FPGA. Our experimental results show that a significant speed-up can be achieved compared to general purpose processors using low-power FPGA devices. Our preliminary estimates show that it is also possible to achieve energy savings of up to 118x.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"384 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116522241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

System modeling and multicore simulation using transactions 使用事务的系统建模和多核仿真

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404156

Amine Anane, E. Aboulhamid, Y. Savaria

With the increasing complexity of digital systems that are becoming more and more parallel, a better abstraction to describe such systems has become a necessity. This paper shows how, by using the powerful mechanism of transactions as a concurrency model, and by taking advantage of .NET introspection and attribute programming capabilities, we were able to develop a system-level modeling and parallel simulation environment. We kept the same concepts to describe the architecture of high-level models, such as modules and communication channels. However, unlike SystemC, the behaviour is no longer described as processes and events but as transactions. We implemented scheduling algorithms in order to enable simulating a transactional models in parallel by taking advantage of a multicore machine. These algorithms take into account the dependency between transactions and the number of cores of the simulation machine. We studied two synchronisation strategies: one using locking and the other using partitioning. An experiment made on a WiFi 802.11a transmitter achieved a speedup of about 1.9 using two threads. With 8 threads, although the workload of individual transactions was not significant, we could reach a 5.1 speedup. When the workload is significant the speedup can reach 6.3.

随着数字系统越来越复杂，越来越并行，一个更好的抽象描述系统已经成为一种必要。本文展示了如何使用强大的事务机制作为并发模型，并利用。net自省和属性编程功能，开发系统级建模和并行仿真环境。我们保留了相同的概念来描述高级模型的体系结构，例如模块和通信通道。然而，与SystemC不同的是，行为不再被描述为过程和事件，而是作为事务。我们实现了调度算法，以便利用多核机器并行模拟事务模型。这些算法考虑了事务之间的依赖关系和模拟机的核数。我们研究了两种同步策略:一种使用锁定，另一种使用分区。在WiFi 802.11a发射器上进行的一项实验使用两个线程实现了大约1.9的加速。对于8个线程，尽管单个事务的工作负载并不大，但我们可以达到5.1的加速。当工作负载很大时，加速可以达到6.3。

{"title":"System modeling and multicore simulation using transactions","authors":"Amine Anane, E. Aboulhamid, Y. Savaria","doi":"10.1109/SAMOS.2012.6404156","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404156","url":null,"abstract":"With the increasing complexity of digital systems that are becoming more and more parallel, a better abstraction to describe such systems has become a necessity. This paper shows how, by using the powerful mechanism of transactions as a concurrency model, and by taking advantage of .NET introspection and attribute programming capabilities, we were able to develop a system-level modeling and parallel simulation environment. We kept the same concepts to describe the architecture of high-level models, such as modules and communication channels. However, unlike SystemC, the behaviour is no longer described as processes and events but as transactions. We implemented scheduling algorithms in order to enable simulating a transactional models in parallel by taking advantage of a multicore machine. These algorithms take into account the dependency between transactions and the number of cores of the simulation machine. We studied two synchronisation strategies: one using locking and the other using partitioning. An experiment made on a WiFi 802.11a transmitter achieved a speedup of about 1.9 using two threads. With 8 threads, although the workload of individual transactions was not significant, we could reach a 5.1 speedup. When the workload is significant the speedup can reach 6.3.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122643576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Interleaving methods for hybrid system-level MPSoC design space exploration 混合系统级MPSoC设计空间探索的交错方法

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404152

R. Piscitelli, A. Pimentel

System-level design space exploration (DSE), which is performed early in the design process, is of eminent importance to the design of complex multi-processor embedded system architectures. During system-level DSE, system parameters like, e.g., the number and type of processors, the type and size of memories, or the mapping of application tasks to architectural resources, are considered. Simulation-based DSE, in which different design instances are evaluated using system-level simulations, typically are computationally costly. Even using high-level simulations and efficient exploration algorithms, the simulation time to evaluate design points forms a real bottleneck in such DSE. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. This paper presents and studies different strategies for interleaving fast but less accurate analytical performance estimations with slower but more accurate simulations during DSE. By interleaving these analytical estimations with simulations, our hybrid approach significantly reduces the number of simulations that are needed during the process of DSE. Experimental results have demonstrated that such hybrid DSE is a promising technique that can yield solutions of similar quality as compared to simulation-based DSE but only at a fraction of the execution time.

系统级设计空间探索(system -level design space exploration, DSE)在设计过程的早期进行，对于复杂的多处理器嵌入式系统架构的设计非常重要。在系统级DSE期间，会考虑系统参数，例如处理器的数量和类型、内存的类型和大小，或者应用程序任务到体系结构资源的映射。基于仿真的DSE，使用系统级仿真来评估不同的设计实例，通常计算成本很高。即使使用高水平的仿真和高效的探索算法，评估设计点的仿真时间也成为这种DSE的真正瓶颈。因此，需要搜索的巨大设计空间需要有效的设计空间修剪技术。本文提出并研究了在DSE过程中，将快速但不太准确的分析性能估计与缓慢但更准确的仿真相结合的不同策略。通过将这些分析估计与模拟交叉，我们的混合方法显着减少了DSE过程中所需的模拟次数。实验结果表明，这种混合DSE是一种很有前途的技术，与基于仿真的DSE相比，它可以产生类似质量的解决方案，但只需要一小部分执行时间。

{"title":"Interleaving methods for hybrid system-level MPSoC design space exploration","authors":"R. Piscitelli, A. Pimentel","doi":"10.1109/SAMOS.2012.6404152","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404152","url":null,"abstract":"System-level design space exploration (DSE), which is performed early in the design process, is of eminent importance to the design of complex multi-processor embedded system architectures. During system-level DSE, system parameters like, e.g., the number and type of processors, the type and size of memories, or the mapping of application tasks to architectural resources, are considered. Simulation-based DSE, in which different design instances are evaluated using system-level simulations, typically are computationally costly. Even using high-level simulations and efficient exploration algorithms, the simulation time to evaluate design points forms a real bottleneck in such DSE. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. This paper presents and studies different strategies for interleaving fast but less accurate analytical performance estimations with slower but more accurate simulations during DSE. By interleaving these analytical estimations with simulations, our hybrid approach significantly reduces the number of simulations that are needed during the process of DSE. Experimental results have demonstrated that such hybrid DSE is a promising technique that can yield solutions of similar quality as compared to simulation-based DSE but only at a fraction of the execution time.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122971346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Adaptive reinforcement learning method for networks-on-chip 片上网络的自适应强化学习方法

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404180

F. Farahnakian, M. Ebrahimi, M. Daneshtalab, J. Plosila, P. Liljeberg

In this paper, we propose a congestion-aware routing algorithm based on Dual Reinforcement Q-routing. In this method, local and global congestion information of the network is provided for each router, utilizing learning packets. This information should be dynamically updated according to the changing traffic conditions in the network. For this purpose, a congestion detection method is presented to measure the average of free buffer slots in a specific time interval. This value is compared with maximum and minimum threshold values and based on the comparison result, the learning rate is updated. If the learning rate is a large value, it means the network gets congested and global information is more emphasized than local information. In contrast, local information is more important than global when a router receives few packets in a time interval. Experimental results for different traffic patterns and network loads show that the proposed method improves the network performance compared with the standard Q-routing, DRQ-routing, and Dynamic XY-routing algorithms.

本文提出了一种基于双增强q路由的拥塞感知路由算法。该方法利用学习包为每台路由器提供网络的本地和全局拥塞信息。该信息应根据网络中不断变化的流量情况动态更新。为此，提出了一种拥塞检测方法来测量在特定时间间隔内空闲缓冲槽的平均值。将该值与最大和最小阈值进行比较，根据比较结果更新学习率。学习率较大，说明网络拥塞，全局信息比局部信息更受重视。相反，当路由器在一段时间间隔内接收到很少的数据包时，本地信息比全局信息更重要。在不同流量模式和网络负载下的实验结果表明，与标准q -路由、drq -路由和动态xy -路由算法相比，该方法提高了网络性能。

引用次数: 28

BEE technology overview BEE技术概述

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404186

Joseph Rothman, Chen Chang

This presentation will focus on a technology overview of the BEE4 and miniBEE FPGA based reconfigurable platforms. BEEcube supplies advanced system level FPGA prototyping platforms, targeting a wide range of uses including: multi-core computer architecture, wireless communications, 100Gbps+ networking solutions, HD video processing, signal intelligence, radar/sonar array, and High Performance Computing (HPC) needs. This overview will review features, capabilities, unique technology and uses of BEE platforms on both, its state of the art Virtex 6 based multi-array FPGA BEE4™ system, and introduce the first Research in a Box solution, the miniBEE™. miniBEE offers a combination of the latest FPGA, multicore CPU, high-speed networking technology all tightly coupled in one integrated cost effective solution targeting the research and lab community. This flexible system replaces the need for disjointed FPGA boards, PCs, networking devices, and test equipment. The presentation will describe how both algorithm oriented researchers as well as seasoned FPGA experts can utilize BEE technology to achieve their proof of concept or application level prototyping goals based on real time and real world data or conditions. Unique BEE technologies covered include its' symmetrical Honeycomb Architecture, Full Speed Sting I/O interface, Application Control and Debugging Nectar OS, and the BEEcube Platform Studio software environment. The presentation plans to include BEE technology in action, for real-time imaging manipulation or as a flexible testing platform, an Arbitrary Waveform Generation example.

本报告将重点介绍基于BEE4和miniBEE FPGA的可重构平台的技术概述。BEEcube提供先进的系统级FPGA原型平台，针对广泛的用途，包括:多核计算机架构，无线通信，100Gbps+网络解决方案，高清视频处理，信号情报，雷达/声纳阵列和高性能计算(HPC)需求。本综述将回顾两种BEE平台的特性、功能、独特技术和用途，以及基于Virtex 6的多阵列FPGA BEE4™系统，并介绍第一个研究盒解决方案miniBEE™。miniBEE提供了最新的FPGA，多核CPU，高速网络技术的组合，所有这些技术都紧密耦合在一个针对研究和实验室社区的集成成本效益解决方案中。这种灵活的系统取代了对分离的FPGA板、pc机、网络设备和测试设备的需求。该演讲将描述算法导向的研究人员以及经验丰富的FPGA专家如何利用BEE技术来实现基于实时和真实世界数据或条件的概念验证或应用级原型目标。独特的BEE技术包括其对称蜂巢结构、全速Sting I/O接口、应用控制和调试Nectar操作系统以及BEEcube平台工作室软件环境。演示计划包括BEE技术的实际应用，用于实时成像操作或作为灵活的测试平台，任意波形生成示例。

{"title":"BEE technology overview","authors":"Joseph Rothman, Chen Chang","doi":"10.1109/SAMOS.2012.6404186","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404186","url":null,"abstract":"This presentation will focus on a technology overview of the BEE4 and miniBEE FPGA based reconfigurable platforms. BEEcube supplies advanced system level FPGA prototyping platforms, targeting a wide range of uses including: multi-core computer architecture, wireless communications, 100Gbps+ networking solutions, HD video processing, signal intelligence, radar/sonar array, and High Performance Computing (HPC) needs. This overview will review features, capabilities, unique technology and uses of BEE platforms on both, its state of the art Virtex 6 based multi-array FPGA BEE4™ system, and introduce the first Research in a Box solution, the miniBEE™. miniBEE offers a combination of the latest FPGA, multicore CPU, high-speed networking technology all tightly coupled in one integrated cost effective solution targeting the research and lab community. This flexible system replaces the need for disjointed FPGA boards, PCs, networking devices, and test equipment. The presentation will describe how both algorithm oriented researchers as well as seasoned FPGA experts can utilize BEE technology to achieve their proof of concept or application level prototyping goals based on real time and real world data or conditions. Unique BEE technologies covered include its' symmetrical Honeycomb Architecture, Full Speed Sting I/O interface, Application Control and Debugging Nectar OS, and the BEEcube Platform Studio software environment. The presentation plans to include BEE technology in action, for real-time imaging manipulation or as a flexible testing platform, an Arbitrary Waveform Generation example.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129915223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Virtual prototyping for efficient multi-core ECU development of driver assistance systems 驾驶员辅助系统中高效多核ECU开发的虚拟样机

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404155

Rainer Kiesel, M. Streubühr, C. Haubelt, A. Terzis, J. Teich

In recent years, road vehicles have experienced an enormous increase in driver assistance systems such as traffic sign recognition, lane departure warning, and pedestrian detection. Cost-efficient development of electronic control units (ECUs) for these systems is a complex challenge. The demand for shortened time to market makes the development even more challenging and thus demands efficient design flows. This paper proposes a model-based design flow that permits simulation-based performance evaluation of multi-core ECUs for driver assistance systems in a pre-development stage. The approach is based on a system-level virtual prototype of a multi-core ECU and allows the evaluation of timing effects by mapping application tasks to different platforms. The results show that performance estimation of different parallel implementation candidates is possible with high accuracy even in a pre-development stage. By adapting the best-fitting parallelization strategy to the final ECU, a reduction in the time to market period is possible. Currently, the design flow is being evaluated by Daimler AG and is being applied to a pedestrian detection system. Results from this application illustrate the benefits of the proposed approach.

近年来，道路车辆的驾驶辅助系统如交通标志识别、车道偏离警告、行人检测等有了巨大的增长。为这些系统开发具有成本效益的电子控制单元(ecu)是一个复杂的挑战。缩短上市时间的需求使得开发更具挑战性，因此需要高效的设计流程。本文提出了一种基于模型的设计流程，允许在预开发阶段对驾驶员辅助系统的多核ecu进行基于仿真的性能评估。该方法基于多核ECU的系统级虚拟原型，并允许通过将应用程序任务映射到不同平台来评估时序效果。结果表明，即使在预开发阶段，也可以对不同的并行候选实现进行高精度的性能估计。通过对最终ECU采用最合适的并行化策略，可以缩短产品上市时间。目前，戴姆勒公司正在对设计流程进行评估，并将其应用于行人检测系统。这个应用程序的结果说明了所提出的方法的好处。

{"title":"Virtual prototyping for efficient multi-core ECU development of driver assistance systems","authors":"Rainer Kiesel, M. Streubühr, C. Haubelt, A. Terzis, J. Teich","doi":"10.1109/SAMOS.2012.6404155","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404155","url":null,"abstract":"In recent years, road vehicles have experienced an enormous increase in driver assistance systems such as traffic sign recognition, lane departure warning, and pedestrian detection. Cost-efficient development of electronic control units (ECUs) for these systems is a complex challenge. The demand for shortened time to market makes the development even more challenging and thus demands efficient design flows. This paper proposes a model-based design flow that permits simulation-based performance evaluation of multi-core ECUs for driver assistance systems in a pre-development stage. The approach is based on a system-level virtual prototype of a multi-core ECU and allows the evaluation of timing effects by mapping application tasks to different platforms. The results show that performance estimation of different parallel implementation candidates is possible with high accuracy even in a pre-development stage. By adapting the best-fitting parallelization strategy to the final ECU, a reduction in the time to market period is possible. Currently, the design flow is being evaluated by Daimler AG and is being applied to a pedestrian detection system. Results from this application illustrate the benefits of the proposed approach.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115948045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model-driven robot-software design using integrated models and co-simulation 基于集成模型和协同仿真的模型驱动机器人软件设计

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404197

J. Broenink, Yunyun Ni

The work presented here is on a methodology for design of hard real-time embedded control software for robots, i.e. mechatronic products. The behavior of the total robot system (machine, control, software and I/O) is relevant, because the dynamics of the machine influences the robot software. Therefore, we use two appropriate Models of Computation, which represent continuous-time equations for the machine / robot part, and discrete event / discrete time equations for the control software part.

这里介绍的工作是关于机器人的硬实时嵌入式控制软件的设计方法，即机电产品。整个机器人系统(机器、控制、软件和I/O)的行为是相关的，因为机器的动力学影响机器人软件。因此，我们使用了两种合适的计算模型，分别表示机器/机器人部分的连续时间方程和控制软件部分的离散事件/离散时间方程。

引用次数: 22

Maximum performance computing for exascale applications 为百亿亿级应用程序提供最高性能计算

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404150

O. Mencer

Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run.

只提供摘要形式。自从1953年费米、意大利面和乌拉姆进行了第一次具有根本意义的数值实验以来，科学一直受到可用计算能力进步的推动。特别是，计算量子化学和计算量子物理依赖于不断增加的计算量。然而，由于芯片的功率密度限制，我们已经看到了单个CPU性能扩展的终结。现在的挑战是通过某种形式的并行处理来提高计算性能，而不引起系统级的功率限制。处理系统“功率墙”问题的一种方法是问“在一定的功率预算内可以实现的最大计算量是多少”。我们认为，这种最大性能计算需要关注完整科学应用的端到端执行时间，需要包括多学科方法，将科学家和工程师聚集在一起，从数学和算法一直到算术和数字表示来优化整个过程。我们已经与我们的客户(雪佛龙、斯伦贝谢和摩根大通)进行了许多这样的多学科研究。我们目前使用Maxeler数据流引擎在地球科学和金融领域的生产PDE求解器应用程序上的结果表明，每次应用程序运行的速度和/或功率提高了20-40倍。

{"title":"Maximum performance computing for exascale applications","authors":"O. Mencer","doi":"10.1109/SAMOS.2012.6404150","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404150","url":null,"abstract":"Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"6 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Out-Of-order execution of synchronous data-flow networks 同步数据流网络的乱序执行

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404171

D. Baudisch, J. Brandt, K. Schneider

Data flow process networks (DPNs) have been introduced as a convenient model of computation for distributed and asynchronous systems since each process node can work independently of the other nodes, i. e. without the need of a global coordination. Synchronous and cyclo-static data flow process networks even allow to derive at compile-time efficient static schedules that allow one to run these systems with an efficient use of available resources, e. g. in embedded systems. Single process nodes of DPNs are stream-based computing devices that transform input streams to uniquely defined corresponding output streams such that single values of the output streams are computed as soon as sufficient input values are available. In this sense, they are related to the execution of an instruction stream by a conventional microprocessor. In this paper, we show how out-of-order execution that has been introduced for the efficient use of multiple functional units in microprocessors can also be used for the implementation of DPNs on multiprocessors. This way, the implementation of DPNs on multiprocessors allows one to optimize the throughput of single process nodes, and as shown by our experiments, also of the entire DPN.

数据流过程网络(dpn)作为分布式和异步系统的一种方便的计算模型被引入，因为每个过程节点可以独立于其他节点工作，即不需要全局协调。同步和循环静态数据流处理网络甚至允许在编译时派生出有效的静态调度，从而允许在有效利用可用资源的情况下运行这些系统，例如在嵌入式系统中。dpn的单进程节点是基于流的计算设备，它将输入流转换为唯一定义的相应输出流，以便在有足够的输入值可用时立即计算输出流的单个值。从这个意义上说，它们与传统微处理器执行指令流有关。在本文中，我们展示了为有效使用微处理器中的多个功能单元而引入的乱序执行如何也可用于在多处理器上实现dpn。通过这种方式，在多处理器上实现DPN可以优化单个进程节点的吞吐量，正如我们的实验所示，也可以优化整个DPN的吞吐量。

{"title":"Out-Of-order execution of synchronous data-flow networks","authors":"D. Baudisch, J. Brandt, K. Schneider","doi":"10.1109/SAMOS.2012.6404171","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404171","url":null,"abstract":"Data flow process networks (DPNs) have been introduced as a convenient model of computation for distributed and asynchronous systems since each process node can work independently of the other nodes, i. e. without the need of a global coordination. Synchronous and cyclo-static data flow process networks even allow to derive at compile-time efficient static schedules that allow one to run these systems with an efficient use of available resources, e. g. in embedded systems. Single process nodes of DPNs are stream-based computing devices that transform input streams to uniquely defined corresponding output streams such that single values of the output streams are computed as soon as sufficient input values are available. In this sense, they are related to the execution of an instruction stream by a conventional microprocessor. In this paper, we show how out-of-order execution that has been introduced for the efficient use of multiple functional units in microprocessors can also be used for the implementation of DPNs on multiprocessors. This way, the implementation of DPNs on multiprocessors allows one to optimize the throughput of single process nodes, and as shown by our experiments, also of the entire DPN.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117134555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

A template-based methodology for efficient microprocessor and FPGA accelerator co-design 基于模板的高效微处理器和FPGA加速器协同设计方法

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/samos.2012.6404153

A. Kritikakou, F. Catthoor, G. Athanasiou, Vasilios I. Kelefouras, C. Goutis

Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs.

嵌入式应用通常需要软件/硬件(SW/HW)设计来满足硬时序约束和所需的设计灵活性。对软件/硬件设计进行详尽的探索是一项非常耗时的任务，而特别的方法和部分自动化工具的使用通常会导致设计效率较低。为了支持更有效的FPGA平台协同设计过程，我们提出了一种系统的方法，将应用程序映射到具有定制硬件加速器和微处理器核心的软件/硬件平台。方法映射步骤通过软件/硬件通信组织、前景(FG)内存管理和数据路径(DP)映射的参数模板表示。通过实例化模板产生了几个性能领域的权衡设计帕累托点。将实时生物成像应用程序映射到FPGA上，以评估我们的方法的增益，即与纯SW设计相比，性能提高44.8%，与纯硬件设计相比，面积提高58%。

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 International Conference on Embedded Computer Systems (SAMOS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀