Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404174
F. Pratas, P. Tomás, P. Trancoso, L. Sousa
Reconfigurable hardware can be used as an energy and performance efficient co-processing solution to accelerate certain types of applications. To facilitate the design of hardware accelerators we have proposed a methodology that adopts the stream-based computing model and the usage of Graphics Processing Units as prototyping platforms. In this paper we go a step further and propose a new modular architecture for low-power reconfigurable systems to easily map the stream-based algorithms. In particular, the architecture consists of a semi-programable accelerator set that can be adapted to the application needs in terms of functional units and number of streaming engines. The proposed embedded architecture mates the flexibility of reconfigurable hardware with the advantages of stream computing for the strict needs of embedded reconfigurable devices. We show a possible organization for this architecture. Moreover, we provide a general case study to analyze the scalability of the proposed architecture in an Altera FPGA. Our experimental results show that a significant speed-up can be achieved compared to general purpose processors using low-power FPGA devices. Our preliminary estimates show that it is also possible to achieve energy savings of up to 118x.
{"title":"Energy efficient stream-based configurable architecture for embedded platforms","authors":"F. Pratas, P. Tomás, P. Trancoso, L. Sousa","doi":"10.1109/SAMOS.2012.6404174","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404174","url":null,"abstract":"Reconfigurable hardware can be used as an energy and performance efficient co-processing solution to accelerate certain types of applications. To facilitate the design of hardware accelerators we have proposed a methodology that adopts the stream-based computing model and the usage of Graphics Processing Units as prototyping platforms. In this paper we go a step further and propose a new modular architecture for low-power reconfigurable systems to easily map the stream-based algorithms. In particular, the architecture consists of a semi-programable accelerator set that can be adapted to the application needs in terms of functional units and number of streaming engines. The proposed embedded architecture mates the flexibility of reconfigurable hardware with the advantages of stream computing for the strict needs of embedded reconfigurable devices. We show a possible organization for this architecture. Moreover, we provide a general case study to analyze the scalability of the proposed architecture in an Altera FPGA. Our experimental results show that a significant speed-up can be achieved compared to general purpose processors using low-power FPGA devices. Our preliminary estimates show that it is also possible to achieve energy savings of up to 118x.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"384 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116522241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404156
Amine Anane, E. Aboulhamid, Y. Savaria
With the increasing complexity of digital systems that are becoming more and more parallel, a better abstraction to describe such systems has become a necessity. This paper shows how, by using the powerful mechanism of transactions as a concurrency model, and by taking advantage of .NET introspection and attribute programming capabilities, we were able to develop a system-level modeling and parallel simulation environment. We kept the same concepts to describe the architecture of high-level models, such as modules and communication channels. However, unlike SystemC, the behaviour is no longer described as processes and events but as transactions. We implemented scheduling algorithms in order to enable simulating a transactional models in parallel by taking advantage of a multicore machine. These algorithms take into account the dependency between transactions and the number of cores of the simulation machine. We studied two synchronisation strategies: one using locking and the other using partitioning. An experiment made on a WiFi 802.11a transmitter achieved a speedup of about 1.9 using two threads. With 8 threads, although the workload of individual transactions was not significant, we could reach a 5.1 speedup. When the workload is significant the speedup can reach 6.3.
{"title":"System modeling and multicore simulation using transactions","authors":"Amine Anane, E. Aboulhamid, Y. Savaria","doi":"10.1109/SAMOS.2012.6404156","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404156","url":null,"abstract":"With the increasing complexity of digital systems that are becoming more and more parallel, a better abstraction to describe such systems has become a necessity. This paper shows how, by using the powerful mechanism of transactions as a concurrency model, and by taking advantage of .NET introspection and attribute programming capabilities, we were able to develop a system-level modeling and parallel simulation environment. We kept the same concepts to describe the architecture of high-level models, such as modules and communication channels. However, unlike SystemC, the behaviour is no longer described as processes and events but as transactions. We implemented scheduling algorithms in order to enable simulating a transactional models in parallel by taking advantage of a multicore machine. These algorithms take into account the dependency between transactions and the number of cores of the simulation machine. We studied two synchronisation strategies: one using locking and the other using partitioning. An experiment made on a WiFi 802.11a transmitter achieved a speedup of about 1.9 using two threads. With 8 threads, although the workload of individual transactions was not significant, we could reach a 5.1 speedup. When the workload is significant the speedup can reach 6.3.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122643576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404152
R. Piscitelli, A. Pimentel
System-level design space exploration (DSE), which is performed early in the design process, is of eminent importance to the design of complex multi-processor embedded system architectures. During system-level DSE, system parameters like, e.g., the number and type of processors, the type and size of memories, or the mapping of application tasks to architectural resources, are considered. Simulation-based DSE, in which different design instances are evaluated using system-level simulations, typically are computationally costly. Even using high-level simulations and efficient exploration algorithms, the simulation time to evaluate design points forms a real bottleneck in such DSE. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. This paper presents and studies different strategies for interleaving fast but less accurate analytical performance estimations with slower but more accurate simulations during DSE. By interleaving these analytical estimations with simulations, our hybrid approach significantly reduces the number of simulations that are needed during the process of DSE. Experimental results have demonstrated that such hybrid DSE is a promising technique that can yield solutions of similar quality as compared to simulation-based DSE but only at a fraction of the execution time.
系统级设计空间探索(system -level design space exploration, DSE)在设计过程的早期进行,对于复杂的多处理器嵌入式系统架构的设计非常重要。在系统级DSE期间,会考虑系统参数,例如处理器的数量和类型、内存的类型和大小,或者应用程序任务到体系结构资源的映射。基于仿真的DSE,使用系统级仿真来评估不同的设计实例,通常计算成本很高。即使使用高水平的仿真和高效的探索算法,评估设计点的仿真时间也成为这种DSE的真正瓶颈。因此,需要搜索的巨大设计空间需要有效的设计空间修剪技术。本文提出并研究了在DSE过程中,将快速但不太准确的分析性能估计与缓慢但更准确的仿真相结合的不同策略。通过将这些分析估计与模拟交叉,我们的混合方法显着减少了DSE过程中所需的模拟次数。实验结果表明,这种混合DSE是一种很有前途的技术,与基于仿真的DSE相比,它可以产生类似质量的解决方案,但只需要一小部分执行时间。
{"title":"Interleaving methods for hybrid system-level MPSoC design space exploration","authors":"R. Piscitelli, A. Pimentel","doi":"10.1109/SAMOS.2012.6404152","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404152","url":null,"abstract":"System-level design space exploration (DSE), which is performed early in the design process, is of eminent importance to the design of complex multi-processor embedded system architectures. During system-level DSE, system parameters like, e.g., the number and type of processors, the type and size of memories, or the mapping of application tasks to architectural resources, are considered. Simulation-based DSE, in which different design instances are evaluated using system-level simulations, typically are computationally costly. Even using high-level simulations and efficient exploration algorithms, the simulation time to evaluate design points forms a real bottleneck in such DSE. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. This paper presents and studies different strategies for interleaving fast but less accurate analytical performance estimations with slower but more accurate simulations during DSE. By interleaving these analytical estimations with simulations, our hybrid approach significantly reduces the number of simulations that are needed during the process of DSE. Experimental results have demonstrated that such hybrid DSE is a promising technique that can yield solutions of similar quality as compared to simulation-based DSE but only at a fraction of the execution time.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122971346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404180
F. Farahnakian, M. Ebrahimi, M. Daneshtalab, J. Plosila, P. Liljeberg
In this paper, we propose a congestion-aware routing algorithm based on Dual Reinforcement Q-routing. In this method, local and global congestion information of the network is provided for each router, utilizing learning packets. This information should be dynamically updated according to the changing traffic conditions in the network. For this purpose, a congestion detection method is presented to measure the average of free buffer slots in a specific time interval. This value is compared with maximum and minimum threshold values and based on the comparison result, the learning rate is updated. If the learning rate is a large value, it means the network gets congested and global information is more emphasized than local information. In contrast, local information is more important than global when a router receives few packets in a time interval. Experimental results for different traffic patterns and network loads show that the proposed method improves the network performance compared with the standard Q-routing, DRQ-routing, and Dynamic XY-routing algorithms.
{"title":"Adaptive reinforcement learning method for networks-on-chip","authors":"F. Farahnakian, M. Ebrahimi, M. Daneshtalab, J. Plosila, P. Liljeberg","doi":"10.1109/SAMOS.2012.6404180","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404180","url":null,"abstract":"In this paper, we propose a congestion-aware routing algorithm based on Dual Reinforcement Q-routing. In this method, local and global congestion information of the network is provided for each router, utilizing learning packets. This information should be dynamically updated according to the changing traffic conditions in the network. For this purpose, a congestion detection method is presented to measure the average of free buffer slots in a specific time interval. This value is compared with maximum and minimum threshold values and based on the comparison result, the learning rate is updated. If the learning rate is a large value, it means the network gets congested and global information is more emphasized than local information. In contrast, local information is more important than global when a router receives few packets in a time interval. Experimental results for different traffic patterns and network loads show that the proposed method improves the network performance compared with the standard Q-routing, DRQ-routing, and Dynamic XY-routing algorithms.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116923735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404186
Joseph Rothman, Chen Chang
This presentation will focus on a technology overview of the BEE4 and miniBEE FPGA based reconfigurable platforms. BEEcube supplies advanced system level FPGA prototyping platforms, targeting a wide range of uses including: multi-core computer architecture, wireless communications, 100Gbps+ networking solutions, HD video processing, signal intelligence, radar/sonar array, and High Performance Computing (HPC) needs. This overview will review features, capabilities, unique technology and uses of BEE platforms on both, its state of the art Virtex 6 based multi-array FPGA BEE4™ system, and introduce the first Research in a Box solution, the miniBEE™. miniBEE offers a combination of the latest FPGA, multicore CPU, high-speed networking technology all tightly coupled in one integrated cost effective solution targeting the research and lab community. This flexible system replaces the need for disjointed FPGA boards, PCs, networking devices, and test equipment. The presentation will describe how both algorithm oriented researchers as well as seasoned FPGA experts can utilize BEE technology to achieve their proof of concept or application level prototyping goals based on real time and real world data or conditions. Unique BEE technologies covered include its' symmetrical Honeycomb Architecture, Full Speed Sting I/O interface, Application Control and Debugging Nectar OS, and the BEEcube Platform Studio software environment. The presentation plans to include BEE technology in action, for real-time imaging manipulation or as a flexible testing platform, an Arbitrary Waveform Generation example.
{"title":"BEE technology overview","authors":"Joseph Rothman, Chen Chang","doi":"10.1109/SAMOS.2012.6404186","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404186","url":null,"abstract":"This presentation will focus on a technology overview of the BEE4 and miniBEE FPGA based reconfigurable platforms. BEEcube supplies advanced system level FPGA prototyping platforms, targeting a wide range of uses including: multi-core computer architecture, wireless communications, 100Gbps+ networking solutions, HD video processing, signal intelligence, radar/sonar array, and High Performance Computing (HPC) needs. This overview will review features, capabilities, unique technology and uses of BEE platforms on both, its state of the art Virtex 6 based multi-array FPGA BEE4™ system, and introduce the first Research in a Box solution, the miniBEE™. miniBEE offers a combination of the latest FPGA, multicore CPU, high-speed networking technology all tightly coupled in one integrated cost effective solution targeting the research and lab community. This flexible system replaces the need for disjointed FPGA boards, PCs, networking devices, and test equipment. The presentation will describe how both algorithm oriented researchers as well as seasoned FPGA experts can utilize BEE technology to achieve their proof of concept or application level prototyping goals based on real time and real world data or conditions. Unique BEE technologies covered include its' symmetrical Honeycomb Architecture, Full Speed Sting I/O interface, Application Control and Debugging Nectar OS, and the BEEcube Platform Studio software environment. The presentation plans to include BEE technology in action, for real-time imaging manipulation or as a flexible testing platform, an Arbitrary Waveform Generation example.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129915223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404155
Rainer Kiesel, M. Streubühr, C. Haubelt, A. Terzis, J. Teich
In recent years, road vehicles have experienced an enormous increase in driver assistance systems such as traffic sign recognition, lane departure warning, and pedestrian detection. Cost-efficient development of electronic control units (ECUs) for these systems is a complex challenge. The demand for shortened time to market makes the development even more challenging and thus demands efficient design flows. This paper proposes a model-based design flow that permits simulation-based performance evaluation of multi-core ECUs for driver assistance systems in a pre-development stage. The approach is based on a system-level virtual prototype of a multi-core ECU and allows the evaluation of timing effects by mapping application tasks to different platforms. The results show that performance estimation of different parallel implementation candidates is possible with high accuracy even in a pre-development stage. By adapting the best-fitting parallelization strategy to the final ECU, a reduction in the time to market period is possible. Currently, the design flow is being evaluated by Daimler AG and is being applied to a pedestrian detection system. Results from this application illustrate the benefits of the proposed approach.
{"title":"Virtual prototyping for efficient multi-core ECU development of driver assistance systems","authors":"Rainer Kiesel, M. Streubühr, C. Haubelt, A. Terzis, J. Teich","doi":"10.1109/SAMOS.2012.6404155","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404155","url":null,"abstract":"In recent years, road vehicles have experienced an enormous increase in driver assistance systems such as traffic sign recognition, lane departure warning, and pedestrian detection. Cost-efficient development of electronic control units (ECUs) for these systems is a complex challenge. The demand for shortened time to market makes the development even more challenging and thus demands efficient design flows. This paper proposes a model-based design flow that permits simulation-based performance evaluation of multi-core ECUs for driver assistance systems in a pre-development stage. The approach is based on a system-level virtual prototype of a multi-core ECU and allows the evaluation of timing effects by mapping application tasks to different platforms. The results show that performance estimation of different parallel implementation candidates is possible with high accuracy even in a pre-development stage. By adapting the best-fitting parallelization strategy to the final ECU, a reduction in the time to market period is possible. Currently, the design flow is being evaluated by Daimler AG and is being applied to a pedestrian detection system. Results from this application illustrate the benefits of the proposed approach.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115948045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404197
J. Broenink, Yunyun Ni
The work presented here is on a methodology for design of hard real-time embedded control software for robots, i.e. mechatronic products. The behavior of the total robot system (machine, control, software and I/O) is relevant, because the dynamics of the machine influences the robot software. Therefore, we use two appropriate Models of Computation, which represent continuous-time equations for the machine / robot part, and discrete event / discrete time equations for the control software part.
{"title":"Model-driven robot-software design using integrated models and co-simulation","authors":"J. Broenink, Yunyun Ni","doi":"10.1109/SAMOS.2012.6404197","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404197","url":null,"abstract":"The work presented here is on a methodology for design of hard real-time embedded control software for robots, i.e. mechatronic products. The behavior of the total robot system (machine, control, software and I/O) is relevant, because the dynamics of the machine influences the robot software. Therefore, we use two appropriate Models of Computation, which represent continuous-time equations for the machine / robot part, and discrete event / discrete time equations for the control software part.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121269411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404171
D. Baudisch, J. Brandt, K. Schneider
Data flow process networks (DPNs) have been introduced as a convenient model of computation for distributed and asynchronous systems since each process node can work independently of the other nodes, i. e. without the need of a global coordination. Synchronous and cyclo-static data flow process networks even allow to derive at compile-time efficient static schedules that allow one to run these systems with an efficient use of available resources, e. g. in embedded systems. Single process nodes of DPNs are stream-based computing devices that transform input streams to uniquely defined corresponding output streams such that single values of the output streams are computed as soon as sufficient input values are available. In this sense, they are related to the execution of an instruction stream by a conventional microprocessor. In this paper, we show how out-of-order execution that has been introduced for the efficient use of multiple functional units in microprocessors can also be used for the implementation of DPNs on multiprocessors. This way, the implementation of DPNs on multiprocessors allows one to optimize the throughput of single process nodes, and as shown by our experiments, also of the entire DPN.
{"title":"Out-Of-order execution of synchronous data-flow networks","authors":"D. Baudisch, J. Brandt, K. Schneider","doi":"10.1109/SAMOS.2012.6404171","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404171","url":null,"abstract":"Data flow process networks (DPNs) have been introduced as a convenient model of computation for distributed and asynchronous systems since each process node can work independently of the other nodes, i. e. without the need of a global coordination. Synchronous and cyclo-static data flow process networks even allow to derive at compile-time efficient static schedules that allow one to run these systems with an efficient use of available resources, e. g. in embedded systems. Single process nodes of DPNs are stream-based computing devices that transform input streams to uniquely defined corresponding output streams such that single values of the output streams are computed as soon as sufficient input values are available. In this sense, they are related to the execution of an instruction stream by a conventional microprocessor. In this paper, we show how out-of-order execution that has been introduced for the efficient use of multiple functional units in microprocessors can also be used for the implementation of DPNs on multiprocessors. This way, the implementation of DPNs on multiprocessors allows one to optimize the throughput of single process nodes, and as shown by our experiments, also of the entire DPN.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117134555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/samos.2012.6404153
A. Kritikakou, F. Catthoor, G. Athanasiou, Vasilios I. Kelefouras, C. Goutis
Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs.
{"title":"A template-based methodology for efficient microprocessor and FPGA accelerator co-design","authors":"A. Kritikakou, F. Catthoor, G. Athanasiou, Vasilios I. Kelefouras, C. Goutis","doi":"10.1109/samos.2012.6404153","DOIUrl":"https://doi.org/10.1109/samos.2012.6404153","url":null,"abstract":"Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124619419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404202
P. Quinton, Anne-Marie Chana, Steven Derrien
Many modern computing systems deal with streams of data, which have to be processed in parallel in order to be handled in real-time. This is in particular the case for some kind of cyber physical systems, which process data provided by physical devices. We consider here an approach to generate efficient hardware for-a particular class of-such systems, which relies upon the polyhedral model. Flexible parallel components, described by the Alpha functional language, are modelled and assembled using a scheduling method which combines the synchronous data-flow principle of balance equations, and the polyhedral scheduling technique. The modelling of flexible components relies on a simple, affine-periodic, delayable and stretchable time model, which allows a full system to be assembled and synthesized by combining the component hardware descriptions with automatically generated wrappers. We illustrate this method on a simplified WCDMA system and we discuss the relationship of this approach with stream languages, latency-insensitive design, and multidimensional data-flow systems.
{"title":"Efficient hardware implementation of data-flow parallel embedded systems","authors":"P. Quinton, Anne-Marie Chana, Steven Derrien","doi":"10.1109/SAMOS.2012.6404202","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404202","url":null,"abstract":"Many modern computing systems deal with streams of data, which have to be processed in parallel in order to be handled in real-time. This is in particular the case for some kind of cyber physical systems, which process data provided by physical devices. We consider here an approach to generate efficient hardware for-a particular class of-such systems, which relies upon the polyhedral model. Flexible parallel components, described by the Alpha functional language, are modelled and assembled using a scheduling method which combines the synchronous data-flow principle of balance equations, and the polyhedral scheduling technique. The modelling of flexible components relies on a simple, affine-periodic, delayable and stretchable time model, which allows a full system to be assembled and synthesized by combining the component hardware descriptions with automatically generated wrappers. We illustrate this method on a simplified WCDMA system and we discuss the relationship of this approach with stream languages, latency-insensitive design, and multidimensional data-flow systems.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128780834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}