Although many efficient high-level algorithms have been proposed for the realization of Multiple Constant Multiplications (MCM) using the fewest number of addition and subtraction operations, they do not consider the low-level implementation issues that directly affect the area, delay, and power dissipation of the MCM design. In this paper, we initially present area efficient addition and subtraction architectures used in the design of the MCM operation. Then, we propose an algorithm that searches an MCM design with the smallest area taking into account the cost of each operation at gate-level. To address the area and delay tradeoff in MCM design, the proposed algorithm is improved to find the smallest area solution under a delay constraint. The experimental results show that the proposed algorithms yield low-complexity and high-speed MCM designs with respect to those obtained by the prominent algorithms designed for the optimization of the number of operations and the optimization of area at gate-level.
{"title":"Optimization of Area and Delay at Gate-Level in Multiple Constant Multiplications","authors":"L. Aksoy, E. Costa, P. Flores, J. Monteiro","doi":"10.1109/DSD.2010.32","DOIUrl":"https://doi.org/10.1109/DSD.2010.32","url":null,"abstract":"Although many efficient high-level algorithms have been proposed for the realization of Multiple Constant Multiplications (MCM) using the fewest number of addition and subtraction operations, they do not consider the low-level implementation issues that directly affect the area, delay, and power dissipation of the MCM design. In this paper, we initially present area efficient addition and subtraction architectures used in the design of the MCM operation. Then, we propose an algorithm that searches an MCM design with the smallest area taking into account the cost of each operation at gate-level. To address the area and delay tradeoff in MCM design, the proposed algorithm is improved to find the smallest area solution under a delay constraint. The experimental results show that the proposed algorithms yield low-complexity and high-speed MCM designs with respect to those obtained by the prominent algorithms designed for the optimization of the number of operations and the optimization of area at gate-level.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128237466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel synchronous to asynchronous logic conversion tool targeted specifically for a synchronous field programmable gate array (FPGA). This tool augments the synchronous FPGA design flow and removes the clock network to implement an asynchronous control network in its place. We evaluate the timing performance benefits of the methods used to implement the asynchronous control network on synchronous FPGA fabric. Industrial video processing circuits are used to demonstrate the iterative timing improvements the tool makes to asynchronous control networks in each circuit. The targeted design constraints used in the tool are intended to improve the robustness and predictability of the placed circuits. This allows the timing benefits of asynchronous bundled data circuits easier to achieve, making asynchronous circuits a viable design option on modern FPGAs.
{"title":"Optimising Self-Timed FPGA Circuits","authors":"P. Ferguson, A. Efthymiou, T. Arslan, Danny Hume","doi":"10.1109/DSD.2010.97","DOIUrl":"https://doi.org/10.1109/DSD.2010.97","url":null,"abstract":"This paper introduces a novel synchronous to asynchronous logic conversion tool targeted specifically for a synchronous field programmable gate array (FPGA). This tool augments the synchronous FPGA design flow and removes the clock network to implement an asynchronous control network in its place. We evaluate the timing performance benefits of the methods used to implement the asynchronous control network on synchronous FPGA fabric. Industrial video processing circuits are used to demonstrate the iterative timing improvements the tool makes to asynchronous control networks in each circuit. The targeted design constraints used in the tool are intended to improve the robustness and predictability of the placed circuits. This allows the timing benefits of asynchronous bundled data circuits easier to achieve, making asynchronous circuits a viable design option on modern FPGAs.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133856564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Otero, Angel Morales-Cas, J. Portilla, E. D. L. Torre, T. Riesgo
In this paper, a solution to support the run-time read back, relocation and replication of cores in embedded systems with dynamic and partial reconfiguration capabilities is presented. The proposal shows a peripheral structure that allows an easy integration and communication with the rest of the system, including an API to make the reconfiguration details to be more transparent to software applications. Differently to other proposals, all functionality is implemented in hardware, achieving a higher reconfiguration speed. In addition, different design decisions have been taken in order to increase the portability of the solution to existing and, possibly, future FPGAs. Finally, a use case is provided, which shows the features of this module applied to the run-time scaling of a hardware coprocessor.
{"title":"A Modular Peripheral to Support Self-Reconfiguration in SoCs","authors":"A. Otero, Angel Morales-Cas, J. Portilla, E. D. L. Torre, T. Riesgo","doi":"10.1109/DSD.2010.100","DOIUrl":"https://doi.org/10.1109/DSD.2010.100","url":null,"abstract":"In this paper, a solution to support the run-time read back, relocation and replication of cores in embedded systems with dynamic and partial reconfiguration capabilities is presented. The proposal shows a peripheral structure that allows an easy integration and communication with the rest of the system, including an API to make the reconfiguration details to be more transparent to software applications. Differently to other proposals, all functionality is implemented in hardware, achieving a higher reconfiguration speed. In addition, different design decisions have been taken in order to increase the portability of the solution to existing and, possibly, future FPGAs. Finally, a use case is provided, which shows the features of this module applied to the run-time scaling of a hardware coprocessor.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131408379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-moduli architectures are very useful for reconfigurable digital processors and fault-tolerant systems that are based on the Residue Number System (RNS). In this paper we propose two architectures for multi-moduli squaring that support the most common moduli cases in RNS channels, that is, 2^n-1, 2^n and 2^n+1. The proposed architectures are based on the modified Booth encoding of the input operand for deriving the required partial products and on Dadda adder trees for their addition. Experimental results show that the proposed squarers offer significant savings in area compared to previous proposals while a small improvement in delay is achieved in most cases as well.
{"title":"Area-Efficient Multi-moduli Squarers for RNS","authors":"D. Bakalis, H. T. Vergos","doi":"10.1109/DSD.2010.25","DOIUrl":"https://doi.org/10.1109/DSD.2010.25","url":null,"abstract":"Multi-moduli architectures are very useful for reconfigurable digital processors and fault-tolerant systems that are based on the Residue Number System (RNS). In this paper we propose two architectures for multi-moduli squaring that support the most common moduli cases in RNS channels, that is, 2^n-1, 2^n and 2^n+1. The proposed architectures are based on the modified Booth encoding of the input operand for deriving the required partial products and on Dadda adder trees for their addition. Experimental results show that the proposed squarers offer significant savings in area compared to previous proposals while a small improvement in delay is achieved in most cases as well.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"20 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ever-increasing complexity of MPSoCs is making the production of software the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim at facilitating application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware. In this paper we present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementative variants that efficiently exploit the memory hierarchy. Experimental results on different benchmarks confirm the effectiveness of the optimizations in terms of performance improvements.
{"title":"Evaluating OpenMP Support Costs on MPSoCs","authors":"A. Marongiu, P. Burgio, L. Benini","doi":"10.1109/DSD.2010.99","DOIUrl":"https://doi.org/10.1109/DSD.2010.99","url":null,"abstract":"The ever-increasing complexity of MPSoCs is making the production of software the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim at facilitating application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware. In this paper we present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementative variants that efficiently exploit the memory hierarchy. Experimental results on different benchmarks confirm the effectiveness of the optimizations in terms of performance improvements.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"110 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Some applications, especially in the area of multimedia processing, need to be implemented in a multichip platform, due to their size. An efficient communication infrastructure for such systems may be designed with the use of the Networks-on-Chip (NoCs). However, a network for multi-chip systems require a scalable architecture. Moreover, for multimedia purposes, such NoC should support a multicast transmission mode. In order to meet this requirements, we propose the NoMC (Network-on-Multi-Chip) which is a hierarchical interconnect system, designed for multi-chip systems. A performance of the proposed network is assessed utilizing a model of the MVC (Multiview Video Coding) coder. In such system, the multicast transmission mode may yield an overall bandwidth gain up to 30%. Moreover, the synthesis results show that the proposed network elements are easily synthesizable for the FPGA devices.
{"title":"Network-on-Multi-Chip (NoMC) for Multi-FPGA Multimedia Systems","authors":"M. Stepniewska, A. Luczak, J. Siast","doi":"10.1109/DSD.2010.106","DOIUrl":"https://doi.org/10.1109/DSD.2010.106","url":null,"abstract":"Some applications, especially in the area of multimedia processing, need to be implemented in a multichip platform, due to their size. An efficient communication infrastructure for such systems may be designed with the use of the Networks-on-Chip (NoCs). However, a network for multi-chip systems require a scalable architecture. Moreover, for multimedia purposes, such NoC should support a multicast transmission mode. In order to meet this requirements, we propose the NoMC (Network-on-Multi-Chip) which is a hierarchical interconnect system, designed for multi-chip systems. A performance of the proposed network is assessed utilizing a model of the MVC (Multiview Video Coding) coder. In such system, the multicast transmission mode may yield an overall bandwidth gain up to 30%. Moreover, the synthesis results show that the proposed network elements are easily synthesizable for the FPGA devices.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Dondo, Fernando Rincón Calle, Jesús Barba, F. Moya, Francisco Sánchez, J. C. López
This document presents a persistence management model for reconfigurable SoC. This model provides an efficient mechanism for persistence to preserve data information of hardware components that are swapped out of dynamically reconfigurable areas, in order to allow the reinsertion of these components and to restart the execution path from the same point where they were interrupted when reinserted. This mechanism allows state management of components instantiated not only in reconfigurable areas, but also for those instantiated in static areas, that are feasible to be stopped and replaced for new versions instantiated in hardware or implemented in software migrating their state to the new ones.
{"title":"Persistence Management Model for Dynamically Reconfigurable Hardware","authors":"J. Dondo, Fernando Rincón Calle, Jesús Barba, F. Moya, Francisco Sánchez, J. C. López","doi":"10.1109/DSD.2010.90","DOIUrl":"https://doi.org/10.1109/DSD.2010.90","url":null,"abstract":"This document presents a persistence management model for reconfigurable SoC. This model provides an efficient mechanism for persistence to preserve data information of hardware components that are swapped out of dynamically reconfigurable areas, in order to allow the reinsertion of these components and to restart the execution path from the same point where they were interrupted when reinserted. This mechanism allows state management of components instantiated not only in reconfigurable areas, but also for those instantiated in static areas, that are feasible to be stopped and replaced for new versions instantiated in hardware or implemented in software migrating their state to the new ones.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122902947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zdenek Prikryl, Karel Masarík, Tomás Hruska, A. Husár
Application-specific instruction set processors used in embedded systems are highly optimized for a given task. On this type of processors runs a specific application. Therefore, the designer should have a tool which helps him or her in the task of processor and application optimization. One of such tools is profiler. It can discover problematic parts, such as bottleneck points, in the processor and application design. Then, the designer can easily find which parts of the processor or application should be modified, so that performance gets better or power-consumption is reduced. In this paper, a way how to generate cycle-accurate profiler for C language from a processor model described with an architecture description language is proposed.
{"title":"Generated Cycle-Accurate Profiler for C Language","authors":"Zdenek Prikryl, Karel Masarík, Tomás Hruska, A. Husár","doi":"10.1109/DSD.2010.39","DOIUrl":"https://doi.org/10.1109/DSD.2010.39","url":null,"abstract":"Application-specific instruction set processors used in embedded systems are highly optimized for a given task. On this type of processors runs a specific application. Therefore, the designer should have a tool which helps him or her in the task of processor and application optimization. One of such tools is profiler. It can discover problematic parts, such as bottleneck points, in the processor and application design. Then, the designer can easily find which parts of the processor or application should be modified, so that performance gets better or power-consumption is reduced. In this paper, a way how to generate cycle-accurate profiler for C language from a processor model described with an architecture description language is proposed.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"513 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116008861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new moduli set {2n-1, 2n+3, 2n+1, 2n-3} has recently been proposed to represent numbers in Residue Number Systems (RNS), increasing the number of channels. With this, the processing time can be reduced by simultaneously exploiting the carry-free characteristic of the modular arithmetic and improving the parallelism. In this paper, hardware structures for addition and multiplication operation in RNS for the moduli {2n-3} and {2n+3} are proposed and analyzed. In order to evaluate the performance of the proposed units they were implemented on an ASIC technology. The obtained experimental results suggest that the performance of the moduli {2npm3} are acceptable but demand more area resource and impose a larger delay than the typically used {2npm1} arithmetic units. Addition units require at least 42% more area for a performance identical to the {2n+1} modulo adder. The multiplication units require up to 37% more area and impose a delay 25% higher. This paper also suggests that more balanced moduli sets should be developed in order to achieve more efficient RNS.
{"title":"Arithmetic Units for RNS Moduli {2n-3} and {2n+3} Operations","authors":"P. M. Matutino, R. Chaves, L. Sousa","doi":"10.1109/DSD.2010.77","DOIUrl":"https://doi.org/10.1109/DSD.2010.77","url":null,"abstract":"A new moduli set {2n-1, 2n+3, 2n+1, 2n-3} has recently been proposed to represent numbers in Residue Number Systems (RNS), increasing the number of channels. With this, the processing time can be reduced by simultaneously exploiting the carry-free characteristic of the modular arithmetic and improving the parallelism. In this paper, hardware structures for addition and multiplication operation in RNS for the moduli {2n-3} and {2n+3} are proposed and analyzed. In order to evaluate the performance of the proposed units they were implemented on an ASIC technology. The obtained experimental results suggest that the performance of the moduli {2npm3} are acceptable but demand more area resource and impose a larger delay than the typically used {2npm1} arithmetic units. Addition units require at least 42% more area for a performance identical to the {2n+1} modulo adder. The multiplication units require up to 37% more area and impose a delay 25% higher. This paper also suggests that more balanced moduli sets should be developed in order to achieve more efficient RNS.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. López, O. Garnica, D. Albonesi, S. Dropsho, J. Lanchares, J. Hidalgo
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the harmonic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.
{"title":"Adaptive Cache Memories for SMT Processors","authors":"S. López, O. Garnica, D. Albonesi, S. Dropsho, J. Lanchares, J. Hidalgo","doi":"10.1109/DSD.2010.69","DOIUrl":"https://doi.org/10.1109/DSD.2010.69","url":null,"abstract":"Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the harmonic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128788885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}