A branching program machine (BM) is a special purpose processor that uses only two kinds of instructions: Branch and output instructions. Thus, the architecture for the BM is much simpler than that for a general purpose processor (MPU). Since the BM uses the dedicated instructions for a special purpose application, it is faster than the MPU. This paper presents a packet classifier using a parallel branching program machine (PBM). To reduce computation time and code size, first, a set of rules for the packet classifier is partitioned into groups. Then, they are evaluated by the PBM in parallel. Also, this paper shows a method to estimate the number of necessary BMs to realize the packet classifier. The PBM32 consisting of 32 BMs has been implemented on an FPGA, and compared with the Intel's Core2Duo@1.2GHz. The PBM32 is 8.1-11.1 times faster than the Core2Duo, and the PBM32 requires only 0.2-10.3 percent of the memory for the Core2Duo.
{"title":"A Packet Classifier Using a Parallel Branching Program Machine","authors":"Hiroki Nakahara, Tsutomu Sasao, M. Matsuura","doi":"10.1109/DSD.2010.18","DOIUrl":"https://doi.org/10.1109/DSD.2010.18","url":null,"abstract":"A branching program machine (BM) is a special purpose processor that uses only two kinds of instructions: Branch and output instructions. Thus, the architecture for the BM is much simpler than that for a general purpose processor (MPU). Since the BM uses the dedicated instructions for a special purpose application, it is faster than the MPU. This paper presents a packet classifier using a parallel branching program machine (PBM). To reduce computation time and code size, first, a set of rules for the packet classifier is partitioned into groups. Then, they are evaluated by the PBM in parallel. Also, this paper shows a method to estimate the number of necessary BMs to realize the packet classifier. The PBM32 consisting of 32 BMs has been implemented on an FPGA, and compared with the Intel's Core2Duo@1.2GHz. The PBM32 is 8.1-11.1 times faster than the Core2Duo, and the PBM32 requires only 0.2-10.3 percent of the memory for the Core2Duo.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115796301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Engineering hardware platform for a Wireless Sensor Network (WSN) node is known to be a tough challenge, as the design must enforce many severe constraints, among which energy dissipation is by far the most challenging one. Today, most of the WSN node platforms are based on low cost and low-power programmable micro controllers, even if it is acknowledged that their energy efficiency remains limited and hinders the wide-spreading of WSN to new applications. In this paper, we propose a complete system level flow for an alternative approach based on the concept of hardware micro-tasks, which relies on hardware specialization and power gating to dramatically improve the energy efficiency of the computational part of the node. Early estimates show power saving by more than one order of magnitude over MCU-based implementations.
{"title":"System Level Synthesis for Ultra Low-Power Wireless Sensor Nodes","authors":"Muhammad Adeel Pasha, Steven Derrien, O. Sentieys","doi":"10.1109/DSD.2010.88","DOIUrl":"https://doi.org/10.1109/DSD.2010.88","url":null,"abstract":"Engineering hardware platform for a Wireless Sensor Network (WSN) node is known to be a tough challenge, as the design must enforce many severe constraints, among which energy dissipation is by far the most challenging one. Today, most of the WSN node platforms are based on low cost and low-power programmable micro controllers, even if it is acknowledged that their energy efficiency remains limited and hinders the wide-spreading of WSN to new applications. In this paper, we propose a complete system level flow for an alternative approach based on the concept of hardware micro-tasks, which relies on hardware specialization and power gating to dramatically improve the energy efficiency of the computational part of the node. Early estimates show power saving by more than one order of magnitude over MCU-based implementations.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126735253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The work reported in this paper describes the steps given towards an FPGA-based implementation of evolvable wavelet transforms for image compression in embedded systems. An Evolutionary Algorithm (EA) for the design and optimization of the transform coefficients is tailored for a suitable System on Chip implementation. Several cut downs on the computing requirements have been done to the original algorithm, adapting it for the FPGA implementation. What this paper addresses more specifically is the validation of the algorithm using fixed point arithmetic for the whole optimization process. The results show how high quality transforms are evolved from scratch with limited precision arithmetic. Also, preliminary results of the implementation in an FPGA device are included.
{"title":"High Level Validation of an Optimization Algorithm for the Implementation of Adaptive Wavelet Transforms in FPGAs","authors":"R. Salvador, F. Moreno, T. Riesgo, L. Sekanina","doi":"10.1109/DSD.2010.96","DOIUrl":"https://doi.org/10.1109/DSD.2010.96","url":null,"abstract":"The work reported in this paper describes the steps given towards an FPGA-based implementation of evolvable wavelet transforms for image compression in embedded systems. An Evolutionary Algorithm (EA) for the design and optimization of the transform coefficients is tailored for a suitable System on Chip implementation. Several cut downs on the computing requirements have been done to the original algorithm, adapting it for the FPGA implementation. What this paper addresses more specifically is the validation of the algorithm using fixed point arithmetic for the whole optimization process. The results show how high quality transforms are evolved from scratch with limited precision arithmetic. Also, preliminary results of the implementation in an FPGA device are included.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124510892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work proposes a testable QCA (Quantum-Dot Cellular Automata) logic gate (UQCALG) realizing the universal functions. The design of UQCALG is based on the Coupled Majority Minority (CMVMIN) QCA structure with the target to reduce wire crossings as well as the number of clock cycles required to operate a QCA circuit. The characterization of defects in such design leads to synthesis of a test block, realized with the majority and minority voters, that ensures the desired testability of a circuit. The experimental designs establish that the UQCALG can result in cost effective design of testable QCA logic circuits that may not be possible with conventional ULG (Universal Logic Gate).
{"title":"Design of Testable Universal Logic Gate Targeting Minimum Wire-Crossings in QCA Logic Circuit","authors":"B. Sen, Anik Sengupta, M. Dalui, B. Sikdar","doi":"10.1109/DSD.2010.114","DOIUrl":"https://doi.org/10.1109/DSD.2010.114","url":null,"abstract":"This work proposes a testable QCA (Quantum-Dot Cellular Automata) logic gate (UQCALG) realizing the universal functions. The design of UQCALG is based on the Coupled Majority Minority (CMVMIN) QCA structure with the target to reduce wire crossings as well as the number of clock cycles required to operate a QCA circuit. The characterization of defects in such design leads to synthesis of a test block, realized with the majority and minority voters, that ensures the desired testability of a circuit. The experimental designs establish that the UQCALG can result in cost effective design of testable QCA logic circuits that may not be possible with conventional ULG (Universal Logic Gate).","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114161278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Castagnetti, C. Belleudy, S. Bilavarn, M. Auguin
A lot of task scheduling algorithms and power management policies have been developed based on simplistic power models, which rarely take into account the effects of the power consumptions of the different components of a real system. Most of the models on which the study of the DVFS scheduling is based, make the assumption that the power consumption of a processor could be modelled as a E ∝ V 2 model. This hypothesis, even if partly true, is not generally applicable when considering the complete system, which consists of the processor, memories and power conversion circuits. In this paper we present a power and energy model for a DVFS enabled mobile computing platform. The platform is based on a low power SoC, which integrates both the processor core and memory, as well as other hardware accelerators. We include in our analisys the study of the power conversion components, which supply the SoC. Starting from measures, we first characterize the power consumption of the SoC and the converters, then a power and energy model for the processor is proposed. The model is able to predict the power consumption of the processor core with an average error less than 10%. This is then used to analyse two DVFS scheduling techniques based on the EDF algorithm, Cycle Conserving and Look Ahead. The results show that the CPU energy saving computed using our model, is far less than what would be expected using a model that does not take into account the effect of the static power.
{"title":"Power Consumption Modeling for DVFS Exploitation","authors":"A. Castagnetti, C. Belleudy, S. Bilavarn, M. Auguin","doi":"10.1109/DSD.2010.55","DOIUrl":"https://doi.org/10.1109/DSD.2010.55","url":null,"abstract":"A lot of task scheduling algorithms and power management policies have been developed based on simplistic power models, which rarely take into account the effects of the power consumptions of the different components of a real system. Most of the models on which the study of the DVFS scheduling is based, make the assumption that the power consumption of a processor could be modelled as a E ∝ V 2 model. This hypothesis, even if partly true, is not generally applicable when considering the complete system, which consists of the processor, memories and power conversion circuits. In this paper we present a power and energy model for a DVFS enabled mobile computing platform. The platform is based on a low power SoC, which integrates both the processor core and memory, as well as other hardware accelerators. We include in our analisys the study of the power conversion components, which supply the SoC. Starting from measures, we first characterize the power consumption of the SoC and the converters, then a power and energy model for the processor is proposed. The model is able to predict the power consumption of the processor core with an average error less than 10%. This is then used to analyse two DVFS scheduling techniques based on the EDF algorithm, Cycle Conserving and Look Ahead. The results show that the CPU energy saving computed using our model, is far less than what would be expected using a model that does not take into account the effect of the static power.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123694235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling of complex and computationally intense applications supported by modern mobile devices via standard modeling languages is a challenging task. Within the GENESYS process model the application modeling phase is thus of key importance. GENESYS manages complexity by employing cross domain and platform-based application design. The main contribution of this article is to describe the instantiation of GENESYS application architecture modeling via MARTE profile and describe a methodology for validation of nonfunctional properties annotated in the application model.
{"title":"Instantiating GENESYS Application Architecture Modeling via UML 2.0 Constructs and MARTE Profile","authors":"Subayal Khan, Kari Tiensyrjä, J. Nurmi","doi":"10.1109/DSD.2010.36","DOIUrl":"https://doi.org/10.1109/DSD.2010.36","url":null,"abstract":"Modeling of complex and computationally intense applications supported by modern mobile devices via standard modeling languages is a challenging task. Within the GENESYS process model the application modeling phase is thus of key importance. GENESYS manages complexity by employing cross domain and platform-based application design. The main contribution of this article is to describe the instantiation of GENESYS application architecture modeling via MARTE profile and describe a methodology for validation of nonfunctional properties annotated in the application model.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116524788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The syntax-directed synthesis paradigm has shown to be a powerful synthesis approach. However, its control-driven nature results in significant performance overhead. Some methods to reduce this overhead include peephole optimisations, control resynthesis and component optimisations. This work explores new methods of improving the performance of syntax-directed synthesised asynchronous circuits, using the Balsa synthesis system as the research framework. This includes investigating description styles and the usage of language constructs that exploit the directness of the synthesis method to obtain more concurrent and faster circuits. The techniques and optimisations presented here has been tested in a set of non-trivial examples including a 32-bit processor, a Viterbi decoder, and a channel-sliced wormhole router.
{"title":"Description-Level Optimisation of Synthesisable Asynchronous Circuits","authors":"L. Tarazona, D. Edwards, A. Bardsley, L. Plana","doi":"10.1109/DSD.2010.71","DOIUrl":"https://doi.org/10.1109/DSD.2010.71","url":null,"abstract":"The syntax-directed synthesis paradigm has shown to be a powerful synthesis approach. However, its control-driven nature results in significant performance overhead. Some methods to reduce this overhead include peephole optimisations, control resynthesis and component optimisations. This work explores new methods of improving the performance of syntax-directed synthesised asynchronous circuits, using the Balsa synthesis system as the research framework. This includes investigating description styles and the usage of language constructs that exploit the directness of the synthesis method to obtain more concurrent and faster circuits. The techniques and optimisations presented here has been tested in a set of non-trivial examples including a 32-bit processor, a Viterbi decoder, and a channel-sliced wormhole router.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121841167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A proven approach to increase performance of general-purpose processors is to add hardware accelerators. In its basic configuration, the FlexCore processor has a limited set of datapath units. But thanks to a flexible datapath interconnect and a wide control word, the FlexCore datapath is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. We present the integration of a versatile accelerator for several Cyclic Redundancy Checking (CRC) keys. Furthermore, we investigate the accelerator’s impact on processor execution time and energy efficiency, using the Power Stone CRC benchmark. Our evaluation shows that the accelerated 65-nm 2.7-ns FlexCore datapath is, for example, 86% more energy and cycle efficient than a datapath lacking the CRC accelerator.
提高通用处理器性能的一种经过验证的方法是添加硬件加速器。在其基本配置中,FlexCore处理器有一组有限的数据路径单元。但是由于灵活的数据路径互连和广泛的控制字,FlexCore数据路径被明确设计为支持特殊单元的集成,可以根据需要加速某些数据密集型应用程序。我们提出了一个多功能加速器的几个循环冗余校验(CRC)密钥的集成。此外,我们研究了加速器对处理器执行时间和能源效率的影响,使用Power Stone CRC基准。我们的评估表明,例如,加速的65纳米2.7 ns FlexCore数据路径比缺乏CRC加速器的数据路径的能量和循环效率高86%。
{"title":"Cyclic Redundancy Checking (CRC) Accelerator for the FlexCore Processor","authors":"M. Azhar, T. Hoang, P. Larsson-Edefors","doi":"10.1109/DSD.2010.51","DOIUrl":"https://doi.org/10.1109/DSD.2010.51","url":null,"abstract":"A proven approach to increase performance of general-purpose processors is to add hardware accelerators. In its basic configuration, the FlexCore processor has a limited set of datapath units. But thanks to a flexible datapath interconnect and a wide control word, the FlexCore datapath is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. We present the integration of a versatile accelerator for several Cyclic Redundancy Checking (CRC) keys. Furthermore, we investigate the accelerator’s impact on processor execution time and energy efficiency, using the Power Stone CRC benchmark. Our evaluation shows that the accelerated 65-nm 2.7-ns FlexCore datapath is, for example, 86% more energy and cycle efficient than a datapath lacking the CRC accelerator.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126528621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Berger-invert codes are coding schemes used to protect communication channels against all asymmetric errors and to decrease power consumption. This paper proposes a method of constructing modified Berger-invert codes that relies on the choice of check parts with the smallest possible total weight and assignment of low-weight check parts to the most numerous subsets of data with the largest Hamming weights. As a result, the error rate of the transmitted data can be reduced by up to about 23.5% for a 8-bit bus at no cost (no extra bus lines or increase of hardware to implement encoding and decoding/checking circuitry).
{"title":"On Reducing Error Rate of Data Protected Using Systematic Unordered Codes in Asymmetric Channels","authors":"S. Piestrak","doi":"10.1109/DSD.2010.117","DOIUrl":"https://doi.org/10.1109/DSD.2010.117","url":null,"abstract":"Berger-invert codes are coding schemes used to protect communication channels against all asymmetric errors and to decrease power consumption. This paper proposes a method of constructing modified Berger-invert codes that relies on the choice of check parts with the smallest possible total weight and assignment of low-weight check parts to the most numerous subsets of data with the largest Hamming weights. As a result, the error rate of the transmitted data can be reduced by up to about 23.5% for a 8-bit bus at no cost (no extra bus lines or increase of hardware to implement encoding and decoding/checking circuitry).","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127754727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Danese, Mauro Giachero, F. Leporati, Nelson Nazzicari
Biometric identification systems exploit automated methods of recognition based on physiological or behavioural people characteristics. Among these, fingerprints are very affordable biometric identifiers. In order to build embedded systems performing real-time authentication, a fast computational unit for image processing is required. In this paper we propose a parallel architecture that efficiently implements the high computationally demanding core of a matching algorithm based on Band Limited Phase Only spatial Correlation (BLPOC), elaborated by two concurrent computational units implemented onto Stratix II family Altera FPGA. The realised device is competitive with those provided by similar hardware solutions described in literature and outperforms the elaboration capabilities of general purpose PC processors.
{"title":"A Multicore Embedded Processor for Fingerprint Recognition","authors":"G. Danese, Mauro Giachero, F. Leporati, Nelson Nazzicari","doi":"10.1109/DSD.2010.101","DOIUrl":"https://doi.org/10.1109/DSD.2010.101","url":null,"abstract":"Biometric identification systems exploit automated methods of recognition based on physiological or behavioural people characteristics. Among these, fingerprints are very affordable biometric identifiers. In order to build embedded systems performing real-time authentication, a fast computational unit for image processing is required. In this paper we propose a parallel architecture that efficiently implements the high computationally demanding core of a matching algorithm based on Band Limited Phase Only spatial Correlation (BLPOC), elaborated by two concurrent computational units implemented onto Stratix II family Altera FPGA. The realised device is competitive with those provided by similar hardware solutions described in literature and outperforms the elaboration capabilities of general purpose PC processors.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116266885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}