Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285747
J. Keinert, C. Haubelt, J. Teich
Embedded real-time image processing applications working on large images have to process and store huge amounts of data. Consequently the organization of the memory buffers and the precise determination of the required buffer sizes are critical steps for efficient system implementation. In this paper, we propose a new method, that permits the analysis to be performed automatically for local image processing algorithms. The latter ones are specified by help of the windowed synchronous data flow (WSDF) model, a multi-dimensional model of computation which has been especially designed to represent local image processing algorithms. This paper introduces a corresponding buffer organization leading to solutions comparable to hand-built designs concerning the required memory. Special care is taken, so that also large problems in terms of the image size can be analyzed. The applicability of our approach is demonstrated by help of a JPEG2000 decoder model.
{"title":"Simulative Buffer Analysis of Local Image Processing Algorithms Described by Windowed Synchronous Data Flow","authors":"J. Keinert, C. Haubelt, J. Teich","doi":"10.1109/ICSAMOS.2007.4285747","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285747","url":null,"abstract":"Embedded real-time image processing applications working on large images have to process and store huge amounts of data. Consequently the organization of the memory buffers and the precise determination of the required buffer sizes are critical steps for efficient system implementation. In this paper, we propose a new method, that permits the analysis to be performed automatically for local image processing algorithms. The latter ones are specified by help of the windowed synchronous data flow (WSDF) model, a multi-dimensional model of computation which has been especially designed to represent local image processing algorithms. This paper introduces a corresponding buffer organization leading to solutions comparable to hand-built designs concerning the required memory. Special care is taken, so that also large problems in terms of the image size can be analyzed. The applicability of our approach is demonstrated by help of a JPEG2000 decoder model.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134269536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285742
B. Garbinato, R. Guerraoui, J. Hulaas, A. Kounine, Maxime Monod, J. H. Spring
This paper presents the weight-watcher service. This service aims at providing resource consumption measurements and estimations for software executing on resource-constrained devices. By using the weight-watcher, software components can continuously adapt and optimize their quality of service with respect to resource availability. The interface of the service is composed of a profiler and a predictor. We present an implementation that is lightweight in terms of CPU and memory. We also performed various experiments that convey (a) the tradeoff between the memory consumption of the service and the accuracy of the prediction, as well as (b) a maximum overhead of 10% on the execution speed of the VM for the profiler to provide accurate measurements.
{"title":"The Weight-Watcher Service and its Lightweight Implementation","authors":"B. Garbinato, R. Guerraoui, J. Hulaas, A. Kounine, Maxime Monod, J. H. Spring","doi":"10.1109/ICSAMOS.2007.4285742","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285742","url":null,"abstract":"This paper presents the weight-watcher service. This service aims at providing resource consumption measurements and estimations for software executing on resource-constrained devices. By using the weight-watcher, software components can continuously adapt and optimize their quality of service with respect to resource availability. The interface of the service is composed of a profiler and a predictor. We present an implementation that is lightweight in terms of CPU and memory. We also performed various experiments that convey (a) the tradeoff between the memory consumption of the service and the accuracy of the prediction, as well as (b) a maximum overhead of 10% on the execution speed of the VM for the profiler to provide accurate measurements.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285736
H. Blume, Jörg von Livonius, Lisa Rotenberg, T. Noll, Harald Bothe, J. Brakensiek
In this contribution, the potential of parallelized software that implements algorithms of digital signal processing on a multicore processor platform is analyzed. For this purpose various digital signal processing tasks have been implemented on a prototyping platform i.e. an ARM MPCore featuring four ARM 11 processor cores. In order to analyze the effect of parallelization on the resulting performance-power ratio, influencing parameters like e.g. the number of issued program threads have been studied. For paralllelization issues the OpenMP programming model has been used which can be efficiently applied on C- level. In order to elaborate power efficient code also a functional and instruction level power model of the MPCore has been derived which features a high estimation accuracy. Using this power model and exploiting the capabilities of OpenMP a variety of exemplary tasks could be efficiently parallelized. The general efficiency potential of parallelization for multiprocessor architectures can be assembled.
{"title":"Performance and Power Analysis of Parallelized Implementations on an MPCore Multiprocessor Platform","authors":"H. Blume, Jörg von Livonius, Lisa Rotenberg, T. Noll, Harald Bothe, J. Brakensiek","doi":"10.1109/ICSAMOS.2007.4285736","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285736","url":null,"abstract":"In this contribution, the potential of parallelized software that implements algorithms of digital signal processing on a multicore processor platform is analyzed. For this purpose various digital signal processing tasks have been implemented on a prototyping platform i.e. an ARM MPCore featuring four ARM 11 processor cores. In order to analyze the effect of parallelization on the resulting performance-power ratio, influencing parameters like e.g. the number of issued program threads have been studied. For paralllelization issues the OpenMP programming model has been used which can be efficiently applied on C- level. In order to elaborate power efficient code also a functional and instruction level power model of the MPCore has been derived which features a high estimation accuracy. Using this power model and exploiting the capabilities of OpenMP a variety of exemplary tasks could be efficiently parallelized. The general efficiency potential of parallelization for multiprocessor architectures can be assembled.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129444424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285753
F. Regazzoni, S. Badel, T. Eisenbarth, J. Großschädl, A. Poschmann, Z. Deniz, Marco Macchetti, L. Pozzi, C. Paar, Y. Leblebici, P. Ienne
This paper explores the resistance of MOS current mode logic (MCML) against differential power analysis (DPA) attacks. Circuits implemented in MCML, in fact, have unique characteristics both in terms of power consumption and the dependency of the power profile from the input signal pattern. Therefore, MCML is suitable to protect cryptographic hardware from DPA and similar side-channel attacks. In order to demonstrate the effectiveness of different logic styles against power analysis attacks, the non-linear bijective function of the Kasumi algorithm (known as substitution box S7) was implemented with CMOS and MCML technology, and a set of attacks was performed using power traces derived from SPICE-level simulations. Although all keys were discovered for CMOS, only very few attacks to MCML were successful.
{"title":"A Simulation-Based Methodology for Evaluating the DPA-Resistance of Cryptographic Functional Units with Application to CMOS and MCML Technologies","authors":"F. Regazzoni, S. Badel, T. Eisenbarth, J. Großschädl, A. Poschmann, Z. Deniz, Marco Macchetti, L. Pozzi, C. Paar, Y. Leblebici, P. Ienne","doi":"10.1109/ICSAMOS.2007.4285753","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285753","url":null,"abstract":"This paper explores the resistance of MOS current mode logic (MCML) against differential power analysis (DPA) attacks. Circuits implemented in MCML, in fact, have unique characteristics both in terms of power consumption and the dependency of the power profile from the input signal pattern. Therefore, MCML is suitable to protect cryptographic hardware from DPA and similar side-channel attacks. In order to demonstrate the effectiveness of different logic styles against power analysis attacks, the non-linear bijective function of the Kasumi algorithm (known as substitution box S7) was implemented with CMOS and MCML technology, and a set of attacks was performed using power traces derived from SPICE-level simulations. Although all keys were discovered for CMOS, only very few attacks to MCML were successful.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115146626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285740
Tero Rintaluoma, O. Silvén
In this paper, we consider the energy efficiency of implementations of video codecs for mobile devices in a top-down manner. We start from typical applications and analyse device architectures, codec implementations, and software platforms. The physical size of mobile devices limits their heat dissipation, while the battery capacity needs to be used conservingly to provide for satisfactory untethered active use time. Together with the required versatile capabilities of the devices, these are essential constraints that must be taken into account from hardware to application software design. In video decoding additional constraints come from the need to support multiple digital video coding standards, and the platform oriented design regimes of the device manufacturers.
{"title":"Energy efficiency of mobile video decoding","authors":"Tero Rintaluoma, O. Silvén","doi":"10.1109/ICSAMOS.2007.4285740","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285740","url":null,"abstract":"In this paper, we consider the energy efficiency of implementations of video codecs for mobile devices in a top-down manner. We start from typical applications and analyse device architectures, codec implementations, and software platforms. The physical size of mobile devices limits their heat dissipation, while the battery capacity needs to be used conservingly to provide for satisfactory untethered active use time. Together with the required versatile capabilities of the devices, these are essential constraints that must be taken into account from hardware to application software design. In video decoding additional constraints come from the need to support multiple digital video coding standards, and the platform oriented design regimes of the device manufacturers.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125699557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285749
E. Herruzo, E. Zapata, O. Plata
The paper describes a framework for analyzing the cache content on affine references to arrays in loops. The framework is based on a small set of key cache parameters. We study the relation between these cache parameters and the data memory layout of arrays to demonstrate how to use array padding (static array re-dimensioning) to optimize the use of the cache. Based on the cache model we present a method to carry out intra-array padding for a maximum cache occupation and for a maximum sorted cache occupation, and a simple method to carry out inter-array padding. We also present an experimental evaluation of our techniques using a cache simulator and actual code executions on the MIPS R10K processor.
{"title":"Maximum and Sorted Cache Occupation Using Array Padding","authors":"E. Herruzo, E. Zapata, O. Plata","doi":"10.1109/ICSAMOS.2007.4285749","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285749","url":null,"abstract":"The paper describes a framework for analyzing the cache content on affine references to arrays in loops. The framework is based on a small set of key cache parameters. We study the relation between these cache parameters and the data memory layout of arrays to demonstrate how to use array padding (static array re-dimensioning) to optimize the use of the cache. Based on the cache model we present a method to carry out intra-array padding for a maximum cache occupation and for a maximum sorted cache occupation, and a simple method to carry out inter-array padding. We also present an experimental evaluation of our techniques using a cache simulator and actual code executions on the MIPS R10K processor.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133426511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285728
M. Med, A. Krall
In an embedded system, the cost of storing a program on-chip can be as high as the cost of the microprocessor itself. We examine how much a given application's program size can be reduced when an instruction set is tailored to the application. We provide different algorithms for calculating an optimized instruction set and evaluate their impact on the size of several benchmark programs. Our results show that an average reduction of 11% is possible, and further improvement can be achieved by changing the instruction length of the given architecture. However compiling other applications with such an optimized instruction set might produce larger code sizes.
{"title":"Instruction Set Encoding Optimization for Code Size Reduction","authors":"M. Med, A. Krall","doi":"10.1109/ICSAMOS.2007.4285728","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285728","url":null,"abstract":"In an embedded system, the cost of storing a program on-chip can be as high as the cost of the microprocessor itself. We examine how much a given application's program size can be reduced when an instruction set is tailored to the application. We provide different algorithms for calculating an optimized instruction set and evaluate their impact on the size of several benchmark programs. Our results show that an average reduction of 11% is possible, and further improvement can be achieved by changing the instruction length of the given architecture. However compiling other applications with such an optimized instruction set might produce larger code sizes.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116270321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285738
N. Saint-Jean, P. Benoit, G. Sassatelli, L. Torres, M. Robert
Scalability of architecture, programming model and task control management will be a major challenge for MP-SOC designs in the coming years. The contribution presented in this paper is HS-Scale, a hardware/software framework to study, define and experiment scalable solutions for next generation MP-SOC. The hardware architecture, H-Scale, is a homogeneous MP-SOC based on RISC processors, distributed memories and a globally asynchronous/locally synchronous network on chip. S-Scale is the software support to program H-Scale. It is a multithreaded sequential programming model with dedicated communication primitives handled at run-time by a simple operating system we developed. The hardware validations on FPGA and CMOS 90 nm technology and the experimental case studies on several applications (FIR, DES and MJPEG) demonstrate the scalability of our approach and draws interesting perspectives to automate task placement and duplication.
{"title":"Application Case Studies on HS-Scale, a MP-SOC for Embbeded Systems","authors":"N. Saint-Jean, P. Benoit, G. Sassatelli, L. Torres, M. Robert","doi":"10.1109/ICSAMOS.2007.4285738","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285738","url":null,"abstract":"Scalability of architecture, programming model and task control management will be a major challenge for MP-SOC designs in the coming years. The contribution presented in this paper is HS-Scale, a hardware/software framework to study, define and experiment scalable solutions for next generation MP-SOC. The hardware architecture, H-Scale, is a homogeneous MP-SOC based on RISC processors, distributed memories and a globally asynchronous/locally synchronous network on chip. S-Scale is the software support to program H-Scale. It is a multithreaded sequential programming model with dedicated communication primitives handled at run-time by a simple operating system we developed. The hardware validations on FPGA and CMOS 90 nm technology and the experimental case studies on several applications (FIR, DES and MJPEG) demonstrate the scalability of our approach and draws interesting perspectives to automate task placement and duplication.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126949400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285744
S. Xydis, G. Economakos, K. Pekmestzi
This paper presents a design technique for coarse grained reconfigurable cores targeting mostly DSP applications. The proposed technique inlines flexibility into custom carry-save-arithmetic (CSA) datapaths exploiting a stable and canonical interconnection scheme. The canonical interconnection is revealed by a uniformity transformation imposed on the basic architectures of CSA multipliers and CSA chain-adders/subtractors. The design flow for the implementation of the core is analyzed in detail, and a novel reconfigurable architecture prototype is presented. The paper concludes with the experimental results showing that our architecture performs an average latency reduction of 32.63%, compared with datapaths of primitive computational resources, with a tolerable overhead in hardware utilization.
{"title":"Flexibility Inlining into Arithmetic Data-paths Exploiting A Regular Interconnection Scheme","authors":"S. Xydis, G. Economakos, K. Pekmestzi","doi":"10.1109/ICSAMOS.2007.4285744","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285744","url":null,"abstract":"This paper presents a design technique for coarse grained reconfigurable cores targeting mostly DSP applications. The proposed technique inlines flexibility into custom carry-save-arithmetic (CSA) datapaths exploiting a stable and canonical interconnection scheme. The canonical interconnection is revealed by a uniformity transformation imposed on the basic architectures of CSA multipliers and CSA chain-adders/subtractors. The design flow for the implementation of the core is analyzed in detail, and a novel reconfigurable architecture prototype is presented. The paper concludes with the experimental results showing that our architecture performs an average latency reduction of 32.63%, compared with datapaths of primitive computational resources, with a tolerable overhead in hardware utilization.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116263494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285750
Vassilis Dimopoulos, I. Papaefstathiou, D. Pnevmatikatos
The Aho-Corasick (AC) algorithm is a very flexible and efficient but memory-hungry pattern matching algorithm that can scan the existence of a query string among multiple test strings looking at each character exactly once, making it one of the main options for software-base intrusion detection systems such as SNORT. We present the Split-AC algorithm, which is a reconfigurable variation of the AC algorithm that exploits domain-specific characteristics of intrusion detection to reduce considerably the FSM memory requirements. SplitAC achieves an overall reduction between 28-75% compared to the best proposed implementation.
{"title":"A Memory-Efficient Reconfigurable Aho-Corasick FSM Implementation for Intrusion Detection Systems","authors":"Vassilis Dimopoulos, I. Papaefstathiou, D. Pnevmatikatos","doi":"10.1109/ICSAMOS.2007.4285750","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285750","url":null,"abstract":"The Aho-Corasick (AC) algorithm is a very flexible and efficient but memory-hungry pattern matching algorithm that can scan the existence of a query string among multiple test strings looking at each character exactly once, making it one of the main options for software-base intrusion detection systems such as SNORT. We present the Split-AC algorithm, which is a reconfigurable variation of the AC algorithm that exploits domain-specific characteristics of intrusion detection to reduce considerably the FSM memory requirements. SplitAC achieves an overall reduction between 28-75% compared to the best proposed implementation.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128088576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}