Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285745
Fabrizio Ferrandi, P. Lanzi, G. Palermo, C. Pilato, D. Sciuto, Antonino Tumeo
This paper presents a new methodology based on evolutionary multi-objective optimization (EMO) to synthesize multiple complex modules on programmable devices (FPGAs). It starts from a behavioral description written in a common high-level language (for instance C) to automatically produce the register-transfer level (RTL) design in a hardware description language (e.g. Verilog). Since all high-level synthesis problems (scheduling, allocation and binding) are notoriously NP-complete and interdependent, the three problems should be considered simultaneously. This drives to a wide design space, that needs to be thoroughly explored to obtain solutions able to satisfy the design constraints. Evolutionary algorithms are good candidates to tackle such complex explorations. In this paper we provide a solution based on the non-dominated sorting genetic algorithm (NSGA-II) to explore the design space in order obtain the best solutions in terms of performance given the area constraints of a target FPGA device. Moreover, it has been integrated a good cost estimation model to guarantee the quality of the solutions found without requiring a complete synthesis for the validation of each generation, an impractical and time consuming operation. We show on the JPEG case study that the proposed approach provides good results in terms of trade-off between total area occupied and execution time.
{"title":"An Evolutionary Approach to Area-Time Optimization of FPGA designs","authors":"Fabrizio Ferrandi, P. Lanzi, G. Palermo, C. Pilato, D. Sciuto, Antonino Tumeo","doi":"10.1109/ICSAMOS.2007.4285745","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285745","url":null,"abstract":"This paper presents a new methodology based on evolutionary multi-objective optimization (EMO) to synthesize multiple complex modules on programmable devices (FPGAs). It starts from a behavioral description written in a common high-level language (for instance C) to automatically produce the register-transfer level (RTL) design in a hardware description language (e.g. Verilog). Since all high-level synthesis problems (scheduling, allocation and binding) are notoriously NP-complete and interdependent, the three problems should be considered simultaneously. This drives to a wide design space, that needs to be thoroughly explored to obtain solutions able to satisfy the design constraints. Evolutionary algorithms are good candidates to tackle such complex explorations. In this paper we provide a solution based on the non-dominated sorting genetic algorithm (NSGA-II) to explore the design space in order obtain the best solutions in terms of performance given the area constraints of a target FPGA device. Moreover, it has been integrated a good cost estimation model to guarantee the quality of the solutions found without requiring a complete synthesis for the validation of each generation, an impractical and time consuming operation. We show on the JPEG case study that the proposed approach provides good results in terms of trade-off between total area occupied and execution time.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128888071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285730
Vassilis D. Papaefstathiou, D. Pnevmatikatos, M. Marazakis, Giorgos Kalokairinos, Aggelos D. Ioannou, Michael Papamichael, S. Kavadias, Giorgos Mihelogiannakis, M. Katevenis
Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as remote DMA, remote queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.
{"title":"Prototyping Efficient Interprocessor Communication Mechanisms","authors":"Vassilis D. Papaefstathiou, D. Pnevmatikatos, M. Marazakis, Giorgos Kalokairinos, Aggelos D. Ioannou, Michael Papamichael, S. Kavadias, Giorgos Mihelogiannakis, M. Katevenis","doi":"10.1109/ICSAMOS.2007.4285730","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285730","url":null,"abstract":"Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as remote DMA, remote queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132313385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285748
Miquel Moretó, F. Cazorla, Alex Ramírez, M. Valero
General purpose architectures are designed to offer average high performance regardless of the particular application that is being run. Performance and power inefficiencies appear as a consequence for some programs. Reconfigurable hardware (cache hierarchy, branch predictor, execution units, bandwidth, etc.) has been proposed to overcome these inefficiencies by dynamically adapting the architecture to the application needs. However, nearly all the proposals use indirect measures or heuristics of performance to decide new configurations, what may lead to inefficiencies. In this paper we propose a runtime mechanism that allows to predict the throughput of an application on an architecture using a reconfigurable L2 cache. L2 cache size varies at a way granularity and we predict the performance of the same application on all other L2 cache sizes at the same time. We obtain for different L2 cache sizes an average error of 3.11%, a maximum error of 16.4% and standard deviation of 3.7%. No profiling or operating system participation is needed in this mechanism. We also give a hardware implementation that allows to reduce the hardware cost under 0.4% of the total L2 size and maintains high accuracy. This mechanism can be used to reduce power consumption in single threaded architectures and improve performance in multithreaded architectures that dynamically partition shared L2 caches.
{"title":"Online Prediction of Applications Cache Utility","authors":"Miquel Moretó, F. Cazorla, Alex Ramírez, M. Valero","doi":"10.1109/ICSAMOS.2007.4285748","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285748","url":null,"abstract":"General purpose architectures are designed to offer average high performance regardless of the particular application that is being run. Performance and power inefficiencies appear as a consequence for some programs. Reconfigurable hardware (cache hierarchy, branch predictor, execution units, bandwidth, etc.) has been proposed to overcome these inefficiencies by dynamically adapting the architecture to the application needs. However, nearly all the proposals use indirect measures or heuristics of performance to decide new configurations, what may lead to inefficiencies. In this paper we propose a runtime mechanism that allows to predict the throughput of an application on an architecture using a reconfigurable L2 cache. L2 cache size varies at a way granularity and we predict the performance of the same application on all other L2 cache sizes at the same time. We obtain for different L2 cache sizes an average error of 3.11%, a maximum error of 16.4% and standard deviation of 3.7%. No profiling or operating system participation is needed in this mechanism. We also give a hardware implementation that allows to reduce the hardware cost under 0.4% of the total L2 size and maintains high accuracy. This mechanism can be used to reduce power consumption in single threaded architectures and improve performance in multithreaded architectures that dynamically partition shared L2 caches.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125346142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-07-16DOI: 10.1109/ICSAMOS.2007.4285752
Sascha Mühlbach, S. Wallner
The protection of chip-level microcomputer bus systems in embedded devices is essential to prevent the growing number of hardware hacking attacks. This paper presents an authenticated key exchange and encryption solution in order to ensure chip-level microcomputer bus systems via the tree parity machine rekeying architecture (TPMRA). Due to this intention, a scalable TPMRA IP-core is designed and implemented in order to meet variable bus performance requirements. It allows the authentication of the bus participants as well as the encryption of chip-to-chip buses from a single primitive. The solution is transparent and easy applicable to an arbitrary microcomputer bus system for embedded devices on the market. A proof of concept implementation shows the applicability of the TPMRA in the standardized advanced microprocessor bus architecture (AMBA) by implementing the IP-core into the peripheral bus-to-bus interface (AHB-APB-bridge). It will be shown that the solution is latency free and can be used in order to protect the ARM bus system with a low hardware overhead considering all AMBA bus features.
{"title":"Secure and Authenticated Communication in Chip-Level Microcomputer Bus Systems with Tree Parity Machines","authors":"Sascha Mühlbach, S. Wallner","doi":"10.1109/ICSAMOS.2007.4285752","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2007.4285752","url":null,"abstract":"The protection of chip-level microcomputer bus systems in embedded devices is essential to prevent the growing number of hardware hacking attacks. This paper presents an authenticated key exchange and encryption solution in order to ensure chip-level microcomputer bus systems via the tree parity machine rekeying architecture (TPMRA). Due to this intention, a scalable TPMRA IP-core is designed and implemented in order to meet variable bus performance requirements. It allows the authentication of the bus participants as well as the encryption of chip-to-chip buses from a single primitive. The solution is transparent and easy applicable to an arbitrary microcomputer bus system for embedded devices on the market. A proof of concept implementation shows the applicability of the TPMRA in the standardized advanced microprocessor bus architecture (AMBA) by implementing the IP-core into the peripheral bus-to-bus interface (AHB-APB-bridge). It will be shown that the solution is latency free and can be used in order to protect the ARM bus system with a low hardware overhead considering all AMBA bus features.","PeriodicalId":106933,"journal":{"name":"2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127155293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}