As systems-on-a-chip (SoCs) become larger, the problem of interconnecting the various subsystems becomes more complicated. In this framework, certain alternatives to the standard busses, based on network technologies, have emerged as an innovative approach for future SoC interconnect. One of the main advantages of such an alternative, is that it can offer certain quality of service (QoS) over the internal cross-connects while at the same time it supports higher transfer rates than the existing on-chip busses. This paper presents a chip interconnection architecture, which is based on a buffered crossbar switch. The main advantage of the proposed system is that it efficiently supports different priority levels; it also provides several gigabits per second of aggregate bandwidth, while it introduces very low latency. Moreover, its hardware complexity is minimal. All those facts make this framework ideal for SoCs that contain IP cores with diverse speed/throughput requirements.
{"title":"A Buffered Crossbar-Based Chip Interconnection Architecture Supporting Quality of Service","authors":"Georgios Kornaros, Y. Papaefstathiou","doi":"10.1109/SPL.2007.371723","DOIUrl":"https://doi.org/10.1109/SPL.2007.371723","url":null,"abstract":"As systems-on-a-chip (SoCs) become larger, the problem of interconnecting the various subsystems becomes more complicated. In this framework, certain alternatives to the standard busses, based on network technologies, have emerged as an innovative approach for future SoC interconnect. One of the main advantages of such an alternative, is that it can offer certain quality of service (QoS) over the internal cross-connects while at the same time it supports higher transfer rates than the existing on-chip busses. This paper presents a chip interconnection architecture, which is based on a buffered crossbar switch. The main advantage of the proposed system is that it efficiently supports different priority levels; it also provides several gigabits per second of aggregate bandwidth, while it introduces very low latency. Moreover, its hardware complexity is minimal. All those facts make this framework ideal for SoCs that contain IP cores with diverse speed/throughput requirements.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"297 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122538351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Trujillo-Olaya, Jaime Velasco-Medina, J. C. López-Hernández
This article presents efficient hardware implementations for the Gaussian normal basis multiplication over GF(2163). Hardware implementations of GF(2m) multiplication algorithms are suitable to design elliptic curve cryptoprocessors, which allow that elliptic curve based cryptosystems implemented in hardware provide more physical security and higher performance than software implementations. In this case, the multipliers were designed using conventional, modified and fast- parallel algorithms for the Gaussian normal basis multiplication, the synthesis and simulation were carried out using Quartus II of Altera, and the designs were synthesized on the device EP2A15B724C7. The simulation results show that the multipliers designed present a very good performance using small area.
{"title":"Efficient Hardware Implementations for the Gaussian Normal Basis Multiplication Over GF(2163)","authors":"V. Trujillo-Olaya, Jaime Velasco-Medina, J. C. López-Hernández","doi":"10.1109/SPL.2007.371722","DOIUrl":"https://doi.org/10.1109/SPL.2007.371722","url":null,"abstract":"This article presents efficient hardware implementations for the Gaussian normal basis multiplication over GF(2163). Hardware implementations of GF(2m) multiplication algorithms are suitable to design elliptic curve cryptoprocessors, which allow that elliptic curve based cryptosystems implemented in hardware provide more physical security and higher performance than software implementations. In this case, the multipliers were designed using conventional, modified and fast- parallel algorithms for the Gaussian normal basis multiplication, the synthesis and simulation were carried out using Quartus II of Altera, and the designs were synthesized on the device EP2A15B724C7. The simulation results show that the multipliers designed present a very good performance using small area.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123844526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconfigurable computing is gaining rising attention as an alternative to traditional processing for many applications. Data encryption and decryption is one of these applications, which can get tremendous speedup running on FPGAs instead of microprocessors. We have developed a block-cipher library that covers 15 most popular encryption algorithms, and generated 35 bitstreams running on the SGI's latest version of a reconfigurable computer, RASCRC-100. The end- to-end throughput of 1.136 GB/s have been demonstrated for almost all ciphers, and was limited only by the input/output interface, rather than the FPGA processing time. The library is written in Verilog-HDL, and can be easily ported to other reconfigurable computing platforms. It provides means for cryptographers and computer scientists to program reconfigurable computers without the need for detailed knowledge of hardware design.
{"title":"Development of Block-Cipher Library for Reconfigurable Computers","authors":"Miaoqing Huang, T. El-Ghazawi, B. Larson, K. Gaj","doi":"10.1109/SPL.2007.371747","DOIUrl":"https://doi.org/10.1109/SPL.2007.371747","url":null,"abstract":"Reconfigurable computing is gaining rising attention as an alternative to traditional processing for many applications. Data encryption and decryption is one of these applications, which can get tremendous speedup running on FPGAs instead of microprocessors. We have developed a block-cipher library that covers 15 most popular encryption algorithms, and generated 35 bitstreams running on the SGI's latest version of a reconfigurable computer, RASCRC-100. The end- to-end throughput of 1.136 GB/s have been demonstrated for almost all ciphers, and was limited only by the input/output interface, rather than the FPGA processing time. The library is written in Verilog-HDL, and can be easily ported to other reconfigurable computing platforms. It provides means for cryptographers and computer scientists to program reconfigurable computers without the need for detailed knowledge of hardware design.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132475922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Gonzalez, M. Funes, R. Petrocelli, M. Benedetti
This work presents a modulator implemented in a FPGA for power matrix converters. Its function is to operate as a peripheral unit of a digitally-controlled system in order to generate duty cycles for each of the converter switches and provide a safe commutation of the switching devices. The correct generation of duty cycles and sequences was verified, as well as the safe commutation of bi-directional switches. The performance of the modulator in conjunction with the power stage and a DSP, which performs the high level control layer, was also analyzed.
{"title":"FPGA Modulator for Matrix Converter","authors":"M. Gonzalez, M. Funes, R. Petrocelli, M. Benedetti","doi":"10.1109/SPL.2007.371751","DOIUrl":"https://doi.org/10.1109/SPL.2007.371751","url":null,"abstract":"This work presents a modulator implemented in a FPGA for power matrix converters. Its function is to operate as a peripheral unit of a digitally-controlled system in order to generate duty cycles for each of the converter switches and provide a safe commutation of the switching devices. The correct generation of duty cycles and sequences was verified, as well as the safe commutation of bi-directional switches. The performance of the modulator in conjunction with the power stage and a DSP, which performs the high level control layer, was also analyzed.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134633820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advances in the programmable hardware have lead to new architectures, where the hardware can be dynamically adapted to the application to gain better performance. One of the problems in realizing dynamically reconfigurable systems is the allocation of dynamically reconfigurable modules. In this scenario, when a new module has to be reconfigured in the system, there is the need to find a suitable free place where it can be configured. In this work a genetic algorithm has been developed to solve the problem of dynamically reconfigurable modules allocation. The search task has been modeled with a genetic algorithm in which each chromosome represents a configuration status of the programmable devices and both crossover and mutation processes try to change the previously found location for the new module in order to achieve a better fitness, that stands for the goodness of the final solution.
{"title":"A Genetic Algorithm Based Solution for Dynamically Reconfigurable Modules Allocation","authors":"V. Rana, C. Sandionigi, M. Santambrogio","doi":"10.1109/SPL.2007.371745","DOIUrl":"https://doi.org/10.1109/SPL.2007.371745","url":null,"abstract":"The advances in the programmable hardware have lead to new architectures, where the hardware can be dynamically adapted to the application to gain better performance. One of the problems in realizing dynamically reconfigurable systems is the allocation of dynamically reconfigurable modules. In this scenario, when a new module has to be reconfigured in the system, there is the need to find a suitable free place where it can be configured. In this work a genetic algorithm has been developed to solve the problem of dynamically reconfigurable modules allocation. The search task has been modeled with a genetic algorithm in which each chromosome represents a configuration status of the programmable devices and both crossover and mutation processes try to change the previously found location for the new module in order to achieve a better fitness, that stands for the goodness of the final solution.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115202735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although radix-10 based arithmetic has been gaining renewed importance over the last few years, decimal systems are not efficient enough and techniques are still under development. In this paper, a modification of the CORDIC method for decimal arithmetic is proposed and applied to produce fast rotations. The algorithm uses BCD operands as inputs, combining the advantages both decimal and binary systems. The result is an important number of iterations reduction compared with the original decimal CORDIC method. Finally, a FPGA-based radix-10 architecture that can be used to produce rotations with more precision and speed is presented and different experiments showing the advantages of the new method are shown.
{"title":"A Fast Architecture for Radix 10 Coordinates Rotation","authors":"A. J. Morenilla, H. Mora, J. Romero, F.P. Lopez","doi":"10.1109/SPL.2007.371721","DOIUrl":"https://doi.org/10.1109/SPL.2007.371721","url":null,"abstract":"Although radix-10 based arithmetic has been gaining renewed importance over the last few years, decimal systems are not efficient enough and techniques are still under development. In this paper, a modification of the CORDIC method for decimal arithmetic is proposed and applied to produce fast rotations. The algorithm uses BCD operands as inputs, combining the advantages both decimal and binary systems. The result is an important number of iterations reduction compared with the original decimal CORDIC method. Finally, a FPGA-based radix-10 architecture that can be used to produce rotations with more precision and speed is presented and different experiments showing the advantages of the new method are shown.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate the usability of several well-known real-time scheduling algorithms for a system consisting of a single processor core and multiple dynamically reconfigurable functional units, running a number of processes in parallel. A SystemC simulation model of a wireless sensor network node serves as a case study for assessing the performance of the different algorithms. Specific emphasis is given to the analysis of the configuration cache miss ratio and the number of context switches, which are indicators of costly operations of the reconfigurable units, respectively reconfiguration and saving internal state.
{"title":"Comparative Analysis of Multitask Scheduling Algorithms for Reconfigurable Computing Regarding Context Switches and Configuration Cache Usage","authors":"Christopher Spies, L. Indrusiak, M. Glesner","doi":"10.1109/SPL.2007.371758","DOIUrl":"https://doi.org/10.1109/SPL.2007.371758","url":null,"abstract":"In this paper, we investigate the usability of several well-known real-time scheduling algorithms for a system consisting of a single processor core and multiple dynamically reconfigurable functional units, running a number of processes in parallel. A SystemC simulation model of a wireless sensor network node serves as a case study for assessing the performance of the different algorithms. Specific emphasis is given to the analysis of the configuration cache miss ratio and the number of context switches, which are indicators of costly operations of the reconfigurable units, respectively reconfiguration and saving internal state.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current work on automatic task partitioning and scheduling for reconfigurable computing (RC) systems strictly addresses the FPGA hardware, and does not take advantage of the synergy between the microprocessor and the FPGA. Efforts on partitioning between muP and the FPGA are a manual and laborious effort, as a formal methodology for automatic hardware-software partitioning has not been established. Related fields such as heterogeneous computing (HC) and embedded computing (EC) have an extensive body of work for scheduling for heterogeneous processors. Unlike the HC scheduling algorithms, the EC algorithms take into account the differences in computational capabilities of each processing element. In this work, we adapt EC scheduling algorithms for RC systems, and show how simply adapting the algorithms alone is not sufficient to take advantage of the reconfigurable hardware. We introduce new heuristic algorithms based on EC scheduling algorithms and show that they provide up to an order of magnitude improvement in scheduling and execution times.
{"title":"Extending Embedded Computing Scheduling Algorithms for Reconfigurable Computing Systems","authors":"P. Saha, T. El-Ghazawi","doi":"10.1109/SPL.2007.371729","DOIUrl":"https://doi.org/10.1109/SPL.2007.371729","url":null,"abstract":"Current work on automatic task partitioning and scheduling for reconfigurable computing (RC) systems strictly addresses the FPGA hardware, and does not take advantage of the synergy between the microprocessor and the FPGA. Efforts on partitioning between muP and the FPGA are a manual and laborious effort, as a formal methodology for automatic hardware-software partitioning has not been established. Related fields such as heterogeneous computing (HC) and embedded computing (EC) have an extensive body of work for scheduling for heterogeneous processors. Unlike the HC scheduling algorithms, the EC algorithms take into account the differences in computational capabilities of each processing element. In this work, we adapt EC scheduling algorithms for RC systems, and show how simply adapting the algorithms alone is not sufficient to take advantage of the reconfigurable hardware. We introduce new heuristic algorithms based on EC scheduling algorithms and show that they provide up to an order of magnitude improvement in scheduling and execution times.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120948107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nuclear pulses parameters estimation is needed in many nuclear applications. Its precision and performance requirements are very demanding, especially in PET applications. Quality of PET images depends on the energy and time resolution of gamma pulses detection. Neural networks estimators were analyzed in contrast with common methods. Two-layer feed-forward networks with three neurons in the hidden layer reached precision goal. The chosen estimators allowed the use of 40 MHz free running ADC obtaining precision of 1ns in the timestamp determination, exceeding coincidence detection requirements. An efficient VHDL implementation on an inexpensive Xilinx Spartan-3 FPGA was achieved that fulfill performance requirements, adding no dead time due to digital processing. The estimators and its FPGA implementations were verified on hardware and characterization were done using nuclear shaped pulses synthesized with an arbitrary function generator.
{"title":"FPGA Neural Networks Implementation for Nuclear Pulses Parameters Estimation","authors":"D. Estryk, G. E. Ríos, C. Verrastro","doi":"10.1109/SPL.2007.371716","DOIUrl":"https://doi.org/10.1109/SPL.2007.371716","url":null,"abstract":"Nuclear pulses parameters estimation is needed in many nuclear applications. Its precision and performance requirements are very demanding, especially in PET applications. Quality of PET images depends on the energy and time resolution of gamma pulses detection. Neural networks estimators were analyzed in contrast with common methods. Two-layer feed-forward networks with three neurons in the hidden layer reached precision goal. The chosen estimators allowed the use of 40 MHz free running ADC obtaining precision of 1ns in the timestamp determination, exceeding coincidence detection requirements. An efficient VHDL implementation on an inexpensive Xilinx Spartan-3 FPGA was achieved that fulfill performance requirements, adding no dead time due to digital processing. The estimators and its FPGA implementations were verified on hardware and characterization were done using nuclear shaped pulses synthesized with an arbitrary function generator.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127723683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Silva, J. Vortmann, Luciano Agostini, S. Bampi, A. Susin
This paper presents the design of a hardware architecture for the entropy coder of H.264/AVC video compression standard, considering the baseline profile. The baseline entropy coder is composed of two main blocks: Exp-Golomb coder and CAVLC coder. This paper presents the architectural design of these two blocks. These architectures were described in VHDL and synthesized to an Altera Stratix-II FPGA. From the synthesis results it was possible to verify that the Exp-Golomb and CAVLC coders reached a throughput of 15.9 million of samples per second for the Exp-Golomb coder and of 103.8 million of samples per second for CAVLC coder. The H.264/AVC baseline entropy coder is being designed through the integration of these two coders and preliminary results indicate that this solution will be able to process HDTV frames in real time.
{"title":"FPGA Based Design of CAVLC and Exp-Golomb Coders for H.264/AVC Baseline Entropy Coding","authors":"T. Silva, J. Vortmann, Luciano Agostini, S. Bampi, A. Susin","doi":"10.1109/SPL.2007.371741","DOIUrl":"https://doi.org/10.1109/SPL.2007.371741","url":null,"abstract":"This paper presents the design of a hardware architecture for the entropy coder of H.264/AVC video compression standard, considering the baseline profile. The baseline entropy coder is composed of two main blocks: Exp-Golomb coder and CAVLC coder. This paper presents the architectural design of these two blocks. These architectures were described in VHDL and synthesized to an Altera Stratix-II FPGA. From the synthesis results it was possible to verify that the Exp-Golomb and CAVLC coders reached a throughput of 15.9 million of samples per second for the Exp-Golomb coder and of 103.8 million of samples per second for CAVLC coder. The H.264/AVC baseline entropy coder is being designed through the integration of these two coders and preliminary results indicate that this solution will be able to process HDTV frames in real time.","PeriodicalId":419253,"journal":{"name":"2007 3rd Southern Conference on Programmable Logic","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}