Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521140
T. Marconi, J. Hur, K. Bertels, G. Gaydadjiev
Long reconfiguration times form a major bottleneck in dynamic reconfigurable systems. Many approaches have been proposed to address this problem. However, improvements in the configuration circuit that introduces this overhead are usually not considered. The high reconfiguration times are due to the large amount of configuration bits sent through a constrained data path. In order to alleviate this, we propose a novel FPGA configuration circuit architecture to speedup bitstream (re)configuration and relocation. Experimental results using the MCNC benchmark set indicate that our proposal reconfigures 4 times faster and relocates 19.8 times more efficient compared to the state of the art approaches. This is achieved by transporting only the data required for the configuration in flight and by avoiding external communication while relocating. Moreover, the configuration bitstream sizes of the evaluated benchmarks are reduced by 65%on average. In addition, our proposal introduces negligible hardware and communication protocol overheads.
{"title":"A novel configuration circuit architecture to speedup reconfiguration and relocation for partially reconfigurable devices","authors":"T. Marconi, J. Hur, K. Bertels, G. Gaydadjiev","doi":"10.1109/SASP.2010.5521140","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521140","url":null,"abstract":"Long reconfiguration times form a major bottleneck in dynamic reconfigurable systems. Many approaches have been proposed to address this problem. However, improvements in the configuration circuit that introduces this overhead are usually not considered. The high reconfiguration times are due to the large amount of configuration bits sent through a constrained data path. In order to alleviate this, we propose a novel FPGA configuration circuit architecture to speedup bitstream (re)configuration and relocation. Experimental results using the MCNC benchmark set indicate that our proposal reconfigures 4 times faster and relocates 19.8 times more efficient compared to the state of the art approaches. This is achieved by transporting only the data required for the configuration in flight and by avoiding external communication while relocating. Moreover, the configuration bitstream sizes of the evaluated benchmarks are reduced by 65%on average. In addition, our proposal introduces negligible hardware and communication protocol overheads.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122861496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521149
A. El-Rayis, T. Arslan, A. Erdogan
The correlation process in direct sequence spread spectrum (DSSS) communication systems is key in having successful signal reception. The implementation of real-time correlation in digital signal processors is one of key challenge in the realization of positioning systems today; as a result, most realizations are either application specific integrated circuits (ASIC) or Field Programmable Gate Array (FPGA) based. In this work we have introduced a new correlation engine targeting performance critical Global Positioning Satellite (GPS) based positioning. The processor is based on Reconfigurable Instruction Cell Array (RICA) paradigm. The GPS has been chosen due to its extensive integration in handheld devices (e.g. mobile phones) together with rising energy consumption concerns. We have designed, programmed and implemented several time-domain correlator engines based on RICA architectural paradigm. Various optimization techniques were implemented to adapt the processor to the correlation algorithm and in order to achieve the best performance. 12 and 24 channel correlators are tested using the new processor architecture.
{"title":"A processing engine for GPS correlation","authors":"A. El-Rayis, T. Arslan, A. Erdogan","doi":"10.1109/SASP.2010.5521149","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521149","url":null,"abstract":"The correlation process in direct sequence spread spectrum (DSSS) communication systems is key in having successful signal reception. The implementation of real-time correlation in digital signal processors is one of key challenge in the realization of positioning systems today; as a result, most realizations are either application specific integrated circuits (ASIC) or Field Programmable Gate Array (FPGA) based. In this work we have introduced a new correlation engine targeting performance critical Global Positioning Satellite (GPS) based positioning. The processor is based on Reconfigurable Instruction Cell Array (RICA) paradigm. The GPS has been chosen due to its extensive integration in handheld devices (e.g. mobile phones) together with rising energy consumption concerns. We have designed, programmed and implemented several time-domain correlator engines based on RICA architectural paradigm. Various optimization techniques were implemented to adapt the processor to the correlation algorithm and in order to achieve the best performance. 12 and 24 channel correlators are tested using the new processor architecture.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114352082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521151
D. Novo, A. Kritikakou, P. Raghavan, L. Perre, J. Huisken, F. Catthoor
Many signal processing applications demand for highly energy efficient flexible implementations. In this paper, we propose a novel Domain Specific Instruction-set Processor (DSIP) architecture template which is tuned to deploy in the targeted domain of on-line surveillance. The architectur e, when implemented using a 40-nm CMOS standard cell library, executes a representative test vehicle with an energy efficiency of near ly 900 MOPS/mW including instruction and data memor ies. This is about 20 times higher than a state-of-the-ar t low power DSP architecture and less than a factor 2 below a heavily optimized ASIC realization for the same application benchmark.
{"title":"Ultra low energy Domain Specific Instruction-set Processor for on-line surveillance","authors":"D. Novo, A. Kritikakou, P. Raghavan, L. Perre, J. Huisken, F. Catthoor","doi":"10.1109/SASP.2010.5521151","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521151","url":null,"abstract":"Many signal processing applications demand for highly energy efficient flexible implementations. In this paper, we propose a novel Domain Specific Instruction-set Processor (DSIP) architecture template which is tuned to deploy in the targeted domain of on-line surveillance. The architectur e, when implemented using a 40-nm CMOS standard cell library, executes a representative test vehicle with an energy efficiency of near ly 900 MOPS/mW including instruction and data memor ies. This is about 20 times higher than a state-of-the-ar t low power DSP architecture and less than a factor 2 below a heavily optimized ASIC realization for the same application benchmark.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121630959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521153
R. Lee, Yu-Yuan Chen
Software AES cipher performance is not fast enough for encryption to be incorporated ubiquitously for all computing needs. Furthermore, fast software implementations of AES that use table lookups are susceptible to software cache-based side channel attacks, leaking the secret encryption key. To bridge the gap between software and hardware AES implementations, several Instruction Set Architecture (ISA) extensions have been proposed to provide speedup for software AES programs, most notably the recent introduction of six AES-specific instructions for Intel microprocessors. However, algorithm-specific instructions are less desirable than general-purpose ones for microprocessors. In this paper, we propose an enhanced parallel table lookup instruction that can achieve the fastest reported software AES encryption and decryption of 1.38 cycles/byte for general-purpose microprocessors, a 1.45X speedup from the fastest prior work reported. Also, security is improved where cache-based side-channel attacks are thwarted, since all table lookups take the same amount of time. Furthermore, the new instructions can also be used to accelerate any functions that can be accelerated through table lookup operations of one or multiple small tables.
{"title":"Processor accelerator for AES","authors":"R. Lee, Yu-Yuan Chen","doi":"10.1109/SASP.2010.5521153","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521153","url":null,"abstract":"Software AES cipher performance is not fast enough for encryption to be incorporated ubiquitously for all computing needs. Furthermore, fast software implementations of AES that use table lookups are susceptible to software cache-based side channel attacks, leaking the secret encryption key. To bridge the gap between software and hardware AES implementations, several Instruction Set Architecture (ISA) extensions have been proposed to provide speedup for software AES programs, most notably the recent introduction of six AES-specific instructions for Intel microprocessors. However, algorithm-specific instructions are less desirable than general-purpose ones for microprocessors. In this paper, we propose an enhanced parallel table lookup instruction that can achieve the fastest reported software AES encryption and decryption of 1.38 cycles/byte for general-purpose microprocessors, a 1.45X speedup from the fastest prior work reported. Also, security is improved where cache-based side-channel attacks are thwarted, since all table lookups take the same amount of time. Furthermore, the new instructions can also be used to accelerate any functions that can be accelerated through table lookup operations of one or multiple small tables.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116174978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521155
Ji Kong, Peilin Liu, Xianmin Chen, Jin Wang, Xingguang Pan, Jun Wang, He-D. Xiao, Zhenqi Wei, R. Ying
For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.
{"title":"Next-generation consumer audio application specific embedded processor","authors":"Ji Kong, Peilin Liu, Xianmin Chen, Jin Wang, Xingguang Pan, Jun Wang, He-D. Xiao, Zhenqi Wei, R. Ying","doi":"10.1109/SASP.2010.5521155","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521155","url":null,"abstract":"For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"79 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521139
J. Hu, C. Xue, Wei-Che Tseng, Qingfeng Zhuge, E. Sha
Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.
{"title":"Minimizing write activities to non-volatile memory via scheduling and recomputation","authors":"J. Hu, C. Xue, Wei-Che Tseng, Qingfeng Zhuge, E. Sha","doi":"10.1109/SASP.2010.5521139","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521139","url":null,"abstract":"Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"31 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113942734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521136
Jelena Trajkovic, D. Gajski
Application-specific processors (ASPs) are increasingly being adopted for optimized implementation of embedded systems. ASP design automation tools are, therefore, critical for meeting the time-to-market goals for ASP-based embedded systems. This paper targets the problem of determining the optimal data path pipeline configuration from a given application C code. We propose a technique for automatically estimating the application execution time on an ASP for various data path pipeline configurations based on estimated clock cycle length and estimated number of cycles. In addition, we compute the cost of each pipelined design, thereby characterizing the ASP by its performance and cost. Our estimation enables fast, accurate and early analysis of trade-offs between different data path pipeline configurations, without the need for creating either a prototype or a cycle-accurate model of the ASP. Our experimental results, based on industrial applications, demonstrate high fidelity for the performance estimation.
{"title":"Early performance-cost estimation of application-specific data path pipelining","authors":"Jelena Trajkovic, D. Gajski","doi":"10.1109/SASP.2010.5521136","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521136","url":null,"abstract":"Application-specific processors (ASPs) are increasingly being adopted for optimized implementation of embedded systems. ASP design automation tools are, therefore, critical for meeting the time-to-market goals for ASP-based embedded systems. This paper targets the problem of determining the optimal data path pipeline configuration from a given application C code. We propose a technique for automatically estimating the application execution time on an ASP for various data path pipeline configurations based on estimated clock cycle length and estimated number of cycles. In addition, we compute the cost of each pipelined design, thereby characterizing the ASP by its performance and cost. Our estimation enables fast, accurate and early analysis of trade-offs between different data path pipeline configurations, without the need for creating either a prototype or a cycle-accurate model of the ASP. Our experimental results, based on industrial applications, demonstrate high fidelity for the performance estimation.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115924397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521142
Nicholas Moore, M. Leeser, L. King
Graphics processing units (GPUs) offer significantly higher peak performance than CPUs, but for a limited problem space. Even within this space, GPU solutions are often restricted to a set of specific problem instances or offer greatly varying performance for slightly different parameters. This makes providing a library of GPU implementations that is adaptable to arbitrary inputs a difficult task. This research is motivated by a MATLAB lung tumor tracking application that relies on two-dimensional correlation and uses large template sizes. While GPU-based template matching has been addressed in the past, template sizes were limited to specific, relatively small sizes and not acceptable for accelerating the target application. This paper discusses a CUDA implementation that supports large template sizes and is adaptable to arbitrary template dimensions. The implementation uses on-demand compilation of kernels and compile-time expansion of various kernel parameters to improve the implementation adaptability without sacrificing performance.
{"title":"Efficient template matching with variable size templates in CUDA","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SASP.2010.5521142","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521142","url":null,"abstract":"Graphics processing units (GPUs) offer significantly higher peak performance than CPUs, but for a limited problem space. Even within this space, GPU solutions are often restricted to a set of specific problem instances or offer greatly varying performance for slightly different parameters. This makes providing a library of GPU implementations that is adaptable to arbitrary inputs a difficult task. This research is motivated by a MATLAB lung tumor tracking application that relies on two-dimensional correlation and uses large template sizes. While GPU-based template matching has been addressed in the past, template sizes were limited to specific, relatively small sizes and not acceptable for accelerating the target application. This paper discusses a CUDA implementation that supports large template sizes and is adaptable to arbitrary template dimensions. The implementation uses on-demand compilation of kernels and compile-time expansion of various kernel parameters to improve the implementation adaptability without sacrificing performance.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116323877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521147
Naotaka Maruyama, T. Ishihara, H. Yasuura
Many functions of embedded systems are implemented by software for flexibly dealing with frequent upgrade and for quickly fixing unpredictable bugs in applications. This system architecture is however generally less energy efficient than that implemented by application specific hardware. As a remedy of this issue, this paper proposes a processor-based platform having an RTOS in hardware for energy efficient and flexible TCP/IP processing. Unlike application specific hardware, implementing RTOS in hardware does not lose the fl exibility of the applications while the energy efficiency is comparable to the application specifi c hardware. Experiments with an actual TCP/IP application demonstrate that our approach achieves a 7 times improvement in energy effi ciency over an existing commercial fi rmware RTOS.
{"title":"An RTOS in hardware for energy efficient software-based TCP/IP processing","authors":"Naotaka Maruyama, T. Ishihara, H. Yasuura","doi":"10.1109/SASP.2010.5521147","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521147","url":null,"abstract":"Many functions of embedded systems are implemented by software for flexibly dealing with frequent upgrade and for quickly fixing unpredictable bugs in applications. This system architecture is however generally less energy efficient than that implemented by application specific hardware. As a remedy of this issue, this paper proposes a processor-based platform having an RTOS in hardware for energy efficient and flexible TCP/IP processing. Unlike application specific hardware, implementing RTOS in hardware does not lose the fl exibility of the applications while the energy efficiency is comparable to the application specifi c hardware. Experiments with an actual TCP/IP application demonstrate that our approach achieves a 7 times improvement in energy effi ciency over an existing commercial fi rmware RTOS.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122845206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-06-13DOI: 10.1109/SASP.2010.5521135
I-Wei Wu, J. Shann, C. Chung
Recently, next-generation digital entertainment and mobile communication devices are driving the demand for high-performance processing solutions. In order to achieve this demand, multiple-issue processors such as very long instruction word (VLIW) architecture augmented with a reconfigurable hardware accelerator have been proposed in many papers. The reconfigurable hardware accelerator is usually realized by multiple functional units (FUs) organized in matrix fashion, called reconfigurable customized functional unit (RCFU). Since a multiple-issue processor can execute several data-independent operations simultaneously, executing operations on both of the RCFU and FUs of the base processor concurrently is reasonable and is also beneficial for improving the hardware resource utilization and the execution performance. Because of this observation, we propose an RCFU generation algorithm and an RCFU exploitation algorithm in this paper. In our experiment, 43% of execution performance improvement can be further achieved averagely compared with the previous works.1
{"title":"Reconfigurable custom functional unit generation and exploitation in multiple-issue processors","authors":"I-Wei Wu, J. Shann, C. Chung","doi":"10.1109/SASP.2010.5521135","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521135","url":null,"abstract":"Recently, next-generation digital entertainment and mobile communication devices are driving the demand for high-performance processing solutions. In order to achieve this demand, multiple-issue processors such as very long instruction word (VLIW) architecture augmented with a reconfigurable hardware accelerator have been proposed in many papers. The reconfigurable hardware accelerator is usually realized by multiple functional units (FUs) organized in matrix fashion, called reconfigurable customized functional unit (RCFU). Since a multiple-issue processor can execute several data-independent operations simultaneously, executing operations on both of the RCFU and FUs of the base processor concurrently is reasonable and is also beneficial for improving the hardware resource utilization and the execution performance. Because of this observation, we propose an RCFU generation algorithm and an RCFU exploitation algorithm in this paper. In our experiment, 43% of execution performance improvement can be further achieved averagely compared with the previous works.1","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124358075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}