{"title":"基于制造后可重构架构平台的自动指令融合提高嵌入式系统效率","authors":"A. Cheng","doi":"10.1109/ISQED.2008.85","DOIUrl":null,"url":null,"abstract":"Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.","PeriodicalId":243121,"journal":{"name":"9th International Symposium on Quality Electronic Design (isqed 2008)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Amplifying Embedded System Efficiency via Automatic Instruction Fusion on a Post-Manufacturing Reconfigurable Architecture Platform\",\"authors\":\"A. Cheng\",\"doi\":\"10.1109/ISQED.2008.85\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.\",\"PeriodicalId\":243121,\"journal\":{\"name\":\"9th International Symposium on Quality Electronic Design (isqed 2008)\",\"volume\":\"179 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"9th International Symposium on Quality Electronic Design (isqed 2008)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISQED.2008.85\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"9th International Symposium on Quality Electronic Design (isqed 2008)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED.2008.85","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Amplifying Embedded System Efficiency via Automatic Instruction Fusion on a Post-Manufacturing Reconfigurable Architecture Platform
Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.