{"title":"Falcon: A Fused-Layer Accelerator With Layer-Wise Hybrid Inference Flow for Computational Imaging CNNs","authors":"Yong-Tai Chen;Yen-Ting Chiu;Hao-Jiun Tu;Chao-Tsung Huang","doi":"10.1109/TVLSI.2024.3488042","DOIUrl":null,"url":null,"abstract":"Computational imaging (CI) has advanced significantly due to the use of convolutional neural networks (CNNs). Its edge deployment relies on layer fusion to offload the monstrous external memory access (EMA) of feature maps, necessitating the handling of overlapped features either through reusing or recomputing them. Depending on how the boundary-handling strategy is organized, the induced computing complexity and EMA can be optimized. However, state-of-the-art CI accelerators primarily apply homogeneous inference flows, which employ a single overlap-handling strategy throughout the fused layers, limiting their ability to balance computation and data access. In this article, we explore layer-wise optimization in fused-layer CNNs by exploiting hybrid-strategy inference flows and devising a corresponding computing architecture. We categorize layer-wise strategies and put forward a layer-wise hybrid inference flow (LHIF) to integrate their advantages, and we propose an optimization procedure that explicitly analyzes essential figures of merit (FoMs), including throughput, EMA, and energy efficiency. Furthermore, we develop a high-throughput accelerator—Falcon—to efficiently support LHIF under massive parallelism, especially with a time-division-multiplexing (TDM) buffer interface that enables seamless access to feature maps stored in an interleaved manner. Layout results show that the accelerator, delivering 41 TOPS with 1.5 MB of feature-map buffers, supports LHIF while increasing the die area by only 1.4% and power consumption by only 0.7%. Extensive simulations are conducted to demonstrate the versatility of LHIF in working scenarios at operational, design, and system levels. Compared with using homogeneous inference flows, the proposed LHIF achieves Pareto optimality with up to <inline-formula> <tex-math>$2.28\\times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$3.5\\times $ </tex-math></inline-formula> lower EMA.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"720-732"},"PeriodicalIF":2.8000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10745739/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Computational imaging (CI) has advanced significantly due to the use of convolutional neural networks (CNNs). Its edge deployment relies on layer fusion to offload the monstrous external memory access (EMA) of feature maps, necessitating the handling of overlapped features either through reusing or recomputing them. Depending on how the boundary-handling strategy is organized, the induced computing complexity and EMA can be optimized. However, state-of-the-art CI accelerators primarily apply homogeneous inference flows, which employ a single overlap-handling strategy throughout the fused layers, limiting their ability to balance computation and data access. In this article, we explore layer-wise optimization in fused-layer CNNs by exploiting hybrid-strategy inference flows and devising a corresponding computing architecture. We categorize layer-wise strategies and put forward a layer-wise hybrid inference flow (LHIF) to integrate their advantages, and we propose an optimization procedure that explicitly analyzes essential figures of merit (FoMs), including throughput, EMA, and energy efficiency. Furthermore, we develop a high-throughput accelerator—Falcon—to efficiently support LHIF under massive parallelism, especially with a time-division-multiplexing (TDM) buffer interface that enables seamless access to feature maps stored in an interleaved manner. Layout results show that the accelerator, delivering 41 TOPS with 1.5 MB of feature-map buffers, supports LHIF while increasing the die area by only 1.4% and power consumption by only 0.7%. Extensive simulations are conducted to demonstrate the versatility of LHIF in working scenarios at operational, design, and system levels. Compared with using homogeneous inference flows, the proposed LHIF achieves Pareto optimality with up to $2.28\times $ higher throughput and $3.5\times $ lower EMA.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.