Pub Date : 2025-07-25DOI: 10.1109/TCSII.2025.3592480
Nadeem Atif;Saquib Mazhar;Mohammed Ameen;Shaik Rafi Ahamed;M. K. Bhuyan
Semantic segmentation is a pixel-level visual recognition task widely used in autonomous driving. Attaining a decent trade-off between accuracy and speed is critically important for the effective physical deployment of networks on resource-constrained edge devices. Towards this challenging task, we propose an efficient basic block that is designed to leverage local, short-range, and long-range contextual information at different abstraction levels. We introduce a simple technique inside the basic block, called Iterative Context Embedding (ICE), to reinforce the short and long-range contextual details in an iterative fashion. Based on the resulting short and long-range ICE or SLICE module, we propose an ultra-lightweight network, called SLICENet. Our model is the fastest among the existing ultra-lightweight models while achieving a decent accuracy. Specifically, with only 0.3 million parameters, it achieves 69.1% mean IoUs on the cityscapes test set, making it the smallest model to achieve this accuracy. In addition, it achieves an inference speed of 224.8 frames per second (FPS) on the RTX 3090 with $512times 1024$ resolution. To achieve a power-efficient solution meant for battery-operated devices, we also deploy our model on Xilinx’s ZCU102 development board (Zync UltraScale+ MPSoC). Despite achieving an impressive performance, its power consumption is only 950 mW; significantly lower than GPU-based inferences. Our code will be shared at https://github.com/NadeemAtif-Alig/SLICENet.
{"title":"SLICENet: An FPGA-Based Efficient Semantic Segmentation Network for Edge Deployment","authors":"Nadeem Atif;Saquib Mazhar;Mohammed Ameen;Shaik Rafi Ahamed;M. K. Bhuyan","doi":"10.1109/TCSII.2025.3592480","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3592480","url":null,"abstract":"Semantic segmentation is a pixel-level visual recognition task widely used in autonomous driving. Attaining a decent trade-off between accuracy and speed is critically important for the effective physical deployment of networks on resource-constrained edge devices. Towards this challenging task, we propose an efficient basic block that is designed to leverage local, short-range, and long-range contextual information at different abstraction levels. We introduce a simple technique inside the basic block, called Iterative Context Embedding (ICE), to reinforce the short and long-range contextual details in an iterative fashion. Based on the resulting short and long-range ICE or SLICE module, we propose an ultra-lightweight network, called SLICENet. Our model is the fastest among the existing ultra-lightweight models while achieving a decent accuracy. Specifically, with only 0.3 million parameters, it achieves 69.1% mean IoUs on the cityscapes test set, making it the smallest model to achieve this accuracy. In addition, it achieves an inference speed of 224.8 frames per second (FPS) on the RTX 3090 with <inline-formula> <tex-math>$512times 1024$ </tex-math></inline-formula> resolution. To achieve a power-efficient solution meant for battery-operated devices, we also deploy our model on Xilinx’s ZCU102 development board (Zync UltraScale+ MPSoC). Despite achieving an impressive performance, its power consumption is only 950 mW; significantly lower than GPU-based inferences. Our code will be shared at <uri>https://github.com/NadeemAtif-Alig/SLICENet</uri>.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1338-1342"},"PeriodicalIF":4.9,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-24DOI: 10.1109/TCSII.2025.3592475
Chun-Tse Su;Chao-Yen Hsu;Tai-Cheng Lee
This brief presents a calibration-free 12-bit 1.5-GS/s pipelined ADC employing a merged sub-ADC quantization (MSAQ) technique. Building upon the conventional pipelined ADC architecture, the proposed technique can extend the amplification time, thereby relaxing the design of the inner-stage residue amplifier. A prototype ADC implemented in a 28-nm CMOS technology achieves an SFDR of 70.52 dB and an SNDR of 58.03 dB at a Nyquist input, while consuming 18.5 mW from a 1-V supply. It yields Schreier and Walden figure of merits (FoM) of 164.1 dB and 18.9 fJ/conv.-step, respectively.
{"title":"A Calibration-Free 12-Bit 1.5-GS/s Pipelined ADC With Merged Sub-ADC Quantization Technique","authors":"Chun-Tse Su;Chao-Yen Hsu;Tai-Cheng Lee","doi":"10.1109/TCSII.2025.3592475","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3592475","url":null,"abstract":"This brief presents a calibration-free 12-bit 1.5-GS/s pipelined ADC employing a merged sub-ADC quantization (MSAQ) technique. Building upon the conventional pipelined ADC architecture, the proposed technique can extend the amplification time, thereby relaxing the design of the inner-stage residue amplifier. A prototype ADC implemented in a 28-nm CMOS technology achieves an SFDR of 70.52 dB and an SNDR of 58.03 dB at a Nyquist input, while consuming 18.5 mW from a 1-V supply. It yields Schreier and Walden figure of merits (FoM) of 164.1 dB and 18.9 fJ/conv.-step, respectively.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1168-1172"},"PeriodicalIF":4.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TCSII.2025.3591633
Haikuo Shao;Aotao Wang;Zhongfeng Wang
Layer normalization (LN) function is widely adopted in Transformer-based neural networks. The efficient training of Transformers on personal devices is attracting attention for data privacy and latency concerns. However, the critical LN function involves extreme outliers for quantization, as well as hardware-unfriendly square-root and division operations, posing resource challenges for training deployment on the edge. This brief proposes an efficient LN training architecture with algorithm and hardware co-optimization. Specifically, we present a dynamic quantized algorithm based on integer arithmetics to smooth outliers for sufficient training accuracy. Then, we develop a reconfigurable hardware architecture to efficiently support various operations during LN training, with a vector-wise pipelined dataflow to improve hardware efficiency further. Experimental results show that our architecture achieves up to 0.25 and 1.0 Giga input per Second (GinS) in throughput at FPGA and ASIC platforms, respectively, outperforming prior works.
{"title":"An Efficient Layer Normalization Training Module With Dynamic Quantization for Transformers","authors":"Haikuo Shao;Aotao Wang;Zhongfeng Wang","doi":"10.1109/TCSII.2025.3591633","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3591633","url":null,"abstract":"Layer normalization (LN) function is widely adopted in Transformer-based neural networks. The efficient training of Transformers on personal devices is attracting attention for data privacy and latency concerns. However, the critical LN function involves extreme outliers for quantization, as well as hardware-unfriendly square-root and division operations, posing resource challenges for training deployment on the edge. This brief proposes an efficient LN training architecture with algorithm and hardware co-optimization. Specifically, we present a dynamic quantized algorithm based on integer arithmetics to smooth outliers for sufficient training accuracy. Then, we develop a reconfigurable hardware architecture to efficiently support various operations during LN training, with a vector-wise pipelined dataflow to improve hardware efficiency further. Experimental results show that our architecture achieves up to 0.25 and 1.0 Giga input per Second (GinS) in throughput at FPGA and ASIC platforms, respectively, outperforming prior works.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1288-1292"},"PeriodicalIF":4.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TCSII.2025.3590998
Qing Yang;Jing Wang;Hao Shen;Ju H. Park
This brief addresses the optimization problem for Markov jump systems (MJSs) with unknown dynamics via a novel scaling-based reinforcement learning scheme. First, by employing subsystem transformation, the optimal controller design problem for MJSs is reformulated into solving a set of parallel and decoupled algebraic Riccati equations (DAREs). Traditional learning schemes for solving these equations either require initially admissible control policies or suffer from slow convergence. To overcome these limitations, a novel scaling-based reinforcement learning algorithm is proposed. Several notable advantages are exhibited by the proposed algorithm: it eliminates the need for system dynamics during the learning process, achieves faster convergence, and relaxes the requirement for an initially admissible control policy. The effectiveness of the proposed scheme is rigorously proven through a mathematical induction method. Finally, the feasibility of the proposed scheme is verified using an operational amplifier circuit example, and its superiority is demonstrated through a series of comparative simulations.
{"title":"Learning-Based Scaling Scheme for Markov Jump Systems and Its Application in Operational Amplifier Circuit","authors":"Qing Yang;Jing Wang;Hao Shen;Ju H. Park","doi":"10.1109/TCSII.2025.3590998","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3590998","url":null,"abstract":"This brief addresses the optimization problem for Markov jump systems (MJSs) with unknown dynamics via a novel scaling-based reinforcement learning scheme. First, by employing subsystem transformation, the optimal controller design problem for MJSs is reformulated into solving a set of parallel and decoupled algebraic Riccati equations (DAREs). Traditional learning schemes for solving these equations either require initially admissible control policies or suffer from slow convergence. To overcome these limitations, a novel scaling-based reinforcement learning algorithm is proposed. Several notable advantages are exhibited by the proposed algorithm: it eliminates the need for system dynamics during the learning process, achieves faster convergence, and relaxes the requirement for an initially admissible control policy. The effectiveness of the proposed scheme is rigorously proven through a mathematical induction method. Finally, the feasibility of the proposed scheme is verified using an operational amplifier circuit example, and its superiority is demonstrated through a series of comparative simulations.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1238-1242"},"PeriodicalIF":4.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multi-coils configuration presents an effective approach for high-power wireless power transfer (WPT) systems. Among them, mitigating complex cross-coupling in magnetic couplers remains critical to achieving high efficiency and stable power delivery. Thus, this brief proposes a compact dual-channel WPT system with decoupled coils to enhance the overall power capacity. The transmitter and receiver have the same structure, with each charging pad constructed by solenoid coils wound around Q-coils and ferrite cores. Solenoid coils and Q-coils are naturally decoupled from each other, thereby eliminating additional coupling interference and only their main mutual inductance $M_{1}$ , $M_{2}$ are retained. Furthermore, the principle of power enhancement and constant current (CC) output is thoroughly analyzed, and a more generalized output model is derived. Finally, a 305 W experimental prototype was constructed, with results in agreement with theoretical analyses. Compared with the single-channel system, the output current (2.82 A) of the proposed system is amplified by (1+$M_{1}$ /${M} _{2}$ ), with the peak efficiency reaching 90.5%, an improvement of about 6%.
{"title":"A Compact Dual-Channel WPT System Based on Decoupled Integrated Coils for Power Enhancement","authors":"Jiawei Xie;Yandong Chen;Yuhang Zhou;Cong Luo;Jian Guo","doi":"10.1109/TCSII.2025.3591215","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3591215","url":null,"abstract":"The multi-coils configuration presents an effective approach for high-power wireless power transfer (WPT) systems. Among them, mitigating complex cross-coupling in magnetic couplers remains critical to achieving high efficiency and stable power delivery. Thus, this brief proposes a compact dual-channel WPT system with decoupled coils to enhance the overall power capacity. The transmitter and receiver have the same structure, with each charging pad constructed by solenoid coils wound around Q-coils and ferrite cores. Solenoid coils and Q-coils are naturally decoupled from each other, thereby eliminating additional coupling interference and only their main mutual inductance <inline-formula> <tex-math>$M_{1}$ </tex-math></inline-formula>, <inline-formula> <tex-math>$M_{2}$ </tex-math></inline-formula> are retained. Furthermore, the principle of power enhancement and constant current (CC) output is thoroughly analyzed, and a more generalized output model is derived. Finally, a 305 W experimental prototype was constructed, with results in agreement with theoretical analyses. Compared with the single-channel system, the output current (2.82 A) of the proposed system is amplified by (1+<inline-formula> <tex-math>$M_{1}$ </tex-math></inline-formula>/<inline-formula> <tex-math>${M} _{2}$ </tex-math></inline-formula>), with the peak efficiency reaching 90.5%, an improvement of about 6%.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1243-1247"},"PeriodicalIF":4.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/TCSII.2025.3590699
Qifan Wang;Qiangsong Zhao;Yuanqing Xia
This brief presents a novel voltage-sensorless grid voltage full feedforward estimator (GVFFE)-based current control strategy for a grid-connected inverter with an LCL filter. The grid voltage full feedforward (GVFF) signal can be directly estimated by the GVFFE using a closed-loop structure based on a repetitive controller. Furthermore, the grid voltage can be reconstructed from the estimated GVFF signal without relying on a voltage sensor. Compared with traditional GVFF methods, the GVFFE eliminates the noise amplification caused by derivative operations and compensates for computational delay. As a result, the disturbance rejection performance for grid voltage is significantly improved. The stability and harmonic suppression capabilities of the proposed strategy are comprehensively analyzed. Experimental results validate the effectiveness of the proposed control strategy, demonstrating its potential for practical applications in grid-connected inverter systems.
{"title":"A Novel Voltage-Sensorless Grid Voltage Full Feedforward Estimator-Based Current Control Strategy for a Grid-Connected Inverter","authors":"Qifan Wang;Qiangsong Zhao;Yuanqing Xia","doi":"10.1109/TCSII.2025.3590699","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3590699","url":null,"abstract":"This brief presents a novel voltage-sensorless grid voltage full feedforward estimator (GVFFE)-based current control strategy for a grid-connected inverter with an LCL filter. The grid voltage full feedforward (GVFF) signal can be directly estimated by the GVFFE using a closed-loop structure based on a repetitive controller. Furthermore, the grid voltage can be reconstructed from the estimated GVFF signal without relying on a voltage sensor. Compared with traditional GVFF methods, the GVFFE eliminates the noise amplification caused by derivative operations and compensates for computational delay. As a result, the disturbance rejection performance for grid voltage is significantly improved. The stability and harmonic suppression capabilities of the proposed strategy are comprehensively analyzed. Experimental results validate the effectiveness of the proposed control strategy, demonstrating its potential for practical applications in grid-connected inverter systems.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1233-1237"},"PeriodicalIF":4.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/TCSII.2025.3590689
Yao Xu;Chunyu Yang;Gonghe Li;Ju H. Park
This brief focuses on the data-driven near-optimal reduced trackingcontrol problem of linear time-invariant (LTI) singularly perturbed systems (SPSs) from noisy data. Based on singular perturbation theory (SPT), the reduced subsystem of the SPSs is obtained, further, an augmented error system is constructed and an optimal trackingcontrol (OTC) problem is formulated. Then, the integral version of the continuous-time augmented error system is constructed to avoid the error-prone problem of derivative calculation. Next, the closed-loop augmented error system is parameterized by the system I/O data, and the data-based semi-definite program (SDP) is proposed for the OTC problem. In addition, considering that the I/O data of the virtual reduced system are actually unmeasurable, the virtual reduced system is reconstructed by the I/O data of the original system, and the system performance is analyzed. Finally, the experiment of speed tracking control of permanent magnet synchronous motor (PMSM) verifies the effectiveness of the proposed data-driven control scheme.
{"title":"Data-Driven Near-Optimal Reduced Tracking Control of SPSs With Application to PMSM","authors":"Yao Xu;Chunyu Yang;Gonghe Li;Ju H. Park","doi":"10.1109/TCSII.2025.3590689","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3590689","url":null,"abstract":"This brief focuses on the data-driven near-optimal reduced trackingcontrol problem of linear time-invariant (LTI) singularly perturbed systems (SPSs) from noisy data. Based on singular perturbation theory (SPT), the reduced subsystem of the SPSs is obtained, further, an augmented error system is constructed and an optimal trackingcontrol (OTC) problem is formulated. Then, the integral version of the continuous-time augmented error system is constructed to avoid the error-prone problem of derivative calculation. Next, the closed-loop augmented error system is parameterized by the system I/O data, and the data-based semi-definite program (SDP) is proposed for the OTC problem. In addition, considering that the I/O data of the virtual reduced system are actually unmeasurable, the virtual reduced system is reconstructed by the I/O data of the original system, and the system performance is analyzed. Finally, the experiment of speed tracking control of permanent magnet synchronous motor (PMSM) verifies the effectiveness of the proposed data-driven control scheme.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1228-1232"},"PeriodicalIF":4.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TCSII.2025.3590593
Xiaofei Liao;Dixian Zhao;Chenyu Xu;Hao Gong;Wendi Chen;Xiaohu You
This brief presents a wideband frequency synthesizer with 3.9 to 8.2 GHz continuous frequency coverage for satellite communication applications. The core fractional-N phase locked loop utilizes four LC-VCOs achieving a 4.3 GHz tuning range with a 50-MHz reference frequency. The frequency mapping of the four VCOs, along with module-level parameter optimization, is performed to maintain a stable figure of merit and minimize loop jitter across the entire tuning range. A high-isolation low-loss inductive multiplexing output technique is proposed, which uses only one active buffer to drive both the internal loop and the external load, significantly reducing power consumption. Moreover, an on-chip active loop filter is implemented, reducing the capacitance area by 80% and enhancing chip integration. Fabricated in a 65-nm CMOS technology, the frequency synthesizer occupies a chip area of 2.28 mm2 while consumes power of 25–33.5 mW. The phase noise reaches –123.72 dBc/Hz and –116.31 dBc/Hz at 1-MHz offset under 3.9- and 8.2-GHz carriers, respectively. Measured reference and fractional spurs remain below –65 and –55 dBc.
{"title":"A 3.9-8.2-GHz Wideband Frequency Synthesizer With an Inductive Multiplexing Output Network for SATCOM Applications","authors":"Xiaofei Liao;Dixian Zhao;Chenyu Xu;Hao Gong;Wendi Chen;Xiaohu You","doi":"10.1109/TCSII.2025.3590593","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3590593","url":null,"abstract":"This brief presents a wideband frequency synthesizer with 3.9 to 8.2 GHz continuous frequency coverage for satellite communication applications. The core fractional-N phase locked loop utilizes four LC-VCOs achieving a 4.3 GHz tuning range with a 50-MHz reference frequency. The frequency mapping of the four VCOs, along with module-level parameter optimization, is performed to maintain a stable figure of merit and minimize loop jitter across the entire tuning range. A high-isolation low-loss inductive multiplexing output technique is proposed, which uses only one active buffer to drive both the internal loop and the external load, significantly reducing power consumption. Moreover, an on-chip active loop filter is implemented, reducing the capacitance area by 80% and enhancing chip integration. Fabricated in a 65-nm CMOS technology, the frequency synthesizer occupies a chip area of 2.28 mm2 while consumes power of 25–33.5 mW. The phase noise reaches –123.72 dBc/Hz and –116.31 dBc/Hz at 1-MHz offset under 3.9- and 8.2-GHz carriers, respectively. Measured reference and fractional spurs remain below –65 and –55 dBc.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1163-1167"},"PeriodicalIF":4.9,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.
{"title":"A Hybrid CAM-SRAM Processing-in-Memory Architecture With Feature Level Sparsity for Attention Mechanisms","authors":"Haiqiu Huang;Mingyu Wang;Xiaojie Li;Baiqing Zhong;Zeqi Yang;Tao Lu;Yicong Zhang;Zhiyi Yu","doi":"10.1109/TCSII.2025.3590432","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3590432","url":null,"abstract":"The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1283-1287"},"PeriodicalIF":4.9,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-17DOI: 10.1109/TCSII.2025.3589991
Xuetao Xie;Yi-Fei Pu;Jian Wang
This brief proposes a memristor-based volume controller, thus providing a practical application scenario of system identification. In order to identify the parameter in this system, we propose a fractional momentum enhanced fractional least mean square (FM-EFLMS) algorithm by combining the enhanced fractional derivative and the fractional momentum term. We analyze the stability condition of the FM-EFLMS algorithm. The resource consumption of the FM-EFLMS algorithm is also analyzed. Simulation experiments demonstrate the potential advantage of the memristor-based volume controller. Moreover, the experimental results show that the convergence performance of the FM-EFLMS algorithm exhibits obvious advantages compared to the competing filter algorithms.
{"title":"A Fractional Momentum Enhanced Fractional Filter for the Memristor-Based Volume Controller","authors":"Xuetao Xie;Yi-Fei Pu;Jian Wang","doi":"10.1109/TCSII.2025.3589991","DOIUrl":"https://doi.org/10.1109/TCSII.2025.3589991","url":null,"abstract":"This brief proposes a memristor-based volume controller, thus providing a practical application scenario of system identification. In order to identify the parameter in this system, we propose a fractional momentum enhanced fractional least mean square (FM-EFLMS) algorithm by combining the enhanced fractional derivative and the fractional momentum term. We analyze the stability condition of the FM-EFLMS algorithm. The resource consumption of the FM-EFLMS algorithm is also analyzed. Simulation experiments demonstrate the potential advantage of the memristor-based volume controller. Moreover, the experimental results show that the convergence performance of the FM-EFLMS algorithm exhibits obvious advantages compared to the competing filter algorithms.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1333-1337"},"PeriodicalIF":4.9,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}