首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
ReViT: Vision Transformer Accelerator With Reconfigurable Semantic-Aware Differential Attention
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3504263
Xiaofeng Zou;Cen Chen;Hongen Shao;Qinyu Wang;Xiaobin Zhuang;Yangfan Li;Keqin Li
While vision transformers (ViTs) have continued to achieve new milestones in computer vision, their complicated network architectures with high computation and memory costs have hindered their deployment on resource-limited edge devices. Some customized accelerators have been proposed to accelerate the execution of ViTs, achieving improved performance with reduced energy consumption. However, these approaches utilize flattened attention mechanisms and ignore the inherent hierarchical visual semantics in images. In this work, we conduct a thorough analysis of hierarchical visual semantics in real-world images, revealing opportunities and challenges of leveraging visual semantics to accelerate ViTs. We propose ReViT, a systematic algorithm and architecture co-design approach, which aims to exploit the visual semantics to accelerate ViTs. Our proposed algorithm can leverage the same semantic class with strong feature similarity to reduce computation and communication in a differential attention mechanism, and support the semantic-aware attention efficiently. A novel dedicated architecture is designed to support the proposed algorithm and translate it into performance improvements. Moreover, we propose an efficient execution dataflow to alleviate workload imbalance and maximize hardware utilization. ReViT opens new directions for accelerating ViTs by exploring the underlying visual semantics of images. ReViT gains an average of 2.3$boldsymbol{times}$ speedup and 3.6$boldsymbol{times}$ energy efficiency over state-of-the-art ViT accelerators.
{"title":"ReViT: Vision Transformer Accelerator With Reconfigurable Semantic-Aware Differential Attention","authors":"Xiaofeng Zou;Cen Chen;Hongen Shao;Qinyu Wang;Xiaobin Zhuang;Yangfan Li;Keqin Li","doi":"10.1109/TC.2024.3504263","DOIUrl":"https://doi.org/10.1109/TC.2024.3504263","url":null,"abstract":"While vision transformers (ViTs) have continued to achieve new milestones in computer vision, their complicated network architectures with high computation and memory costs have hindered their deployment on resource-limited edge devices. Some customized accelerators have been proposed to accelerate the execution of ViTs, achieving improved performance with reduced energy consumption. However, these approaches utilize flattened attention mechanisms and ignore the inherent hierarchical visual semantics in images. In this work, we conduct a thorough analysis of hierarchical visual semantics in real-world images, revealing opportunities and challenges of leveraging visual semantics to accelerate ViTs. We propose ReViT, a systematic algorithm and architecture co-design approach, which aims to exploit the visual semantics to accelerate ViTs. Our proposed algorithm can leverage the same semantic class with strong feature similarity to reduce computation and communication in a differential attention mechanism, and support the semantic-aware attention efficiently. A novel dedicated architecture is designed to support the proposed algorithm and translate it into performance improvements. Moreover, we propose an efficient execution dataflow to alleviate workload imbalance and maximize hardware utilization. ReViT opens new directions for accelerating ViTs by exploring the underlying visual semantics of images. ReViT gains an average of 2.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> speedup and 3.6<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> energy efficiency over state-of-the-art ViT accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1079-1093"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NStore: A High-Performance NUMA-Aware Key-Value Store for Hybrid Memory
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3504269
Zhonghua Wang;Kai Lu;Jiguang Wan;Hong Jiang;Zeyang Zhao;Peng Xu;Biliang Lai;Guokuan Li;Changsheng Xie
Emerging persistent memory (PM) promises near-DRAM performance, larger capacity, and data persistence, attracting researchers to design PM-based key-value stores. However, existing PM-based key-value stores lack awareness of the Non-Uniform Memory Access (NUMA) architecture on PM, where accessing PM on remote NUMA sockets is considerably slower than accessing local PM. This NUMA-unawareness results in sub-optimal performance when scaling on NUMA. Although DRAM caching alleviates this issue, existing cache policies ignore the performance disparity between remote and local PM accesses, keeping remote PM access as a performance bottleneck when scaling PM stores on NUMA. Furthermore, creating hot data views in each socket's PM fails to eliminate remote PM writes and, worse, induces additional local PM writes. This paper presents NStore, a high-performance NUMA-aware key-value store for the PM-DRAM hybrid memory. NStore introduces a NUMA-aware cache replacement strategy, called Remote Access First (RAF) cache in DRAM, to minimize remote PM accesses. In addition, NStore deploys Nlog, a write-optimized log-structured persistent storage, purposed to eliminate remote PM writes. NStore further mitigates the NUMA impacts through localized scan operations, efficient garbage collection, and multi-thread recovery for Nlog. Evaluations show that NStore outperforms state-of-the-art PM-based key-value stores, achieving up to 13.9$times$ and 11.2$times$ higher write and read throughput, respectively.
{"title":"NStore: A High-Performance NUMA-Aware Key-Value Store for Hybrid Memory","authors":"Zhonghua Wang;Kai Lu;Jiguang Wan;Hong Jiang;Zeyang Zhao;Peng Xu;Biliang Lai;Guokuan Li;Changsheng Xie","doi":"10.1109/TC.2024.3504269","DOIUrl":"https://doi.org/10.1109/TC.2024.3504269","url":null,"abstract":"Emerging persistent memory (PM) promises near-DRAM performance, larger capacity, and data persistence, attracting researchers to design PM-based key-value stores. However, existing PM-based key-value stores lack awareness of the Non-Uniform Memory Access (NUMA) architecture on PM, where accessing PM on remote NUMA sockets is considerably slower than accessing local PM. This NUMA-unawareness results in sub-optimal performance when scaling on NUMA. Although DRAM caching alleviates this issue, existing cache policies ignore the performance disparity between remote and local PM accesses, keeping remote PM access as a performance bottleneck when scaling PM stores on NUMA. Furthermore, creating hot data views in each socket's PM fails to eliminate remote PM writes and, worse, induces additional local PM writes. This paper presents NStore, a high-performance NUMA-aware key-value store for the PM-DRAM hybrid memory. NStore introduces a NUMA-aware cache replacement strategy, called Remote Access First (RAF) cache in DRAM, to minimize remote PM accesses. In addition, NStore deploys Nlog, a write-optimized log-structured persistent storage, purposed to eliminate remote PM writes. NStore further mitigates the NUMA impacts through localized scan operations, efficient garbage collection, and multi-thread recovery for Nlog. Evaluations show that NStore outperforms state-of-the-art PM-based key-value stores, achieving up to 13.9<inline-formula><tex-math>$times$</tex-math></inline-formula> and 11.2<inline-formula><tex-math>$times$</tex-math></inline-formula> higher write and read throughput, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"929-943"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NetMod: Toward Accelerating Cloud RAN Distributed Unit Modulation Within Programmable Switches
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-20 DOI: 10.1109/TC.2024.3500379
Abdulbary Naji;Xingfu Wang;Ammar Hawbani;Aiman Ghannami;Liang Zhao;XiaoHua Xu;Wei Zhao
Radio Access Networks (RAN) are anticipated to gradually transition towards Cloud RAN (C-RAN), leveraging the full advantages of the cloud-native computing model. While this paradigm shift offers a promising architectural evolution to improve scalability, efficiency, and performance, significant challenges remain in managing the massive computing requirements of physical layer (PHY) processing. To address these challenges and meet the stringent Service Level Objectives (SLOs) in 5G networks, hardware acceleration technologies are essential. In this paper, we aim to mitigate this challenge by offloading 5G modulation mapping, a critical yet demanding function to encode bits into IQ symbols, directly onto the switch ASICs. Specifically, we introduce NetMod, a 5G New Radio (NR) standard-compliant in-network modulation mapper accelerator. NetMod leverages the capabilities of new-generation programmable switches within the C-RAN infrastructure to offload and accelerate PHY modulation functions. We implemented a NetMod prototype on a real-world platform using the Intel Tofino programmable switch and commodity servers running the Data Plane Development Kit (DPDK). Through extensive experiments, we demonstrate that NetMod achieves modulation mapping at switch line rate using minimal switch resources, thereby preserving ample space for traditional switching tasks. Furthermore, comparisons with a GPU-based 5G modulation mapper show that NetMod is 2.2$boldsymbol{times}$ to 3.3$boldsymbol{times}$ faster using only a single switch port. These results highlight the potential of in-network acceleration to enhance 5G network performance and efficiency.
{"title":"NetMod: Toward Accelerating Cloud RAN Distributed Unit Modulation Within Programmable Switches","authors":"Abdulbary Naji;Xingfu Wang;Ammar Hawbani;Aiman Ghannami;Liang Zhao;XiaoHua Xu;Wei Zhao","doi":"10.1109/TC.2024.3500379","DOIUrl":"https://doi.org/10.1109/TC.2024.3500379","url":null,"abstract":"Radio Access Networks (RAN) are anticipated to gradually transition towards Cloud RAN (C-RAN), leveraging the full advantages of the cloud-native computing model. While this paradigm shift offers a promising architectural evolution to improve scalability, efficiency, and performance, significant challenges remain in managing the massive computing requirements of physical layer (PHY) processing. To address these challenges and meet the stringent Service Level Objectives (SLOs) in 5G networks, hardware acceleration technologies are essential. In this paper, we aim to mitigate this challenge by offloading 5G modulation mapping, a critical yet demanding function to encode bits into IQ symbols, directly onto the switch ASICs. Specifically, we introduce NetMod, a 5G New Radio (NR) standard-compliant in-network modulation mapper accelerator. NetMod leverages the capabilities of new-generation programmable switches within the C-RAN infrastructure to offload and accelerate PHY modulation functions. We implemented a NetMod prototype on a real-world platform using the Intel Tofino programmable switch and commodity servers running the Data Plane Development Kit (DPDK). Through extensive experiments, we demonstrate that NetMod achieves modulation mapping at switch line rate using minimal switch resources, thereby preserving ample space for traditional switching tasks. Furthermore, comparisons with a GPU-based 5G modulation mapper show that NetMod is 2.2<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> to 3.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster using only a single switch port. These results highlight the potential of in-network acceleration to enhance 5G network performance and efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"665-677"},"PeriodicalIF":3.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COSMO: COmpressed Sensing for Models and Logging Optimization in MCU Performance Screening
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-20 DOI: 10.1109/TC.2024.3500378
Nicolò Bellarmino;Riccardo Cantoro;Sophie M. Fosson;Martin Huch;Tobias Kilian;Ulf Schlichtmann;Giovanni Squillero
In safety-critical applications, microcontrollers must meet stringent quality and performance standards, including the maximum operating frequency $F_{max}$. Machine learning models have proven effective in estimating $F_{max}$ by utilizing data from on-chip ring oscillators. Previous research has shown that increasing the number of ring oscillators on board can enable the deployment of simple linear regression models to predict $F_{max}$. However, the scarcity of labeled data that characterize this context poses a challenge in managing high-dimensional feature spaces; moreover, a very high number of ring oscillators is not desirable due to technological reasons. By modeling $F_{max}$ as a linear combination of the ring oscillators’ values, this paper employs Compressed Sensing theory to build the model and perform feature selection, enhancing model efficiency and interpretability. We explore regularized linear methods with convex/non-convex penalties in microcontroller performance screening, focusing on selecting informative ring oscillators. This permits reducing models’ footprint while retaining high prediction accuracy. Our experiments on two real-world microcontroller products compare Compressed Sensing with two alternative feature selection approaches: filter and wrapped methods. In our experiments, regularized linear models effectively identify relevant ring oscillators, achieving compression rates of up to 32:1, with no substantial loss in prediction metrics.
{"title":"COSMO: COmpressed Sensing for Models and Logging Optimization in MCU Performance Screening","authors":"Nicolò Bellarmino;Riccardo Cantoro;Sophie M. Fosson;Martin Huch;Tobias Kilian;Ulf Schlichtmann;Giovanni Squillero","doi":"10.1109/TC.2024.3500378","DOIUrl":"https://doi.org/10.1109/TC.2024.3500378","url":null,"abstract":"In safety-critical applications, microcontrollers must meet stringent quality and performance standards, including the maximum operating frequency <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula>. Machine learning models have proven effective in estimating <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula> by utilizing data from on-chip ring oscillators. Previous research has shown that increasing the number of ring oscillators on board can enable the deployment of simple linear regression models to predict <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula>. However, the scarcity of labeled data that characterize this context poses a challenge in managing high-dimensional feature spaces; moreover, a very high number of ring oscillators is not desirable due to technological reasons. By modeling <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula> as a linear combination of the ring oscillators’ values, this paper employs Compressed Sensing theory to build the model and perform feature selection, enhancing model efficiency and interpretability. We explore regularized linear methods with convex/non-convex penalties in microcontroller performance screening, focusing on selecting informative ring oscillators. This permits reducing models’ footprint while retaining high prediction accuracy. Our experiments on two real-world microcontroller products compare Compressed Sensing with two alternative feature selection approaches: filter and wrapped methods. In our experiments, regularized linear models effectively identify relevant ring oscillators, achieving compression rates of up to 32:1, with no substantial loss in prediction metrics.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"652-664"},"PeriodicalIF":3.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Efficiency in Multi-Modal Autonomous Embedded Systems Through Adaptive Gating
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500382
Xiaofeng Hou;Cheng Xu;Chao Li;Jiacheng Liu;Xuehan Tang;Kwang-Ting Cheng;Minyi Guo
The parallel advancement of AI and IoT technologies has recently boosted the development of multi-modal computing ($M^{2}C$) on pervasive autonomous embedded systems (AES). $M^{2}C$ takes advantage of data from different modalities such as images, audio, and text and is able to achieve notable improvements in accuracy. However, achieving these accuracy gains often comes at the cost of increased computational complexity and energy consumption. Furthermore, the presence of numerous advanced sensors in these systems significantly contributes to power consumption, exacerbating the issue of limited power resources. Collectively, these challenges pose difficulties in deploying $M^{2}C$ on small embedded devices with scarce energy resources. In this article, we propose an Adaptive Modality Gating technique called AMG for in-situ $M^{2}C$ applications. The primary objective of AMG is to conserve energy while preserving the accuracy advantages of $M^{2}C$. To achieve this goal, AMG incorporates two first-of-its-kind designs. Firstly, it introduces a novel semi-gating architecture that enables partial modality sensor power gating. Specifically, we devise the de-centralized AMG (D-AMG) and centralized AMG (C-AMG) architecture. The former buffers raw data on sensors while the latter buffers raw data on the computing board, which are suitable for different edge scenarios respectively. Secondly, it facilitates a self-initialization/tuning process on the AES, which is supported by carefully-built analytical model. Extensive evaluations demonstrate the effectiveness of AMG. It achieves a 1.6x to 3.8x throughput higher than other power management methods and improves the lifespan of AES by 10% to 280% longer within the same energy budget, while satisfying all performance and latency requirements across various scenarios.
{"title":"Improving Efficiency in Multi-Modal Autonomous Embedded Systems Through Adaptive Gating","authors":"Xiaofeng Hou;Cheng Xu;Chao Li;Jiacheng Liu;Xuehan Tang;Kwang-Ting Cheng;Minyi Guo","doi":"10.1109/TC.2024.3500382","DOIUrl":"https://doi.org/10.1109/TC.2024.3500382","url":null,"abstract":"The parallel advancement of AI and IoT technologies has recently boosted the development of multi-modal computing (<inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula>) on pervasive autonomous embedded systems (AES). <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> takes advantage of data from different modalities such as images, audio, and text and is able to achieve notable improvements in accuracy. However, achieving these accuracy gains often comes at the cost of increased computational complexity and energy consumption. Furthermore, the presence of numerous advanced sensors in these systems significantly contributes to power consumption, exacerbating the issue of limited power resources. Collectively, these challenges pose difficulties in deploying <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> on small embedded devices with scarce energy resources. In this article, we propose an <b>A</b>daptive <b>M</b>odality <b>G</b>ating technique called <i>AMG</i> for in-situ <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> applications. The primary objective of <i>AMG</i> is to conserve energy while preserving the accuracy advantages of <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula>. To achieve this goal, <i>AMG</i> incorporates two first-of-its-kind designs. Firstly, it introduces a novel semi-gating architecture that enables partial modality sensor power gating. Specifically, we devise the de-centralized <i>AMG</i> (<i>D-AMG</i>) and centralized <i>AMG</i> (<i>C-AMG</i>) architecture. The former buffers raw data on sensors while the latter buffers raw data on the computing board, which are suitable for different edge scenarios respectively. Secondly, it facilitates a self-initialization/tuning process on the AES, which is supported by carefully-built analytical model. Extensive evaluations demonstrate the effectiveness of <i>AMG</i>. It achieves a 1.6x to 3.8x throughput higher than other power management methods and improves the lifespan of AES by 10% to 280% longer within the same energy budget, while satisfying all performance and latency requirements across various scenarios.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"691-704"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncovering the Intricacies and Synergies of Processor Microarchitecture Mechanisms Using Explainable AI
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500377
Abdoulaye Gamatié;Yuyang Wang;Diego Valdez Duran
This paper defines a data-driven methodology seamlessly combining machine learning (ML) and eXplainable Artificial Intelligence (XAI) techniques to address the challenge of understanding the intricate relationships between microarchitecture mechanisms with respect to system performance. By applying the SHapley Additive exPlanations (SHAP) XAI method, it analyzes the synergies of cache replacement, branch prediction, and hardware prefetching on instructions per cycle (IPC) scores. We validate our methodology by using the SPEC CPU 2006 and 2017 benchmark suites with the ChampSim simulator. We illustrate the benefits of the proposed methodology and discuss the major insights and limitations obtained from this study.
{"title":"Uncovering the Intricacies and Synergies of Processor Microarchitecture Mechanisms Using Explainable AI","authors":"Abdoulaye Gamatié;Yuyang Wang;Diego Valdez Duran","doi":"10.1109/TC.2024.3500377","DOIUrl":"https://doi.org/10.1109/TC.2024.3500377","url":null,"abstract":"This paper defines a data-driven methodology seamlessly combining machine learning (ML) and eXplainable Artificial Intelligence (XAI) techniques to address the challenge of understanding the intricate relationships between microarchitecture mechanisms with respect to system performance. By applying the SHapley Additive exPlanations (SHAP) XAI method, it analyzes the synergies of cache replacement, branch prediction, and hardware prefetching on instructions per cycle (IPC) scores. We validate our methodology by using the SPEC CPU 2006 and 2017 benchmark suites with the ChampSim simulator. We illustrate the benefits of the proposed methodology and discuss the major insights and limitations obtained from this study.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"637-651"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Energy-Delay Efficient Segmented Approximate Adder With Smart Chaining
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500371
Tayebeh Karimi;Arezoo Kamran
Approximate computing is a promising approach for high-performance, and low-energy computation in inherently error-tolerant applications. This paper proposes an approximate adder comprising a constant-truncation block in the least significant part and several non-overlapping summation blocks in the more significant parts of the adder. The carry-in of each block is supplied using the most significant bit of one of the input operands from the earlier block. In the most significant block, two more-precise approaches are used to generate candidate values for the carry-in. The final value of the carry-in for this block is selected based on the values of the input operands. In fact, the proposed approximate adder is input-aware, and dynamically adjusts its operation in one or two cycles to improve accuracy while limiting the average delay. The experimental results indicate that the proposed adder has a better quality-effort tradeoff than state-of-the-art approximate adders. Different configurations of the proposed adder improve delay, energy, and the energy-delay product (EDP) by 78%, 72%, and 87%, respectively, when compared to state-of-the-art approximate adders, all without any loss in accuracy. Additionally, the efficiency of the proposed adder is confirmed in both image dithering and stock price prediction through regression.
{"title":"Energy-Delay Efficient Segmented Approximate Adder With Smart Chaining","authors":"Tayebeh Karimi;Arezoo Kamran","doi":"10.1109/TC.2024.3500371","DOIUrl":"https://doi.org/10.1109/TC.2024.3500371","url":null,"abstract":"Approximate computing is a promising approach for high-performance, and low-energy computation in inherently error-tolerant applications. This paper proposes an approximate adder comprising a constant-truncation block in the least significant part and several non-overlapping summation blocks in the more significant parts of the adder. The carry-in of each block is supplied using the most significant bit of one of the input operands from the earlier block. In the most significant block, two more-precise approaches are used to generate candidate values for the carry-in. The final value of the carry-in for this block is selected based on the values of the input operands. In fact, the proposed approximate adder is input-aware, and dynamically adjusts its operation in one or two cycles to improve accuracy while limiting the average delay. The experimental results indicate that the proposed adder has a better quality-effort tradeoff than state-of-the-art approximate adders. Different configurations of the proposed adder improve delay, energy, and the energy-delay product (EDP) by 78%, 72%, and 87%, respectively, when compared to state-of-the-art approximate adders, all without any loss in accuracy. Additionally, the efficiency of the proposed adder is confirmed in both image dithering and stock price prediction through regression.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"597-608"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shared Recurrence Floating-Point Divide/Sqrt and Integer Divide/Remainder With Early Termination 共享递归浮点除法/平方根和整数除法/余数提前终止
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500380
Kevin Kim;Katherine Parry;David Harris;Cedar Turek;Alessandro Maiuolo;Rose Thompson;James Stine
Division, square root, and remainder are fundamental operations required by most computer systems. Floating-point and integer operations are commonly performed on separate datapaths. This paper presents the first detailed implementation of a shared recurrence unit that supports floating-point division/square root and integer division/remainder. It supports early termination and shares the normalization shifter needed for integer and subnormal inputs. Synthesis results show that shared double-precision dividers producing at least 4 bits per cycle are 9 - 18% smaller and 3 - 16% faster than separate integer and floating-point units.
除法、平方根和余数是大多数计算机系统所需要的基本运算。浮点和整数操作通常在不同的数据路径上执行。本文首次详细实现了一个支持浮点除法/平方根和整数除法/余数的共享递归单元。它支持提前终止,并共享整数和次正规输入所需的归一化移位器。综合结果表明,每个周期至少产生4位的共享双精度分频器比单独的整数和浮点单元小9 - 18%,快3 - 16%。
{"title":"Shared Recurrence Floating-Point Divide/Sqrt and Integer Divide/Remainder With Early Termination","authors":"Kevin Kim;Katherine Parry;David Harris;Cedar Turek;Alessandro Maiuolo;Rose Thompson;James Stine","doi":"10.1109/TC.2024.3500380","DOIUrl":"https://doi.org/10.1109/TC.2024.3500380","url":null,"abstract":"Division, square root, and remainder are fundamental operations required by most computer systems. Floating-point and integer operations are commonly performed on separate datapaths. This paper presents the first detailed implementation of a shared recurrence unit that supports floating-point division/square root and integer division/remainder. It supports early termination and shares the normalization shifter needed for integer and subnormal inputs. Synthesis results show that shared double-precision dividers producing at least 4 bits per cycle are 9 - 18% smaller and 3 - 16% faster than separate integer and floating-point units.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"740-748"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A System-Level Test Methodology for Communication Peripherals in System-on-Chips 片上系统通信外设的系统级测试方法
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500375
Francesco Angione;Paolo Bernardi;Nicola di Gruttola Giardino;Gabriele Filipponi;Claudia Bertani;Vincenzo Tancorre
This paper deals with functional System-Level Test (SLT) for System-on-Chips (SoCs) communication peripherals. The proposed methodology is based on analyzing the potential weaknesses of applied structural tests such as Scan-based. Then, the paper illustrates how to develop a functional SLT programs software suite to address such issues. In case the communication peripheral provides detection/correction features, the methodology proposes the design of a hardware companion module to be added to the Automatic Test Equipment (ATE) to interact with the SoC communication module by purposely corrupting data frames. Experimental results are obtained on an industrial, automotive SoC produced by STMicroelectronics focusing on the Controller Area Network (CAN) communication peripheral and showing the effectiveness of the SLT suite to complement structural tests.
本文讨论了片上系统通信外设的功能系统级测试(SLT)。提出的方法是在分析基于扫描的结构测试的潜在缺陷的基础上提出的。然后,本文阐述了如何开发一个功能强大的SLT程序软件套件来解决这些问题。如果通信外设提供检测/校正功能,该方法建议设计一个硬件配套模块,将其添加到自动测试设备(ATE)中,通过故意破坏数据帧与SoC通信模块进行交互。在意法半导体(STMicroelectronics)生产的工业汽车SoC上获得了实验结果,重点是控制器局域网(CAN)通信外设,并显示了SLT套件补充结构测试的有效性。
{"title":"A System-Level Test Methodology for Communication Peripherals in System-on-Chips","authors":"Francesco Angione;Paolo Bernardi;Nicola di Gruttola Giardino;Gabriele Filipponi;Claudia Bertani;Vincenzo Tancorre","doi":"10.1109/TC.2024.3500375","DOIUrl":"https://doi.org/10.1109/TC.2024.3500375","url":null,"abstract":"This paper deals with functional System-Level Test (SLT) for System-on-Chips (SoCs) communication peripherals. The proposed methodology is based on analyzing the potential weaknesses of applied structural tests such as Scan-based. Then, the paper illustrates how to develop a functional SLT programs software suite to address such issues. In case the communication peripheral provides detection/correction features, the methodology proposes the design of a hardware companion module to be added to the Automatic Test Equipment (ATE) to interact with the SoC communication module by purposely corrupting data frames. Experimental results are obtained on an industrial, automotive SoC produced by STMicroelectronics focusing on the Controller Area Network (CAN) communication peripheral and showing the effectiveness of the SLT suite to complement structural tests.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"731-739"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10755212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Differentially Private Matrix Factorization for Implicit Data via Secure Aggregation
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500383
Chenhong Luo;Yong Wang;Yanjun Zhang;Leo Yu Zhang
Implicit feedback data has become the primary choice for building recommendation models due to its abundance and ease for collection in the real world. The strong generalization capability and high computational efficiency of matrix factorization make it one of the principal models for constructing recommender systems. Recommenders have to collect vast amounts of user data for model training, which poses a significant threat to user privacy. Most of the current privacy enhancing recommendation systems mainly focus on explicit feedback data, and there are limited studies dedicated to the privacy protection of implicit recommender. To bridge the existing research gap, this paper designs a distributed differentially private matrix factorization for implicit feedback data in scenarios where the recommender is not trusted. Our mechanism not only eliminates the assumption of a trusted recommender, but also achieves the same accuracy as CDP-based privacy-preserving MF model. We prove that our mechanism satisfies $(epsilon,delta)$-CDP. The experimental results on three public datasets confirm that the proposed mechanism can achieve high recommendation quality.
{"title":"Distributed Differentially Private Matrix Factorization for Implicit Data via Secure Aggregation","authors":"Chenhong Luo;Yong Wang;Yanjun Zhang;Leo Yu Zhang","doi":"10.1109/TC.2024.3500383","DOIUrl":"https://doi.org/10.1109/TC.2024.3500383","url":null,"abstract":"Implicit feedback data has become the primary choice for building recommendation models due to its abundance and ease for collection in the real world. The strong generalization capability and high computational efficiency of matrix factorization make it one of the principal models for constructing recommender systems. Recommenders have to collect vast amounts of user data for model training, which poses a significant threat to user privacy. Most of the current privacy enhancing recommendation systems mainly focus on explicit feedback data, and there are limited studies dedicated to the privacy protection of implicit recommender. To bridge the existing research gap, this paper designs a distributed differentially private matrix factorization for implicit feedback data in scenarios where the recommender is not trusted. Our mechanism not only eliminates the assumption of a trusted recommender, but also achieves the same accuracy as CDP-based privacy-preserving MF model. We prove that our mechanism satisfies <inline-formula><tex-math>$(epsilon,delta)$</tex-math></inline-formula>-CDP. The experimental results on three public datasets confirm that the proposed mechanism can achieve high recommendation quality.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"705-716"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1