Xiaofeng Zou;Cen Chen;Hongen Shao;Qinyu Wang;Xiaobin Zhuang;Yangfan Li;Keqin Li
While vision transformers (ViTs) have continued to achieve new milestones in computer vision, their complicated network architectures with high computation and memory costs have hindered their deployment on resource-limited edge devices. Some customized accelerators have been proposed to accelerate the execution of ViTs, achieving improved performance with reduced energy consumption. However, these approaches utilize flattened attention mechanisms and ignore the inherent hierarchical visual semantics in images. In this work, we conduct a thorough analysis of hierarchical visual semantics in real-world images, revealing opportunities and challenges of leveraging visual semantics to accelerate ViTs. We propose ReViT, a systematic algorithm and architecture co-design approach, which aims to exploit the visual semantics to accelerate ViTs. Our proposed algorithm can leverage the same semantic class with strong feature similarity to reduce computation and communication in a differential attention mechanism, and support the semantic-aware attention efficiently. A novel dedicated architecture is designed to support the proposed algorithm and translate it into performance improvements. Moreover, we propose an efficient execution dataflow to alleviate workload imbalance and maximize hardware utilization. ReViT opens new directions for accelerating ViTs by exploring the underlying visual semantics of images. ReViT gains an average of 2.3$boldsymbol{times}$ speedup and 3.6$boldsymbol{times}$ energy efficiency over state-of-the-art ViT accelerators.
{"title":"ReViT: Vision Transformer Accelerator With Reconfigurable Semantic-Aware Differential Attention","authors":"Xiaofeng Zou;Cen Chen;Hongen Shao;Qinyu Wang;Xiaobin Zhuang;Yangfan Li;Keqin Li","doi":"10.1109/TC.2024.3504263","DOIUrl":"https://doi.org/10.1109/TC.2024.3504263","url":null,"abstract":"While vision transformers (ViTs) have continued to achieve new milestones in computer vision, their complicated network architectures with high computation and memory costs have hindered their deployment on resource-limited edge devices. Some customized accelerators have been proposed to accelerate the execution of ViTs, achieving improved performance with reduced energy consumption. However, these approaches utilize flattened attention mechanisms and ignore the inherent hierarchical visual semantics in images. In this work, we conduct a thorough analysis of hierarchical visual semantics in real-world images, revealing opportunities and challenges of leveraging visual semantics to accelerate ViTs. We propose ReViT, a systematic algorithm and architecture co-design approach, which aims to exploit the visual semantics to accelerate ViTs. Our proposed algorithm can leverage the same semantic class with strong feature similarity to reduce computation and communication in a differential attention mechanism, and support the semantic-aware attention efficiently. A novel dedicated architecture is designed to support the proposed algorithm and translate it into performance improvements. Moreover, we propose an efficient execution dataflow to alleviate workload imbalance and maximize hardware utilization. ReViT opens new directions for accelerating ViTs by exploring the underlying visual semantics of images. ReViT gains an average of 2.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> speedup and 3.6<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> energy efficiency over state-of-the-art ViT accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1079-1093"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emerging persistent memory (PM) promises near-DRAM performance, larger capacity, and data persistence, attracting researchers to design PM-based key-value stores. However, existing PM-based key-value stores lack awareness of the Non-Uniform Memory Access (NUMA) architecture on PM, where accessing PM on remote NUMA sockets is considerably slower than accessing local PM. This NUMA-unawareness results in sub-optimal performance when scaling on NUMA. Although DRAM caching alleviates this issue, existing cache policies ignore the performance disparity between remote and local PM accesses, keeping remote PM access as a performance bottleneck when scaling PM stores on NUMA. Furthermore, creating hot data views in each socket's PM fails to eliminate remote PM writes and, worse, induces additional local PM writes. This paper presents NStore, a high-performance NUMA-aware key-value store for the PM-DRAM hybrid memory. NStore introduces a NUMA-aware cache replacement strategy, called Remote Access First (RAF) cache in DRAM, to minimize remote PM accesses. In addition, NStore deploys Nlog, a write-optimized log-structured persistent storage, purposed to eliminate remote PM writes. NStore further mitigates the NUMA impacts through localized scan operations, efficient garbage collection, and multi-thread recovery for Nlog. Evaluations show that NStore outperforms state-of-the-art PM-based key-value stores, achieving up to 13.9$times$ and 11.2$times$ higher write and read throughput, respectively.
{"title":"NStore: A High-Performance NUMA-Aware Key-Value Store for Hybrid Memory","authors":"Zhonghua Wang;Kai Lu;Jiguang Wan;Hong Jiang;Zeyang Zhao;Peng Xu;Biliang Lai;Guokuan Li;Changsheng Xie","doi":"10.1109/TC.2024.3504269","DOIUrl":"https://doi.org/10.1109/TC.2024.3504269","url":null,"abstract":"Emerging persistent memory (PM) promises near-DRAM performance, larger capacity, and data persistence, attracting researchers to design PM-based key-value stores. However, existing PM-based key-value stores lack awareness of the Non-Uniform Memory Access (NUMA) architecture on PM, where accessing PM on remote NUMA sockets is considerably slower than accessing local PM. This NUMA-unawareness results in sub-optimal performance when scaling on NUMA. Although DRAM caching alleviates this issue, existing cache policies ignore the performance disparity between remote and local PM accesses, keeping remote PM access as a performance bottleneck when scaling PM stores on NUMA. Furthermore, creating hot data views in each socket's PM fails to eliminate remote PM writes and, worse, induces additional local PM writes. This paper presents NStore, a high-performance NUMA-aware key-value store for the PM-DRAM hybrid memory. NStore introduces a NUMA-aware cache replacement strategy, called Remote Access First (RAF) cache in DRAM, to minimize remote PM accesses. In addition, NStore deploys Nlog, a write-optimized log-structured persistent storage, purposed to eliminate remote PM writes. NStore further mitigates the NUMA impacts through localized scan operations, efficient garbage collection, and multi-thread recovery for Nlog. Evaluations show that NStore outperforms state-of-the-art PM-based key-value stores, achieving up to 13.9<inline-formula><tex-math>$times$</tex-math></inline-formula> and 11.2<inline-formula><tex-math>$times$</tex-math></inline-formula> higher write and read throughput, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"929-943"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radio Access Networks (RAN) are anticipated to gradually transition towards Cloud RAN (C-RAN), leveraging the full advantages of the cloud-native computing model. While this paradigm shift offers a promising architectural evolution to improve scalability, efficiency, and performance, significant challenges remain in managing the massive computing requirements of physical layer (PHY) processing. To address these challenges and meet the stringent Service Level Objectives (SLOs) in 5G networks, hardware acceleration technologies are essential. In this paper, we aim to mitigate this challenge by offloading 5G modulation mapping, a critical yet demanding function to encode bits into IQ symbols, directly onto the switch ASICs. Specifically, we introduce NetMod, a 5G New Radio (NR) standard-compliant in-network modulation mapper accelerator. NetMod leverages the capabilities of new-generation programmable switches within the C-RAN infrastructure to offload and accelerate PHY modulation functions. We implemented a NetMod prototype on a real-world platform using the Intel Tofino programmable switch and commodity servers running the Data Plane Development Kit (DPDK). Through extensive experiments, we demonstrate that NetMod achieves modulation mapping at switch line rate using minimal switch resources, thereby preserving ample space for traditional switching tasks. Furthermore, comparisons with a GPU-based 5G modulation mapper show that NetMod is 2.2$boldsymbol{times}$ to 3.3$boldsymbol{times}$ faster using only a single switch port. These results highlight the potential of in-network acceleration to enhance 5G network performance and efficiency.
{"title":"NetMod: Toward Accelerating Cloud RAN Distributed Unit Modulation Within Programmable Switches","authors":"Abdulbary Naji;Xingfu Wang;Ammar Hawbani;Aiman Ghannami;Liang Zhao;XiaoHua Xu;Wei Zhao","doi":"10.1109/TC.2024.3500379","DOIUrl":"https://doi.org/10.1109/TC.2024.3500379","url":null,"abstract":"Radio Access Networks (RAN) are anticipated to gradually transition towards Cloud RAN (C-RAN), leveraging the full advantages of the cloud-native computing model. While this paradigm shift offers a promising architectural evolution to improve scalability, efficiency, and performance, significant challenges remain in managing the massive computing requirements of physical layer (PHY) processing. To address these challenges and meet the stringent Service Level Objectives (SLOs) in 5G networks, hardware acceleration technologies are essential. In this paper, we aim to mitigate this challenge by offloading 5G modulation mapping, a critical yet demanding function to encode bits into IQ symbols, directly onto the switch ASICs. Specifically, we introduce NetMod, a 5G New Radio (NR) standard-compliant in-network modulation mapper accelerator. NetMod leverages the capabilities of new-generation programmable switches within the C-RAN infrastructure to offload and accelerate PHY modulation functions. We implemented a NetMod prototype on a real-world platform using the Intel Tofino programmable switch and commodity servers running the Data Plane Development Kit (DPDK). Through extensive experiments, we demonstrate that NetMod achieves modulation mapping at switch line rate using minimal switch resources, thereby preserving ample space for traditional switching tasks. Furthermore, comparisons with a GPU-based 5G modulation mapper show that NetMod is 2.2<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> to 3.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster using only a single switch port. These results highlight the potential of in-network acceleration to enhance 5G network performance and efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"665-677"},"PeriodicalIF":3.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolò Bellarmino;Riccardo Cantoro;Sophie M. Fosson;Martin Huch;Tobias Kilian;Ulf Schlichtmann;Giovanni Squillero
In safety-critical applications, microcontrollers must meet stringent quality and performance standards, including the maximum operating frequency $F_{max}$. Machine learning models have proven effective in estimating $F_{max}$ by utilizing data from on-chip ring oscillators. Previous research has shown that increasing the number of ring oscillators on board can enable the deployment of simple linear regression models to predict $F_{max}$. However, the scarcity of labeled data that characterize this context poses a challenge in managing high-dimensional feature spaces; moreover, a very high number of ring oscillators is not desirable due to technological reasons. By modeling $F_{max}$ as a linear combination of the ring oscillators’ values, this paper employs Compressed Sensing theory to build the model and perform feature selection, enhancing model efficiency and interpretability. We explore regularized linear methods with convex/non-convex penalties in microcontroller performance screening, focusing on selecting informative ring oscillators. This permits reducing models’ footprint while retaining high prediction accuracy. Our experiments on two real-world microcontroller products compare Compressed Sensing with two alternative feature selection approaches: filter and wrapped methods. In our experiments, regularized linear models effectively identify relevant ring oscillators, achieving compression rates of up to 32:1, with no substantial loss in prediction metrics.
{"title":"COSMO: COmpressed Sensing for Models and Logging Optimization in MCU Performance Screening","authors":"Nicolò Bellarmino;Riccardo Cantoro;Sophie M. Fosson;Martin Huch;Tobias Kilian;Ulf Schlichtmann;Giovanni Squillero","doi":"10.1109/TC.2024.3500378","DOIUrl":"https://doi.org/10.1109/TC.2024.3500378","url":null,"abstract":"In safety-critical applications, microcontrollers must meet stringent quality and performance standards, including the maximum operating frequency <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula>. Machine learning models have proven effective in estimating <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula> by utilizing data from on-chip ring oscillators. Previous research has shown that increasing the number of ring oscillators on board can enable the deployment of simple linear regression models to predict <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula>. However, the scarcity of labeled data that characterize this context poses a challenge in managing high-dimensional feature spaces; moreover, a very high number of ring oscillators is not desirable due to technological reasons. By modeling <inline-formula><tex-math>$F_{max}$</tex-math></inline-formula> as a linear combination of the ring oscillators’ values, this paper employs Compressed Sensing theory to build the model and perform feature selection, enhancing model efficiency and interpretability. We explore regularized linear methods with convex/non-convex penalties in microcontroller performance screening, focusing on selecting informative ring oscillators. This permits reducing models’ footprint while retaining high prediction accuracy. Our experiments on two real-world microcontroller products compare Compressed Sensing with two alternative feature selection approaches: filter and wrapped methods. In our experiments, regularized linear models effectively identify relevant ring oscillators, achieving compression rates of up to 32:1, with no substantial loss in prediction metrics.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"652-664"},"PeriodicalIF":3.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The parallel advancement of AI and IoT technologies has recently boosted the development of multi-modal computing ($M^{2}C$) on pervasive autonomous embedded systems (AES). $M^{2}C$ takes advantage of data from different modalities such as images, audio, and text and is able to achieve notable improvements in accuracy. However, achieving these accuracy gains often comes at the cost of increased computational complexity and energy consumption. Furthermore, the presence of numerous advanced sensors in these systems significantly contributes to power consumption, exacerbating the issue of limited power resources. Collectively, these challenges pose difficulties in deploying $M^{2}C$ on small embedded devices with scarce energy resources. In this article, we propose an Adaptive Modality Gating technique called AMG for in-situ $M^{2}C$ applications. The primary objective of AMG is to conserve energy while preserving the accuracy advantages of $M^{2}C$. To achieve this goal, AMG incorporates two first-of-its-kind designs. Firstly, it introduces a novel semi-gating architecture that enables partial modality sensor power gating. Specifically, we devise the de-centralized AMG (D-AMG) and centralized AMG (C-AMG) architecture. The former buffers raw data on sensors while the latter buffers raw data on the computing board, which are suitable for different edge scenarios respectively. Secondly, it facilitates a self-initialization/tuning process on the AES, which is supported by carefully-built analytical model. Extensive evaluations demonstrate the effectiveness of AMG. It achieves a 1.6x to 3.8x throughput higher than other power management methods and improves the lifespan of AES by 10% to 280% longer within the same energy budget, while satisfying all performance and latency requirements across various scenarios.
{"title":"Improving Efficiency in Multi-Modal Autonomous Embedded Systems Through Adaptive Gating","authors":"Xiaofeng Hou;Cheng Xu;Chao Li;Jiacheng Liu;Xuehan Tang;Kwang-Ting Cheng;Minyi Guo","doi":"10.1109/TC.2024.3500382","DOIUrl":"https://doi.org/10.1109/TC.2024.3500382","url":null,"abstract":"The parallel advancement of AI and IoT technologies has recently boosted the development of multi-modal computing (<inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula>) on pervasive autonomous embedded systems (AES). <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> takes advantage of data from different modalities such as images, audio, and text and is able to achieve notable improvements in accuracy. However, achieving these accuracy gains often comes at the cost of increased computational complexity and energy consumption. Furthermore, the presence of numerous advanced sensors in these systems significantly contributes to power consumption, exacerbating the issue of limited power resources. Collectively, these challenges pose difficulties in deploying <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> on small embedded devices with scarce energy resources. In this article, we propose an <b>A</b>daptive <b>M</b>odality <b>G</b>ating technique called <i>AMG</i> for in-situ <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula> applications. The primary objective of <i>AMG</i> is to conserve energy while preserving the accuracy advantages of <inline-formula><tex-math>$M^{2}C$</tex-math></inline-formula>. To achieve this goal, <i>AMG</i> incorporates two first-of-its-kind designs. Firstly, it introduces a novel semi-gating architecture that enables partial modality sensor power gating. Specifically, we devise the de-centralized <i>AMG</i> (<i>D-AMG</i>) and centralized <i>AMG</i> (<i>C-AMG</i>) architecture. The former buffers raw data on sensors while the latter buffers raw data on the computing board, which are suitable for different edge scenarios respectively. Secondly, it facilitates a self-initialization/tuning process on the AES, which is supported by carefully-built analytical model. Extensive evaluations demonstrate the effectiveness of <i>AMG</i>. It achieves a 1.6x to 3.8x throughput higher than other power management methods and improves the lifespan of AES by 10% to 280% longer within the same energy budget, while satisfying all performance and latency requirements across various scenarios.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"691-704"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper defines a data-driven methodology seamlessly combining machine learning (ML) and eXplainable Artificial Intelligence (XAI) techniques to address the challenge of understanding the intricate relationships between microarchitecture mechanisms with respect to system performance. By applying the SHapley Additive exPlanations (SHAP) XAI method, it analyzes the synergies of cache replacement, branch prediction, and hardware prefetching on instructions per cycle (IPC) scores. We validate our methodology by using the SPEC CPU 2006 and 2017 benchmark suites with the ChampSim simulator. We illustrate the benefits of the proposed methodology and discuss the major insights and limitations obtained from this study.
{"title":"Uncovering the Intricacies and Synergies of Processor Microarchitecture Mechanisms Using Explainable AI","authors":"Abdoulaye Gamatié;Yuyang Wang;Diego Valdez Duran","doi":"10.1109/TC.2024.3500377","DOIUrl":"https://doi.org/10.1109/TC.2024.3500377","url":null,"abstract":"This paper defines a data-driven methodology seamlessly combining machine learning (ML) and eXplainable Artificial Intelligence (XAI) techniques to address the challenge of understanding the intricate relationships between microarchitecture mechanisms with respect to system performance. By applying the SHapley Additive exPlanations (SHAP) XAI method, it analyzes the synergies of cache replacement, branch prediction, and hardware prefetching on instructions per cycle (IPC) scores. We validate our methodology by using the SPEC CPU 2006 and 2017 benchmark suites with the ChampSim simulator. We illustrate the benefits of the proposed methodology and discuss the major insights and limitations obtained from this study.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"637-651"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate computing is a promising approach for high-performance, and low-energy computation in inherently error-tolerant applications. This paper proposes an approximate adder comprising a constant-truncation block in the least significant part and several non-overlapping summation blocks in the more significant parts of the adder. The carry-in of each block is supplied using the most significant bit of one of the input operands from the earlier block. In the most significant block, two more-precise approaches are used to generate candidate values for the carry-in. The final value of the carry-in for this block is selected based on the values of the input operands. In fact, the proposed approximate adder is input-aware, and dynamically adjusts its operation in one or two cycles to improve accuracy while limiting the average delay. The experimental results indicate that the proposed adder has a better quality-effort tradeoff than state-of-the-art approximate adders. Different configurations of the proposed adder improve delay, energy, and the energy-delay product (EDP) by 78%, 72%, and 87%, respectively, when compared to state-of-the-art approximate adders, all without any loss in accuracy. Additionally, the efficiency of the proposed adder is confirmed in both image dithering and stock price prediction through regression.
{"title":"Energy-Delay Efficient Segmented Approximate Adder With Smart Chaining","authors":"Tayebeh Karimi;Arezoo Kamran","doi":"10.1109/TC.2024.3500371","DOIUrl":"https://doi.org/10.1109/TC.2024.3500371","url":null,"abstract":"Approximate computing is a promising approach for high-performance, and low-energy computation in inherently error-tolerant applications. This paper proposes an approximate adder comprising a constant-truncation block in the least significant part and several non-overlapping summation blocks in the more significant parts of the adder. The carry-in of each block is supplied using the most significant bit of one of the input operands from the earlier block. In the most significant block, two more-precise approaches are used to generate candidate values for the carry-in. The final value of the carry-in for this block is selected based on the values of the input operands. In fact, the proposed approximate adder is input-aware, and dynamically adjusts its operation in one or two cycles to improve accuracy while limiting the average delay. The experimental results indicate that the proposed adder has a better quality-effort tradeoff than state-of-the-art approximate adders. Different configurations of the proposed adder improve delay, energy, and the energy-delay product (EDP) by 78%, 72%, and 87%, respectively, when compared to state-of-the-art approximate adders, all without any loss in accuracy. Additionally, the efficiency of the proposed adder is confirmed in both image dithering and stock price prediction through regression.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"597-608"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Kim;Katherine Parry;David Harris;Cedar Turek;Alessandro Maiuolo;Rose Thompson;James Stine
Division, square root, and remainder are fundamental operations required by most computer systems. Floating-point and integer operations are commonly performed on separate datapaths. This paper presents the first detailed implementation of a shared recurrence unit that supports floating-point division/square root and integer division/remainder. It supports early termination and shares the normalization shifter needed for integer and subnormal inputs. Synthesis results show that shared double-precision dividers producing at least 4 bits per cycle are 9 - 18% smaller and 3 - 16% faster than separate integer and floating-point units.
{"title":"Shared Recurrence Floating-Point Divide/Sqrt and Integer Divide/Remainder With Early Termination","authors":"Kevin Kim;Katherine Parry;David Harris;Cedar Turek;Alessandro Maiuolo;Rose Thompson;James Stine","doi":"10.1109/TC.2024.3500380","DOIUrl":"https://doi.org/10.1109/TC.2024.3500380","url":null,"abstract":"Division, square root, and remainder are fundamental operations required by most computer systems. Floating-point and integer operations are commonly performed on separate datapaths. This paper presents the first detailed implementation of a shared recurrence unit that supports floating-point division/square root and integer division/remainder. It supports early termination and shares the normalization shifter needed for integer and subnormal inputs. Synthesis results show that shared double-precision dividers producing at least 4 bits per cycle are 9 - 18% smaller and 3 - 16% faster than separate integer and floating-point units.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"740-748"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Angione;Paolo Bernardi;Nicola di Gruttola Giardino;Gabriele Filipponi;Claudia Bertani;Vincenzo Tancorre
This paper deals with functional System-Level Test (SLT) for System-on-Chips (SoCs) communication peripherals. The proposed methodology is based on analyzing the potential weaknesses of applied structural tests such as Scan-based. Then, the paper illustrates how to develop a functional SLT programs software suite to address such issues. In case the communication peripheral provides detection/correction features, the methodology proposes the design of a hardware companion module to be added to the Automatic Test Equipment (ATE) to interact with the SoC communication module by purposely corrupting data frames. Experimental results are obtained on an industrial, automotive SoC produced by STMicroelectronics focusing on the Controller Area Network (CAN) communication peripheral and showing the effectiveness of the SLT suite to complement structural tests.
{"title":"A System-Level Test Methodology for Communication Peripherals in System-on-Chips","authors":"Francesco Angione;Paolo Bernardi;Nicola di Gruttola Giardino;Gabriele Filipponi;Claudia Bertani;Vincenzo Tancorre","doi":"10.1109/TC.2024.3500375","DOIUrl":"https://doi.org/10.1109/TC.2024.3500375","url":null,"abstract":"This paper deals with functional System-Level Test (SLT) for System-on-Chips (SoCs) communication peripherals. The proposed methodology is based on analyzing the potential weaknesses of applied structural tests such as Scan-based. Then, the paper illustrates how to develop a functional SLT programs software suite to address such issues. In case the communication peripheral provides detection/correction features, the methodology proposes the design of a hardware companion module to be added to the Automatic Test Equipment (ATE) to interact with the SoC communication module by purposely corrupting data frames. Experimental results are obtained on an industrial, automotive SoC produced by STMicroelectronics focusing on the Controller Area Network (CAN) communication peripheral and showing the effectiveness of the SLT suite to complement structural tests.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"731-739"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10755212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Implicit feedback data has become the primary choice for building recommendation models due to its abundance and ease for collection in the real world. The strong generalization capability and high computational efficiency of matrix factorization make it one of the principal models for constructing recommender systems. Recommenders have to collect vast amounts of user data for model training, which poses a significant threat to user privacy. Most of the current privacy enhancing recommendation systems mainly focus on explicit feedback data, and there are limited studies dedicated to the privacy protection of implicit recommender. To bridge the existing research gap, this paper designs a distributed differentially private matrix factorization for implicit feedback data in scenarios where the recommender is not trusted. Our mechanism not only eliminates the assumption of a trusted recommender, but also achieves the same accuracy as CDP-based privacy-preserving MF model. We prove that our mechanism satisfies $(epsilon,delta)$-CDP. The experimental results on three public datasets confirm that the proposed mechanism can achieve high recommendation quality.
{"title":"Distributed Differentially Private Matrix Factorization for Implicit Data via Secure Aggregation","authors":"Chenhong Luo;Yong Wang;Yanjun Zhang;Leo Yu Zhang","doi":"10.1109/TC.2024.3500383","DOIUrl":"https://doi.org/10.1109/TC.2024.3500383","url":null,"abstract":"Implicit feedback data has become the primary choice for building recommendation models due to its abundance and ease for collection in the real world. The strong generalization capability and high computational efficiency of matrix factorization make it one of the principal models for constructing recommender systems. Recommenders have to collect vast amounts of user data for model training, which poses a significant threat to user privacy. Most of the current privacy enhancing recommendation systems mainly focus on explicit feedback data, and there are limited studies dedicated to the privacy protection of implicit recommender. To bridge the existing research gap, this paper designs a distributed differentially private matrix factorization for implicit feedback data in scenarios where the recommender is not trusted. Our mechanism not only eliminates the assumption of a trusted recommender, but also achieves the same accuracy as CDP-based privacy-preserving MF model. We prove that our mechanism satisfies <inline-formula><tex-math>$(epsilon,delta)$</tex-math></inline-formula>-CDP. The experimental results on three public datasets confirm that the proposed mechanism can achieve high recommendation quality.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"705-716"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}