Pub Date : 2025-06-19DOI: 10.1109/TETC.2025.3579813
Raffaele Romagnoli;Jianwen Xiang
{"title":"Guest Editorial: Special Section on Applied Software Aging and Rejuvenation","authors":"Raffaele Romagnoli;Jianwen Xiang","doi":"10.1109/TETC.2025.3579813","DOIUrl":"https://doi.org/10.1109/TETC.2025.3579813","url":null,"abstract":"","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 2","pages":"281-282"},"PeriodicalIF":5.1,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11045264","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-16DOI: 10.1109/TETC.2025.3572712
Sheng Lu;Liuting Shang;Sungyong Jung;Qilian Liang;Chenyun Pan
Reconfigurable devices are attracting growing interest as both a potential alternative and complement to traditional CMOS technology. This paper develops a novel field-programmable gate array (FPGA) architecture based on MClusters, which is made of fast and area-efficient 2-input look-up tables (LUTs) through reconfigurable field-effect transistors (RFETs). To fully utilize the MClusters, we propose an SAT-based delay-aware packing algorithm for the technology mapping. In addition, we integrate a partitioning algorithm to divide the circuit into several sub-circuits to further reduce the global routing resources and their associated switching energy of the system. Finally, we develop an efficient technology/circuit/system co-design framework for optimizing the overall performance of FPGAs. Based on comprehensive benchmarking, results demonstrate that optimal design yields significant reductions of up to 39% area, 36% wire length, and 40% switching energy compared to traditional CMOS 6-input LUT FPGAs.
{"title":"A Novel RFET-Based FPGA Architecture Based on Delay-Aware Packing Algorithm","authors":"Sheng Lu;Liuting Shang;Sungyong Jung;Qilian Liang;Chenyun Pan","doi":"10.1109/TETC.2025.3572712","DOIUrl":"https://doi.org/10.1109/TETC.2025.3572712","url":null,"abstract":"Reconfigurable devices are attracting growing interest as both a potential alternative and complement to traditional CMOS technology. This paper develops a novel field-programmable gate array (FPGA) architecture based on MClusters, which is made of fast and area-efficient 2-input look-up tables (LUTs) through reconfigurable field-effect transistors (RFETs). To fully utilize the MClusters, we propose an SAT-based delay-aware packing algorithm for the technology mapping. In addition, we integrate a partitioning algorithm to divide the circuit into several sub-circuits to further reduce the global routing resources and their associated switching energy of the system. Finally, we develop an efficient technology/circuit/system co-design framework for optimizing the overall performance of FPGAs. Based on comprehensive benchmarking, results demonstrate that optimal design yields significant reductions of up to 39% area, 36% wire length, and 40% switching energy compared to traditional CMOS 6-input LUT FPGAs.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1230-1241"},"PeriodicalIF":5.4,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-05DOI: 10.1109/TETC.2025.3574508
Tanvi Sharma;Mustafa Ali;Indranil Chakraborty;Kaushik Roy
Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture-evaluation methodology with tailored mapping algorithm. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our analysis considers the integration of CiM prototypes into the cache levels of a tensor-core-like architecture, and shows that CiM integrated memory improves energy efficiency by up to $3.4 times$ and throughput by up to $15.6 times$ compared to established baseline with INT-8 precision. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.
{"title":"What, When, Where to Compute-in-Memory for Efficient Matrix Multiplication During Machine Learning Inference","authors":"Tanvi Sharma;Mustafa Ali;Indranil Chakraborty;Kaushik Roy","doi":"10.1109/TETC.2025.3574508","DOIUrl":"https://doi.org/10.1109/TETC.2025.3574508","url":null,"abstract":"Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution. However, integrating compute in memory poses key questions, such as 1) <i>What type of CiM to use:</i> Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) <i>When to use CiM:</i> ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) <i>Where to integrate CiM:</i> Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture-evaluation methodology with tailored mapping algorithm. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our analysis considers the integration of CiM prototypes into the cache levels of a tensor-core-like architecture, and shows that CiM integrated memory improves energy efficiency by up to <inline-formula><tex-math>$3.4 times$</tex-math></inline-formula> and throughput by up to <inline-formula><tex-math>$15.6 times$</tex-math></inline-formula> compared to established baseline with INT-8 precision. We believe the proposed work provides insights into <i>what</i> type of CiM to use, and <i>when</i> and <i>where</i> to optimally integrate it in the cache hierarchy for efficient matrix multiplication.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1215-1229"},"PeriodicalIF":5.4,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TETC.2025.3574359
Huabin Wang;Mingzhao Wang;Xinxin Liu;Yingfan Cheng;Fei Liu;Jian Zhou;Liang Tao
Current multimodal template protection methods typically require encryption or transformation of the original biometric features. However, these operations carry certain risks, as attackers may reverse-engineer or decrypt the protected multimodal templates to retrieve partial or complete information about the original templates, leading to the leakage of the original biometric features. To address this issue, we propose a cancelable multimodal template protection method based on random indexing. First, hash functions are used to generate integer sequences as index values, which are then employed to create single-modal cancelable templates using random binary vectors. Second, the single-modal cancelable templates are used as indices for random binary sequences, which locate the corresponding template information and are filled into the fusion cancelable template at the respective positions, achieving template fusion. The resulting template is unrelated to the original biometric features. Finally, without directly storing the binary factor sequences, an XOR operation is performed on the extended biometric feature vectors and random binary sequences to generate the encoded key. Experimental results demonstrate that the proposed method significantly enhances performance on the FVC2002DB1 fingerprint, MMCBNU_6000 finger-vein, and NUPT_FPV databases, while also satisfying the standards for cancelable biometric feature design. We also analyze four privacy and security attacks against this scheme.
{"title":"The Cancelable Multimodal Template Protection Algorithm Based on Random Index","authors":"Huabin Wang;Mingzhao Wang;Xinxin Liu;Yingfan Cheng;Fei Liu;Jian Zhou;Liang Tao","doi":"10.1109/TETC.2025.3574359","DOIUrl":"https://doi.org/10.1109/TETC.2025.3574359","url":null,"abstract":"Current multimodal template protection methods typically require encryption or transformation of the original biometric features. However, these operations carry certain risks, as attackers may reverse-engineer or decrypt the protected multimodal templates to retrieve partial or complete information about the original templates, leading to the leakage of the original biometric features. To address this issue, we propose a cancelable multimodal template protection method based on random indexing. First, hash functions are used to generate integer sequences as index values, which are then employed to create single-modal cancelable templates using random binary vectors. Second, the single-modal cancelable templates are used as indices for random binary sequences, which locate the corresponding template information and are filled into the fusion cancelable template at the respective positions, achieving template fusion. The resulting template is unrelated to the original biometric features. Finally, without directly storing the binary factor sequences, an XOR operation is performed on the extended biometric feature vectors and random binary sequences to generate the encoded key. Experimental results demonstrate that the proposed method significantly enhances performance on the FVC2002DB1 fingerprint, MMCBNU_6000 finger-vein, and NUPT_FPV databases, while also satisfying the standards for cancelable biometric feature design. We also analyze four privacy and security attacks against this scheme.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1200-1214"},"PeriodicalIF":5.4,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Communication scheduling effectively improves the scalability of distributed deep learning by overlapping computation and communication tasks during training. However, existing communication scheduling frameworks based on tensor partitioning suffer from two fundamental issues: (1) partitioning schemes at the data volume level introduce extensive startup overheads leading to higher energy consumption, and (2) partitioning schemes at the communication primitive level do not provide optimal scheduling resulting in longer training time. In this article, we propose an efficient communication mechanism, namely PipeDAP, which schedules decoupled all-reduce operations in a near-optimal order to minimize the time and energy consumption of training DNN models. We build the mathematical model for PipeDAP and derive the near-optimal scheduling order of the reduce-scatter and all-gather operations. Meanwhile, we leverage simultaneous communication of reduce-scatter and all-gather operations to further reduce the startup overheads. We implement the PipeDAP architecture on PyTorch framework, and apply it for distributed training of benchmark DNN models. Experimental results on two GPU clusters demonstrate that PipeDAP achieves up to 1.82x speedup and saves up to 45.4% of energy consumption compared to the state-of-the-art communication scheduling frameworks.
{"title":"PipeDAP: An Efficient Communication Framework for Scheduling Decoupled All-Reduce Primitives in Distributed DNN Training","authors":"Yunqi Gao;Bing Hu;Mahdi Boloursaz Mashhadi;Wei Wang;Rahim Tafazolli;Mérouane Debbah","doi":"10.1109/TETC.2025.3573522","DOIUrl":"https://doi.org/10.1109/TETC.2025.3573522","url":null,"abstract":"Communication scheduling effectively improves the scalability of distributed deep learning by overlapping computation and communication tasks during training. However, existing communication scheduling frameworks based on tensor partitioning suffer from two fundamental issues: (1) partitioning schemes at the data volume level introduce extensive startup overheads leading to higher energy consumption, and (2) partitioning schemes at the communication primitive level do not provide optimal scheduling resulting in longer training time. In this article, we propose an efficient communication mechanism, namely PipeDAP, which schedules decoupled all-reduce operations in a near-optimal order to minimize the time and energy consumption of training DNN models. We build the mathematical model for PipeDAP and derive the near-optimal scheduling order of the reduce-scatter and all-gather operations. Meanwhile, we leverage simultaneous communication of reduce-scatter and all-gather operations to further reduce the startup overheads. We implement the PipeDAP architecture on PyTorch framework, and apply it for distributed training of benchmark DNN models. Experimental results on two GPU clusters demonstrate that PipeDAP achieves up to 1.82x speedup and saves up to 45.4% of energy consumption compared to the state-of-the-art communication scheduling frameworks.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1170-1184"},"PeriodicalIF":5.4,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/TETC.2025.3573505
Aihua Mao;Qing Liu;Yuxuan Tang;Sheng Ye;Ran Yi;Minjing Yu;Yong-Jin Liu
Point cloud completion, which involves inferring missing regions of 3D objects from partial observations, remains a challenging problem in 3D vision and robotics. Existing learning-based frameworks typically leverage an encoder-decoder architecture to predict the complete point cloud based on the global shape representation extracted from the incomplete input, or further introduce a refinement network to optimize the obtained complete point cloud in a coarse-to-fine manner, which is unable to capture fine-grained local geometric details and filled with noisy points in the thin or complex structure. In this article, we propose a novel coarse-to-fine point cloud completion framework called DT-Net, by focusing on coarse point cloud denoising and multi-level upsampling. Specifically, we propose a Neighboring Adaptive Denoiser (NAD) to effectively denoise the coarse point cloud generated by an autoencoder, and reduce noise around the slender structures, making them clear and well represented. Moreover, a novel Splitting-based Upsampling Transformer (SUT), which effectively incorporates spatial and semantic relationships between local neighborhoods in the point cloud, is also proposed for multi-level upsampling. Extensive qualitative and quantitative experiments demonstrate that our method outperforms state-of-the-art methods under widely used benchmarks.
{"title":"DT-Net: Point Cloud Completion Network With Neighboring Adaptive Denoiser and Splitting-Based Upsampling Transformer","authors":"Aihua Mao;Qing Liu;Yuxuan Tang;Sheng Ye;Ran Yi;Minjing Yu;Yong-Jin Liu","doi":"10.1109/TETC.2025.3573505","DOIUrl":"https://doi.org/10.1109/TETC.2025.3573505","url":null,"abstract":"Point cloud completion, which involves inferring missing regions of 3D objects from partial observations, remains a challenging problem in 3D vision and robotics. Existing learning-based frameworks typically leverage an encoder-decoder architecture to predict the complete point cloud based on the global shape representation extracted from the incomplete input, or further introduce a refinement network to optimize the obtained complete point cloud in a coarse-to-fine manner, which is unable to capture fine-grained local geometric details and filled with noisy points in the thin or complex structure. In this article, we propose a novel coarse-to-fine point cloud completion framework called DT-Net, by focusing on coarse point cloud denoising and multi-level upsampling. Specifically, we propose a Neighboring Adaptive Denoiser (NAD) to effectively denoise the coarse point cloud generated by an autoencoder, and reduce noise around the slender structures, making them clear and well represented. Moreover, a novel Splitting-based Upsampling Transformer (SUT), which effectively incorporates spatial and semantic relationships between local neighborhoods in the point cloud, is also proposed for multi-level upsampling. Extensive qualitative and quantitative experiments demonstrate that our method outperforms state-of-the-art methods under widely used benchmarks.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1185-1199"},"PeriodicalIF":5.4,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-30DOI: 10.1109/TETC.2025.3563944
Dae-Il Noh;Seon-Geun Jeong;Won-Joo Hwang
Residual networks (ResNet) are known to be effective for image classification. However, challenges such as computational time remain because of the significant number of parameters. Quantum computing using quantum entanglement and quantum parallelism is an emerging computing paradigm that addresses this issue. Although quantum advantage is still studied in many research fields, quantum machine learning is a research area that leverages the strengths of quantum computing and machine learning. In this study, we investigated the quantum speedup with respect to the number of parameters in each model for a time-series classification task. This paper proposes a novel hybrid quantum residual network (HQResNet) inspired by the classical ResNet for time-series classification. HQResNet introduces a classical layer before a quantum convolutional neural network (QCNN), where the QCNN is used as a residual block. These structures enable shortcut connections and are particularly effective in achieving classification tasks without a data re-uploading scheme. We used ultra-wide-band (UWB) channel impulse response data to demonstrate the performance of the proposed algorithm and compared the state-of-the-art benchmarks with HQResNet using evaluation metrics. The results show that HQResNet achieved high performance with a small number of trainable parameters.
{"title":"Hybrid Quantum ResNet for Time Series Classification","authors":"Dae-Il Noh;Seon-Geun Jeong;Won-Joo Hwang","doi":"10.1109/TETC.2025.3563944","DOIUrl":"https://doi.org/10.1109/TETC.2025.3563944","url":null,"abstract":"Residual networks (ResNet) are known to be effective for image classification. However, challenges such as computational time remain because of the significant number of parameters. Quantum computing using quantum entanglement and quantum parallelism is an emerging computing paradigm that addresses this issue. Although quantum advantage is still studied in many research fields, quantum machine learning is a research area that leverages the strengths of quantum computing and machine learning. In this study, we investigated the quantum speedup with respect to the number of parameters in each model for a time-series classification task. This paper proposes a novel hybrid quantum residual network (HQResNet) inspired by the classical ResNet for time-series classification. HQResNet introduces a classical layer before a quantum convolutional neural network (QCNN), where the QCNN is used as a residual block. These structures enable shortcut connections and are particularly effective in achieving classification tasks without a data re-uploading scheme. We used ultra-wide-band (UWB) channel impulse response data to demonstrate the performance of the proposed algorithm and compared the state-of-the-art benchmarks with HQResNet using evaluation metrics. The results show that HQResNet achieved high performance with a small number of trainable parameters.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1083-1098"},"PeriodicalIF":5.4,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wepresent CAM4, a novel embedded dynamic storage-based similarity search content addressable memory. CAM4 is designated for in-memory computational genomics applications, particularly the identification and classification of pathogen DNA. CAM4 employs a novel gain cell design and one-hot encoding of DNA bases to address retention time variations, and mitigate potential data loss from pulldown leakage and soft errors in embedded DRAM. CAM4 features performance overhead-free refresh and data upload, allowing simultaneous search and refresh without performance degradation. CAM4 offers approximate search versatility in scenarios with a variety of industrial sequencers with different error profiles. When classifying DNA reads with a 10% error rate, it achieves, on average, a 25% higher $F_{1}$ score compared to MetaCache-GPU and Kraken2 DNA classification tools. Simulated at 1 GHz, CAM4 provides $1,412times$ and $1,040times$ average speedup over MetaCache-GPU and Kraken2 respectively.
{"title":"CAM4: In-Memory Viral Pathogen Genome Classification Using Similarity Search Dynamic Content-Addressable Memory","authors":"Zuher Jahshan;Itay Merlin;Esteban Garzón;Leonid Yavits","doi":"10.1109/TETC.2025.3563201","DOIUrl":"https://doi.org/10.1109/TETC.2025.3563201","url":null,"abstract":"Wepresent CAM4, a novel embedded dynamic storage-based similarity search content addressable memory. CAM4 is designated for in-memory computational genomics applications, particularly the identification and classification of pathogen DNA. CAM4 employs a novel gain cell design and one-hot encoding of DNA bases to address retention time variations, and mitigate potential data loss from pulldown leakage and soft errors in embedded DRAM. CAM4 features performance overhead-free refresh and data upload, allowing simultaneous search and refresh without performance degradation. CAM4 offers approximate search versatility in scenarios with a variety of industrial sequencers with different error profiles. When classifying DNA reads with a 10% error rate, it achieves, on average, a 25% higher <inline-formula><tex-math>$F_{1}$</tex-math></inline-formula> score compared to MetaCache-GPU and Kraken2 DNA classification tools. Simulated at 1 GHz, CAM4 provides <inline-formula><tex-math>$1,412times$</tex-math></inline-formula> and <inline-formula><tex-math>$1,040times$</tex-math></inline-formula> average speedup over MetaCache-GPU and Kraken2 respectively.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 4","pages":"1341-1355"},"PeriodicalIF":5.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-28DOI: 10.1109/TETC.2025.3563064
Sourav Roy;Mahabub Hasan Mahalat;Bibhash Sen
The ubiquitous presence of Internet of Things (IoT) prospers in every aspect of human life. The low-powered sensors, actuators, and mobile devices in IoT transfer a high volume of security-sensitive data. Unmonitored IoT devices are highly susceptible to security vulnerabilities. Their operating environment, with minimal or no safeguards, allows physical invasion. The conventional end-to-end authentications protocols are inadequate because of the limited resources and ambient working environment of IoT. In this direction, a lightweight and secure end-to-end authentication protocol is proposed for the Physically Unclonability Function (PUF) embedded IoT devices by processing them in pairs. PUF promises to be a unique hardware-based security solution for resource-constrained devices. The proposed protocol exploits the coherent conduct of public and private key-based cryptosystems with PUF. The protocol integrates the concept of ECC with ECDH and the cryptographic hash function. Security of the proposed protocol is validated using authentication validation, BAN logic, Scyther tool, and against different adversarial attacks. The performance evaluation and extensive comparative study of the proposed protocol highlight its lightweight feature. The practical feasibility of the proposed protocol is verified by an empirical evaluation using an Arbiter PUF implemented on Xilinx Spartan-3E FPGA and Raspberry Pi as an IoT device.
{"title":"Design and Implementation of Cost-Effective End-to-End Authentication Protocol for PUF-Enabled IoT Devices","authors":"Sourav Roy;Mahabub Hasan Mahalat;Bibhash Sen","doi":"10.1109/TETC.2025.3563064","DOIUrl":"https://doi.org/10.1109/TETC.2025.3563064","url":null,"abstract":"The ubiquitous presence of Internet of Things (IoT) prospers in every aspect of human life. The low-powered sensors, actuators, and mobile devices in IoT transfer a high volume of security-sensitive data. Unmonitored IoT devices are highly susceptible to security vulnerabilities. Their operating environment, with minimal or no safeguards, allows physical invasion. The conventional end-to-end authentications protocols are inadequate because of the limited resources and ambient working environment of IoT. In this direction, a lightweight and secure end-to-end authentication protocol is proposed for the Physically Unclonability Function (PUF) embedded IoT devices by processing them in pairs. PUF promises to be a unique hardware-based security solution for resource-constrained devices. The proposed protocol exploits the coherent conduct of public and private key-based cryptosystems with PUF. The protocol integrates the concept of ECC with ECDH and the cryptographic hash function. Security of the proposed protocol is validated using authentication validation, BAN logic, Scyther tool, and against different adversarial attacks. The performance evaluation and extensive comparative study of the proposed protocol highlight its lightweight feature. The practical feasibility of the proposed protocol is verified by an empirical evaluation using an Arbiter PUF implemented on Xilinx Spartan-3E FPGA and Raspberry Pi as an IoT device.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1055-1067"},"PeriodicalIF":5.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Processing-in-Memory (PIM) offers a promising architecture to alleviate the memory wall challenge in graph processing applications. The key aspect of PIM is to incorporate logic within the memory, thereby leveraging the near-data advantages. State-of-the-art PIM-based graph processing accelerators tend to offload more to the memory in order to maximize near-data benefits, causing significant load imbalance in PIM systems. In this paper, we demonstrate that this intention is not true and that host processors still play a vital role in heterogeneous CPU-PIM systems. For this purpose, we propose CAPLBS, an online contention-aware Processing-in-Memory load-balance scheduler for graph processing applications in CPU-PIM systems. The core concept of CAPLBS is to steal workload candidates back to host processors with minimal off-chip data synchronization overhead when some host processors are idle. To model data contentions among workloads and determine the stealing decision, a measurement structure called Locality Cohesive Subgraph is proposed by deeply exploring the connectivity of the input graph and the memory access patterns of deployed graph applications. Experimental results show that CAPLBS achieved an average speed-up of 4.8× and 1.3× (up to 9.1× and 1.9×) compared with CPU-only and the upper bound of locality-aware fine-grained in-memory atomics. Moreover, CAPLBS adds no hardware overhead and works well with existing CPU-PIM graph processing accelerators.
{"title":"Balancing Graph Processing Workloads in Heterogeneous CPU-PIM Systems","authors":"Sheng Xu;Chun Li;Le Luo;Ming Zheng;Liang Yan;Xingqi Zou;Xiaoming Chen","doi":"10.1109/TETC.2025.3563249","DOIUrl":"https://doi.org/10.1109/TETC.2025.3563249","url":null,"abstract":"Processing-in-Memory (PIM) offers a promising architecture to alleviate the memory wall challenge in graph processing applications. The key aspect of PIM is to incorporate logic within the memory, thereby leveraging the near-data advantages. State-of-the-art PIM-based graph processing accelerators tend to offload more to the memory in order to maximize near-data benefits, causing significant load imbalance in PIM systems. In this paper, we demonstrate that this intention is not true and that host processors still play a vital role in heterogeneous CPU-PIM systems. For this purpose, we propose CAPLBS, an online contention-aware Processing-in-Memory load-balance scheduler for graph processing applications in CPU-PIM systems. The core concept of CAPLBS is to steal workload candidates back to host processors with minimal off-chip data synchronization overhead when some host processors are idle. To model data contentions among workloads and determine the stealing decision, a measurement structure called Locality Cohesive Subgraph is proposed by deeply exploring the connectivity of the input graph and the memory access patterns of deployed graph applications. Experimental results show that CAPLBS achieved an average speed-up of 4.8× and 1.3× (up to 9.1× and 1.9×) compared with CPU-only and the upper bound of locality-aware fine-grained in-memory atomics. Moreover, CAPLBS adds no hardware overhead and works well with existing CPU-PIM graph processing accelerators.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1068-1082"},"PeriodicalIF":5.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}