Pub Date : 2025-06-13DOI: 10.1016/j.jpdc.2025.105131
Mandira Banik, Sanjay Kumar
Fog-based computation is an interesting computing paradigm developed for giving end users access to real-time services in the Internet of Vehicles (IoV). Achieving safe and effective data sharing is a huge challenge in such a dynamic system architecture. As a promising cryptographic primitive, searchable encryption (SE) aims to maintain data searchability while maintaining data confidentiality. Nonetheless, the majority of current methods are open to attacks by leaking exploitation. So, we develop a fog-based searchable public key encryption scheme (FC-PEMKS) that achieves forward security and multi-keyword search in the IoV network. The security proof shows that our model achieves the indistinguishability of trapdoor and keyword index. The outcomes of the comparative simulations and performance analysis demonstrate the viability and effectiveness of the FC-PEMKS scheme in fog-enabled vehicular networks.
{"title":"Fog computing based public key encryption with multi-keyword search for Internet of vehicles","authors":"Mandira Banik, Sanjay Kumar","doi":"10.1016/j.jpdc.2025.105131","DOIUrl":"10.1016/j.jpdc.2025.105131","url":null,"abstract":"<div><div>Fog-based computation is an interesting computing paradigm developed for giving end users access to real-time services in the Internet of Vehicles (IoV). Achieving safe and effective data sharing is a huge challenge in such a dynamic system architecture. As a promising cryptographic primitive, searchable encryption (SE) aims to maintain data searchability while maintaining data confidentiality. Nonetheless, the majority of current methods are open to attacks by leaking exploitation. So, we develop a fog-based searchable public key encryption scheme (FC-PEMKS) that achieves forward security and multi-keyword search in the IoV network. The security proof shows that our model achieves the indistinguishability of trapdoor and keyword index. The outcomes of the comparative simulations and performance analysis demonstrate the viability and effectiveness of the FC-PEMKS scheme in fog-enabled vehicular networks.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105131"},"PeriodicalIF":3.4,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144271563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-13DOI: 10.1016/j.jpdc.2025.105133
K Rajkumar, S. Mercy Shalinie
Securing IoT networks against cyber-attacks, especially Distributed Denial of Service (DDoS) attacks, is a growing challenge due to their ability to disrupt services and overwhelm network resources. This study introduces a novel post-processing methodology that integrates Explainable AI (XAI) with Quantum Neural Networks (QNN) to enhance the interpretability of DDoS attack detection. We utilize the CICFlowMeter tool for feature extraction, processing bidirectional network traffic data and generating up to 87 distinct features. Notably, the CICFlowMeter removes potentially tampered features such as IP addresses and ports to prevent manipulation, addressing the limitations associated with the use of these features in the presence of attackers. After a QNN generates expectation values for a given input, SHAP (SHapley Additive exPlanations) values are applied to interpret the contributions of individual features in the decision-making process. Although the QNN output indicates whether a network flow is benign or malicious, the quantum model's complexity makes it difficult to interpret. By using SHAP values, we identify which features such as IP addresses, ports, and traffic patterns significantly influence the QNN’s classification, providing human-understandable explanations for the model's predictions. For evaluation, we used the CIC-IoT 2022and proposed SDN-DDoS24 datasets, with SDN-DDoS24 outperforming others when integrated with the proposed methodology. The QNN was implemented on IonQ quantum hardware through Amazon Braket, achieving an expectation value of 0.98 with a low latency of 113 milliseconds, making it suitable for applications requiring both precision and speed. This study demonstrates that integrating XAI with QNN not only improves DDoS attack detection accuracy but also enhances transparency, making the model more trustworthy for real-world cybersecurity applications. By offering clear explanations of model behavior, the approach ensures that security experts can make informed decisions based on the quantum-enhanced detection system, improving its reliability and usability in dynamic network environments.
{"title":"SHAP-based intrusion detection in IoT networks using quantum neural networks on IonQ hardware","authors":"K Rajkumar, S. Mercy Shalinie","doi":"10.1016/j.jpdc.2025.105133","DOIUrl":"10.1016/j.jpdc.2025.105133","url":null,"abstract":"<div><div>Securing IoT networks against cyber-attacks, especially Distributed Denial of Service (DDoS) attacks, is a growing challenge due to their ability to disrupt services and overwhelm network resources. This study introduces a novel post-processing methodology that integrates Explainable AI (XAI) with Quantum Neural Networks (QNN) to enhance the interpretability of DDoS attack detection. We utilize the CICFlowMeter tool for feature extraction, processing bidirectional network traffic data and generating up to 87 distinct features. Notably, the CICFlowMeter removes potentially tampered features such as IP addresses and ports to prevent manipulation, addressing the limitations associated with the use of these features in the presence of attackers. After a QNN generates expectation values for a given input, SHAP (SHapley Additive exPlanations) values are applied to interpret the contributions of individual features in the decision-making process. Although the QNN output indicates whether a network flow is benign or malicious, the quantum model's complexity makes it difficult to interpret. By using SHAP values, we identify which features such as IP addresses, ports, and traffic patterns significantly influence the QNN’s classification, providing human-understandable explanations for the model's predictions. For evaluation, we used the CIC-IoT 2022and proposed SDN-DDoS24 datasets, with SDN-DDoS24 outperforming others when integrated with the proposed methodology. The QNN was implemented on IonQ quantum hardware through Amazon Braket, achieving an expectation value of 0.98 with a low latency of 113 milliseconds, making it suitable for applications requiring both precision and speed. This study demonstrates that integrating XAI with QNN not only improves DDoS attack detection accuracy but also enhances transparency, making the model more trustworthy for real-world cybersecurity applications. By offering clear explanations of model behavior, the approach ensures that security experts can make informed decisions based on the quantum-enhanced detection system, improving its reliability and usability in dynamic network environments.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105133"},"PeriodicalIF":3.4,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1016/j.jpdc.2025.105129
Alfredo Navarra , Francesco Piselli , Giuseppe Prencipe
Programmable Matter (PM) has been widely investigated in recent years. It refers to some kind of substance with the ability to change its physical properties (e.g., shape or color) in a programmable way. In this paper, we refer to the model, where the particles live and move on a triangular grid, are asynchronous in their computations and movements, and do not possess any direct means of communication (silent) or memory of past events (oblivious).
Within , we aim at studying Spanning problems, i.e., problems where the particles are required to suitably span all over the grid. We first address the Line Formation problem where the particles are required to end up in a configuration where they all lie on a line, i.e., they are aligned and connected. Secondly, we deal with the more general Scattering problem: starting from any initial configuration, we aim at reaching a final one where no particles occupy neighboring nodes. Furthermore, we investigate configurations where some nodes of the grid can be occupied by unmovable elements (i.e., obstacles) from both theoretical and experimental view points.
{"title":"Line formation and scattering in silent programmable matter","authors":"Alfredo Navarra , Francesco Piselli , Giuseppe Prencipe","doi":"10.1016/j.jpdc.2025.105129","DOIUrl":"10.1016/j.jpdc.2025.105129","url":null,"abstract":"<div><div>Programmable Matter (PM) has been widely investigated in recent years. It refers to some kind of substance with the ability to change its physical properties (e.g., shape or color) in a programmable way. In this paper, we refer to the <span><math><mi>SILBOT</mi></math></span> model, where the particles live and move on a triangular grid, are asynchronous in their computations and movements, and do not possess any direct means of communication (silent) or memory of past events (oblivious).</div><div>Within <span><math><mi>SILBOT</mi></math></span>, we aim at studying <em>Spanning</em> problems, i.e., problems where the particles are required to suitably span all over the grid. We first address the <span>Line Formation</span> problem where the particles are required to end up in a configuration where they all lie on a line, i.e., they are aligned and connected. Secondly, we deal with the more general <span>Scattering</span> problem: starting from any initial configuration, we aim at reaching a final one where no particles occupy neighboring nodes. Furthermore, we investigate configurations where some nodes of the grid can be occupied by unmovable elements (i.e., obstacles) from both theoretical and experimental view points.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105129"},"PeriodicalIF":3.4,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144271562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1016/j.jpdc.2025.105130
Yung-Ting Chuang, Chih-Han Tu
Containerization has become the primary method for deploying applications, with web services being the most prevalent. However, exposing server IP addresses to external connections renders containerized services vulnerable to DDoS attacks, which can deplete server resources and hinder legitimate user access. To address this issue, we implement twelve different mitigation strategies, test them across three common types of web services, and conduct experiments on both Docker and Kubernetes deployment platforms. Furthermore, this study introduces a cross-platform, orchestration-aware evaluation framework that simulates realistic multi-service workloads and analyzes defense strategy performance under varying concurrency conditions. Experimental results indicate that Docker excels in managing white-listed traffic and delaying attacker responses, while Kubernetes achieves low completion times, minimum response times, and low failure rates by processing all requests simultaneously. Based on these findings, we provide actionable insights for selecting appropriate mitigation strategies tailored to different orchestration environments and workload patterns, offering practical guidance for securing containerized deployments against low-rate DDoS threats. Our work not only provides empirical performance evaluations but also reveals deployment-specific trade-offs, offering strategic recommendations for building resilient cloud-native infrastructures.
{"title":"Mitigating DDoS attacks in containerized environments: A comparative analysis of Docker and Kubernetes","authors":"Yung-Ting Chuang, Chih-Han Tu","doi":"10.1016/j.jpdc.2025.105130","DOIUrl":"10.1016/j.jpdc.2025.105130","url":null,"abstract":"<div><div>Containerization has become the primary method for deploying applications, with web services being the most prevalent. However, exposing server IP addresses to external connections renders containerized services vulnerable to DDoS attacks, which can deplete server resources and hinder legitimate user access. To address this issue, we implement twelve different mitigation strategies, test them across three common types of web services, and conduct experiments on both Docker and Kubernetes deployment platforms. Furthermore, this study introduces a cross-platform, orchestration-aware evaluation framework that simulates realistic multi-service workloads and analyzes defense strategy performance under varying concurrency conditions. Experimental results indicate that Docker excels in managing white-listed traffic and delaying attacker responses, while Kubernetes achieves low completion times, minimum response times, and low failure rates by processing all requests simultaneously. Based on these findings, we provide actionable insights for selecting appropriate mitigation strategies tailored to different orchestration environments and workload patterns, offering practical guidance for securing containerized deployments against low-rate DDoS threats. Our work not only provides empirical performance evaluations but also reveals deployment-specific trade-offs, offering strategic recommendations for building resilient cloud-native infrastructures.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105130"},"PeriodicalIF":3.4,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144280939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-06DOI: 10.1016/j.jpdc.2025.105128
Jorge Villarrubia, Luis Costero, Francisco D. Igual, Katzalin Olcoz
NVIDIA MIG (Multi-Instance GPU) allows partitioning a physical GPU into multiple logical instances with fully-isolated resources, which can be dynamically reconfigured. This work highlights the untapped potential of MIG through moldable task scheduling with dynamic reconfigurations. Specifically, we propose a makespan minimization problem for multi-task execution under MIG constraints. Our profiling shows that assuming monotonicity in task work with respect to resources is not viable, as is usual in multicore scheduling. Relying on a state-of-the-art proposal that does not require such an assumption, we present FAR, a 3-phase algorithm to solve the problem. Phase 1 of FAR builds on a classical task moldability method, phase 2 combines Longest Processing Time First and List Scheduling with a novel repartitioning tree heuristic tailored to MIG constraints, and phase 3 employs local search via task moves and swaps. FAR schedules tasks in batches offline, concatenating their schedules on the fly in an improved way that favors resource reuse. Excluding reconfiguration costs, the List Scheduling proof shows an approximation factor of 7/4 on the NVIDIA A30 model. We adapt the technique to the particular constraints of an NVIDIA A100/H100 to obtain an approximation factor of 2. Including the reconfiguration cost, our real-world experiments reveal a makespan with respect to the optimum no worse than 1.22× for a well-known suite of benchmarks, and 1.10× for synthetic inputs inspired by real kernels. We obtain good experimental results for each batch of tasks, but also in the concatenation of batches, with large improvements over the state-of-the-art and proposals without GPU reconfiguration. Moreover, we show that the proposed heuristics allow a correct adaptation to tasks of very different characteristics. Beyond the specific algorithm, the paper demonstrates the research potential of the MIG technology and suggests useful metrics, workload characterizations and evaluation techniques for future work in this field.
{"title":"Leveraging Multi-Instance GPUs through moldable task scheduling","authors":"Jorge Villarrubia, Luis Costero, Francisco D. Igual, Katzalin Olcoz","doi":"10.1016/j.jpdc.2025.105128","DOIUrl":"10.1016/j.jpdc.2025.105128","url":null,"abstract":"<div><div>NVIDIA MIG (Multi-Instance GPU) allows partitioning a physical GPU into multiple logical instances with fully-isolated resources, which can be dynamically reconfigured. This work highlights the untapped potential of MIG through moldable task scheduling with dynamic reconfigurations. Specifically, we propose a makespan minimization problem for multi-task execution under MIG constraints. Our profiling shows that assuming monotonicity in task work with respect to resources is not viable, as is usual in multicore scheduling. Relying on a state-of-the-art proposal that does not require such an assumption, we present <span>FAR</span>, a 3-phase algorithm to solve the problem. Phase 1 of FAR builds on a classical task moldability method, phase 2 combines Longest Processing Time First and List Scheduling with a novel repartitioning tree heuristic tailored to MIG constraints, and phase 3 employs local search via task moves and swaps. <span>FAR</span> schedules tasks in batches offline, concatenating their schedules on the fly in an improved way that favors resource reuse. Excluding reconfiguration costs, the List Scheduling proof shows an approximation factor of 7/4 on the NVIDIA A30 model. We adapt the technique to the particular constraints of an NVIDIA A100/H100 to obtain an approximation factor of 2. Including the reconfiguration cost, our real-world experiments reveal a makespan with respect to the optimum no worse than 1.22× for a well-known suite of benchmarks, and 1.10× for synthetic inputs inspired by real kernels. We obtain good experimental results for each batch of tasks, but also in the concatenation of batches, with large improvements over the state-of-the-art and proposals without GPU reconfiguration. Moreover, we show that the proposed heuristics allow a correct adaptation to tasks of very different characteristics. Beyond the specific algorithm, the paper demonstrates the research potential of the MIG technology and suggests useful metrics, workload characterizations and evaluation techniques for future work in this field.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105128"},"PeriodicalIF":3.4,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144254815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-05DOI: 10.1016/j.jpdc.2025.105119
Sangeetha A․S , Shunmugan S
Blockchain systems do not rely on trust for electronic transactions and it emerged as a popular technology due to its attributes like immutability, transparency, distributed storage, and decentralized control. Student certificates and skill verification play crucial roles in job applications and other purposes. In traditional systems, certificate forgery is a common problem, especially in online education. Processes, such as issuing and verifying student certifications along with student performance prediction for higher education or job recruitment are often lengthy and time-consuming. Integrating blockchain into certificate verification protocols offers authenticity and significantly reduces processing times. Hence, this research introduced a novel secure privacy preservation-based academic certificate authentication system (CertAuthSystem) for verifying the academic certificates of students. The CertAuthSystem contains different entities, such as Student, System, University, Blockchain, and Company. The university issues certificates to students, which are stored in Blockchain, and when the student applies for a job/scholarship, he/she transmits the certificate and the blockID to the organization, based on which verification is performed. Moreover, the student’s performance is predicted by a classifier named Deep Long Short-Term Memory (DLSTM). Then, CertAuthSystem is examined for its superiority considering measures, like validation time, memory, throughput and execution time and has achieved values of 53.412 ms, 86.6 MB, 94.876 Mbps, and 73.57 ms, correspondingly for block size 7. Finally, the prediction analysis of the DLSTM classifier is done based on evaluation metrics, such as precision, recall and F measure, which attained superior values of 90.77 %, 92.99 %, and 91.86 %.
{"title":"Privacy-enabled academic certificate authentication and deep learning-based student performance prediction system using hyperledger blockchain technology","authors":"Sangeetha A․S , Shunmugan S","doi":"10.1016/j.jpdc.2025.105119","DOIUrl":"10.1016/j.jpdc.2025.105119","url":null,"abstract":"<div><div>Blockchain systems do not rely on trust for electronic transactions and it emerged as a popular technology due to its attributes like immutability, transparency, distributed storage, and decentralized control. Student certificates and skill verification play crucial roles in job applications and other purposes. In traditional systems, certificate forgery is a common problem, especially in online education. Processes, such as issuing and verifying student certifications along with student performance prediction for higher education or job recruitment are often lengthy and time-consuming. Integrating blockchain into certificate verification protocols offers authenticity and significantly reduces processing times. Hence, this research introduced a novel secure privacy preservation-based academic certificate authentication system (CertAuthSystem) for verifying the academic certificates of students. The CertAuthSystem contains different entities, such as Student, System, University, Blockchain, and Company. The university issues certificates to students, which are stored in Blockchain, and when the student applies for a job/scholarship, he/she transmits the certificate and the blockID to the organization, based on which verification is performed. Moreover, the student’s performance is predicted by a classifier named Deep Long Short-Term Memory (DLSTM). Then, CertAuthSystem is examined for its superiority considering measures, like validation time, memory, throughput and execution time and has achieved values of 53.412 ms, 86.6 MB, 94.876 Mbps, and 73.57 ms, correspondingly for block size 7. Finally, the prediction analysis of the DLSTM classifier is done based on evaluation metrics, such as precision, recall and F measure, which attained superior values of 90.77 %, 92.99 %, and 91.86 %.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105119"},"PeriodicalIF":3.4,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144289001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-05DOI: 10.1016/S0743-7315(25)00089-9
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00089-9","DOIUrl":"10.1016/S0743-7315(25)00089-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"203 ","pages":"Article 105122"},"PeriodicalIF":3.4,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144213164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1016/j.jpdc.2025.105118
Ibai Calero, Salvador Petit, María E. Gómez, Julio Sahuquillo
Energy efficiency has been a major concern in data centers, and the problem is exacerbated as its size continues to rise. However, the lack of tools to measure and handle this energy at a fine granularity (e.g., processor core or last-level cache) has translated into slow research advances in this topic. Understanding where (i.e., which components) and when (the point in time) energy consumption translates into minor performance improvements is of paramount importance to design any energy-aware scheduler. This paper characterizes the relationship between energy consumption and performance in a 28-core ARM ThunderX2 processor for both single-threaded and multi-threaded applications.
This paper shows that single-threaded applications with high CPU activity maintain their performance in spite of the inter-application interference at shared resources, but this comes at the expense of higher power consumption. Conversely, applications that heavily utilize the L3 cache and memory consume less power but suffer significant performance degradation as interference levels rise.
In contrast, multi-threaded applications show two distinct behaviors. On the one hand, some of them experience significant performance gains when they execute in a higher number of cores with more threads, which outweighs the increase in power consumption, leading to high energy efficiency.
{"title":"Power, energy, and performance analysis of single- and multi-threaded applications in the ARM ThunderX2","authors":"Ibai Calero, Salvador Petit, María E. Gómez, Julio Sahuquillo","doi":"10.1016/j.jpdc.2025.105118","DOIUrl":"10.1016/j.jpdc.2025.105118","url":null,"abstract":"<div><div>Energy efficiency has been a major concern in data centers, and the problem is exacerbated as its size continues to rise. However, the lack of tools to measure and handle this energy at a fine granularity (e.g., processor core or last-level cache) has translated into slow research advances in this topic. Understanding where (i.e., which components) and when (the point in time) energy consumption translates into minor performance improvements is of paramount importance to design any energy-aware scheduler. This paper characterizes the relationship between energy consumption and performance in a 28-core ARM ThunderX2 processor for both single-threaded and multi-threaded applications.</div><div>This paper shows that single-threaded applications with high CPU activity maintain their performance in spite of the inter-application interference at shared resources, but this comes at the expense of higher power consumption. Conversely, applications that heavily utilize the L3 cache and memory consume less power but suffer significant performance degradation as interference levels rise.</div><div>In contrast, multi-threaded applications show two distinct behaviors. On the one hand, some of them experience significant performance gains when they execute in a higher number of cores with more threads, which outweighs the increase in power consumption, leading to high energy efficiency.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105118"},"PeriodicalIF":3.4,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144242749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-21DOI: 10.1016/S0743-7315(25)00079-6
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00079-6","DOIUrl":"10.1016/S0743-7315(25)00079-6","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"202 ","pages":"Article 105112"},"PeriodicalIF":3.4,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144105472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-21DOI: 10.1016/j.jpdc.2025.105108
Tian Chen , Yu-an Tan , Thar Baker , Haokai Wu , Qiuyu Zhang , Yuanzhang Li
By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.
{"title":"ConCeal: A Winograd convolution code template for optimising GCU in parallel","authors":"Tian Chen , Yu-an Tan , Thar Baker , Haokai Wu , Qiuyu Zhang , Yuanzhang Li","doi":"10.1016/j.jpdc.2025.105108","DOIUrl":"10.1016/j.jpdc.2025.105108","url":null,"abstract":"<div><div>By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"203 ","pages":"Article 105108"},"PeriodicalIF":3.4,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144114726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}