Pub Date : 2023-09-01DOI: 10.1016/j.parco.2023.103034
Bin Yu , Xu Lu , Cong Tian , Meng Wang , Chu Chen , Ming Lei , Zhenhua Duan
Runtime verification is a lightweight verification technique that verifies whether a monitored program execution satisfies a desired property. Online runtime verification faces challenges regarding efficiency and property expressiveness, which limit its widespread adoption. However, there is a lack of research that addresses both of these issues. With the basis of a distributed network, we propose an adaptively parallel approach to verify full regular temporal properties of C programs in an online manner. During program execution, segments of the generated state sequence are verified by distributed machines concurrently, while each segment is also verified in each multi-core machine with an adaptive number of threads. Experimental results demonstrate that, with supporting more expressive properties, our approach has a speedup of 2.5X–5.0X compared with other runtime verification approaches.
{"title":"Adaptively parallel runtime verification based on distributed network for temporal properties","authors":"Bin Yu , Xu Lu , Cong Tian , Meng Wang , Chu Chen , Ming Lei , Zhenhua Duan","doi":"10.1016/j.parco.2023.103034","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103034","url":null,"abstract":"<div><p>Runtime verification<span><span> is a lightweight verification technique that verifies whether a monitored program execution satisfies a desired property. Online runtime verification faces challenges regarding efficiency and property expressiveness, which limit its widespread adoption. However, there is a lack of research that addresses both of these issues. With the basis of a distributed network, we propose an adaptively parallel approach to verify full regular temporal properties of C programs in an online manner. During program execution, segments of the generated state sequence are verified by distributed machines concurrently, while each segment is also verified in each multi-core machine with an adaptive number of </span>threads. Experimental results demonstrate that, with supporting more expressive properties, our approach has a speedup of 2.5X–5.0X compared with other runtime verification approaches.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 ","pages":"Article 103034"},"PeriodicalIF":1.4,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49877448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.1016/j.parco.2023.103033
R. Halver, Christoph Junghans, G. Sutmann
{"title":"Using heterogeneous GPU nodes with a Cabana-based implementation of MPCD","authors":"R. Halver, Christoph Junghans, G. Sutmann","doi":"10.1016/j.parco.2023.103033","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103033","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103033"},"PeriodicalIF":1.4,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-01DOI: 10.1016/j.parco.2023.103025
Srđan Daniel Simić, Nikola Tanković, Darko Etinger
Cloud computing is one of the critical technologies that meet the demand of various businesses for the high-capacity computational processing power needed to gain knowledge from their ever-growing business data. When utilizing cloud computing resources to deal with Big Data processing, companies face the challenge of determining the optimal use of resources within their business processes. The miscalculation of the necessary resources directly affects their budget and can cause delays in the cycle time of their key processes. This study investigates the simulation of cloud resource optimization for Big Data workflows modeled with the Business Process Modeling Notation (BPMN). To this end, a BPMN performance evaluation framework was developed. The framework’s capabilities were presented using real-world data science workflow and later evaluated on workflows consisting of 13, 52, and 104 tasks. The results show that the developed framework is adequate for estimating the overall run-time distribution and optimizing the cloud resource deployment and that the BPMN can be utilized for Big Data processing workflows. Therefore, this study contributes to BPMN practitioners by providing a tool to apply BPMN for their Big Data workflows and decision-makers by giving them critical insights into their key business processes. The framework source code is available at https://github.com/ntankovic/python-bpmn-engine.
{"title":"Big data BPMN workflow resource optimization in the cloud","authors":"Srđan Daniel Simić, Nikola Tanković, Darko Etinger","doi":"10.1016/j.parco.2023.103025","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103025","url":null,"abstract":"<div><p>Cloud computing is one of the critical technologies that meet the demand of various businesses for the high-capacity computational processing power needed to gain knowledge from their ever-growing business data. When utilizing cloud computing resources to deal with Big Data processing, companies face the challenge of determining the optimal use of resources within their business processes. The miscalculation of the necessary resources directly affects their budget and can cause delays in the cycle time of their key processes. This study investigates the simulation of cloud resource optimization for Big Data workflows modeled with the Business Process Modeling Notation (BPMN). To this end, a BPMN performance evaluation framework was developed. The framework’s capabilities were presented using real-world data science workflow and later evaluated on workflows consisting of 13, 52, and 104 tasks. The results show that the developed framework is adequate for estimating the overall run-time distribution and optimizing the cloud resource deployment and that the BPMN can be utilized for Big Data processing workflows. Therefore, this study contributes to BPMN practitioners by providing a tool to apply BPMN for their Big Data workflows and decision-makers by giving them critical insights into their key business processes. The framework source code is available at <span>https://github.com/ntankovic/python-bpmn-engine</span><svg><path></path></svg>.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 ","pages":"Article 103025"},"PeriodicalIF":1.4,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49877447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1016/j.parco.2023.103042
I. Laguna, Anh Tran, G. Gopalakrishnan
{"title":"Finding inputs that trigger floating-point exceptions in heterogeneous computing via Bayesian optimization","authors":"I. Laguna, Anh Tran, G. Gopalakrishnan","doi":"10.1016/j.parco.2023.103042","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103042","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"62 1","pages":"103042"},"PeriodicalIF":1.4,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.parco.2023.103039
A. Sky, César Polindara, I. Muench, C. Birk
{"title":"A flexible sparse matrix data format and parallel algorithms for the assembly of finite element matrices on shared memory systems","authors":"A. Sky, César Polindara, I. Muench, C. Birk","doi":"10.1016/j.parco.2023.103039","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103039","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103039"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.parco.2023.103021
Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson
Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix–vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.
{"title":"Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures","authors":"Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson","doi":"10.1016/j.parco.2023.103021","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103021","url":null,"abstract":"<div><p>Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix–vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103021"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.parco.2023.103022
Lei Yu , Tianqi Zhong , Peng Bi , Lan Wang , Fei Teng
Smart Mobile Devices (SMDs) are crucial for the edge computing paradigm’s real-world sensing. Real-time applications, which are computationally intensive and periodic with strict time constraints, can typically be used to replicate real-world sensing. Such applications call for increased processing speed, memory capacity, and battery life on SMDs, which are typically resource-constrained due to physical size restrictions. As a result, scheduling real-time applications for SMDs that are power efficient is crucial for the regular operation of edge computing platforms, and downstream decision-making tasks like computation offloading require the prediction of power consumption using power-saving approaches like DVFS. The main question is how to swiftly develop a better solution to the NP-Hard power efficient scheduling problem with DVFS. Thus, by segmenting the aligned tasks on an SMD, we present a segment-based analysis approach. Additionally, we offer a segment-based scheduling algorithm (SEDF) that draws inspiration from the segment-based analysis approach to achieve power-efficient scheduling for these real-time workloads. This segment-based approach yields a power consumption bound (PB), and a computation offloading use case is developed to demonstrate the application of PB in the subsequent decision-making processes. Both simulations and actual device tests are used to confirm the PB, SEDF, and the effectiveness of offloading decision-making. We demonstrate empirically that PB can be utilized to make approximative optimal decisions in decision-making problems involving computation offloading. SEDF is a straightforward and effective scheduling approach that can cut the power consumption of a multi-core SMD by roughly 30%.
{"title":"Segment based power-efficient scheduling for real-time DAG tasks on edge devices","authors":"Lei Yu , Tianqi Zhong , Peng Bi , Lan Wang , Fei Teng","doi":"10.1016/j.parco.2023.103022","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103022","url":null,"abstract":"<div><p><span>Smart Mobile Devices<span><span><span> (SMDs) are crucial for the edge computing paradigm’s real-world sensing. Real-time applications, which are computationally intensive and periodic with strict time constraints, can typically be used to replicate real-world sensing. Such applications call for increased processing speed, memory capacity, and battery life on SMDs, which are typically resource-constrained due to physical size restrictions. As a result, scheduling real-time applications for SMDs that are power efficient is crucial for the regular operation of edge computing platforms, and downstream decision-making tasks like </span>computation offloading require the prediction of </span>power consumption using power-saving approaches like DVFS. The main question is how to swiftly develop a better solution to the NP-Hard power efficient scheduling problem with DVFS. Thus, by segmenting the aligned tasks on an SMD, we present a segment-based analysis approach. Additionally, we offer a segment-based </span></span>scheduling algorithm (SEDF) that draws inspiration from the segment-based analysis approach to achieve power-efficient scheduling for these real-time workloads. This segment-based approach yields a power consumption bound (PB), and a computation offloading use case is developed to demonstrate the application of PB in the subsequent decision-making processes. Both simulations and actual device tests are used to confirm the PB, SEDF, and the effectiveness of offloading decision-making. We demonstrate empirically that PB can be utilized to make approximative optimal decisions in decision-making problems involving computation offloading. SEDF is a straightforward and effective scheduling approach that can cut the power consumption of a multi-core SMD by roughly 30%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103022"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.parco.2023.103018
Akira Nukada , Taichiro Suzuki , Satoshi Matsuoka
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR’ overhead is acceptably low.
{"title":"Efficient checkpoint/Restart of CUDA applications","authors":"Akira Nukada , Taichiro Suzuki , Satoshi Matsuoka","doi":"10.1016/j.parco.2023.103018","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103018","url":null,"abstract":"<div><p>We present NVCR<span> which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR’ overhead is acceptably low.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103018"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1016/j.parco.2023.103019
David Castells-Rufas
Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.
{"title":"GPU acceleration of Levenshtein distance computation between long strings","authors":"David Castells-Rufas","doi":"10.1016/j.parco.2023.103019","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103019","url":null,"abstract":"<div><p>Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103019"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}