IEEE Transactions on Parallel and Distributed Systems最新文献

英文中文

ChunkFunc: Dynamic SLO-Aware Configuration of Serverless Functions

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-09 DOI: 10.1109/TPDS.2025.3559021

Thomas Pusztai;Stefan Nastic

Serverless computing promises to be a cost effective form of on demand computing. To fully utilize its cost saving potential, workflows must be configured with the appropriate amount of resources to meet their response time Service Level Objective (SLO), while keeping costs at a minimum. Since determining and updating these configuration models manually is a nontrivial and error prone task, researchers have developed solutions for automatically finding configurations that meet the aforementioned requirements. However, our initial experiments show that even when following best practices and using state-of-the-art configuration tools, resources may still be considerably over- or underprovisioned, depending on the size of functions’ input payload. In this paper we present ChunkFunc, an SLO- and input data-aware framework for tuning serverless workflows. Our main contributions include: i) an SLO- and input size-aware function performance model for optimized configurations in serverless workflows, ii) ChunkFunc Profiler, an auto-tuned, Bayesian Optimization-guided profiling mechanism for profiling serverless functions with typical input data sizes to build a performance model, and iii) ChunkFunc Workflow Optimizer, which uses these models to determine an input size dependent configuration for each serverless function in a workflow to meet the SLO, while keeping costs to a minimum. We evaluate ChunkFunc on real-life serverless workflows and compare it to two state-of-the-art solutions, showing that it increases SLO adherence by a factor of 1.04 to 2.78, depending on the workflow, and reduces costs by up to 61% .

{"title":"ChunkFunc: Dynamic SLO-Aware Configuration of Serverless Functions","authors":"Thomas Pusztai;Stefan Nastic","doi":"10.1109/TPDS.2025.3559021","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3559021","url":null,"abstract":"Serverless computing promises to be a cost effective form of on demand computing. To fully utilize its cost saving potential, workflows must be configured with the appropriate amount of resources to meet their response time Service Level Objective (SLO), while keeping costs at a minimum. Since determining and updating these configuration models manually is a nontrivial and error prone task, researchers have developed solutions for automatically finding configurations that meet the aforementioned requirements. However, our initial experiments show that even when following best practices and using state-of-the-art configuration tools, resources may still be considerably over- or underprovisioned, depending on the size of functions’ input payload. In this paper we present ChunkFunc, an SLO- and input data-aware framework for tuning serverless workflows. Our main contributions include: i) an SLO- and input size-aware function performance model for optimized configurations in serverless workflows, ii) ChunkFunc Profiler, an auto-tuned, Bayesian Optimization-guided profiling mechanism for profiling serverless functions with typical input data sizes to build a performance model, and iii) ChunkFunc Workflow Optimizer, which uses these models to determine an input size dependent configuration for each serverless function in a workflow to meet the SLO, while keeping costs to a minimum. We evaluate ChunkFunc on real-life serverless workflows and compare it to two state-of-the-art solutions, showing that it increases SLO adherence by a factor of 1.04 to 2.78, depending on the workflow, and reduces costs by up to 61% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1237-1252"},"PeriodicalIF":5.6,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10959103","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Productivity, Portability, Performance, and Reproducibility: Data-Centric Python

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-09 DOI: 10.1109/TPDS.2025.3549310

Alexandros Nikolaos Ziogas;Timo Schneider;Tal Ben-Nun;Alexandru Calotoiu;Tiziano De Matteis;Johannes de Fine Licht;Luca Lavarini;Torsten Hoefler

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High-Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. This work presents a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes. Our benchmarks were reproduced in the Student Cluster Competition (SCC) during the Supercomputing Conference (SC) 2022. We present and discuss the student teams’ results.

Python 已成为事实上的科学计算语言。使用 Python 编程具有很高的生产力，这主要归功于围绕 NumPy 模块构建的丰富的面向科学的软件生态系统。因此，高性能计算（HPC）领域对 Python 支持的需求急剧上升。然而，Python 语言本身并不一定能提供高性能。这项工作提出了一种工作流，既能保持 Python 的高生产力，又能在不同架构间实现可移植的性能。该工作流的主要特点是面向高性能计算的语言扩展和一套由以数据为中心的中间表示法驱动的自动优化。我们展示了在 CPU、GPU、FPGA 和 Piz Daint 超级计算机（多达 23328 核）上的性能结果和扩展，与之前的最佳解决方案相比，速度分别提高了 2.47 倍和 3.75 倍，首次在赛灵思和英特尔 FPGA 上实现了注释 Python，并在 512 个节点上实现了高达 93.16% 的扩展效率。我们的基准在 2022 年超级计算大会（SC）期间的学生集群竞赛（SCC）中重现。我们介绍并讨论了学生团队的成果。

{"title":"Productivity, Portability, Performance, and Reproducibility: Data-Centric Python","authors":"Alexandros Nikolaos Ziogas;Timo Schneider;Tal Ben-Nun;Alexandru Calotoiu;Tiziano De Matteis;Johannes de Fine Licht;Luca Lavarini;Torsten Hoefler","doi":"10.1109/TPDS.2025.3549310","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3549310","url":null,"abstract":"Python has become the <italic>de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High-Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. This work presents a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes. Our benchmarks were reproduced in the Student Cluster Competition (SCC) during the Supercomputing Conference (SC) 2022. We present and discuss the student teams’ results.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"804-820"},"PeriodicalIF":5.6,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest Editorial:Special Section on SC22 Student Cluster Competition

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-09 DOI: 10.1109/TPDS.2025.3549281

Omer Rana;Josef Spillner;Stephen Leak;Gerald F Lofstead II;Rafael Tolosana Calasanz

引用次数: 0

Cube-fx: Mapping Taylor Expansion Onto Matrix Multiplier-Accumulators of Huawei Ascend AI Processors

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557444

Yifeng Tang;Huaman Zhou;Zhuoran Ji;Cho-Li Wang

Taylor expansion, a mature method for function evaluations used in Artificial Intelligence (AI) applications, approximates functions with polynomials. In addition to the function evaluations, AI applications require massive matrix multiplications, inspiring manufacturers to propose AI processors with matrix multiplier-accumulators (MACs). However, compared with the powerful Matrix MACs, the vectorized units of the AI processors cannot efficiently carry the existing Taylor expansion implementation of Single Instruction Multiple Data (SIMD) parallelism. Leveraging the Matrix MACs for Taylor expansion becomes an ideal direction. In previous studies, migrating optimized algorithms to the Matrix MACs requires matrix generation during the runtime. The generation is expensive and even cancels the accelerations brought by the Matrix MACs on the AI processors, which Taylor expansion also suffers. This article presents Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Cube-fx expresses the building and computation in matrix multiplications without inefficient dynamic matrix generation. On Huawei Ascend processors, Cube-fx averagely achieves 1.64× speedups compared with vectorized Horner’s Method with 56.38

$%$

vectorized operations reduced.

{"title":"Cube-fx: Mapping Taylor Expansion Onto Matrix Multiplier-Accumulators of Huawei Ascend AI Processors","authors":"Yifeng Tang;Huaman Zhou;Zhuoran Ji;Cho-Li Wang","doi":"10.1109/TPDS.2025.3557444","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557444","url":null,"abstract":"Taylor expansion, a mature method for function evaluations used in Artificial Intelligence (AI) applications, approximates functions with polynomials. In addition to the function evaluations, AI applications require massive matrix multiplications, inspiring manufacturers to propose AI processors with matrix multiplier-accumulators (MACs). However, compared with the powerful Matrix MACs, the vectorized units of the AI processors cannot efficiently carry the existing Taylor expansion implementation of Single Instruction Multiple Data (SIMD) parallelism. Leveraging the Matrix MACs for Taylor expansion becomes an ideal direction. In previous studies, migrating optimized algorithms to the Matrix MACs requires matrix generation during the runtime. The generation is expensive and even cancels the accelerations brought by the Matrix MACs on the AI processors, which Taylor expansion also suffers. This article presents Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Cube-fx expresses the building and computation in matrix multiplications without inefficient dynamic matrix generation. On Huawei Ascend processors, Cube-fx averagely achieves 1.64× speedups compared with vectorized Horner’s Method with 56.38<inline-formula><tex-math>$%$</tex-math></inline-formula> vectorized operations reduced.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1115-1129"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557610

Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang

Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the large graph scales, dynamic graph changes and the network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.

{"title":"Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers","authors":"Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang","doi":"10.1109/TPDS.2025.3557610","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557610","url":null,"abstract":"Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the <italic>large graph scales, <italic>dynamic graph changes and the <italic>network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose <italic>DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, <italic>DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, <italic>DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, <italic>DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, <italic>DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. <italic>DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, <italic>DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1161-1174"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OneOS: Distributed Operating System for the Edge-to-Cloud Continuum

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-03 DOI: 10.1109/TPDS.2025.3557747

Kumseok Jung;Julien Gascon-Samson;Sathish Gopalakrishnan;Karthik Pattabiraman

Application developers often need to employ a combination of software such as communication middleware and cloud-based services to deal with the challenges of heterogeneity and network dynamism in the edge-to-cloud continuum. Consequently, developers write extra glue code peripheral to the application’s core business logic, to provide interoperability between interacting software frameworks. Each software framework comes with its own framework-specific API, and as technology evolves, the developer must keep up with the changing APIs by updating the glue code in their application. Thus, framework-specific APIs hinder interoperability and cause technology fragmentation. We propose a design of a middleware-based distributed operating system (OS) called OneOS to realize a computing paradigm that alleviates such interoperability challenges. OneOS provides a single system image of the distributed computing platform, and transparently provides interoperability between software components through the standard POSIX API. Using OneOS’s domain-specific language, users can compose complex distributed applications from legacy POSIX programs. OneOS tolerates failures by adopting a distributed checkpoint-restore algorithm. We evaluate the performance of OneOS against an open-source IoT Platform, ThingsJS, using an IoT stream processing benchmark suite, and a video processing application. OneOS executes the programs about 3x faster than ThingsJS, reduces the code size by about 22%, and recovers the state of failed applications within 1 s upon detecting their failure.

{"title":"OneOS: Distributed Operating System for the Edge-to-Cloud Continuum","authors":"Kumseok Jung;Julien Gascon-Samson;Sathish Gopalakrishnan;Karthik Pattabiraman","doi":"10.1109/TPDS.2025.3557747","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3557747","url":null,"abstract":"Application developers often need to employ a combination of software such as communication middleware and cloud-based services to deal with the challenges of heterogeneity and network dynamism in the edge-to-cloud continuum. Consequently, developers write extra glue code peripheral to the application’s core business logic, to provide interoperability between interacting software frameworks. Each software framework comes with its own framework-specific API, and as technology evolves, the developer must keep up with the changing APIs by updating the glue code in their application. Thus, framework-specific APIs hinder interoperability and cause technology fragmentation. We propose a design of a middleware-based distributed operating system (OS) called OneOS to realize a computing paradigm that alleviates such interoperability challenges. OneOS provides a single system image of the distributed computing platform, and transparently provides interoperability between software components through the standard POSIX API. Using OneOS’s domain-specific language, users can compose complex distributed applications from legacy POSIX programs. OneOS tolerates failures by adopting a distributed checkpoint-restore algorithm. We evaluate the performance of OneOS against an open-source IoT Platform, ThingsJS, using an IoT stream processing benchmark suite, and a video processing application. OneOS executes the programs about 3x faster than ThingsJS, reduces the code size by about 22%, and recovers the state of failed applications within 1 s upon detecting their failure.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1175-1192"},"PeriodicalIF":5.6,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRHunter: Universal Detection of Instruction Reordering Vulnerabilities for Enhanced Concurrency in Distributed and Parallel Systems

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-02 DOI: 10.1109/TPDS.2025.3556861

GuoHua Xin;Guangquan Xu;Yao Zhang;Cheng Wen;Cen Zhang;Xiaofei Xie;Neal N. Xiong;Shaoying Liu;Pan Gao

Instruction reordering is an essential optimization technique used in both compilers and multi-core processors to enhance parallelism and resource utilization. Although the original intent of this technique is to benefit the program, some improper reordering can significantly impact the program correctness, which we call instruction reordering vulnerability (IRV). However, existing methods detect IRV by defining CPU instruction reordering rules to schedule execution paths while neglecting compiler reordering, and thus generate false positives that require manual filtering and resulting in inefficiency. To bridge this gap, in this paper, we propose the IRV detection method, IRHunter, which analyzes IRV characteristics and extracts vulnerability patterns, integrating program dependency analysis for compiler reordering and memory model constraints for CPU reordering. Specifically, we use static analysis based on specific patterns to narrow the analysis scope, and adopt log-based dynamic analysis to confirm vulnerability by checking the log constraints. We built the IRV benchmark to compare IRHunter with five state-of-the-art tools (i.e., GENMC, Nidhugg, CBMC, SHB, BiRD). IRHunter detected all 19 errors, doubling the best model checking tools’ performance, with half the false positive rate of leading data race detectors. It was 10× faster on small programs and outperformed data race detectors on large programs.

{"title":"IRHunter: Universal Detection of Instruction Reordering Vulnerabilities for Enhanced Concurrency in Distributed and Parallel Systems","authors":"GuoHua Xin;Guangquan Xu;Yao Zhang;Cheng Wen;Cen Zhang;Xiaofei Xie;Neal N. Xiong;Shaoying Liu;Pan Gao","doi":"10.1109/TPDS.2025.3556861","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3556861","url":null,"abstract":"Instruction reordering is an essential optimization technique used in both compilers and multi-core processors to enhance parallelism and resource utilization. Although the original intent of this technique is to benefit the program, some improper reordering can significantly impact the program correctness, which we call instruction reordering vulnerability (IRV). However, existing methods detect IRV by defining CPU instruction reordering rules to schedule execution paths while neglecting compiler reordering, and thus generate false positives that require manual filtering and resulting in inefficiency. To bridge this gap, in this paper, we propose the IRV detection method, <italic>IRHunter, which analyzes IRV characteristics and extracts vulnerability patterns, integrating program dependency analysis for compiler reordering and memory model constraints for CPU reordering. Specifically, we use static analysis based on specific patterns to narrow the analysis scope, and adopt log-based dynamic analysis to confirm vulnerability by checking the log constraints. We built the IRV benchmark to compare <italic>IRHunter with five state-of-the-art tools (i.e., GENMC, Nidhugg, CBMC, SHB, BiRD). <italic>IRHunter detected all 19 errors, doubling the best model checking tools’ performance, with half the false positive rate of leading data race detectors. It was 10× faster on small programs and outperformed data race detectors on large programs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1220-1236"},"PeriodicalIF":5.6,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WCET Estimation for CNN Inference on FPGA SoC With Multi-DPU Engines

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-04-01 DOI: 10.1109/TPDS.2025.3555968

Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju

The Deep Learning Processor Unit (DPU) released in the official Xilinx Vitis AI toolchain stands as a commercial off-the-shelf solution tailored for accelerating convolutional neural network (CNN) inference on Xilinx FPGA devices. While most FPGA accelerator focus on high performance and energy-efficiency, analyzing the worst-case execution time (WCET) bound is essential for using CNN accelerations in real-time embedded systems design. In this work, we show that in a multi-DPU environment, the observed worst-case inference time for a CNN inference task could become 3X larger w.r.t. the best case inference time, which prompts the prominent importance of a static timing analysis for FPGA-based CNN inference. We propose, to the best of the authors’ knowledge, the first static timing analysis framework for CNN inference in a multi-DPU environment. The proposed framework introduces a generalized timing behavior model for shared bus arbitration and memory access contention between parallel running DPU engines. Additionally, it incorporates a fine-grained memory access contention analysis that takes into account the characteristics of deep learning applications. For a single-DPU environment, the analysis result is 27% tighter in average compared with the state-of-the-art results. Furthermore, our proposed method produces relatively tight estimated results in the multi-DPU environment.

{"title":"WCET Estimation for CNN Inference on FPGA SoC With Multi-DPU Engines","authors":"Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju","doi":"10.1109/TPDS.2025.3555968","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555968","url":null,"abstract":"The Deep Learning Processor Unit (DPU) released in the official Xilinx Vitis AI toolchain stands as a commercial off-the-shelf solution tailored for accelerating convolutional neural network (CNN) inference on Xilinx FPGA devices. While most FPGA accelerator focus on high performance and energy-efficiency, analyzing the worst-case execution time (WCET) bound is essential for using CNN accelerations in real-time embedded systems design. In this work, we show that in a multi-DPU environment, the observed worst-case inference time for a CNN inference task could become 3X larger w.r.t. the best case inference time, which prompts the prominent importance of a static timing analysis for FPGA-based CNN inference. We propose, to the best of the authors’ knowledge, the first static timing analysis framework for CNN inference in a multi-DPU environment. The proposed framework introduces a generalized timing behavior model for shared bus arbitration and memory access contention between parallel running DPU engines. Additionally, it incorporates a fine-grained memory access contention analysis that takes into account the characteristics of deep learning applications. For a single-DPU environment, the analysis result is 27% tighter in average compared with the state-of-the-art results. Furthermore, our proposed method produces relatively tight estimated results in the multi-DPU environment.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1146-1160"},"PeriodicalIF":5.6,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DFU-E: A Dataflow Architecture for Edge DSP and AI Applications

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3555329

Wenming Li;Zhihua Fan;Tianyu Liu;Zhen Wang;Haibin Wu;Meng Wu;Kunming Zhang;Yanhuan Liu;Ninghui Sun;Xiaochun Ye;Dongrui Fan

Edge computing aims to enable swift, real-time data processing, analysis, and storage close to the data source. However, edge computing platforms are often constrained by limited processing power and efficiency. This paper presents DFU-E, a dataflow-based accelerator specifically designed to meet the demands of edge digital signal processing (DSP) and artificial intelligence (AI) applications. Our design addresses real-world requirements with three main innovations. First, to accommodate the diverse algorithms utilized at the edge, we propose a multi-layer dataflow mechanism capable of exploiting task-level, instruction block-level, instruction-level, and data-level parallelism. Second, we develop an edge dataflow architecture that includes a customized processing element (PE) array, memory, and on-chip network microarchitecture optimized for the multi-layer dataflow mechanism. Third, we design an edge dataflow software stack that enables automatic optimizations through operator fusion, dataflow graph mapping, and task scheduling. We utilize representative real-world DSP and AI applications for evaluation. Comparing with Nvidia's state-of-the-art edge computing processor, DFU-E achieves up to 1.42× geometric mean performance improvement and 1.27× energy efficiency improvement.

{"title":"DFU-E: A Dataflow Architecture for Edge DSP and AI Applications","authors":"Wenming Li;Zhihua Fan;Tianyu Liu;Zhen Wang;Haibin Wu;Meng Wu;Kunming Zhang;Yanhuan Liu;Ninghui Sun;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2025.3555329","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555329","url":null,"abstract":"Edge computing aims to enable swift, real-time data processing, analysis, and storage close to the data source. However, edge computing platforms are often constrained by limited processing power and efficiency. This paper presents DFU-E, a dataflow-based accelerator specifically designed to meet the demands of edge digital signal processing (DSP) and artificial intelligence (AI) applications. Our design addresses real-world requirements with three main innovations. First, to accommodate the diverse algorithms utilized at the edge, we propose a multi-layer dataflow mechanism capable of exploiting task-level, instruction block-level, instruction-level, and data-level parallelism. Second, we develop an edge dataflow architecture that includes a customized processing element (PE) array, memory, and on-chip network microarchitecture optimized for the multi-layer dataflow mechanism. Third, we design an edge dataflow software stack that enables automatic optimizations through operator fusion, dataflow graph mapping, and task scheduling. We utilize representative real-world DSP and AI applications for evaluation. Comparing with Nvidia's state-of-the-art edge computing processor, DFU-E achieves up to 1.42× geometric mean performance improvement and 1.27× energy efficiency improvement.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1100-1114"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Workload-Aware Performance Model Based Soft Preemptive Real-Time Scheduling for Neural Processing Units

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3553922

Yuan Yao;Yujiao Hu;Yi Dang;Wei Tao;Kai Hu;Qiming Huang;Zhe Peng;Gang Yang;Xingshe Zhou

A neural processing unit (NPU) is a microprocessor which is specially designed for various types of neural network applications. Because of its high acceleration efficiency and lower power consumption, the airborne embedded system has widely deployed NPU to replace GPU as the new accelerator. Unfortunately, the inherent scheduler of NPU does not consider real-time scheduling. Therefore, it cannot meet real-time requirements of airborne embedded systems. At present, there is less research on the multi-task real-time scheduling of the NPU device. In this article, we first design an NPU resource management framework based on Kubernetes. Then, we propose WAMSPRES, a workload-aware NPU performance model based soft preemptive real-time scheduling method. The proposed workload-aware NPU performance model can accurately predict the remaining execution time of the task when it runs with other tasks concurrently. The soft preemptive real-time scheduling algorithm can provide approximate preemption capability by dynamically adjusting the NPU computing resources of tasks. Finally, we implement a prototype NPU scheduler of the airborne embedded system for the fixed-wing UAV. The proposed models and algorithms are validated on both the simulated and realistic task sets. Experimental results illustrate that WAMSPRES can achieve low prediction error and high scheduling success rate.

{"title":"Workload-Aware Performance Model Based Soft Preemptive Real-Time Scheduling for Neural Processing Units","authors":"Yuan Yao;Yujiao Hu;Yi Dang;Wei Tao;Kai Hu;Qiming Huang;Zhe Peng;Gang Yang;Xingshe Zhou","doi":"10.1109/TPDS.2025.3553922","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553922","url":null,"abstract":"A neural processing unit (NPU) is a microprocessor which is specially designed for various types of neural network applications. Because of its high acceleration efficiency and lower power consumption, the airborne embedded system has widely deployed NPU to replace GPU as the new accelerator. Unfortunately, the inherent scheduler of NPU does not consider real-time scheduling. Therefore, it cannot meet real-time requirements of airborne embedded systems. At present, there is less research on the multi-task real-time scheduling of the NPU device. In this article, we first design an NPU resource management framework based on Kubernetes. Then, we propose WAMSPRES, a workload-aware NPU performance model based soft preemptive real-time scheduling method. The proposed workload-aware NPU performance model can accurately predict the remaining execution time of the task when it runs with other tasks concurrently. The soft preemptive real-time scheduling algorithm can provide approximate preemption capability by dynamically adjusting the NPU computing resources of tasks. Finally, we implement a prototype NPU scheduler of the airborne embedded system for the fixed-wing UAV. The proposed models and algorithms are validated on both the simulated and realistic task sets. Experimental results illustrate that WAMSPRES can achieve low prediction error and high scheduling success rate.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1058-1070"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Parallel and Distributed Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀