Vanderlei Munhoz, Vinicius G. Pinto, João V. F. Lima, Márcio Castro, Daniel Cordeiro, Emilio Francesquini
Task-based programming interfaces introduce a paradigm in which computations are decomposed into fine-grained units of work known as “tasks”. StarPU is a runtime system originally developed to support task-based parallelism on on-premise heterogeneous architectures by abstracting low-level hardware details and efficiently managing resource scheduling. It enables developers to express applications as task graphs with explicit data dependencies, which are then dynamically scheduled across available processing units, such as CPUs and GPUs. In recent years, major cloud providers have begun offering virtual machines equipped with both CPUs and GPUs, allowing researchers to deploy and execute parallel workloads in virtual heterogeneous clusters. However, the performance and cost effectiveness of executing StarPU-based applications in public cloud environments remain unclear, particularly due to variability in hardware configurations, network performance, ever-changing pricing models, and computing performance due to virtualization and multi-tenancy. In this paper, we evaluate the performance and cost-efficiency of StarPU on Amazon Elastic Compute Cloud (EC2) using dense linear algebra kernels and N-Body simulations as case studies. Our experiments consider different cluster configurations, including powerful and more expensive instances with four NVIDIA GPUs per node (which we refer to as “fat nodes”), and less powerful and lower-cost instances with a single NVIDIA GPU per node (which we refer to as “thin nodes”). Our results show that arithmetic precision affects the performance–cost trade-off for dense linear algebra applications, whereas N-Body simulations consistently achieve better cost-efficiency on thin-node clusters. These findings underscore the challenges of optimizing HPC workloads for performance and cost in cloud environments.
{"title":"Performance and Cost Evaluation of StarPU on AWS: Case Studies With Dense Linear Algebra Kernels and N-Body Simulations","authors":"Vanderlei Munhoz, Vinicius G. Pinto, João V. F. Lima, Márcio Castro, Daniel Cordeiro, Emilio Francesquini","doi":"10.1002/cpe.70582","DOIUrl":"10.1002/cpe.70582","url":null,"abstract":"<p>Task-based programming interfaces introduce a paradigm in which computations are decomposed into fine-grained units of work known as “tasks”. StarPU is a runtime system originally developed to support task-based parallelism on on-premise heterogeneous architectures by abstracting low-level hardware details and efficiently managing resource scheduling. It enables developers to express applications as task graphs with explicit data dependencies, which are then dynamically scheduled across available processing units, such as CPUs and GPUs. In recent years, major cloud providers have begun offering virtual machines equipped with both CPUs and GPUs, allowing researchers to deploy and execute parallel workloads in virtual heterogeneous clusters. However, the performance and cost effectiveness of executing StarPU-based applications in public cloud environments remain unclear, particularly due to variability in hardware configurations, network performance, ever-changing pricing models, and computing performance due to virtualization and multi-tenancy. In this paper, we evaluate the performance and cost-efficiency of StarPU on Amazon Elastic Compute Cloud (EC2) using dense linear algebra kernels and N-Body simulations as case studies. Our experiments consider different cluster configurations, including powerful and more expensive instances with four NVIDIA GPUs per node (which we refer to as “fat nodes”), and less powerful and lower-cost instances with a single NVIDIA GPU per node (which we refer to as “thin nodes”). Our results show that arithmetic precision affects the performance–cost trade-off for dense linear algebra applications, whereas N-Body simulations consistently achieve better cost-efficiency on thin-node clusters. These findings underscore the challenges of optimizing HPC workloads for performance and cost in cloud environments.</p>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"38 3","pages":""},"PeriodicalIF":1.5,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cpe.70582","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146057779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}