MultiPoint: Enabling scalable pre-silicon performance evaluation for multi-task workloads

BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-09-01 DOI:10.1016/j.tbench.2025.100189

Chenji Han , Xinyu Li , Feng Xue , Weitong Wang , Yuxuan Wu , Wenxiang Wang , Fuxin Zhang

{"title":"MultiPoint: Enabling scalable pre-silicon performance evaluation for multi-task workloads","authors":"Chenji Han , Xinyu Li , Feng Xue , Weitong Wang , Yuxuan Wu , Wenxiang Wang , Fuxin Zhang","doi":"10.1016/j.tbench.2025.100189","DOIUrl":null,"url":null,"abstract":"<div><div>With the core numbers integrated within single processors growing and the fast development of cloud computing, performance evaluation for multi-core systems is increasingly crucial. It is typically conducted by executing multi-task workloads, exemplified by SPEC CPU Rate, to measure metrics like system’s throughput. In response, several sampling-based methods have been developed for their pre-silicon performance evaluation. Nevertheless, these methods involve directly capturing multi-task checkpoints, which presents scalability issues of significant storage and time overheads. Therefore, enabling more scalable performance evaluation remains a critical problem.</div><div>In this work, we propose MultiPoint to enable scalable pre-silicon performance evaluation for multi-task workloads. It is noted that in the multi-task workloads of interest, each task executes independently without inter-task communication. Therefore, MultiPoint is motivated to construct the required multi-task checkpoints by recovering multiple single-task checkpoints across different cores and guarantee their smooth execution through address remapping and shuffling. We implemented MultiPoint on the Emulator Accelerator and assessed its evaluation accuracy against its post-silicon Loongson 3A6000 processor. Using SPEC CPU 2017 as the benchmark, MultiPoint achieved the estimation errors of 6.20%, 5.45%, and 6.99% for Rate 2, Rate 4, and Rate 8, respectively, achieving comparable accuracy compared to direct multi-task checkpointing but in a more scalable manner with substantially 86.0% lower storage and 93.7% less time overheads.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 3","pages":"Article 100189"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S277248592500002X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the core numbers integrated within single processors growing and the fast development of cloud computing, performance evaluation for multi-core systems is increasingly crucial. It is typically conducted by executing multi-task workloads, exemplified by SPEC CPU Rate, to measure metrics like system’s throughput. In response, several sampling-based methods have been developed for their pre-silicon performance evaluation. Nevertheless, these methods involve directly capturing multi-task checkpoints, which presents scalability issues of significant storage and time overheads. Therefore, enabling more scalable performance evaluation remains a critical problem.

In this work, we propose MultiPoint to enable scalable pre-silicon performance evaluation for multi-task workloads. It is noted that in the multi-task workloads of interest, each task executes independently without inter-task communication. Therefore, MultiPoint is motivated to construct the required multi-task checkpoints by recovering multiple single-task checkpoints across different cores and guarantee their smooth execution through address remapping and shuffling. We implemented MultiPoint on the Emulator Accelerator and assessed its evaluation accuracy against its post-silicon Loongson 3A6000 processor. Using SPEC CPU 2017 as the benchmark, MultiPoint achieved the estimation errors of 6.20%, 5.45%, and 6.99% for Rate 2, Rate 4, and Rate 8, respectively, achieving comparable accuracy compared to direct multi-task checkpointing but in a more scalable manner with substantially 86.0% lower storage and 93.7% less time overheads.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多点：支持多任务工作负载的可伸缩预硅性能评估

随着集成在单核处理器中的内核数量不断增加以及云计算的快速发展，多核系统的性能评估变得越来越重要。评估通常通过执行多任务工作负载（如 SPEC CPU Rate）来衡量系统吞吐量等指标。为此，人们开发了几种基于采样的方法，用于硅前性能评估。然而，这些方法涉及直接捕获多任务检查点，会带来大量存储和时间开销的可扩展性问题。因此，实现更具可扩展性的性能评估仍然是一个关键问题。在这项工作中，我们提出了 MultiPoint，以实现多任务工作负载的可扩展硅前性能评估。我们注意到，在所关注的多任务工作负载中，每个任务都是独立执行的，任务间没有通信。因此，MultiPoint 的动机是通过在不同内核间恢复多个单任务检查点来构建所需的多任务检查点，并通过地址重映射和洗牌保证其顺利执行。我们在仿真器加速器上实施了MultiPoint，并针对其硅片后的Loongson 3A6000处理器评估了其评估精度。以 SPEC CPU 2017 为基准，MultiPoint 对速率 2、速率 4 和速率 8 的估算误差分别为 6.20%、5.45% 和 6.99%，与直接多任务检查点相比，精度相当，但扩展性更强，存储开销大幅降低 86.0%，时间开销降低 93.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

CiteScore

4.80

自引率

0.00%

发文量