Memory Dependence Speculation for Simultaneous Multi-Threading Processors

Pub Date : 2024-01-17 DOI:10.1142/s0129626424500014

Jonathan Flores, Wei-Ming Lin

{"title":"Memory Dependence Speculation for Simultaneous Multi-Threading Processors","authors":"Jonathan Flores, Wei-Ming Lin","doi":"10.1142/s0129626424500014","DOIUrl":null,"url":null,"abstract":"Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great bottleneck for processor performance, as loads must stall execution until all prior store addresses are known or risk reading invalid data. Prior research in this area has mainly focused on superscalar architecture, as such, it is only natural to extend memory dependence speculation techniques to an SMT architecture. In this paper, we allow for loads among threads to execute as soon as their addresses are resolved without checking for prior memory address conflicts. Stores also perform a check on all later loads to see if any read was too early due to an address match, if so, the processor state is recovered, and the load re-issued. This aggressive technique allows for the greatest potential instructions per clock cycle gains over predictive techniques as the pipeline is never stalled for loads. Our simulations show that an overall IPC gain up to 12% and 10% is possible for both 4-threaded and 8-threaded workloads respectively. Conversely, a maximum overall IPC loss of at least 2.3% and 2% for 4-threaded and 8-threaded workloads respectively was also observed.","PeriodicalId":0,"journal":{"name":"","volume":" January","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0129626424500014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great bottleneck for processor performance, as loads must stall execution until all prior store addresses are known or risk reading invalid data. Prior research in this area has mainly focused on superscalar architecture, as such, it is only natural to extend memory dependence speculation techniques to an SMT architecture. In this paper, we allow for loads among threads to execute as soon as their addresses are resolved without checking for prior memory address conflicts. Stores also perform a check on all later loads to see if any read was too early due to an address match, if so, the processor state is recovered, and the load re-issued. This aggressive technique allows for the greatest potential instructions per clock cycle gains over predictive techniques as the pipeline is never stalled for loads. Our simulations show that an overall IPC gain up to 12% and 10% is possible for both 4-threaded and 8-threaded workloads respectively. Conversely, a maximum overall IPC loss of at least 2.3% and 2% for 4-threaded and 8-threaded workloads respectively was also observed.

查看原文

微信好友朋友圈 QQ好友复制链接

同步多线程处理器的内存依赖性推测

同时多线程（SMT）处理器允许多个独立线程的指令同时在无序状态下执行，从而改进了传统的无序超标量架构。保持从内存读取和写入内存的值的准确性是处理器性能的一大瓶颈，因为负载必须停止执行，直到知道所有先前的存储地址，否则就有可能读取无效数据。这一领域的前期研究主要集中在超标量架构上，因此，将内存依赖性推测技术扩展到 SMT 架构是很自然的事情。在本文中，我们允许线程间的加载在其地址解析后立即执行，而无需检查之前的内存地址冲突。此外，我们还对所有后续加载执行检查，以确定是否有任何读取因地址匹配而过早，如果有，则恢复处理器状态并重新加载。与预测技术相比，由于流水线不会因为加载而停滞，因此这种积极的技术可以在每个时钟周期内实现最大的潜在指令增益。我们的模拟结果表明，4 线程和 8 线程工作负载的总体 IPC 增益分别可达 12% 和 10%。相反，我们也观察到 4 线程和 8 线程工作负载的最大总体 IPC 损失分别至少为 2.3% 和 2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助