PRES: probabilistic replay with execution sketching on multiprocessors

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles Pub Date : 2009-10-11 DOI:10.1145/1629575.1629593

Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini T. Kaushik, Kyuhyung Lee, Shan Lu

{"title":"PRES: probabilistic replay with execution sketching on multiprocessors","authors":"Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini T. Kaushik, Kyuhyung Lee, Shan Lu","doi":"10.1145/1629575.1629593","DOIUrl":null,"url":null,"abstract":"Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions.\n This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of \"reproducing the bug on the first replay attempt\" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as \"sketches\") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time.\n We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.","PeriodicalId":20672,"journal":{"name":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","volume":"97 1","pages":"177-192"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"283","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629575.1629593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 283

Abstract

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions. This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time. We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PRES:多处理器上带有执行草图的概率重播

Bug重现对于诊断生产运行故障至关重要。不幸的是，在多处理器(例如，多核)上重现并发错误是具有挑战性的。以前的技术要么产生很大的开销，要么需要新的重要的硬件扩展。本文提出了一种称为PRES(通过执行草图的概率重播)的新技术来帮助再现多处理器上的并发错误。它放松了过去(也许是理想主义的)“在第一次重播尝试时再现错误”的目标，从而显著降低了生产运行的记录开销。这是通过以下方式实现的:(1)在生产运行期间仅记录部分执行信息(称为“草图”)，以及(2)在诊断期间(当性能不太关键时)依赖智能重播器系统地探索未记录的非确定性空间并重现错误。由于只有部分信息，我们的重播器可能需要多次协调重播运行来重现bug。但是，在错误被复制一次之后，PRES可以每次都复制它。我们将PRES与五种不同的执行草图机制一起实现。我们用11个代表性应用程序对它们进行了评估，其中包括4个服务器应用程序、3个桌面/客户端应用程序和4个科学/图形应用程序，其中有13个不同类型的真实并发错误，包括原子性违反、顺序违反和死锁。PRES(使用同步或系统调用草图)显著降低了以前方法的生产运行记录开销(最多减少了4416倍)，同时在不到10次重放尝试中仍然再现了大多数测试过的错误。此外，PRES可以很好地随处理器数量的增加而扩展;PRES从不成功的重放中产生的反馈对bug繁殖至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

自引率

0.00%

发文量