Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems最新文献

英文中文

Session details: Session 1B: Managed Runtimes and Dynamic Translation 会话1B:管理运行时和动态翻译

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252953

Lei Liu

引用次数: 0

Static Detection of Event-based Races in Android Apps Android应用中基于事件的竞赛的静态检测

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173173

Yongjian Hu, Iulian Neamtiu

Event-based races are the main source of concurrency errors in Android apps. Prior approaches for scalable detection of event-based races have been dynamic. Due to their dynamic nature, these approaches suffer from coverage and false negative issues. We introduce a precise and scalable static approach and tool, named SIERRA, for detecting Android event-based races. SIERRA is centered around a new concept of "concurrency action" (that reifies threads, events/messages, system and user actions) and statically-derived order (happens-before relation) between actions. Establishing action order is complicated in Android, and event-based systems in general, because of externally-orchestrated control flow, use of callbacks, asynchronous tasks, and ad-hoc synchronization. We introduce several novel approaches that enable us to infer order relations statically: auto-generated code models which impose order among lifecycle and GUI events; a novel context abstraction for event-driven programs named action-sensitivity and finally, on-demand path sensitivity via backward symbolic execution to further rule out false positives. We have evaluated SIERRA on 194 Android apps. Of these, we chose 20 apps for manual analysis and comparison with a state-of-the-art dynamic race detector. Experimental results show that SIERRA is effective and efficient, typically taking 960 seconds to analyze an app and revealing 43 potential races. Compared with the dynamic race detector, SIERRA discovered an average 29.5 true races with 3.5 false positives, where the dynamic detector only discovered 4 races (hence missing 25.5 races per app) -- this demonstrates the advantage of a precise static approach. We believe that our approach opens the way for precise analysis and static event race detection in other event-driven systems beyond Android.

基于事件的比赛是Android应用并发错误的主要来源。先前用于可扩展检测基于事件的竞赛的方法是动态的。由于其动态特性，这些方法存在报道和假阴性问题。我们引入了一种精确且可扩展的静态方法和工具，名为SIERRA，用于检测Android基于事件的比赛。SIERRA围绕着一个新概念“并发操作”(具体化线程、事件/消息、系统和用户操作)和动作之间静态派生的顺序(先于发生的关系)。在Android和一般基于事件的系统中，由于外部协调的控制流、回调的使用、异步任务和临时同步，建立操作顺序非常复杂。我们介绍了几种新颖的方法，使我们能够静态地推断顺序关系:自动生成的代码模型，在生命周期和GUI事件之间施加顺序;为事件驱动程序提供了一种新的上下文抽象，称为动作敏感性，最后是通过向后符号执行的按需路径敏感性，以进一步排除误报。我们已经对194款Android应用进行了SIERRA评估。其中，我们选择了20个应用程序进行手动分析，并与最先进的动态种族检测器进行比较。实验结果表明，SIERRA是有效和高效的，通常需要960秒来分析一个应用程序，并揭示43个潜在的竞争。与动态种族检测器相比，SIERRA平均发现了29.5个真实种族和3.5个假阳性，而动态检测器只发现了4个种族(因此每个应用遗漏了25.5个种族)——这证明了精确静态方法的优势。我们相信我们的方法为Android以外的其他事件驱动系统的精确分析和静态事件竞争检测开辟了道路。

{"title":"Static Detection of Event-based Races in Android Apps","authors":"Yongjian Hu, Iulian Neamtiu","doi":"10.1145/3173162.3173173","DOIUrl":"https://doi.org/10.1145/3173162.3173173","url":null,"abstract":"Event-based races are the main source of concurrency errors in Android apps. Prior approaches for scalable detection of event-based races have been dynamic. Due to their dynamic nature, these approaches suffer from coverage and false negative issues. We introduce a precise and scalable static approach and tool, named SIERRA, for detecting Android event-based races. SIERRA is centered around a new concept of \"concurrency action\" (that reifies threads, events/messages, system and user actions) and statically-derived order (happens-before relation) between actions. Establishing action order is complicated in Android, and event-based systems in general, because of externally-orchestrated control flow, use of callbacks, asynchronous tasks, and ad-hoc synchronization. We introduce several novel approaches that enable us to infer order relations statically: auto-generated code models which impose order among lifecycle and GUI events; a novel context abstraction for event-driven programs named action-sensitivity and finally, on-demand path sensitivity via backward symbolic execution to further rule out false positives. We have evaluated SIERRA on 194 Android apps. Of these, we chose 20 apps for manual analysis and comparison with a state-of-the-art dynamic race detector. Experimental results show that SIERRA is effective and efficient, typically taking 960 seconds to analyze an app and revealing 43 potential races. Compared with the dynamic race detector, SIERRA discovered an average 29.5 true races with 3.5 false positives, where the dynamic detector only discovered 4 races (hence missing 25.5 races per app) -- this demonstrates the advantage of a precise static approach. We believe that our approach opens the way for precise analysis and static event race detection in other event-driven systems beyond Android.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131717051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

In-Memory Data Parallel Processor 内存数据并行处理器

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173171

Daichi Fujiki, S. Mahlke, R. Das

Recent developments in Non-Volatile Memories (NVMs) have opened up a new horizon for in-memory computing. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of specialized kernels to the memory arrays, making it infeasible to execute more general workloads. We combat this problem by proposing a programmable in-memory processor architecture and data-parallel programming framework. The efficiency of the proposed in-memory processor comes from two sources: massive parallelism and reduction in data movement. A compact instruction set provides generalized computation capabilities for the memory array. The proposed programming framework seeks to leverage the underlying parallelism in the hardware by merging the concepts of data-flow and vector processing. To facilitate in-memory programming, we develop a compilation framework that takes a TensorFlow input and generates code for our in-memory processor. Our results demonstrate 7.5x speedup over a multi-core CPU server for a set of applications from Parsec and 763x speedup over a server-class GPU for a set of Rodinia benchmarks.

非易失性存储器(nvm)的最新发展为内存计算开辟了一个新的领域。尽管计算型nvm提供了显著的性能提升，但以前的工作依赖于手动将专门的内核映射到内存阵列，这使得执行更通用的工作负载变得不可行。我们通过提出一个可编程内存处理器架构和数据并行编程框架来解决这个问题。所建议的内存处理器的效率来自两个来源:大规模并行性和减少数据移动。紧凑的指令集为存储器阵列提供了通用的计算能力。所提出的编程框架试图通过合并数据流和矢量处理的概念来利用硬件中的底层并行性。为了方便内存编程，我们开发了一个编译框架，它接受TensorFlow输入，并为我们的内存处理器生成代码。我们的结果表明，对于一组来自Parsec的应用程序，在多核CPU服务器上的加速速度为7.5倍，对于一组Rodinia基准测试，在服务器级GPU上的加速速度为763x。

{"title":"In-Memory Data Parallel Processor","authors":"Daichi Fujiki, S. Mahlke, R. Das","doi":"10.1145/3173162.3173171","DOIUrl":"https://doi.org/10.1145/3173162.3173171","url":null,"abstract":"Recent developments in Non-Volatile Memories (NVMs) have opened up a new horizon for in-memory computing. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of specialized kernels to the memory arrays, making it infeasible to execute more general workloads. We combat this problem by proposing a programmable in-memory processor architecture and data-parallel programming framework. The efficiency of the proposed in-memory processor comes from two sources: massive parallelism and reduction in data movement. A compact instruction set provides generalized computation capabilities for the memory array. The proposed programming framework seeks to leverage the underlying parallelism in the hardware by merging the concepts of data-flow and vector processing. To facilitate in-memory programming, we develop a compilation framework that takes a TensorFlow input and generates code for our in-memory processor. Our results demonstrate 7.5x speedup over a multi-core CPU server for a set of applications from Parsec and 763x speedup over a server-class GPU for a set of Rodinia benchmarks.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133579818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114

Blasting through the Front-End Bottleneck with Shotgun 用散弹枪爆破前端瓶颈

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173178

Rakesh Kumar, Boris Grot, V. Nagarajan

The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between performance and metadata storage costs. This work introduces Shotgun, a BTB-directed front-end prefetcher powered by a new BTB organization that maintains a logical map of an application's instruction footprint, which enables high-efficacy prefetching at low storage cost. To map active code regions, Shotgun precisely tracks an application's global control flow (e.g., function and trap routine entry points) and summarizes local control flow within each code region. Because the local control flow enjoys high spatial locality, with most functions comprised of a handful of instruction cache blocks, it lends itself to a compact region-based encoding. Meanwhile, the global control flow is naturally captured by the application's unconditional branch working set (calls, returns, traps). Based on these insights, Shotgun devotes the bulk of its BTB capacity to branches responsible for the global control flow and a spatial encoding of their target regions. By effectively capturing a map of the application's instruction footprint in the BTB, Shotgun enables highly effective BTB-directed prefetching. Using a storage budget equivalent to a conventional BTB, Shotgun outperforms the state-of-the-art BTB-directed front-end prefetcher by up to 14% on a set of varied commercial workloads.

由于服务器工作负载具有较深的软件堆栈和较大的指令工作集，因此前端瓶颈是一个公认的问题。尽管对有效的L1-I和BTB预取进行了多年的研究，但最先进的技术迫使在性能和元数据存储成本之间进行权衡。本文介绍了Shotgun，这是一个BTB导向的前端预取器，由一个新的BTB组织提供支持，它维护应用程序指令占用的逻辑映射，从而以低存储成本实现高效预取。为了映射活动代码区域，Shotgun精确地跟踪应用程序的全局控制流(例如，函数和陷阱例程入口点)，并总结每个代码区域内的本地控制流。由于本地控制流具有较高的空间局部性，并且大多数函数由少量指令缓存块组成，因此它适合于紧凑的基于区域的编码。同时，全局控制流自然地被应用程序的无条件分支工作集(调用、返回、陷阱)捕获。基于这些见解，Shotgun将其BTB容量的大部分用于负责其目标区域的全局控制流和空间编码的分支机构。通过有效地捕获BTB中应用程序指令占用的映射，Shotgun支持高效的BTB定向预取。使用相当于传统BTB的存储预算，Shotgun在各种商业工作负载上的性能比最先进的BTB导向前端预取器高出14%。

{"title":"Blasting through the Front-End Bottleneck with Shotgun","authors":"Rakesh Kumar, Boris Grot, V. Nagarajan","doi":"10.1145/3173162.3173178","DOIUrl":"https://doi.org/10.1145/3173162.3173178","url":null,"abstract":"The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between performance and metadata storage costs. This work introduces Shotgun, a BTB-directed front-end prefetcher powered by a new BTB organization that maintains a logical map of an application's instruction footprint, which enables high-efficacy prefetching at low storage cost. To map active code regions, Shotgun precisely tracks an application's global control flow (e.g., function and trap routine entry points) and summarizes local control flow within each code region. Because the local control flow enjoys high spatial locality, with most functions comprised of a handful of instruction cache blocks, it lends itself to a compact region-based encoding. Meanwhile, the global control flow is naturally captured by the application's unconditional branch working set (calls, returns, traps). Based on these insights, Shotgun devotes the bulk of its BTB capacity to branches responsible for the global control flow and a spatial encoding of their target regions. By effectively capturing a map of the application's instruction footprint in the BTB, Shotgun enables highly effective BTB-directed prefetching. Using a storage budget equivalent to a conventional BTB, Shotgun outperforms the state-of-the-art BTB-directed front-end prefetcher by up to 14% on a set of varied commercial workloads.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124887134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

NEOFog: Nonvolatility-Exploiting Optimizations for Fog Computing NEOFog:雾计算的非易失性优化

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177154

Kaisheng Ma, Xueqing Li, M. Kandemir, J. Sampson, N. Vijaykrishnan, Jinyang Li, Tongda Wu, Zhibo Wang, Yongpan Liu, Yuan Xie

Nonvolatile processors have emerged as one of the promising solutions for energy harvesting scenarios, among which Wireless Sensor Networks (WSN) provide some of the most important applications. In a typical distributed sensing system, due to difference in location, energy harvester angles, power sources, etc. different nodes may have different amount of energy ready for use. While prior approaches have examined these challenges, they have not done so in the context of the features offered by nonvolatile computing approaches, which disrupt certain foundational assumptions. We propose a new set of nonvolatility-exploiting optimizations and embody them in the NEOFog system architecture. We discuss shifts in the tradeoffs in data and program distribution for nonvolatile processing-based WSNs, showing how non-volatile processing and non-volatile RF support alter the benefits of computation and communication-centric approaches. We also propose a new algorithm specific to nonvolatile sensing systems for load balancing both computation and communication demands. Collectively, the NV-aware optimizations in NEOFog increase the ability to perform in-fog processing by 4.2X and can increase this to 8X if virtualized nodes are 3X multiplexed.

非易失性处理器已成为能量收集场景中最有前途的解决方案之一，其中无线传感器网络(WSN)提供了一些最重要的应用。在典型的分布式传感系统中，由于位置、能量采集器角度、电源等的不同，不同节点可供使用的能量可能不同。虽然以前的方法已经研究了这些挑战，但它们并没有在非易失性计算方法提供的特性的背景下进行研究，这破坏了某些基本假设。我们提出了一套新的非易失性优化，并将其体现在NEOFog系统架构中。我们讨论了基于非易失性处理的wsn在数据和程序分布方面的权衡变化，展示了非易失性处理和非易失性RF支持如何改变以计算和通信为中心的方法的好处。我们还提出了一种针对非易失性传感系统的新算法，以平衡计算和通信需求。总的来说，NEOFog中的nv感知优化将雾中处理的能力提高了4.2倍，如果虚拟化节点是3倍多路复用，则可以将其提高到8倍。

{"title":"NEOFog: Nonvolatility-Exploiting Optimizations for Fog Computing","authors":"Kaisheng Ma, Xueqing Li, M. Kandemir, J. Sampson, N. Vijaykrishnan, Jinyang Li, Tongda Wu, Zhibo Wang, Yongpan Liu, Yuan Xie","doi":"10.1145/3173162.3177154","DOIUrl":"https://doi.org/10.1145/3173162.3177154","url":null,"abstract":"Nonvolatile processors have emerged as one of the promising solutions for energy harvesting scenarios, among which Wireless Sensor Networks (WSN) provide some of the most important applications. In a typical distributed sensing system, due to difference in location, energy harvester angles, power sources, etc. different nodes may have different amount of energy ready for use. While prior approaches have examined these challenges, they have not done so in the context of the features offered by nonvolatile computing approaches, which disrupt certain foundational assumptions. We propose a new set of nonvolatility-exploiting optimizations and embody them in the NEOFog system architecture. We discuss shifts in the tradeoffs in data and program distribution for nonvolatile processing-based WSNs, showing how non-volatile processing and non-volatile RF support alter the benefits of computation and communication-centric approaches. We also propose a new algorithm specific to nonvolatile sensing systems for load balancing both computation and communication demands. Collectively, the NV-aware optimizations in NEOFog increase the ability to perform in-fog processing by 4.2X and can increase this to 8X if virtualized nodes are 3X multiplexed.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121058990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Gloss: Seamless Live Reconfiguration and Reoptimization of Stream Programs 光泽:无缝现场重新配置和流程序的重新优化

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173170

S. Rajadurai, Jeffrey Bosboom, W. Wong, Saman P. Amarasinghe

An important class of applications computes on long-running or infinite streams of data, often with known fixed data rates. The latter is referred to as synchronous data flow ~(SDF) streams. These stream applications need to run on clusters or the cloud due to the high performance requirement. Further, they require live reconfiguration and reoptimization for various reasons such as hardware maintenance, elastic computation, or to respond to fluctuations in resources or application workload. However, reconfiguration and reoptimization without downtime while accurately preserving program state in a distributed environment is difficult. In this paper, we introduce Gloss, a suite of compiler and runtime techniques for live reconfiguration of distributed stream programs. Gloss, for the first time, avoids periods of zero throughput during the reconfiguration of both stateless and stateful SDF based stream programs. Furthermore, unlike other systems, Gloss globally reoptimizes and completely recompiles the program during reconfiguration. This permits it to reoptimize the application for entirely new configurations that it may not have encountered before. All these Gloss operations happen in-situ, requiring no extra hardware resources. We show how Gloss allows stream programs to reconfigure and reoptimize with no downtime and minimal overhead, and demonstrate the wider applicability of it via a variety of experiments.

一类重要的应用程序对长时间运行或无限的数据流进行计算，通常具有已知的固定数据速率。后者被称为同步数据流(SDF)流。由于高性能需求，这些流应用程序需要在集群或云上运行。此外，由于硬件维护、弹性计算或响应资源或应用程序工作负载的波动等各种原因，它们需要实时重新配置和重新优化。然而，要在不停机的情况下重新配置和重新优化，同时在分布式环境中准确地保持程序状态是很困难的。在本文中，我们介绍了Gloss，这是一套用于分布式流程序实时重构的编译器和运行时技术。Gloss首次避免了在无状态和有状态的基于SDF的流程序重新配置期间的零吞吐量。此外，与其他系统不同，Gloss在重新配置期间全局重新优化并完全重新编译程序。这允许它重新优化应用程序，以获得以前可能没有遇到过的全新配置。所有这些Gloss操作都在现场进行，不需要额外的硬件资源。我们展示了Gloss如何允许流程序在没有停机时间和最小开销的情况下重新配置和重新优化，并通过各种实验证明了它的更广泛的适用性。

{"title":"Gloss: Seamless Live Reconfiguration and Reoptimization of Stream Programs","authors":"S. Rajadurai, Jeffrey Bosboom, W. Wong, Saman P. Amarasinghe","doi":"10.1145/3173162.3173170","DOIUrl":"https://doi.org/10.1145/3173162.3173170","url":null,"abstract":"An important class of applications computes on long-running or infinite streams of data, often with known fixed data rates. The latter is referred to as synchronous data flow ~(SDF) streams. These stream applications need to run on clusters or the cloud due to the high performance requirement. Further, they require live reconfiguration and reoptimization for various reasons such as hardware maintenance, elastic computation, or to respond to fluctuations in resources or application workload. However, reconfiguration and reoptimization without downtime while accurately preserving program state in a distributed environment is difficult. In this paper, we introduce Gloss, a suite of compiler and runtime techniques for live reconfiguration of distributed stream programs. Gloss, for the first time, avoids periods of zero throughput during the reconfiguration of both stateless and stateful SDF based stream programs. Furthermore, unlike other systems, Gloss globally reoptimizes and completely recompiles the program during reconfiguration. This permits it to reoptimize the application for entirely new configurations that it may not have encountered before. All these Gloss operations happen in-situ, requiring no extra hardware resources. We show how Gloss allows stream programs to reconfigure and reoptimize with no downtime and minimal overhead, and demonstrate the wider applicability of it via a variety of experiments.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121161008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Enhancing Cross-ISA DBT Through Automatically Learned Translation Rules 通过自动学习翻译规则增强跨isa DBT

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177160

Wenwen Wang, Stephen McCamant, Antonia Zhai, P. Yew

This paper presents a novel approach for dynamic binary translation (DBT) to automatically learn translation rules from guest and host binaries compiled from the same source code. The learned translation rules are then verified via binary symbolic execution and used in an existing DBT system, QEMU, to generate more efficient host binary code. Experimental results on SPEC CINT2006 show that the average time of learning a translation rule is less than two seconds. With the rules learned from a collection of benchmark programs excluding the targeted program itself, an average 1.25X performance speedup over QEMU can be achieved for SPEC CINT2006. Moreover, the translation overhead introduced by this rule-based approach is very small even for short-running workloads.

本文提出了一种动态二进制翻译(DBT)的新方法，该方法可以从从相同源代码编译的来宾和主机二进制文件中自动学习翻译规则。然后通过二进制符号执行验证学习到的翻译规则，并在现有的DBT系统QEMU中使用，以生成更高效的主机二进制代码。在SPEC CINT2006上的实验结果表明，学习一条翻译规则的平均时间小于2秒。通过从一系列基准测试程序(不包括目标程序本身)中学习到的规则，SPEC CINT2006可以实现比QEMU平均1.25倍的性能加速。此外，即使对于短时间运行的工作负载，这种基于规则的方法引入的转换开销也非常小。

引用次数: 16

Session details: Session 4B: Program Analysis 会话详细信息:会话4B:程序分析

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252959

Shan Lu

引用次数: 0

Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching Minnow:用于工作列表管理和工作列表定向预取的轻量级卸载引擎

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173197

Dan Zhang, Xiaoyu Ma, Michael Thomson, Derek Chiou

The importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic task generation. At high thread counts, execution time is dominated by worklist synchronization overhead and cache misses. Researchers have proposed hardware worklist accelerators to address scheduling costs, but these proposals often harden a specific scheduling policy and do not address high cache miss rates. We address this with Minnow, a technique that augments each core in a CMP with a lightweight Minnow accelerator. Minnow engines offload worklist scheduling from worker threads to improve scalability. The engines also perform worklist-directed prefetching, a technique that exploits knowledge of upcoming tasks to issue nearly perfectly accurate and timely prefetch operations. On a simulated 64-core CMP running a parallel graph benchmark suite, Minnow improves scalability and reduces L2 cache misses from 29 to 1.2 MPKI on average, resulting in 6.01x average speedup over an optimized software baseline for only 1% area overhead.

随着大数据的兴起，图形分析等非常规应用的重要性正在迅速增长。然而，由于缓存局部性差、计算强度低、频繁同步、任务大小不均匀和动态任务生成，并行图工作负载在通用芯片多处理器(cmp)上的性能往往很差。在线程数较高的情况下，执行时间主要由工作列表同步开销和缓存丢失决定。研究人员提出了硬件工作列表加速器来解决调度成本问题，但这些建议通常会强化特定的调度策略，并不能解决高缓存丢失率问题。我们用Minnow解决了这个问题，这是一种用轻量级Minnow加速器增强CMP中的每个核心的技术。Minnow引擎从工作线程中卸载工作列表调度以提高可伸缩性。引擎还执行面向工作列表的预取，这是一种利用即将到来的任务的知识来发出几乎完全准确和及时的预取操作的技术。在运行并行图基准测试套件的模拟64核CMP上，Minnow提高了可伸缩性，并将L2缓存丢失从平均29 MPKI减少到1.2 MPKI，从而在优化软件基线上平均加速6.01倍，而面积开销仅为1%。

{"title":"Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching","authors":"Dan Zhang, Xiaoyu Ma, Michael Thomson, Derek Chiou","doi":"10.1145/3173162.3173197","DOIUrl":"https://doi.org/10.1145/3173162.3173197","url":null,"abstract":"The importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic task generation. At high thread counts, execution time is dominated by worklist synchronization overhead and cache misses. Researchers have proposed hardware worklist accelerators to address scheduling costs, but these proposals often harden a specific scheduling policy and do not address high cache miss rates. We address this with Minnow, a technique that augments each core in a CMP with a lightweight Minnow accelerator. Minnow engines offload worklist scheduling from worker threads to improve scalability. The engines also perform worklist-directed prefetching, a technique that exploits knowledge of upcoming tasks to issue nearly perfectly accurate and timely prefetch operations. On a simulated 64-core CMP running a parallel graph benchmark suite, Minnow improves scalability and reduces L2 cache misses from 29 to 1.2 MPKI on average, resulting in 6.01x average speedup over an optimized software baseline for only 1% area overhead.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128405777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Session details: Session 7B: Memory 2 会话详细信息:会话7B:内存2

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252965

S. Blackburn

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀