ACM Transactions on Computer Systems最新文献_第5页

Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors 量化新兴的横向扩展应用程序和现代处理器之间的不匹配

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-11-01 DOI: 10.1145/2382553.2382557

M. Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, Stavros Volos, M. Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, A. Ailamaki, B. Falsafi

Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor microarchitecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core microarchitecture. Moreover, while today’s predominant microarchitecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key microarchitectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.

新兴的横向扩展工作负载需要大量的计算资源。然而，使用现代服务器硬件的数据中心面临着空间和功率方面的物理限制，限制了进一步的扩展，并要求改进每台服务器的计算密度和每次操作的能量。在不受物理限制的情况下，继续改进云计算资源要求优化服务器效率，以确保服务器硬件与向外扩展工作负载的需求紧密匹配。在本文中，我们将介绍CloudSuite，这是一个新兴的横向扩展工作负载的基准套件。我们在现代服务器上使用性能计数器来研究向外扩展的工作负载，发现当今主流的处理器微架构对于运行这些工作负载是低效的。我们发现低效率来自于工作负载需求与现代处理器之间的不匹配，特别是在指令和数据存储系统的组织以及处理器核心微体系结构方面。此外，虽然目前主流的微架构在执行向外扩展工作负载时效率低下，但我们发现，继续当前的趋势将进一步加剧未来的低效率。在这项工作中，我们确定了向外扩展工作负载的关键微架构需求，呼吁改变服务器处理器的轨迹，从而提高数据中心的计算密度和功率效率。

{"title":"Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors","authors":"M. Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, Stavros Volos, M. Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, A. Ailamaki, B. Falsafi","doi":"10.1145/2382553.2382557","DOIUrl":"https://doi.org/10.1145/2382553.2382557","url":null,"abstract":"Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads.\u0000 In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor microarchitecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core microarchitecture. Moreover, while today’s predominant microarchitecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key microarchitectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"10 1","pages":"15:1-15:24"},"PeriodicalIF":1.5,"publicationDate":"2012-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79587002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

The Design, Implementation, and Evaluation of Cells: A Virtual Smartphone Architecture 单元的设计、实现和评估:一种虚拟智能手机架构

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-08-01 DOI: 10.1145/2324876.2324877

Chris Dall, Jeremy Andrus, Alexander Van't Hof, Oren Laadan, Jason Nieh

Smartphones are increasingly ubiquitous, and many users carry multiple phones to accommodate work, personal, and geographic mobility needs. We present Cells, a virtualization architecture for enabling multiple virtual smartphones to run simultaneously on the same physical cellphone in an isolated, secure manner. Cells introduces a usage model of having one foreground virtual phone and multiple background virtual phones. This model enables a new device namespace mechanism and novel device proxies that integrate with lightweight operating system virtualization to multiplex phone hardware across multiple virtual phones while providing native hardware device performance. Cells virtual phone features include fully accelerated 3D graphics, complete power management features, and full telephony functionality with separately assignable telephone numbers and caller ID support. We have implemented a prototype of Cells that supports multiple Android virtual phones on the same phone. Our performance results demonstrate that Cells imposes only modest runtime and memory overhead, works seamlessly across multiple hardware devices including Google Nexus 1 and Nexus S phones, and transparently runs Android applications at native speed without any modifications.

智能手机越来越普遍，许多用户携带多部手机，以适应工作、个人和地理上的移动需求。我们提出了Cells，这是一种虚拟化架构，可以使多个虚拟智能手机以隔离、安全的方式同时运行在同一个物理手机上。cell介绍了一个前台虚拟电话和多个后台虚拟电话的使用模型。该模型支持新的设备命名空间机制和新颖的设备代理，它们与轻量级操作系统虚拟化集成在一起，可以跨多个虚拟电话复用电话硬件，同时提供本机硬件设备性能。cell虚拟电话功能包括完全加速的3D图形，完整的电源管理功能和完整的电话功能，可单独分配电话号码和来电显示支持。我们已经实现了cell的原型，它支持同一部手机上的多个Android虚拟电话。我们的性能结果表明，Cells仅施加适度的运行时和内存开销，可以无缝地跨多种硬件设备(包括Google Nexus 1和Nexus S手机)运行，并且无需任何修改即可以本机速度运行Android应用程序。

{"title":"The Design, Implementation, and Evaluation of Cells: A Virtual Smartphone Architecture","authors":"Chris Dall, Jeremy Andrus, Alexander Van't Hof, Oren Laadan, Jason Nieh","doi":"10.1145/2324876.2324877","DOIUrl":"https://doi.org/10.1145/2324876.2324877","url":null,"abstract":"Smartphones are increasingly ubiquitous, and many users carry multiple phones to accommodate work, personal, and geographic mobility needs. We present Cells, a virtualization architecture for enabling multiple virtual smartphones to run simultaneously on the same physical cellphone in an isolated, secure manner. Cells introduces a usage model of having one foreground virtual phone and multiple background virtual phones. This model enables a new device namespace mechanism and novel device proxies that integrate with lightweight operating system virtualization to multiplex phone hardware across multiple virtual phones while providing native hardware device performance. Cells virtual phone features include fully accelerated 3D graphics, complete power management features, and full telephony functionality with separately assignable telephone numbers and caller ID support. We have implemented a prototype of Cells that supports multiple Android virtual phones on the same phone. Our performance results demonstrate that Cells imposes only modest runtime and memory overhead, works seamlessly across multiple hardware devices including Google Nexus 1 and Nexus S phones, and transparently runs Android applications at native speed without any modifications.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"48 1","pages":"9:1-9:31"},"PeriodicalIF":1.5,"publicationDate":"2012-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87318591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A File Is Not a File: Understanding the I/O Behavior of Apple Desktop Applications 文件不是文件:理解苹果桌面应用程序的I/O行为

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-08-01 DOI: 10.1145/2324876.2324878

T. Harter, Chris Dragga, Michael Vaughn, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

We analyze the I/O behavior of iBench, a new collection of productivity and multimedia application workloads. Our analysis reveals a number of differences between iBench and typical file-system workload studies, including the complex organization of modern files, the lack of pure sequential access, the influence of underlying frameworks on I/O patterns, the widespread use of file synchronization and atomic operations, and the prevalence of threads. Our results have strong ramifications for the design of next generation local and cloud-based storage systems.

我们分析了iBench的I/O行为，iBench是一个新的生产力和多媒体应用程序工作负载集合。我们的分析揭示了iBench与典型文件系统工作负载研究之间的许多差异，包括现代文件的复杂组织、纯顺序访问的缺乏、底层框架对I/O模式的影响、文件同步和原子操作的广泛使用以及线程的流行。我们的研究结果对下一代本地和基于云的存储系统的设计有很强的影响。

引用次数: 133

Power Limitations and Dark Silicon Challenge the Future of Multicore 功率限制和暗硅挑战多核的未来

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-08-01 DOI: 10.1145/2324876.2324879

H. Esmaeilzadeh, Emily R. Blem, R. S. Amant, K. Sankaralingam, D. Burger

Since 2004, processor designers have increased core counts to exploit Moore’s Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9× average speedup is possible across commonly used parallel workloads for the topologies we study, leaving a nearly 24-fold gap from a target of doubled performance per generation.

自2004年以来，处理器设计人员增加了核心数量，以利用摩尔定律的缩放，而不是专注于单核性能。Dennard缩放的失败(向多核部件的转变在一定程度上是对Dennard缩放的回应)可能很快就会限制多核缩放，就像单核缩放已经被限制一样。本文通过结合设备扩展、单核扩展和多核扩展来模拟多核扩展限制，以衡量未来五代技术中一组并行工作负载的加速潜力。对于设备缩放，我们使用ITRS预测和一组更保守的设备缩放参数。为了模拟单核缩放，我们结合了来自150多个处理器的测量结果，得出了面积/性能和功率/性能的帕累托最优边界。最后，为了对多核缩放进行建模，我们建立了一个详细的性能上限和核心功耗下限的性能模型。我们研究的多核设计包括具有对称、非对称、动态和组合拓扑的单线程类cpu和大规模线程类gpu的多核芯片组织。该研究表明，无论芯片组织和拓扑结构如何，多核扩展在一定程度上都受到了计算社区广泛认可的功率限制。即使在22nm工艺中(距离现在只有一年的时间)，固定尺寸芯片的21%必须关闭电源，而在8nm工艺中，这个数字增长到50%以上。到2024年，对于我们所研究的拓扑，在常用的并行工作负载上只能实现7.9倍的平均加速，距离每代性能翻倍的目标还有近24倍的差距。

{"title":"Power Limitations and Dark Silicon Challenge the Future of Multicore","authors":"H. Esmaeilzadeh, Emily R. Blem, R. S. Amant, K. Sankaralingam, D. Burger","doi":"10.1145/2324876.2324879","DOIUrl":"https://doi.org/10.1145/2324876.2324879","url":null,"abstract":"Since 2004, processor designers have increased core counts to exploit Moore’s Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9× average speedup is possible across commonly used parallel workloads for the topologies we study, leaving a nearly 24-fold gap from a target of doubled performance per generation.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1 1","pages":"11:1-11:27"},"PeriodicalIF":1.5,"publicationDate":"2012-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90601149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors 高效处理量处理器的分级线程调度和寄存器文件

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-04-01 DOI: 10.1145/2166879.2166882

Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.

现代图形处理单元(gpu)使用大量的硬件线程来隐藏功能单元和内存访问延迟。极端多线程需要一个复杂的线程调度器和一个大的寄存器文件，这在能量和延迟方面都是昂贵的。我们提出了两种互补的技术来减少像gpu这样的大线程处理器上的能量。首先，我们研究了一个两级线程调度器，它维护一小组活动线程来隐藏ALU和本地内存访问延迟，以及一组较大的挂起线程来隐藏主内存延迟。减少调度器每个周期必须考虑的线程数可以提高调度器的能源效率。其次，我们建议用分层寄存器文件替换现代设计中的单片寄存器文件。我们探讨了层次结构的各种权衡，包括层次结构中的级别数量和每个级别上的条目数量。我们考虑硬件管理的缓存方案和软件管理的方案，其中编译器负责编排寄存器文件层次结构中的所有数据移动。结合分层寄存器文件，我们的两级线程调度器通过仅为活动线程在寄存器文件层次结构的上层分配条目，进一步减少了能量。在各种现实世界的图形和计算工作负载中进行平均计算，活动线程计数可以减少4倍，对性能的影响最小，我们最有效的三级软件管理的寄存器文件层次结构可以将寄存器文件能量减少54%。

{"title":"A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors","authors":"Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron","doi":"10.1145/2166879.2166882","DOIUrl":"https://doi.org/10.1145/2166879.2166882","url":null,"abstract":"Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"28 1","pages":"8:1-8:38"},"PeriodicalIF":1.5,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78471372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems 基于源节流的公平性:多核存储系统的可配置高性能公平性基板

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-04-01 DOI: 10.1145/2166879.2166881

Eiman Ebrahimi, Chang Joo Lee, O. Mutlu, Y. Patt

Cores in chip-multiprocessors (CMPs) share multiple memory subsystem resources. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate fairness mechanisms for each resource. Such resource-based fairness mechanisms implemented independently in each resource can make contradictory decisions, leading to low fairness and performance loss. Therefore, a coordinated mechanism that provides fairness in the entire shared memory system is desirable. This article proposes a new approach that provides fairness in the entire shared memory system, thereby eliminating the need for and complexity of developing fairness mechanisms for each resource. Our technique, Fairness via Source Throttling (FST), estimates unfairness in the entire memory system. If unfairness is above a system-software-set threshold, FST throttles down cores causing unfairness by limiting the number of requests they create and the frequency at which they do. As such, our source-based fairness control ensures fairness decisions are made in tandem in the entire memory system. FST enforces thread priorities/weights, and enables system-software to enforce different fairness objectives in the memory system. Our evaluations show that FST provides the best system fairness and performance compared to three systems with state-of-the-art fairness mechanisms implemented in both shared caches and memory controllers.

芯片多处理器(cmp)中的内核共享多个内存子系统资源。如果资源共享是不公平的，一些应用程序可能会被严重延迟，而另一些应用程序则会被不公平地优先考虑。先前的研究提出了针对每种资源的单独公平机制。这种基于资源的公平机制在每个资源中独立实现，可能会产生相互矛盾的决策，导致公平性低下和性能损失。因此，需要一种在整个共享内存系统中提供公平性的协调机制。本文提出了一种在整个共享内存系统中提供公平性的新方法，从而消除了为每个资源开发公平性机制的必要性和复杂性。我们的技术，公平通过源节流(FST)，估计不公平在整个内存系统。如果不公平高于系统软件设置的阈值，FST通过限制它们创建的请求的数量和频率来减少导致不公平的内核。因此，我们基于源的公平性控制确保公平性决策在整个内存系统中串联做出。FST强制线程优先级/权重，并使系统软件在内存系统中强制不同的公平性目标。我们的评估表明，与在共享缓存和内存控制器中实现最先进的公平机制的三个系统相比，FST提供了最好的系统公平性和性能。

{"title":"Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems","authors":"Eiman Ebrahimi, Chang Joo Lee, O. Mutlu, Y. Patt","doi":"10.1145/2166879.2166881","DOIUrl":"https://doi.org/10.1145/2166879.2166881","url":null,"abstract":"Cores in chip-multiprocessors (CMPs) share multiple memory subsystem resources. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate fairness mechanisms for each resource. Such resource-based fairness mechanisms implemented independently in each resource can make contradictory decisions, leading to low fairness and performance loss. Therefore, a coordinated mechanism that provides fairness in the entire shared memory system is desirable.\u0000 This article proposes a new approach that provides fairness in the entire shared memory system, thereby eliminating the need for and complexity of developing fairness mechanisms for each resource. Our technique, Fairness via Source Throttling (FST), estimates unfairness in the entire memory system. If unfairness is above a system-software-set threshold, FST throttles down cores causing unfairness by limiting the number of requests they create and the frequency at which they do. As such, our source-based fairness control ensures fairness decisions are made in tandem in the entire memory system. FST enforces thread priorities/weights, and enables system-software to enforce different fairness objectives in the memory system.\u0000 Our evaluations show that FST provides the best system fairness and performance compared to three systems with state-of-the-art fairness mechanisms implemented in both shared caches and memory controllers.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"30 1","pages":"7"},"PeriodicalIF":1.5,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2166879.2166881","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64134582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems 通过操作系统调度利用核心专门化来提高非对称多核系统的性能

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-04-01 DOI: 10.1145/2166879.2166880

J. C. Saez, Alexandra Fedorova, David A. Koufaty, M. Prieto

Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use more area and consume more energy relative to simpler and slower cores, we must use these cores for running applications that experience significant performance improvements from using those features. Having cores of different types in a single system allows optimizing the performance/energy trade-off. To deliver this potential to unmodified applications, the OS scheduler must map threads to cores in consideration of the properties of both. Our work describes a Comprehensive scheduler for Asymmetric Multicore Processors (CAMP) that addresses shortcomings of previous asymmetry-aware schedulers. First, previous schedulers catered to only one kind of workload properties that are crucial for scheduling on AMPs; either efficiency or thread-level parallelism (TLP), but not both. CAMP overcomes this limitation showing how using both efficiency and TLP in synergy in a single scheduling algorithm can improve performance. Second, most existing schedulers relying on models for estimating how much faster a thread executes on a “fast” vs. “slow” core (i.e., the speedup factor) were specifically designed for AMP systems where cores differ only in clock frequency. However, more realistic AMP systems include cores that differ more significantly in their features. To demonstrate the effectiveness of CAMP on more realistic scenarios, we augmented the CAMP scheduler with a model that predicts the speedup factor on a real AMP prototype that closely matches future asymmetric systems.

非对称多核处理器(AMPs)由具有相同ISA(指令集架构)的核心组成，但具有不同的微架构特性、速度和功耗。由于具有更复杂功能和更高速度的核心通常比更简单和更慢的核心使用更多的面积和消耗更多的能量，因此我们必须使用这些核心来运行通过使用这些功能而获得显着性能改进的应用程序。在单个系统中使用不同类型的核心可以优化性能/能量权衡。为了向未修改的应用程序提供这种潜力，操作系统调度器必须将线程映射到内核，同时考虑两者的属性。我们的工作描述了一个非对称多核处理器(CAMP)的综合调度器，它解决了以前的非对称感知调度器的缺点。首先，以前的调度器只满足一种对amp上的调度至关重要的工作负载属性;要么是效率，要么是线程级并行性(TLP)，但不能两者兼而有之。CAMP克服了这一限制，显示了如何在单个调度算法中协同使用效率和TLP可以提高性能。其次，大多数现有的调度器依赖于模型来估计线程在“快”与“慢”内核上的执行速度(即加速因子)是专门为AMP系统设计的，其中内核仅在时钟频率上有所不同。然而，更现实的AMP系统包括在功能上差异更大的内核。为了证明CAMP在更现实的场景中的有效性，我们用一个模型增强了CAMP调度器，该模型预测了与未来非对称系统密切匹配的真实AMP原型的加速因子。

{"title":"Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems","authors":"J. C. Saez, Alexandra Fedorova, David A. Koufaty, M. Prieto","doi":"10.1145/2166879.2166880","DOIUrl":"https://doi.org/10.1145/2166879.2166880","url":null,"abstract":"Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use more area and consume more energy relative to simpler and slower cores, we must use these cores for running applications that experience significant performance improvements from using those features. Having cores of different types in a single system allows optimizing the performance/energy trade-off. To deliver this potential to unmodified applications, the OS scheduler must map threads to cores in consideration of the properties of both. Our work describes a Comprehensive scheduler for Asymmetric Multicore Processors (CAMP) that addresses shortcomings of previous asymmetry-aware schedulers. First, previous schedulers catered to only one kind of workload properties that are crucial for scheduling on AMPs; either efficiency or thread-level parallelism (TLP), but not both. CAMP overcomes this limitation showing how using both efficiency and TLP in synergy in a single scheduling algorithm can improve performance. Second, most existing schedulers relying on models for estimating how much faster a thread executes on a “fast” vs. “slow” core (i.e., the speedup factor) were specifically designed for AMP systems where cores differ only in clock frequency. However, more realistic AMP systems include cores that differ more significantly in their features. To demonstrate the effectiveness of CAMP on more realistic scenarios, we augmented the CAMP scheduler with a model that predicts the speedup factor on a real AMP prototype that closely matches future asymmetric systems.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"82 1","pages":"6:1-6:38"},"PeriodicalIF":1.5,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87103410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

DoublePlay: Parallelizing Sequential Logging and Replay 双播放:并行顺序日志记录和回放

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-02-01 DOI: 10.1145/2110356.2110359

K. Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, J. Flinn, S. Narayanasamy

Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the need to reproduce the order of or the values read by shared memory operations performed by multiple threads. In this paper, we present DoublePlay, a new way to efficiently guarantee replay on commodity multiprocessors. Our key insight is that one can use the simpler and faster mechanisms of single-processor record and replay, yet still achieve the scalability offered by multiple cores, by using an additional execution to parallelize the record and replay of an application. DoublePlay timeslices multiple threads on a single processor, then runs multiple time intervals (epochs) of the program concurrently on separate processors. This strategy, which we call uniparallelism, makes logging much easier because each epoch runs on a single processor (so threads in an epoch never simultaneously access the same memory) and different epochs operate on different copies of the memory. Thus, rather than logging the order of shared-memory accesses, we need only log the order in which threads in an epoch are timesliced on the processor. DoublePlay runs an additional execution of the program on multiple processors to generate checkpoints so that epochs run in parallel. We evaluate DoublePlay on a variety of client, server, and scientific parallel benchmarks; with spare cores, DoublePlay reduces logging overhead to an average of 15% with two worker threads and 28% with four threads.

确定性重放系统记录和再现硬件或软件系统的执行。与单处理器上的重放执行相比，多处理器上的确定性重放很难有效地实现，因为需要重放由多个线程执行的共享内存操作所读取的值的顺序。本文提出了一种在商用多处理器上有效保证重放的新方法——双播放。我们的主要见解是，通过使用额外的执行来并行化应用程序的记录和重播，可以使用更简单、更快的单处理器记录和重播机制，但仍然可以实现多核提供的可伸缩性。双播放在单个处理器上对多个线程进行时间切片，然后在不同的处理器上并发地运行程序的多个时间间隔(epoch)。这种策略(我们称之为单并行性)使日志记录变得更加容易，因为每个epoch在单个处理器上运行(因此epoch中的线程永远不会同时访问相同的内存)，并且不同的epoch在不同的内存副本上操作。因此，我们不需要记录共享内存访问的顺序，而只需要记录一个epoch中线程在处理器上被时间切片的顺序。DoublePlay在多个处理器上运行程序的额外执行，以生成检查点，以便epoch并行运行。我们在各种客户端，服务器和科学并行基准上评估双玩;使用备用核，DoublePlay在使用两个工作线程时将日志开销平均降低到15%，在使用四个线程时将日志开销平均降低到28%。

{"title":"DoublePlay: Parallelizing Sequential Logging and Replay","authors":"K. Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, J. Flinn, S. Narayanasamy","doi":"10.1145/2110356.2110359","DOIUrl":"https://doi.org/10.1145/2110356.2110359","url":null,"abstract":"Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the need to reproduce the order of or the values read by shared memory operations performed by multiple threads. In this paper, we present DoublePlay, a new way to efficiently guarantee replay on commodity multiprocessors. Our key insight is that one can use the simpler and faster mechanisms of single-processor record and replay, yet still achieve the scalability offered by multiple cores, by using an additional execution to parallelize the record and replay of an application. DoublePlay timeslices multiple threads on a single processor, then runs multiple time intervals (epochs) of the program concurrently on separate processors. This strategy, which we call uniparallelism, makes logging much easier because each epoch runs on a single processor (so threads in an epoch never simultaneously access the same memory) and different epochs operate on different copies of the memory. Thus, rather than logging the order of shared-memory accesses, we need only log the order in which threads in an epoch are timesliced on the processor. DoublePlay runs an additional execution of the program on multiple processors to generate checkpoints so that epochs run in parallel. We evaluate DoublePlay on a variety of client, server, and scientific parallel benchmarks; with spare cores, DoublePlay reduces logging overhead to an average of 15% with two worker threads and 28% with four threads.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"83 1","pages":"3:1-3:24"},"PeriodicalIF":1.5,"publicationDate":"2012-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76563108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Improving Software Diagnosability via Log Enhancement 通过日志增强提高软件的可诊断性

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-02-01 DOI: 10.1145/2110356.2110360

Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, S. Savage

Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of troubleshooting any complex software system, but further exacerbated by the paucity of information that is typically available in the production setting. Indeed, for reasons of both overhead and privacy, it is common that only the run-time log generated by a system (e.g., syslog) can be shared with the developers. Unfortunately, the ad-hoc nature of such reports are frequently insufficient for detailed failure diagnosis. This paper seeks to improve this situation within the rubric of existing practice. We describe a tool, LogEnhancer that automatically “enhances” existing logging code to aid in future post-failure debugging. We evaluate LogEnhancer on eight large, real-world applications and demonstrate that it can dramatically reduce the set of potential root failure causes that must be considered while imposing negligible overheads.

在现场诊断软件故障是出了名的困难，部分原因是由于对任何复杂软件系统进行故障排除的基本复杂性，而在生产环境中通常可用的信息的缺乏进一步加剧了这一点。实际上，出于开销和隐私的原因，通常只有系统生成的运行时日志(例如，syslog)才能与开发人员共享。不幸的是，此类报告的临时性质往往不足以进行详细的故障诊断。本文试图在现有实践的框架内改善这种情况。我们描述了一个工具，LogEnhancer，它自动“增强”现有的日志代码，以帮助将来的故障后调试。我们在八个大型实际应用程序上对LogEnhancer进行了评估，并证明它可以显著减少必须考虑的潜在根本故障原因集，同时施加的开销可以忽略不计。

引用次数: 245

Introduction to Special Issue APLOS 2011 APLOS 2011特刊简介

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2012-02-01 DOI: 10.1145/2110356.2110357

T. Mowry

It is a great pleasure to welcome you to this special issue of ACM Transactions on Computer Systems that is focusing on highlights from the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), held at Newport Beach, California, in March 2011. ASPLOS is a multidisciplinary conference for research that spans the boundaries of hardware, computer architecture, compilers, languages, operating systems, networking, and applications. ACM TOCS has recently begun a new tradition of inviting the authors of awardquality ASPLOS papers to submit extended versions of their work for fast-track consideration for publication in ACM TOCS. I am very pleased to announce that extended versions of all four of the papers that were finalists for the Best Paper Award in ASPLOS 2011 are appearing in this special issue of ACM TOCS. Each of these papers stood out not only due to their overall quality and expected research impact, but also because the reviewers and program committee members found them to be unusually novel and thought provoking. I hope that you enjoy reading each of these papers as much as I did.

很高兴欢迎您来到本期《ACM计算机系统汇刊》特刊，该特刊聚焦于2011年3月在加州纽波特海滩举行的第16届编程语言和操作系统体系结构支持国际会议(ASPLOS)的亮点。ASPLOS是一个多学科会议，其研究跨越硬件、计算机体系结构、编译器、语言、操作系统、网络和应用程序的边界。ACM TOCS最近开始了一项新的传统，邀请ASPLOS论文的作者提交其工作的扩展版本，以便快速考虑在ACM TOCS上发表。我很高兴地宣布，入围2011年ASPLOS最佳论文奖的所有四篇论文的扩展版本将出现在本期ACM TOCS特刊上。每一篇论文都脱颖而出，不仅因为它们的整体质量和预期的研究影响，而且因为审稿人和项目委员会成员发现它们异常新颖和发人深省。我希望你们和我一样喜欢阅读这些论文。

引用次数: 20