ASPLOS XI最新文献

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024401

Timothy E. Denehy, John Bent, Florentina I. Popovici, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

We introduce Shear, a user-level software tool that characterizes RAID storage arrays. Shear employs a set of controlled algorithms combined with statistical techniques to automatically determine the important properties of a RAID system, including the number of disks, chunk size, level of redundancy, and layout scheme. We illustrate the correctness of Shear by running it upon numerous simulated configurations, and then verify its real-world applicability by running Shear on both software-based and hardware-based RAID systems. Finally, we demonstrate the utility of Shear through three case studies. First, we show how Shear can be used in a storage management environment to verify RAID construction and detect failures. Second, we demonstrate how Shear can be used to extract detailed characteristics about the individual disks within an array. Third, we show how an operating system can use Shear to automatically tune its storage subsystems to specific RAID configurations.

我们介绍Shear，这是一个用户级软件工具，用于描述RAID存储阵列。Shear采用一套受控算法，结合统计技术，自动确定RAID系统的重要属性，包括磁盘数量、chunk大小、冗余级别和布局方案。我们通过在许多模拟配置上运行Shear来说明其正确性，然后通过在基于软件和基于硬件的RAID系统上运行Shear来验证其实际适用性。最后，我们通过三个案例研究展示了剪切的效用。首先，我们展示了如何在存储管理环境中使用Shear来验证RAID构建和检测故障。其次，我们演示了如何使用Shear来提取阵列中单个磁盘的详细特征。第三，我们将展示操作系统如何使用Shear自动调整其存储子系统以适应特定的RAID配置。

引用次数: 25

Low-overhead memory leak detection using adaptive statistical profiling 使用自适应统计分析的低开销内存泄漏检测

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024412

Matthias Hauswirth, Trishul M. Chilimbi

Sampling has been successfully used to identify performance optimization opportunities. We would like to apply similar techniques to check program correctness. Unfortunately, sampling provides poor coverage of infrequently executed code, where bugs often lurk. We describe an adaptive profiling scheme that addresses this by sampling executions of code segments at a rate inversely proportional to their execution frequency. To validate our ideas, we have implemented SWAT, a novel memory leak detection tool. SWAT traces program allocations/ frees to construct a heap model and uses our adaptive profiling infrastructure to monitor loads/stores to these objects with low overhead. SWAT reports 'stale' objects that have not been accessed for a 'long' time as leaks. This allows it to find all leaks that manifest during the current program execution. Since SWAT has low runtime overhead (‹5%), and low space overhead (‹10% in most cases and often less than 5%), it can be used to track leaks in production code that take days to manifest. In addition to identifying the allocations that leak memory, SWAT exposes where the program last accessed the leaked data, which facilitates debugging and fixing the leak. SWAT has been used by several product groups at Microsoft for the past 18 months and has proved effective at detecting leaks with a low false positive rate (‹10%).

采样已经成功地用于识别性能优化机会。我们希望应用类似的技术来检查程序的正确性。不幸的是，采样对不经常执行的代码提供了很差的覆盖，而这些代码往往潜伏着bug。我们描述了一种自适应分析方案，该方案通过以与执行频率成反比的速率对代码段的执行进行抽样来解决这个问题。为了验证我们的想法，我们实现了一种新的内存泄漏检测工具SWAT。SWAT跟踪程序分配/释放以构建堆模型，并使用我们的自适应分析基础设施以低开销监视这些对象的负载/存储。SWAT将“过时”的对象报告为“长时间”未被访问的泄漏。这允许它查找在当前程序执行期间出现的所有泄漏。由于SWAT具有较低的运行时开销(5%)和较低的空间开销(大多数情况下为10%，通常低于5%)，因此它可以用于跟踪需要几天才能显示的生产代码中的泄漏。除了识别泄漏内存的分配之外，SWAT还公开程序最后访问泄漏数据的位置，这有助于调试和修复泄漏。在过去的18个月里，微软的几个产品小组已经使用了SWAT，并且在检测泄漏方面被证明是有效的，并且误报率很低(10%)。

{"title":"Low-overhead memory leak detection using adaptive statistical profiling","authors":"Matthias Hauswirth, Trishul M. Chilimbi","doi":"10.1145/1024393.1024412","DOIUrl":"https://doi.org/10.1145/1024393.1024412","url":null,"abstract":"Sampling has been successfully used to identify performance optimization opportunities. We would like to apply similar techniques to check program correctness. Unfortunately, sampling provides poor coverage of infrequently executed code, where bugs often lurk. We describe an adaptive profiling scheme that addresses this by sampling executions of code segments at a rate inversely proportional to their execution frequency. To validate our ideas, we have implemented SWAT, a novel memory leak detection tool. SWAT traces program allocations/ frees to construct a heap model and uses our adaptive profiling infrastructure to monitor loads/stores to these objects with low overhead. SWAT reports 'stale' objects that have not been accessed for a 'long' time as leaks. This allows it to find all leaks that manifest during the current program execution. Since SWAT has low runtime overhead (‹5%), and low space overhead (‹10% in most cases and often less than 5%), it can be used to track leaks in production code that take days to manifest. In addition to identifying the allocations that leak memory, SWAT exposes where the program last accessed the leaked data, which facilitates debugging and fixing the leak. SWAT has been used by several product groups at Microsoft for the past 18 months and has proved effective at detecting leaks with a low false positive rate (‹10%).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125977553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 254

Dynamic tracking of page miss ratio curve for memory management 内存管理中页面缺失率曲线的动态跟踪

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024415

Pin Zhou, V. Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar

Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this purpose. However, dynamically tracking MRC at run time is challenging in systems with virtual memory because not every memory reference passes through the operating system (OS).This paper proposes two methods to dynamically track MRC of applications at run time. The first method is using a hardware MRC monitor that can track MRC at fine time granularity. Our simulation results show that this monitor has negligible performance and energy overheads. The second method is an OS-only implementation that can track MRC at coarse time granularity. Our implementation results on Linux show that it adds only 7--10% overhead.We have also used the dynamic MRC to guide both memory allocation for multiprogramming systems and memory energy management. Our real system experiments on Linux with applications including Apache Web Server show that the MRC-directed memory allocation can speed up the applications' execution/response time by up to a factor of 5.86 and reduce the number of page faults by up to 63.1%. Our execution-driven simulation results with SPEC2000 benchmarks show that the MRC-directed memory energy management can improve the Energy * Delay metric by 27--58% over previously proposed static and dynamic schemes.

如果可以在运行时确定和分析应用程序的动态内存需求，则可以有效地利用内存。页面缺失率曲线(MRC)，即页面缺失率与内存大小曲线，是一个很好的性能导向指标。然而，在使用虚拟内存的系统中，在运行时动态跟踪MRC是一项挑战，因为并非每个内存引用都要经过操作系统。本文提出了两种动态跟踪应用程序运行时MRC的方法。第一种方法是使用硬件MRC监视器，该监视器可以在精细的时间粒度上跟踪MRC。我们的模拟结果表明，该监视器的性能和能源开销可以忽略不计。第二种方法是一个仅限操作系统的实现，它可以在粗时间粒度上跟踪MRC。我们在Linux上的实现结果表明，它只增加了7- 10%的开销。我们还使用动态MRC来指导多道程序系统的内存分配和内存能量管理。我们在Linux上对包括Apache Web Server在内的应用程序进行的实际系统实验表明，mrc定向内存分配可以将应用程序的执行/响应时间提高5.86倍，并将页面错误数量减少63.1%。我们在SPEC2000基准测试中执行驱动的仿真结果表明，与之前提出的静态和动态方案相比，mrc定向内存能量管理可以将energy * Delay指标提高27- 58%。

{"title":"Dynamic tracking of page miss ratio curve for memory management","authors":"Pin Zhou, V. Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar","doi":"10.1145/1024393.1024415","DOIUrl":"https://doi.org/10.1145/1024393.1024415","url":null,"abstract":"Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this purpose. However, dynamically tracking MRC at run time is challenging in systems with virtual memory because not every memory reference passes through the operating system (OS).This paper proposes two methods to dynamically track MRC of applications at run time. The first method is using a hardware MRC monitor that can track MRC at fine time granularity. Our simulation results show that this monitor has negligible performance and energy overheads. The second method is an OS-only implementation that can track MRC at coarse time granularity. Our implementation results on Linux show that it adds only 7--10% overhead.We have also used the dynamic MRC to guide both memory allocation for multiprogramming systems and memory energy management. Our real system experiments on Linux with applications including Apache Web Server show that the MRC-directed memory allocation can speed up the applications' execution/response time by up to a factor of 5.86 and reduce the number of page faults by up to 63.1%. Our execution-driven simulation results with SPEC2000 benchmarks show that the MRC-directed memory energy management can improve the Energy * Delay metric by 27--58% over previously proposed static and dynamic schemes.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"42 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128224733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 252

Heat-and-run: leveraging SMT and CMP to manage power density through the operating system 热运行:利用SMT和CMP通过操作系统管理功率密度

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024424

M. Gomaa, Michael D. Powell, T. N. Vijaykumar

Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power density is characterized by localized chip hot spots that can reach critical temperatures and cause failure. Previous architectural approaches to power density have used global clock gating, fetch toggling, dynamic frequency scaling, or resource duplication to either prevent heating or relieve overheated resources in a superscalar processor. Previous approaches also evaluate design technologies where power density is not a major problem and most applications do not overheat the processor. Future processors, however, are likely to be chip multiprocessors (CMPs) with simultaneously-multithreaded (SMT) cores. SMT CMPs pose unique challenges and opportunities for power density. SMT and CMP increase throughput and thus on-chip heat, but also provide natural granularities for managing power-density. This paper is the first work to leverage SMT and CMP to address power density. We propose heat-and-run SMT thread assignment to increase processor-resource utilization before cooling becomes necessary by co-scheduling threads that use complimentary resources. We propose heat-and-run CMP thread migration to migrate threads away from overheated cores and assign them to free SMT contexts on alternate cores, leveraging availability of SMT contexts on alternate CMP cores to maintain throughput while allowing overheated cores to cool. We show that our proposal has an average of 9% and up to 34% higher throughput than a previous superscalar technique running the same number of threads.

随着技术的发展，高性能处理器的功率密度不断增加，因为电流、时钟速度和器件密度的缩放速度超过了电源电压和封装散热能力的缩小速度。功率密度的特点是局部芯片热点可以达到临界温度并导致故障。以前的功率密度架构方法使用全局时钟门控、获取切换、动态频率缩放或资源复制来防止过热或缓解超标量处理器中的过热资源。以前的方法也评估设计技术，其中功率密度不是主要问题，大多数应用程序不会使处理器过热。然而，未来的处理器很可能是具有同步多线程(SMT)内核的芯片多处理器(cmp)。SMT cmp对功率密度提出了独特的挑战和机遇。SMT和CMP增加了吞吐量，从而增加了芯片上的热量，但也为管理功率密度提供了自然粒度。本文是第一个利用SMT和CMP来解决功率密度问题的工作。我们提出热运行SMT线程分配，通过协同调度使用互补资源的线程，在冷却成为必要之前提高处理器资源利用率。我们建议热运行CMP线程迁移，将线程从过热的内核迁移出去，并将它们分配给备用内核上的空闲SMT上下文，利用备用CMP内核上SMT上下文的可用性来保持吞吐量，同时允许过热的内核冷却。我们表明，在运行相同数量的线程时，我们的建议比以前的超标量技术的吞吐量平均高出9%，最高可达34%。

{"title":"Heat-and-run: leveraging SMT and CMP to manage power density through the operating system","authors":"M. Gomaa, Michael D. Powell, T. N. Vijaykumar","doi":"10.1145/1024393.1024424","DOIUrl":"https://doi.org/10.1145/1024393.1024424","url":null,"abstract":"Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power density is characterized by localized chip hot spots that can reach critical temperatures and cause failure. Previous architectural approaches to power density have used global clock gating, fetch toggling, dynamic frequency scaling, or resource duplication to either prevent heating or relieve overheated resources in a superscalar processor. Previous approaches also evaluate design technologies where power density is not a major problem and most applications do not overheat the processor. Future processors, however, are likely to be chip multiprocessors (CMPs) with simultaneously-multithreaded (SMT) cores. SMT CMPs pose unique challenges and opportunities for power density. SMT and CMP increase throughput and thus on-chip heat, but also provide natural granularities for managing power-density. This paper is the first work to leverage SMT and CMP to address power density. We propose heat-and-run SMT thread assignment to increase processor-resource utilization before cooling becomes necessary by co-scheduling threads that use complimentary resources. We propose heat-and-run CMP thread migration to migrate threads away from overheated cores and assign them to free SMT contexts on alternate cores, leveraging availability of SMT contexts on alternate CMP cores to maintain throughput while allowing overheated cores to cool. We show that our proposal has an average of 9% and up to 34% higher throughput than a previous superscalar technique running the same number of threads.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132393808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 339

Continual flow pipelines 连续流管道

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024407

Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, A. Gandhi, M. Upton

Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.

以单个芯片上多个处理器内核的形式增加的集成、相对恒定的芯片尺寸、缩小的功耗封装以及新兴应用程序为处理器架构师带来了新的挑战。如何构建提供高单线程性能的处理器，并使多个处理器能够放置在同一芯片上以实现高吞吐量，同时动态适应未来的应用程序?传统的高单线程性能方法依赖于大而复杂的内核来维持大的内存容错指令窗口，这使得它们不适合多核芯片。我们提出了连续流管道(CFP)作为一种新的非阻塞处理器管道架构，它实现了大指令窗口的性能，而不需要像调度程序和寄存器文件这样的周期关键结构。为了实现大指令窗口的优势，必须解决调度程序和寄存器文件管理效率低下的问题，并提出了统一的解决方案。CFP的非阻塞特性使影响周期时间和功耗(调度器、寄存器文件)和芯片大小(二级缓存)的关键处理器结构保持较小。内存延迟容忍CFP核心允许在单个芯片上多个核心，同时在单线程应用程序中优于当前的处理器核心。

{"title":"Continual flow pipelines","authors":"Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, A. Gandhi, M. Upton","doi":"10.1145/1024393.1024407","DOIUrl":"https://doi.org/10.1145/1024393.1024407","url":null,"abstract":"Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127626003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 197

HIDE: an infrastructure for efficiently protecting information leakage on the address bus 用于有效地保护地址总线上的信息泄漏的基础结构

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024403

Xiaotong Zhuang, Zhang Tao, S. Pande

XOM-based secure processor has recently been introduced as a mechanism to provide copy and tamper resistant execution. XOM provides support for encryption/decryption and integrity checking. However, neither XOM nor any other current approach adequately addresses the problem of information leakage via the address bus. This paper shows that without address bus protection, the XOM model is severely crippled. Two realistic attacks are shown and experiments show that 70% of the code might be cracked and sensitive data might be exposed leading to serious security breaches.Although the problem of address bus leakage has been widely acknowledged both in industry and academia, no practical solution has ever been proposed that can provide an adequate security guarantee. The main reason is that the problem is very difficult to solve in practice due to severe performance degradation which accompanies most of the solutions. This paper presents an infrastructure called HIDE (Hardware-support for leakage-Immune Dynamic Execution) which provides a solution consisting of chunk-level protection with hardware support and a flexible interface which can be orchestrated through the proposed compiler optimization and user specifications that allow utilizing underlying hardware solution more efficiently to provide better security guarantees.Our results show that protecting both data and code with a high level of security guarantee is possible with negligible performance penalty (1.3% slowdown).

最近引入了基于xom的安全处理器，作为提供防复制和防篡改执行的机制。XOM提供对加密/解密和完整性检查的支持。然而，无论是XOM还是任何其他当前的方法都不能充分解决通过地址总线的信息泄漏问题。本文表明，如果没有地址总线保护，XOM模型将受到严重的破坏。实验表明，70%的代码可能被破解，敏感数据可能被泄露，导致严重的安全漏洞。虽然地址总线泄漏问题已经得到了业界和学术界的广泛认可，但目前还没有提出切实可行的解决方案来提供足够的安全保障。主要原因是由于大多数解决方案都伴随着严重的性能下降，因此在实践中很难解决这个问题。本文提出了一种名为HIDE(硬件支持免疫泄漏动态执行)的基础结构，它提供了一种解决方案，包括具有硬件支持的块级保护和灵活的接口，该接口可以通过建议的编译器优化和用户规范进行编排，从而允许更有效地利用底层硬件解决方案来提供更好的安全保证。我们的结果表明，用高水平的安全保证来保护数据和代码是可能的，而性能损失可以忽略不计(降低1.3%)。

{"title":"HIDE: an infrastructure for efficiently protecting information leakage on the address bus","authors":"Xiaotong Zhuang, Zhang Tao, S. Pande","doi":"10.1145/1024393.1024403","DOIUrl":"https://doi.org/10.1145/1024393.1024403","url":null,"abstract":"XOM-based secure processor has recently been introduced as a mechanism to provide copy and tamper resistant execution. XOM provides support for encryption/decryption and integrity checking. However, neither XOM nor any other current approach adequately addresses the problem of information leakage via the address bus. This paper shows that without address bus protection, the XOM model is severely crippled. Two realistic attacks are shown and experiments show that 70% of the code might be cracked and sensitive data might be exposed leading to serious security breaches.Although the problem of address bus leakage has been widely acknowledged both in industry and academia, no practical solution has ever been proposed that can provide an adequate security guarantee. The main reason is that the problem is very difficult to solve in practice due to severe performance degradation which accompanies most of the solutions. This paper presents an infrastructure called HIDE (Hardware-support for leakage-Immune Dynamic Execution) which provides a solution consisting of chunk-level protection with hardware support and a flexible interface which can be orchestrated through the proposed compiler optimization and user specifications that allow utilizing underlying hardware solution more efficiently to provide better security guarantees.Our results show that protecting both data and code with a high level of security guarantee is possible with negligible performance penalty (1.3% slowdown).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123811959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 193

Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform 在基于itanium®2处理器的实验平台上，通过虚拟多线程实现助手线程

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024411

P. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, T. Sych, Stephen F. Moore, John Paul Shen

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.

辅助线程是一种通过利用处理器的多线程能力来运行“辅助”线程来加速程序的技术。以前在超线程处理器上的实验已经证明，通过使用帮助线程预取难以预测的错误数据访问，可以显著提高速度。为了将此技术应用于没有内置多线程硬件支持的处理器，我们引入了虚拟多线程(VMT)，这是一种新颖的事件开关用户级多线程形式，能够在单个处理器上对事件驱动的线程执行进行轻量级多路复用，而无需额外的操作系统支持。通过在用户级线程之间明智地划分寄存器使用，编译器在最小化同步成本方面起着关键作用。VMT方法使得启动动态助手线程实例来响应长延迟的缓存丢失事件成为可能，并且当主线程因其他原因停滞时，可以在缓存丢失的阴影下运行助手线程。VMT的概念是在Itanium®2处理器上原型化的，使用了处理器抽象层(PAL)固件机制提供的功能，这些功能已经存在于当前的处理器中。在配备支持VMT的Itanium 2处理器的4路MP物理系统上，通过VMT机制的助手线程可以为各种实际工作负载(从单线程工作站基准测试到使用IBM DB2通用数据库的多线程大规模决策支持系统(DSS))实现显著的性能提升。我们在工作站基准测试中测量到的挂钟加速为5.8%到38.5%，在DSS工作负载中的各种查询上测量到的挂钟加速为5.0%到12.7%。

{"title":"Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform","authors":"P. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, T. Sych, Stephen F. Moore, John Paul Shen","doi":"10.1145/1024393.1024411","DOIUrl":"https://doi.org/10.1145/1024393.1024411","url":null,"abstract":"Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign 标记清除垃圾收集的软件预取:硬件分析和软件重新设计

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024417

Chen-Yong Cher, Antony Lloyd Hosking, T. N. Vijaykumar

Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. Since application heaps are typically much larger than hardware caches, tracing results in many cache misses. Technology trends will make cache misses more important, so tracing is a prime target for prefetching.Simulation of Java benchmarks running with the Boehm-De-mers-Weiser mark-sweep garbage collector for a projected hardware platform reveal high tracing overhead (up to 65% of elapsed time), and that cache misses are a problem. Applying Boehm's default prefetching strategy yields improvements in execution time (16% on average with incremental/generational collection for GC-intensive benchmarks), but analysis shows that his strategy suffers from significant timing problems: prefetches that occur too early or too late relative to their matching loads. This analysis drives development of a new prefetching strategy that yields up to three times the performance improvement of Boehm's strategy for GC-intensive benchmark (27% average speedup), and achieves performance close to that of perfect timing ie, few misses for tracing accesses) on some benchmarks. Validating these simulation results with live runs on current hardware produces average speedup of 6% for the new strategy on GC-intensive benchmarks with a GC configuration that tightly controls heap growth. In contrast, Boehm's default prefetching strategy is ineffective on this platform.

跟踪垃圾收集器遍历来自活动程序变量的引用，传递地跟踪活动对象的闭包。跟踪期间产生的内存访问基本上是随机的:给定对象可能包含对任何其他对象的引用。由于应用程序堆通常比硬件缓存大得多，跟踪会导致许多缓存丢失。技术趋势将使缓存丢失变得更加重要，因此跟踪是预取的主要目标。对使用Boehm-De-mers-Weiser标记-清除垃圾收集器运行的Java基准测试的模拟显示，跟踪开销很高(高达运行时间的65%)，并且缓存丢失是一个问题。应用Boehm的默认预取策略可以提高执行时间(对于gc密集型基准测试，使用增量/分代收集平均可以提高16%)，但分析表明，他的策略存在严重的时间问题:相对于匹配负载，预取发生得太早或太晚。这种分析推动了一种新的预取策略的开发，这种策略在gc密集型基准测试中产生的性能提高是Boehm策略的三倍(平均加速提高27%)，并且在某些基准测试中实现了接近完美计时的性能(即跟踪访问的失误很少)。通过在当前硬件上的实时运行验证这些模拟结果，在GC密集型基准测试中使用严格控制堆增长的GC配置，新策略的平均加速速度为6%。相比之下，Boehm的默认预取策略在这个平台上是无效的。

{"title":"Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign","authors":"Chen-Yong Cher, Antony Lloyd Hosking, T. N. Vijaykumar","doi":"10.1145/1024393.1024417","DOIUrl":"https://doi.org/10.1145/1024393.1024417","url":null,"abstract":"Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. Since application heaps are typically much larger than hardware caches, tracing results in many cache misses. Technology trends will make cache misses more important, so tracing is a prime target for prefetching.Simulation of Java benchmarks running with the Boehm-De-mers-Weiser mark-sweep garbage collector for a projected hardware platform reveal high tracing overhead (up to 65% of elapsed time), and that cache misses are a problem. Applying Boehm's default prefetching strategy yields improvements in execution time (16% on average with incremental/generational collection for GC-intensive benchmarks), but analysis shows that his strategy suffers from significant timing problems: prefetches that occur too early or too late relative to their matching loads. This analysis drives development of a new prefetching strategy that yields up to three times the performance improvement of Boehm's strategy for GC-intensive benchmark (27% average speedup), and achieves performance close to that of perfect timing ie, few misses for tracing accesses) on some benchmarks. Validating these simulation results with live runs on current hardware produces average speedup of 6% for the new strategy on GC-intensive benchmarks with a GC configuration that tightly controls heap growth. In contrast, Boehm's default prefetching strategy is ineffective on this platform.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Application-level checkpointing for shared memory programs 共享内存程序的应用程序级检查点

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024421

G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

高性能计算的趋势使得长时间运行的应用程序必须能够容忍硬件故障。最常用的方法是检查点和重启(CPR)——计算的状态定期保存在磁盘上，当发生故障时，计算从上次保存的状态重新开始。目前，程序设计人员的职责是对CPR的应用程序进行仪表化。我们的小组正在研究使用编译器技术来检测代码，使它们能够自我检查点和自我重启，从而为使长时间运行的科学应用程序能够适应硬件故障的问题提供一个自动解决方案。我们以前的工作集中在消息传递程序上。本文描述了一个在对称多处理器上运行共享内存程序的系统。这个系统有两个组成部分:(i)一个用于应用程序源到源修改的预编译器，(ii)一个运行时系统，它实现了在并行应用程序的线程之间协调CPR的协议。为了具体起见，我们将重点放在OpenMP的一个重要子集上，其中包括屏障和锁。这种方法的优点之一是，容忍错误的能力嵌入到应用程序本身中，因此应用程序可以在任何平台上实现自我检查点和自我重启。我们通过展示转换后的基准可以在三个不同的平台(Windows/x86、Linux/x86和Tru64/Alpha)上检查点和重新启动来演示这一点。我们的实验表明，这种方法带来的开销通常非常小;他们还提出了调整当前实现以进一步减少开销的方法。

{"title":"Application-level checkpointing for shared memory programs","authors":"G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz","doi":"10.1145/1024393.1024421","DOIUrl":"https://doi.org/10.1145/1024393.1024421","url":null,"abstract":"Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121273717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 131

Formal online methods for voltage/frequency control in multiple clock domain microprocessors 多时钟域微处理器电压/频率控制的正式在线方法

ASPLOS XI

Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024423

Qiang Wu, Philo Juang, M. Martonosi, D. Clark

Multiple Clock Domain (MCD) processors are a promising future alternative to today's fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches are profile-based offline schemes which are mainly suitable for applications whose execution char-acteristics are constrained and repeatable. While some work has been published about online DVFS schemes, the prior approaches are typically heuristic-based. In this paper, we present an effective online DVFS scheme for an MCD processor which takes a formal analytic approach, is driven by dynamic workloads, and is suitable for all applications. In our approach, we model an MCD processor as a queue-domain network and the online DVFS as a feedback control problem with issue queue occupancies as feedback signals. A dynamic stochastic queuing model is first proposed and linearized through an accu-rate linearization technique. A controller is then designed and verified by stability analysis. Finally we evaluate our DVFS scheme through a cycle-accurate simulation with a broad set of applications selected from MediaBench and SPEC2000 benchmark suites. Compared to the best-known prior approach, which is heuristic-based, the proposed online DVFS scheme is substantially more effective due to its automatic regulation ability. For example, we have achieved a 2-3 fold increase in efficiency in terms of energy-delay product improvement. In addition, our control theoretic technique is more resilient, requires less tuning effort, and has better scalability as compared to prior online DVFS schemes.We believe that the techniques and methodology described in this paper can be generalized for energy control in processors other than MCD, such as tiled stream processors.

多时钟域(MCD)处理器是当今全同步设计的一个有前途的未来替代方案。MCD处理器中的动态电压和频率缩放(DVFS)具有额外的灵活性，可以独立调节每个域的电压和频率。现有的DVFS方法大多是基于概要文件的离线方案，主要适用于执行特征受限且可重复的应用。虽然已经发表了一些关于在线DVFS方案的工作，但先前的方法通常是基于启发式的。在本文中，我们提出了一种有效的MCD处理器在线DVFS方案，该方案采用形式分析方法，由动态工作负载驱动，适用于所有应用。在我们的方法中，我们将MCD处理器建模为队列域网络，将在线DVFS建模为反馈控制问题，并将问题队列占用作为反馈信号。首先提出了一种动态随机排队模型，并通过精确线性化技术将其线性化。然后设计了控制器，并进行了稳定性分析。最后，我们通过从mediabbench和SPEC2000基准套件中选择的广泛应用程序的周期精确模拟来评估我们的DVFS方案。与最著名的基于启发式的先验方法相比，所提出的在线DVFS方案由于其自动调节能力而更加有效。例如，我们在能源延迟产品改进方面的效率提高了2-3倍。此外，与先前的在线DVFS方案相比，我们的控制理论技术具有更强的弹性，需要更少的调优工作，并且具有更好的可扩展性。我们相信本文中描述的技术和方法可以推广到MCD以外的处理器(如平铺流处理器)中的能量控制。

{"title":"Formal online methods for voltage/frequency control in multiple clock domain microprocessors","authors":"Qiang Wu, Philo Juang, M. Martonosi, D. Clark","doi":"10.1145/1024393.1024423","DOIUrl":"https://doi.org/10.1145/1024393.1024423","url":null,"abstract":"Multiple Clock Domain (MCD) processors are a promising future alternative to today's fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches are profile-based offline schemes which are mainly suitable for applications whose execution char-acteristics are constrained and repeatable. While some work has been published about online DVFS schemes, the prior approaches are typically heuristic-based. In this paper, we present an effective online DVFS scheme for an MCD processor which takes a formal analytic approach, is driven by dynamic workloads, and is suitable for all applications. In our approach, we model an MCD processor as a queue-domain network and the online DVFS as a feedback control problem with issue queue occupancies as feedback signals. A dynamic stochastic queuing model is first proposed and linearized through an accu-rate linearization technique. A controller is then designed and verified by stability analysis. Finally we evaluate our DVFS scheme through a cycle-accurate simulation with a broad set of applications selected from MediaBench and SPEC2000 benchmark suites. Compared to the best-known prior approach, which is heuristic-based, the proposed online DVFS scheme is substantially more effective due to its automatic regulation ability. For example, we have achieved a 2-3 fold increase in efficiency in terms of energy-delay product improvement. In addition, our control theoretic technique is more resilient, requires less tuning effort, and has better scalability as compared to prior online DVFS schemes.We believe that the techniques and methodology described in this paper can be generalized for energy control in processors other than MCD, such as tiled stream processors.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114656729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 194