IEEE International Symposium on Workload Characterization (IISWC'10)最新文献_第2页

Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications 探索GPGPU工作负载:表征方法、分析和微架构评估含义

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649549

Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li

The GPUs are emerging as a general-purpose high-performance computing device. Growing GPGPU research has made numerous GPGPU workloads available. However, a systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture design evaluation is still lacking. In this research, we propose a set of microarchitecture agnostic GPGPU workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis are used to understand these workloads. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (e.g. memory coalescing and branch divergence). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction.

gpu是一种新兴的通用高性能计算设备。不断增长的GPGPU研究使得许多GPGPU工作负载可用。然而，目前还缺乏一种系统的方法来描述这些基准测试并分析它们对GPU微架构设计评估的影响。在这项研究中，我们提出了一组与微架构无关的GPGPU工作负载特征，以在微架构独立的空间中表示它们。使用相关降维过程和聚类分析来理解这些工作负载。此外，我们还提出了一套评估指标来准确评估GPGPU的设计空间。随着GPGPU工作负载的不断增加，这种分析方法为提出的GPU架构设计选择提供了有意义、准确和全面的仿真。架构师还可以通过选择一组工作负载来强调其GPU微架构的预期功能块而受益。我们提出了GPU基准套件的多样性分析，如Nvidia CUDA SDK, Parboil和Rodinia。我们的研究结果表明，使用大量不同的内核，相似度评分、并行减少和大阵列扫描等工作负载在不同的工作负载空间中表现出不同的特征。我们还探讨了不同工作负载子空间的多样性(例如内存合并和分支发散)。与其他工作负载相比，相似性评分、大数组扫描、MUMmerGPU、混合排序和最近邻工作负载在分支发散特征上表现出相对较大的变化。在大阵列扫描、K-Means、相似性评分和并行约简中，内存合并行为是多种多样的。

{"title":"Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications","authors":"Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li","doi":"10.1109/IISWC.2010.5649549","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649549","url":null,"abstract":"The GPUs are emerging as a general-purpose high-performance computing device. Growing GPGPU research has made numerous GPGPU workloads available. However, a systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture design evaluation is still lacking. In this research, we propose a set of microarchitecture agnostic GPGPU workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis are used to understand these workloads. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (e.g. memory coalescing and branch divergence). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116009225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads Rodinia基准套件的特征，并与当代CMP工作负载进行比较

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650274

Shuai Che, J. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, K. Skadron

The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics.

最近发布的Rodinia基准测试套件使用户能够评估异构系统，包括加速器(如gpu)和多核cpu。随着Rodinia看到更高的接受度，研究人员理解这套新的基准变得非常重要，特别是它们与以前的工作有何不同。在本文中，我们介绍了Rodinia的最新扩展，并对Rodinia基准进行了详细的描述(包括基于Fermi架构发布的首款产品NVIDIA GeForce GTX480的性能结果)。我们还比较和对比Rodinia与Parsec，以深入了解两个基准集合的异同;我们应用主成分分析来分析两个套件的应用空间覆盖率。我们的分析表明，Rodinia和Parsec中的许多工作负载是互补的，捕获某些性能指标的不同方面。

引用次数: 305

Characterization of workload and resource consumption for an online travel and booking site 描述在线旅游和预订网站的工作量和资源消耗

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649408

Nicolás Poggi, David Carrera, Ricard Gavaldà, J. Torres, E. Ayguadé

Online travel and ticket booking is one of the top E-Commerce industries. As they present a mix of products: flights, hotels, tickets, restaurants, activities and vacational packages, they rely on a wide range of technologies to support them: Javascript, AJAX, XML, B2B Web services, Caching, Search Algorithms and Affiliation; resulting in a very rich and heterogeneous workload. Moreover, visits to travel sites present a great variability depending on time of the day, season, promotions, events, and linking; creating bursty traffic, making capacity planning a challenge. It is therefore of great importance to understand how users and crawlers interact on travel sites and their effect on server resources, for devising cost effective infrastructures and improving the Quality of Service for users. In this paper we present a detailed workload and resource consumption characterization of the web site of a top national Online Travel Agency. Characterization is performed on server logs, including both HTTP data and resource consumption of the requests, as well as the server load status during the execution. From the dataset we characterize user sessions, their patterns and how response time is affected as load on Web servers increases. We provide a fine grain analysis by performing experiments differentiating: types of request, time of the day, products, and resource requirements for each. Results show that the workload is bursty, as expected, that exhibit different properties between day and night traffic in terms of request type mix, that user session length cover a wide range of durations, which response time grows proportionally to server load, and that response time of external data providers also increase on peak hours, amongst other results. Such results can be useful for optimizing infrastructure costs, improving QoS for users, and development of realistic workload generators for similar applications.

在线旅游和订票是最热门的电子商务行业之一。由于他们提供各种各样的产品:航班、酒店、机票、餐厅、活动和度假套餐，他们依赖于广泛的技术来支持他们:Javascript、AJAX、XML、B2B Web服务、缓存、搜索算法和关联;导致非常丰富和异构的工作负载。此外，对旅游网站的访问呈现出很大的变化，这取决于一天中的时间、季节、促销、活动和链接;造成繁忙的交通，使容量规划成为一项挑战。因此，了解用户和爬虫如何在旅游网站上交互以及它们对服务器资源的影响，对于设计具有成本效益的基础设施和提高用户的服务质量非常重要。本文对某国内顶级在线旅行社网站的工作量和资源消耗特征进行了详细的分析。在服务器日志上执行特征描述，包括HTTP数据和请求的资源消耗，以及执行期间的服务器负载状态。从数据集中，我们描述了用户会话、会话模式以及响应时间如何随着Web服务器负载的增加而受到影响。我们通过执行区分请求类型、一天中的时间、产品和资源需求的实验来提供细粒度分析。结果表明，工作负载如预期的那样是突发的，在请求类型混合方面在白天和夜间流量之间表现出不同的属性，用户会话长度覆盖了很长的持续时间，响应时间与服务器负载成比例地增长，外部数据提供者的响应时间也在高峰时段增加，以及其他结果。这样的结果对于优化基础设施成本、改进用户的QoS以及为类似应用程序开发实际的工作负载生成器非常有用。

{"title":"Characterization of workload and resource consumption for an online travel and booking site","authors":"Nicolás Poggi, David Carrera, Ricard Gavaldà, J. Torres, E. Ayguadé","doi":"10.1109/IISWC.2010.5649408","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649408","url":null,"abstract":"Online travel and ticket booking is one of the top E-Commerce industries. As they present a mix of products: flights, hotels, tickets, restaurants, activities and vacational packages, they rely on a wide range of technologies to support them: Javascript, AJAX, XML, B2B Web services, Caching, Search Algorithms and Affiliation; resulting in a very rich and heterogeneous workload. Moreover, visits to travel sites present a great variability depending on time of the day, season, promotions, events, and linking; creating bursty traffic, making capacity planning a challenge. It is therefore of great importance to understand how users and crawlers interact on travel sites and their effect on server resources, for devising cost effective infrastructures and improving the Quality of Service for users. In this paper we present a detailed workload and resource consumption characterization of the web site of a top national Online Travel Agency. Characterization is performed on server logs, including both HTTP data and resource consumption of the requests, as well as the server load status during the execution. From the dataset we characterize user sessions, their patterns and how response time is affected as load on Web servers increases. We provide a fine grain analysis by performing experiments differentiating: types of request, time of the day, products, and resource requirements for each. Results show that the workload is bursty, as expected, that exhibit different properties between day and night traffic in terms of request type mix, that user session length cover a wide range of durations, which response time grows proportionally to server load, and that response time of external data providers also increase on peak hours, amongst other results. Such results can be useful for optimizing infrastructure costs, improving QoS for users, and development of realistic workload generators for similar applications.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122901568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Toward a more accurate understanding of the limits of the TLS execution paradigm 更准确地理解TLS执行范例的限制

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649169

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Craig Pocock, Gavin Brown, M. Luján, I. Watson, Marcelo H. Cintra

Thread-Level Speculation (TLS) facilitates the extraction of parallel threads from sequential applications. Most prior work has focused on developing the compiler and architecture for this execution paradigm. Such studies often narrowly concentrated on a specific design point. On the other hand, other studies have attempted to assess how well TLS performs if some architectural/ compiler constraint is relaxed. Unfortunately, such previous studies have failed to truly assess TLS performance potential, because they have been bound to some specific TLS architecture and have ignored one or another important TLS design choice, such as support for out-of-order task spawn or support for intermediate checkpointing.

线程级推测(TLS)有助于从顺序应用程序中提取并行线程。大多数先前的工作都集中在为这种执行范例开发编译器和体系结构上。这类研究往往局限于一个特定的设计点。另一方面，其他研究试图评估如果放松一些体系结构/编译器约束，TLS的性能如何。不幸的是，这些先前的研究未能真正评估TLS的性能潜力，因为它们被绑定到一些特定的TLS架构，并且忽略了一个或另一个重要的TLS设计选择，例如支持乱序任务衍生或支持中间检查点。

引用次数: 19

Benchmark synthesis for architecture and compiler exploration 用于架构和编译器探索的基准综合

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650208

Luk Van Ertvelde, L. Eeckhout

This paper presents a novel benchmark synthesis framework with three key features. First, it generates synthetic benchmarks in a high-level programming language (C in our case), in contrast to prior work in benchmark synthesis which generates synthetic benchmarks in assembly. Second, the synthetic benchmarks hide proprietary information from the original workloads they are built after. Hence, companies may want to distribute synthetic benchmark clones to third parties as proxies for their proprietary codes; third parties can then optimize the target system without having access to the original codes. Third, the synthetic benchmarks are shorter running than the original workloads they are modeled after, yet they are representative. In summary, the proposed framework generates small (thus quick to simulate) and representative benchmarks that can serve as proxies for other workloads without revealing proprietary information; and because the benchmarks are generated in a high-level programming language, they can be used to explore both the architecture and compiler spaces. The results obtained with our initial framework are promising. We demonstrate that we can generate synthetic proxy benchmarks for the MiBench benchmarks, and we show that they are representative across a range of machines with different instruction-set architectures, microarchitectures, and compilers and optimization levels, while being 30 times shorter running on average. We also verify using software plagiarism detection tools that the synthetic benchmark clones hide proprietary information from the original workloads.

本文提出了一种新的基准综合框架，它具有三个关键特征。首先，它用一种高级编程语言(在我们的例子中是C语言)生成综合基准，而之前的基准合成工作是用汇编语言生成综合基准。其次，合成基准对构建它们的原始工作负载隐藏专有信息。因此，公司可能希望将合成基准克隆分发给第三方，作为其专有代码的代理;然后，第三方可以在没有访问原始代码的情况下优化目标系统。第三，合成基准比它们所建模的原始工作负载运行时间更短，但它们具有代表性。总之，建议的框架生成小型(因此可以快速模拟)和代表性基准，可以作为其他工作负载的代理，而不会泄露专有信息;由于基准测试是用高级编程语言生成的，因此可以使用它们来探索体系结构和编译器空间。使用我们的初始框架获得的结果是有希望的。我们演示了我们可以为MiBench基准测试生成合成代理基准测试，并且我们展示了它们在具有不同指令集架构、微架构、编译器和优化级别的一系列机器上具有代表性，同时平均运行时间缩短了30倍。我们还使用软件剽窃检测工具验证了合成基准克隆对原始工作负载隐藏了专有信息。

{"title":"Benchmark synthesis for architecture and compiler exploration","authors":"Luk Van Ertvelde, L. Eeckhout","doi":"10.1109/IISWC.2010.5650208","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650208","url":null,"abstract":"This paper presents a novel benchmark synthesis framework with three key features. First, it generates synthetic benchmarks in a high-level programming language (C in our case), in contrast to prior work in benchmark synthesis which generates synthetic benchmarks in assembly. Second, the synthetic benchmarks hide proprietary information from the original workloads they are built after. Hence, companies may want to distribute synthetic benchmark clones to third parties as proxies for their proprietary codes; third parties can then optimize the target system without having access to the original codes. Third, the synthetic benchmarks are shorter running than the original workloads they are modeled after, yet they are representative. In summary, the proposed framework generates small (thus quick to simulate) and representative benchmarks that can serve as proxies for other workloads without revealing proprietary information; and because the benchmarks are generated in a high-level programming language, they can be used to explore both the architecture and compiler spaces. The results obtained with our initial framework are promising. We demonstrate that we can generate synthetic proxy benchmarks for the MiBench benchmarks, and we show that they are representative across a range of machines with different instruction-set architectures, microarchitectures, and compilers and optimization levels, while being 30 times shorter running on average. We also verify using software plagiarism detection tools that the synthetic benchmark clones hide proprietary information from the original workloads.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127011479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Analyzing and scaling parallelism for network routing protocols 网络路由协议的并行性分析与扩展

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650317

A. Dhanotia, Sabina Grover, G. Byrd

The serial nature of legacy code in routing protocol implementations has inhibited a shift to multicore processing in the control plane, even though there is much inherent parallelism. In this paper, we investigate the use of multicore as the compute platform for routing applications using BGP, the ubiquitous protocol for routing in the Internet backbone, as a representative application. We develop a scalable multithreaded implementation for BGP and evaluate its performance on several multicore configurations using a fully configurable multicore simulation environment. We implement several optimizations at the software and architecture levels, achieving a speedup of 6.5 times over the sequential implementation, which translates to a throughput of ∼170K updates per second. Subsequently, we propose a generic architecture and parallelization methodology which can be applied to all routing protocol implementations to achieve significant performance improvement.

路由协议实现中遗留代码的串行特性抑制了控制平面向多核处理的转变，尽管存在许多固有的并行性。在本文中，我们研究了使用多核作为路由应用程序的计算平台，以BGP作为代表性应用程序，BGP是互联网骨干网中普遍存在的路由协议。我们为BGP开发了一个可扩展的多线程实现，并使用一个完全可配置的多核模拟环境在几个多核配置上评估了它的性能。我们在软件和架构级别实现了一些优化，实现了比顺序实现6.5倍的加速，这意味着每秒更新的吞吐量为170K。随后，我们提出了一种通用的架构和并行化方法，可以应用于所有路由协议的实现，以实现显著的性能改进。

引用次数: 1

Characterizing datasets for data deduplication in backup applications 对备份应用中的数据集进行特征描述，以便重复数据删除

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650369

Nohhyun Park, D. Lilja

The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.

重复数据删除系统的压缩性能和吞吐量直接受到输入数据集的影响。我们为重复数据删除系统提出了两套评估指标，以及提取这些指标的方法。第一组指标表示重复数据删除系统内的段组成在五个完整备份期间的变化情况。这反过来又可以让我们更深入地了解压缩比如何随着数据的积累而变化。第二组指标表示由重复消除和到达底层存储系统的速率引起的索引表碎片。我们表明，虽然较短的唯一数据序列可能不利于索引缓存，但它们提供了更统一的到达率，从而提高了整体吞吐量。最后，我们计算了从评估数据集派生的指标，并展示了数据集在不同指标下的表现。我们的评估表明，备份数据集通常表现出随时间变化的模式，并且这些模式可以量化它们对重复数据删除过程的影响。这种量化使我们能够:1)决定重复数据删除是否适用;2)提供资源;3)调优数据重复数据删除参数;4)确定数据集的哪一部分最适合重复数据删除。

{"title":"Characterizing datasets for data deduplication in backup applications","authors":"Nohhyun Park, D. Lilja","doi":"10.1109/IISWC.2010.5650369","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650369","url":null,"abstract":"The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116199735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Performance of multi-process and multi-thread processing on multi-core SMT processors 多核SMT处理器上的多进程和多线程处理性能

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650174

H. Inoue, T. Nakatani

Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.

许多现代高性能处理器以多核和SMT(同步多线程)的形式支持多个硬件线程。因此，在核数(核心可伸缩性)和一个核中的SMT线程数(SMT可伸缩性)方面实现程序的良好性能可伸缩性是至关重要的。为了确定在多核SMT处理器上实现更高性能的方法，本文在两种硬件并行性(核心可伸缩性和SMT可伸缩性)上比较了两种并行化模型(使用多个进程和在一个进程中使用多个线程)的性能可伸缩性。我们在Sun的UltraSPARC T1 (Niagara)处理器和Intel的Xeon (Nehalem)处理器两个平台上测试了标准Java基准测试和一个用PHP编写的真实服务器程序。我们表明，与多进程模型相比，多线程模型通过减少缓存缺失和DTLB缺失的数量实现了更好的SMT可伸缩性。然而，这两种模型实现了大致相同的核心可伸缩性。我们表明，当使用多个内核时，多线程模型产生的DTLB遗漏比多进程模型多7.4倍。为了利用这两种模型，我们为PHP运行时实现了一个内存分配器，以减少多核SMT处理器上的DTLB丢失。分配器知道正在运行每个软件线程的内核，并从相同的内存页为每个处理器内核分配内存块。当在Niagara上使用所有硬件线程时，与默认分配器相比，内核感知分配器减少了46.7%的DTLB缺失，并将性能提高了3.0%。

{"title":"Performance of multi-process and multi-thread processing on multi-core SMT processors","authors":"H. Inoue, T. Nakatani","doi":"10.1109/IISWC.2010.5650174","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650174","url":null,"abstract":"Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"s3-50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130239289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A limit study of JavaScript parallelism JavaScript并行性的极限研究

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649419

Emily Fortuna, O. Anderson, L. Ceze, S. Eggers

JavaScript is ubiquitous on the web. At the same time, the language's dynamic behavior makes optimizations challenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript applications, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging— averaging 8.9x and as high as 45.5x. Parallelizing functions themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate in our JavaScript engine, most of the dependencies manifest via virtual registers rather than hash table lookups.

JavaScript在网络上无处不在。同时，该语言的动态行为使优化变得困难，导致性能不佳。在本文中，我们对JavaScript应用程序的潜在并行性进行了限制研究，包括流行的网页和标准JavaScript基准测试。我们检查依赖类型和循环行为，以便更好地理解JavaScript并行化的潜力。我们的结果表明，潜在的加速是非常令人鼓舞的——平均8.9倍，最高可达45.5倍。事实证明，并行化函数本身，而不仅仅是循环体，在提高JavaScript执行速度方面更有成效。结果还表明，在我们的JavaScript引擎中，大多数依赖项是通过虚拟寄存器而不是哈希表查找来显示的。

引用次数: 42

Analysis on semantic transactional memory footprint for hardware transactional memory 硬件事务内存的语义事务内存占用分析

IEEE International Symposium on Workload Characterization (IISWC'10)

Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649529

Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh

We analyze various characteristics of semantic transactional memory footprint (STMF) that consists of only the memory accesses the underlying hardware transactional memory (HTM) system has to manage for the correct execution of transactional programs. Our analysis shows that STMF can be significantly smaller than declarative transactional memory footprint (DTMF) that contains all memory accesses within transaction boundaries (i.e., only 8.3% of DTMF in the applications examined). This result encourages processor designers and software toolchain developers to explore new design points for low-cost HTM systems and intelligent software toolchains to find and leverage STMF efficiently. We identify seven code patterns that belong to DTMF, but not to STMF, and show that they take up 91.7% of all memory accesses in transactional boundaries, on average, for the transactional programs examined. A new instruction prefix is proposed to express STMF efficiently, and the existing compiler techniques are examined to check their applicability to deduce STMF from DTMF. Our trace analysis shows that using STMF significantly reduces the ratio of transactions overflowing a 32KB L1 cache, from 12.80% to 2.00%, and substantially lowers the false positive probability of Bloom filters used for transaction signature management, from 23.60% to less than 0.001%. The simulation result shows that the STAMP applications with the STMF expression run 40% faster on average than those with the DTMF expression.

我们分析了语义事务性内存占用(STMF)的各种特征，它只包含底层硬件事务性内存(HTM)系统为正确执行事务性程序而必须管理的内存访问。我们的分析表明，STMF可以明显小于声明性事务性内存占用(DTMF)，后者包含事务边界内的所有内存访问(即，在所检查的应用程序中仅占DTMF的8.3%)。这一结果鼓励处理器设计人员和软件工具链开发人员探索低成本HTM系统和智能软件工具链的新设计点，以有效地发现和利用STMF。我们确定了七种属于DTMF而不属于STMF的代码模式，并且表明，对于所检查的事务性程序，它们平均占据事务性边界中所有内存访问的91.7%。提出了一种新的指令前缀来有效地表达STMF，并对现有的编译器技术进行了检验，以检验它们在从DTMF推导STMF方面的适用性。我们的跟踪分析表明，使用STMF可以显著降低事务溢出32KB L1缓存的比率，从12.80%降至2.00%，并显著降低用于事务签名管理的Bloom过滤器的误报概率，从23.60%降至不到0.001%。仿真结果表明，使用STMF表达式的STAMP应用程序比使用DTMF表达式的STAMP应用程序平均运行速度快40%。

{"title":"Analysis on semantic transactional memory footprint for hardware transactional memory","authors":"Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh","doi":"10.1109/IISWC.2010.5649529","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649529","url":null,"abstract":"We analyze various characteristics of semantic transactional memory footprint (STMF) that consists of only the memory accesses the underlying hardware transactional memory (HTM) system has to manage for the correct execution of transactional programs. Our analysis shows that STMF can be significantly smaller than declarative transactional memory footprint (DTMF) that contains all memory accesses within transaction boundaries (i.e., only 8.3% of DTMF in the applications examined). This result encourages processor designers and software toolchain developers to explore new design points for low-cost HTM systems and intelligent software toolchains to find and leverage STMF efficiently. We identify seven code patterns that belong to DTMF, but not to STMF, and show that they take up 91.7% of all memory accesses in transactional boundaries, on average, for the transactional programs examined. A new instruction prefix is proposed to express STMF efficiently, and the existing compiler techniques are examined to check their applicability to deduce STMF from DTMF. Our trace analysis shows that using STMF significantly reduces the ratio of transactions overflowing a 32KB L1 cache, from 12.80% to 2.00%, and substantially lowers the false positive probability of Bloom filters used for transaction signature management, from 23.60% to less than 0.001%. The simulation result shows that the STAMP applications with the STMF expression run 40% faster on average than those with the DTMF expression.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123369824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1