IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.最新文献_第2页

Motivation for Variable Length Intervals and Hierarchical Phase Behavior 变长度区间和层次相位行为的动机

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430568

Jeremy Lau, Erez Perelman, Greg Hamerly, T. Sherwood, B. Calder

Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program's execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program's periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program's actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint

大多数程序都是重复的，在不同的执行时间可以看到类似的行为。提出的算法自动地将程序执行的相似部分分组为阶段，其中每个阶段的间隔具有相同的行为和相似的资源需求。这些先前的技术专注于固定长度的间隔(例如1亿条指令)来寻找相位行为。固定长度的间隔可能使程序的周期相位行为难以发现，因为固定的间隔长度可能与程序的实际相位行为的周期不同步。此外，固定的间隔长度只能表达一个层次的相位行为。在本文中，我们图解地说明了在程序中存在一个相位行为层次，并激发了对可变长度间隔的需求。我们描述了应用于SimPoint以支持可变长度间隔的更改。最后，我们提供了一个关于使用可变长度间隔来指导SimPoint的初步研究

引用次数: 91

A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity 一种高性能、高能效的GALS处理器微架构，降低了实现复杂度

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430558

Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu

As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement

随着每一代新微处理器的出现，全球时钟分布的成本和挑战都在增加，全球异步、本地同步(GALS)方法成为一种有吸引力的替代方案。一种被提出的GALS方法，称为多时钟域(MCD)处理器，以相对较低的性能成本实现了令人印象深刻的节能。然而，该方法需要将处理器分为四个域，包括分离整数域和内存域，这使得负载调度变得复杂，并且在每个域中实现32个电压和频率电平。此外，基于硬件的控制算法虽然总体上是有效的，但在某些应用中会产生显著的性能下降。在本文中，我们对MCD设计进行了修改，保留了许多优点，同时大大降低了实现的复杂性。我们首先确定对MCD性能下降最负责的同步通道是那些涉及缓存访问的通道，并建议合并整数域和内存域以消除这种开销。我们进一步建议显著减少电压电平的数量，将重排序缓冲区分离到自己的域中以允许前端频率缩放，分离L2缓存以允许使用标准功率优化，以及一个新的在线算法，在我们的基准套件中提供一致的结果。总体结果是显著减少了原始MCD方法的性能下降，节省了更多的能源，并且大大简化了微架构，更容易实现

{"title":"A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity","authors":"Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu","doi":"10.1109/ISPASS.2005.1430558","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430558","url":null,"abstract":"As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115230965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites 测量程序相似性:用SPEC CPU基准套件进行实验

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430555

Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John

It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9% and 4.4% error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same

用于评估体系结构增强的基准程序子集必须在目标工作负载空间中很好地分布，而不是聚集在特定区域中。过去识别子集的工作主要依赖于使用与微体系结构相关的程序性能指标，例如每条指令的周期和缓存失误率。这种技术的缺点是结果可能会受到所选配置的特性的影响。本文的目的是提出一种基于程序固有的与微体系结构无关的特征来衡量程序相似性的方法，使结果适用于任何微体系结构。我们将我们的方法应用于SPEC CPU2000基准套件，并证明了8个程序的子集可以用来有效地表示整个套件。我们通过使用它来估计整个套件的平均IPC和L1数据缓存失误率来验证该子集的有效性。8路和16路超标量处理器配置的平均IPC估计误差分别为3.9%和4.4%。这种方法不仅适用于从基准测试套件中找到子集，而且还适用于从潜在候选列表中确定基准测试套件的程序。通过对四代SPEC CPU基准测试套件的研究，我们发现，除了动态指令数量急剧增加和时间数据局部性越来越差之外，程序的固有特征或多或少保持不变

{"title":"Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites","authors":"Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John","doi":"10.1109/ISPASS.2005.1430555","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430555","url":null,"abstract":"It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9% and 4.4% error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 172

Accelerating Multiprocessor Simulation with a Memory Timestamp Record 使用内存时间戳记录加速多处理器仿真

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430560

K. Barr, Heidi Pan, Michael Zhang, K. Asanović

We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cache-coherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15% of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45x speedup over FFW, and a 7.7x speedup over our detailed baseline

本文介绍了一种基于内存时间戳记录(MTR)的新型软件结构的快速准确的多处理器系统目录和缓存状态初始化技术。MTR是一种通用的、压缩的内存参考模式快照，可以在快进模拟期间快速更新，或者作为检查点的一部分存储。我们使用运行一系列多线程工作负载的基于目录的缓存一致多处理器的全系统模拟来评估MTR。MTR和多处理器版本的功能快速转发(FFW)都做出了类似的性能估计，通常在我们详细模型的15%以内。除了其他好处之外，我们还展示了MTR比FFW有1.45倍的加速，比我们详细的基线有7.7倍的加速

引用次数: 50

Dataflow: A Complement to Superscalar 数据流:超标量的补充

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430572

M. Budiu, Pedro V. Artigas, S. Goldstein

There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow machine, even with unlimited resources, does not always outperform a superscalar processor on general-purpose codes, under the assumption that both machines take the same time to execute basic operations. We compare a program-specific dataflow machine with unlimited parallelism to a superscalar processor running the same program. While the dataflow machines provide very good performance on most data-parallel programs, we show that the dataflow machine cannot always take advantage of the available parallelism. Using the dynamic critical path we investigate the mechanisms used by superscalar processors to provide a performance advantage and their impact on a dataflow model

人们对数据流体系结构的兴趣已经重新抬头，因为它们具有利用低开销的并行性的潜力。在本文中，我们分析了一类静态数据流机器在整数媒体和控制密集型程序上的性能，并解释了为什么数据流机器，即使拥有无限的资源，在假设两台机器执行基本操作所需的时间相同的情况下，在通用代码上并不总是优于超标量处理器。我们将具有无限并行性的特定于程序的数据流机器与运行相同程序的超标量处理器进行比较。虽然数据流机器在大多数数据并行程序上提供了非常好的性能，但我们表明数据流机器并不总是能够利用可用的并行性。使用动态关键路径，我们研究了超标量处理器用于提供性能优势的机制及其对数据流模型的影响

引用次数: 44

Anatomy and Performance of SSL Processing SSL处理的剖析和性能

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430574

Li Zhao, R. Iyer, S. Makineni, L. Bhuyan

A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors

广泛的电子商务(B2B/B2C)、银行、金融交易和其他业务应用程序需要高度安全的数据交换。安全套接字层(SSL)协议提供了安全通信的基本成分——隐私、完整性和身份验证。尽管众所周知，安全性总是以性能为代价，但这些代价取决于加密算法。在本文中，我们详细描述了安全会话的解剖结构。我们分析了在会话协商和数据传输期间用于各种加密操作(对称、非对称和散列)的时间。然后我们分析了最常用的加密算法(RSA, AES, DES, 3DES, RC4, MD5和SHA-1)。我们确定这些算法的关键组件(设置密钥调度、加密轮、替换、排列等)，并确定大部分时间花在哪里。我们还对这些算法进行了体系结构分析，展示了经常执行的指令，并讨论了可能有助于提高SSL性能的ISA/硬件支持。我们相信本文中提供的性能数据对性能分析师和处理器架构师非常有用，可以帮助提高未来处理器中的SSL性能

{"title":"Anatomy and Performance of SSL Processing","authors":"Li Zhao, R. Iyer, S. Makineni, L. Bhuyan","doi":"10.1109/ISPASS.2005.1430574","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430574","url":null,"abstract":"A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115729967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

The Strong correlation Between Code Signatures and Performance 代码签名和性能之间的强相关性

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430578

Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder

A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance

最近的一项研究检查了如何使用采样硬件计数器来创建采样代码签名。这种方法很有吸引力，因为可以为任何应用程序快速收集采样代码签名。他们的研究结论是，抽样代码签名和性能可预测性之间存在模糊的相关性。这篇论文提出了在采样过程中丢失了多少信息的问题，我们的论文重点研究了这个问题。我们首先着重说明代码签名和性能之间存在很强的相关性。然后，我们将检查采样和完整代码签名之间的关系，以及它们如何影响性能可预测性。我们的结果证实，在最近的工作中发现的带有采样代码签名的SPEC程序存在模糊相关性，但与完整代码签名存在强相关性。此外，我们建议将先前工作中使用的采样指令计数转换为表示循环和过程执行频率的采样代码签名。这些采样的循环和过程代码签名允许阶段分析更准确、更容易地找到模式，并且它们与性能有更好的相关性

{"title":"The Strong correlation Between Code Signatures and Performance","authors":"Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder","doi":"10.1109/ISPASS.2005.1430578","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430578","url":null,"abstract":"A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126342115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 100

BioBench: A Benchmark Suite of Bioinformatics Applications bibench:生物信息学应用的基准套件

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430554

K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung

Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating-point operation content, and higher IPC than either SPEC INT or SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers

生物信息学的最新进展和研究人员可用的计算能力的显著提高，使得更好地利用过去二十年来收集的大量遗传数据成为可能。随着基因数据的使用扩展到包括药物发现和基于基因的疗法的发展，生物信息学注定要在科学计算应用领域的前沿占有一席之地。尽管这一领域的重要性显而易见，但到目前为止，普通生物信息学应用及其对微体系结构设计的影响尚未得到计算机体系结构社区的重视。一套共同的生物信息学基准的可用性可能是激励这一关键领域进一步研究的第一步。为此，本文介绍了bibench，这是一个代表多种生物信息学应用的基准套件。bibench的第一个版本包括来自不同应用领域的应用，特别强调成熟的基因组学应用。简要介绍了在基准测试中的应用，并给出了在实际处理器上获得的基本执行特性。与SPEC INT和SPEC FP基准测试相比，bibench中的应用程序显示更高百分比的加载/存储指令，几乎可以忽略不计的浮点操作内容，以及比SPEC INT或SPEC FP应用程序更高的IPC。我们的评估表明，生物信息学应用与上述两个SPEC套件的应用具有明显不同的特征;我们的研究结果表明，生物信息学工作负载可以从内存带宽的架构改进和利用其高水平ILP的技术中受益。整个bibench套件和附带的参考数据将免费提供给研究人员

{"title":"BioBench: A Benchmark Suite of Bioinformatics Applications","authors":"K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung","doi":"10.1109/ISPASS.2005.1430554","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430554","url":null,"abstract":"Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating-point operation content, and higher IPC than either SPEC INT or SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131367897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

Pro-active Page Replacement for Scientific Applications: A Characterization 科学应用的主动页面替换:表征

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430579

M. Vilayannur, A. Sivasubramaniam, M. Kandemir

Paging policies implemented by today's operating systems cause scientific applications to exhibit poor performance, when the application's working set does not fit in main memory. This has been typically attributed to the sub-optimal performance of LRU-like virtual-memory replacement algorithms. On one end of the spectrum, researchers in the past have proposed fully automated compiler-based techniques that provide crucial information on future access patterns (reuse-distances, release hints etc) of an application that can be exploited by the operating system to make intelligent prefetching and replacement decisions. Static techniques like the aforementioned can be quite accurate, but require that the source code be available and analyzable. At the other end of the spectrum, researchers have also proposed pure system-level algorithmic innovations to improve the performance of LRU-like algorithms, some of which are only interesting from the theoretical sense and may not really be implementable. Instead, in this paper we explore the possibility of tracking application's runtime behavior in the operating system, and find that there are several useful characteristics in the virtual memory behavior that can be anticipated and used to pro-actively manage physical memory usage. Specifically, we show that LRU-like replacement algorithms hold onto pages long after they outlive their usefulness and propose a new replacement algorithm that exploits the predictability of the application's page-fault patterns to reduce the number of page-faults. Our results demonstrate that such techniques can reduce page-faults by as much as 78% over both LRU and EELRU that is considered to be one of the state-of-the-art algorithms towards addressing the performance shortcomings of LRU. Further, we also present an implementable replacement algorithm within the operating system, that performs considerably better than the Linux kernel's replacement algorithm

当前操作系统实现的分页策略导致科学应用程序在应用程序的工作集不适合主内存时表现出较差的性能。这通常归因于类lru虚拟内存替换算法的次优性能。一方面，过去的研究人员提出了完全自动化的基于编译器的技术，这些技术可以提供应用程序未来访问模式(重用距离、发布提示等)的关键信息，这些信息可以被操作系统利用来做出智能的预取和替换决策。像前面提到的静态技术可以非常精确，但是要求源代码是可用的和可分析的。在光谱的另一端，研究人员也提出了纯粹的系统级算法创新来提高类lru算法的性能，其中一些仅从理论意义上有趣，可能无法真正实现。相反，在本文中，我们探索了在操作系统中跟踪应用程序运行时行为的可能性，并发现虚拟内存行为中有几个有用的特征可以预测并用于主动管理物理内存使用。具体来说，我们展示了类似lru的替换算法在页面失效后很长一段时间内仍然保留页面，并提出了一种新的替换算法，该算法利用应用程序页面错误模式的可预测性来减少页面错误的数量。我们的结果表明，与LRU和EELRU相比，这种技术可以减少78%的页面错误，EELRU被认为是解决LRU性能缺陷的最先进算法之一。此外，我们还在操作系统中提出了一个可实现的替换算法，它的性能比Linux内核的替换算法要好得多

{"title":"Pro-active Page Replacement for Scientific Applications: A Characterization","authors":"M. Vilayannur, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/ISPASS.2005.1430579","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430579","url":null,"abstract":"Paging policies implemented by today's operating systems cause scientific applications to exhibit poor performance, when the application's working set does not fit in main memory. This has been typically attributed to the sub-optimal performance of LRU-like virtual-memory replacement algorithms. On one end of the spectrum, researchers in the past have proposed fully automated compiler-based techniques that provide crucial information on future access patterns (reuse-distances, release hints etc) of an application that can be exploited by the operating system to make intelligent prefetching and replacement decisions. Static techniques like the aforementioned can be quite accurate, but require that the source code be available and analyzable. At the other end of the spectrum, researchers have also proposed pure system-level algorithmic innovations to improve the performance of LRU-like algorithms, some of which are only interesting from the theoretical sense and may not really be implementable. Instead, in this paper we explore the possibility of tracking application's runtime behavior in the operating system, and find that there are several useful characteristics in the virtual memory behavior that can be anticipated and used to pro-actively manage physical memory usage. Specifically, we show that LRU-like replacement algorithms hold onto pages long after they outlive their usefulness and propose a new replacement algorithm that exploits the predictability of the application's page-fault patterns to reduce the number of page-faults. Our results demonstrate that such techniques can reduce page-faults by as much as 78% over both LRU and EELRU that is considered to be one of the state-of-the-art algorithms towards addressing the performance shortcomings of LRU. Further, we also present an implementable replacement algorithm within the operating system, that performs considerably better than the Linux kernel's replacement algorithm","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130271902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Trace-Driven Simulator For Palm OS Devices 跟踪驱动模拟器为Palm OS设备

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430570

Hyrum D. Carroll, J. Flanagan, Satish Baniya

Due to the high cost of producing hardware prototypes, software simulators are typically used to determine the performance of proposed systems. To accurately represent a system with a simulator, the simulator inputs need to be representative of actual system usage. Trace-driven simulators that use logs of actual usage are generally preferred by researchers and developers to other types of simulators to determine expected performance. In this paper we explain the design and results of a trace-driven simulator for Palm OS devices capable of starting in a specified state and replaying a log of inputs originally generated on a handheld. We collect the user inputs with an acceptable amount of overhead while a device is executing real applications in normal operating environments. We based our simulator on the deterministic state machine model. The model specifies that two equivalent systems that start in the same state and have the same inputs applied, follow the same execution paths. By replaying the collected inputs we are able to collect traces and performance statistics from the simulator that are representative of actual usage with minimal perturbation. Our simulator can be used to evaluate various hardware modifications to Palm OS devices such as adding a cache. At the end of this paper we present an in-depth case study analyzing the expected memory performance from adding a cache to a Palm m515 device

由于生产硬件原型的高成本，通常使用软件模拟器来确定所提议系统的性能。为了用模拟器准确地表示系统，模拟器的输入需要代表实际的系统使用情况。研究人员和开发人员通常更喜欢使用实际使用日志的跟踪驱动模拟器，而不是其他类型的模拟器，以确定预期的性能。在本文中，我们解释了一个用于Palm OS设备的跟踪驱动模拟器的设计和结果，该模拟器能够在指定状态下启动并重放最初在手持设备上生成的输入日志。当设备在正常操作环境中执行实际应用程序时，我们以可接受的开销收集用户输入。我们的模拟器基于确定性状态机模型。该模型指定以相同状态启动并应用相同输入的两个等效系统遵循相同的执行路径。通过重放收集到的输入，我们能够从模拟器收集跟踪和性能统计数据，这些跟踪和统计数据代表了最小扰动下的实际使用情况。我们的模拟器可用于评估Palm OS设备的各种硬件修改，例如添加缓存。在本文的最后，我们提供了一个深入的案例研究，分析了向Palm m515设备添加缓存的预期内存性能

{"title":"A Trace-Driven Simulator For Palm OS Devices","authors":"Hyrum D. Carroll, J. Flanagan, Satish Baniya","doi":"10.1109/ISPASS.2005.1430570","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430570","url":null,"abstract":"Due to the high cost of producing hardware prototypes, software simulators are typically used to determine the performance of proposed systems. To accurately represent a system with a simulator, the simulator inputs need to be representative of actual system usage. Trace-driven simulators that use logs of actual usage are generally preferred by researchers and developers to other types of simulators to determine expected performance. In this paper we explain the design and results of a trace-driven simulator for Palm OS devices capable of starting in a specified state and replaying a log of inputs originally generated on a handheld. We collect the user inputs with an acceptable amount of overhead while a device is executing real applications in normal operating environments. We based our simulator on the deterministic state machine model. The model specifies that two equivalent systems that start in the same state and have the same inputs applied, follow the same execution paths. By replaying the collected inputs we are able to collect traces and performance statistics from the simulator that are representative of actual usage with minimal perturbation. Our simulator can be used to evaluate various hardware modifications to Palm OS devices such as adding a cache. At the end of this paper we present an in-depth case study analyzing the expected memory performance from adding a cache to a Palm m515 device","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133334652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5