首页 > 最新文献

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.最新文献

英文 中文
Motivation for Variable Length Intervals and Hierarchical Phase Behavior 变长度区间和层次相位行为的动机
Jeremy Lau, Erez Perelman, Greg Hamerly, T. Sherwood, B. Calder
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program's execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program's periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program's actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint
大多数程序都是重复的,在不同的执行时间可以看到类似的行为。提出的算法自动地将程序执行的相似部分分组为阶段,其中每个阶段的间隔具有相同的行为和相似的资源需求。这些先前的技术专注于固定长度的间隔(例如1亿条指令)来寻找相位行为。固定长度的间隔可能使程序的周期相位行为难以发现,因为固定的间隔长度可能与程序的实际相位行为的周期不同步。此外,固定的间隔长度只能表达一个层次的相位行为。在本文中,我们图解地说明了在程序中存在一个相位行为层次,并激发了对可变长度间隔的需求。我们描述了应用于SimPoint以支持可变长度间隔的更改。最后,我们提供了一个关于使用可变长度间隔来指导SimPoint的初步研究
{"title":"Motivation for Variable Length Intervals and Hierarchical Phase Behavior","authors":"Jeremy Lau, Erez Perelman, Greg Hamerly, T. Sherwood, B. Calder","doi":"10.1109/ISPASS.2005.1430568","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430568","url":null,"abstract":"Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program's execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program's periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program's actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121211676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity 一种高性能、高能效的GALS处理器微架构,降低了实现复杂度
Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu
As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement
随着每一代新微处理器的出现,全球时钟分布的成本和挑战都在增加,全球异步、本地同步(GALS)方法成为一种有吸引力的替代方案。一种被提出的GALS方法,称为多时钟域(MCD)处理器,以相对较低的性能成本实现了令人印象深刻的节能。然而,该方法需要将处理器分为四个域,包括分离整数域和内存域,这使得负载调度变得复杂,并且在每个域中实现32个电压和频率电平。此外,基于硬件的控制算法虽然总体上是有效的,但在某些应用中会产生显著的性能下降。在本文中,我们对MCD设计进行了修改,保留了许多优点,同时大大降低了实现的复杂性。我们首先确定对MCD性能下降最负责的同步通道是那些涉及缓存访问的通道,并建议合并整数域和内存域以消除这种开销。我们进一步建议显著减少电压电平的数量,将重排序缓冲区分离到自己的域中以允许前端频率缩放,分离L2缓存以允许使用标准功率优化,以及一个新的在线算法,在我们的基准套件中提供一致的结果。总体结果是显著减少了原始MCD方法的性能下降,节省了更多的能源,并且大大简化了微架构,更容易实现
{"title":"A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity","authors":"Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu","doi":"10.1109/ISPASS.2005.1430558","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430558","url":null,"abstract":"As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115230965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites 测量程序相似性:用SPEC CPU基准套件进行实验
Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John
It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9% and 4.4% error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same
用于评估体系结构增强的基准程序子集必须在目标工作负载空间中很好地分布,而不是聚集在特定区域中。过去识别子集的工作主要依赖于使用与微体系结构相关的程序性能指标,例如每条指令的周期和缓存失误率。这种技术的缺点是结果可能会受到所选配置的特性的影响。本文的目的是提出一种基于程序固有的与微体系结构无关的特征来衡量程序相似性的方法,使结果适用于任何微体系结构。我们将我们的方法应用于SPEC CPU2000基准套件,并证明了8个程序的子集可以用来有效地表示整个套件。我们通过使用它来估计整个套件的平均IPC和L1数据缓存失误率来验证该子集的有效性。8路和16路超标量处理器配置的平均IPC估计误差分别为3.9%和4.4%。这种方法不仅适用于从基准测试套件中找到子集,而且还适用于从潜在候选列表中确定基准测试套件的程序。通过对四代SPEC CPU基准测试套件的研究,我们发现,除了动态指令数量急剧增加和时间数据局部性越来越差之外,程序的固有特征或多或少保持不变
{"title":"Measuring Program Similarity: Experiments with SPEC CPU Benchmark Suites","authors":"Aashish Phansalkar, A. Joshi, L. Eeckhout, L. John","doi":"10.1109/ISPASS.2005.1430555","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430555","url":null,"abstract":"It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9% and 4.4% error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 172
Accelerating Multiprocessor Simulation with a Memory Timestamp Record 使用内存时间戳记录加速多处理器仿真
K. Barr, Heidi Pan, Michael Zhang, K. Asanović
We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cache-coherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15% of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45x speedup over FFW, and a 7.7x speedup over our detailed baseline
本文介绍了一种基于内存时间戳记录(MTR)的新型软件结构的快速准确的多处理器系统目录和缓存状态初始化技术。MTR是一种通用的、压缩的内存参考模式快照,可以在快进模拟期间快速更新,或者作为检查点的一部分存储。我们使用运行一系列多线程工作负载的基于目录的缓存一致多处理器的全系统模拟来评估MTR。MTR和多处理器版本的功能快速转发(FFW)都做出了类似的性能估计,通常在我们详细模型的15%以内。除了其他好处之外,我们还展示了MTR比FFW有1.45倍的加速,比我们详细的基线有7.7倍的加速
{"title":"Accelerating Multiprocessor Simulation with a Memory Timestamp Record","authors":"K. Barr, Heidi Pan, Michael Zhang, K. Asanović","doi":"10.1109/ISPASS.2005.1430560","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430560","url":null,"abstract":"We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cache-coherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15% of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45x speedup over FFW, and a 7.7x speedup over our detailed baseline","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123611723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Anatomy and Performance of SSL Processing SSL处理的剖析和性能
Li Zhao, R. Iyer, S. Makineni, L. Bhuyan
A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors
广泛的电子商务(B2B/B2C)、银行、金融交易和其他业务应用程序需要高度安全的数据交换。安全套接字层(SSL)协议提供了安全通信的基本成分——隐私、完整性和身份验证。尽管众所周知,安全性总是以性能为代价,但这些代价取决于加密算法。在本文中,我们详细描述了安全会话的解剖结构。我们分析了在会话协商和数据传输期间用于各种加密操作(对称、非对称和散列)的时间。然后我们分析了最常用的加密算法(RSA, AES, DES, 3DES, RC4, MD5和SHA-1)。我们确定这些算法的关键组件(设置密钥调度、加密轮、替换、排列等),并确定大部分时间花在哪里。我们还对这些算法进行了体系结构分析,展示了经常执行的指令,并讨论了可能有助于提高SSL性能的ISA/硬件支持。我们相信本文中提供的性能数据对性能分析师和处理器架构师非常有用,可以帮助提高未来处理器中的SSL性能
{"title":"Anatomy and Performance of SSL Processing","authors":"Li Zhao, R. Iyer, S. Makineni, L. Bhuyan","doi":"10.1109/ISPASS.2005.1430574","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430574","url":null,"abstract":"A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115729967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Dataflow: A Complement to Superscalar 数据流:超标量的补充
M. Budiu, Pedro V. Artigas, S. Goldstein
There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow machine, even with unlimited resources, does not always outperform a superscalar processor on general-purpose codes, under the assumption that both machines take the same time to execute basic operations. We compare a program-specific dataflow machine with unlimited parallelism to a superscalar processor running the same program. While the dataflow machines provide very good performance on most data-parallel programs, we show that the dataflow machine cannot always take advantage of the available parallelism. Using the dynamic critical path we investigate the mechanisms used by superscalar processors to provide a performance advantage and their impact on a dataflow model
人们对数据流体系结构的兴趣已经重新抬头,因为它们具有利用低开销的并行性的潜力。在本文中,我们分析了一类静态数据流机器在整数媒体和控制密集型程序上的性能,并解释了为什么数据流机器,即使拥有无限的资源,在假设两台机器执行基本操作所需的时间相同的情况下,在通用代码上并不总是优于超标量处理器。我们将具有无限并行性的特定于程序的数据流机器与运行相同程序的超标量处理器进行比较。虽然数据流机器在大多数数据并行程序上提供了非常好的性能,但我们表明数据流机器并不总是能够利用可用的并行性。使用动态关键路径,我们研究了超标量处理器用于提供性能优势的机制及其对数据流模型的影响
{"title":"Dataflow: A Complement to Superscalar","authors":"M. Budiu, Pedro V. Artigas, S. Goldstein","doi":"10.1109/ISPASS.2005.1430572","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430572","url":null,"abstract":"There has been a resurgence of interest in dataflow architectures, because of their potential for exploiting parallelism with low overhead. In this paper we analyze the performance of a class of static dataflow machines on integer media and control-intensive programs and we explain why a dataflow machine, even with unlimited resources, does not always outperform a superscalar processor on general-purpose codes, under the assumption that both machines take the same time to execute basic operations. We compare a program-specific dataflow machine with unlimited parallelism to a superscalar processor running the same program. While the dataflow machines provide very good performance on most data-parallel programs, we show that the dataflow machine cannot always take advantage of the available parallelism. Using the dynamic critical path we investigate the mechanisms used by superscalar processors to provide a performance advantage and their impact on a dataflow model","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115266577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
BioBench: A Benchmark Suite of Bioinformatics Applications bibench:生物信息学应用的基准套件
K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung
Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating-point operation content, and higher IPC than either SPEC INT or SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers
生物信息学的最新进展和研究人员可用的计算能力的显著提高,使得更好地利用过去二十年来收集的大量遗传数据成为可能。随着基因数据的使用扩展到包括药物发现和基于基因的疗法的发展,生物信息学注定要在科学计算应用领域的前沿占有一席之地。尽管这一领域的重要性显而易见,但到目前为止,普通生物信息学应用及其对微体系结构设计的影响尚未得到计算机体系结构社区的重视。一套共同的生物信息学基准的可用性可能是激励这一关键领域进一步研究的第一步。为此,本文介绍了bibench,这是一个代表多种生物信息学应用的基准套件。bibench的第一个版本包括来自不同应用领域的应用,特别强调成熟的基因组学应用。简要介绍了在基准测试中的应用,并给出了在实际处理器上获得的基本执行特性。与SPEC INT和SPEC FP基准测试相比,bibench中的应用程序显示更高百分比的加载/存储指令,几乎可以忽略不计的浮点操作内容,以及比SPEC INT或SPEC FP应用程序更高的IPC。我们的评估表明,生物信息学应用与上述两个SPEC套件的应用具有明显不同的特征;我们的研究结果表明,生物信息学工作负载可以从内存带宽的架构改进和利用其高水平ILP的技术中受益。整个bibench套件和附带的参考数据将免费提供给研究人员
{"title":"BioBench: A Benchmark Suite of Bioinformatics Applications","authors":"K. Albayraktaroglu, A. Jaleel, Xue Wu, Manoj Franklin, Bruce Jacob, C. Tseng, Donald Yeung","doi":"10.1109/ISPASS.2005.1430554","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430554","url":null,"abstract":"Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating-point operation content, and higher IPC than either SPEC INT or SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131367897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
The Strong correlation Between Code Signatures and Performance 代码签名和性能之间的强相关性
Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder
A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance
最近的一项研究检查了如何使用采样硬件计数器来创建采样代码签名。这种方法很有吸引力,因为可以为任何应用程序快速收集采样代码签名。他们的研究结论是,抽样代码签名和性能可预测性之间存在模糊的相关性。这篇论文提出了在采样过程中丢失了多少信息的问题,我们的论文重点研究了这个问题。我们首先着重说明代码签名和性能之间存在很强的相关性。然后,我们将检查采样和完整代码签名之间的关系,以及它们如何影响性能可预测性。我们的结果证实,在最近的工作中发现的带有采样代码签名的SPEC程序存在模糊相关性,但与完整代码签名存在强相关性。此外,我们建议将先前工作中使用的采样指令计数转换为表示循环和过程执行频率的采样代码签名。这些采样的循环和过程代码签名允许阶段分析更准确、更容易地找到模式,并且它们与性能有更好的相关性
{"title":"The Strong correlation Between Code Signatures and Performance","authors":"Jeremy Lau, J. Sampson, Erez Perelman, Greg Hamerly, B. Calder","doi":"10.1109/ISPASS.2005.1430578","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430578","url":null,"abstract":"A recent study examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126342115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 100
Partitioning Multi-Threaded Processors with a Large Number of Threads 对具有大量线程的多线程处理器进行分区
A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas
Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future
如今的通用处理器越来越多地使用多线程,以便更好地利用每一代技术带来的额外芯片空间。同步多线程(SMT)最初被提出为一种大型动态超标量处理器,其硬件结构在所有线程之间共享。它的超线程Pentium 4处理器在两个线程之间划分队列结构,通过减少单个线程对结构的囤积来展示更均衡的性能。IBM的Power5处理器是SMT处理器的双向芯片多处理器(CMP),每个处理器支持2个线程,这大大降低了设计复杂性,并可以提高电源效率。本文研究了一个芯片上线程数量较多的处理器分区选项。虽然不断增长的晶体管预算允许设计四线程和八线程处理器,但设计复杂性、功耗和导线缩放限制为其实际实现创造了重大障碍。我们探讨了在集群多线程(CMT)处理器中共享或分区和分发前端(指令缓存、指令获取和分派)、执行单元和相关状态以及L1 Dcache库的设计选择。我们表明,通过限制线程之间L1 Dcache银行和执行引擎的共享,可以获得最佳性能。另一方面,大量共享前端资源是最好的方法。与大型单片SMT处理器相比,CMT处理器提供了非常有竞争力的IPC性能,平均为分区SMT的90-96%,同时具有更高的可扩展性和更高的功耗效率。在CMP组织中,SMT和CMT处理器之间的差距进一步缩小,使CMT处理器的CMP成为未来高度可行的替代方案
{"title":"Partitioning Multi-Threaded Processors with a Large Number of Threads","authors":"A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas","doi":"10.1109/ISPASS.2005.1430566","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430566","url":null,"abstract":"Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128990192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Pro-active Page Replacement for Scientific Applications: A Characterization 科学应用的主动页面替换:表征
M. Vilayannur, A. Sivasubramaniam, M. Kandemir
Paging policies implemented by today's operating systems cause scientific applications to exhibit poor performance, when the application's working set does not fit in main memory. This has been typically attributed to the sub-optimal performance of LRU-like virtual-memory replacement algorithms. On one end of the spectrum, researchers in the past have proposed fully automated compiler-based techniques that provide crucial information on future access patterns (reuse-distances, release hints etc) of an application that can be exploited by the operating system to make intelligent prefetching and replacement decisions. Static techniques like the aforementioned can be quite accurate, but require that the source code be available and analyzable. At the other end of the spectrum, researchers have also proposed pure system-level algorithmic innovations to improve the performance of LRU-like algorithms, some of which are only interesting from the theoretical sense and may not really be implementable. Instead, in this paper we explore the possibility of tracking application's runtime behavior in the operating system, and find that there are several useful characteristics in the virtual memory behavior that can be anticipated and used to pro-actively manage physical memory usage. Specifically, we show that LRU-like replacement algorithms hold onto pages long after they outlive their usefulness and propose a new replacement algorithm that exploits the predictability of the application's page-fault patterns to reduce the number of page-faults. Our results demonstrate that such techniques can reduce page-faults by as much as 78% over both LRU and EELRU that is considered to be one of the state-of-the-art algorithms towards addressing the performance shortcomings of LRU. Further, we also present an implementable replacement algorithm within the operating system, that performs considerably better than the Linux kernel's replacement algorithm
当前操作系统实现的分页策略导致科学应用程序在应用程序的工作集不适合主内存时表现出较差的性能。这通常归因于类lru虚拟内存替换算法的次优性能。一方面,过去的研究人员提出了完全自动化的基于编译器的技术,这些技术可以提供应用程序未来访问模式(重用距离、发布提示等)的关键信息,这些信息可以被操作系统利用来做出智能的预取和替换决策。像前面提到的静态技术可以非常精确,但是要求源代码是可用的和可分析的。在光谱的另一端,研究人员也提出了纯粹的系统级算法创新来提高类lru算法的性能,其中一些仅从理论意义上有趣,可能无法真正实现。相反,在本文中,我们探索了在操作系统中跟踪应用程序运行时行为的可能性,并发现虚拟内存行为中有几个有用的特征可以预测并用于主动管理物理内存使用。具体来说,我们展示了类似lru的替换算法在页面失效后很长一段时间内仍然保留页面,并提出了一种新的替换算法,该算法利用应用程序页面错误模式的可预测性来减少页面错误的数量。我们的结果表明,与LRU和EELRU相比,这种技术可以减少78%的页面错误,EELRU被认为是解决LRU性能缺陷的最先进算法之一。此外,我们还在操作系统中提出了一个可实现的替换算法,它的性能比Linux内核的替换算法要好得多
{"title":"Pro-active Page Replacement for Scientific Applications: A Characterization","authors":"M. Vilayannur, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/ISPASS.2005.1430579","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430579","url":null,"abstract":"Paging policies implemented by today's operating systems cause scientific applications to exhibit poor performance, when the application's working set does not fit in main memory. This has been typically attributed to the sub-optimal performance of LRU-like virtual-memory replacement algorithms. On one end of the spectrum, researchers in the past have proposed fully automated compiler-based techniques that provide crucial information on future access patterns (reuse-distances, release hints etc) of an application that can be exploited by the operating system to make intelligent prefetching and replacement decisions. Static techniques like the aforementioned can be quite accurate, but require that the source code be available and analyzable. At the other end of the spectrum, researchers have also proposed pure system-level algorithmic innovations to improve the performance of LRU-like algorithms, some of which are only interesting from the theoretical sense and may not really be implementable. Instead, in this paper we explore the possibility of tracking application's runtime behavior in the operating system, and find that there are several useful characteristics in the virtual memory behavior that can be anticipated and used to pro-actively manage physical memory usage. Specifically, we show that LRU-like replacement algorithms hold onto pages long after they outlive their usefulness and propose a new replacement algorithm that exploits the predictability of the application's page-fault patterns to reduce the number of page-faults. Our results demonstrate that such techniques can reduce page-faults by as much as 78% over both LRU and EELRU that is considered to be one of the state-of-the-art algorithms towards addressing the performance shortcomings of LRU. Further, we also present an implementable replacement algorithm within the operating system, that performs considerably better than the Linux kernel's replacement algorithm","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130271902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1