IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.最新文献

英文中文

Partitioning Multi-Threaded Processors with a Large Number of Threads 对具有大量线程的多线程处理器进行分区

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430566

A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas

Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future

如今的通用处理器越来越多地使用多线程，以便更好地利用每一代技术带来的额外芯片空间。同步多线程(SMT)最初被提出为一种大型动态超标量处理器，其硬件结构在所有线程之间共享。它的超线程Pentium 4处理器在两个线程之间划分队列结构，通过减少单个线程对结构的囤积来展示更均衡的性能。IBM的Power5处理器是SMT处理器的双向芯片多处理器(CMP)，每个处理器支持2个线程，这大大降低了设计复杂性，并可以提高电源效率。本文研究了一个芯片上线程数量较多的处理器分区选项。虽然不断增长的晶体管预算允许设计四线程和八线程处理器，但设计复杂性、功耗和导线缩放限制为其实际实现创造了重大障碍。我们探讨了在集群多线程(CMT)处理器中共享或分区和分发前端(指令缓存、指令获取和分派)、执行单元和相关状态以及L1 Dcache库的设计选择。我们表明，通过限制线程之间L1 Dcache银行和执行引擎的共享，可以获得最佳性能。另一方面，大量共享前端资源是最好的方法。与大型单片SMT处理器相比，CMT处理器提供了非常有竞争力的IPC性能，平均为分区SMT的90-96%，同时具有更高的可扩展性和更高的功耗效率。在CMP组织中，SMT和CMT处理器之间的差距进一步缩小，使CMT处理器的CMP成为未来高度可行的替代方案

{"title":"Partitioning Multi-Threaded Processors with a Large Number of Threads","authors":"A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas","doi":"10.1109/ISPASS.2005.1430566","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430566","url":null,"abstract":"Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128990192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications 多媒体应用中一维和二维SIMD扩展的可扩展性

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430571

Friman Sánchez, M. Alvarez, E. Salamí, Alex Ramírez, M. Valero

SIMD extensions are the most common technique used in current processors for multimedia computing. In order to obtain more performance for emerging applications SIMD extensions need to be scaled. In this paper we perform a scalability analysis of SIMD extensions for multimedia applications. Scaling a 1-dimensional extension, like Intel MMX, was compared to scaling a 2-dimensional (matrix) extension. Evaluations have demonstrated that the 2-d architecture is able to use more parallel hardware than the 1-d extension. Speed-ups over a 2-way superscalar processor with MMX-like extension go up to 4X for kernels and up to 3.3X for complete applications and the matrix architecture can deliver, in some cases, more performance with simpler processor configurations. The experiments also show that the scaled matrix architecture is reaching the limits of the DLP available in the internal loops of common multimedia kernels

SIMD扩展是当前用于多媒体计算的处理器中最常用的技术。为了获得新兴应用程序的更高性能，需要对SIMD扩展进行缩放。本文对多媒体应用的SIMD扩展进行了可扩展性分析。缩放一维扩展(如Intel MMX)与缩放二维(矩阵)扩展进行了比较。评估表明，二维架构能够使用比一维扩展更多的并行硬件。在具有类似mmx扩展的双向超标量处理器上，内核的加速可达4倍，完整应用程序的加速可达3.3倍，在某些情况下，矩阵架构可以通过更简单的处理器配置提供更高的性能。实验还表明，缩放矩阵结构达到了普通多媒体内核内部循环的DLP极限

引用次数: 10

Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors 芯片多处理器上线程级并行性的功率性能影响

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430567

Jian Li, José F. Martínez

We discuss power-performance implications of running parallel applications on chip multiprocessors (CMPs). First, we develop an analytical model that, for the first time, puts together parallel efficiency, granularity, and voltage/frequency scaling, to quantify the performance and power consumption, delivered by a CMP running a parallel code. Then, we conduct detailed simulations of parallel applications running on a power-performance CMP model. Our experiments confirm that our analytical model predicts power-performance behavior reasonably well. Both analytical and experimental models show that parallel computing can bring significant power savings and still meet a given performance target, by choosing granularity and voltage/frequency levels judiciously. The particular choice, however, is dependent on the application's parallel efficiency curve and the process technology utilized, which our model captures. Likewise, analytical model and experiments show the effect of a limited power budget on the application's scalability curve. In particular, we show that a limited power budget can cause a rapid performance degradation beyond a number of cores, even in the case of applications with excellent scalability properties. On the other hand, our experiments show that power-thrifty memory-bound applications can actually enjoy better scalability than more "nominally scalable" applications (i.e., without regard to power) when a limited power budget is in place

我们讨论了在芯片多处理器(cmp)上运行并行应用程序的功率性能影响。首先，我们开发了一个分析模型，该模型首次将并行效率、粒度和电压/频率缩放放在一起，以量化运行并行代码的CMP所提供的性能和功耗。然后，我们对运行在功率性能CMP模型上的并行应用程序进行了详细的仿真。我们的实验证实，我们的分析模型可以很好地预测功率性能行为。分析模型和实验模型都表明，通过明智地选择粒度和电压/频率水平，并行计算可以带来显著的功耗节约，并且仍然满足给定的性能目标。然而，具体的选择取决于应用程序的并行效率曲线和所使用的工艺技术，我们的模型捕获了这些。同样，分析模型和实验显示了有限的功耗预算对应用程序可伸缩性曲线的影响。特别是，我们表明，有限的功率预算可能会导致超过多个核心的性能快速下降，即使在具有出色可伸缩性属性的应用程序的情况下也是如此。另一方面，我们的实验表明，在有限的功耗预算下，节电的内存约束应用程序实际上比“名义上可伸缩”的应用程序(即，不考虑功耗)具有更好的可伸缩性

{"title":"Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors","authors":"Jian Li, José F. Martínez","doi":"10.1109/ISPASS.2005.1430567","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430567","url":null,"abstract":"We discuss power-performance implications of running parallel applications on chip multiprocessors (CMPs). First, we develop an analytical model that, for the first time, puts together parallel efficiency, granularity, and voltage/frequency scaling, to quantify the performance and power consumption, delivered by a CMP running a parallel code. Then, we conduct detailed simulations of parallel applications running on a power-performance CMP model. Our experiments confirm that our analytical model predicts power-performance behavior reasonably well. Both analytical and experimental models show that parallel computing can bring significant power savings and still meet a given performance target, by choosing granularity and voltage/frequency levels judiciously. The particular choice, however, is dependent on the application's parallel efficiency curve and the process technology utilized, which our model captures. Likewise, analytical model and experiments show the effect of a limited power budget on the application's scalability curve. In particular, we show that a limited power budget can cause a rapid performance degradation beyond a number of cores, even in the case of applications with excellent scalability properties. On the other hand, our experiments show that power-thrifty memory-bound applications can actually enjoy better scalability than more \"nominally scalable\" applications (i.e., without regard to power) when a limited power budget is in place","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering 基于TCP流聚类的新型数据包跟踪压缩器性能分析

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430576

R. Holanda, Javier Verdú, J. García-Vidal, M. Valero

In this paper we study the properties of a new packet trace compression method based on clustering of TCP flows. With our proposed method, the compression ratio that we achieve is around 3%, reducing the file size, for instance, from 100 MB to 3 MB. Although this specification defines a lossy compressed data format, it preserves important statistical properties present into original trace. In order to validate the method, memory performance studies were done with the Radix Tree algorithm executing a trace generated by our method. To give support to these studies, measurements were taken of memory access and cache miss ratio. For the time, the results have showed that our proposed method provides a good solution for packet trace compression

本文研究了一种新的基于TCP流聚类的数据包跟踪压缩方法的特性。使用我们提出的方法，我们实现的压缩比约为3%，例如，将文件大小从100 MB减少到3 MB。尽管该规范定义了有损压缩数据格式，但它保留了原始跟踪中存在的重要统计属性。为了验证该方法，使用Radix Tree算法执行由我们的方法生成的跟踪来进行内存性能研究。为了支持这些研究，对内存访问和缓存丢失率进行了测量。实验结果表明，该方法为数据包跟踪压缩提供了较好的解决方案

引用次数: 7

Scalarization on Short Vector Machines 短向量机的标量化

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430573

Yuan Zhao, K. Kennedy

Scalarization is a process that converts array statements into loop nests so that they can run on a scalar machine. One technical difficulty of scalarization is that temporary storage often needs to be allocated in order to preserve the semantics of array syntax - "fetch before store". Many techniques have been developed to reduce the size of temporary storage requirement in order to improve the memory hierarchy performance. With the emergence of short vector units on modern microprocessors, it is interesting to see how to extend the preexisting scalarization methods so that the underlying vector infrastructure is fully utilized, while at the same time keep the temporary storage minimized. In this paper, we extend a loop alignment algorithm for scalarization on short vector machines. The revised algorithm not only achieves vector execution with minimum temporary storage, but also handles data alignment properly, which is very important for performance. Our experiments on two types of widely available architectures demonstrate the effectiveness of our strategy

标量化是一个将数组语句转换为循环巢的过程，这样它们就可以在标量机器上运行。规模化的一个技术难题是，为了保持数组语法的语义——“先取后存”，经常需要分配临时存储。为了提高内存层次结构的性能，已经开发了许多减小临时存储需求大小的技术。随着现代微处理器上短向量单元的出现，如何扩展现有的标量化方法以充分利用底层向量基础设施，同时保持临时存储最小化，这是一件有趣的事情。在本文中，我们扩展了一种用于短向量机标量化的循环对齐算法。改进后的算法不仅可以在最小的临时存储空间内实现矢量执行，而且可以很好地处理数据对齐，这对性能非常重要。我们在两种广泛可用的体系结构上的实验证明了我们的策略的有效性

引用次数: 9

Simulation Differences Between Academia and Industry: A Branch Prediction Case Study 学术界和工业界的模拟差异:一个分支预测案例研究

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430556

G. Loh

Computer architecture research in academia and industry is heavily reliant on simulation studies. While microprocessor companies have the resources to develop highly detailed simulation infrastructures that they correlate against their own silicon, academic researchers tend to use free, widely available simulators. The differences in instruction set architectures, operating systems, simulator models and benchmarks create disconnect between academic and industrial research studies. This paper presents a comparative study to find correlations and differences between the same microarchitecture studies conducted in two different frameworks. Due to the limited availability of industrial simulation frameworks, this research is limited to a case study of branch predictors. Encouragingly, our simulations indicate that several recently proposed branch predictors behave similarly in both environments when evaluated with the SPEC CPU benchmark suite. Unfortunately, we also present results that show that conclusions drawn from studies based on SPEC CPU do not necessarily hold when other applications are considered

学术界和工业界的计算机体系结构研究在很大程度上依赖于仿真研究。虽然微处理器公司有资源开发非常详细的模拟基础设施，并将其与自己的芯片相关联，但学术研究人员倾向于使用免费的、广泛可用的模拟器。指令集架构、操作系统、模拟器模型和基准的差异造成了学术研究和工业研究之间的脱节。本文提出了一项比较研究，以发现在两种不同框架下进行的相同微架构研究之间的相关性和差异。由于工业模拟框架的可用性有限，本研究仅限于分支预测器的案例研究。令人鼓舞的是，我们的模拟表明，当使用SPEC CPU基准套件进行评估时，最近提出的几个分支预测器在两个环境中的行为相似。不幸的是，我们还提出的结果表明，当考虑其他应用程序时，基于SPEC CPU的研究得出的结论不一定成立

{"title":"Simulation Differences Between Academia and Industry: A Branch Prediction Case Study","authors":"G. Loh","doi":"10.1109/ISPASS.2005.1430556","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430556","url":null,"abstract":"Computer architecture research in academia and industry is heavily reliant on simulation studies. While microprocessor companies have the resources to develop highly detailed simulation infrastructures that they correlate against their own silicon, academic researchers tend to use free, widely available simulators. The differences in instruction set architectures, operating systems, simulator models and benchmarks create disconnect between academic and industrial research studies. This paper presents a comparative study to find correlations and differences between the same microarchitecture studies conducted in two different frameworks. Due to the limited availability of industrial simulation frameworks, this research is limited to a case study of branch predictors. Encouragingly, our simulations indicate that several recently proposed branch predictors behave similarly in both environments when evaluated with the SPEC CPU benchmark suite. Unfortunately, we also present results that show that conclusions drawn from studies based on SPEC CPU do not necessarily hold when other applications are considered","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126428448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀