首页 > 最新文献

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文 中文
Discovering and understanding performance bottlenecks in transactional applications 发现并理解事务性应用程序中的性能瓶颈
Ferad Zyulkyarov, Srdjan Stipic, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero
Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and tuning programs which use transactions.
许多研究人员使用事务性内存(TM)开发了应用程序,目的是对不同的实现进行基准测试,并研究TM是否易于使用。然而,相对而言,在提供通用工具来分析和调优使用事务的程序方面做得很少。
{"title":"Discovering and understanding performance bottlenecks in transactional applications","authors":"Ferad Zyulkyarov, Srdjan Stipic, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1145/1854273.1854311","DOIUrl":"https://doi.org/10.1145/1854273.1854311","url":null,"abstract":"Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and tuning programs which use transactions.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115046329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Approximating age-based arbitration in on-chip networks 在片上网络中近似基于年龄的仲裁
M. J. Lee, John Kim, D. Abts, Michael R. Marty, Jae W. Lee
The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.
新兴的多核cmp的片上网络可以共享许多片上组件。这种片上网络需要保证访问共享资源时的公平性。在这项工作中,我们建议通过利用距离或跳数来近似网络中数据包的年龄,在未来的多核cmp片上网络中提供服务平等(EoS)。我们提出了结合基于距离的权重的概率仲裁来实现EoS,并克服了传统轮循仲裁器的局限性。我们描述了非线性权重如何与概率仲裁器一起使用,并提出了三种不同的仲裁权重度量-固定权重,不断增加的权重和可变增加的权重。通过仅修改片上路由器的仲裁,我们不需要任何额外的缓冲区或虚拟通道,并创建了实现EoS的复杂有效机制。
{"title":"Approximating age-based arbitration in on-chip networks","authors":"M. J. Lee, John Kim, D. Abts, Michael R. Marty, Jae W. Lee","doi":"10.1145/1854273.1854359","DOIUrl":"https://doi.org/10.1145/1854273.1854359","url":null,"abstract":"The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132112537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SPACE: Sharing pattern-based directory coherence for multicore scalability SPACE:共享基于模式的目录一致性,实现多核可伸缩性
Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas
An important challenge in multicore processors is the maintenance of cache coherence in a scalable manner. Directory-based protocols save bandwidth and achieve scalability by associating information about sharer cores with every cache block. As the number of cores and cache sizes increase, the directory itself adds significant area and energy overheads.
多核处理器面临的一个重要挑战是以可扩展的方式维护缓存一致性。基于目录的协议通过将共享内核的信息与每个缓存块关联,节省了带宽,并实现了可伸缩性。随着内核数量和缓存大小的增加,目录本身增加了大量的面积和能源开销。
{"title":"SPACE: Sharing pattern-based directory coherence for multicore scalability","authors":"Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas","doi":"10.1145/1854273.1854294","DOIUrl":"https://doi.org/10.1145/1854273.1854294","url":null,"abstract":"An important challenge in multicore processors is the maintenance of cache coherence in a scalable manner. Directory-based protocols save bandwidth and achieve scalability by associating information about sharer cores with every cache block. As the number of cores and cache sizes increase, the directory itself adds significant area and energy overheads.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133010893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
Ordered and unordered algorithms for parallel breadth first search 并行广度优先搜索的有序和无序算法
M. A. Hassaan, Martin Burtscher, K. Pingali
We describe and evaluate ordered and unordered algorithms for shared-memory parallel breadth-first search. The unordered algorithm is based on viewing breadth-first search as a fixpoint computation, and in general, it may perform more work than the ordered algorithms while requiring less global synchronization.
我们描述并评估了共享内存并行广度优先搜索的有序和无序算法。无序算法基于将宽度优先搜索视为不动点计算,通常,它可能比有序算法执行更多的工作,同时需要更少的全局同步。
{"title":"Ordered and unordered algorithms for parallel breadth first search","authors":"M. A. Hassaan, Martin Burtscher, K. Pingali","doi":"10.1145/1854273.1854341","DOIUrl":"https://doi.org/10.1145/1854273.1854341","url":null,"abstract":"We describe and evaluate ordered and unordered algorithms for shared-memory parallel breadth-first search. The unordered algorithm is based on viewing breadth-first search as a fixpoint computation, and in general, it may perform more work than the ordered algorithms while requiring less global synchronization.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133083623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks 系统级最大功率(SYMPO)——一种使用综合基准提高系统级功耗的系统方法
K. Ganesan, Jungho Jo, W. Bircher, Dimitris Kaseridis, Zhibin Yu, L. John
To effectively design a computer system for the worst case power consumption scenario, system architects often use hand-crafted maximum power consuming benchmarks at the assembly language level. These stressmarks, also called power viruses, are very tedious to generate and require significant domain knowledge. In this paper, we propose SYMPO, an automatic SYstem level Max POwer virus generation framework, which maximizes the power consumption of the CPU and the memory system using genetic algorithm and an abstract workload generation framework. For a set of three ISAs, we show the efficacy of the power viruses generated using SYMPO by comparing the power consumption with that of MPrime torture test, which is widely used by industry to test system stability. Our results show that the usage of SYMPO results in the generation of power viruses that consume 14–41% more power compared to MPrime on SPARC ISA. The genetic algorithm achieved this result in about 70 to 90 generations in 11 to 15 hours when using a full system simulator. We also show that the power viruses generated in the Alpha ISA consume 9–24% more power compared to the previous approach of stressmark generation. We measure and provide the power consumption of these benchmarks on hardware by instrumenting a quad-core AMD Phenom II X4 system. The SYMPO power virus consumes more power compared to various industry grade power viruses on x86 hardware. We also provide a microarchitecture independent characterization of various industry standard power viruses.
为了有效地为最坏的功耗场景设计计算机系统,系统架构师通常在汇编语言级别使用手工制作的最大功耗基准。这些压力标记,也称为权力病毒,生成起来非常繁琐,需要大量的领域知识。本文提出了一个系统级最大功率病毒自动生成框架SYMPO,该框架采用遗传算法和抽象工作负载生成框架,最大限度地提高了CPU和内存系统的功耗。对于一组三个isa,我们通过比较使用SYMPO生成的功率病毒的功耗与业界广泛用于测试系统稳定性的MPrime折磨测试的功耗来显示功率病毒的有效性。我们的结果表明,与SPARC ISA上的MPrime相比,使用SYMPO会导致生成功耗病毒,其功耗高出14-41%。当使用完整的系统模拟器时,遗传算法在11到15小时内实现了大约70到90代的结果。我们还表明,与之前的压力标记生成方法相比,在Alpha ISA中生成的功率病毒消耗的功率多9-24%。我们通过测量四核AMD飞鸿II X4系统来测量和提供这些基准测试在硬件上的功耗。与x86硬件上的各种工业级电源病毒相比,SYMPO电源病毒消耗更多的电源。我们还提供了各种行业标准电源病毒的微架构独立表征。
{"title":"System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks","authors":"K. Ganesan, Jungho Jo, W. Bircher, Dimitris Kaseridis, Zhibin Yu, L. John","doi":"10.1145/1854273.1854282","DOIUrl":"https://doi.org/10.1145/1854273.1854282","url":null,"abstract":"To effectively design a computer system for the worst case power consumption scenario, system architects often use hand-crafted maximum power consuming benchmarks at the assembly language level. These stressmarks, also called power viruses, are very tedious to generate and require significant domain knowledge. In this paper, we propose SYMPO, an automatic SYstem level Max POwer virus generation framework, which maximizes the power consumption of the CPU and the memory system using genetic algorithm and an abstract workload generation framework. For a set of three ISAs, we show the efficacy of the power viruses generated using SYMPO by comparing the power consumption with that of MPrime torture test, which is widely used by industry to test system stability. Our results show that the usage of SYMPO results in the generation of power viruses that consume 14–41% more power compared to MPrime on SPARC ISA. The genetic algorithm achieved this result in about 70 to 90 generations in 11 to 15 hours when using a full system simulator. We also show that the power viruses generated in the Alpha ISA consume 9–24% more power compared to the previous approach of stressmark generation. We measure and provide the power consumption of these benchmarks on hardware by instrumenting a quad-core AMD Phenom II X4 system. The SYMPO power virus consumes more power compared to various industry grade power viruses on x86 hardware. We also provide a microarchitecture independent characterization of various industry standard power viruses.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling tile - mapreduce:通过平铺优化多核数据并行应用程序的资源使用
Rong-Xin Chen, Haibo Chen, B. Zang
The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.
芯片多处理器的流行为在具有多个核心的单个机器上的集群中运行数据并行应用程序提供了机会。MapReduce是一种用于大规模集群编程的简单而优雅的编程模型,最近被证明是利用多核平台的一种有前途的替代方案。
{"title":"Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling","authors":"Rong-Xin Chen, Haibo Chen, B. Zang","doi":"10.1145/1854273.1854337","DOIUrl":"https://doi.org/10.1145/1854273.1854337","url":null,"abstract":"The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 130
Energy efficient speculative threads: Dynamic thread allocation in same-ISA heterogeneous multicore systems 高能效推测线程:同一isa异构多核系统中的动态线程分配
Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai
Thread-level parallelism at the chip level is critical in overcoming some of the challenges that have been ushered in through the advent of modern multicore processors (CMP). Extracting speculatively parallel threads from sequential applications and executing these threads on multicore processors is a promising technique to speed up these applications on multicore systems. However, the potential degradation in energy efficiency associated is an important factor that hinders the deployment of this technique. For multicore systems that integrate same-ISA heterogeneous cores, it is possible to judiciously allocate speculative threads to achieve energy-efficient performance improvement.
芯片级的线程级并行性对于克服现代多核处理器(CMP)的出现所带来的一些挑战至关重要。从顺序应用程序中提取推测并行线程并在多核处理器上执行这些线程是一种很有前途的技术,可以提高多核系统上这些应用程序的速度。然而,相关的能源效率的潜在退化是阻碍该技术部署的一个重要因素。对于集成相同isa异构内核的多核系统,可以明智地分配推测线程以实现节能性能改进。
{"title":"Energy efficient speculative threads: Dynamic thread allocation in same-ISA heterogeneous multicore systems","authors":"Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai","doi":"10.1145/1854273.1854329","DOIUrl":"https://doi.org/10.1145/1854273.1854329","url":null,"abstract":"Thread-level parallelism at the chip level is critical in overcoming some of the challenges that have been ushered in through the advent of modern multicore processors (CMP). Extracting speculatively parallel threads from sequential applications and executing these threads on multicore processors is a promising technique to speed up these applications on multicore systems. However, the potential degradation in energy efficiency associated is an important factor that hinders the deployment of this technique. For multicore systems that integrate same-ISA heterogeneous cores, it is possible to judiciously allocate speculative threads to achieve energy-efficient performance improvement.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131382451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Adaptive spatiotemporal node selection in dynamic networks 动态网络中的自适应时空节点选择
P. Hari, John B. P. McCabe, Jon Banafato, Marcus Henry, Kevin Ko, Emmanouil Koukoumidis, U. Kremer, M. Martonosi, L. Peh
Dynamic networks—spontaneous, self-organizing groups of devices—are a promising new computing platform. Writing applications for such networks is a daunting task, however, due to their extreme variability and unpredictability, with many devices having significant resource limitations. Intelligent, automated distribution of work across network nodes is needed to get the most out of limited resource budgets.
动态网络——自发的、自组织的设备群——是一个很有前途的新计算平台。然而,为这样的网络编写应用程序是一项艰巨的任务,因为它们具有极端的可变性和不可预测性,并且许多设备具有明显的资源限制。为了最大限度地利用有限的资源预算,需要在网络节点之间智能、自动化地分配工作。
{"title":"Adaptive spatiotemporal node selection in dynamic networks","authors":"P. Hari, John B. P. McCabe, Jon Banafato, Marcus Henry, Kevin Ko, Emmanouil Koukoumidis, U. Kremer, M. Martonosi, L. Peh","doi":"10.1145/1854273.1854304","DOIUrl":"https://doi.org/10.1145/1854273.1854304","url":null,"abstract":"Dynamic networks—spontaneous, self-organizing groups of devices—are a promising new computing platform. Writing applications for such networks is a daunting task, however, due to their extreme variability and unpredictability, with many devices having significant resource limitations. Intelligent, automated distribution of work across network nodes is needed to get the most out of limited resource budgets.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116946558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic vector instruction selection for dynamic compilation 动态编译的自动矢量指令选择
R. Barik, Jisheng Zhao, Vivek Sarkar
Accelerating program performance via short SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, and AltiVec SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers still fall short of using the full potential of vector units, and only generate vector code for simple loop nests. In this poster, we present the design and implementation of an auto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. Additionally, we describe a vector instruction selection algorithm based on dynamic programming. Our results obtained in JikesRVM dynamic compilation environment show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution.
通过短SIMD矢量单元加速程序性能在现代处理器中非常常见,多媒体、科学和嵌入式应用程序中使用SSE、MMX和AltiVec SIMD指令证明了这一点。为了充分利用矢量功能,编译器需要自动生成高效的矢量代码。然而,大多数商业和开源编译器仍然不能充分利用向量单元的潜力,并且只能为简单的循环巢生成向量代码。在这张海报中,我们展示了动态编译器后端的自动矢量化框架的设计和实现,该框架不仅生成优化的矢量代码,而且还与指令调度程序和寄存器分配器很好地集成在一起。此外,我们还描述了一种基于动态规划的矢量指令选择算法。我们在JikesRVM动态编译环境中获得的结果显示,与非矢量化执行相比,在Intel Xeon处理器上的性能提高高达57.71%。
{"title":"Automatic vector instruction selection for dynamic compilation","authors":"R. Barik, Jisheng Zhao, Vivek Sarkar","doi":"10.1145/1854273.1854358","DOIUrl":"https://doi.org/10.1145/1854273.1854358","url":null,"abstract":"Accelerating program performance via short SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, and AltiVec SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers still fall short of using the full potential of vector units, and only generate vector code for simple loop nests. In this poster, we present the design and implementation of an auto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. Additionally, we describe a vector instruction selection algorithm based on dynamic programming. Our results obtained in JikesRVM dynamic compilation environment show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116503568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Feedback-directed pipeline parallelism 反馈导向的管道并行性
M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt
Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.
从芯片多处理器中提取高性能要求应用程序并行化。一种常见的软件并行循环技术是管道并行,在这种技术中,程序员/编译器将每个循环迭代分成几个阶段,每个阶段在一定数量的内核上运行。仔细选择每个阶段的核心数量非常重要,因为核心到阶段的分配决定了性能和功耗。为应用程序找到最佳的核心到阶段分配是一项挑战,因为可能的分配数量很大,而最佳分配取决于输入集和机器配置。本文提出了一种在运行时选择核心到阶段分配的软件框架——反馈导向管道(FDP)。FDP首先最大化工作负载的性能,然后通过减少活动核的数量来节省功耗,而不会影响性能。我们对具有两个Core2Quad处理器(8核)的真实SMP系统的评估表明,FDP提供了4.2x的平均加速,这明显高于实际基于配置文件分配获得的2.3x加速。我们还证明了FDP对机器配置和输入集的变化具有鲁棒性。
{"title":"Feedback-directed pipeline parallelism","authors":"M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt","doi":"10.1145/1854273.1854296","DOIUrl":"https://doi.org/10.1145/1854273.1854296","url":null,"abstract":"Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127070306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
期刊
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1