ACM International Conference on Computing Frontiers最新文献_第9页

Scalable memory registration for high performance networks using helper threads 使用helper线程为高性能网络注册可扩展内存

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016652

Dong Li, K. Cameron, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz

Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.

RDMA (Remote DMA)技术使高性能网络能够减少应用程序和操作系统之间的数据复制。然而，在一些高性能网络中，RDMA操作需要向网络适配器显式注册并由操作系统固定的通信内存。内存注册和固定限制了内存系统的灵活性，并减少了用户进程可以分配的内存量。这些问题在多核平台上变得更加重要，因为注册的内存需求随着处理器核心数量的增加而线性增长。在本文中，我们提出了一种新的内存注册/注销策略，以减少HPC应用程序在多核架构上的注册内存。通过将所有动态内存注册和注销请求卸载到专用内存管理辅助线程，我们隐藏了动态内存管理的成本。我们研究了帮助线程方法的设计策略和性能含义。我们用NAS并行基准测试来评估我们的框架，我们的注册方案显着减少了注册内存(平均23.62%，最高49.39%)，并避免了内存注册/取消注册的成本。我们表明，我们的系统能够执行在现有内存注册策略下无法完成的问题大小。

{"title":"Scalable memory registration for high performance networks using helper threads","authors":"Dong Li, K. Cameron, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz","doi":"10.1145/2016604.2016652","DOIUrl":"https://doi.org/10.1145/2016604.2016652","url":null,"abstract":"Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129419108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Parametrizing multicore architectures for multiple sequence alignment 多序列比对的多核结构参数化

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016642

S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev

Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.

序列比对是生物信息学的基本任务之一。由于生物数据的指数增长和所用算法的计算复杂性，需要高性能的计算系统。尽管多核体系结构有潜力利用这些工作负载中的任务级并行性，但要有效利用具有数百个核心的系统，需要对应用程序和体系结构有深入的了解。当合并大量核心时，性能可伸缩性可能会使总线和内存等共享硬件资源饱和。在本文中，我们评估了基于加速器的多核架构的各种配置对性能的影响，目的是揭示和量化瓶颈。然后，我们比较了使用通用处理器的多核，并讨论了性能差距。我们的目标应用程序是ClustalW，它是最流行的多序列比对程序之一。我们描述了不同的输入数据集，并展示了它们如何影响性能。仿真结果表明，由于高计算通信比和大数据块传输，内存延迟可以很好地耐受。然而，带宽是实现最大性能的关键。使用带有4个bank的32KB缓存配置可以捕获大部分内存流量，从而避免昂贵的片外事务。另一方面，使用硬件队列进行任务同步使我们能够处理大量的内核。最后，我们展示了使用简单的负载平衡策略，我们可以将通用核心的性能提高28%。

{"title":"Parametrizing multicore architectures for multiple sequence alignment","authors":"S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev","doi":"10.1145/2016604.2016642","DOIUrl":"https://doi.org/10.1145/2016604.2016642","url":null,"abstract":"Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126276994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Tolerating correlated failures for generalized Cartesian distributions via bipartite matching 广义笛卡儿分布的二部匹配容忍相关失效

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016649

N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.

预计故障将在如何设计算法和应用程序以在未来的极端规模系统上运行方面发挥越来越重要的作用。基于算法的容错(ABFT)是一种很有前途的方法，它涉及修改算法，以比复制存储更低的开销从故障中恢复，并且与检查点重新启动技术相比，显著减少了丢失的工作。容错线性代数(FTLA)算法采用额外的处理器，这些处理器沿着矩阵的维度存储奇偶，以容忍多个同时发生的错误。现有的方法假设有规则的数据分布(阻塞或块循环)，每个数据块的故障是独立的。为了匹配并行计算机上的故障特征，我们将这些方法扩展到以几种重要方式映射奇偶校验块。首先，我们处理广义笛卡尔数据分布的奇偶性计算，每个处理器持有笛卡尔分布数组中的任意块子集。其次，介绍了处理相关故障的技术，即可能同时发生故障的多个处理器。第三，我们处理奇偶校验块与数据块的并置，并且不要求它们在额外的处理器上。提出了几种基于图匹配的替代方法，这些方法试图平衡处理器上的内存开销，同时保证与现有方法相同的容错特性，这些方法假设常规阻塞数据分布上的独立故障。对这些算法的评估表明，所提出的方法以最小的开销提供了额外的理想性能。

{"title":"Tolerating correlated failures for generalized Cartesian distributions via bipartite matching","authors":"N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily","doi":"10.1145/2016604.2016649","DOIUrl":"https://doi.org/10.1145/2016604.2016649","url":null,"abstract":"Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116909501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient stack distance computation for priority replacement policies 优先级替换策略的高效堆栈距离计算

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016607

G. Bilardi, K. Ekanadham, P. Pattnaik

The concept of stack distance, applicable to the important class of inclusion replacement policies for the memory hierarchy, enables to efficiently compute the number of misses incurred on a given address trace, for all cache sizes. The concept was introduced by Mattson, Gecsei, Sluts, and Traiger (Evaluation techniques for storage hierarchies, IBM System Journal, (9)2:78-117, 1970), together with a Linear-Scan algorithm, which takes time O(V) per access, in the worst case, where V is the number of distinct (virtual) items referenced within the trace. While subsequent work has lowered the time bound to O(log V) per access in the special case of the Least Recently Used policy, no improvements have been obtained for the general case. This work introduces a class of inclusion policies called policies with nearly static priorities, which encompasses several of the policies considered in the literature. The Min-Tree algorithm is proposed for these policies. The performance of the Min-Tree algorithm is very sensitive to the replacement policy as well as to the address trace. Under suitable probabilistic assumptions, the expected time per access is O(log2 V). Experimental evidence collected on a mix of benchmarks shows that the Min-Tree algorithm is significantly faster than Linear-Scan, for interesting policies such as OPT (or Belady), Least Frequently Used (LFU), and Most Recently Used (MRU). As a further advantage, Min-Tree can be parallelized to run in time O(log V) using O(V/log V) processors, in the worst case. A more sophisticated Lazy Min-Tree algorithm is also developed which achieves O(√ log V) worst-case time per access. This bound applies, in particular, to the policies OPT, LFU, and Least Recently/Frequently Used (LRFU), for which the best previously known bound was O(V).

堆栈距离的概念适用于内存层次结构中重要的一类包含替换策略，它能够有效地计算给定地址跟踪中所有缓存大小所导致的丢失次数。这个概念是由Mattson, Gecsei, Sluts和Traiger(存储层次结构的评估技术，IBM System Journal，(9)2:78- 117,1970)以及线性扫描算法引入的，在最坏的情况下，每次访问需要花费O(V)时间，其中V是跟踪中引用的不同(虚拟)项的数量。虽然在最近最少使用策略的特殊情况下，后续的工作将每次访问的时间限制降低到O(log V)，但在一般情况下没有得到任何改进。这项工作介绍了一类被称为具有几乎静态优先级的策略的包容策略，它包含了文献中考虑的几个策略。针对这些策略，提出了最小树算法。最小树算法的性能对替换策略和地址跟踪非常敏感。在适当的概率假设下，每次访问的预期时间为O(log2v)。在混合基准测试中收集的实验证据表明，对于OPT(或Belady)、最不频繁使用(LFU)和最近使用(MRU)等有趣的策略，最小树算法明显快于线性扫描。作为进一步的优势，在最坏的情况下，Min-Tree可以并行化，使用O(V/log V)个处理器在O(log V)时间内运行。此外，还开发了一种更复杂的Lazy Min-Tree算法，每次访问的最坏情况时间为O(√log V)。这个边界特别适用于策略OPT、LFU和最近最少/最常使用(LRFU)，它们的最佳已知边界是O(V)。

{"title":"Efficient stack distance computation for priority replacement policies","authors":"G. Bilardi, K. Ekanadham, P. Pattnaik","doi":"10.1145/2016604.2016607","DOIUrl":"https://doi.org/10.1145/2016604.2016607","url":null,"abstract":"The concept of stack distance, applicable to the important class of inclusion replacement policies for the memory hierarchy, enables to efficiently compute the number of misses incurred on a given address trace, for all cache sizes. The concept was introduced by Mattson, Gecsei, Sluts, and Traiger (Evaluation techniques for storage hierarchies, IBM System Journal, (9)2:78-117, 1970), together with a Linear-Scan algorithm, which takes time O(V) per access, in the worst case, where V is the number of distinct (virtual) items referenced within the trace. While subsequent work has lowered the time bound to O(log V) per access in the special case of the Least Recently Used policy, no improvements have been obtained for the general case.\u0000 This work introduces a class of inclusion policies called policies with nearly static priorities, which encompasses several of the policies considered in the literature. The Min-Tree algorithm is proposed for these policies. The performance of the Min-Tree algorithm is very sensitive to the replacement policy as well as to the address trace. Under suitable probabilistic assumptions, the expected time per access is O(log2 V). Experimental evidence collected on a mix of benchmarks shows that the Min-Tree algorithm is significantly faster than Linear-Scan, for interesting policies such as OPT (or Belady), Least Frequently Used (LFU), and Most Recently Used (MRU). As a further advantage, Min-Tree can be parallelized to run in time O(log V) using O(V/log V) processors, in the worst case.\u0000 A more sophisticated Lazy Min-Tree algorithm is also developed which achieves O(√ log V) worst-case time per access. This bound applies, in particular, to the policies OPT, LFU, and Least Recently/Frequently Used (LRFU), for which the best previously known bound was O(V).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114108638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Pruning hardware evaluation space via correlation-driven application similarity analysis 通过关联驱动的应用相似度分析来修剪硬件评估空间

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016610

Rosario Cammarota, A. Kejariwal, P. D'Alberto, Sapan Panigrahi, A. Veidenbaum, A. Nicolau

System evaluation is routinely performed in industry to select one amongst a set of different systems to improve performance of proprietary applications. However, a wide range of system configurations is available every year on the market. This makes an exhaustive system evaluation progressively challenging and expensive. In this paper we propose a novel similarity-based methodology for system selection. Our methodology prunes the set of candidate systems by eliminating those systems that are likely to reduce performance of a given proprietary application. The pruning process relies on applications that are similar to a given application of interest whose performance on the candidte systems is known. This obviates the need to install and run the given application on each and every candidate system. The concept of similarity we introduce is performance centric. For a given application, we compute the Pearson's correlation between different types of resource stall and cycles per instruction. We refer to the vector of Pearson's correlation coefficients as an application signature. Next, we assess similarity between two applications as Spearman's correlation between their respective signature. We use the former type of correlation to quantify the association between pipeline stalls and cycles per instruction, whereas we use the latter type of correlation to quantify the association of two signatures, hence to assess similarity, based on the difference in terms of rank ordering of their components. We evaluate the proposed methodology on three different micro-architectures, viz., Intel's Harpertown, Nehalem and Westmere, using industry-standard SPEC CINT2006. We assess performance centric similarity among applications in SPEC CINT2006. We show how our methodology clusters applications with common performance issues. Finally, we show how to use the notion of similarity among applications to compare the three architectures with respect to a given Yahoo! property.

系统评估通常在工业中执行，以便在一组不同的系统中选择一个来提高专有应用程序的性能。然而，市场上每年都有各种各样的系统配置可供选择。这使得详尽的系统评估变得越来越具有挑战性和昂贵。在本文中，我们提出了一种新的基于相似性的系统选择方法。我们的方法通过消除那些可能降低给定专有应用程序性能的系统来修剪候选系统集。修剪过程依赖于与特定应用程序相似的应用程序，这些应用程序在候选系统上的性能是已知的。这就避免了在每个候选系统上安装和运行给定应用程序的需要。我们引入的相似度概念是以性能为中心的。对于给定的应用程序，我们计算每个指令不同类型的资源失速和周期之间的Pearson相关性。我们将皮尔逊相关系数的向量称为应用签名。接下来，我们将两个应用程序之间的相似性评估为各自签名之间的Spearman相关性。我们使用前一种类型的相关性来量化每条指令的管道失速和周期之间的关联，而我们使用后一种类型的相关性来量化两个签名的关联，从而根据其组件的排名顺序的差异来评估相似性。我们使用行业标准SPEC CINT2006在三种不同的微架构(即英特尔的Harpertown、Nehalem和Westmere)上评估了所提出的方法。我们在SPEC CINT2006中评估应用程序之间以性能为中心的相似性。我们将展示我们的方法如何对具有常见性能问题的应用程序进行集群。最后，我们将展示如何使用应用程序之间的相似性概念来比较给定Yahoo!财产。

{"title":"Pruning hardware evaluation space via correlation-driven application similarity analysis","authors":"Rosario Cammarota, A. Kejariwal, P. D'Alberto, Sapan Panigrahi, A. Veidenbaum, A. Nicolau","doi":"10.1145/2016604.2016610","DOIUrl":"https://doi.org/10.1145/2016604.2016610","url":null,"abstract":"System evaluation is routinely performed in industry to select one amongst a set of different systems to improve performance of proprietary applications. However, a wide range of system configurations is available every year on the market. This makes an exhaustive system evaluation progressively challenging and expensive.\u0000 In this paper we propose a novel similarity-based methodology for system selection. Our methodology prunes the set of candidate systems by eliminating those systems that are likely to reduce performance of a given proprietary application. The pruning process relies on applications that are similar to a given application of interest whose performance on the candidte systems is known. This obviates the need to install and run the given application on each and every candidate system.\u0000 The concept of similarity we introduce is performance centric. For a given application, we compute the Pearson's correlation between different types of resource stall and cycles per instruction. We refer to the vector of Pearson's correlation coefficients as an application signature. Next, we assess similarity between two applications as Spearman's correlation between their respective signature. We use the former type of correlation to quantify the association between pipeline stalls and cycles per instruction, whereas we use the latter type of correlation to quantify the association of two signatures, hence to assess similarity, based on the difference in terms of rank ordering of their components.\u0000 We evaluate the proposed methodology on three different micro-architectures, viz., Intel's Harpertown, Nehalem and Westmere, using industry-standard SPEC CINT2006. We assess performance centric similarity among applications in SPEC CINT2006. We show how our methodology clusters applications with common performance issues. Finally, we show how to use the notion of similarity among applications to compare the three architectures with respect to a given Yahoo! property.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115259375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs BarrierWatch:描述跨程序定义时代和程序定义时代内的多线程工作负载

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016611

Socrates Demetriades, Sangyeun Cho

Characterizing the dynamic behavior of a program is essential for optimizing the program on a given system. Once the program's repetitive execution phases (and their boundaries) have been correctly identified, various phase-aware optimizations can be applied. Multithreaded workloads exhibit dynamic behavior that is further affected by the sharing of data and platform resources. As computer systems and workloads become denser and more parallel, this effect will intensify the dynamicity of the executed workload. In this work, we introduce a new relaxed concept for a parallel program phase, called epoch. Epochs are defined as time intervals between global synchronization points that programmers insert into their program codes for correct parallel execution. We characterize the behavior of multithreaded workloads across and within epochs and show that epochs have consistent and repetitive behaviors while their boundaries naturally indicate a shift in program behavior. We show that epoch changes can be easily captured at run time without complex monitoring and decision mechanisms and we employ simple run-time techniques to enable epoch-based adaptation. To highlight the efficacy of our approach, we present a case study of an epoch-based adaptive chip multiprocessor (CMP) architecture. We conclude that our approach provides an attractive new framework for lightweight phase-based resource management for future CMPs.

描述程序的动态行为对于优化给定系统上的程序是必不可少的。一旦程序的重复执行阶段(及其边界)被正确识别，就可以应用各种阶段感知优化。多线程工作负载表现出受数据和平台资源共享进一步影响的动态行为。随着计算机系统和工作负载变得更加密集和并行，这种影响将加强执行工作负载的动态性。在这项工作中，我们为并行程序阶段引入了一个新的宽松概念，称为epoch。epoch被定义为全局同步点之间的时间间隔，程序员将其插入到程序代码中以正确地并行执行。我们描述了多线程工作负载跨时期和在时期内的行为，并表明时期具有一致和重复的行为，而它们的边界自然地表明程序行为的变化。我们表明，纪元变化可以在运行时轻松捕获，而无需复杂的监控和决策机制，我们采用简单的运行时技术来实现基于纪元的适应。为了强调我们方法的有效性，我们提出了一个基于时代的自适应芯片多处理器(CMP)架构的案例研究。我们的结论是，我们的方法为未来cmp的轻量级基于阶段的资源管理提供了一个有吸引力的新框架。

{"title":"BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs","authors":"Socrates Demetriades, Sangyeun Cho","doi":"10.1145/2016604.2016611","DOIUrl":"https://doi.org/10.1145/2016604.2016611","url":null,"abstract":"Characterizing the dynamic behavior of a program is essential for optimizing the program on a given system. Once the program's repetitive execution phases (and their boundaries) have been correctly identified, various phase-aware optimizations can be applied. Multithreaded workloads exhibit dynamic behavior that is further affected by the sharing of data and platform resources. As computer systems and workloads become denser and more parallel, this effect will intensify the dynamicity of the executed workload.\u0000 In this work, we introduce a new relaxed concept for a parallel program phase, called epoch. Epochs are defined as time intervals between global synchronization points that programmers insert into their program codes for correct parallel execution. We characterize the behavior of multithreaded workloads across and within epochs and show that epochs have consistent and repetitive behaviors while their boundaries naturally indicate a shift in program behavior. We show that epoch changes can be easily captured at run time without complex monitoring and decision mechanisms and we employ simple run-time techniques to enable epoch-based adaptation. To highlight the efficacy of our approach, we present a case study of an epoch-based adaptive chip multiprocessor (CMP) architecture. We conclude that our approach provides an attractive new framework for lightweight phase-based resource management for future CMPs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132635627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion SoftHV:硬件/软件协同设计的水平和垂直融合处理器

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016606

Abhishek Deb, J. M. Codina, Antonio González

In this paper we propose SoftHV, a high-performance HW/SW co-designed in-order processor that performs horizontal and vertical fusion of instructions. SoftHV consists of a co-designed virtual machine (Cd-VM) which reorders, removes and fuses instructions from frequently executed regions of code. On the hardware front, SoftHV implements HW features for efficient execution of Cd-VM and efficient execution of the fused instructions. In particular, (1) Interlock Collapsing ALU (ICALU) are included to execute pairs of dependent simple arithmetic operations in a single cycle, and (2) Vector Load units (VLDU) are added to execute parallel loads. The key novelty of SoftHV resides on the efficient usage of HW using a Cd-VM in order to provide high-performance by drastically cutting down processor complexity. Co-designed processor provides efficient mechanisms to exploit ILP and reduce the latency of certain code sequences. Results presented in this paper show that SoftHV produces average performance improvements of 85% in SPECFP and 52% in SPECINT, and up-to 2.35x, over a conventional four-way in-order processor. For a two-way in-order processor configuration SoftHV obtains improvements in performance of 72% and 47% for SPECFP and SPECINT, respectively. Overall, we show that such a co-designed processor based on an in-order core provides a compelling alternative to out-of-order processors for the low-end domain where high-performance at a low-complexity is a key feature.

在本文中，我们提出了SoftHV，一个高性能的硬件/软件协同设计的顺序处理器，执行水平和垂直的指令融合。SoftHV由一个共同设计的虚拟机(Cd-VM)组成，该虚拟机可以从频繁执行的代码区域重新排序、删除和融合指令。在硬件方面，SoftHV实现了高效执行Cd-VM和高效执行融合指令的硬件特性。其中，(1)加入联锁崩溃ALU (Interlock collapse ALU, ICALU)，在一个周期内执行对相关的简单算术运算;(2)加入矢量负载单元(Vector Load units, VLDU)，执行并行负载。SoftHV的关键新颖之处在于使用Cd-VM有效地使用硬件，从而通过大幅降低处理器复杂性来提供高性能。协同设计的处理器提供了有效的机制来利用ILP和减少某些代码序列的延迟。本文给出的结果表明，与传统的四路顺序处理器相比，SoftHV在SPECFP和SPECINT方面的平均性能提高了85%和52%，最高可达2.35倍。对于双向顺序处理器配置，SoftHV在SPECFP和SPECINT上的性能分别提高了72%和47%。总的来说，我们表明，这种基于有序核心的协同设计处理器为低端领域的无序处理器提供了令人信服的替代方案，其中低复杂性的高性能是一个关键特性。

{"title":"SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion","authors":"Abhishek Deb, J. M. Codina, Antonio González","doi":"10.1145/2016604.2016606","DOIUrl":"https://doi.org/10.1145/2016604.2016606","url":null,"abstract":"In this paper we propose SoftHV, a high-performance HW/SW co-designed in-order processor that performs horizontal and vertical fusion of instructions.\u0000 SoftHV consists of a co-designed virtual machine (Cd-VM) which reorders, removes and fuses instructions from frequently executed regions of code. On the hardware front, SoftHV implements HW features for efficient execution of Cd-VM and efficient execution of the fused instructions. In particular, (1) Interlock Collapsing ALU (ICALU) are included to execute pairs of dependent simple arithmetic operations in a single cycle, and (2) Vector Load units (VLDU) are added to execute parallel loads.\u0000 The key novelty of SoftHV resides on the efficient usage of HW using a Cd-VM in order to provide high-performance by drastically cutting down processor complexity. Co-designed processor provides efficient mechanisms to exploit ILP and reduce the latency of certain code sequences.\u0000 Results presented in this paper show that SoftHV produces average performance improvements of 85% in SPECFP and 52% in SPECINT, and up-to 2.35x, over a conventional four-way in-order processor. For a two-way in-order processor configuration SoftHV obtains improvements in performance of 72% and 47% for SPECFP and SPECINT, respectively. Overall, we show that such a co-designed processor based on an in-order core provides a compelling alternative to out-of-order processors for the low-end domain where high-performance at a low-complexity is a key feature.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115499096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Quantitative analysis of parallelism and data movement properties across the Berkeley computational motifs 并行性和数据移动特性的定量分析跨越伯克利计算基元

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016625

V. Cabezas, Phillip Stanley-Marbell

This work presents the first thorough quantitative study of the available instruction-level parallelism, basic-block-granularity thread parallelism, and data movement, across the Berkeley dwarfs/computational motifs. Although this classification was intended to group applications with common computation and (albeit coarse-grained) communication patterns, the applications analyzed exhibit a wide range of available machine-extractable parallelism and data motion within and across dwarfs.

这项工作提出了第一个全面的定量研究可用的指令级并行性，基本块粒度线程并行性和数据移动，跨越伯克利小矮人/计算主题。尽管这种分类的目的是对具有公共计算和(尽管是粗粒度的)通信模式的应用程序进行分组，但所分析的应用程序显示了广泛的可用的机器可提取的并行性以及小矮人内部和跨小矮人的数据移动。

引用次数: 4

Increasing power/performance resource efficiency on virtualized enterprise servers 提高虚拟化企业服务器的电源/性能资源效率

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016615

Emmanuel Arzuaga, D. Kaeli

In this work, we analyze the impact that live VM migration has on virtualized data centers in terms of performance and power consumption. We present a metric that captures system efficiency in terms of VM resource usage. This metric is used to create a resource efficiency manager (REM) framework that issues live VM migrations to enhance the efficiency of system resources. We compare our framework to other commercially available solutions and show that we can improve performance up to 9% while providing a better overall power/performance solution.

在本文中，我们从性能和功耗方面分析了虚拟机迁移对虚拟化数据中心的影响。我们提出了一个度量，可以根据VM资源使用情况捕获系统效率。此指标用于创建资源效率管理器(resource efficiency manager, REM)框架，该框架发布实时VM迁移，以提高系统资源的效率。我们将我们的框架与其他商业上可用的解决方案进行了比较，并表明我们可以在提供更好的整体功耗/性能解决方案的同时将性能提高9%。

引用次数: 0

Elastic pipeline: addressing GPU on-chip shared memory bank conflicts 弹性管道:解决GPU片上共享内存库冲突

ACM International Conference on Computing Frontiers

Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016608

C. Gou, G. Gaydadjiev

One of the major problems with the GPU on-chip shared memory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves the overall performance by up to 20.7% (on average 13.3%) for our benchmark applications, at trivial hardware overhead.

GPU片上共享内存的主要问题之一是库冲突。我们观察到GPU处理器核心的吞吐量通常既不受共享内存带宽的限制，也不受共享内存延迟的限制(只要它保持不变)，而是由于内存库冲突引起的各种延迟。这将导致有序管道回写阶段的冲突和管道停滞，从而降低系统吞吐量。基于这一观察，我们研究并提出了一种新的弹性管道设计，通过将存储库冲突与管道失速解耦，将片上存储库冲突对系统吞吐量的负面影响降至最低。仿真结果表明，我们提出的弹性管道以及共同设计的感知银行冲突的warp调度在我们的基准应用程序中减少了高达64.0%(平均42.3%)的管道失速，并将总体性能提高了高达20.7%(平均13.3%)，而硬件开销很小。

引用次数: 20