首页 > 最新文献

2011 23rd International Symposium on Computer Architecture and High Performance Computing最新文献

英文 中文
High Performance by Exploiting Information Locality through Reverse Computing 通过反向计算利用信息局部性实现高性能
Mouad Bahi, C. Eisenbeis
In this paper we present performance results for our register rematerialization technique based on reverse recomputing. Rematerialization adds instructions and we show on one specifically designed example that reverse computing alleviates the impact of these additional instructions on performance. We also show how thread parallelism may be optimized on GPUs by performing register allocation with reverse recomputing that increases the number of threads per Streaming Multiprocessor (SM). This is done on the main kernel of Lattice Quantum Chromo Dynamics (LQCD) simulation program where we gain a 10.84% speedup.
本文给出了基于反向重计算的寄存器重物化技术的性能结果。重新物质化增加了指令,我们在一个专门设计的示例中展示了反向计算减轻了这些额外指令对性能的影响。我们还展示了如何通过使用反向重计算执行寄存器分配来优化gpu上的线程并行性,这增加了每个流式多处理器(SM)的线程数量。这是在Lattice Quantum Chromo Dynamics (LQCD)模拟程序的主内核上完成的,我们获得了10.84%的加速。
{"title":"High Performance by Exploiting Information Locality through Reverse Computing","authors":"Mouad Bahi, C. Eisenbeis","doi":"10.1109/SBAC-PAD.2011.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.10","url":null,"abstract":"In this paper we present performance results for our register rematerialization technique based on reverse recomputing. Rematerialization adds instructions and we show on one specifically designed example that reverse computing alleviates the impact of these additional instructions on performance. We also show how thread parallelism may be optimized on GPUs by performing register allocation with reverse recomputing that increases the number of threads per Streaming Multiprocessor (SM). This is done on the main kernel of Lattice Quantum Chromo Dynamics (LQCD) simulation program where we gain a 10.84% speedup.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128108810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip 利用片上网络加速基于最大似然的系统发育核
Turbo Majumder, P. Pande, A. Kalyanaraman
Probability-based approaches for phylogenetic inference, like Maximum Likelihood (ML) and Bayesian Inference, provide the most accurate estimate of evolutionary relationships among species. But they come at a high algorithmic and computational cost. Network-on-chip (NoC), being an emerging paradigm, has not been explored yet to achieve fine-grained parallelism for these applications. In this paper, we present the design and performance evaluation of an NoC architecture for RAxML, which is one of the most widely used ML software suites. Specifically, we implement the top three function kernels that account for more than 85% of the total run-time. Simulations show that through novel core design, allocation and placement strategies our NoC-based implementation can achieve function-level speedups of 388x to 786x and system-level speedups in excess of 5000x over state-of-the-art multithreaded software.
基于概率的系统发育推断方法,如最大似然(ML)和贝叶斯推断,提供了物种之间进化关系的最准确估计。但它们需要很高的算法和计算成本。片上网络(NoC)作为一种新兴的范式,尚未被探索以实现这些应用程序的细粒度并行性。在本文中,我们提出了一个NoC架构的设计和性能评估,RAxML是最广泛使用的机器学习软件套件之一。具体来说,我们实现了占总运行时85%以上的前三个函数内核。仿真表明,通过新颖的核心设计、分配和放置策略,我们基于noc的实现可以实现比最先进的多线程软件更高的388x到786x的功能级加速和超过5000倍的系统级加速。
{"title":"Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip","authors":"Turbo Majumder, P. Pande, A. Kalyanaraman","doi":"10.1109/SBAC-PAD.2011.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.17","url":null,"abstract":"Probability-based approaches for phylogenetic inference, like Maximum Likelihood (ML) and Bayesian Inference, provide the most accurate estimate of evolutionary relationships among species. But they come at a high algorithmic and computational cost. Network-on-chip (NoC), being an emerging paradigm, has not been explored yet to achieve fine-grained parallelism for these applications. In this paper, we present the design and performance evaluation of an NoC architecture for RAxML, which is one of the most widely used ML software suites. Specifically, we implement the top three function kernels that account for more than 85% of the total run-time. Simulations show that through novel core design, allocation and placement strategies our NoC-based implementation can achieve function-level speedups of 388x to 786x and system-level speedups in excess of 5000x over state-of-the-art multithreaded software.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130146448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++ 基于BSP++的异构高性能计算平台上并行生物序列比较
Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble
Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high computing and memory requirements, SW is usually executed on HPC platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application between them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to reduce the effort to write parallel code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW in multiple platforms like MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE. The results obtained with real DNA sequences show that the performance of our versions is comparable to the ones in the literature, evidencing the appropriateness and flexibility of our approach.
生物序列比较是生物信息学中的一项重要操作,通常用于联系生物体。Smith和Waterman提出了一种精确算法(SW),可以在二次时间和空间中比较两个序列。由于对计算和内存的要求很高,软件通常在多核集群和cellbe等高性能计算平台上执行。由于HPC体系结构表现出非常不同的硬件特征,因此在它们之间移植应用程序是一项容易出错的耗时任务。BSP++是BSP的一个实现,旨在减少编写并行代码的工作量。在本文中,我们提出并评估了并行BSP++策略,以在MPI, OpenMP, MPI/OpenMP, CellBE和MPI/CellBE等多个平台上执行SW。用真实DNA序列获得的结果表明,我们的版本的性能与文献中的版本相当,证明了我们方法的适当性和灵活性。
{"title":"Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++","authors":"Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble","doi":"10.1109/SBAC-PAD.2011.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.16","url":null,"abstract":"Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high computing and memory requirements, SW is usually executed on HPC platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application between them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to reduce the effort to write parallel code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW in multiple platforms like MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE. The results obtained with real DNA sequences show that the performance of our versions is comparable to the ones in the literature, evidencing the appropriateness and flexibility of our approach.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132096535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FAIRIO: An Algorithm for Differentiated I/O Performance 一种差分I/O性能算法
Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam
Providing differentiated service in a consolidated storage environment is a challenging task. To address this problem, we introduce FAIRIO, a cycle-based I/O scheduling algorithm that provides differentiated service to workloads concurrently accessing a consolidated RAID storage system. FAIRIO enforces proportional sharing of I/O service through fair scheduling of disk time. During each cycle of the algorithm, I/O requests are scheduled according to workload weights and disk-time utilization history. Experiments, which were driven by the I/O request streams of real and synthetic I/O benchmarks and run on a modified version of DiskSim, provide evidence of FAIRIO's effectiveness and demonstrate that fair scheduling of disk time is key to achieving differentiated service. In particular, the experimental results show that, for a broad range of workload request types, sizes, and access characteristics, the algorithm provides differentiated storage throughput that is within 10% of being perfectly proportional to workload weights, and, it achieves this with little or no degradation of aggregate throughput. The core design concepts of FAIRIO, including service-time allocation and history-driven compensation, potentially can be used to design I/O scheduling algorithms that provide workloads with differentiated service in storage systems comprised of RAIDs, multiple RAIDs, SANs, and hypervisors for Clouds.
在统一的存储环境中提供差异化的服务是一项具有挑战性的任务。为了解决这个问题,我们引入了FAIRIO,一种基于周期的I/O调度算法,为并发访问统一RAID存储系统的工作负载提供差异化服务。FAIRIO通过公平的磁盘时间调度来实现I/O服务的比例共享。在算法的每个周期中,根据工作负载权重和磁盘时间利用历史来调度I/O请求。实验由真实和合成I/O基准的I/O请求流驱动,并在修改版本的DiskSim上运行,提供了FAIRIO有效性的证据,并证明公平的磁盘时间调度是实现差异化服务的关键。特别是,实验结果表明,对于广泛的工作负载请求类型、大小和访问特征,该算法提供的差异化存储吞吐量在与工作负载权重完全成正比的10%以内,并且在实现这一目标时很少或没有降低总吞吐量。FAIRIO的核心设计概念,包括服务时间分配和历史驱动补偿,可以潜在地用于设计I/O调度算法,在由raid、多个raid、san和云管理程序组成的存储系统中为工作负载提供差异化的服务。
{"title":"FAIRIO: An Algorithm for Differentiated I/O Performance","authors":"Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam","doi":"10.1109/SBAC-PAD.2011.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.26","url":null,"abstract":"Providing differentiated service in a consolidated storage environment is a challenging task. To address this problem, we introduce FAIRIO, a cycle-based I/O scheduling algorithm that provides differentiated service to workloads concurrently accessing a consolidated RAID storage system. FAIRIO enforces proportional sharing of I/O service through fair scheduling of disk time. During each cycle of the algorithm, I/O requests are scheduled according to workload weights and disk-time utilization history. Experiments, which were driven by the I/O request streams of real and synthetic I/O benchmarks and run on a modified version of DiskSim, provide evidence of FAIRIO's effectiveness and demonstrate that fair scheduling of disk time is key to achieving differentiated service. In particular, the experimental results show that, for a broad range of workload request types, sizes, and access characteristics, the algorithm provides differentiated storage throughput that is within 10% of being perfectly proportional to workload weights, and, it achieves this with little or no degradation of aggregate throughput. The core design concepts of FAIRIO, including service-time allocation and history-driven compensation, potentially can be used to design I/O scheduling algorithms that provide workloads with differentiated service in storage systems comprised of RAIDs, multiple RAIDs, SANs, and hypervisors for Clouds.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125914282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Speeding Up Learning in Real-Time Search through Parallel Computing 通过并行计算加速实时搜索中的学习
Vinícius Marques, L. Chaimowicz, R. Ferreira
Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.
实时搜索算法解决了路径规划问题,而不考虑地图的大小和复杂性,也不考虑同一环境中实体的大量存在。在这些方法中,学习步骤旨在避免局部最小值,并为以后的搜索改善结果,确保在重复求解同一规划任务时收敛到最优路径。然而,由于实时约束,在有限的区域内执行搜索使得收敛成为一个漫长的过程。在这项工作中,我们提出了一种并行化策略,旨在减少收敛时间,保持搜索的实时性。并行化技术包括使用辅助搜索,而不存在主搜索中存在的实时限制。此外,所有搜索都可以共享相同的学习结果。经验评估表明,即使需要额外的成本来协调辅助搜索,收敛时间的减少也是显著的,显示了从局部最小值较少的环境中的搜索到复杂地图上的较大搜索的收益,其中性能改进甚至更好。
{"title":"Speeding Up Learning in Real-Time Search through Parallel Computing","authors":"Vinícius Marques, L. Chaimowicz, R. Ferreira","doi":"10.1109/SBAC-PAD.2011.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.30","url":null,"abstract":"Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125398898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Why Online Dynamic Mesh Refinement is Better for Parallel Climatological Models 为什么在线动态网格细化对并行气候模型更好
C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß
Forecast precisions of climatological models are limited by computing power and time available for the executions. As more and faster processors are used in the computation, the resolution of the mesh adopted to represent the Earth's atmosphere can be increased, and consequently the numerical forecast is more accurate and shows local phenomena. However, a finer mesh resolution, able to include local phenomena in a global atmosphere integration, is still not possible. To overcome this situation, different mesh refinement levels can be used at the same time for different areas. In this context, this paper evaluates how mesh refinement at run time can improve performance for climatological models. In order to contribute with this analysis, an online dynamic mesh refinement was developed. It increases mesh resolution in parts of a parallel distributed model, when special atmosphere conditions are registered during the execution. The results show that the parallel execution of this improvement provides better resolution for the meshes, without a significant increase of execution time.
气候模式的预报精度受到计算能力和执行时间的限制。随着计算中处理器的数量和速度的增加,用于表示地球大气的网格的分辨率可以提高,从而使数值预报更加准确,并显示局部现象。然而,一个更精细的网格分辨率,能够在全球大气整合中包括局部现象,仍然是不可能的。为了克服这种情况,可以对不同的区域同时使用不同的网格细化级别。在此背景下,本文评估了运行时网格细化如何提高气候模型的性能。为了便于分析,开发了一种在线动态网格细化方法。当在执行过程中注册特殊的大气条件时,它增加了并行分布式模型部分的网格分辨率。结果表明,这种改进的并行执行在不显著增加执行时间的情况下提供了更好的网格分辨率。
{"title":"Why Online Dynamic Mesh Refinement is Better for Parallel Climatological Models","authors":"C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß","doi":"10.1109/SBAC-PAD.2011.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.14","url":null,"abstract":"Forecast precisions of climatological models are limited by computing power and time available for the executions. As more and faster processors are used in the computation, the resolution of the mesh adopted to represent the Earth's atmosphere can be increased, and consequently the numerical forecast is more accurate and shows local phenomena. However, a finer mesh resolution, able to include local phenomena in a global atmosphere integration, is still not possible. To overcome this situation, different mesh refinement levels can be used at the same time for different areas. In this context, this paper evaluates how mesh refinement at run time can improve performance for climatological models. In order to contribute with this analysis, an online dynamic mesh refinement was developed. It increases mesh resolution in parts of a parallel distributed model, when special atmosphere conditions are registered during the execution. The results show that the parallel execution of this improvement provides better resolution for the meshes, without a significant increase of execution time.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms 使用自适应阻塞算法提高高性能BLAS实现的准确性
M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau
Matrix multiply is ubiquitous in scientific computing. Considerable effort has been spent on improving its performance. Once methods that make efficient use of the processor have been exhausted, methods that use less operations than the canonical matrix multiply must be explored. Combining the two methods yields a hybrid matrix multiply algorithm. Hybrid matrix multiply algorithms tend to be less accurate than the canonical matrix multiply implementation, leaving room for improvement. There are well-known techniques for improving accuracy, but they tend to be slow and it is not immediately obvious how best to apply them to hybrid algorithms without lowering performance. Previous attempts have focused on the bottom of the hybrid matrix multiply algorithm, modifying the high-performance matrix multiply implementation. In contrast, the top-down approach presented here does not require the modification of the high-performance matrix multiply implementation at the bottom, nor does it require modification of the fast asymptotic matrix multiply algorithm at the top. The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.
矩阵乘法在科学计算中无处不在。在提高其性能方面已经付出了相当大的努力。一旦用尽了有效利用处理器的方法,就必须探索比规范矩阵乘法使用更少操作的方法。将这两种方法结合起来,得到一种混合矩阵乘法算法。混合矩阵乘法算法往往不如规范矩阵乘法实现精确,留下了改进的空间。有一些众所周知的提高准确性的技术,但它们往往很慢,而且如何在不降低性能的情况下最好地将它们应用于混合算法并不是很明显。以前的尝试主要集中在底层的混合矩阵乘法算法上,修改了高性能矩阵乘法的实现。相比之下,本文提出的自顶向下方法不需要修改底部的高性能矩阵乘法实现,也不需要修改顶部的快速渐近矩阵乘法算法。本文提出的三层混合算法不仅比最快的高性能矩阵乘法性能提高了10%,而且精度更高。
{"title":"Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms","authors":"M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau","doi":"10.1109/SBAC-PAD.2011.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.21","url":null,"abstract":"Matrix multiply is ubiquitous in scientific computing. Considerable effort has been spent on improving its performance. Once methods that make efficient use of the processor have been exhausted, methods that use less operations than the canonical matrix multiply must be explored. Combining the two methods yields a hybrid matrix multiply algorithm. Hybrid matrix multiply algorithms tend to be less accurate than the canonical matrix multiply implementation, leaving room for improvement. There are well-known techniques for improving accuracy, but they tend to be slow and it is not immediately obvious how best to apply them to hybrid algorithms without lowering performance. Previous attempts have focused on the bottom of the hybrid matrix multiply algorithm, modifying the high-performance matrix multiply implementation. In contrast, the top-down approach presented here does not require the modification of the high-performance matrix multiply implementation at the bottom, nor does it require modification of the fast asymptotic matrix multiply algorithm at the top. The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133151306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A New Parallel Schema for Branch-and-Bound Algorithms Using GPGPU 基于GPGPU的分支定界算法并行架构
T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos
This work presents a new parallel procedure designed to process combinatorial B&B algorithms using GPGPU. In our schema we dispatch a number of threads that treats intelligently the massively parallel processors of NVIDIA GeForce graphical units. The strategy is to build sequentially a series of initial searches that can map a subspace of the B&B tree by starting a number of limited threads after achieving a specific level of the tree. The search is then processed massively by DFS. The whole subspace is optimized accordingly to memory and limits of threads and blocks available by the GPU. We compare our results with its OpenMP and Serial versions of the same search schema using explicitly enumeration (all possible solutions) to the Asymmetrical Travelling Salesman Problem's instances. We also show the great superiority of our GPGPU based method.
本文提出了一种利用GPGPU处理组合B&B算法的并行程序。在我们的模式中,我们调度了许多线程来智能地处理NVIDIA GeForce图形单元的大规模并行处理器。该策略是依次构建一系列初始搜索,这些搜索可以通过在到达B&B树的特定级别后启动一些有限的线程来映射B&B树的子空间。然后,DFS对搜索进行大规模处理。整个子空间根据内存和GPU可用的线程和块的限制进行优化。我们使用非对称旅行推销员问题实例的显式枚举(所有可能的解决方案),将我们的结果与相同搜索模式的OpenMP和Serial版本进行比较。我们还展示了基于GPGPU的方法的巨大优越性。
{"title":"A New Parallel Schema for Branch-and-Bound Algorithms Using GPGPU","authors":"T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos","doi":"10.1109/SBAC-PAD.2011.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.20","url":null,"abstract":"This work presents a new parallel procedure designed to process combinatorial B&B algorithms using GPGPU. In our schema we dispatch a number of threads that treats intelligently the massively parallel processors of NVIDIA GeForce graphical units. The strategy is to build sequentially a series of initial searches that can map a subspace of the B&B tree by starting a number of limited threads after achieving a specific level of the tree. The search is then processed massively by DFS. The whole subspace is optimized accordingly to memory and limits of threads and blocks available by the GPU. We compare our results with its OpenMP and Serial versions of the same search schema using explicitly enumeration (all possible solutions) to the Asymmetrical Travelling Salesman Problem's instances. We also show the great superiority of our GPGPU based method.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129227408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Structure-Constrained Microcode Compression 结构约束的微码压缩
E. Borin, G. Araújo, M. Breternitz, Youfeng Wu
Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. One solution to address the microcode size issue is to store the microcode in a compressed form and decompress it during execution. Furthermore, the reuse of a single hardware building block layout to implement different dictionaries in the two-level microcode compression reduces the cost and the design time of the decompression engine. However, the reuse of the hardware building block imposes structural constraints to the compression algorithm, and existing algorithms may yield poor compression. In this paper, we develop the SC2 algorithm that considers the structural constraint in its objective function and reduces the area expansion when reusing hardware building blocks to implement different dictionaries. Our experimental results show that the SC2 algorithm is able to produce similar sized dictionaries and achieves the similar compression ratio to the non-constrained algorithm.
微码允许(微)架构结构的可编程性,以增强功能并将补丁应用于现有设计。随着更多的功能被添加到CPU核心中,与微码相关的面积和功耗成本也在增加。解决微码大小问题的一种解决方案是以压缩形式存储微码,并在执行期间解压缩。此外,在二级微码压缩中重用单个硬件构建块布局来实现不同的字典,降低了解压引擎的成本和设计时间。然而,硬件构建块的重用对压缩算法施加了结构约束,现有算法可能产生较差的压缩。在本文中,我们开发了SC2算法,该算法在其目标函数中考虑了结构约束,并减少了重用硬件构建块实现不同字典时的面积扩展。实验结果表明,SC2算法能够产生与无约束算法相似大小的字典,并获得与无约束算法相似的压缩比。
{"title":"Structure-Constrained Microcode Compression","authors":"E. Borin, G. Araújo, M. Breternitz, Youfeng Wu","doi":"10.1109/SBAC-PAD.2011.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.32","url":null,"abstract":"Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. One solution to address the microcode size issue is to store the microcode in a compressed form and decompress it during execution. Furthermore, the reuse of a single hardware building block layout to implement different dictionaries in the two-level microcode compression reduces the cost and the design time of the decompression engine. However, the reuse of the hardware building block imposes structural constraints to the compression algorithm, and existing algorithms may yield poor compression. In this paper, we develop the SC2 algorithm that considers the structural constraint in its objective function and reduces the area expansion when reusing hardware building blocks to implement different dictionaries. Our experimental results show that the SC2 algorithm is able to produce similar sized dictionaries and achieves the similar compression ratio to the non-constrained algorithm.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131502766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification and Elimination of Conflicts in Hardware Transactional Memory Systems 硬件事务性存储系统中冲突的分类与消除
M. Waliullah, P. Stenström
This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.
本文分析了硬件事务性内存中性能损失的来源,并研究了减少性能损失的技术。它将硬件事务性内存系统(HTM)中数据冲突的根本原因分解为四类冲突:真共享、假共享、静默存储和写-写冲突。由于中止和额外的通信,这些冲突可能导致性能和能量损失。为了量化损失,本文首先提出了5C缓存丢失分类模型,该模型扩展了已建立的4C模型,并添加了一类新的缓存丢失,称为污染丢失。本文还提出了两种消除数据冲突的技术:一种用于消除虚假共享冲突,另一种用于消除沉默存储冲突。此外,它还重新审视并采用了一种能够减少因真假冲突造成的损失的技术。所有建议的技术都可以包含在基于MESI缓存一致性基础设施的延迟版本控制和延迟冲突解决HTM中,并具有相当适度的扩展。它们降低绩效的能力是定量确定的,无论是单独的还是组合的。性能大大提高。
{"title":"Classification and Elimination of Conflicts in Hardware Transactional Memory Systems","authors":"M. Waliullah, P. Stenström","doi":"10.1109/SBAC-PAD.2011.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.18","url":null,"abstract":"This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130596394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2011 23rd International Symposium on Computer Architecture and High Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1