首页 > 最新文献

2011 23rd International Symposium on Computer Architecture and High Performance Computing最新文献

英文 中文
Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding 应用CUDA架构加速视频编码中高性能运动估计的全搜索块匹配算法
Eduarda Monteiro, B. Vizzotto, C. Diniz, B. Zatt, S. Bampi
This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture. A comparison among the performance achieved by this solution with a theoretical model and two other implementations (sequential and parallel using OpenMP library) is made as well. We obtained a O(n^2/log^2n) speed-up which fits the proposed theoretical model considering different search areas. It represents up to 600x gain compared to the serial implementation, and 66x compared to the parallel OpenMP implementation.
本文提出了一种基于并行gpu的视频编码系统运动估计(ME)处理方案。提出了一种基于CUDA架构的全搜索块匹配算法的步骤划分方法。并将该解决方案与理论模型和其他两种实现(使用OpenMP库的顺序和并行)的性能进行了比较。我们得到了一个O(n^2/log^2n)的加速,这符合所提出的理论模型,考虑了不同的搜索区域。与串行实现相比,它代表高达600倍的增益,与并行OpenMP实现相比,它代表高达66倍的增益。
{"title":"Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding","authors":"Eduarda Monteiro, B. Vizzotto, C. Diniz, B. Zatt, S. Bampi","doi":"10.1109/SBAC-PAD.2011.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.19","url":null,"abstract":"This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture. A comparison among the performance achieved by this solution with a theoretical model and two other implementations (sequential and parallel using OpenMP library) is made as well. We obtained a O(n^2/log^2n) speed-up which fits the proposed theoretical model considering different search areas. It represents up to 600x gain compared to the serial implementation, and 66x compared to the parallel OpenMP implementation.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128223404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Workload Balancing Methodology for Data-Intensive Applications with Divisible Load 具有可分负载的数据密集型应用的工作负载平衡方法
C. Rosas, A. Sikora, Josep Jorba, Eduardo César
Data-intensive applications are those that explore, query, analyze, and, in general, process very large data sets. Generally in High Performance Computing (HPC), the main performance problem associated to these applications is the load unbalance or inefficient resources utilization. This paper proposes a methodology for improving performance of data-intensive applications based on performing multiple data partitions prior to the execution, and ordering the data chunks according to their processing times during the application execution. As a first step, we consider that a single execution includes multiple related explorations on the same data set. Consequently, we propose to monitor the processing of each exploration and use the data gathered to dynamically tune the performance of the application. The tuning parameters included in the methodology are the partition factor of the data set, the distribution of these data chunks, and the number of processing nodes to be used by the application. The methodology has been initially tested using the well-known bioinformatics tool BLAST, obtaining encouraging results (up to a 40% of improvement).
数据密集型应用程序是那些探索、查询、分析和通常处理非常大的数据集的应用程序。通常在高性能计算(HPC)中,与这些应用程序相关的主要性能问题是负载不平衡或资源利用效率低下。本文提出了一种提高数据密集型应用程序性能的方法,该方法基于在执行之前执行多个数据分区,并根据应用程序执行期间的处理时间对数据块进行排序。作为第一步,我们认为单个执行包括对同一数据集的多个相关探索。因此,我们建议监控每次探索的处理过程,并使用收集到的数据动态地调优应用程序的性能。该方法中包含的调优参数是数据集的分区因子、这些数据块的分布以及应用程序要使用的处理节点的数量。该方法已经使用著名的生物信息学工具BLAST进行了初步测试,获得了令人鼓舞的结果(高达40%的改进)。
{"title":"Workload Balancing Methodology for Data-Intensive Applications with Divisible Load","authors":"C. Rosas, A. Sikora, Josep Jorba, Eduardo César","doi":"10.1109/SBAC-PAD.2011.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.15","url":null,"abstract":"Data-intensive applications are those that explore, query, analyze, and, in general, process very large data sets. Generally in High Performance Computing (HPC), the main performance problem associated to these applications is the load unbalance or inefficient resources utilization. This paper proposes a methodology for improving performance of data-intensive applications based on performing multiple data partitions prior to the execution, and ordering the data chunks according to their processing times during the application execution. As a first step, we consider that a single execution includes multiple related explorations on the same data set. Consequently, we propose to monitor the processing of each exploration and use the data gathered to dynamically tune the performance of the application. The tuning parameters included in the methodology are the partition factor of the data set, the distribution of these data chunks, and the number of processing nodes to be used by the application. The methodology has been initially tested using the well-known bioinformatics tool BLAST, obtaining encouraging results (up to a 40% of improvement).","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132613734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Distributed Skycube Computation with Anthill 分布式Skycube计算与Anthill
R. R. Veloso, L. Cerf, Chedy Raïssi, Wagner Meira Jr
Recently skyline queries have gained considerable attention and are among the most important tools for multi-criteria analysis. In order to process all possible combinations of criteria along with their inherent analysis, researchers introduced and studied the notion of emph{skycube}. Simply put, a skycube is a pre-materialization of all possible subspaces with their associated skylines. An efficient skycube computation relies on the detection of redundancies in the different processing steps and enhanced result sharing between subspaces. Lately, the Orion algorithm was proposed to compute the skycube in a very efficient way. The approach relies on the derivation of skyline points over different subspaces. Nevertheless, because there are 2^{|D|} - 1 subspaces (where D is the set of dimensions) in a skycube, the running time still grows exponentially with the number of dimensions and easily becomes intractable on real-world datasets. In this study, we detail the distribution of Orion within a emph{filter-stream} framework and we conduct an extensive set of experiments on large datasets collected from Twitter to demonstrate the efficiency of our method.
最近,天际线查询获得了相当大的关注,并且是多标准分析中最重要的工具之一。为了处理所有可能的标准组合及其固有分析,研究人员引入并研究了emph{skycube}的概念。简单地说,天空立方体是所有可能的子空间及其相关天际线的预物化。高效的天空立方体计算依赖于在不同处理步骤中检测冗余和增强子空间之间的结果共享。最近,猎户座算法被提出,以一种非常有效的方式计算天空立方体。该方法依赖于不同子空间上天际线点的推导。然而,由于在skycube中有2^{|D|} - 1子空间(其中D是维的集合),运行时间仍然随着维的数量呈指数增长,并且很容易在现实世界的数据集上变得难以处理。在本研究中,我们详细介绍了Orion在emph{过滤流}框架中的分布,并对从Twitter收集的大型数据集进行了广泛的实验,以证明我们方法的效率。
{"title":"Distributed Skycube Computation with Anthill","authors":"R. R. Veloso, L. Cerf, Chedy Raïssi, Wagner Meira Jr","doi":"10.1109/SBAC-PAD.2011.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.29","url":null,"abstract":"Recently skyline queries have gained considerable attention and are among the most important tools for multi-criteria analysis. In order to process all possible combinations of criteria along with their inherent analysis, researchers introduced and studied the notion of emph{skycube}. Simply put, a skycube is a pre-materialization of all possible subspaces with their associated skylines. An efficient skycube computation relies on the detection of redundancies in the different processing steps and enhanced result sharing between subspaces. Lately, the Orion algorithm was proposed to compute the skycube in a very efficient way. The approach relies on the derivation of skyline points over different subspaces. Nevertheless, because there are 2^{|D|} - 1 subspaces (where D is the set of dimensions) in a skycube, the running time still grows exponentially with the number of dimensions and easily becomes intractable on real-world datasets. In this study, we detail the distribution of Orion within a emph{filter-stream} framework and we conduct an extensive set of experiments on large datasets collected from Twitter to demonstrate the efficiency of our method.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131146172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Data Parallelism for Belief Propagation in Factor Graphs 因子图中信念传播的数据并行性
N. Ma, Yinglong Xia, V. Prasanna
We investigate data parallelism for belief propagation in a cyclic factor graphs on multicore/many core processors. Belief propagation is a key problem in exploring factor graphs, a probabilistic graphical model that has found applications in many domains. In this paper, we identify basic operations called node level primitives for updating the distribution tables in a factor graph. We develop algorithms for these primitives to explore data parallelism. We also propose a complete belief propagation algorithm to perform exact inference in such graphs. We implement the proposed algorithms on state-of-the-art multicore processors and show that the proposed algorithms exhibit good scalability using a representative set of factor graphs. On a 32-core Intel Nehalem-EX based system, we achieve 30× speedup for the primitives and 29× for the complete algorithm using factor graphs with large distribution tables.
研究了多核/多核处理器上循环因子图中信念传播的数据并行性。因子图是一种概率图模型,在许多领域都有应用。在本文中,我们确定了用于更新因子图中的分布表的称为节点级原语的基本操作。我们为这些原语开发算法来探索数据并行性。我们还提出了一个完整的信念传播算法来对这种图进行精确的推理。我们在最先进的多核处理器上实现了所提出的算法,并表明所提出的算法使用一组具有代表性的因子图显示出良好的可扩展性。在基于32核Intel Nehalem-EX的系统上,我们使用带有大型分布表的因子图实现了原语30倍的加速和完整算法29倍的加速。
{"title":"Data Parallelism for Belief Propagation in Factor Graphs","authors":"N. Ma, Yinglong Xia, V. Prasanna","doi":"10.1109/SBAC-PAD.2011.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.34","url":null,"abstract":"We investigate data parallelism for belief propagation in a cyclic factor graphs on multicore/many core processors. Belief propagation is a key problem in exploring factor graphs, a probabilistic graphical model that has found applications in many domains. In this paper, we identify basic operations called node level primitives for updating the distribution tables in a factor graph. We develop algorithms for these primitives to explore data parallelism. We also propose a complete belief propagation algorithm to perform exact inference in such graphs. We implement the proposed algorithms on state-of-the-art multicore processors and show that the proposed algorithms exhibit good scalability using a representative set of factor graphs. On a 32-core Intel Nehalem-EX based system, we achieve 30× speedup for the primitives and 29× for the complete algorithm using factor graphs with large distribution tables.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Predictive and Distributed Routing Balancing on High-Speed Cluster Networks 高速集群网络的预测和分布式路由平衡
Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque
In high performance clusters current parallel application communication needs such as traffic pattern, communication volume, etc., change along time and are difficult to know in advance. Such needs often exceed or do not match available resources causing resource use imbalance, network congestion, throughput reduction and message latency increase, thus degrading the overall system performance. Studies on parallel applications show repetitive behavior that can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing (PRDRB) technique, a new method developed to gradually control network congestion, based on paths expansion, traffic distribution, applications pattern repetitiveness and speculative adaptive routing, in order to maintain low latency values. PRDRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Traffic congestion experiments were conducted in order to evaluate the performance of the method, and improvements were observed.
在高性能集群中,当前并行应用的通信需求,如流量模式、通信量等,会随着时间的变化而变化,很难提前知道。这种需求经常超过或不匹配可用资源,导致资源使用不平衡、网络拥塞、吞吐量降低和消息延迟增加,从而降低系统的整体性能。对并行应用程序的研究表明,重复行为可以用一组有代表性的阶段来表征。本文提出了一种基于路径扩展、流量分配、应用模式重复性和推测性自适应路由的预测和分布式路由平衡(PRDRB)技术,该技术是一种逐步控制网络拥塞的新方法,以保持低延迟值。PRDRB监视路由器上的消息延迟,并记录阻塞解决方案,以便在未来类似情况下快速响应。通过交通拥堵实验来评价该方法的性能,并观察到改进的结果。
{"title":"Predictive and Distributed Routing Balancing on High-Speed Cluster Networks","authors":"Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque","doi":"10.1109/SBAC-PAD.2011.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.27","url":null,"abstract":"In high performance clusters current parallel application communication needs such as traffic pattern, communication volume, etc., change along time and are difficult to know in advance. Such needs often exceed or do not match available resources causing resource use imbalance, network congestion, throughput reduction and message latency increase, thus degrading the overall system performance. Studies on parallel applications show repetitive behavior that can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing (PRDRB) technique, a new method developed to gradually control network congestion, based on paths expansion, traffic distribution, applications pattern repetitiveness and speculative adaptive routing, in order to maintain low latency values. PRDRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Traffic congestion experiments were conducted in order to evaluate the performance of the method, and improvements were observed.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117120012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Watershed: A High Performance Distributed Stream Processing System 分水岭:一个高性能分布式流处理系统
Thatyene Louise Alves de Souza Ramos, R. S. Oliveira, Ana Paula de Carvalho, R. Ferreira, Wagner Meira Jr
The task of extracting information from datasets that become larger at a daily basis, such as those collected from the web, is an increasing challenge, but also provides more interesting insights and analysis. Current analyses went beyond content and now focus on tracking and understanding users' relationships and interactions. Such computation is intensive both in terms of the processing demand imposed by the algorithms and also the sheer amount of data that has to handled. In this paper we introduce Watershed, a distributed computing framework designed to support the analysis of very large data streams online and in real-time. Data are obtained from streams by the system's processing components, transformed, and directed to other streams, creating large flows of information. The processing components are decoupled from each other and their connections are strictly data-driven. They can be dynamically inserted and removed, providing an environment in which it is feasible that different applications share intermediate results or cooperate to a global purpose. Our experiments demonstrate the flexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment.
从每天变得越来越大的数据集中提取信息的任务,比如从网络上收集的数据集,是一个越来越大的挑战,但也提供了更多有趣的见解和分析。目前的分析已经超越了内容,而是专注于追踪和理解用户之间的关系和互动。这样的计算是密集的,无论是在算法施加的处理需求方面,还是在必须处理的数据量方面。在本文中,我们介绍了Watershed,一个分布式计算框架,旨在支持在线和实时分析非常大的数据流。数据由系统的处理组件从流中获得,转换并定向到其他流,从而创建了大量的信息流。处理组件彼此解耦,它们的连接严格由数据驱动。它们可以动态地插入和删除,从而提供一个环境,在这个环境中,不同的应用程序可以共享中间结果或为全局目的进行合作。我们的实验证明了创建一组数据分析算法及其组成到强大的流分析环境中的灵活性。
{"title":"Watershed: A High Performance Distributed Stream Processing System","authors":"Thatyene Louise Alves de Souza Ramos, R. S. Oliveira, Ana Paula de Carvalho, R. Ferreira, Wagner Meira Jr","doi":"10.1109/SBAC-PAD.2011.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.31","url":null,"abstract":"The task of extracting information from datasets that become larger at a daily basis, such as those collected from the web, is an increasing challenge, but also provides more interesting insights and analysis. Current analyses went beyond content and now focus on tracking and understanding users' relationships and interactions. Such computation is intensive both in terms of the processing demand imposed by the algorithms and also the sheer amount of data that has to handled. In this paper we introduce Watershed, a distributed computing framework designed to support the analysis of very large data streams online and in real-time. Data are obtained from streams by the system's processing components, transformed, and directed to other streams, creating large flows of information. The processing components are decoupled from each other and their connections are strictly data-driven. They can be dynamically inserted and removed, providing an environment in which it is feasible that different applications share intermediate results or cooperate to a global purpose. Our experiments demonstrate the flexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126906920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
MRU-Tour-based Replacement Algorithms for Last-Level Caches 基于mru - tour的最后一级缓存替换算法
A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato
Memory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last-Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both capacity and conflict misses, LLCs are implemented as large memory structures with high associativities. To exploit temporal locality, LRU is the replacement algorithm usually implemented in caches. However, for a high-associative cache, its implementation is costly in terms of area and power consumption. Indeed, LRU is not well suited for the LLC, because as this cache level does not see all memory accesses, it cannot cope with temporal locality. In addition, blocks must descend down to the LRU position of the stack before eviction, even when they are not longer useful. In this paper, we show that most of the blocks are not referenced again once they leave the MRU position. Moreover, the probability of being referenced again does not depend on the location on the LRU stack. Based on these observations, we define the number of MRU-Tours (MRUTs) of a block as the number of times that a block occupies the MRU position while it is stored in the cache, and propose the MRUT replacement algorithm, which selects the block to be replaced among the blocks that show only one MRUT. Variations of this algorithm have been also proposed to exploit both MRUT behavior and recency of information. Experimental results show that, compared to LRU, the proposal reduces the MPKI up to 22%, while IPC is improved by 48%.
内存层次结构设计是当前微处理器关注的主要问题。许多研究工作集中在最后一级缓存(LLC)上,它的设计是为了隐藏访问主存的长时间错过的惩罚。为了减少容量和冲突缺失,有限责任节点被实现为具有高关联的大内存结构。为了利用时间局部性,LRU通常是在缓存中实现的替换算法。然而,对于高关联缓存,其实现在面积和功耗方面是昂贵的。实际上,LRU并不适合LLC,因为这个缓存级别不能看到所有的内存访问,它不能处理时间局部性。此外,即使块不再有用,也必须在移除之前下降到堆栈的LRU位置。在本文中,我们证明了大多数块一旦离开MRU位置就不会被再次引用。此外,再次被引用的概率不依赖于LRU堆栈上的位置。基于这些观察结果,我们将块的MRU- tours (MRUT)的次数定义为块在缓存中存储时占用MRU位置的次数,并提出了MRUT替换算法,该算法在只显示一个MRUT的块中选择要替换的块。该算法的变体也被提出,以利用MRUT行为和信息的近时性。实验结果表明,与LRU相比,该方案将MPKI降低了22%,IPC提高了48%。
{"title":"MRU-Tour-based Replacement Algorithms for Last-Level Caches","authors":"A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato","doi":"10.1109/SBAC-PAD.2011.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.13","url":null,"abstract":"Memory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last-Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both capacity and conflict misses, LLCs are implemented as large memory structures with high associativities. To exploit temporal locality, LRU is the replacement algorithm usually implemented in caches. However, for a high-associative cache, its implementation is costly in terms of area and power consumption. Indeed, LRU is not well suited for the LLC, because as this cache level does not see all memory accesses, it cannot cope with temporal locality. In addition, blocks must descend down to the LRU position of the stack before eviction, even when they are not longer useful. In this paper, we show that most of the blocks are not referenced again once they leave the MRU position. Moreover, the probability of being referenced again does not depend on the location on the LRU stack. Based on these observations, we define the number of MRU-Tours (MRUTs) of a block as the number of times that a block occupies the MRU position while it is stored in the cache, and propose the MRUT replacement algorithm, which selects the block to be replaced among the blocks that show only one MRUT. Variations of this algorithm have been also proposed to exploit both MRUT behavior and recency of information. Experimental results show that, compared to LRU, the proposal reduces the MPKI up to 22%, while IPC is improved by 48%.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123862599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Power-Efficient Co-designed Out-of-Order Processor 一种高效节能的协同设计无序处理器
Abhishek Deb, J. M. Codina, Antonio González
A co-designed processor helps in cutting down both the complexity and power consumption by co-designing certain key performance enablers. In this paper, we propose a FIFO based co-designed out-of-order processor. Multiple FIFOs are added in order to dynamically schedule, in a complexity-effective manner, the micro-ops. We propose a commit logic that is able to commit the program state as a superblock commits atomically. This enables us to get rid of the Reorder Buffer (ROB) entirely. Instead to maintain the correct program state, we propose a four/eight entry Superblock Ordering Buffer (SOB). We also propose the per superblock Register Rename Table (SRRT) that holds the register state pertaining to the superblock. Our proposed processor dissipates 6% less power and obtains 12% speedup for SPECFP, as a result, it consumes less energy. Furthermore, we propose an enhanced steering heuristic and an early release mechanism to increase the performance of a FIFO based out-of-order processor. We obtain performance improvement of nearly 25% and 70% for a four FIFO and for a two FIFO configurations, respectively. We also show that our proposed steering heuristic based processor consumes 10% less energy than the previously proposed steering heuristic.
协同设计的处理器通过协同设计某些关键的性能支持因素,有助于降低复杂性和功耗。本文提出了一种基于FIFO的协同设计乱序处理器。添加多个fifo是为了以一种复杂性有效的方式动态调度微操作。我们提出了一种提交逻辑,它能够像超级块一样自动提交程序状态。这使我们能够完全摆脱重新排序缓冲区(ROB)。为了保持正确的程序状态,我们建议使用4 / 8个条目的超级块排序缓冲区(SOB)。我们还建议使用每个超级块的寄存器重命名表(SRRT)来保存与超级块相关的寄存器状态。我们提出的处理器功耗降低了6%,SPECFP的加速提高了12%,因此消耗的能量更少。此外,我们提出了一种增强的导向启发式和早期释放机制,以提高基于FIFO的乱序处理器的性能。对于四个FIFO和两个FIFO配置,我们分别获得了近25%和70%的性能改进。我们还表明,我们提出的基于转向启发式的处理器比之前提出的转向启发式处理器消耗的能量少10%。
{"title":"A Power-Efficient Co-designed Out-of-Order Processor","authors":"Abhishek Deb, J. M. Codina, Antonio González","doi":"10.1109/SBAC-PAD.2011.9","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.9","url":null,"abstract":"A co-designed processor helps in cutting down both the complexity and power consumption by co-designing certain key performance enablers. In this paper, we propose a FIFO based co-designed out-of-order processor. Multiple FIFOs are added in order to dynamically schedule, in a complexity-effective manner, the micro-ops. We propose a commit logic that is able to commit the program state as a superblock commits atomically. This enables us to get rid of the Reorder Buffer (ROB) entirely. Instead to maintain the correct program state, we propose a four/eight entry Superblock Ordering Buffer (SOB). We also propose the per superblock Register Rename Table (SRRT) that holds the register state pertaining to the superblock. Our proposed processor dissipates 6% less power and obtains 12% speedup for SPECFP, as a result, it consumes less energy. Furthermore, we propose an enhanced steering heuristic and an early release mechanism to increase the performance of a FIFO based out-of-order processor. We obtain performance improvement of nearly 25% and 70% for a four FIFO and for a two FIFO configurations, respectively. We also show that our proposed steering heuristic based processor consumes 10% less energy than the previously proposed steering heuristic.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116003191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modeling the Performance of the Hadoop Online Prototype Hadoop在线原型的性能建模
Emanuel Vianna, Giovanni V. Comarela, Tatiana Pontes, J. Almeida, Virgílio A. F. Almeida, K. Wilkinson, Harumi A. Kuno, U. Dayal
MapReduce is an important paradigm to support modern data-intensive applications. In this paper we address the challenge of modeling performance of one implementation of MapReduce called Hadoop Online Prototype (HOP), with a specific target on the intra-job pipeline parallelism. We use a hierarchical model that combines a precedence model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time and resource utilization. We validate our solution against a queuing network simulator in various scenarios, finding that our performance model presents a close agreement, with maximum relative difference under 15%.
MapReduce是支持现代数据密集型应用的一个重要范例。在本文中,我们解决了MapReduce的一个实现,即Hadoop在线原型(HOP)的建模性能的挑战,并以任务内管道并行性为特定目标。我们使用结合了优先级模型和排队网络模型的分层模型来捕获作业内同步约束。我们首先展示如何构建一个优先级图,表示同一作业的多个任务之间的依赖关系。然后,我们将其与近似均值分析(aMVA)解决方案联合应用,以预测平均作业响应时间和资源利用率。我们针对排队网络模拟器在各种场景中验证了我们的解决方案,发现我们的性能模型非常一致,最大相对差异在15%以下。
{"title":"Modeling the Performance of the Hadoop Online Prototype","authors":"Emanuel Vianna, Giovanni V. Comarela, Tatiana Pontes, J. Almeida, Virgílio A. F. Almeida, K. Wilkinson, Harumi A. Kuno, U. Dayal","doi":"10.1109/SBAC-PAD.2011.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.24","url":null,"abstract":"MapReduce is an important paradigm to support modern data-intensive applications. In this paper we address the challenge of modeling performance of one implementation of MapReduce called Hadoop Online Prototype (HOP), with a specific target on the intra-job pipeline parallelism. We use a hierarchical model that combines a precedence model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time and resource utilization. We validate our solution against a queuing network simulator in various scenarios, finding that our performance model presents a close agreement, with maximum relative difference under 15%.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130389974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Efficiently Managing Advance Reservations Using Lists of Free Blocks 有效地管理使用免费块列表提前预订
Jörg Schneider, B. Linnert
Advance reservation was identified as a key technology to enable guaranteed Quality of Service and co-allocation in the Grid. Nonetheless, most Grid and local resource management systems still use the queuing approach because of the additional complexity introduced by advance reservation. A planning based resource management system has to keep track of the reservations in the future and needs a good overview on the available capacity during the negotiation of incoming reservations. For advance reservation, the resource management problem becomes a two dimensional problem. In this paper different data structures are investigated and discussed in order to fit to planning based resource management. As a result the benefits of using lists of resource allocation or free blocks are exposed. This general idea widely used to manage continuous resources is extended to cover not only the resource dimension but also the time dimension. The list of blocks approach is evaluated in a Grid level and a resource level resource management system. The extensive simulations showed a better runtime and higher reservation success rate compared with the currently favored approach of a slotted time.
提前预约是保证网格服务质量和协同分配的关键技术。尽管如此,大多数网格和本地资源管理系统仍然使用排队方法,因为提前预订带来了额外的复杂性。基于计划的资源管理系统必须跟踪未来的预订,并且需要在协商传入的预订期间对可用容量有一个很好的概述。对于提前预订,资源管理问题变成了一个二维问题。为了适应基于规划的资源管理,本文对不同的数据结构进行了研究和讨论。因此,使用资源分配列表或空闲块的好处就暴露出来了。这一广泛用于管理连续资源的一般思想被扩展到不仅涵盖资源维度,而且涵盖时间维度。在网格级和资源级资源管理系统中对块列表方法进行了评估。大量的模拟结果表明,与目前流行的分缝时间方法相比,该方法具有更好的运行时间和更高的预约成功率。
{"title":"Efficiently Managing Advance Reservations Using Lists of Free Blocks","authors":"Jörg Schneider, B. Linnert","doi":"10.1109/SBAC-PAD.2011.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.25","url":null,"abstract":"Advance reservation was identified as a key technology to enable guaranteed Quality of Service and co-allocation in the Grid. Nonetheless, most Grid and local resource management systems still use the queuing approach because of the additional complexity introduced by advance reservation. A planning based resource management system has to keep track of the reservations in the future and needs a good overview on the available capacity during the negotiation of incoming reservations. For advance reservation, the resource management problem becomes a two dimensional problem. In this paper different data structures are investigated and discussed in order to fit to planning based resource management. As a result the benefits of using lists of resource allocation or free blocks are exposed. This general idea widely used to manage continuous resources is extended to cover not only the resource dimension but also the time dimension. The list of blocks approach is evaluated in a Grid level and a resource level resource management system. The extensive simulations showed a better runtime and higher reservation success rate compared with the currently favored approach of a slotted time.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130997366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2011 23rd International Symposium on Computer Architecture and High Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1