首页 > 最新文献

2008 IEEE International Conference on Computer Design最新文献

英文 中文
A simple latency tolerant processor 一个简单的延迟容忍处理器
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751889
Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song
The advent of multi-core processors and the emergence of new parallel applications that take advantage of such processors pose difficult challenges to designers. With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory wall problem. How can a designer build a processor that provides a core with good single-thread performance in the presence of long latency cache misses, while enabling as many of these cores to be placed on the same die for high throughput. Conventional latency tolerant architectures that use out-of-order superscalar execution have become too complex and power hungry for the multi-core era. Instead, we present a simple, non-blocking architecture that achieves memory latency tolerance without requiring complex out-of-order execution hardware or large, cycle-critical and power hungry structures, such as dynamic schedulers, fully associative load and store queues, and reorder buffers. The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the same die than is possible with conventional out-of-order superscalar architecture.
多核处理器的出现和利用这些处理器的新的并行应用程序的出现给设计人员带来了困难的挑战。由于芯片尺寸相对恒定,芯片上缓存有限,引脚带宽稀缺,芯片上更多的内核减少了每个内核可用的缓存和总线带宽,因此加剧了内存墙问题。设计人员如何构建一个处理器,在存在长延迟缓存丢失的情况下,提供一个具有良好单线程性能的核心,同时使尽可能多的这些核心放在同一个die上以实现高吞吐量?使用无序超标量执行的传统延迟容忍架构对于多核时代来说已经变得过于复杂和耗电。相反,我们提出了一个简单的、非阻塞的架构,它可以实现内存延迟容忍,而不需要复杂的乱序执行硬件或大型的、周期关键的和耗电的结构,如动态调度器、完全关联的负载和存储队列以及重新排序缓冲区。这种架构的非阻塞特性在一个简单的有序问题核心上提供了数百个缓存丢失延迟周期的容错性,从而允许在同一个die上集成比传统的无序超标架构更多的这样的核心。
{"title":"A simple latency tolerant processor","authors":"Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song","doi":"10.1109/ICCD.2008.4751889","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751889","url":null,"abstract":"The advent of multi-core processors and the emergence of new parallel applications that take advantage of such processors pose difficult challenges to designers. With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory wall problem. How can a designer build a processor that provides a core with good single-thread performance in the presence of long latency cache misses, while enabling as many of these cores to be placed on the same die for high throughput. Conventional latency tolerant architectures that use out-of-order superscalar execution have become too complex and power hungry for the multi-core era. Instead, we present a simple, non-blocking architecture that achieves memory latency tolerance without requiring complex out-of-order execution hardware or large, cycle-critical and power hungry structures, such as dynamic schedulers, fully associative load and store queues, and reorder buffers. The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the same die than is possible with conventional out-of-order superscalar architecture.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA 基于FPGA的RNA二级结构预测的细粒度并行专用计算
Pub Date : 2008-10-01 DOI: 10.1142/S0218126614500315
Qianghua Zhu, Fei Xia, Guoqing Jin
In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.
在RNA二级结构预测领域,Zuker算法是利用自由能最小化的最常用方法之一。然而,包括并行计算机或多核计算机在内的通用计算机在Zuker上的并行效率不超过50%。FPGA芯片通过利用细粒度定制设计提供了一种加速Zuker算法的新方法。Zuker展示了复杂的数据依赖关系,其中依赖距离是可变的,并且依赖方向也是跨两个维度的。我们提出了一个包含一个主PE和多个从PE的收缩阵列结构,用于FPGA上的细粒度硬件实现。我们利用数据重用方案来减少从外部存储器加载能量矩阵的需要。我们还提出了几种将能量表参数大小减少85%的方法。据我们所知,我们的16 pe实现是唯一实现完整Zuker算法的FPGA加速器。实验结果表明,在Pentium 4 2.6 GHz CPU的PC平台上,使用ViennaRNA-1.6.5软件对2981残基RNA序列进行处理,速度提高了14倍。
{"title":"Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA","authors":"Qianghua Zhu, Fei Xia, Guoqing Jin","doi":"10.1142/S0218126614500315","DOIUrl":"https://doi.org/10.1142/S0218126614500315","url":null,"abstract":"In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Run-time Active Leakage Reduction by power gating and reverse body biasing: An eNERGY vIEW 运行时主动泄漏减少功率门控和反向体偏置:一个能源的观点
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751925
Hao Xu, R. Vemuri, W. Jone
Run-time active leakage reduction (RALR) is a recent technique and aims at aggressively reducing leakage power consumption. This paper studies the feasibility of RALR from the energy aspect, for both power gating (PG) and reverse body bias (RBB) implementations.We develop two energy saving models for PG and RBB, respectively. These models can accurately estimate the circuit energy saving at any time, even when the circuit is in state transition. In PG modeling, we discover a physical phenomenon called ldquoinstant savingrdquo, which can affect the model accuracy by 30%-50%. Based on the RBB model, we derive the optimum design point of RBB for RALR. Finally in terms of energy saving, we define four figures-of-merit, to compare the efficacy of using PG and RBB to implement RALR.
运行时主动泄漏降低(RALR)是一项最新技术,旨在积极降低泄漏功耗。本文从能量方面研究了功率门控(PG)和反向体偏置(RBB)实现的rar的可行性。我们分别开发了PG和RBB两种节能模型。这些模型可以在任何时候,甚至在电路处于状态转换时,准确地估计电路的节能。在PG建模中,我们发现了一种称为ldquoinstant savingrdquo的物理现象,它可以影响30%-50%的模型精度。在RBB模型的基础上,推导出了RBB的最佳设计点。最后,在节能方面,我们定义了四个优值,以比较使用PG和RBB实现RALR的效果。
{"title":"Run-time Active Leakage Reduction by power gating and reverse body biasing: An eNERGY vIEW","authors":"Hao Xu, R. Vemuri, W. Jone","doi":"10.1109/ICCD.2008.4751925","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751925","url":null,"abstract":"Run-time active leakage reduction (RALR) is a recent technique and aims at aggressively reducing leakage power consumption. This paper studies the feasibility of RALR from the energy aspect, for both power gating (PG) and reverse body bias (RBB) implementations.We develop two energy saving models for PG and RBB, respectively. These models can accurately estimate the circuit energy saving at any time, even when the circuit is in state transition. In PG modeling, we discover a physical phenomenon called ldquoinstant savingrdquo, which can affect the model accuracy by 30%-50%. Based on the RBB model, we derive the optimum design point of RBB for RALR. Finally in terms of energy saving, we define four figures-of-merit, to compare the efficacy of using PG and RBB to implement RALR.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Near-optimal oblivious routing on three-dimensional mesh networks 三维网状网络的近最优遗忘路由
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751852
R. Ramanujam, Bill Lin
The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of two-dimensional (2D) mesh-based tiled chip-multiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worst-case throughput (DOR , ROMM) or poor latency (VAL). Although the minimal routing algorithm O1TURN proposed in already achieves near-optimal worst-case throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worst-case throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing. RPM provably achieves optimal worst-case throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal worst-case throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in average-case throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 times 8 times 8 topology. Finally, whereas VAL achieves optimal worst-case throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worst-case throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR.
三维(3D)硅集成技术的日益可行性为芯片架构创新开辟了新的机会。一个方向是将基于二维网格的平铺式芯片多处理器架构扩展到三维空间。在本文中,我们重点研究了这种三维网格网络的有效路由算法。在二维网格网络的情况下,吞吐量和延迟是路由算法的重要设计指标。现有的路由算法要么存在较差的最坏情况吞吐量(DOR、ROMM),要么存在较差的延迟(VAL)。虽然本文提出的最小路由算法O1TURN在二维情况下已经达到了接近最优的最坏情况吞吐量,但最优性结果并没有扩展到更高的维度。对于3D和高维网格,O1TURN的最坏吞吐量会大大降低。本文的主要贡献是为三维网格网络设计了一种新的无关路由算法,称为随机部分最小(RPM)路由。可以证明,当网络基数k为偶数时,RPM可以实现3D网格的最优最坏情况吞吐量,并且在k为奇数时最优最坏情况吞吐量的1/k2范围内。当在8 × 8 × 8拓扑上平均超过100万个随机流量模式时,RPM的平均吞吐量比VAL、DOR、rom和O1TURN分别高出33.3%、111%、47%和30%。最后,VAL在DOR的平均延迟中以2的惩罚因子实现最优最坏情况吞吐量,而RPM以更小的1.33因子实现(接近)最优最坏情况吞吐量。在实践中,RPM的平均延迟预计更接近最小路由,因为3D网格网络在3D芯片设计中不期望是对称的。可用设备层的数量预计要比可以沿设备层边缘放置的处理器块的数量少得多。对于实际的非对称3D网格配置,RPM的平均延迟降低到DOR的1.11倍。
{"title":"Near-optimal oblivious routing on three-dimensional mesh networks","authors":"R. Ramanujam, Bill Lin","doi":"10.1109/ICCD.2008.4751852","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751852","url":null,"abstract":"The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of two-dimensional (2D) mesh-based tiled chip-multiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worst-case throughput (DOR , ROMM) or poor latency (VAL). Although the minimal routing algorithm O1TURN proposed in already achieves near-optimal worst-case throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worst-case throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing. RPM provably achieves optimal worst-case throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal worst-case throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in average-case throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 times 8 times 8 topology. Finally, whereas VAL achieves optimal worst-case throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worst-case throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130989546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Exploiting spare resources of in-order SMT processors executing hard real-time threads 利用执行硬实时线程的有序SMT处理器的空闲资源
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751887
Jörg Mische, S. Uhrig, Florian Kluge, T. Ungerer
We developed an SMT processor that allows a static WCET analysis of several hard real-time threads and uses the remaining resources for soft or non real-time threads. The analysis is possible, because one Dominant Meta Thread (DMT) is executed as if it were the unique thread on the processor and thus single-threaded WCET techniques can be applied. To provide more than one hard real-time thread the execution time of the Dominant Meta Thread is distributed by time sharing whereby the length of the time slices and periods can be adjusted at runtime. Our technique, called Dominant Time Sharing (DTS), can be used to minimize the number of control units in embedded hard real-time systems and hence reduces the overall energy consumption and material demand. In contrast to many other studies we are able to handle multicycle memory latencies while preserving analyzability. The proposed technique can easily be extended to access other external resources like coprocessors or reconfigurable arrays.
我们开发了一个SMT处理器,它允许对几个硬实时线程进行静态WCET分析,并将剩余的资源用于软或非实时线程。分析是可能的,因为一个主导元线程(DMT)被执行,就好像它是处理器上的唯一线程一样,因此可以应用单线程WCET技术。为了提供多个硬实时线程,主元线程的执行时间通过分时分配,从而可以在运行时调整时间片和周期的长度。我们的技术,称为主导分时(DTS),可用于最大限度地减少嵌入式硬实时系统中的控制单元数量,从而降低总体能耗和材料需求。与许多其他研究相比,我们能够在保持可分析性的同时处理多周期内存延迟。所提出的技术可以很容易地扩展到访问其他外部资源,如协处理器或可重构阵列。
{"title":"Exploiting spare resources of in-order SMT processors executing hard real-time threads","authors":"Jörg Mische, S. Uhrig, Florian Kluge, T. Ungerer","doi":"10.1109/ICCD.2008.4751887","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751887","url":null,"abstract":"We developed an SMT processor that allows a static WCET analysis of several hard real-time threads and uses the remaining resources for soft or non real-time threads. The analysis is possible, because one Dominant Meta Thread (DMT) is executed as if it were the unique thread on the processor and thus single-threaded WCET techniques can be applied. To provide more than one hard real-time thread the execution time of the Dominant Meta Thread is distributed by time sharing whereby the length of the time slices and periods can be adjusted at runtime. Our technique, called Dominant Time Sharing (DTS), can be used to minimize the number of control units in embedded hard real-time systems and hence reduces the overall energy consumption and material demand. In contrast to many other studies we are able to handle multicycle memory latencies while preserving analyzability. The proposed technique can easily be extended to access other external resources like coprocessors or reconfigurable arrays.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Frequency and voltage planning for multi-core processors under thermal constraints 热约束下多核处理器的频率和电压规划
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751902
M. Kadin, S. Reda
Clock frequency and transistor density increases have resulted in elevated chip temperatures. In order to meet temperature constraints while still exploiting the performance opportunities enabled by continued scaling, chip designers have migrated towards multi-core architectures. Multi-core architectures use multiple cores running at moderate clock frequencies to run several threads concurrently, which increases overall system throughput. In this work, we propose novel methods to find the optimal operating parameters, i.e., frequency and voltage, that maximize a multi-core system throughput under thermal constraints. By adjusting core clock frequencies and voltages, on-chip power dissipation can be spatially and temporally distributed to maximize the chippsilas physical performance during runtime. We propose a simple, yet efficient model that accurately characterize the effects that changes in clock frequency and voltage have on on-chip temperatures. Using the model, we find the optimal operating conditions for the following scenarios: (1) standard processor performance, where various cores operate using identical operating parameters, (2) optimal processor performance where each core can have its own frequency and voltage, and (3) optimal processor performance with thread priorities, where each core runs a thread of varied importance. We run several experiments across six different technology nodes to validate the work, assuring that our models and methods are accurate. Our methods demonstrate the total physical performance of a multi-core system can be increased by up to 33.4% without violating the maximum temperature constraints.
时钟频率和晶体管密度的增加导致芯片温度升高。为了满足温度限制,同时仍然利用持续扩展所带来的性能机会,芯片设计师已经转向多核架构。多核架构使用在中等时钟频率下运行的多个内核来并发运行多个线程,从而提高了整体系统吞吐量。在这项工作中,我们提出了新的方法来找到最佳的工作参数,即频率和电压,在热约束下最大化多核系统吞吐量。通过调整内核时钟频率和电压,可以在空间和时间上分布芯片上的功耗,以最大限度地提高芯片在运行时的物理性能。我们提出了一个简单而有效的模型,可以准确地表征时钟频率和电压变化对片上温度的影响。使用该模型,我们找到了以下场景的最佳操作条件:(1)标准处理器性能,其中各种核心使用相同的操作参数运行;(2)最佳处理器性能,其中每个核心可以拥有自己的频率和电压;(3)具有线程优先级的最佳处理器性能,其中每个核心运行不同重要性的线程。我们在六个不同的技术节点上进行了几个实验来验证工作,确保我们的模型和方法是准确的。我们的方法表明,在不违反最高温度约束的情况下,多核系统的总物理性能可以提高33.4%。
{"title":"Frequency and voltage planning for multi-core processors under thermal constraints","authors":"M. Kadin, S. Reda","doi":"10.1109/ICCD.2008.4751902","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751902","url":null,"abstract":"Clock frequency and transistor density increases have resulted in elevated chip temperatures. In order to meet temperature constraints while still exploiting the performance opportunities enabled by continued scaling, chip designers have migrated towards multi-core architectures. Multi-core architectures use multiple cores running at moderate clock frequencies to run several threads concurrently, which increases overall system throughput. In this work, we propose novel methods to find the optimal operating parameters, i.e., frequency and voltage, that maximize a multi-core system throughput under thermal constraints. By adjusting core clock frequencies and voltages, on-chip power dissipation can be spatially and temporally distributed to maximize the chippsilas physical performance during runtime. We propose a simple, yet efficient model that accurately characterize the effects that changes in clock frequency and voltage have on on-chip temperatures. Using the model, we find the optimal operating conditions for the following scenarios: (1) standard processor performance, where various cores operate using identical operating parameters, (2) optimal processor performance where each core can have its own frequency and voltage, and (3) optimal processor performance with thread priorities, where each core runs a thread of varied importance. We run several experiments across six different technology nodes to validate the work, assuring that our models and methods are accurate. Our methods demonstrate the total physical performance of a multi-core system can be increased by up to 33.4% without violating the maximum temperature constraints.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133541451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Variation-aware thermal characterization and management of multi-core architectures 多核架构的变化感知热特性和管理
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751874
E. Kursun, Chen-Yong Cher
The accuracy and efficiency of dynamic power and thermal management are both affected by the increased levels of on-chip variation, mainly because dynamic thermal management schemes are oblivious to the variation characteristics of the underlying hardware. We propose a technique that utilizes the existing on-chip sensor infrastructure to improve the inherent thermal imbalances among different cores in a multi-core architecture. Thermal sensor readings are compiled to generate an on-chip variation map, which is provided to the system power/thermal management to effectively manage the existing on-chip variation. Experimental analysis based on live measurements on a special test-chip shows reduced on-chip heating with no performance loss, which improves the power/thermal efficiency of the chip at no cost.
动态功耗和热管理的准确性和效率都受到片上变化水平的影响,主要是因为动态热管理方案忽略了底层硬件的变化特性。我们提出了一种利用现有的片上传感器基础设施来改善多核架构中不同核之间固有的热平衡的技术。编译热传感器读数生成片上变化图,该图提供给系统电源/热管理,以有效地管理现有的片上变化。基于特殊测试芯片现场测量的实验分析表明,在没有性能损失的情况下,芯片上的加热减少了,从而在没有成本的情况下提高了芯片的功率/热效率。
{"title":"Variation-aware thermal characterization and management of multi-core architectures","authors":"E. Kursun, Chen-Yong Cher","doi":"10.1109/ICCD.2008.4751874","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751874","url":null,"abstract":"The accuracy and efficiency of dynamic power and thermal management are both affected by the increased levels of on-chip variation, mainly because dynamic thermal management schemes are oblivious to the variation characteristics of the underlying hardware. We propose a technique that utilizes the existing on-chip sensor infrastructure to improve the inherent thermal imbalances among different cores in a multi-core architecture. Thermal sensor readings are compiled to generate an on-chip variation map, which is provided to the system power/thermal management to effectively manage the existing on-chip variation. Experimental analysis based on live measurements on a special test-chip shows reduced on-chip heating with no performance loss, which improves the power/thermal efficiency of the chip at no cost.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124366123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Fast arbiters for on-chip network switches 片上网络交换机的快速仲裁器
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751932
G. Dimitrakopoulos, N. Chrysos, C. Galanopoulos
The need for efficient implementation of simple crossbar schedulers has increased in the recent years due to the advent of on-chip interconnection networks that require low latency message delivery. The core function of any crossbar scheduler is arbitration that resolves conflicting requests for the same output. Since, the delay of the arbiters directly determine the operation speed of the scheduler, the design of faster arbiters is of paramount importance. In this paper, we present a new bit-level algorithm and new circuit techniques for the design of programmable priority arbiters that offer significantly more efficient implementations compared to already-known solutions. From the experimental results it is derived that the proposed circuits are more than 15% faster than the most efficient previous implementations, which under equal delay comparisons, translates to 40% less energy.
近年来,由于需要低延迟消息传递的片上互连网络的出现,对简单交叉调度器的有效实现的需求有所增加。任何交叉调度器的核心功能都是仲裁,用于解决针对相同输出的冲突请求。由于仲裁器的延迟直接决定了调度程序的运行速度,因此设计更快的仲裁器至关重要。在本文中,我们提出了一种新的位级算法和新的电路技术,用于设计可编程优先仲裁器,与已知的解决方案相比,它提供了更有效的实现。从实验结果中可以得出,所提出的电路比以前最有效的实现快15%以上,在相同的延迟比较下,可以减少40%的能量。
{"title":"Fast arbiters for on-chip network switches","authors":"G. Dimitrakopoulos, N. Chrysos, C. Galanopoulos","doi":"10.1109/ICCD.2008.4751932","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751932","url":null,"abstract":"The need for efficient implementation of simple crossbar schedulers has increased in the recent years due to the advent of on-chip interconnection networks that require low latency message delivery. The core function of any crossbar scheduler is arbitration that resolves conflicting requests for the same output. Since, the delay of the arbiters directly determine the operation speed of the scheduler, the design of faster arbiters is of paramount importance. In this paper, we present a new bit-level algorithm and new circuit techniques for the design of programmable priority arbiters that offer significantly more efficient implementations compared to already-known solutions. From the experimental results it is derived that the proposed circuits are more than 15% faster than the most efficient previous implementations, which under equal delay comparisons, translates to 40% less energy.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124716652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework CrashTest:基于fpga的快速高保真弹性分析框架
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751886
Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin
Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.
硅技术中的极端缩放实践正在迅速导致集成电路元件可靠性有限,其中早期晶体管故障,栅极氧化物磨损和瞬态故障等现象变得越来越普遍。为了克服这些问题并为大市场硅集成电路开发强大的设计技术,有必要依赖准确的故障分析框架,使设计公司能够忠实地评估各种潜在故障的影响以及候选可靠机制克服它们的能力。不幸的是,虽然故障率已经超过了经济上可行的极限,但目前还没有既准确又能在复杂的集成系统上运行的故障分析框架。为了解决这一空白,我们提出了CrashTest,一个快速,高保真和灵活的弹性分析系统。给定正在分析的设计的硬件描述模型,CrashTest能够通过检查设计在运行软件应用程序时对故障的反应来编排和执行全面的设计弹性分析。完成后,CrashTest提供高保真分析报告,该报告是通过在设计的门级网络列表上执行故障注入活动获得的。利用FPGA硬件仿真平台,大大加快了故障注入和分析过程。我们对一系列系统进行了实验评估,包括一个复杂的基于leon的片上系统,并在系统级别评估了门级注入故障的影响。当通过直接主I/ o分析设计时,我们发现CrashTest比同等的基于软件的框架快16-90倍。正如我们基于leon的SoC实验所示,CrashTest显示的仿真速度比仿真快6个数量级。
{"title":"CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework","authors":"Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin","doi":"10.1109/ICCD.2008.4751886","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751886","url":null,"abstract":"Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Acceleration of a 3D target tracking algorithm using an application specific instruction set processor 使用特定指令集处理器的3D目标跟踪算法的加速
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751870
S. Fontaine, Sylvain Goyette, J. Langlois, G. Bois
In todaypsilas high-tech world, intelligent video-surveillance is becoming a part of everyday life. In addition to minimizing the need for constant monitoring by an operator, it can automatically perform tasks such as accident detection or estimation of vehicle speed. A particularly useful algorithm for video surveillance is three-dimensional target tracking but, since it is both quite computationally expensive and requires the use of two cameras, it is seldom used. In this paper, we concentrate on accelerating an implementation of 3D tracking using a multiprocessor ASIP architecture based on the Tensilica Xtensa processor. Our experiments show that a speedup factor of 22 can be achieved using an extensible platform expressly optimized for this application as opposed to using a general-purpose processor.
在当今这个高科技的世界里,智能视频监控正在成为人们日常生活的一部分。除了最大限度地减少操作员对持续监控的需求外,它还可以自动执行诸如事故检测或车辆速度估计等任务。一种特别有用的视频监视算法是三维目标跟踪,但由于它在计算上非常昂贵并且需要使用两个摄像机,因此很少使用。在本文中,我们专注于使用基于Tensilica Xtensa处理器的多处理器ASIP架构加速3D跟踪的实现。我们的实验表明,与使用通用处理器相比,使用专门为该应用程序优化的可扩展平台可以实现22的加速因子。
{"title":"Acceleration of a 3D target tracking algorithm using an application specific instruction set processor","authors":"S. Fontaine, Sylvain Goyette, J. Langlois, G. Bois","doi":"10.1109/ICCD.2008.4751870","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751870","url":null,"abstract":"In todaypsilas high-tech world, intelligent video-surveillance is becoming a part of everyday life. In addition to minimizing the need for constant monitoring by an operator, it can automatically perform tasks such as accident detection or estimation of vehicle speed. A particularly useful algorithm for video surveillance is three-dimensional target tracking but, since it is both quite computationally expensive and requires the use of two cameras, it is seldom used. In this paper, we concentrate on accelerating an implementation of 3D tracking using a multiprocessor ASIP architecture based on the Tensilica Xtensa processor. Our experiments show that a speedup factor of 22 can be achieved using an extensible platform expressly optimized for this application as opposed to using a general-purpose processor.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131831088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2008 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1