2008 IEEE International Conference on Computer Design最新文献

英文中文

Analysis and minimization of practical energy in 45nm subthreshold logic circuits 45nm亚阈值逻辑电路中实际能量的分析与最小化

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751876

D. Bol, R. Ambroise, D. Flandre, J. Legat

Over the last decade, the design of ultra-low-power digital circuits in subthreshold regime has been driven by the quest for minimum energy per operation. In this contribution, we observe that operating at minimum-energy point is not straightforward as design constraints from real-life applications have an important impact on energy. Therefore, we introduce the alternative concept of practical energy, taking functional-yield and throughput constraints on minimum Vdd into account. In this context, we demonstrate for the first time the detrimental impact of DIBL on minimum Vdd. Practical energy gives a useful analysis framework of circuit optimization to reach minimum-energy point, while considering the throughput as an input variable dictated by the application. From simulation of a benchmark multiplier in 45 nm technology, we find out that practical energy can be far higher than minimum energy point, in the case of low-throughput applications (ap 10-100 kOp/s) because of static leakage energy and robustness-limited minimum Vdd. With the proposed framework, we investigate the capability of conventional optimization techniques to make practical energy meet minimum energy point. Amongst these techniques, channel length upsize is shown to be more efficient than MTCMOS power gating, body biasing, Vt selection or device width upsize, as it increases robustness while simultaneously reducing static leakage energy. A small length upsize with low area overhead is shown to reduce practical energy at low throughput to less than 2.1 times the minimum energy level. At medium throughput, it even brings practical energy 30% lower than minimum energy level without optimization techniques.

在过去的十年中，亚阈值状态下的超低功耗数字电路的设计一直受到每次操作最小能量的追求的驱动。在这篇文章中，我们观察到在最小能量点操作并不简单，因为来自实际应用的设计约束对能量有重要影响。因此，我们引入实用能量的替代概念，考虑到最小Vdd的功能产率和吞吐量约束。在这种情况下，我们首次证明了DIBL对最小Vdd的有害影响。实际能量给出了一个有用的电路优化分析框架，以达到最小能量点，同时考虑吞吐量作为一个输入变量由应用决定。通过对45纳米技术的基准倍增器的模拟，我们发现在低吞吐量应用(ap 10-100 kOp/s)的情况下，由于静态泄漏能量和鲁棒性限制的最小Vdd，实际能量可能远高于最小能量点。在此框架下，我们考察了传统优化技术使实际能量满足最小能量点的能力。在这些技术中，通道长度增大被证明比MTCMOS功率门控、体偏置、Vt选择或器件宽度增大更有效，因为它增加了鲁棒性，同时减少了静态泄漏能量。具有低面积开销的小长度增大可以在低吞吐量时将实际能量降低到最小能量水平的2.1倍以下。在中等吞吐量下，在没有优化技术的情况下，实际能量甚至比最低能量水平低30%。

{"title":"Analysis and minimization of practical energy in 45nm subthreshold logic circuits","authors":"D. Bol, R. Ambroise, D. Flandre, J. Legat","doi":"10.1109/ICCD.2008.4751876","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751876","url":null,"abstract":"Over the last decade, the design of ultra-low-power digital circuits in subthreshold regime has been driven by the quest for minimum energy per operation. In this contribution, we observe that operating at minimum-energy point is not straightforward as design constraints from real-life applications have an important impact on energy. Therefore, we introduce the alternative concept of practical energy, taking functional-yield and throughput constraints on minimum Vdd into account. In this context, we demonstrate for the first time the detrimental impact of DIBL on minimum Vdd. Practical energy gives a useful analysis framework of circuit optimization to reach minimum-energy point, while considering the throughput as an input variable dictated by the application. From simulation of a benchmark multiplier in 45 nm technology, we find out that practical energy can be far higher than minimum energy point, in the case of low-throughput applications (ap 10-100 kOp/s) because of static leakage energy and robustness-limited minimum Vdd. With the proposed framework, we investigate the capability of conventional optimization techniques to make practical energy meet minimum energy point. Amongst these techniques, channel length upsize is shown to be more efficient than MTCMOS power gating, body biasing, Vt selection or device width upsize, as it increases robustness while simultaneously reducing static leakage energy. A small length upsize with low area overhead is shown to reduce practical energy at low throughput to less than 2.1 times the minimum energy level. At medium throughput, it even brings practical energy 30% lower than minimum energy level without optimization techniques.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133756057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Simulation points for SPEC CPU 2006 模拟点的规格CPU 2006

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751891

Arun A. Nair, L. John

Increasing sizes of benchmarks make detailed simulation an extremely time consuming process. Statistical techniques such as the SimPoint methodology have been proposed in order to address this problem during the initial design phase. The SimPoint methodology attempts to identify repetitive, long, large-grain phases in programs and predict the performance of the architecture based on its aggregate performance on the individual phases. This study attempts to compare accuracy of the SimPoint methodology for the SPEC CPU 2006 benchmark suite with that of SPEC CPU 2000 and to study the large-grain phases in the two benchmark suites using the SimPoint methodology. We find that there has not been a significant increase in the number of simulation points required to accurately predict the behavior of the programs in SPEC CPU 2006, despite its significantly larger data footprint and dynamic instruction count. We also find that the programs in both benchmark suites have similar characteristics in terms of the number of phases that contribute significantly towards overall behavior, further emphasizing the similarity between the two benchmark suites with respect to the number of simulation points required for similar accuracies.

越来越多的基准测试使得详细的模拟成为一个非常耗时的过程。为了在初始设计阶段解决这个问题，已经提出了SimPoint方法等统计技术。SimPoint方法试图识别程序中重复的、长时间的、大粒度的阶段，并根据单个阶段的总体性能预测体系结构的性能。本研究试图比较SPEC CPU 2006基准套件与SPEC CPU 2000基准套件的SimPoint方法的准确性，并使用SimPoint方法研究两个基准套件中的大粒度阶段。我们发现，在SPEC CPU 2006中，准确预测程序行为所需的模拟点数量并没有显著增加，尽管它的数据占用和动态指令计数明显增加。我们还发现，两个基准套件中的程序在对总体行为有重大贡献的阶段数量方面具有相似的特征，进一步强调了两个基准套件之间在相似精度所需的模拟点数量方面的相似性。

{"title":"Simulation points for SPEC CPU 2006","authors":"Arun A. Nair, L. John","doi":"10.1109/ICCD.2008.4751891","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751891","url":null,"abstract":"Increasing sizes of benchmarks make detailed simulation an extremely time consuming process. Statistical techniques such as the SimPoint methodology have been proposed in order to address this problem during the initial design phase. The SimPoint methodology attempts to identify repetitive, long, large-grain phases in programs and predict the performance of the architecture based on its aggregate performance on the individual phases. This study attempts to compare accuracy of the SimPoint methodology for the SPEC CPU 2006 benchmark suite with that of SPEC CPU 2000 and to study the large-grain phases in the two benchmark suites using the SimPoint methodology. We find that there has not been a significant increase in the number of simulation points required to accurately predict the behavior of the programs in SPEC CPU 2006, despite its significantly larger data footprint and dynamic instruction count. We also find that the programs in both benchmark suites have similar characteristics in terms of the number of phases that contribute significantly towards overall behavior, further emphasizing the similarity between the two benchmark suites with respect to the number of simulation points required for similar accuracies.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125095867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

A parallel Steiner tree heuristic for macro cell routing 宏单元路由的并行Steiner树启发式算法

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751836

C. Fobel, G. Grewal

Global routing of macro cells remains an important but time-consuming step in the VLSI design cycle. Macro cells are large, irregularly sized parameterized circuit modules that typically contain large numbers of terminals that must be interconnected. The interconnection pattern for each set of terminals (net) that must be connected is a Steiner tree, and the primary sub-problem in the global routing of macro cells is to find a set of dissimilar, low-cost Steiner trees for each net that must be routed. In this paper, a two-phase, parallel (multi-processor) algorithm is proposed for quickly constructing a diverse pool of high-quality Steiner trees for routing of multi-terminal nets. In the first phase, a single Steiner tree is constructed using a heuristic, called Shrubbery. Then, in the second phase, a pool of dissimilar, high-quality trees are created from the original tree, by running multiple instances of a local search in parallel. Computational experiments performed on over 800 commonly used benchmarks show that running multiple instances of the local search in parallel results in near-linear speed-up over the serial case. Most importantly, the trees produced are both high-quality and dissimilar, allowing for numerous routing possibilities for each net.

宏单元的全局路由仍然是VLSI设计周期中一个重要但耗时的步骤。宏单元是大型的、尺寸不规则的参数化电路模块，通常包含大量必须相互连接的端子。必须连接的每一组终端(网络)的互连模式是一棵斯坦纳树，宏观单元全局路由的主要子问题是为每一个必须路由的网络找到一组不同的、低成本的斯坦纳树。本文提出了一种两阶段并行(多处理器)算法，用于快速构建用于多终端网络路由的高质量斯坦纳树池。在第一阶段，使用一种叫做灌木林的启发式方法构造一棵斯坦纳树。然后，在第二阶段，通过并行运行多个本地搜索实例，从原始树创建一个不同的高质量树池。在超过800个常用基准测试上进行的计算实验表明，并行运行多个局部搜索实例比串行情况有近线性的加速。最重要的是，生成的树既高质量又不同，允许每个网络有许多路由可能性。

{"title":"A parallel Steiner tree heuristic for macro cell routing","authors":"C. Fobel, G. Grewal","doi":"10.1109/ICCD.2008.4751836","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751836","url":null,"abstract":"Global routing of macro cells remains an important but time-consuming step in the VLSI design cycle. Macro cells are large, irregularly sized parameterized circuit modules that typically contain large numbers of terminals that must be interconnected. The interconnection pattern for each set of terminals (net) that must be connected is a Steiner tree, and the primary sub-problem in the global routing of macro cells is to find a set of dissimilar, low-cost Steiner trees for each net that must be routed. In this paper, a two-phase, parallel (multi-processor) algorithm is proposed for quickly constructing a diverse pool of high-quality Steiner trees for routing of multi-terminal nets. In the first phase, a single Steiner tree is constructed using a heuristic, called Shrubbery. Then, in the second phase, a pool of dissimilar, high-quality trees are created from the original tree, by running multiple instances of a local search in parallel. Computational experiments performed on over 800 commonly used benchmarks show that running multiple instances of the local search in parallel results in near-linear speed-up over the serial case. Most importantly, the trees produced are both high-quality and dissimilar, allowing for numerous routing possibilities for each net.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124779796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA 基于FPGA的RNA二级结构预测的细粒度并行专用计算

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1142/S0218126614500315

Qianghua Zhu, Fei Xia, Guoqing Jin

In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.

在RNA二级结构预测领域，Zuker算法是利用自由能最小化的最常用方法之一。然而，包括并行计算机或多核计算机在内的通用计算机在Zuker上的并行效率不超过50%。FPGA芯片通过利用细粒度定制设计提供了一种加速Zuker算法的新方法。Zuker展示了复杂的数据依赖关系，其中依赖距离是可变的，并且依赖方向也是跨两个维度的。我们提出了一个包含一个主PE和多个从PE的收缩阵列结构，用于FPGA上的细粒度硬件实现。我们利用数据重用方案来减少从外部存储器加载能量矩阵的需要。我们还提出了几种将能量表参数大小减少85%的方法。据我们所知，我们的16 pe实现是唯一实现完整Zuker算法的FPGA加速器。实验结果表明，在Pentium 4 2.6 GHz CPU的PC平台上，使用ViennaRNA-1.6.5软件对2981残基RNA序列进行处理，速度提高了14倍。

{"title":"Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA","authors":"Qianghua Zhu, Fei Xia, Guoqing Jin","doi":"10.1142/S0218126614500315","DOIUrl":"https://doi.org/10.1142/S0218126614500315","url":null,"abstract":"In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

A high-performance parallel CAVLC encoder on a fine-grained many-core system 基于细粒度多核系统的高性能并行CAVLC编码器

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751869

Zhibin Xiao, B. Baas

This paper presents a high-performance parallel context-based adaptive length coding (CAVLC) encoder implemented on a fine-grained many-core system. The software encoder is designed for a H.264/AVC baseline profile encoder. By utilizing arithmetic table elimination and compression techniques, the data-flow of the CAVLC encoder has been partitioned and mapped to an array of 15 small processors. The parallel workload of each processor is characterized and balanced for further throughput optimization. The proposed parallel CAVLC encoder achieves the real-time processing requirement of 30 frames per second for 720 p HDTV. Our experiments show that the presented CAVLC encoder has 4.86 to 6.83 times higher throughput and requires far smaller chip area than the identical encoder implemented on state-of-art general-purpose processors. In comparison to published implementations on common DSP processors, the design has approximately 1.0 to 6.15 times higher throughput while requiring less than 6 times smaller area.

提出了一种在细粒度多核系统上实现的高性能并行上下文自适应长度编码(CAVLC)编码器。软件编码器是为H.264/AVC基线配置文件编码器而设计的。利用算术表消除和压缩技术，对CAVLC编码器的数据流进行了分区，并映射到15个小处理器阵列上。对每个处理器的并行工作负载进行表征和平衡，以进一步优化吞吐量。所提出的并行CAVLC编码器实现了720p高清电视每秒30帧的实时处理要求。实验表明，该编码器的吞吐量是目前通用处理器上相同编码器的4.86 ~ 6.83倍，而且所需的芯片面积要小得多。与在普通DSP处理器上发布的实现相比，该设计具有大约1.0至6.15倍的高吞吐量，而所需的面积不到6倍。

引用次数: 23

A novel, highly SEU tolerant digital circuit design approach 一种新颖的、高度容限的数字电路设计方法

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751834

Rajesh Garg, S. Khatri

In this paper, we present a new radiation tolerant CMOS standard cell library, and demonstrate its effectiveness in implementing radiation hardened digital circuits. We exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in logic a 0 to 1 (1 to 0) flip. Based on this observation, we derive our radiation hardened gates from regular static CMOS gates. In particular, we separate the PMOS and NMOS devices, and split the gate output into two signals. One of these outputs of our radiation tolerant gate is generated using PMOS transistors, and it drives other PMOS transistors (only) in its fanout. Similarly, the other output (generated from NMOS transistors) drives only other NMOS transistors in its fanout. Now, if a radiation particle strikes one of the outputs of the radiation tolerant gate, then the gates in the fanout enter a high-impedance state, and hence preserve their output values. Our radiation hardened gates exhibit an extremely high degree of SEU tolerance, which is validated at the circuit level. Using these gates, we also implement circuit level hardening based on logical masking, to selectively harden those gates in a circuit which contribute most to the soft error failure of the circuit. The gates with a low probability of logical masking are replaced by SEU tolerant gates from our new library, such that the digital design achieves a 90% soft error rate reduction. Experimental results demonstrate that this reduction is achieved with a modest layout area and delay penalty of 62% and 29% respectively, for area mapped designs. In contrast with existing approaches, our approach results in SEU immunity for extremely large critical charge values (>650fC).

本文提出了一种新的耐辐射CMOS标准单元库，并证明了其在实现抗辐射数字电路中的有效性。我们利用这样一个事实，即如果栅极仅使用PMOS (NMOS)晶体管实现，那么辐射粒子撞击只能导致逻辑上的0到1(1到0)翻转。基于这一观察，我们从常规静态CMOS栅极中推导出辐射硬化栅极。特别地，我们分离了PMOS和NMOS器件，并将门输出拆分为两个信号。我们的耐辐射门的其中一个输出是使用PMOS晶体管产生的，并且它驱动其他PMOS晶体管(仅)在其扇出。同样，另一个输出(由NMOS晶体管产生)只驱动其扇出中的其他NMOS晶体管。现在，如果辐射粒子击中容辐射门的一个输出，那么扇出中的门进入高阻抗状态，因此保持其输出值。我们的辐射硬化门具有极高的SEU容忍度，这在电路层面得到了验证。利用这些门，我们还实现了基于逻辑屏蔽的电路级强化，以选择性地强化电路中对电路软错误故障贡献最大的那些门。逻辑屏蔽概率较低的门被我们的新库中的SEU容限门所取代，从而使数字设计实现了90%的软错误率降低。实验结果表明，对于区域映射设计，这种减少分别为62%和29%的适度布局面积和延迟损失。与现有方法相比，我们的方法在极大的临界电荷值(>650fC)下具有SEU抗扰性。

{"title":"A novel, highly SEU tolerant digital circuit design approach","authors":"Rajesh Garg, S. Khatri","doi":"10.1109/ICCD.2008.4751834","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751834","url":null,"abstract":"In this paper, we present a new radiation tolerant CMOS standard cell library, and demonstrate its effectiveness in implementing radiation hardened digital circuits. We exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in logic a 0 to 1 (1 to 0) flip. Based on this observation, we derive our radiation hardened gates from regular static CMOS gates. In particular, we separate the PMOS and NMOS devices, and split the gate output into two signals. One of these outputs of our radiation tolerant gate is generated using PMOS transistors, and it drives other PMOS transistors (only) in its fanout. Similarly, the other output (generated from NMOS transistors) drives only other NMOS transistors in its fanout. Now, if a radiation particle strikes one of the outputs of the radiation tolerant gate, then the gates in the fanout enter a high-impedance state, and hence preserve their output values. Our radiation hardened gates exhibit an extremely high degree of SEU tolerance, which is validated at the circuit level. Using these gates, we also implement circuit level hardening based on logical masking, to selectively harden those gates in a circuit which contribute most to the soft error failure of the circuit. The gates with a low probability of logical masking are replaced by SEU tolerant gates from our new library, such that the digital design achieves a 90% soft error rate reduction. Experimental results demonstrate that this reduction is achieved with a modest layout area and delay penalty of 62% and 29% respectively, for area mapped designs. In contrast with existing approaches, our approach results in SEU immunity for extremely large critical charge values (>650fC).","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120961689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Optimizing data sharing and address translation for the Cell BE Heterogeneous Chip Multiprocessor 优化Cell BE异构芯片多处理器的数据共享和地址转换

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751904

M. Gschwind

Heterogeneous Chip Multiprocessors (HMPs), such as the Cell Broadband Engine, offer a new design optimization opportunity by allowing designers to provide accelerators for application specific domains. Data sharing between CPUs and accelerators, and memory access mechanisms and protocols are crucial decisions in the design of an HMP. In this article, we analyze the choices between hardware and software managed coherence between CPU and accelerators for DMA-based data sharing, and find that hardware-coherent DMA shows a performance benefit of up to 3x, even for simple workloads.We explore memory address translation architecture choices for DMA-based data sharing. In multiprogramming environments, address translation is commonly used to separate processes. For efficiency, direct access to system memory requires address translation capabilities in the accelerator. We find that hardware managed address translation shows a performance benefit of up to 5x, even for simple workloads, by avoiding the costs of accelerator/CPU communication and supervisor management of the translation context and the introduction of a serial bottleneck on the CPU.

异构芯片多处理器(hmp)，如Cell宽带引擎，通过允许设计人员为特定应用领域提供加速器，提供了新的设计优化机会。cpu和加速器之间的数据共享以及内存访问机制和协议是HMP设计中的关键决策。在本文中，我们分析了基于DMA的数据共享的CPU和加速器之间的硬件和软件管理一致性的选择，并发现硬件一致的DMA显示了高达3倍的性能优势，即使对于简单的工作负载也是如此。我们探索了基于dma的数据共享的内存地址转换架构选择。在多道程序设计环境中，地址转换通常用于分离进程。为了提高效率，直接访问系统内存需要加速器中的地址转换功能。我们发现，硬件管理的地址转换显示了高达5倍的性能优势，即使对于简单的工作负载，通过避免加速器/CPU通信和翻译上下文的主管管理的成本，以及在CPU上引入串行瓶颈。

{"title":"Optimizing data sharing and address translation for the Cell BE Heterogeneous Chip Multiprocessor","authors":"M. Gschwind","doi":"10.1109/ICCD.2008.4751904","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751904","url":null,"abstract":"Heterogeneous Chip Multiprocessors (HMPs), such as the Cell Broadband Engine, offer a new design optimization opportunity by allowing designers to provide accelerators for application specific domains. Data sharing between CPUs and accelerators, and memory access mechanisms and protocols are crucial decisions in the design of an HMP. In this article, we analyze the choices between hardware and software managed coherence between CPU and accelerators for DMA-based data sharing, and find that hardware-coherent DMA shows a performance benefit of up to 3x, even for simple workloads.We explore memory address translation architecture choices for DMA-based data sharing. In multiprogramming environments, address translation is commonly used to separate processes. For efficiency, direct access to system memory requires address translation capabilities in the accelerator. We find that hardware managed address translation shows a performance benefit of up to 5x, even for simple workloads, by avoiding the costs of accelerator/CPU communication and supervisor management of the translation context and the introduction of a serial bottleneck on the CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125674744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

A family of scalable FFT architectures and an implementation of 1024-point radix-2 FFT for real-time communications 一个可扩展的FFT体系结构家族和用于实时通信的1024点基数-2 FFT实现

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751880

A. Suleiman, H. Saleh, A. Hussein, D. Akopian

The paper presents a family of architectures for FFT implementation based on the decomposition of the perfect shuffle permutation, which can be designed with variable number of processing elements. This provides designers with a trade-off choice of speed vs. complexity (cost and area.). A detailed case study is provided on the implementation of 1024-point FFT with 2 processing elements using 45 nm process technology, including area, timing, power and place-and-route results.

本文提出了一种基于完美洗牌排列分解的FFT实现体系结构，该体系结构可以设计为可变数量的处理元素。这为设计师提供了速度与复杂性(成本和面积)之间的权衡选择。详细的案例研究了采用45纳米工艺技术的2个处理元件实现1024点FFT，包括面积、时间、功率和位置和路由结果。

引用次数: 16

Configurable rectilinear Steiner tree construction for SoC and nano technologies 可配置的线性斯坦纳树结构的SoC和纳米技术

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751837

I. Jiang, Yen-Ting Yu

The rectilinear Steiner minimal tree (RSMT) problem is essential in physical design. Moreover, the variant constraints for fabrication issues, including obstacle avoidance, multiple routing layers, layer-specific routing directions, cannot be ignored during RSMT construction for modern SoC and nano technologies. This paper proposes a construction-by-correction approach for obstacle-avoiding preferred direction rectilinear Steiner tree construction. Experimental results show that our algorithm is promising and outperforms the state-of-the-art works.

线性斯坦纳最小树(RSMT)问题在物理设计中是必不可少的。此外，在现代SoC和纳米技术的RSMT构建过程中，制造问题的各种约束，包括避障、多路由层、层特定路由方向，都是不可忽视的。提出了一种避障优先方向直线斯坦纳树构造的修正构造方法。实验结果表明，该算法具有较好的应用前景。

引用次数: 2

Timing analysis considering IR drop waveforms in power gating designs 功率门控设计中考虑红外降波的时序分析

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751912

Shih-Hung Weng, Yu-Min Kuo, Shih-Chieh Chang, M. Marek-Sadowska

IR drop noise has become a critical issue in advanced process technologies. Traditionally, timing analysis in which the IR drop noise is considered assumes a worst-case IR drop for each gate; however, using this assumption provides unduly pessimistic results. In this paper, we describe a timing analysis approach for power gating designs. To improve the accuracy of the gate delay calculation we determine the virtual voltage level by taking into account the IR drop waveforms across the sleep transistors. These can be obtained efficiently using a linear programming approach. Our experimental results are very promising.

红外降噪已成为先进工艺技术中的关键问题。传统上，考虑红外降噪声的时序分析假设每个栅极的最坏情况下的红外降;然而，使用这种假设提供了过于悲观的结果。在本文中，我们描述了一种功率门控设计的时序分析方法。为了提高栅极延迟计算的准确性，我们通过考虑休眠晶体管间的红外降波形来确定虚电压电平。这些可以用线性规划方法有效地得到。我们的实验结果很有希望。

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 IEEE International Conference on Computer Design

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀