2008 IEEE International Conference on Computer Design最新文献

英文中文

Synthesis of parallel prefix adders considering switching activities 考虑切换活动的并行前缀加法器的综合

2008 IEEE International Conference on Computer Design

Pub Date : 2008-12-01 DOI: 10.1109/ICCD.2008.4751892

T. Matsunaga, S. Kimura, Y. Matsunaga

This paper addresses parallel prefix adder synthesis which targets minimization of the total switching activities under bitwise timing constraints. This problem is treated as synthesis of prefix graphs which represent global structures of parallel prefix adders at technology-independent level. An approach for timing-driven area minimization has been proposed which first finds the exact minimum solution on a specific subset of prefix graphs by dynamic programming, then restructures the result for further reduction by removing restriction on the subset. This approach can be applied for switching cost minimization almost directly, though it is not so effective as area minimization in some cases. In this paper, a heuristic is proposed which estimates the effect of the restructuring phase and improve cost calculation for some specific cases. Through various kinds of experiments, conditions where this approach can be executed effectively is also discussed.

本文研究并行前缀加法器的合成，其目标是在位时序约束下使总开关活动最小化。该问题被视为前缀图的综合，前缀图在技术无关的水平上表示并行前缀加法器的全局结构。提出了一种时间驱动的区域最小化方法，该方法首先通过动态规划在前缀图的特定子集上找到精确的最小解，然后通过去除子集上的限制对结果进行重构以进一步缩减。这种方法几乎可以直接应用于开关成本最小化，尽管在某些情况下不如面积最小化有效。本文针对一些具体情况，提出了一种启发式方法来估计重组阶段的效果并改进成本计算。通过各种实验，讨论了该方法有效实施的条件。

引用次数: 6

A fine-grain dynamic sleep control scheme in MIPS R3000 MIPS R3000中的一种细粒度动态睡眠控制方案

2008 IEEE International Conference on Computer Design

Pub Date : 2008-12-01 DOI: 10.1109/ICCD.2008.4751924

N. Seki, Lei Zhao, J. Kei, D. Ikebuchi, Y. Kojima, Y. Hasegawa, H. Amano, Toshihiro Kashima, S. Takeda, T. Shirai, M. Nakata, K. Usami, T. Sunata, J. Kanai, M. Namiki, Masaaki Kondo, Hiroshi Nakamura

A fine-grain dynamic power gating is proposed for saving the leakage power in MIPS R3000 by sleep control and applied to a processor pipeline. An execution unit is divided into four small units: multiplier, divider, shifter and other (CLU). The power of each unit is cut off dynamically, based on the operation. We tape-outed the prototype chip Geyser-0, which provides an R3000 Core with the power reduction technique, 16 KB caches and translation lookaside buffer (TLB) using 90 nm CMOS technology. The evaluation results of four benchmark programs for embedded applications show that 47% of the leakage power is reduced on average with 41% area overhead.

为了降低MIPS R3000的漏功率，提出了一种细粒度动态功率门控方法，并将其应用于处理器流水线。一个执行单元分为四个小单元:乘法器、除法器、移法器和其它(CLU)。根据运行情况，动态切断各单元的电源。我们将原型芯片Geyser-0带出，该芯片采用90纳米CMOS技术，为R3000 Core提供了功耗降低技术，16 KB缓存和翻译暂存缓冲(TLB)。四种嵌入式应用基准方案的评估结果表明，该方案平均减少47%的泄漏功率和41%的面积开销。

引用次数: 38

Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective SMT和CMP架构中线程级推测的效率——性能、功率和热的观点

2008 IEEE International Conference on Computer Design

Pub Date : 2008-12-01 DOI: 10.1109/ICCD.2008.4751875

Venkatesan Packirisamy, Yangchun Luo, W. Hung, Antonia Zhai, P. Yew, Tin-fook Ngai

Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000psilas. However, because of the lack of compilers and other related software technologies, most of the general-purpose applications today still cannot take advantage of such architectures to improve their performance. Thread-level speculation (TLS) has been proposed as a way of using these multi-threaded architectures to parallelize general-purpose applications. Both simultaneous multithreading (SMT) and chip multiprocessors (CMP) have been extended to implement TLS. While the characteristics of SMT and CMP have been widely studied under multi-programmed and parallel workloads, their behavior under TLS workload is not well understood. The TLS workload due to speculative nature of the threads which could potentially be rollbacked and due to variable degree of parallelism available in applications, exhibits unique characteristics which makes it different from other workloads. In this paper, we present a detailed study of the performance, power consumption and thermal effect of these multithreaded architectures against that of a Superscalar with equal chip area. A wide spectrum of design choices and tradeoffs are also studied using commonly used simulation techniques. We show that the SMT based TLS architecture performs about 21% better than the best CMP based configuration while it suffers about 16% power overhead. In terms of Energy-Delay-Squared product (ED2), SMT based TLS performs about 26% better than the best CMP based TLS configuration and 11% better than the superscalar architecture. But the SMT based TLS configuration, causes more thermal stress than the CMP based TLS architectures.

计算机行业已经采用了多线程和多核架构，因为时钟速率在2000年初停止了增长。然而，由于缺乏编译器和其他相关软件技术，目前大多数通用应用程序仍然无法利用这种体系结构来提高其性能。线程级推测(TLS)已经被提出作为使用这些多线程架构来并行化通用应用程序的一种方法。同时多线程(SMT)和芯片多处理器(CMP)都已经扩展到实现TLS。虽然SMT和CMP在多编程和并行工作负载下的特性已经得到了广泛的研究，但它们在TLS工作负载下的行为却没有得到很好的理解。由于可能被回滚的线程的推测性质以及应用程序中可用的不同程度的并行性，TLS工作负载表现出独特的特征，使其与其他工作负载不同。在本文中，我们详细研究了这些多线程架构与具有相同芯片面积的超标量架构的性能，功耗和热效应。广泛的设计选择和权衡也研究使用常用的仿真技术。我们表明，基于SMT的TLS架构的性能比最佳的基于CMP的配置好21%，而它的功耗开销约为16%。在能量-延迟-平方积(ED2)方面，基于SMT的TLS比最佳的基于CMP的TLS配置性能好26%，比标量架构性能好11%。但是基于SMT的TLS配置比基于CMP的TLS架构产生更多的热应力。

{"title":"Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective","authors":"Venkatesan Packirisamy, Yangchun Luo, W. Hung, Antonia Zhai, P. Yew, Tin-fook Ngai","doi":"10.1109/ICCD.2008.4751875","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751875","url":null,"abstract":"Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000psilas. However, because of the lack of compilers and other related software technologies, most of the general-purpose applications today still cannot take advantage of such architectures to improve their performance. Thread-level speculation (TLS) has been proposed as a way of using these multi-threaded architectures to parallelize general-purpose applications. Both simultaneous multithreading (SMT) and chip multiprocessors (CMP) have been extended to implement TLS. While the characteristics of SMT and CMP have been widely studied under multi-programmed and parallel workloads, their behavior under TLS workload is not well understood. The TLS workload due to speculative nature of the threads which could potentially be rollbacked and due to variable degree of parallelism available in applications, exhibits unique characteristics which makes it different from other workloads. In this paper, we present a detailed study of the performance, power consumption and thermal effect of these multithreaded architectures against that of a Superscalar with equal chip area. A wide spectrum of design choices and tradeoffs are also studied using commonly used simulation techniques. We show that the SMT based TLS architecture performs about 21% better than the best CMP based configuration while it suffers about 16% power overhead. In terms of Energy-Delay-Squared product (ED2), SMT based TLS performs about 26% better than the best CMP based TLS configuration and 11% better than the superscalar architecture. But the SMT based TLS configuration, causes more thermal stress than the CMP based TLS architectures.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131731822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A floating-point fused dot-product unit 一种浮点融合点积单位

2008 IEEE International Conference on Computer Design

Pub Date : 2008-11-10 DOI: 10.1109/ICCD.2008.4751896

H. Saleh, E. Swartzlander

A floating-point fused dot-product unit is presented that performs single-precision floating-point multiplication and addition operations on two pairs of data in a time that is only 150% the time required for a conventional floating-point multiplication. When placed and routed in a 45 nm process, the fused dot-product unit occupied about 70% of the area needed to implement a parallel dot-product unit using conventional floating-point adders and multipliers. The speed of the fused dot-product is 27% faster than the speed of the conventional parallel approach. The numerical result of the fused unit is more accurate because one rounding operation is needed versus at least three for other approaches.

提出了一种浮点融合点积单元，该单元对两对数据进行单精度浮点乘法和加法运算，所需时间仅为传统浮点乘法所需时间的150%。当在45纳米工艺中放置和布线时，融合点积单元占用了使用传统浮点加法器和乘法器实现并行点积单元所需面积的70%左右。融合点积的速度比传统并行方法的速度快27%。所述融合单元的数值结果更为精确，因为需要一次舍入运算，而其他方法至少需要三次。

引用次数: 60

Area and power-delay efficient state retention pulse-triggered flip-flops with scan and reset capabilities 具有扫描和复位能力的面积和功率延迟高效状态保持脉冲触发触发器

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751857

K. Shi

This paper presents two area and power-delay efficient state retention pulsed flops with scan and reset capabilities for sub-90 nm production low-power designs. The proposed flops also mitigate area overhead and integration complexity in SoC designs by implementing a single retention control signal and shared function/scan mode clock.

本文提出了两种具有扫描和复位功能的区域和功率延迟高效状态保持脉冲触发器，用于90 nm以下的生产低功耗设计。所提出的flops还通过实现单个保持控制信号和共享功能/扫描模式时钟来减轻SoC设计中的面积开销和集成复杂性。

引用次数: 3

On-chip high performance signaling using passive compensation 采用无源补偿的片上高性能信号

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751859

Yulei Zhang, Ling Zhang, A. Tsuchiya, M. Hashimoto, Chung-Kuan Cheng

To address the performance limitation brought by the scaling issues of on-chip global wires, a new configuration for global wiring using on-chip lossy transmission lines(T-lines) is proposed and optimized in this paper. Firstly, we use passive compensation and repeated transceivers composed by sense amplifier and inverter chain to compensate the distortion and attenuation of on-chip T-lines. Secondly, an optimization flow for designing this scheme based on eye-diagram prediction and sequential quadratic programming (SQP) is proposed. This flow is employed to study the latency, power dissipation and throughput performance of the new global wiring scheme as the technology scales from 90nm to 22nm. Compared with conventional repeater insertion methods, our experimental results demonstrate that, at 22nm technology node, this new scheme reduces the normalized delay by 85.1%, the normalized energy consumption by 98.8%. Furthermore, all the performance metrics are scalable as the technology advances, which makes this new signaling scheme a potential candidate to break the “interconnect wall” of digital system performance.

为了解决片上全局布线的缩放问题带来的性能限制，本文提出了一种使用片上损耗传输线(t线)的全局布线新配置，并对其进行了优化。首先，我们采用由感测放大器和逆变链组成的无源补偿和重复收发器来补偿片上t线的失真和衰减。其次，提出了基于眼图预测和序列二次规划(SQP)的方案优化设计流程;该流程用于研究新全局布线方案在技术从90nm扩展到22nm时的延迟、功耗和吞吐量性能。实验结果表明，与传统的中继器插入方法相比，在22nm技术节点上，新方案将归一化延迟降低85.1%，归一化能耗降低98.8%。此外，随着技术的进步，所有的性能指标都是可扩展的，这使得这种新的信令方案成为打破数字系统性能“互连墙”的潜在候选者。

{"title":"On-chip high performance signaling using passive compensation","authors":"Yulei Zhang, Ling Zhang, A. Tsuchiya, M. Hashimoto, Chung-Kuan Cheng","doi":"10.1109/ICCD.2008.4751859","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751859","url":null,"abstract":"To address the performance limitation brought by the scaling issues of on-chip global wires, a new configuration for global wiring using on-chip lossy transmission lines(T-lines) is proposed and optimized in this paper. Firstly, we use passive compensation and repeated transceivers composed by sense amplifier and inverter chain to compensate the distortion and attenuation of on-chip T-lines. Secondly, an optimization flow for designing this scheme based on eye-diagram prediction and sequential quadratic programming (SQP) is proposed. This flow is employed to study the latency, power dissipation and throughput performance of the new global wiring scheme as the technology scales from 90nm to 22nm. Compared with conventional repeater insertion methods, our experimental results demonstrate that, at 22nm technology node, this new scheme reduces the normalized delay by 85.1%, the normalized energy consumption by 98.8%. Furthermore, all the performance metrics are scalable as the technology advances, which makes this new signaling scheme a potential candidate to break the “interconnect wall” of digital system performance.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123682557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Power-state-aware buffered tree construction 电力状态感知缓冲树结构

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751835

I. Jiang, Ming-Hua Wu

Interconnect delay and low power are two of the main issues in nano technology. Buffer insertion during routing effectively reduces interconnect delay; power state management and multiple supply voltage significantly lower power consumption. However, buffering without considering power states in multiple supply voltage designs may cause the signal integrity problem. This paper first considers power states into buffered tree construction. Based on a hierarchical approach combined with dynamic programming, we can simultaneously minimize power, satisfy timing constraints and maintain signal integrity.

互连延迟和低功耗是纳米技术中的两个主要问题。在路由过程中插入缓冲区有效地减少了互连延迟;电源状态管理和多电源电压显著降低功耗。然而，在多电源电压设计中，不考虑电源状态的缓冲可能会导致信号完整性问题。本文首先将电力状态考虑到缓冲树结构中。基于层次化方法与动态规划相结合，可以同时实现功率最小化、满足时序约束和保持信号完整性。

引用次数: 2

Probabilistic error propagation in logic circuits using the Boolean difference calculus 用布尔差分法研究逻辑电路中的概率误差传播

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751833

Nasir Mohyuddin, E. Pakbaznia, Massoud Pedram

A gate level probabilistic error propagation model is presented which takes as input the Boolean function of the gate, the signal and error probabilities of the gate inputs, and the gate error probability and produces the error probability at the output of the gate. The presented model uses the Boolean difference calculus and can be applied to the problem of calculating the error probability at the primary outputs of a multi-level Boolean circuit with a time complexity which is linear in the number of gates in the circuit. This is done by starting from the primary inputs and moving toward the primary outputs by using a post-order traversal. Experimental results demonstrate the accuracy and efficiency of the proposed approach compared to the other known methods for error calculation in VLSI circuits.

提出了一种门级概率误差传播模型，该模型以门的布尔函数为输入，门输入的信号和误差概率，以及门的误差概率，并在门的输出处产生误差概率。该模型采用布尔差分法，可用于计算时间复杂度与电路门数成线性关系的多级布尔电路主输出的误差概率问题。这是通过使用后序遍历从主要输入开始并向主要输出移动来完成的。实验结果表明，与其他已知的VLSI电路误差计算方法相比，该方法具有较高的精度和效率。

引用次数: 106

Quantitative global dataflow analysis on virtual instruction set simulators for hardware/software co-design 面向软硬件协同设计的虚拟指令集模拟器的定量全局数据流分析

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751888

Carsten Gremzow

One of the main challenges in system design whether for high performance computing or in embedded systems is to partition software for target architectures like multi-core, heterogeneous, or even hardware/software co-design systems. Several compiler techniques handle partitioning and related problems by using static analysis and therefor have no means to capture the global data flow in quantity and its dynamics which is essential for extracting tasks or exploiting coarse grained parallelism. We present a novel solution for capturing and analyzing an applicationpsilas quantitative data flow in this paper. The core part is the LLILA (Low Level Intermediate Language Analyzer) tool set, which automatically generates and augments self-profiling instruction set simulators from assembly level descriptions for a virtual machine. During run-time of the augmented program several properties (frequency, quantity and locality reflecting inter-procedural communication) of data exchange are captured at instruction level and as a consequence in the highest possible degree of accuracy.

无论是高性能计算还是嵌入式系统，系统设计的主要挑战之一是为目标体系结构(如多核、异构甚至硬件/软件协同设计系统)划分软件。一些编译器技术通过使用静态分析来处理分区和相关问题，因此无法大量捕获全局数据流及其动态，而这对于提取任务或利用粗粒度并行性至关重要。本文提出了一种捕获和分析应用程序中定量数据流的新方法。核心部分是LLILA (Low Level Intermediate Language Analyzer)工具集，它根据虚拟机的汇编级描述自动生成和增强自剖析指令集模拟器。在扩充程序的运行期间，在指令级捕获数据交换的几个属性(反映程序间通信的频率、数量和位置)，从而达到尽可能高的精度。

引用次数: 3

Highly reliable A/D converter using analog voting 采用模拟投票的高可靠A/D转换器

2008 IEEE International Conference on Computer Design

Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751882

A. Namazi, S. Askari, M. Nourani

Analog and digital circuits are both prone to failure due to transient upsets, variations, etc. Redundancy techniques, such as N-tuple Modular Redundancy, has been widely used to correct faulty behavior of components and achieve high reliability for digital circuits, whereas, not much has been done on the analog side. In this paper, we propose a redundancy based fault-tolerant methodology to design a highly reliable analog to digital converters (ADC). Our methodology employs redundant analog blocks and chooses the best result using an innovative analog voter. Experimental results are reported to verify the concepts, measure the systempsilas reliability and tradeoff reliability versus cost and power.

模拟电路和数字电路都容易由于瞬态扰动、变化等而发生故障。冗余技术，如n元组模块冗余，已被广泛用于纠正元件的故障行为和实现数字电路的高可靠性，而在模拟方面做得并不多。在本文中，我们提出了一种基于冗余的容错方法来设计高可靠的模数转换器(ADC)。我们的方法采用冗余模拟块，并使用创新的模拟投票人选择最佳结果。实验结果报告验证的概念，测量系统的可靠性和权衡可靠性与成本和功率。

引用次数: 12

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 IEEE International Conference on Computer Design

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀