2014 IEEE Computer Society Annual Symposium on VLSI最新文献

英文中文

Effective Combination of Algebraic Techniques and Decision Diagrams to Formally Verify Large Arithmetic Circuits 代数技术与决策图的有效结合以形式化验证大型算术电路

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.109

Farimah Farahmandi, B. Alizadeh, Z. Navabi

Arithmetic circuits require a verification process to prove that the gate level circuit is functionally equivalent to a high level specification. This paper presents an automatic equivalence checking technique to verify combinational arithmetic circuits at bit level. In order to efficiently verify gate level arithmetic circuits, we make use of computer algebra based approach so that the circuit and the specification are modeled in polynomial system and the verification problem is formulated as polynomial reduction techniques using Groebner basis of circuit polynomial corresponding ideal. To overcome costly Groebner basis computation as well as intensive polynomial reduction, we make use of a canonical decision diagram named Horner Expansion Diagram (HED), derive a suitable term order to represent and manipulate polynomials efficiently and find repetitive components based on automata. To evaluate the effectiveness of our verification technique, we have applied it to very large arithmetic circuits including multipliers. Preliminary experimental results show that the proposed verification technique is scalable enough so that large multipliers can efficiently be verified in reasonable run time and memory usage.

算术电路需要一个验证过程来证明门电平电路在功能上与高电平规格等效。提出了一种用于组合算术电路位级校验的自动等效性检验技术。为了有效地验证门电平算术电路，我们利用基于计算机代数的方法，将电路和规格在多项式系统中建模，并利用电路多项式对应理想的Groebner基将验证问题表述为多项式约简技术。为了克服昂贵的Groebner基计算和密集的多项式约简，我们利用一个称为Horner展开图(HED)的规范决策图，推导出一个合适的项阶来有效地表示和处理多项式，并基于自动机找到重复分量。为了评估我们的验证技术的有效性，我们将其应用于包括乘法器在内的非常大的算术电路。初步的实验结果表明，所提出的验证技术具有足够的可扩展性，可以在合理的运行时间和内存利用率下有效地验证大型乘数。

{"title":"Effective Combination of Algebraic Techniques and Decision Diagrams to Formally Verify Large Arithmetic Circuits","authors":"Farimah Farahmandi, B. Alizadeh, Z. Navabi","doi":"10.1109/ISVLSI.2014.109","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.109","url":null,"abstract":"Arithmetic circuits require a verification process to prove that the gate level circuit is functionally equivalent to a high level specification. This paper presents an automatic equivalence checking technique to verify combinational arithmetic circuits at bit level. In order to efficiently verify gate level arithmetic circuits, we make use of computer algebra based approach so that the circuit and the specification are modeled in polynomial system and the verification problem is formulated as polynomial reduction techniques using Groebner basis of circuit polynomial corresponding ideal. To overcome costly Groebner basis computation as well as intensive polynomial reduction, we make use of a canonical decision diagram named Horner Expansion Diagram (HED), derive a suitable term order to represent and manipulate polynomials efficiently and find repetitive components based on automata. To evaluate the effectiveness of our verification technique, we have applied it to very large arithmetic circuits including multipliers. Preliminary experimental results show that the proposed verification technique is scalable enough so that large multipliers can efficiently be verified in reasonable run time and memory usage.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126580549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Glitch Resistant Private Circuits Design Using HORNS 利用HORNS设计抗故障专用电路

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.93

M. Gomathisankaran, A. Tyagi

Cryptographic algorithms and their specific instantiation in computing engines leak information both through information channels and physical channels (side-channels). CMOS circuits implementing these cryptographic algorithms engines leak information through its physical attributes. The overlooked vulnerabilities in communication, cryptographic, or other system protocols and software, leak computation internal state inadvertently. These are the explicitly designed computational channels which are Turing channels. An unintended, lower barrier leakage occurs, however, through the side channels or physical channels. An actual implementation of an abstract algorithm goes through a model refinement to include the physical properties of the underlying computing machinery. Since there are no constraints placed on many of the physical attributes not visible in the algorithm specification in an abstract model, any refinement is acceptable. This is where the problem occurs. Some of these implementations reveal significant details about the private control and data flow of the underlying computation. In general there are two approaches to solve this problem. First approach is to design cryptographic algorithms which can tolerate some information leakage. Second approach is to remove the correlation between the leaked information and the secret. We propose a novel circuit design technique which uses the second approach.

加密算法及其在计算引擎中的具体实例通过信息通道和物理通道(侧通道)泄露信息。实现这些加密算法的CMOS电路通过其物理属性泄露信息。通信、加密或其他系统协议和软件中被忽视的漏洞会不经意地泄露计算内部状态。这些是明确设计的计算通道也就是图灵通道。然而，通过侧通道或物理通道会发生意外的低势垒泄漏。抽象算法的实际实现要经过模型细化，以包含底层计算机器的物理属性。由于在抽象模型的算法规范中没有对许多不可见的物理属性施加约束，因此任何细化都是可以接受的。这就是问题所在。其中一些实现揭示了有关底层计算的私有控制和数据流的重要细节。通常有两种方法来解决这个问题。第一种方法是设计能够容忍一定程度信息泄漏的加密算法。第二种方法是消除泄露信息与秘密之间的相关性。我们提出了一种使用第二种方法的新颖电路设计技术。

{"title":"Glitch Resistant Private Circuits Design Using HORNS","authors":"M. Gomathisankaran, A. Tyagi","doi":"10.1109/ISVLSI.2014.93","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.93","url":null,"abstract":"Cryptographic algorithms and their specific instantiation in computing engines leak information both through information channels and physical channels (side-channels). CMOS circuits implementing these cryptographic algorithms engines leak information through its physical attributes. The overlooked vulnerabilities in communication, cryptographic, or other system protocols and software, leak computation internal state inadvertently. These are the explicitly designed computational channels which are Turing channels. An unintended, lower barrier leakage occurs, however, through the side channels or physical channels. An actual implementation of an abstract algorithm goes through a model refinement to include the physical properties of the underlying computing machinery. Since there are no constraints placed on many of the physical attributes not visible in the algorithm specification in an abstract model, any refinement is acceptable. This is where the problem occurs. Some of these implementations reveal significant details about the private control and data flow of the underlying computation. In general there are two approaches to solve this problem. First approach is to design cryptographic algorithms which can tolerate some information leakage. Second approach is to remove the correlation between the leaked information and the secret. We propose a novel circuit design technique which uses the second approach.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126085766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Slicing Floorplans with Handling Symmetry and General Placement Constraints 切片平面图与处理对称和一般放置约束

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.62

Hongxia Zhou, Chiu-Wing Sham, Hailong Yao

Floorplan design is an essential step in physical design of VLSI circuits and its results directly determine the performance of the final packing. Existing floorplanners that use slicing floorplans are efficient in runtime and capable of getting a tight and regular packing, which can significantly improve the routability of placement result. Nevertheless, in order to obtain satisfactory floorplans for analog or mixed-signal circuits, a series of constraints should be considered during this stage, including symmetry, and other general constraints. And, most of these constraints have only been achieved by using non-slicing floorplanners in the present. In this paper, we will present a unified method based on polish expression representation to handle these constraints in slicing floorplans. Experimental results demonstrate that our approach is effective and feasible in solving the constraint-driven slicing floorplan problems.

平面设计是超大规模集成电路物理设计的重要步骤，其结果直接决定了最终封装的性能。现有的使用切片平面图的地板规划器在运行时效率高，能够获得紧密和规则的布局，这可以显着提高放置结果的可达性。然而，为了获得令人满意的模拟或混合信号电路的布局图，在此阶段应考虑一系列约束，包括对称性和其他一般约束。而且，目前大多数这些限制只能通过使用非切片地板规划师来实现。在本文中，我们将提出一种基于波兰表达式表示的统一方法来处理切片平面图中的这些约束。实验结果表明，该方法在解决约束驱动的切片平面问题上是有效可行的。

引用次数: 4

Swarm Intelligence Driven Simultaneous Adaptive Exploration of Datapath and Loop Unrolling Factor during Area-Performance Tradeoff 区域性能权衡中数据路径和环路展开因子的群体智能同步自适应探索

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.10

A. Sengupta, V. Mishra

Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.

高阶综合中的多目标设计空间探索(DSE)是一项繁琐的任务，需要在多个阶段管理智能决策策略的使用，以获得高质量的结果。当环路展开因子等辅助变量在决策过程中起着至关重要的作用时，decision - se问题变得棘手而复杂。本文成功地解决了上述问题，提出了一种新的DSE方法，用于HLS在面积性能权衡过程中自动并行(同时)探索最优数据路径和展开因子(UF)。该方法由超维粒子群优化算法(PSO)驱动。该算法的主要子贡献包括:a)推导了基于资源约束的循环展开控制数据流图(CDFG)执行延迟的计算模型，在大多数情况下无需繁琐地展开整个CDFG; b)考虑了循环展开及其对以下方面的影响:i)循环展开期间的控制状态和执行延迟权衡ii) DSE过程中的区域执行延迟权衡，c)关于多个DFG和CDFG基准的区域性能权衡的新颖比较结果。该方法的结果表明，与最近的方法相比，结果质量(QoR)的平均改善> 30%，运行时间减少> 92%。

{"title":"Swarm Intelligence Driven Simultaneous Adaptive Exploration of Datapath and Loop Unrolling Factor during Area-Performance Tradeoff","authors":"A. Sengupta, V. Mishra","doi":"10.1109/ISVLSI.2014.10","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.10","url":null,"abstract":"Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122266502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Framework of an Adaptive Delay-Insensitive Asynchronous Platform for Energy Efficiency 一种自适应时延不敏感异步能效平台框架

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.39

Liang Men, B. Hollosi, J. Di

Asynchronous circuits do not have the clock-related issues as in their synchronous counterparts, thereby enabling further design tradeoffs and in-operation adaptive adjustments for energy efficiency. This paper introduces the framework of a parallel delay-insensitive asynchronous platform implementing adaptive dynamic voltage scaling (DVS), which is based on the observation of system fullness and workload prediction. The voltage reference generator and the voltage regulator have been integrated with the Finite Impulse Response (FIR) cores using the IBM 130nm 8RF process. Results show that the platform exhibits energy savings across various input workload scenarios.

异步电路不像同步电路那样存在与时钟相关的问题，因此可以进行进一步的设计权衡和运行中对能源效率的自适应调整。本文介绍了一种基于系统充分性观察和工作负载预测实现自适应动态电压缩放(DVS)的并行延迟不敏感异步平台的框架。参考电压发生器和稳压器采用IBM 130nm 8RF工艺与有限脉冲响应(FIR)内核集成。结果表明，该平台在各种输入工作负载场景中都可以节省能源。

引用次数: 2

Computational Architectures Based on Coupled Oscillators 基于耦合振荡器的计算体系结构

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.87

M. Cotter, Yan Fang, S. Levitan, D. Chiarulli, N. Vijaykrishnan

Recent advances in device technology have opened the door to exploration of new computing paradigms. These paradigms are geared to exploit the behavior of emerging devices such as coupled oscillator arrays. Coupled oscillators are one such device technology that offloads some of the computational complexity to the devices exploiting the physics of the device to perform computation, rather than relying purely on Boolean logic. In this work, we explore the variety of available coupled oscillator architectures. Additionally, we evaluate some basic computation using these architectures in the image processing domain, with an eye towards more complex algorithms and applications.

设备技术的最新进展为探索新的计算范式打开了大门。这些范例是为了利用诸如耦合振荡器阵列等新兴器件的行为。耦合振荡器就是这样一种设备技术，它利用设备的物理特性来执行计算，而不是纯粹依赖布尔逻辑，从而将一些计算复杂性转移到设备上。在这项工作中，我们探索了各种可用的耦合振荡器架构。此外，我们评估了在图像处理领域使用这些架构的一些基本计算，并着眼于更复杂的算法和应用。

引用次数: 28

Patterned Heterogeneous CMPs: The Case for Regularity-Driven System-Level Synthesis 模式异构cmp:规则驱动的系统级综合案例

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.68

N. Nikitin, Magnus Jahre

The imminent limitations imposed by the utilizationwall phenomenon compel architects to introduce heterogeneity in on-chip systems. The existing approaches to system-level synthesis of chip multiprocessors (CMPs) typically ignore physical aspects, leading to degradation in efficiency during later design stages. In this work, we propose and study the concept of patterned heterogeneous CMPs to handle the complexity of synthesis and extenuate the severity of physical aspects in post-synthesis engineering. A patterned CMP is a variation of a tiled architecture, in which layout regularity is enforced by fixing the tile size. Nevertheless, individual tiles may be implemented with different patterns, enabling a parameterizable degree of heterogeneity, which we leverage for synthesis efficiency. The results of synthesis demonstrate the effectiveness of patterned heterogeneity, suggesting it as a powerful tool for mastering the complexity of future on-chip systems.

利用墙现象带来的迫在眉睫的限制迫使架构师在片上系统中引入异构性。现有的系统级芯片多处理器(cmp)合成方法通常忽略了物理方面，导致后期设计阶段效率下降。在这项工作中，我们提出并研究了模式异构cmp的概念，以处理合成的复杂性并减轻合成后工程中物理方面的严重性。模式化的CMP是平铺架构的一种变体，其中通过固定平铺大小来强制执行布局规则。然而，单个块可以用不同的模式实现，从而实现可参数化的异构程度，我们利用它来提高合成效率。综合结果证明了模式异构的有效性，表明它是掌握未来片上系统复杂性的有力工具。

引用次数: 0

Parallel Multi-core Verilog HDL Simulation Using Domain Partitioning 使用域划分的并行多核Verilog HDL仿真

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.47

T. B. Ahmad, M. Ciesielski

While multi-core computing has become pervasive, scaling single core computations to multi-core computations remains a challenge. This paper aims to accelerate RTL and functional gate-level simulation in the current multi-core computing environment. This work addresses two types of partitioning schemes for multi-core simulation: functional, and domain-based. We discuss the limitations of functional partitioning, offered by new commercial multi-core simulators to speedup functional gate-level simulations. We also present a novel solution to increase RTL and functional gate-level simulation performance based on domain partitioning. This is the first known work that improves simulation performance by leveraging open source technology against commercial simulators.

虽然多核计算已经普及，但将单核计算扩展到多核计算仍然是一个挑战。本文旨在加速当前多核计算环境下的RTL和功能门级仿真。这项工作涉及两种类型的多核模拟分区方案:功能性和基于域的。我们讨论了新的商用多核模拟器为加速功能门级仿真而提供的功能分区的局限性。我们还提出了一种新的解决方案，以提高基于域划分的RTL和功能门级仿真性能。这是已知的第一个通过利用开放源代码技术来改进商业模拟器的仿真性能的工作。

引用次数: 2

Energy-Aware Thread Scheduling for Embedded Multi-threaded Processors: Architectural Level Design and Implementation 嵌入式多线程处理器的能量感知线程调度:架构级设计与实现

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.55

M. Wickramasinghe, Hui Guo

Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.

能耗是嵌入式系统设计中的一个关键问题。节约能源的一种方法是尽早完成执行。多线程处理器通过利用指令级和线程级并行性来减少执行时间，为节能提供了有效的解决方案。在典型的多线程处理器设计中，当指令管道由于高延迟操作而出现停顿时，将处理器的执行切换到另一个线程，从而有效地利用了计算资源，提高了处理器的吞吐量。然而，传统的设计在线程选择中使用基本的调度方案，如轮询，不适合实时执行，并且对于一组执行时间不平衡的线程来说效率低下。在本文中，我们提出了一种线程调度方法，延长短线程的寿命，以确保处理器资源的利用效率;2)零切换时间硬件设计，以实现一组给定应用程序的最小执行时间。通过实验验证了设计的有效性。

{"title":"Energy-Aware Thread Scheduling for Embedded Multi-threaded Processors: Architectural Level Design and Implementation","authors":"M. Wickramasinghe, Hui Guo","doi":"10.1109/ISVLSI.2014.55","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.55","url":null,"abstract":"Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116067001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Reconfigurable Architecture for QR Decomposition Using a Hybrid Approach 基于混合方法的可重构QR分解体系结构

2014 IEEE Computer Society Annual Symposium on VLSI

Pub Date : 2014-07-09 DOI: 10.1109/ISVLSI.2014.92

Xinying Wang, Phillip H. Jones, Joseph Zambreno

QR decomposition has been widely used in many signal processing applications to solve linear inverse problems. However, QR decomposition is considered a computationally expensive process, and its sequential implementations fail to meet the requirements of many time-sensitive applications. The Householder transformation and the Givens rotation are the most popular techniques to conduct QR decomposition. Each of these approaches have their own strengths and weakness. The Householder transformation lends itself to efficient sequential implementation, however its inherent data dependencies complicate parallelization. On the other hand, the structure of Givens rotation provides many opportunities for concurrency, but is typically limited by the availability of computing resources. We propose a deeply pipelined reconfigurable architecture that can be dynamically configured to perform either approach in a manner that takes advantage of the strengths of each. At runtime, the input matrix is first partitioned into numerous sub-matrices. Our architecture then performs parallel Householder transformations on the sub-matrices in the same column block, which is followed by parallel Givens rotations to annihilate the remaining unneeded individual off-diagonals. Analysis of our design indicates the potential to achieve a performance of 10.5 GFLOPS with speedups of up to 1.46fiX, 1.15Xfi and 13.75fiX compared to the MKL implementation, a recent FPGA design and a Matlab solution, respectively.

QR分解在许多信号处理应用中被广泛应用于求解线性逆问题。然而，QR分解被认为是一个计算昂贵的过程，其顺序实现不能满足许多时间敏感应用的要求。Householder变换和Givens旋转是进行QR分解最常用的技术。每种方法都有自己的优点和缺点。Householder转换适合于高效的顺序实现，但是其固有的数据依赖性使并行化复杂化。另一方面，Givens旋转的结构为并发性提供了许多机会，但通常受到计算资源可用性的限制。我们提出了一种深度流水线的可重构架构，可以动态配置，以利用每种方法的优势的方式执行任一方法。在运行时，输入矩阵首先被分割成许多子矩阵。然后，我们的架构在同一列块中的子矩阵上执行并行的Householder转换，随后是并行的给定旋转，以消除剩余的不需要的单个非对角线。对我们设计的分析表明，与MKL实现、最近的FPGA设计和Matlab解决方案相比，我们的设计有可能实现10.5 GFLOPS的性能，加速分别高达1.46fiX、1.15Xfi和13.75fiX。

{"title":"A Reconfigurable Architecture for QR Decomposition Using a Hybrid Approach","authors":"Xinying Wang, Phillip H. Jones, Joseph Zambreno","doi":"10.1109/ISVLSI.2014.92","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.92","url":null,"abstract":"QR decomposition has been widely used in many signal processing applications to solve linear inverse problems. However, QR decomposition is considered a computationally expensive process, and its sequential implementations fail to meet the requirements of many time-sensitive applications. The Householder transformation and the Givens rotation are the most popular techniques to conduct QR decomposition. Each of these approaches have their own strengths and weakness. The Householder transformation lends itself to efficient sequential implementation, however its inherent data dependencies complicate parallelization. On the other hand, the structure of Givens rotation provides many opportunities for concurrency, but is typically limited by the availability of computing resources. We propose a deeply pipelined reconfigurable architecture that can be dynamically configured to perform either approach in a manner that takes advantage of the strengths of each. At runtime, the input matrix is first partitioned into numerous sub-matrices. Our architecture then performs parallel Householder transformations on the sub-matrices in the same column block, which is followed by parallel Givens rotations to annihilate the remaining unneeded individual off-diagonals. Analysis of our design indicates the potential to achieve a performance of 10.5 GFLOPS with speedups of up to 1.46fiX, 1.15Xfi and 13.75fiX compared to the MKL implementation, a recent FPGA design and a Matlab solution, respectively.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE Computer Society Annual Symposium on VLSI

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀