Arithmetic circuits require a verification process to prove that the gate level circuit is functionally equivalent to a high level specification. This paper presents an automatic equivalence checking technique to verify combinational arithmetic circuits at bit level. In order to efficiently verify gate level arithmetic circuits, we make use of computer algebra based approach so that the circuit and the specification are modeled in polynomial system and the verification problem is formulated as polynomial reduction techniques using Groebner basis of circuit polynomial corresponding ideal. To overcome costly Groebner basis computation as well as intensive polynomial reduction, we make use of a canonical decision diagram named Horner Expansion Diagram (HED), derive a suitable term order to represent and manipulate polynomials efficiently and find repetitive components based on automata. To evaluate the effectiveness of our verification technique, we have applied it to very large arithmetic circuits including multipliers. Preliminary experimental results show that the proposed verification technique is scalable enough so that large multipliers can efficiently be verified in reasonable run time and memory usage.
{"title":"Effective Combination of Algebraic Techniques and Decision Diagrams to Formally Verify Large Arithmetic Circuits","authors":"Farimah Farahmandi, B. Alizadeh, Z. Navabi","doi":"10.1109/ISVLSI.2014.109","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.109","url":null,"abstract":"Arithmetic circuits require a verification process to prove that the gate level circuit is functionally equivalent to a high level specification. This paper presents an automatic equivalence checking technique to verify combinational arithmetic circuits at bit level. In order to efficiently verify gate level arithmetic circuits, we make use of computer algebra based approach so that the circuit and the specification are modeled in polynomial system and the verification problem is formulated as polynomial reduction techniques using Groebner basis of circuit polynomial corresponding ideal. To overcome costly Groebner basis computation as well as intensive polynomial reduction, we make use of a canonical decision diagram named Horner Expansion Diagram (HED), derive a suitable term order to represent and manipulate polynomials efficiently and find repetitive components based on automata. To evaluate the effectiveness of our verification technique, we have applied it to very large arithmetic circuits including multipliers. Preliminary experimental results show that the proposed verification technique is scalable enough so that large multipliers can efficiently be verified in reasonable run time and memory usage.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126580549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cryptographic algorithms and their specific instantiation in computing engines leak information both through information channels and physical channels (side-channels). CMOS circuits implementing these cryptographic algorithms engines leak information through its physical attributes. The overlooked vulnerabilities in communication, cryptographic, or other system protocols and software, leak computation internal state inadvertently. These are the explicitly designed computational channels which are Turing channels. An unintended, lower barrier leakage occurs, however, through the side channels or physical channels. An actual implementation of an abstract algorithm goes through a model refinement to include the physical properties of the underlying computing machinery. Since there are no constraints placed on many of the physical attributes not visible in the algorithm specification in an abstract model, any refinement is acceptable. This is where the problem occurs. Some of these implementations reveal significant details about the private control and data flow of the underlying computation. In general there are two approaches to solve this problem. First approach is to design cryptographic algorithms which can tolerate some information leakage. Second approach is to remove the correlation between the leaked information and the secret. We propose a novel circuit design technique which uses the second approach.
{"title":"Glitch Resistant Private Circuits Design Using HORNS","authors":"M. Gomathisankaran, A. Tyagi","doi":"10.1109/ISVLSI.2014.93","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.93","url":null,"abstract":"Cryptographic algorithms and their specific instantiation in computing engines leak information both through information channels and physical channels (side-channels). CMOS circuits implementing these cryptographic algorithms engines leak information through its physical attributes. The overlooked vulnerabilities in communication, cryptographic, or other system protocols and software, leak computation internal state inadvertently. These are the explicitly designed computational channels which are Turing channels. An unintended, lower barrier leakage occurs, however, through the side channels or physical channels. An actual implementation of an abstract algorithm goes through a model refinement to include the physical properties of the underlying computing machinery. Since there are no constraints placed on many of the physical attributes not visible in the algorithm specification in an abstract model, any refinement is acceptable. This is where the problem occurs. Some of these implementations reveal significant details about the private control and data flow of the underlying computation. In general there are two approaches to solve this problem. First approach is to design cryptographic algorithms which can tolerate some information leakage. Second approach is to remove the correlation between the leaked information and the secret. We propose a novel circuit design technique which uses the second approach.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126085766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Floorplan design is an essential step in physical design of VLSI circuits and its results directly determine the performance of the final packing. Existing floorplanners that use slicing floorplans are efficient in runtime and capable of getting a tight and regular packing, which can significantly improve the routability of placement result. Nevertheless, in order to obtain satisfactory floorplans for analog or mixed-signal circuits, a series of constraints should be considered during this stage, including symmetry, and other general constraints. And, most of these constraints have only been achieved by using non-slicing floorplanners in the present. In this paper, we will present a unified method based on polish expression representation to handle these constraints in slicing floorplans. Experimental results demonstrate that our approach is effective and feasible in solving the constraint-driven slicing floorplan problems.
{"title":"Slicing Floorplans with Handling Symmetry and General Placement Constraints","authors":"Hongxia Zhou, Chiu-Wing Sham, Hailong Yao","doi":"10.1109/ISVLSI.2014.62","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.62","url":null,"abstract":"Floorplan design is an essential step in physical design of VLSI circuits and its results directly determine the performance of the final packing. Existing floorplanners that use slicing floorplans are efficient in runtime and capable of getting a tight and regular packing, which can significantly improve the routability of placement result. Nevertheless, in order to obtain satisfactory floorplans for analog or mixed-signal circuits, a series of constraints should be considered during this stage, including symmetry, and other general constraints. And, most of these constraints have only been achieved by using non-slicing floorplanners in the present. In this paper, we will present a unified method based on polish expression representation to handle these constraints in slicing floorplans. Experimental results demonstrate that our approach is effective and feasible in solving the constraint-driven slicing floorplan problems.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"2100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127467216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.
{"title":"Swarm Intelligence Driven Simultaneous Adaptive Exploration of Datapath and Loop Unrolling Factor during Area-Performance Tradeoff","authors":"A. Sengupta, V. Mishra","doi":"10.1109/ISVLSI.2014.10","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.10","url":null,"abstract":"Multi objective (MO) design space exploration (DSE) in high level synthesis (HLS) is a tedious task which administers the usage of intelligent decision making strategies at multiple stages to yield quality results. The problem of DSE becomes intractable and intricate when an auxiliary variable such as loop unrolling factor plays a vital role in the decision making process. This paper successfully solves the above problem by proposing the novel DSE approach for fully automated parallel (simultaneous) exploration of optimal datapath and unrolling factor (UF) during area-performance tradeoff in HLS. The proposed DSE approach is driven by hyper-dimensional particle swarm optimization (PSO). The major sub-contributions of this proposed algorithm includes: a) deriving a model for computation of execution delay of a loop unrolled control data flow graph (CDFG) based on resource constraint, without the necessity of tediously unrolling the entire CDFG in most cases, b) Consideration of loop unrolling and its impact on: i) control states and execution delay tradeoff during loop unrolling ii) area-execution delay tradeoff during the DSE process, c) novel comparative results for area-performance tradeoff with respect to multiple DFG and CDFG benchmarks. Results of the proposed approach indicated an average improvement in Quality of Results (QoR) of > 30% and reduction in runtime of > 92% compared to recent approaches.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122266502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asynchronous circuits do not have the clock-related issues as in their synchronous counterparts, thereby enabling further design tradeoffs and in-operation adaptive adjustments for energy efficiency. This paper introduces the framework of a parallel delay-insensitive asynchronous platform implementing adaptive dynamic voltage scaling (DVS), which is based on the observation of system fullness and workload prediction. The voltage reference generator and the voltage regulator have been integrated with the Finite Impulse Response (FIR) cores using the IBM 130nm 8RF process. Results show that the platform exhibits energy savings across various input workload scenarios.
{"title":"Framework of an Adaptive Delay-Insensitive Asynchronous Platform for Energy Efficiency","authors":"Liang Men, B. Hollosi, J. Di","doi":"10.1109/ISVLSI.2014.39","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.39","url":null,"abstract":"Asynchronous circuits do not have the clock-related issues as in their synchronous counterparts, thereby enabling further design tradeoffs and in-operation adaptive adjustments for energy efficiency. This paper introduces the framework of a parallel delay-insensitive asynchronous platform implementing adaptive dynamic voltage scaling (DVS), which is based on the observation of system fullness and workload prediction. The voltage reference generator and the voltage regulator have been integrated with the Finite Impulse Response (FIR) cores using the IBM 130nm 8RF process. Results show that the platform exhibits energy savings across various input workload scenarios.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128030143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Cotter, Yan Fang, S. Levitan, D. Chiarulli, N. Vijaykrishnan
Recent advances in device technology have opened the door to exploration of new computing paradigms. These paradigms are geared to exploit the behavior of emerging devices such as coupled oscillator arrays. Coupled oscillators are one such device technology that offloads some of the computational complexity to the devices exploiting the physics of the device to perform computation, rather than relying purely on Boolean logic. In this work, we explore the variety of available coupled oscillator architectures. Additionally, we evaluate some basic computation using these architectures in the image processing domain, with an eye towards more complex algorithms and applications.
{"title":"Computational Architectures Based on Coupled Oscillators","authors":"M. Cotter, Yan Fang, S. Levitan, D. Chiarulli, N. Vijaykrishnan","doi":"10.1109/ISVLSI.2014.87","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.87","url":null,"abstract":"Recent advances in device technology have opened the door to exploration of new computing paradigms. These paradigms are geared to exploit the behavior of emerging devices such as coupled oscillator arrays. Coupled oscillators are one such device technology that offloads some of the computational complexity to the devices exploiting the physics of the device to perform computation, rather than relying purely on Boolean logic. In this work, we explore the variety of available coupled oscillator architectures. Additionally, we evaluate some basic computation using these architectures in the image processing domain, with an eye towards more complex algorithms and applications.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131376090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.
{"title":"Energy-Aware Thread Scheduling for Embedded Multi-threaded Processors: Architectural Level Design and Implementation","authors":"M. Wickramasinghe, Hui Guo","doi":"10.1109/ISVLSI.2014.55","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.55","url":null,"abstract":"Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116067001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The imminent limitations imposed by the utilizationwall phenomenon compel architects to introduce heterogeneity in on-chip systems. The existing approaches to system-level synthesis of chip multiprocessors (CMPs) typically ignore physical aspects, leading to degradation in efficiency during later design stages. In this work, we propose and study the concept of patterned heterogeneous CMPs to handle the complexity of synthesis and extenuate the severity of physical aspects in post-synthesis engineering. A patterned CMP is a variation of a tiled architecture, in which layout regularity is enforced by fixing the tile size. Nevertheless, individual tiles may be implemented with different patterns, enabling a parameterizable degree of heterogeneity, which we leverage for synthesis efficiency. The results of synthesis demonstrate the effectiveness of patterned heterogeneity, suggesting it as a powerful tool for mastering the complexity of future on-chip systems.
{"title":"Patterned Heterogeneous CMPs: The Case for Regularity-Driven System-Level Synthesis","authors":"N. Nikitin, Magnus Jahre","doi":"10.1109/ISVLSI.2014.68","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.68","url":null,"abstract":"The imminent limitations imposed by the utilizationwall phenomenon compel architects to introduce heterogeneity in on-chip systems. The existing approaches to system-level synthesis of chip multiprocessors (CMPs) typically ignore physical aspects, leading to degradation in efficiency during later design stages. In this work, we propose and study the concept of patterned heterogeneous CMPs to handle the complexity of synthesis and extenuate the severity of physical aspects in post-synthesis engineering. A patterned CMP is a variation of a tiled architecture, in which layout regularity is enforced by fixing the tile size. Nevertheless, individual tiles may be implemented with different patterns, enabling a parameterizable degree of heterogeneity, which we leverage for synthesis efficiency. The results of synthesis demonstrate the effectiveness of patterned heterogeneity, suggesting it as a powerful tool for mastering the complexity of future on-chip systems.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While multi-core computing has become pervasive, scaling single core computations to multi-core computations remains a challenge. This paper aims to accelerate RTL and functional gate-level simulation in the current multi-core computing environment. This work addresses two types of partitioning schemes for multi-core simulation: functional, and domain-based. We discuss the limitations of functional partitioning, offered by new commercial multi-core simulators to speedup functional gate-level simulations. We also present a novel solution to increase RTL and functional gate-level simulation performance based on domain partitioning. This is the first known work that improves simulation performance by leveraging open source technology against commercial simulators.
{"title":"Parallel Multi-core Verilog HDL Simulation Using Domain Partitioning","authors":"T. B. Ahmad, M. Ciesielski","doi":"10.1109/ISVLSI.2014.47","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.47","url":null,"abstract":"While multi-core computing has become pervasive, scaling single core computations to multi-core computations remains a challenge. This paper aims to accelerate RTL and functional gate-level simulation in the current multi-core computing environment. This work addresses two types of partitioning schemes for multi-core simulation: functional, and domain-based. We discuss the limitations of functional partitioning, offered by new commercial multi-core simulators to speedup functional gate-level simulations. We also present a novel solution to increase RTL and functional gate-level simulation performance based on domain partitioning. This is the first known work that improves simulation performance by leveraging open source technology against commercial simulators.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114905896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
QR decomposition has been widely used in many signal processing applications to solve linear inverse problems. However, QR decomposition is considered a computationally expensive process, and its sequential implementations fail to meet the requirements of many time-sensitive applications. The Householder transformation and the Givens rotation are the most popular techniques to conduct QR decomposition. Each of these approaches have their own strengths and weakness. The Householder transformation lends itself to efficient sequential implementation, however its inherent data dependencies complicate parallelization. On the other hand, the structure of Givens rotation provides many opportunities for concurrency, but is typically limited by the availability of computing resources. We propose a deeply pipelined reconfigurable architecture that can be dynamically configured to perform either approach in a manner that takes advantage of the strengths of each. At runtime, the input matrix is first partitioned into numerous sub-matrices. Our architecture then performs parallel Householder transformations on the sub-matrices in the same column block, which is followed by parallel Givens rotations to annihilate the remaining unneeded individual off-diagonals. Analysis of our design indicates the potential to achieve a performance of 10.5 GFLOPS with speedups of up to 1.46fiX, 1.15Xfi and 13.75fiX compared to the MKL implementation, a recent FPGA design and a Matlab solution, respectively.
{"title":"A Reconfigurable Architecture for QR Decomposition Using a Hybrid Approach","authors":"Xinying Wang, Phillip H. Jones, Joseph Zambreno","doi":"10.1109/ISVLSI.2014.92","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.92","url":null,"abstract":"QR decomposition has been widely used in many signal processing applications to solve linear inverse problems. However, QR decomposition is considered a computationally expensive process, and its sequential implementations fail to meet the requirements of many time-sensitive applications. The Householder transformation and the Givens rotation are the most popular techniques to conduct QR decomposition. Each of these approaches have their own strengths and weakness. The Householder transformation lends itself to efficient sequential implementation, however its inherent data dependencies complicate parallelization. On the other hand, the structure of Givens rotation provides many opportunities for concurrency, but is typically limited by the availability of computing resources. We propose a deeply pipelined reconfigurable architecture that can be dynamically configured to perform either approach in a manner that takes advantage of the strengths of each. At runtime, the input matrix is first partitioned into numerous sub-matrices. Our architecture then performs parallel Householder transformations on the sub-matrices in the same column block, which is followed by parallel Givens rotations to annihilate the remaining unneeded individual off-diagonals. Analysis of our design indicates the potential to achieve a performance of 10.5 GFLOPS with speedups of up to 1.46fiX, 1.15Xfi and 13.75fiX compared to the MKL implementation, a recent FPGA design and a Matlab solution, respectively.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}