Silicon compilers are often used in conjunction with Field Programmable Gate Arrays (FPGAs) to deliver flexibility, fast prototyping, and accelerated time-to-market. Many of these compilers produce hardware that is larger than necessary, as they do not allow instructions to share hardware resources. This study presents an efficient heuristic which transforms a set of custom instructions into a single hardware datapath on which they can execute. Our approach is based on the classic problems of finding the longest common subsequence and substring of two (or more) sequences. This heuristic produces circuits which are as much as 85.33% smaller than those synthesized by integer linear programming (ILP) approaches which do not explore resource sharing. On average, we obtained 55.41% area reduction for pipelined datapaths, and 66.92% area reduction for VLIW datapaths. Our solution is simple and effective, and can easily be integrated into an existing silicon compiler.
{"title":"Area-efficient instruction set synthesis for reconfigurable system-on-chip designs","authors":"P. Brisk, A. Kaplan, M. Sarrafzadeh","doi":"10.1145/996566.996679","DOIUrl":"https://doi.org/10.1145/996566.996679","url":null,"abstract":"Silicon compilers are often used in conjunction with Field Programmable Gate Arrays (FPGAs) to deliver flexibility, fast prototyping, and accelerated time-to-market. Many of these compilers produce hardware that is larger than necessary, as they do not allow instructions to share hardware resources. This study presents an efficient heuristic which transforms a set of custom instructions into a single hardware datapath on which they can execute. Our approach is based on the classic problems of finding the longest common subsequence and substring of two (or more) sequences. This heuristic produces circuits which are as much as 85.33% smaller than those synthesized by integer linear programming (ILP) approaches which do not explore resource sharing. On average, we obtained 55.41% area reduction for pipelined datapaths, and 66.92% area reduction for VLIW datapaths. Our solution is simple and effective, and can easily be integrated into an existing silicon compiler.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134409446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuichi Nakamura, Koh Hosokawa, I. Kuroda, Ko Yoshikawa, T. Yoshimura
This paper describes a new hardware/software co-verification method for System-On-a-Chip, based on the integration of a C/C++ simulator and an inexpensive FPGA emulator. Communication between the simulator and emulator occurs via a flexible interface based on shared communication registers. This method enables easy debugging, rich portability, and high verification speed, at a low cost. We describe the application of this environment to the verification of three different complex commercial SoCs, supporting concurrent hardware and embedded software development. In these projects, our verification methodology was used to perform complete system verification at 0.2-1.1 MHz, while supporting full graphical interface functions such as "waveform" or "signal dump" viewers, and debugging functions such as "step" or "break".
{"title":"A fast hardware/software co-verification method for systern-on-a-chip by using a C/C++ simulator and FPGA emulator with shared register communication","authors":"Yuichi Nakamura, Koh Hosokawa, I. Kuroda, Ko Yoshikawa, T. Yoshimura","doi":"10.1145/996566.996655","DOIUrl":"https://doi.org/10.1145/996566.996655","url":null,"abstract":"This paper describes a new hardware/software co-verification method for System-On-a-Chip, based on the integration of a C/C++ simulator and an inexpensive FPGA emulator. Communication between the simulator and emulator occurs via a flexible interface based on shared communication registers. This method enables easy debugging, rich portability, and high verification speed, at a low cost. We describe the application of this environment to the verification of three different complex commercial SoCs, supporting concurrent hardware and embedded software development. In these projects, our verification methodology was used to perform complete system verification at 0.2-1.1 MHz, while supporting full graphical interface functions such as \"waveform\" or \"signal dump\" viewers, and debugging functions such as \"step\" or \"break\".","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133136033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most practical FPGA designs of digital signal processing applications are limited to fixed-point arithmetic owing to the cost and complexiry of floating-point hardware. While mapping DSP applications onto FPGAs, a DSP algorithm designer, who often develops his applications in MATLAB, must determine the dynamic range and desired precision of input, intermediate and output signals in a design implementation to ensure that the algorithm fidelity criteria are met. The first step in a flow to map MATLAB applications into hardware is the conversion of the floating-point MATLAB algorithm into a fixed-point version. This paper describes an approach to automate this conversion, for mapping to FPGAs by profiling the expected inputs to estimate errors. Our algorithm attempts to minimize the hardware resources while constraining the quantization error within a specified limit
{"title":"An algorithm for converting floating-point computations to fixed-point in MATLAB based FPGA design","authors":"Sanghamitra Roy, P. Banerjee","doi":"10.1145/996566.996701","DOIUrl":"https://doi.org/10.1145/996566.996701","url":null,"abstract":"Most practical FPGA designs of digital signal processing applications are limited to fixed-point arithmetic owing to the cost and complexiry of floating-point hardware. While mapping DSP applications onto FPGAs, a DSP algorithm designer, who often develops his applications in MATLAB, must determine the dynamic range and desired precision of input, intermediate and output signals in a design implementation to ensure that the algorithm fidelity criteria are met. The first step in a flow to map MATLAB applications into hardware is the conversion of the floating-point MATLAB algorithm into a fixed-point version. This paper describes an approach to automate this conversion, for mapping to FPGAs by profiling the expected inputs to estimate errors. Our algorithm attempts to minimize the hardware resources while constraining the quantization error within a specified limit","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133331186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Huang, M. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, S. Velusamy
Thermal design in sub-100nm technologies is one of the major challenges to the CAD community. In this paper, we first introduce the idea of temperature-aware design. We then propose a compact thermal model which can be integrated with modern CAD tools to achieve a temperature-aware design methodology. Finally, we use the compact thermal model in a case study of microprocessor design to show the importance of using temperature as a guideline for the design. Results from our thermal model show that a temperature-aware design approach can provide more accurate estimations, and therefore better decisions and faster design convergence.
{"title":"Compact thermal modeling for temperature-aware design","authors":"Wei Huang, M. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, S. Velusamy","doi":"10.1145/996566.996800","DOIUrl":"https://doi.org/10.1145/996566.996800","url":null,"abstract":"Thermal design in sub-100nm technologies is one of the major challenges to the CAD community. In this paper, we first introduce the idea of temperature-aware design. We then propose a compact thermal model which can be integrated with modern CAD tools to achieve a temperature-aware design methodology. Finally, we use the compact thermal model in a case study of microprocessor design to show the importance of using temperature as a guideline for the design. Results from our thermal model show that a temperature-aware design approach can provide more accurate estimations, and therefore better decisions and faster design convergence.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132055884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasing levels of process variability in sub-100nm CMOS design has become a critical concern for performance and power constraint designs. In this paper, we propose a new statistically aware Dual-Vt and sizing optimization that considers both the variability in performance and leakage of a design. While extensive work has been performed in the past on statistical analysis methods, circuit optimization is still largely performed using deterministic methods. We show in this paper that deterministic optimization quickly looses effectiveness for stringent performance and leakage constraints in designs with significant variability. We then propose a statistically aware dual-Vt and sizing algorithm where both delay constraints and sensitivity computations are performed in a statistical manner. We demonstrate that using this statistically aware optimization, leakage power can be reduced by 15-35% compared to traditional deterministic analysis. The improvements increase for strict delay constraints making statistical optimization especially important for high performance designs.
{"title":"Statistical optimization of leakage power considering process variations using dual-Vth and sizing","authors":"A. Srivastava, D. Sylvester, D. Blaauw","doi":"10.1145/996566.996775","DOIUrl":"https://doi.org/10.1145/996566.996775","url":null,"abstract":"Increasing levels of process variability in sub-100nm CMOS design has become a critical concern for performance and power constraint designs. In this paper, we propose a new statistically aware Dual-Vt and sizing optimization that considers both the variability in performance and leakage of a design. While extensive work has been performed in the past on statistical analysis methods, circuit optimization is still largely performed using deterministic methods. We show in this paper that deterministic optimization quickly looses effectiveness for stringent performance and leakage constraints in designs with significant variability. We then propose a statistically aware dual-Vt and sizing algorithm where both delay constraints and sensitivity computations are performed in a statistical manner. We demonstrate that using this statistically aware optimization, leakage power can be reduced by 15-35% compared to traditional deterministic analysis. The improvements increase for strict delay constraints making statistical optimization especially important for high performance designs.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116112814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.
{"title":"Fast statistical timing analysis handling arbitrary delay correlations","authors":"M. Orshansky, A. Bandyopadhyay","doi":"10.1145/996566.996664","DOIUrl":"https://doi.org/10.1145/996566.996664","url":null,"abstract":"An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116357250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. Jerraya
This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive computation resources and deals with complex data structures to represent video streams. In this study, the initial specification is given in sequential C code that had to be parallelized to be executed on four different processors. High level programming model, namely Message Passing Interface (MPI) was used to enable inter-task communication among parallelized C code. A four processor hardware prototyping platform was used to debug the parallelized software before final SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor architecture required the design of an additional hardware-dependent software layer to refine the abstract programming model. The design was made by a team work of three types of designer: application software, hardware-dependent software and hardware platform designers. The collaboration was necessary to master the whole flow from the specification to the platform.The study showed that HW/SW interface debug was the most time-consuming step. This is identified as a potential killer for application-specific MPSoC design. To further investigate the ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and the available debug environments. Finally, we address a debug strategy that exploits efficiently existing debug environments to reduce the time for HW/SW interface debug.
{"title":"Debugging HW/SW interface for MPSoC: video encoder system design case study","authors":"M. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. Jerraya","doi":"10.1145/996566.996808","DOIUrl":"https://doi.org/10.1145/996566.996808","url":null,"abstract":"This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive computation resources and deals with complex data structures to represent video streams. In this study, the initial specification is given in sequential C code that had to be parallelized to be executed on four different processors. High level programming model, namely Message Passing Interface (MPI) was used to enable inter-task communication among parallelized C code. A four processor hardware prototyping platform was used to debug the parallelized software before final SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor architecture required the design of an additional hardware-dependent software layer to refine the abstract programming model. The design was made by a team work of three types of designer: application software, hardware-dependent software and hardware platform designers. The collaboration was necessary to master the whole flow from the specification to the platform.The study showed that HW/SW interface debug was the most time-consuming step. This is identified as a potential killer for application-specific MPSoC design. To further investigate the ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and the available debug environments. Finally, we address a debug strategy that exploits efficiently existing debug environments to reduce the time for HW/SW interface debug.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127238820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Conti, M. Caldari, G. Vece, S. Orcioni, C. Turchetti
Bus performances are extremely important in a platform-based design. System Level analysis of bus performances gives important information for the analysis and choice between different architectures driven by functional, timing and power constraints of the System-on-Chip. This paper presents the effect of different arbitration algorithms and bus usage methodologies on the bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC and VHDL models have been developed and simulations have been performed.
{"title":"Performance analysis of different arbitration algorithms of the AMBA AHB bus","authors":"M. Conti, M. Caldari, G. Vece, S. Orcioni, C. Turchetti","doi":"10.1145/996566.996734","DOIUrl":"https://doi.org/10.1145/996566.996734","url":null,"abstract":"Bus performances are extremely important in a platform-based design. System Level analysis of bus performances gives important information for the analysis and choice between different architectures driven by functional, timing and power constraints of the System-on-Chip. This paper presents the effect of different arbitration algorithms and bus usage methodologies on the bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC and VHDL models have been developed and simulations have been performed.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127342906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead
{"title":"Virtual memory window for application-specific reconfigurable coprocessors","authors":"M. Vuletic, L. Pozzi, P. Ienne","doi":"10.1145/996566.996818","DOIUrl":"https://doi.org/10.1145/996566.996818","url":null,"abstract":"The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126746583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe a new extraction tool, EMX (Electro-Magnetic eXtractor), for the analysis of RF, analog and high-speed digital circuits. EMX is a fast full-wave field solver. It incorporates two new techniques which make it significantly faster and more memory-efficient than previous solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new method for computing the vector-potential component in the mixed potential integral equation. These techniques give a speed-up of more than a factor of ten, together with a corresponding reduction in memory.
{"title":"Large-scale full-wave simulation","authors":"S. Kapur, D. Long","doi":"10.1145/996566.996782","DOIUrl":"https://doi.org/10.1145/996566.996782","url":null,"abstract":"We describe a new extraction tool, EMX (Electro-Magnetic eXtractor), for the analysis of RF, analog and high-speed digital circuits. EMX is a fast full-wave field solver. It incorporates two new techniques which make it significantly faster and more memory-efficient than previous solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new method for computing the vector-potential component in the mixed potential integral equation. These techniques give a speed-up of more than a factor of ten, together with a corresponding reduction in memory.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126923551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}