Current technology trends have led to the growing impact of both inter-die and intra-die process variations on circuit performance. While it is imperative to model parameter variations for sub-100nm technologies to produce an upper bound prediction on timing, it is equally important to consider the correlation of these variations for the bound to be useful. In this paper we present an efficient block-based statistical static timing analysis algorithm that can account for correlations from process parameters and re-converging paths. The algorithm can also accommodate dominant interconnect coupling effects to provide an accurate compilation of statistical timing information. The generality and efficiency for the proposed algorithm is obtained from a novel simplification technique that is derived from the statistical independence theories and principal component analysis (PCA) methods. The technique significantly reduces the cost for mean, variance and covariance computation of a set of correlated random variables.
{"title":"STAC: statistical timing analysis with correlation","authors":"Jiayong Le, Xin Li, L. Pileggi","doi":"10.1145/996566.996665","DOIUrl":"https://doi.org/10.1145/996566.996665","url":null,"abstract":"Current technology trends have led to the growing impact of both inter-die and intra-die process variations on circuit performance. While it is imperative to model parameter variations for sub-100nm technologies to produce an upper bound prediction on timing, it is equally important to consider the correlation of these variations for the bound to be useful. In this paper we present an efficient block-based statistical static timing analysis algorithm that can account for correlations from process parameters and re-converging paths. The algorithm can also accommodate dominant interconnect coupling effects to provide an accurate compilation of statistical timing information. The generality and efficiency for the proposed algorithm is obtained from a novel simplification technique that is derived from the statistical independence theories and principal component analysis (PCA) methods. The technique significantly reduces the cost for mean, variance and covariance computation of a set of correlated random variables.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"77 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113933235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sang-Il Han, A. Baghdadi, M. Bonaciu, S. Chae, A. Jerraya
Massive data transfer encountered in emerging multimedia embedded applications requires architecture allowing both highly distributed memory structure and multiprocessor computation to be handled. The key issue that needs to be solved is then how to manage data transfers between large numbers of distributed memories. To overcome this issue, our paper proposes a scalable Distributed Memory Server (DMS) for multiprocessor SoC (MPSoC). The proposed DMS is composed of: (1) high-performance and flexible memory service access points (MSAPs), which execute data transfers without intervention of the processing elements, (2) data network, and (3) control network. It can handle direct massive data transfer between the distributed memories of an MPSoC. The scalability and flexibility of the proposed DMS are illustrated through the implementation of an MPEG4 video encoder for QCIF and CIF formats. The experiments show clearly how DMS can be adapted to accommodate different SoC configurations requiring various data transfer bandwidths. Synthesis results show that bandwidth can scale up to 28.8 GB/sec.
{"title":"An efficient scalable and flexible data transfer architecture for multiprocessor SoC with massive distributed memory","authors":"Sang-Il Han, A. Baghdadi, M. Bonaciu, S. Chae, A. Jerraya","doi":"10.1145/996566.996636","DOIUrl":"https://doi.org/10.1145/996566.996636","url":null,"abstract":"Massive data transfer encountered in emerging multimedia embedded applications requires architecture allowing both highly distributed memory structure and multiprocessor computation to be handled. The key issue that needs to be solved is then how to manage data transfers between large numbers of distributed memories. To overcome this issue, our paper proposes a scalable Distributed Memory Server (DMS) for multiprocessor SoC (MPSoC). The proposed DMS is composed of: (1) high-performance and flexible memory service access points (MSAPs), which execute data transfers without intervention of the processing elements, (2) data network, and (3) control network. It can handle direct massive data transfer between the distributed memories of an MPSoC. The scalability and flexibility of the proposed DMS are illustrated through the implementation of an MPEG4 video encoder for QCIF and CIF formats. The experiments show clearly how DMS can be adapted to accommodate different SoC configurations requiring various data transfer bandwidths. Synthesis results show that bandwidth can scale up to 28.8 GB/sec.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127755636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop an approach to minimize total power in a dual-Vdd and dual-Vth design. The algorithm runs in two distinct phases. The first phase relies on upsizing to create slack and maximize low Vdd assignments in a backward topological manner. The second phase proceeds in a forward topological fashion and both sizes and re-assigns gates to high Vdd to enable significant static power savings through high Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS and dual-Vth/sizing algorithms demonstrate the advantage of the algorithm over a range of activity factors, including an average power reduction of 30% (50%) at high (nominal) primary input activities.
{"title":"Power minimization using simultaneous gate sizing, dual-Vdd and dual-Vth assignment","authors":"A. Srivastava, D. Sylvester, D. Blaauw","doi":"10.1145/996566.996777","DOIUrl":"https://doi.org/10.1145/996566.996777","url":null,"abstract":"We develop an approach to minimize total power in a dual-Vdd and dual-Vth design. The algorithm runs in two distinct phases. The first phase relies on upsizing to create slack and maximize low Vdd assignments in a backward topological manner. The second phase proceeds in a forward topological fashion and both sizes and re-assigns gates to high Vdd to enable significant static power savings through high Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS and dual-Vth/sizing algorithms demonstrate the advantage of the algorithm over a range of activity factors, including an average power reduction of 30% (50%) at high (nominal) primary input activities.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131404403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CMOS technology scaling is causing the channel lengths to be sub-wavelength of light. Parameter variation, caused by sub-wavelength lithography, will pose a major challenge for design and reliability of future high performance microprocessors in nanometer technologies. In this paper, we present the impact of these variations on processor functionality, Predictability and reliability. We propose design and CAD solutions for variation tolerance. We conclude this paper with sofi error rate scaling trends and sofl error tolerant circuits for reliabilitv enhancement.
{"title":"Design and reliability challenges in nanometer technologies","authors":"S. Borkar, T. Karnik, V. De","doi":"10.1145/996566.996588","DOIUrl":"https://doi.org/10.1145/996566.996588","url":null,"abstract":"CMOS technology scaling is causing the channel lengths to be sub-wavelength of light. Parameter variation, caused by sub-wavelength lithography, will pose a major challenge for design and reliability of future high performance microprocessors in nanometer technologies. In this paper, we present the impact of these variations on processor functionality, Predictability and reliability. We propose design and CAD solutions for variation tolerance. We conclude this paper with sofi error rate scaling trends and sofl error tolerant circuits for reliabilitv enhancement.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131436504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasing levels of process variability in sub-100nm CMOS design has become a critical concern for performance and power constraint designs. In this paper, we propose a new statistically aware Dual-Vt and sizing optimization that considers both the variability in performance and leakage of a design. While extensive work has been performed in the past on statistical analysis methods, circuit optimization is still largely performed using deterministic methods. We show in this paper that deterministic optimization quickly looses effectiveness for stringent performance and leakage constraints in designs with significant variability. We then propose a statistically aware dual-Vt and sizing algorithm where both delay constraints and sensitivity computations are performed in a statistical manner. We demonstrate that using this statistically aware optimization, leakage power can be reduced by 15-35% compared to traditional deterministic analysis. The improvements increase for strict delay constraints making statistical optimization especially important for high performance designs.
{"title":"Statistical optimization of leakage power considering process variations using dual-Vth and sizing","authors":"A. Srivastava, D. Sylvester, D. Blaauw","doi":"10.1145/996566.996775","DOIUrl":"https://doi.org/10.1145/996566.996775","url":null,"abstract":"Increasing levels of process variability in sub-100nm CMOS design has become a critical concern for performance and power constraint designs. In this paper, we propose a new statistically aware Dual-Vt and sizing optimization that considers both the variability in performance and leakage of a design. While extensive work has been performed in the past on statistical analysis methods, circuit optimization is still largely performed using deterministic methods. We show in this paper that deterministic optimization quickly looses effectiveness for stringent performance and leakage constraints in designs with significant variability. We then propose a statistically aware dual-Vt and sizing algorithm where both delay constraints and sensitivity computations are performed in a statistical manner. We demonstrate that using this statistically aware optimization, leakage power can be reduced by 15-35% compared to traditional deterministic analysis. The improvements increase for strict delay constraints making statistical optimization especially important for high performance designs.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116112814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.
{"title":"Fast statistical timing analysis handling arbitrary delay correlations","authors":"M. Orshansky, A. Bandyopadhyay","doi":"10.1145/996566.996664","DOIUrl":"https://doi.org/10.1145/996566.996664","url":null,"abstract":"An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116357250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. Jerraya
This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive computation resources and deals with complex data structures to represent video streams. In this study, the initial specification is given in sequential C code that had to be parallelized to be executed on four different processors. High level programming model, namely Message Passing Interface (MPI) was used to enable inter-task communication among parallelized C code. A four processor hardware prototyping platform was used to debug the parallelized software before final SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor architecture required the design of an additional hardware-dependent software layer to refine the abstract programming model. The design was made by a team work of three types of designer: application software, hardware-dependent software and hardware platform designers. The collaboration was necessary to master the whole flow from the specification to the platform.The study showed that HW/SW interface debug was the most time-consuming step. This is identified as a potential killer for application-specific MPSoC design. To further investigate the ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and the available debug environments. Finally, we address a debug strategy that exploits efficiently existing debug environments to reduce the time for HW/SW interface debug.
{"title":"Debugging HW/SW interface for MPSoC: video encoder system design case study","authors":"M. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. Jerraya","doi":"10.1145/996566.996808","DOIUrl":"https://doi.org/10.1145/996566.996808","url":null,"abstract":"This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive computation resources and deals with complex data structures to represent video streams. In this study, the initial specification is given in sequential C code that had to be parallelized to be executed on four different processors. High level programming model, namely Message Passing Interface (MPI) was used to enable inter-task communication among parallelized C code. A four processor hardware prototyping platform was used to debug the parallelized software before final SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor architecture required the design of an additional hardware-dependent software layer to refine the abstract programming model. The design was made by a team work of three types of designer: application software, hardware-dependent software and hardware platform designers. The collaboration was necessary to master the whole flow from the specification to the platform.The study showed that HW/SW interface debug was the most time-consuming step. This is identified as a potential killer for application-specific MPSoC design. To further investigate the ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and the available debug environments. Finally, we address a debug strategy that exploits efficiently existing debug environments to reduce the time for HW/SW interface debug.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127238820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Conti, M. Caldari, G. Vece, S. Orcioni, C. Turchetti
Bus performances are extremely important in a platform-based design. System Level analysis of bus performances gives important information for the analysis and choice between different architectures driven by functional, timing and power constraints of the System-on-Chip. This paper presents the effect of different arbitration algorithms and bus usage methodologies on the bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC and VHDL models have been developed and simulations have been performed.
{"title":"Performance analysis of different arbitration algorithms of the AMBA AHB bus","authors":"M. Conti, M. Caldari, G. Vece, S. Orcioni, C. Turchetti","doi":"10.1145/996566.996734","DOIUrl":"https://doi.org/10.1145/996566.996734","url":null,"abstract":"Bus performances are extremely important in a platform-based design. System Level analysis of bus performances gives important information for the analysis and choice between different architectures driven by functional, timing and power constraints of the System-on-Chip. This paper presents the effect of different arbitration algorithms and bus usage methodologies on the bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC and VHDL models have been developed and simulations have been performed.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127342906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead
{"title":"Virtual memory window for application-specific reconfigurable coprocessors","authors":"M. Vuletic, L. Pozzi, P. Ienne","doi":"10.1145/996566.996818","DOIUrl":"https://doi.org/10.1145/996566.996818","url":null,"abstract":"The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126746583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe a new extraction tool, EMX (Electro-Magnetic eXtractor), for the analysis of RF, analog and high-speed digital circuits. EMX is a fast full-wave field solver. It incorporates two new techniques which make it significantly faster and more memory-efficient than previous solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new method for computing the vector-potential component in the mixed potential integral equation. These techniques give a speed-up of more than a factor of ten, together with a corresponding reduction in memory.
{"title":"Large-scale full-wave simulation","authors":"S. Kapur, D. Long","doi":"10.1145/996566.996782","DOIUrl":"https://doi.org/10.1145/996566.996782","url":null,"abstract":"We describe a new extraction tool, EMX (Electro-Magnetic eXtractor), for the analysis of RF, analog and high-speed digital circuits. EMX is a fast full-wave field solver. It incorporates two new techniques which make it significantly faster and more memory-efficient than previous solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new method for computing the vector-potential component in the mixed potential integral equation. These techniques give a speed-up of more than a factor of ten, together with a corresponding reduction in memory.","PeriodicalId":115059,"journal":{"name":"Proceedings. 41st Design Automation Conference, 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126923551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}