Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268896
M. Kandemir
High-energy consumption presents a problem for sustainable clock frequency and deliverable performance. In particular, memory energy consumption of array-intensive applications can be overwhelming due to poor cache locality. One option for reducing memory energy is to adopt a banked memory architecture, where memory space is divided into banks and each bank can be powered down if it is not in active use. An important issue here is the bank access pattern, which determines opportunities for saving energy. In this paper, we present a compiler-based data layout transformation strategy for increasing the effectiveness of a banked memory architecture. The idea is to transform the array layouts in memory in such a way that two loop iterations executed one after another access the data in the same bank as much as possible; the remaining banks can be placed into a low-power mode. Our simulation-based experiments with nine array-intensive applications show significant savings in memory energy consumption.
{"title":"Impact of data transformations on memory bank locality","authors":"M. Kandemir","doi":"10.1109/DATE.2004.1268896","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268896","url":null,"abstract":"High-energy consumption presents a problem for sustainable clock frequency and deliverable performance. In particular, memory energy consumption of array-intensive applications can be overwhelming due to poor cache locality. One option for reducing memory energy is to adopt a banked memory architecture, where memory space is divided into banks and each bank can be powered down if it is not in active use. An important issue here is the bank access pattern, which determines opportunities for saving energy. In this paper, we present a compiler-based data layout transformation strategy for increasing the effectiveness of a banked memory architecture. The idea is to transform the array layouts in memory in such a way that two loop iterations executed one after another access the data in the same bank as much as possible; the remaining banks can be placed into a low-power mode. Our simulation-based experiments with nine array-intensive applications show significant savings in memory energy consumption.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114305133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268981
H. Vranken, Ferry Syafei Sapei, H. Wunderlich
This paper presents an experimental investigation on the impact of test point insertion on circuit size and performance. Often test points are inserted into a circuit in order to improve the circuit's testability, which results in smaller test data volume, shorter test time, and higher fault coverage. Inserting test points however requires additional silicon area and influences the timing of a circuit. The paper shows how placement and routing is affected by test point insertion during layout generation. Experimental data for industrial circuits show that inserting 1% test points in general increases the silicon area after layout by less than 0.5% while the performance of the circuit may be reduced by 5% or more.
{"title":"Impact of test point insertion on silicon area and timing during layout","authors":"H. Vranken, Ferry Syafei Sapei, H. Wunderlich","doi":"10.1109/DATE.2004.1268981","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268981","url":null,"abstract":"This paper presents an experimental investigation on the impact of test point insertion on circuit size and performance. Often test points are inserted into a circuit in order to improve the circuit's testability, which results in smaller test data volume, shorter test time, and higher fault coverage. Inserting test points however requires additional silicon area and influences the timing of a circuit. The paper shows how placement and routing is affected by test point insertion during layout generation. Experimental data for industrial circuits show that inserting 1% test points in general increases the silicon area after layout by less than 0.5% while the performance of the circuit may be reduced by 5% or more.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114524513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268935
A. Iranli, Kihwan Choi, Massoud Pedram
This paper presents a dynamic energy management policy for a wireless video streaming system, consisting of battery-powered client and server. The paper starts from the observation that the video quality in wireless streaming is a function of three factors: encoding aptitude of the server, decoding aptitude of the client, and the wireless channel. Based on this observation, the energy consumption of a wireless video streaming system is modeled and analyzed. Using the proposed model, the optimal energy assignment to each video frame is done such that the maximum system lifetime is achieved while satisfying a given minimum video quality requirement. Experimental results show that the proposed policy increases the system lifetime by 20%.
{"title":"A game theoretic approach to low energy wireless video streaming","authors":"A. Iranli, Kihwan Choi, Massoud Pedram","doi":"10.1109/DATE.2004.1268935","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268935","url":null,"abstract":"This paper presents a dynamic energy management policy for a wireless video streaming system, consisting of battery-powered client and server. The paper starts from the observation that the video quality in wireless streaming is a function of three factors: encoding aptitude of the server, decoding aptitude of the client, and the wireless channel. Based on this observation, the energy consumption of a wireless video streaming system is modeled and analyzed. Using the proposed model, the optimal energy assignment to each video frame is done such that the maximum system lifetime is achieved while satisfying a given minimum video quality requirement. Experimental results show that the proposed policy increases the system lifetime by 20%.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114564930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268879
T. Seceleanu, T. Westerlund
This study shows the derivation of a local segmented bus arbiter from an original single segment bus arbiter. The operations are performed in the formal framework of action systems and illustrated in a graphical manner using the corresponding action systems - UML profile notations. The derivation is useful both to demonstrate the capability of preserving correctness when considering an important hardware design decision and also to identify means through which this kind of decisions can be performed in a graphical environment.
{"title":"Aspects of formal and graphical design of a bus system","authors":"T. Seceleanu, T. Westerlund","doi":"10.1109/DATE.2004.1268879","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268879","url":null,"abstract":"This study shows the derivation of a local segmented bus arbiter from an original single segment bus arbiter. The operations are performed in the formal framework of action systems and illustrated in a graphical manner using the corresponding action systems - UML profile notations. The derivation is useful both to demonstrate the capability of preserving correctness when considering an important hardware design decision and also to identify means through which this kind of decisions can be performed in a graphical environment.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124315332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1269114
K. Rahimi, S. Bridges, C. Diorio
This paper introduces adaptive delay sequential elements (ADSEs). ADSEs are registers that use nonvolatile, floating-gate transistors to tune their internal clock delays. We propose ADSEs for correcting timing violations and optimizing circuit performance. We present an ADSE circuit example, system architecture, and tuning methodology. We present experimental results that demonstrate the correct operation of our example circuit and discuss the die-area impact of using ADSEs. Our experiments also show that voltage and temperature sensitivity of ADSEs are comparable to non-adaptive flip-flops.
{"title":"Timing correction and optimization with adaptive delay sequential elements","authors":"K. Rahimi, S. Bridges, C. Diorio","doi":"10.1109/DATE.2004.1269114","DOIUrl":"https://doi.org/10.1109/DATE.2004.1269114","url":null,"abstract":"This paper introduces adaptive delay sequential elements (ADSEs). ADSEs are registers that use nonvolatile, floating-gate transistors to tune their internal clock delays. We propose ADSEs for correcting timing violations and optimizing circuit performance. We present an ADSE circuit example, system architecture, and tuning methodology. We present experimental results that demonstrate the correct operation of our example circuit and discuss the die-area impact of using ADSEs. Our experiments also show that voltage and temperature sensitivity of ADSEs are comparable to non-adaptive flip-flops.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127873890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1269001
Mikael Millberg, E. Nilsson, R. Thid, A. Jantsch
In today's emerging network-on-chips, there is a need for different traffic classes with different quality-of-service guarantees. Within our NoC architecture nostrum, we have implemented a service of guaranteed bandwidth (GB), and latency, in addition to the already existing service of best-effort (BE) packet delivery. The guaranteed bandwidth is accessed via virtual circuits (VC). The VCs are implemented using a combination of two concepts that we call 'Looped Containers' and 'Temporally Disjoint Networks'. The looped containers are used to guarantee access to the network-independently of the current network load without dropping packets; and the TDNs are used in order to achieve several VCs, plus ordinary BE traffic, in the network. The TDNs are a consequence of the deflective routing policy used, and gives rise to an explicit time-division-multiplexing within the network. To prove our concept an HDL implementation has been synthesised and simulated. The cost in terms of additional hardware needed, as well as additional bandwidth is very low-less than 2 percent in both cases! Simulations showed that ordinary BE traffic is practically unaffected by the VCs.
{"title":"Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip","authors":"Mikael Millberg, E. Nilsson, R. Thid, A. Jantsch","doi":"10.1109/DATE.2004.1269001","DOIUrl":"https://doi.org/10.1109/DATE.2004.1269001","url":null,"abstract":"In today's emerging network-on-chips, there is a need for different traffic classes with different quality-of-service guarantees. Within our NoC architecture nostrum, we have implemented a service of guaranteed bandwidth (GB), and latency, in addition to the already existing service of best-effort (BE) packet delivery. The guaranteed bandwidth is accessed via virtual circuits (VC). The VCs are implemented using a combination of two concepts that we call 'Looped Containers' and 'Temporally Disjoint Networks'. The looped containers are used to guarantee access to the network-independently of the current network load without dropping packets; and the TDNs are used in order to achieve several VCs, plus ordinary BE traffic, in the network. The TDNs are a consequence of the deflective routing policy used, and gives rise to an explicit time-division-multiplexing within the network. To prove our concept an HDL implementation has been synthesised and simulated. The cost in terms of additional hardware needed, as well as additional bandwidth is very low-less than 2 percent in both cases! Simulations showed that ordinary BE traffic is practically unaffected by the VCs.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126218891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268933
Tarvo Raudvere, A. Singh, I. Sander, A. Jantsch
Today's integrated circuits with increasing complexity cause the well known state space explosion problem in verification tools. In order to handle this problem a much simpler abstract model of the design has to be created for verification. We introduce the polynomial abstraction technique, which efficiently simplifies the verification task of sequential design blocks whose functionality can be expressed as a polynomial. Through our technique, the domains of possible values of data input signals can be reduced. This is done in such a way that the abstract model is still valid for model checking of the design functionality in terms of the system's control and data properties. We incorporate polynomial abstraction into the ForSyDe methodology, for the verification of clock domain design refinements.
{"title":"Polynomial abstraction for verification of sequentially implemented combinational circuits","authors":"Tarvo Raudvere, A. Singh, I. Sander, A. Jantsch","doi":"10.1109/DATE.2004.1268933","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268933","url":null,"abstract":"Today's integrated circuits with increasing complexity cause the well known state space explosion problem in verification tools. In order to handle this problem a much simpler abstract model of the design has to be created for verification. We introduce the polynomial abstraction technique, which efficiently simplifies the verification task of sequential design blocks whose functionality can be expressed as a polynomial. Through our technique, the domains of possible values of data input signals can be reduced. This is done in such a way that the abstract model is still valid for model checking of the design functionality in terms of the system's control and data properties. We incorporate polynomial abstraction into the ForSyDe methodology, for the verification of clock domain design refinements.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125575149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1269041
Suvodeep Gupta, S. Katkoori
Given word-level statistics, namely mean, standard deviation, and lag-one temporal correlation of input data, we estimate the bit-level crosstalk probability on a system bus using a non-enumerative statistical approach. We introduce a sampling technique for fast evaluation of integrals during the estimation process. We had proposed two techniques previously - (a) a stream-based estimator that counts crosstalk events on a bus; and (b) a statistical enumeration technique that enumerates crosstalk-producing values on a bus and computes their occurrence probability. Both these techniques suffer from exponential time complexity with respect to the bus-width. In this work, we propose a statistical non-enumerative technique that has linear time complexity with respect to the bus-width. We achieve the linear complexity by resorting to: (1) manipulating the data stream to make the crosstalk-producing values contiguous and (2) sampling the distribution function and storing it as a lookup table. Experimental results for data streams from different data environments are presented, compared against the stream-based approach. Average errors of less than 12% are obtained for bus-widths ranging from 8b to 32b.
{"title":"A fast word-level statistical estimator of intra-bus crosstalk","authors":"Suvodeep Gupta, S. Katkoori","doi":"10.1109/DATE.2004.1269041","DOIUrl":"https://doi.org/10.1109/DATE.2004.1269041","url":null,"abstract":"Given word-level statistics, namely mean, standard deviation, and lag-one temporal correlation of input data, we estimate the bit-level crosstalk probability on a system bus using a non-enumerative statistical approach. We introduce a sampling technique for fast evaluation of integrals during the estimation process. We had proposed two techniques previously - (a) a stream-based estimator that counts crosstalk events on a bus; and (b) a statistical enumeration technique that enumerates crosstalk-producing values on a bus and computes their occurrence probability. Both these techniques suffer from exponential time complexity with respect to the bus-width. In this work, we propose a statistical non-enumerative technique that has linear time complexity with respect to the bus-width. We achieve the linear complexity by resorting to: (1) manipulating the data stream to make the crosstalk-producing values contiguous and (2) sampling the distribution function and storing it as a lookup table. Experimental results for data streams from different data environments are presented, compared against the stream-based approach. Average errors of less than 12% are obtained for bus-widths ranging from 8b to 32b.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128202231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1268838
S. Wong, C. Tsui
In very deep sub-micron designs, cross coupling capacitances become the dominant factor of the total bus loading and have a significant impact on the power consumption. In this paper, we propose two reconfigurable bus encoding schemes, which are based on the correlation among the bit lines, to reduce the power consumption at the cross coupling capacitances of the instruction buses. The instruction is encoded by flipping and reordering the bit lines during compilation time to reduce the total switching capacitances. A crossbar is used to map back the data to the original instruction code before sending to the instruction decoder. The reordering can be re-configured during run-time by using different configurations in the crossbar. We propose two types of re-configuration, static and dynamic. Static coding uses a fix flipping and re-configuring pattern after the corresponding program is compiled. Dynamic coding allows different re-configuring patterns during program execution. Experimental results show that by using the proposed schemes, significant energy reduction, 17-23%, can be achieved. Comparisons with existing bit lines reordering encoding scheme have also been made and on average more than 15% reduction can be obtained using our method.
{"title":"Re-configurable bus encoding scheme for reducing power consumption of the cross coupling capacitance for deep sub-micron instruction bus","authors":"S. Wong, C. Tsui","doi":"10.1109/DATE.2004.1268838","DOIUrl":"https://doi.org/10.1109/DATE.2004.1268838","url":null,"abstract":"In very deep sub-micron designs, cross coupling capacitances become the dominant factor of the total bus loading and have a significant impact on the power consumption. In this paper, we propose two reconfigurable bus encoding schemes, which are based on the correlation among the bit lines, to reduce the power consumption at the cross coupling capacitances of the instruction buses. The instruction is encoded by flipping and reordering the bit lines during compilation time to reduce the total switching capacitances. A crossbar is used to map back the data to the original instruction code before sending to the instruction decoder. The reordering can be re-configured during run-time by using different configurations in the crossbar. We propose two types of re-configuration, static and dynamic. Static coding uses a fix flipping and re-configuring pattern after the corresponding program is compiled. Dynamic coding allows different re-configuring patterns during program execution. Experimental results show that by using the proposed schemes, significant energy reduction, 17-23%, can be achieved. Comparisons with existing bit lines reordering encoding scheme have also been made and on average more than 15% reduction can be obtained using our method.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130834465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-02-16DOI: 10.1109/DATE.2004.1269065
H. Krupnova
Today, having a fast hardware platform for SoC software development prior to silicon is an important challenge to gain the time-to-market. The FPGAs offer an excellent prototyping basis for building hardware platforms since more than ten years. However, as the circuit complexity increases and project timeframes shrink, building a multi-FPGA prototype represents a real challenge from the complexity viewpoint. The paper describes the state-of-the-art mapping methodology, prototyping tools and flows, shows the most difficult mapping problems and the ways to overcome them. The paper is issued from the experience of mapping on FPGA platform of four latest highly complex ST Microelectronics SoCs ranging from 1.5 to 4 million real ASIC gates mapped to up to 9 highest capacity FPGAs.
{"title":"Mapping multi-million gate SoCs on FPGAs: industrial methodology and experience","authors":"H. Krupnova","doi":"10.1109/DATE.2004.1269065","DOIUrl":"https://doi.org/10.1109/DATE.2004.1269065","url":null,"abstract":"Today, having a fast hardware platform for SoC software development prior to silicon is an important challenge to gain the time-to-market. The FPGAs offer an excellent prototyping basis for building hardware platforms since more than ten years. However, as the circuit complexity increases and project timeframes shrink, building a multi-FPGA prototype represents a real challenge from the complexity viewpoint. The paper describes the state-of-the-art mapping methodology, prototyping tools and flows, shows the most difficult mapping problems and the ways to overcome them. The paper is issued from the experience of mapping on FPGA platform of four latest highly complex ST Microelectronics SoCs ranging from 1.5 to 4 million real ASIC gates mapped to up to 9 highest capacity FPGAs.","PeriodicalId":335658,"journal":{"name":"Proceedings Design, Automation and Test in Europe Conference and Exhibition","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126583623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}