Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601936
T. Chiang, C. Y. Chen, Weiyu Chen
Transistor reordering has been known to be effective in reducing delays of a circuit with nearly zero penalties. However, techniques to determine good transistor orders have not been proposed in literature. Previous work on this has to resort to running SPICE for all meaningful transistor orders and selecting a best one, which is extremely time-consuming. This paper proposes an efficient and accurate technique for determining best transistor orders without running SPICE simulations. Experimental results from SPICE3 show that the predictions are very accurate.
{"title":"A technique for selecting CMOS transistor orders","authors":"T. Chiang, C. Y. Chen, Weiyu Chen","doi":"10.1109/ICCD.2007.4601936","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601936","url":null,"abstract":"Transistor reordering has been known to be effective in reducing delays of a circuit with nearly zero penalties. However, techniques to determine good transistor orders have not been proposed in literature. Previous work on this has to resort to running SPICE for all meaningful transistor orders and selecting a best one, which is extremely time-consuming. This paper proposes an efficient and accurate technique for determining best transistor orders without running SPICE simulations. Experimental results from SPICE3 show that the predictions are very accurate.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"438-443"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84995446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601930
Yoonjin Kim, R. Mahapatra
Most of the coarse-grained reconfigurable array architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead.
{"title":"Dynamically compressible context architecture for low power coarse-grained reconfigurable array","authors":"Yoonjin Kim, R. Mahapatra","doi":"10.1109/ICCD.2007.4601930","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601930","url":null,"abstract":"Most of the coarse-grained reconfigurable array architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"395-400"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85328436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601909
G. Keramidas, Pavlos Petoumenos, S. Kaxiras
Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.
{"title":"Cache replacement based on reuse-distance prediction","authors":"G. Keramidas, Pavlos Petoumenos, S. Kaxiras","doi":"10.1109/ICCD.2007.4601909","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601909","url":null,"abstract":"Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"97 1","pages":"245-250"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90815539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601961
Ryoma Watanabe, Masaaki Kondo, Hiroshi Nakamura, T. Nanya
This paper presents a novel power reduction method for chip multi-processors (CMPs) under real-time constraints. While the power consumption of processing units (PUs) on CMPs can be reduced without violating real-time constraints by dynamic voltage and frequency scaling (DVFS), the clock frequency of each PU cannot be determined independently because of the performance impact caused by the conflict for the shared resources. To minimize power consumption in this situation, we first derive an analytical model which provides the optimal priority and clock frequency setting, and then propose a method of controlling the priority of shared resource accesses in cooperation with DVFS. From the analytical model, in dual-core CMPs, we reveal that the total power consumption is minimized when the clock frequency of two PUs becomes the same. An experiment with a synthetic benchmark supports the validity of the analytical model and the evaluation results with real applications show that the proposed method reduces the power consumption by up to 15% and 6.7% on average compared with a conventional DVFS technique.
{"title":"Power reduction of chip multi-processors using shared resource control cooperating with DVFS","authors":"Ryoma Watanabe, Masaaki Kondo, Hiroshi Nakamura, T. Nanya","doi":"10.1109/ICCD.2007.4601961","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601961","url":null,"abstract":"This paper presents a novel power reduction method for chip multi-processors (CMPs) under real-time constraints. While the power consumption of processing units (PUs) on CMPs can be reduced without violating real-time constraints by dynamic voltage and frequency scaling (DVFS), the clock frequency of each PU cannot be determined independently because of the performance impact caused by the conflict for the shared resources. To minimize power consumption in this situation, we first derive an analytical model which provides the optimal priority and clock frequency setting, and then propose a method of controlling the priority of shared resource accesses in cooperation with DVFS. From the analytical model, in dual-core CMPs, we reveal that the total power consumption is minimized when the clock frequency of two PUs becomes the same. An experiment with a synthetic benchmark supports the validity of the analytical model and the evaluation results with real applications show that the proposed method reduces the power consumption by up to 15% and 6.7% on average compared with a conventional DVFS technique.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"96 1","pages":"615-622"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86609224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601894
S. Srinivasan, P. Mangalagiri, Yuan Xie, N. Vijaykrishnan
Systems with the combined features of ASICs and field programmable gate arrays(FPGAs) are increasingly being considered as technology forerunners looking at their extraordinary benefits. This drags FPGAs into the technology scaling race along with ASICs exposing the FPGA industries to the problems associated with scaling. Extensive process variations is one such issue which directly impacts the profit margins of hardware design beyond 65 nm gate length technology. Since the resources in FPGAs are primarily dominated by the interconnect fabric, variations in the interconnect impacting the critical path timing and leakage yield needs rigorous analysis. In this work we provide a statistical modeling of individual routing components in an FPGA followed by a statistical methodology to analyze the timing and leakage distribution. This statistical model is incorporated into the routing algorithm to model a new statistically intelligent routing algorithm (SIRA), which simultaneously optimizes the leakage and timing yield of the FPGA device. We demonstrate and average leakage yield increase of 9% and timing yield by 11% using our final algorithm.
{"title":"FPGA routing architecture analysis under variations","authors":"S. Srinivasan, P. Mangalagiri, Yuan Xie, N. Vijaykrishnan","doi":"10.1109/ICCD.2007.4601894","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601894","url":null,"abstract":"Systems with the combined features of ASICs and field programmable gate arrays(FPGAs) are increasingly being considered as technology forerunners looking at their extraordinary benefits. This drags FPGAs into the technology scaling race along with ASICs exposing the FPGA industries to the problems associated with scaling. Extensive process variations is one such issue which directly impacts the profit margins of hardware design beyond 65 nm gate length technology. Since the resources in FPGAs are primarily dominated by the interconnect fabric, variations in the interconnect impacting the critical path timing and leakage yield needs rigorous analysis. In this work we provide a statistical modeling of individual routing components in an FPGA followed by a statistical methodology to analyze the timing and leakage distribution. This statistical model is incorporated into the routing algorithm to model a new statistically intelligent routing algorithm (SIRA), which simultaneously optimizes the leakage and timing yield of the FPGA device. We demonstrate and average leakage yield increase of 9% and timing yield by 11% using our final algorithm.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"100 1","pages":"152-157"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87002578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601924
Jie Li, J. Lach
The increased process, voltage, and temperature (PVT) variability that comes with integrated circuit (IC) technology scaling has become a major problem in the semiconductor industry. In order to refine manufacturing processes and develop circuit design techniques to cope with variability, we must be able to accurately and precisely characterize the variations that occur. In this paper, we introduce a technique for characterizing combinational path delay variations by measuring a designer-controlled number of register-to-register delays in manufactured ICs with negative-skewed shadow registers. This technique enables delay measurements to be performed with at-speed tests that are run in parallel with and are orthogonal to other testing techniques, and therefore does not add combinatorial complexity to the testing process. This technique can be implemented cost-effectively on a large number of otherwise unobservable internal combinational paths to get accurate, precise data about delay variability.
{"title":"Negative-skewed shadow registers for at-speed delay variation characterization","authors":"Jie Li, J. Lach","doi":"10.1109/ICCD.2007.4601924","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601924","url":null,"abstract":"The increased process, voltage, and temperature (PVT) variability that comes with integrated circuit (IC) technology scaling has become a major problem in the semiconductor industry. In order to refine manufacturing processes and develop circuit design techniques to cope with variability, we must be able to accurately and precisely characterize the variations that occur. In this paper, we introduce a technique for characterizing combinational path delay variations by measuring a designer-controlled number of register-to-register delays in manufactured ICs with negative-skewed shadow registers. This technique enables delay measurements to be performed with at-speed tests that are run in parallel with and are orthogonal to other testing techniques, and therefore does not add combinatorial complexity to the testing process. This technique can be implemented cost-effectively on a large number of otherwise unobservable internal combinational paths to get accurate, precise data about delay variability.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"354-359"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88482713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601910
Huan Ren, S. Dutt
We present new techniques for explicit constraint satisfaction in the incremental placement process. Our algorithm employs a Lagrangian relaxation (LR) type approach in the analytical global placement stage to solve the constrained optimization problem. We establish theoretical results that prove the optimality of this stage. In the detailed placement stage, we develop a constraint-monitoring and satisfaction mechanism in a network (n/w) flow based detailed placement framework proposed recently, and empirically show its near-optimality. We establish the effectiveness of our general constraint-satisfaction methods by applying them to the problem of timing-driven optimization under power constraints. We overlay our algorithms on a recently developed unconstrained timing-driven incremental placement method flow-place. On a large number of benchmarks with up to 210K cells, our constraint satisfaction algorithms obtain an average timing improvement of 12.4% under a 3% power increase limit (the actual average power increase incurred is only 2.1%), while the original unconstrained method gives an average power increase of 8.4% for a timing improvement of 17.3%. Our techniques thus yield a tradeoff of 75% power improvement to 28% timing deterioration for the given constraint. Our constraint-satisfying incremental placer is also quite fast, e.g., its run time for the 210 K-cell circuit ibm18 is only 1541 secs.
{"title":"Constraint satisfaction in incremental placement with application to performance optimization under power constraints","authors":"Huan Ren, S. Dutt","doi":"10.1109/ICCD.2007.4601910","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601910","url":null,"abstract":"We present new techniques for explicit constraint satisfaction in the incremental placement process. Our algorithm employs a Lagrangian relaxation (LR) type approach in the analytical global placement stage to solve the constrained optimization problem. We establish theoretical results that prove the optimality of this stage. In the detailed placement stage, we develop a constraint-monitoring and satisfaction mechanism in a network (n/w) flow based detailed placement framework proposed recently, and empirically show its near-optimality. We establish the effectiveness of our general constraint-satisfaction methods by applying them to the problem of timing-driven optimization under power constraints. We overlay our algorithms on a recently developed unconstrained timing-driven incremental placement method flow-place. On a large number of benchmarks with up to 210K cells, our constraint satisfaction algorithms obtain an average timing improvement of 12.4% under a 3% power increase limit (the actual average power increase incurred is only 2.1%), while the original unconstrained method gives an average power increase of 8.4% for a timing improvement of 17.3%. Our techniques thus yield a tradeoff of 75% power improvement to 28% timing deterioration for the given constraint. Our constraint-satisfying incremental placer is also quite fast, e.g., its run time for the 210 K-cell circuit ibm18 is only 1541 secs.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"66 1","pages":"251-258"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79532801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601905
Hyunjin Lee, Sangyeun Cho, B. Childers
The deployment of future deep submicron technology calls for a careful review of existing cache organizations and design practices in terms of yield and performance. This paper presents a cache design flow that enables processor architects to consider yield, area, and performance (YAP) together in a unified framework. Since there is a complex, changing trade-off between these metrics depending on the technology, the cache organization, and the yield enhancement scheme employed, such a design flow becomes invaluable to processor architects when they assess a design and explore the design space quickly at an early stage. We develop a complete set of tools supporting the proposed design flow, from injecting defects into a wafer to evaluating program performance of individual processors in the wafer. A case study is presented to demonstrate the effectiveness of the proposed design flow and developed tools.
{"title":"Exploring the interplay of yield, area, and performance in processor caches","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/ICCD.2007.4601905","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601905","url":null,"abstract":"The deployment of future deep submicron technology calls for a careful review of existing cache organizations and design practices in terms of yield and performance. This paper presents a cache design flow that enables processor architects to consider yield, area, and performance (YAP) together in a unified framework. Since there is a complex, changing trade-off between these metrics depending on the technology, the cache organization, and the yield enhancement scheme employed, such a design flow becomes invaluable to processor architects when they assess a design and explore the design space quickly at an early stage. We develop a complete set of tools supporting the proposed design flow, from injecting defects into a wafer to evaluating program performance of individual processors in the wafer. A case study is presented to demonstrate the effectiveness of the proposed design flow and developed tools.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"51 1","pages":"216-223"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86469673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601878
Lochi Yu, S. Abdi
This paper presents a tool for automatic generation of transaction level models (TLMs) in SystemC for MPSoC designs with custom communication platforms. The MPSoC platform is captured as a graphical net-list of components, busses and bridge elements. The application is captured as C processes mapped to the platform components. Once the platform is decided, a set of transaction level communication APIs is automatically generated for each application C process. After the C code is input, an executable SystemC TLM of the design is automatically generated using our tool. This TLM can be executed using standard SystemC simulators for early functional verification of the design. Although, several TLM styles and standards have been proposed in the past, our approach differs in the fact that the designers do not need to understand the underlying SystemC code or TLM modeling style to verify that their application executes on the selected platform. Another key advantage of our tool is that the platform can be easily customized for the application and a new TLM for that platform can be automatically generated. The TLM can be used to program the custom platform early in the design cycle before the components are available. Our experimental results demonstrate that for large industrial applications such as MP3 decoder and H.264, high-speed TLMs can be generated for several platforms in a few seconds.
{"title":"Automatic SystemC TLM generation for custom communication platforms","authors":"Lochi Yu, S. Abdi","doi":"10.1109/ICCD.2007.4601878","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601878","url":null,"abstract":"This paper presents a tool for automatic generation of transaction level models (TLMs) in SystemC for MPSoC designs with custom communication platforms. The MPSoC platform is captured as a graphical net-list of components, busses and bridge elements. The application is captured as C processes mapped to the platform components. Once the platform is decided, a set of transaction level communication APIs is automatically generated for each application C process. After the C code is input, an executable SystemC TLM of the design is automatically generated using our tool. This TLM can be executed using standard SystemC simulators for early functional verification of the design. Although, several TLM styles and standards have been proposed in the past, our approach differs in the fact that the designers do not need to understand the underlying SystemC code or TLM modeling style to verify that their application executes on the selected platform. Another key advantage of our tool is that the platform can be easily customized for the application and a new TLM for that platform can be automatically generated. The TLM can be used to program the custom platform early in the design cycle before the components are available. Our experimental results demonstrate that for large industrial applications such as MP3 decoder and H.264, high-speed TLMs can be generated for several platforms in a few seconds.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"59 1","pages":"41-46"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91538619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-10-01DOI: 10.1109/ICCD.2007.4601914
Álvaro Vázquez, E. Antelo, P. Montuschi
In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.
{"title":"A radix-10 SRT divider based on alternative BCD codings","authors":"Álvaro Vázquez, E. Antelo, P. Montuschi","doi":"10.1109/ICCD.2007.4601914","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601914","url":null,"abstract":"In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"69 1","pages":"280-287"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91176668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}