We present a method for co-partitioning affine indexed algorithms resulting in a processor array with an optimized data-reuse. Through this method, a memory hierarchy with an optimized data transfer is derived which allows a significant reduction of the power consumption caused by memory accesses. Apart from former design flows which begin with a space-time transformation, we start with the co-partitioning of the iteration space. This allows an adaption of the resulting processor array towards the constraints of the target architecture at the beginning of the design. We illustrate our method for the full search motion estimation algorithm which bears a high potential of data-reuse.
{"title":"Optimized data-reuse in processor arrays","authors":"Sebastian Siegel, R. Merker","doi":"10.1109/ASAP.2004.10024","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10024","url":null,"abstract":"We present a method for co-partitioning affine indexed algorithms resulting in a processor array with an optimized data-reuse. Through this method, a memory hierarchy with an optimized data transfer is derived which allows a significant reduction of the power consumption caused by memory accesses. Apart from former design flows which begin with a space-time transformation, we start with the co-partitioning of the iteration space. This allows an adaption of the resulting processor array towards the constraints of the target architecture at the beginning of the design. We illustrate our method for the full search motion estimation algorithm which bears a high potential of data-reuse.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125225650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Compaan compiler automatically derives a process network (PN) description from an application written in Matlab. The basic element of a PN is a producer/consumer (P/C) pair. Four different communication patterns for a P/C pair have been identified and the complexity of communication structure differs depending on the communication pattern involved. Therefore, in order to obtain cost-efficient process networks our compiler automatically identifies the communication pattern of each P/C pair. This problem is equivalent to integer linear programming and thus in general can not be solved efficiently. In this paper we present simpler techniques that allow classifying the interprocess communication in a PN. However, in some cases those techniques do not allow to find an answer and therefore, an ILP test has still to be applied. Thus, we introduce a hierarchical classification scheme that correctly classifies the interprocess communication, but uses dramatically less integer linear programming, in only 5% of the cases to classify, we still rely on integer linear programming; in the remaining 95%, the techniques presented Are able to classify a case correctly.
{"title":"A hierarchical classification scheme to derive interprocess communication in process networks","authors":"A. Turjan, B. Kienhuis, E. Deprettere","doi":"10.1109/ASAP.2004.10025","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10025","url":null,"abstract":"The Compaan compiler automatically derives a process network (PN) description from an application written in Matlab. The basic element of a PN is a producer/consumer (P/C) pair. Four different communication patterns for a P/C pair have been identified and the complexity of communication structure differs depending on the communication pattern involved. Therefore, in order to obtain cost-efficient process networks our compiler automatically identifies the communication pattern of each P/C pair. This problem is equivalent to integer linear programming and thus in general can not be solved efficiently. In this paper we present simpler techniques that allow classifying the interprocess communication in a PN. However, in some cases those techniques do not allow to find an answer and therefore, an ILP test has still to be applied. Thus, we introduce a hierarchical classification scheme that correctly classifies the interprocess communication, but uses dramatically less integer linear programming, in only 5% of the cases to classify, we still rely on integer linear programming; in the remaining 95%, the techniques presented Are able to classify a case correctly.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122064995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Public-key cryptosystems generally involve computation-intensive arithmetic operations, making them impractical for software implementation on constrained devices such as smart cards. We investigate the potential of architectural enhancements and instruction set extensions for low-level arithmetic used in public-key cryptography, most notably multiplication in finite fields of large order. The focus of the present work is directed towards a special type of finite fields, the so-called optimal extension fields GF(p/sup m/) where p is a pseudo-Mersenne (PM) prime of the form p = 2/sup n/ - c that fits into a single register. Based on the M/PS32 instruction set architecture, we introduce two custom instructions to accelerate the reduction modulo a PM prime. Moreover, we show that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow. The proposed extensions support a wide range of PM primes and allow a reduction modulo 2/sup n/ - c to complete in only four clock cycles when n /spl les/ 32.
{"title":"Architectural support for arithmetic in optimal extension fields","authors":"J. Großschädl, Sandeep S. Kumar, C. Paar","doi":"10.1109/ASAP.2004.10004","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10004","url":null,"abstract":"Public-key cryptosystems generally involve computation-intensive arithmetic operations, making them impractical for software implementation on constrained devices such as smart cards. We investigate the potential of architectural enhancements and instruction set extensions for low-level arithmetic used in public-key cryptography, most notably multiplication in finite fields of large order. The focus of the present work is directed towards a special type of finite fields, the so-called optimal extension fields GF(p/sup m/) where p is a pseudo-Mersenne (PM) prime of the form p = 2/sup n/ - c that fits into a single register. Based on the M/PS32 instruction set architecture, we introduce two custom instructions to accelerate the reduction modulo a PM prime. Moreover, we show that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow. The proposed extensions support a wide range of PM primes and allow a reduction modulo 2/sup n/ - c to complete in only four clock cycles when n /spl les/ 32.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127525274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work presents a novel architecture, which is both device and circuit independent. The starting idea is that computations can be performed in three fundamentally different ways: entirely digital (using Boolean gates), entirely analog (using analog circuits), or mixed (using both digital and analog circuits). The boundaries between these are sometimes very thin. As an example, a threshold logic gate is already mixed, i.e. even if the inputs and the output are Boolean, the weighted sum-of-inputs is a multiple-valued logic signal, i.e. a low-precision analog signal. It has already been suggested that, at least for CMOS, a mixed analog/digital approach is the most power-efficient solution. Still, the main disadvantages of using analog circuits are: (i) their more complex (handcrafted) design, and (ii) their (expected) lower reliability (signal-to-noise or precision), which will be exacerbated by scaling. Here, we will show how both these disadvantages could be tackled. A constructive solution for Kolmogorov's superposition and (multi-threshold) threshold logic synthesis could be used for automating the design. Digital or threshold logic circuits will compensate for the accumulation of noise in the cascaded (very) low precision analog circuits. These digital circuits will also contribute to a von Neumann's multiplexing scheme used to augment the defect- and fault-tolerance of the architecture. A few examples will show how this architectural approach could be mapped on top of a given (nano) technology.
{"title":"A novel highly reliable low-power nano architecture when von Neumann augments Kolmogorov","authors":"Valeriu Beiu","doi":"10.1109/ASAP.2004.10021","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10021","url":null,"abstract":"This work presents a novel architecture, which is both device and circuit independent. The starting idea is that computations can be performed in three fundamentally different ways: entirely digital (using Boolean gates), entirely analog (using analog circuits), or mixed (using both digital and analog circuits). The boundaries between these are sometimes very thin. As an example, a threshold logic gate is already mixed, i.e. even if the inputs and the output are Boolean, the weighted sum-of-inputs is a multiple-valued logic signal, i.e. a low-precision analog signal. It has already been suggested that, at least for CMOS, a mixed analog/digital approach is the most power-efficient solution. Still, the main disadvantages of using analog circuits are: (i) their more complex (handcrafted) design, and (ii) their (expected) lower reliability (signal-to-noise or precision), which will be exacerbated by scaling. Here, we will show how both these disadvantages could be tackled. A constructive solution for Kolmogorov's superposition and (multi-threshold) threshold logic synthesis could be used for automating the design. Digital or threshold logic circuits will compensate for the accumulation of noise in the cascaded (very) low precision analog circuits. These digital circuits will also contribute to a von Neumann's multiplexing scheme used to augment the defect- and fault-tolerance of the architecture. A few examples will show how this architectural approach could be mapped on top of a given (nano) technology.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128479392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Amft, M. Lauffer, Stijn Ossevoort, Fabrizio Macaluso, P. Lukowicz, G. Tröster
Wearable computing systems can be broadly defined as mobile electronic devices that can be unobtrusively embedded in a user's outfit as part of the garment or an accessory. Unlike conventional mobile devices, such systems shall be virtually invisible, not hindering physical activity, always active and running without user's attention. We present our wearability driven design approach and the philosophy for a novel wearable computing system integrated into a fully functional belt. This system integrates the main electronics in the buckle of a belt and utilizes the belt itself as extension bus and mechanical support for add ons. The system runs GNU/Linux operating system and has sufficient resources to address a variety of applications in the field of wearable computing. Considerations regarding ergonomic design, system architecture, first implementation results and applications are presented.
{"title":"Design of the QBIC wearable computing platform","authors":"O. Amft, M. Lauffer, Stijn Ossevoort, Fabrizio Macaluso, P. Lukowicz, G. Tröster","doi":"10.1109/ASAP.2004.10001","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10001","url":null,"abstract":"Wearable computing systems can be broadly defined as mobile electronic devices that can be unobtrusively embedded in a user's outfit as part of the garment or an accessory. Unlike conventional mobile devices, such systems shall be virtually invisible, not hindering physical activity, always active and running without user's attention. We present our wearability driven design approach and the philosophy for a novel wearable computing system integrated into a fully functional belt. This system integrates the main electronics in the buckle of a belt and utilizes the belt itself as extension bus and mechanical support for add ons. The system runs GNU/Linux operating system and has sufficient resources to address a variety of applications in the field of wearable computing. Considerations regarding ergonomic design, system architecture, first implementation results and applications are presented.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122172426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. I. Gómez, P. Marchal, Sven Verdoolaege, L. Piñuel, F. Catthoor
The memory bandwidth largely determines the performance of embedded systems. However, very often compilers ignore the actual behavior of the memory architecture, causing large performance loss. To better utilize the memory bandwidth, several researchers have introduced instruction scheduling/data assignment techniques. Because they only optimize the bandwidth inside each basic block, they often fail to use all available bandwidth. Loop fusion is an interesting alternative to more globally optimize the memory access schedule. By fusing loops we increase the number of independent memory operations inside each basic block. The compiler can then better exploit the available bandwidth and increase the system's performance. However, existing fusion techniques can only combine loops with a conformable header. To overcome this limitation we present loop morphing; we combine fusion with strip mining and loop splitting. We also introduce a technique to steer loop morphing such that we find a compact memory access schedule. Experimental results show that with our approach we can decrease the execution time up to 88%.
{"title":"Optimizing the memory bandwidth with loop morphing","authors":"J. I. Gómez, P. Marchal, Sven Verdoolaege, L. Piñuel, F. Catthoor","doi":"10.1109/ASAP.2004.10020","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10020","url":null,"abstract":"The memory bandwidth largely determines the performance of embedded systems. However, very often compilers ignore the actual behavior of the memory architecture, causing large performance loss. To better utilize the memory bandwidth, several researchers have introduced instruction scheduling/data assignment techniques. Because they only optimize the bandwidth inside each basic block, they often fail to use all available bandwidth. Loop fusion is an interesting alternative to more globally optimize the memory access schedule. By fusing loops we increase the number of independent memory operations inside each basic block. The compiler can then better exploit the available bandwidth and increase the system's performance. However, existing fusion techniques can only combine loops with a conformable header. To overcome this limitation we present loop morphing; we combine fusion with strip mining and loop splitting. We also introduce a technique to steer loop morphing such that we find a compact memory access schedule. Experimental results show that with our approach we can decrease the execution time up to 88%.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125593869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Charot, Madeleine Nyamsi, P. Quinton, Charles Wagner
Many multimedia and telecommunications applications are modeled as multi-rate, parallel data flow systems. We present techniques to model and schedule such applications using structured systems of recurrence equations. We show that the schedule can be obtained first by computing the period of each component of the system, then by applying structured scheduling to the entire system. This method is implemented in the MMAlpha software, and it is applied to model a WCDMA uplink receiver.
{"title":"Modeling and scheduling parallel data flow systems using structured systems of recurrence equations","authors":"François Charot, Madeleine Nyamsi, P. Quinton, Charles Wagner","doi":"10.1109/ASAP.2004.10032","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10032","url":null,"abstract":"Many multimedia and telecommunications applications are modeled as multi-rate, parallel data flow systems. We present techniques to model and schedule such applications using structured systems of recurrence equations. We show that the schedule can be obtained first by computing the period of each component of the system, then by applying structured scheduling to the entire system. This method is implemented in the MMAlpha software, and it is applied to model a WCDMA uplink receiver.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114396139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binary finite fields GF(2/sup n/) are very commonly used in cryptography, particularly in public-key algorithms such as elliptic curve cryptography (ECC). On word-oriented programmable processors, field elements are generally represented as polynomials with coefficients from [0, 1]. Key arithmetic operations on these polynomials, such as squaring and multiplication, are not supported by integer-oriented processor architectures. Instead, these are implemented in software, causing a very large fraction of the cryptography execution time to be dominated by a few elementary operations. For example, more than 90% of the execution time of 163-bit ECC may be consumed by two simple field operations: squaring and multiplication. A few processor architectures have been proposed recently that include instructions for binary field arithmetic. However, these have only considered processors with small wordsizes and in-order, single-issue execution. The first contribution of this paper is to validate these new arithmetic instructions for processors with wider wordsizes and multiple-issue (e.g. superscalar) execution. We also consider the effects of varying the number of functional units and load/store pipes. We demonstrate that the combination of microarchitecture and new instructions provides speedups up to 22.4x for ECC point multiplication. Second, we show that if a bit-level reverse instruction is included in the instruction set, the size of the multiplier can be reduced by half without significant performance degradation. Third, we compare the benefits of superscalar execution with wordsize scaling. The latter has been used in recent processor architectures such as PLX and PAX as a new way to extract parallelism. We show that 2x wordsize scaling provides 70% better performance than 2-way superscalar execution. Finally, we suggest a low-cost method, which we call multi-word result execution, to realize some of the benefits of wordsize scaling in existing processors with fixed wordsizes.
{"title":"Evaluating instruction set extensions for fast arithmetic on binary finite fields","authors":"A. M. Fiskiran, R. Lee","doi":"10.1109/ASAP.2004.10003","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10003","url":null,"abstract":"Binary finite fields GF(2/sup n/) are very commonly used in cryptography, particularly in public-key algorithms such as elliptic curve cryptography (ECC). On word-oriented programmable processors, field elements are generally represented as polynomials with coefficients from [0, 1]. Key arithmetic operations on these polynomials, such as squaring and multiplication, are not supported by integer-oriented processor architectures. Instead, these are implemented in software, causing a very large fraction of the cryptography execution time to be dominated by a few elementary operations. For example, more than 90% of the execution time of 163-bit ECC may be consumed by two simple field operations: squaring and multiplication. A few processor architectures have been proposed recently that include instructions for binary field arithmetic. However, these have only considered processors with small wordsizes and in-order, single-issue execution. The first contribution of this paper is to validate these new arithmetic instructions for processors with wider wordsizes and multiple-issue (e.g. superscalar) execution. We also consider the effects of varying the number of functional units and load/store pipes. We demonstrate that the combination of microarchitecture and new instructions provides speedups up to 22.4x for ECC point multiplication. Second, we show that if a bit-level reverse instruction is included in the instruction set, the size of the multiplier can be reduced by half without significant performance degradation. Third, we compare the benefits of superscalar execution with wordsize scaling. The latter has been used in recent processor architectures such as PLX and PAX as a new way to extract parallelism. We show that 2x wordsize scaling provides 70% better performance than 2-way superscalar execution. Finally, we suggest a low-cost method, which we call multi-word result execution, to realize some of the benefits of wordsize scaling in existing processors with fixed wordsizes.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dynamic programming for approximate string matching is a large family of different algorithms, which vary significantly in purpose, complexity, and hardware utilization. Many implementations have reported impressive speed-ups, but have typically been point solutions -highly specialized and addressing only one or a few of the many possible options. The problem to be solved is creating a hardware description that implements a broad range of behavioral options without losing efficiency due to feature bloat. We report a set of three component types that address different parts of the DP string matching problem. Multiple, interchangeable implementations are available for each component type. This allows each application to choose the feature set required, then make maximum use of the FPGA fabric according to that application's specific resource requirements. Synthesis estimates show a 4:1 improvement in time-space performance, depending on the options chosen for a specific matching task.
{"title":"Families of FPGA-based algorithms for approximate string matching","authors":"T. Court, M. Herbordt","doi":"10.1109/ASAP.2004.10013","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10013","url":null,"abstract":"Dynamic programming for approximate string matching is a large family of different algorithms, which vary significantly in purpose, complexity, and hardware utilization. Many implementations have reported impressive speed-ups, but have typically been point solutions -highly specialized and addressing only one or a few of the many possible options. The problem to be solved is creating a hardware description that implements a broad range of behavioral options without losing efficiency due to feature bloat. We report a set of three component types that address different parts of the DP string matching problem. Multiple, interchangeable implementations are available for each component type. This allows each application to choose the feature set required, then make maximum use of the FPGA fabric according to that application's specific resource requirements. Synthesis estimates show a 4:1 improvement in time-space performance, depending on the options chosen for a specific matching task.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128289366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Decreasing feature sizes allow additional functionality to be added to future microprocessors to improve the performance of important application domains. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This work presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with IEEE-754R, has an estimated critical path delay of 0.69 ns when implemented using LSI Logic's 0.11 micron gflx-p standard cell library.
{"title":"Decimal floating-point division using Newton-Raphson iteration","authors":"Liang-Kai Wang, M. Schulte","doi":"10.1109/ASAP.2004.10005","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10005","url":null,"abstract":"Decreasing feature sizes allow additional functionality to be added to future microprocessors to improve the performance of important application domains. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This work presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with IEEE-754R, has an estimated critical path delay of 0.69 ns when implemented using LSI Logic's 0.11 micron gflx-p standard cell library.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}