Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030719
N. Burgess
This paper introduces PAPA: packed arithmetic on a prefix adder, a new approach to parallel prefix adder design that supports a wide variety of packed arithmetic computations, including packed add and subtract with saturation, packed rounded average, and packed absolute difference. The approach consists of altering the prefix adder cell logic equations to take advantage of a previously unused "don't care" state. The principle of logical effort is employed to assess the delay of the new adder architecture by establishing the extra effort needed to select and drive the appropriate carry signal to the requisite sum sub-word. This adder will find applications in video processors and other multimedia-orientated processor chips that implement packed arithmetic operations.
{"title":"PAPA - packed arithmetic on a prefix adder for multimedia applications","authors":"N. Burgess","doi":"10.1109/ASAP.2002.1030719","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030719","url":null,"abstract":"This paper introduces PAPA: packed arithmetic on a prefix adder, a new approach to parallel prefix adder design that supports a wide variety of packed arithmetic computations, including packed add and subtract with saturation, packed rounded average, and packed absolute difference. The approach consists of altering the prefix adder cell logic equations to take advantage of a previously unused \"don't care\" state. The principle of logical effort is employed to assess the delay of the new adder architecture by establishing the extra effort needed to select and drive the appropriate carry signal to the requisite sum sub-word. This adder will find applications in video processors and other multimedia-orientated processor chips that implement packed arithmetic operations.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128626057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030730
E. Chester, J. N. Coleman
An architecture design is presented for a device based upon the logarithmic number system (LNS) that is capable of performing general matrix and complex arithmetic, with features useful for DSP system-on-chip applications. A modified LNS addition/subtraction unit is employed in multiple execution units to achieve a maximum single-precision floating-point (FP) equivalent throughput of 3.2 Gflop/s at a clock frequency of 200 MHz. Each execution unit is capable of computing functions of the form (ab + cd)/sup e/ for e /spl isin/ {/spl plusmn/0.5, /spl plusmn/1, /spl plusmn/2} in a 5-stage arithmetic pipeline and returning a result every cycle, yielding a considerable per-cycle improvement over both floating- and fixed-point systems. Comparisons with existing devices and a single floating-point unit are given.
{"title":"Matrix engine for signal processing applications using the logarithmic number system","authors":"E. Chester, J. N. Coleman","doi":"10.1109/ASAP.2002.1030730","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030730","url":null,"abstract":"An architecture design is presented for a device based upon the logarithmic number system (LNS) that is capable of performing general matrix and complex arithmetic, with features useful for DSP system-on-chip applications. A modified LNS addition/subtraction unit is employed in multiple execution units to achieve a maximum single-precision floating-point (FP) equivalent throughput of 3.2 Gflop/s at a clock frequency of 200 MHz. Each execution unit is capable of computing functions of the form (ab + cd)/sup e/ for e /spl isin/ {/spl plusmn/0.5, /spl plusmn/1, /spl plusmn/2} in a 5-stage arithmetic pipeline and returning a result every cycle, yielding a considerable per-cycle improvement over both floating- and fixed-point systems. Comparisons with existing devices and a single floating-point unit are given.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125622358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030710
D. Matula, A. Fit-Florea, L. McFearin
Many function approximation procedures can obtain enhanced accuracy by an efficient table lookup of a product z=f(x)g(y). Both x and y are represented by indices of i leading bits (typically 7
{"title":"Evaluating products of nonlinear functions by indirect bipartite table lookup","authors":"D. Matula, A. Fit-Florea, L. McFearin","doi":"10.1109/ASAP.2002.1030710","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030710","url":null,"abstract":"Many function approximation procedures can obtain enhanced accuracy by an efficient table lookup of a product z=f(x)g(y). Both x and y are represented by indices of i leading bits (typically 7<i<16) for arguments normalized to [0, 1] or [1, 2]. Direct bipartite lookup employs 1/2 bits each of x and y yielding roughly an 1/2 bit result which can lose 2 to 3 bits of accuracy when f and g are nonlinear. Indirect bipartite lookup first generates i/2 bit interval index values for f(x) and g(y) using separate j-bits-in 1/2bits-out tables for f(x) and g(y) where i/2<j<i and is chosen large enough to substantially reduce the effect of nonlinearity in f(x) and g(y). The separate tables readily compensate for the high nonlinearity in f and/or g and generate interval index values representing intervals that can be tailored to minimize the maximum error of the product z=f(x)g(y) determined by an interval product table with the concatenated interval indices as the i bit input. We describe several variations in interval index generation methodology and in the design of the interval product table lookup architecture so as to obtain accuracy of 1/2 bits (or better) in output in 2-3 cycles of table lookup latency.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132247345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030705
M. Arnold
By reducing the accuracy of the logarithmic number system (LNS) it is possible to achieve lower power consumption for multimedia applications, such as MPEG, without significantly lowering the visual quality of the output. An LNS wordsize of 8 to 10 bits produces a comparable MPEG output as a fixed-point wordsize of 14 to 16 bits. The switching activity of an LNS ALU that computes the inverse discrete cosine transform (IDCT) is one quarter that of fixed point, implying lower power consumption. By skipping inputs that are zero (which MPEG can do naturally with its run-length coding and zigzag ordering) the switching activity of LNS MPEG becomes one-tenth that of fixed point, in contrast to the minimal impact zero skipping has on fixed-point power consumption.
{"title":"Reduced power consumption for MPEG decoding with LNS","authors":"M. Arnold","doi":"10.1109/ASAP.2002.1030705","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030705","url":null,"abstract":"By reducing the accuracy of the logarithmic number system (LNS) it is possible to achieve lower power consumption for multimedia applications, such as MPEG, without significantly lowering the visual quality of the output. An LNS wordsize of 8 to 10 bits produces a comparable MPEG output as a fixed-point wordsize of 14 to 16 bits. The switching activity of an LNS ALU that computes the inverse discrete cosine transform (IDCT) is one quarter that of fixed point, implying lower power consumption. By skipping inputs that are zero (which MPEG can do naturally with its run-length coding and zigzag ordering) the switching activity of LNS MPEG becomes one-tenth that of fixed point, in contrast to the minimal impact zero skipping has on fixed-point power consumption.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127628861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030727
J. Irwin, D. Page, N. Smart
Differential power analysis (DPA) has become a real-world threat to the security of cryptographic hardware devices such as smart-cards. By using cheap and readily available equipment, attacks can easily compromise algorithms running on these devices in a non-invasive manner. Adding non-determinism to the execution of cryptographic algorithms has been proposed as a defence against these attacks. One way of achieving this non-determinism is to introduce random additional operations to the algorithm which produce noise in the power profile of the device. We describe the addition of a specialised processor pipeline stage which increases the level of potential non-determinism and hence guards against the revelation of secret information.
{"title":"Instruction stream mutation for non-deterministic processors","authors":"J. Irwin, D. Page, N. Smart","doi":"10.1109/ASAP.2002.1030727","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030727","url":null,"abstract":"Differential power analysis (DPA) has become a real-world threat to the security of cryptographic hardware devices such as smart-cards. By using cheap and readily available equipment, attacks can easily compromise algorithms running on these devices in a non-invasive manner. Adding non-determinism to the execution of cryptographic algorithms has been proposed as a defence against these attacks. One way of achieving this non-determinism is to introduce random additional operations to the algorithm which produce noise in the power profile of the device. We describe the addition of a specialised processor pipeline stage which increases the level of potential non-determinism and hence guards against the revelation of secret information.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114536779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030735
A. Darte, Guillaume Huard
Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting, but thus far, the complexity of the problem was unknown, and no exact approach was available. In this paper, we prove two NP-complete results that characterize precisely the problem and we give a practical integer linear programming formulation to solve the problem exactly.
{"title":"New results on array contraction [memory optimization]","authors":"A. Darte, Guillaume Huard","doi":"10.1109/ASAP.2002.1030735","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030735","url":null,"abstract":"Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting, but thus far, the complexity of the problem was unknown, and no exact approach was available. In this paper, we prove two NP-complete results that characterize precisely the problem and we give a practical integer linear programming formulation to solve the problem exactly.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114647787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030699
J. Fortes
The push to obtain smaller and denser circuits solely based on lithography and silicon technology is quickly reaching limits imposed by device physics and processing technology. It is anticipated that these limits will invalidate Moore's law and lead to unacceptable manufacturing costs, unreliable devices, and hard-to-manage power dissipation and interconnect problems. Nanotechnologies that rely on self-assembly, biomolecular components, and nanoelectronics are promising alternatives to silicon-based microelectronics. They will eventually enable levels of integration that exceed that of today's silicon-based microelectronics by three orders of magnitude. These nascent technologies present intriguing challenges and exciting opportunities to use biologically inspired solutions to address system architecture questions. This paper discusses recent results of an ongoing collaborative research effort by nanotechnologists, neurocomputing experts, and computer and circuit designers to explore novel architectures for nanoscale neuromorphic systems. The focus is placed on implementations whose behavior depends on how propagation delays affect communication among system components. The components under consideration are reminiscent of spiking neurons and, unlike in classical systems, interconnect is used for computation as well as communication purposes. Hybrid systems are also briefly discussed.
{"title":"Nanocomputing with delays","authors":"J. Fortes","doi":"10.1109/ASAP.2002.1030699","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030699","url":null,"abstract":"The push to obtain smaller and denser circuits solely based on lithography and silicon technology is quickly reaching limits imposed by device physics and processing technology. It is anticipated that these limits will invalidate Moore's law and lead to unacceptable manufacturing costs, unreliable devices, and hard-to-manage power dissipation and interconnect problems. Nanotechnologies that rely on self-assembly, biomolecular components, and nanoelectronics are promising alternatives to silicon-based microelectronics. They will eventually enable levels of integration that exceed that of today's silicon-based microelectronics by three orders of magnitude. These nascent technologies present intriguing challenges and exciting opportunities to use biologically inspired solutions to address system architecture questions. This paper discusses recent results of an ongoing collaborative research effort by nanotechnologists, neurocomputing experts, and computer and circuit designers to explore novel architectures for nanoscale neuromorphic systems. The focus is placed on implementations whose behavior depends on how propagation delays affect communication among system components. The components under consideration are reminiscent of spiking neurons and, unlike in classical systems, interconnect is used for computation as well as communication purposes. Hybrid systems are also briefly discussed.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117202708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030725
J. Villalba, G. Bandera, Mario A. González, J. Hormigo, E. Zapata
In this paper we deal with polynomial evaluation based on new processor architectures for multimedia applications. We introduce some algorithms to take advantage of the new attributes of multimedia processors, such as VLIW (very long instruction word) and SIMD (single instruction multiple data architecture) architectures. Algorithms to support polynomial evaluation based only in addition/shift operations and other different algorithms with MAC (multiply-and-add) instructions are analyzed and tailored to subword parallelism units of the new processors. Both potential instruction-level and machine-level parallelism are fully exploited through concurrent use of all functional units.
{"title":"Polynomial evaluation on multimedia processors","authors":"J. Villalba, G. Bandera, Mario A. González, J. Hormigo, E. Zapata","doi":"10.1109/ASAP.2002.1030725","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030725","url":null,"abstract":"In this paper we deal with polynomial evaluation based on new processor architectures for multimedia applications. We introduce some algorithms to take advantage of the new attributes of multimedia processors, such as VLIW (very long instruction word) and SIMD (single instruction multiple data architecture) architectures. Algorithms to support polynomial evaluation based only in addition/shift operations and other different algorithms with MAC (multiply-and-add) instructions are analyzed and tailored to subword parallelism units of the new processors. Both potential instruction-level and machine-level parallelism are fully exploited through concurrent use of all functional units.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030715
A. Hossain, D. Pease, James S. Burns, N. Parveen
Wide-issue superscalar processors have capabilities to execute several basic blocks in a cycle. A regular instruction cache fetch mechanism is not capable of supporting this high fetch throughput requirement. Several improvements of the fetch mechanism are currently in use. One of the most successful of these improvements is the addition of an instruction memory structure known as a trace cache. In this paper an analytical model of instruction fetch performance of a trace cache microarchitecture is presented. Parameters, which affect trace cache instruction fetch performance, are explored and several analytical expressions are presented. The presented model can be used to understand performance tradeoffs in trace cache design. Results from the validation of the model are presented. The instruction fetch rates predicted by the model differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. The model is implemented in a computer program named Tulip. To show how different parameters influence performance, results from Tulip are also presented.
{"title":"A mathematical model of trace cache","authors":"A. Hossain, D. Pease, James S. Burns, N. Parveen","doi":"10.1109/ASAP.2002.1030715","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030715","url":null,"abstract":"Wide-issue superscalar processors have capabilities to execute several basic blocks in a cycle. A regular instruction cache fetch mechanism is not capable of supporting this high fetch throughput requirement. Several improvements of the fetch mechanism are currently in use. One of the most successful of these improvements is the addition of an instruction memory structure known as a trace cache. In this paper an analytical model of instruction fetch performance of a trace cache microarchitecture is presented. Parameters, which affect trace cache instruction fetch performance, are explored and several analytical expressions are presented. The presented model can be used to understand performance tradeoffs in trace cache design. Results from the validation of the model are presented. The instruction fetch rates predicted by the model differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. The model is implemented in a computer program named Tulip. To show how different parameters influence performance, results from Tulip are also presented.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129835793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030717
W. Park, Kilwhan Lee, Il-San Kim, T. Han, Sung-Bong Yang
As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture, which performs a depth test operation twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste caused by fetching unnecessary obscured texture data, by performing the depth test before texture mapping. The proposed architecture reduces the miss penalties of the pixel cache by using a pre-fetch scheme - that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption, producing high-performance gains.
{"title":"A mid-texturing pixel rasterization pipeline architecture for 3D rendering processors","authors":"W. Park, Kilwhan Lee, Il-San Kim, T. Han, Sung-Bong Yang","doi":"10.1109/ASAP.2002.1030717","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030717","url":null,"abstract":"As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture, which performs a depth test operation twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste caused by fetching unnecessary obscured texture data, by performing the depth test before texture mapping. The proposed architecture reduces the miss penalties of the pixel cache by using a pre-fetch scheme - that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption, producing high-performance gains.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122849466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}