Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030709
C. Kang, E. Swartzlander
The circular-mode CORDIC (coordinate rotation digital computer) algorithm is analyzed for DDFS (direct digital frequency synthesis) applications. It is shown how the CORDIC parameters should be chosen to meet given DDFS parameters. Also, three methods of CORDIC datapath quantization: rounding, truncation, and jamming, have been investigated and their error bounds are derived. Through a set of simulations, it is demonstrated that jamming has desirable characteristics in many aspects such as complexity, speed, error, and bias. Finally, it is shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CORDIC output should be truncated to the final DAC (digital-to-analog converter) width.
{"title":"An analysis of the CORDIC algorithm for direct digital frequency synthesis","authors":"C. Kang, E. Swartzlander","doi":"10.1109/ASAP.2002.1030709","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030709","url":null,"abstract":"The circular-mode CORDIC (coordinate rotation digital computer) algorithm is analyzed for DDFS (direct digital frequency synthesis) applications. It is shown how the CORDIC parameters should be chosen to meet given DDFS parameters. Also, three methods of CORDIC datapath quantization: rounding, truncation, and jamming, have been investigated and their error bounds are derived. Through a set of simulations, it is demonstrated that jamming has desirable characteristics in many aspects such as complexity, speed, error, and bias. Finally, it is shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CORDIC output should be truncated to the final DAC (digital-to-analog converter) width.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131674105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030718
E. Antelo, T. Lang, P. Montuschi, A. Nannarelli
Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to perform the determination of the quotient digit by carry-free subtraction and sign detection. This extension consists in advancing the subtraction so that it is outside of the critical path. This advancement also provides the possibility of placing the registers so as to minimize the cycle time. We present the method and report results of synthesis using a family of standard cells. We conclude that the extension results in a speedup of 1.35 with respect to the basic implementation and of 1.3 with respect to the previously mentioned approach. We estimate that the areas of all three units are about the same.
{"title":"Fast radix-4 retimed division with selection by comparisons","authors":"E. Antelo, T. Lang, P. Montuschi, A. Nannarelli","doi":"10.1109/ASAP.2002.1030718","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030718","url":null,"abstract":"Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to perform the determination of the quotient digit by carry-free subtraction and sign detection. This extension consists in advancing the subtraction so that it is outside of the critical path. This advancement also provides the possibility of placing the registers so as to minimize the cycle time. We present the method and report results of synthesis using a family of standard cells. We conclude that the extension results in a speedup of 1.35 with respect to the basic implementation and of 1.3 with respect to the previously mentioned approach. We estimate that the areas of all three units are about the same.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128999010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030716
J. Draper, J. Sondeen, S. Mediratta, Ihn Kim
The Data-Intensive Architecture(DIVA) system employs Processing-In-Memory(PIM) chips as smart-memory coprocessors to a micorprocessor. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth- limited applications, including multimedia applications and pointer-based and sparse-matrix computations. The DIVA project is building a prototype workstation-class system using PIM chips in place of standard DRAMs to demonstrate these concepts. We have recently completed initial testing of the rst version of the prototype PIM device. A key component of this architecture is the scalar processor that coordinates all activ-ity within a PIM node. Since such a component is present in each PIM node,we exploit parallelism to achieve significant speedups rather than relying on costly, high-performance processor design. The resulting scalar processor is then an in-order 32-bit RISC microcontroller that is extremely area-efficient. This paper details the design and implementation of this scalar processor in TSMC 0.18cm technology. In conjunction with other publications, this paper demonstrates that impressive gains can be achieved with very little "smart" logic added to memory devices.
{"title":"Implementation of a 32-bit RISC processor for the data-intensive architecture processing-in-memory chip","authors":"J. Draper, J. Sondeen, S. Mediratta, Ihn Kim","doi":"10.1109/ASAP.2002.1030716","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030716","url":null,"abstract":"The Data-Intensive Architecture(DIVA) system employs Processing-In-Memory(PIM) chips as smart-memory coprocessors to a micorprocessor. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth- limited applications, including multimedia applications and pointer-based and sparse-matrix computations. The DIVA project is building a prototype workstation-class system using PIM chips in place of standard DRAMs to demonstrate these concepts. We have recently completed initial testing of the rst version of the prototype PIM device. A key component of this architecture is the scalar processor that coordinates all activ-ity within a PIM node. Since such a component is present in each PIM node,we exploit parallelism to achieve significant speedups rather than relying on costly, high-performance processor design. The resulting scalar processor is then an in-order 32-bit RISC microcontroller that is extremely area-efficient. This paper details the design and implementation of this scalar processor in TSMC 0.18cm technology. In conjunction with other publications, this paper demonstrates that impressive gains can be achieved with very little \"smart\" logic added to memory devices.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"945 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123303509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-07-17DOI: 10.1109/ASAP.2002.1030728
Mehboob Alam, Wael Badawy, G. Jullien
This paper presents a single-chip parallel architecture for advanced encryption standard (AES). The proposed architecture uses the thread approach, which integrates fully pipelined parallel units, that process 128 bits/cycle and quadruples the data throughput. The threads architecture allows a reduction of the clock rate by a factor of four, while maintaining the data throughput, and consumes less power. The prototype runs at a data rate of 7.68 Gbps on a Xilinx xc2V1500 Virtex-II FPGA. The data rate shows that the proposed thread approach produces one of the fastest single-chip FPGA implementations currently available. In addition, the proposed architecture is scalable to 192, 256 and higher bits.
{"title":"A novel pipelined threads architecture for AES encryption algorithm","authors":"Mehboob Alam, Wael Badawy, G. Jullien","doi":"10.1109/ASAP.2002.1030728","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030728","url":null,"abstract":"This paper presents a single-chip parallel architecture for advanced encryption standard (AES). The proposed architecture uses the thread approach, which integrates fully pipelined parallel units, that process 128 bits/cycle and quadruples the data throughput. The threads architecture allows a reduction of the clock rate by a factor of four, while maintaining the data throughput, and consumes less power. The prototype runs at a data rate of 7.68 Gbps on a Xilinx xc2V1500 Virtex-II FPGA. The data rate shows that the proposed thread approach produces one of the fastest single-chip FPGA implementations currently available. In addition, the proposed architecture is scalable to 192, 256 and higher bits.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124506467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}