Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.625366
S. Kelem
This paper describes the implementation of a real-time video algorithm on a context-switched FPGA. The FPGA is based on the Xilinx XC4000E FPGA, and includes extensions for dealing with state saving and forwarding and for increased routing demand due to time-multiplexing the hardware. The algorithm makes use of special features of this architecture to achieve high utilization of the silicon at run time. Two configuration planes are programmed as distributed RAM and two planes perform replications of the calculation in parallel. The interplay between the CLB architecture, communication between configuration planes, context-switching overhead, and the end-user application are examined as we map the algorithm onto this architecture.
{"title":"Mapping a real-time video algorithm to a context-switched FPGA","authors":"S. Kelem","doi":"10.1109/FPGA.1997.625366","DOIUrl":"https://doi.org/10.1109/FPGA.1997.625366","url":null,"abstract":"This paper describes the implementation of a real-time video algorithm on a context-switched FPGA. The FPGA is based on the Xilinx XC4000E FPGA, and includes extensions for dealing with state saving and forwarding and for increased routing demand due to time-multiplexing the hardware. The algorithm makes use of special features of this architecture to achieve high utilization of the silicon at run time. Two configuration planes are programmed as distributed RAM and two planes perform replications of the calculation in parallel. The interplay between the CLB architecture, communication between configuration planes, context-switching overhead, and the end-user application are examined as we map the algorithm onto this architecture.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124935963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624618
M. Abramovici, P. R. Menon
The authors introduce a new approach to fault simulation, using reconfigurable hardware to implement a critical path tracing algorithm. The performance estimate shows that the approach is at least on order of magnitude faster than serial fault emulation used in prior work.
{"title":"Fault simulation on reconfigurable hardware","authors":"M. Abramovici, P. R. Menon","doi":"10.1109/FPGA.1997.624618","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624618","url":null,"abstract":"The authors introduce a new approach to fault simulation, using reconfigurable hardware to implement a critical path tracing algorithm. The performance estimate shows that the approach is at least on order of magnitude faster than serial fault emulation used in prior work.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117318004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624622
C. Paar, M. Rosner
Reed-Solomon (RS) error correction codes are being widely used in modern communication systems such as compact disk players or satellite communication links. RS codes rely on arithmetic in finite, or Galois fields. The specific field GF(2/sup 8/) is of central importance for many practical systems. The most costly, and thus most critical, elementary operations in RS decoders are multiplication and inversion in Galois fields. Although there have been considerable efforts in the area of Galois field arithmetic architectures, there appears to be very little reported work for Galois field arithmetic for reconfigurable hardware. This contribution provides a systematic comparison of two promising arithmetic architecture classes. The first one is based on a standard base representation, and the second one is based on composite fields. For both classes a multiplier and an inverter for GF(2/sup 8/) are described and theoretical gate counts are provided. Using a design entry based on a VHDL description, each architecture is mapped to a popular FPGA and EPLD device. For each mapping an area and a speed optimization was performed. Absolute values with respect to logic cell counts and critical path simulations are provided. The results show that the composite field architectures can have great advantages on both types of reconfigurable platforms. In particular it is found that composite field multipliers can be more than twice as fast as polynomial base multipliers on FPGAs.
{"title":"Comparison of arithmetic architectures for Reed-Solomon decoders in reconfigurable hardware","authors":"C. Paar, M. Rosner","doi":"10.1109/FPGA.1997.624622","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624622","url":null,"abstract":"Reed-Solomon (RS) error correction codes are being widely used in modern communication systems such as compact disk players or satellite communication links. RS codes rely on arithmetic in finite, or Galois fields. The specific field GF(2/sup 8/) is of central importance for many practical systems. The most costly, and thus most critical, elementary operations in RS decoders are multiplication and inversion in Galois fields. Although there have been considerable efforts in the area of Galois field arithmetic architectures, there appears to be very little reported work for Galois field arithmetic for reconfigurable hardware. This contribution provides a systematic comparison of two promising arithmetic architecture classes. The first one is based on a standard base representation, and the second one is based on composite fields. For both classes a multiplier and an inverter for GF(2/sup 8/) are described and theoretical gate counts are provided. Using a design entry based on a VHDL description, each architecture is mapped to a popular FPGA and EPLD device. For each mapping an area and a speed optimization was performed. Absolute values with respect to logic cell counts and critical path simulations are provided. The results show that the composite field architectures can have great advantages on both types of reconfigurable platforms. In particular it is found that composite field multipliers can be more than twice as fast as polynomial base multipliers on FPGAs.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124437291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624621
J. Greenbaum, Michael Baxter
The need to partition computation across multiple programmable devices in array architecture CCMs leads to performance bottlenecks in data flow through the computer and wiring delays between adjacent devices. However, significant improvements in FPGA capacities have brought one to a threshold where direct inter-chip connections are not required because an entire algorithm can be implemented on a single device for important problems in areas such as image processing. One can now implement architectures that are similar to today's parallel computers in which interprocessor communication is done through shared memory or dedicated communication hardware. The benefits of this approach are system-wide scalability and flexibility. The authors illustrate this new style of CCM with examples from image processing, in particular a novel FPGA implementation of block motion estimation (as for MPEG encoding). Based on the lessons learned from these specific examples, they generalize and speculate on implications for new CCM architectures.
{"title":"Increased FPGA capacity enables scalable, flexible CCMs: an example from image processing","authors":"J. Greenbaum, Michael Baxter","doi":"10.1109/FPGA.1997.624621","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624621","url":null,"abstract":"The need to partition computation across multiple programmable devices in array architecture CCMs leads to performance bottlenecks in data flow through the computer and wiring delays between adjacent devices. However, significant improvements in FPGA capacities have brought one to a threshold where direct inter-chip connections are not required because an entire algorithm can be implemented on a single device for important problems in areas such as image processing. One can now implement architectures that are similar to today's parallel computers in which interprocessor communication is done through shared memory or dedicated communication hardware. The benefits of this approach are system-wide scalability and flexibility. The authors illustrate this new style of CCM with examples from image processing, in particular a novel FPGA implementation of block motion estimation (as for MPEG encoding). Based on the lessons learned from these specific examples, they generalize and speculate on implications for new CCM architectures.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115303647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624628
N. Bergmann, Yuk Ying Chung, B. Gunther
The discrete cosine transform (DCT) is a key step in many image and video coding applications, and its efficient implementation has been extensively studied for software implementations and for custom VLSI. We analyse the use of the distributed arithmetic algorithm for the efficient implementation of the DCT in reconfigurable logic.
{"title":"Efficient implementation of the DCT on custom computers","authors":"N. Bergmann, Yuk Ying Chung, B. Gunther","doi":"10.1109/FPGA.1997.624628","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624628","url":null,"abstract":"The discrete cosine transform (DCT) is a key step in many image and video coding applications, and its efficient implementation has been extensively studied for software implementations and for custom VLSI. We analyse the use of the distributed arithmetic algorithm for the efficient implementation of the DCT in reconfigurable logic.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125683434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624623
Yamin Li, Wanming Chu
The square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a non-restoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is low-cost iterative implementation that uses a traditional adder/subtracter. The operation latency is 25 clock cycles and the issue rate is 24 clock cycles. The other is high-throughput pipelined implementation that uses multiple adder/subtracters. The operation latency is 15 clock cycles and the issue rate is one clock cycle. It means that the pipelined implementation is capable of accepting a square root instruction on every clock cycle.
{"title":"Implementation of single precision floating point square root on FPGAs","authors":"Yamin Li, Wanming Chu","doi":"10.1109/FPGA.1997.624623","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624623","url":null,"abstract":"The square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a non-restoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is low-cost iterative implementation that uses a traditional adder/subtracter. The operation latency is 25 clock cycles and the issue rate is 24 clock cycles. The other is high-throughput pipelined implementation that uses multiple adder/subtracters. The operation latency is 15 clock cycles and the issue rate is one clock cycle. It means that the pipelined implementation is capable of accepting a square root instruction on every clock cycle.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126424525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624611
Bruce Culbertson, R. Amerson, R. Carter, P. Kuekes, G. Snider
Teramac is a large custom computer which works correctly despite the fact that three quarters of its FPGAs contain defects. This is accomplished through unprecedented use of defect tolerance, which substantially reduces Teramac's cost and permits it to have an unusually complex interconnection network. Teramac tolerates defective resources, like gates and wires, that are introduced during the manufacture of its FPGAs and other components, and during assembly of the system. We have developed methods to precisely locate defects. User designs are mapped onto the system by a completely automated process that avoids the defects and hides the defect tolerance from the user. Defective components are not physically removed from the system.
{"title":"Defect tolerance on the Teramac custom computer","authors":"Bruce Culbertson, R. Amerson, R. Carter, P. Kuekes, G. Snider","doi":"10.1109/FPGA.1997.624611","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624611","url":null,"abstract":"Teramac is a large custom computer which works correctly despite the fact that three quarters of its FPGAs contain defects. This is accomplished through unprecedented use of defect tolerance, which substantially reduces Teramac's cost and permits it to have an unusually complex interconnection network. Teramac tolerates defective resources, like gates and wires, that are introduced during the manufacture of its FPGAs and other components, and during assembly of the system. We have developed methods to precisely locate defects. User designs are mapped onto the system by a completely automated process that avoids the defects and hides the defect tolerance from the user. Defective components are not physically removed from the system.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125056696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624626
G. Chapman, B. Dufort
The complexity and speed of monolithic FPGA based custom computers has been set by the presence of defective sections which limit chip area. Test FPGAs show that laser link defect avoidance routing around flawed blocks generates delays <50% of active switches, making the error cell distribution nearly invisible.
{"title":"Laser defect correction applications to FPGA based custom computers","authors":"G. Chapman, B. Dufort","doi":"10.1109/FPGA.1997.624626","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624626","url":null,"abstract":"The complexity and speed of monolithic FPGA based custom computers has been set by the presence of defective sections which limit chip area. Test FPGAs show that laser link defect avoidance routing around flawed blocks generates delays <50% of active switches, making the error cell distribution nearly invisible.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624603
T. McDermott, P. Ryan, M. Shand, D. Skellern, T. Percival, N. Weste
We have implemented the digital section of a wireless local area network (WLAN) demodulator in a reconfigurable interface card called the PCI Pamette. The entire baseband section of the demodulator has been implemented using the Pamette and a simple analog to digital mezzanine board. This is the second version of the demodulator, the first being a card-based design using a mixture of discrete and reconfigurable logic. The Pamette design took far less time to complete than the card-based one. Moreover, the reconfigurable substrate is much more versatile. We describe the Pamette implementation and discuss our experiences with the two different design styles and technologies.
{"title":"A wireless LAN demodulator in a Pamette: design and experience","authors":"T. McDermott, P. Ryan, M. Shand, D. Skellern, T. Percival, N. Weste","doi":"10.1109/FPGA.1997.624603","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624603","url":null,"abstract":"We have implemented the digital section of a wireless local area network (WLAN) demodulator in a reconfigurable interface card called the PCI Pamette. The entire baseband section of the demodulator has been implemented using the Pamette and a simple analog to digital mezzanine board. This is the second version of the demodulator, the first being a card-based design using a mixture of discrete and reconfigurable logic. The Pamette design took far less time to complete than the card-based one. Moreover, the reconfigurable substrate is much more versatile. We describe the Pamette implementation and discuss our experiences with the two different design styles and technologies.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129179580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-04-16DOI: 10.1109/FPGA.1997.624599
N. Margolus
We propose an FPGA chip architecture based on a conventional FPGA logic array core, in which I/O pins are clocked at a much higher rate than that of the logic array that they serve. Wide data paths within the chip are time multiplexed at the edge of the chip into much faster and narrower data paths that run off-chip. This kind of arrangement makes it possible to interface a relatively slow FPGA core with high speed memories and data streams, and is useful for many pin-limited FPGA applications. For efficient use of the highest bandwidth DRAMs, our proposed chip includes a RAMBUS DRAM interface, a burst-transfer controller, and burst buffers. This proposal is motivated by our work with virtual processor cellular automata (CA) machines-a kind of SIMD computer. Our next generation of CA machines requires reconfigurable FPGA-like processors coupled to the highest speed DRAMs and SRAMs available. Unfortunately, no current FPGA chips have appropriate DRAM I/O support or the speed needed to easily interface with pipelined SRAMs. The chips proposed would make a wide range of large-scale CA simulations of 3D physical systems practical and economical-simulations that are currently well beyond the reach of any existing computer. These chips would also be well suited to a broad range of other simulation, graphics and DSP-like applications.
{"title":"An FPGA architecture for DRAM-based systolic computations","authors":"N. Margolus","doi":"10.1109/FPGA.1997.624599","DOIUrl":"https://doi.org/10.1109/FPGA.1997.624599","url":null,"abstract":"We propose an FPGA chip architecture based on a conventional FPGA logic array core, in which I/O pins are clocked at a much higher rate than that of the logic array that they serve. Wide data paths within the chip are time multiplexed at the edge of the chip into much faster and narrower data paths that run off-chip. This kind of arrangement makes it possible to interface a relatively slow FPGA core with high speed memories and data streams, and is useful for many pin-limited FPGA applications. For efficient use of the highest bandwidth DRAMs, our proposed chip includes a RAMBUS DRAM interface, a burst-transfer controller, and burst buffers. This proposal is motivated by our work with virtual processor cellular automata (CA) machines-a kind of SIMD computer. Our next generation of CA machines requires reconfigurable FPGA-like processors coupled to the highest speed DRAMs and SRAMs available. Unfortunately, no current FPGA chips have appropriate DRAM I/O support or the speed needed to easily interface with pipelined SRAMs. The chips proposed would make a wide range of large-scale CA simulations of 3D physical systems practical and economical-simulations that are currently well beyond the reach of any existing computer. These chips would also be well suited to a broad range of other simulation, graphics and DSP-like applications.","PeriodicalId":303064,"journal":{"name":"Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127357597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}