Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558368
Hsiang-Ling Li, C. Chakrabarti
A novel feature-domain 2D motion estimation system based on the straight-line Hough transform (SLHT) is presented. This system implements the motion estimation technique proposed by Li and Chakrabarti (see Pattern Recognition, vol.29, no.8, 1996). It operates on 256/spl times/256-pixel binary images and consists of two main blocks. The first block does the preprocessing work including smoothing the boundary, tracing and integrating the contours, and detecting dominant points. The second block computes the Hough transform on contour segments as well as the rotation and translation parameters. Each of the modules has been implemented (gate level) and simulated using Mentor Graphics tools. The experimental results are presented and compared with the results of the software implementation.
{"title":"Hardware design of a Hough transform based 2-D motion estimation system","authors":"Hsiang-Ling Li, C. Chakrabarti","doi":"10.1109/VLSISP.1996.558368","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558368","url":null,"abstract":"A novel feature-domain 2D motion estimation system based on the straight-line Hough transform (SLHT) is presented. This system implements the motion estimation technique proposed by Li and Chakrabarti (see Pattern Recognition, vol.29, no.8, 1996). It operates on 256/spl times/256-pixel binary images and consists of two main blocks. The first block does the preprocessing work including smoothing the boundary, tracing and integrating the contours, and detecting dominant points. The second block computes the Hough transform on contour segments as well as the rotation and translation parameters. Each of the modules has been implemented (gate level) and simulated using Mentor Graphics tools. The experimental results are presented and compared with the results of the software implementation.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129969675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558314
Lan-Rong Dung, V. K. Madisetti, J. Hines
The paper describes how rapid model-year architectural synthesis (e.g., HW/SW codesign) of embedded signal processors can be performed to optimize various cost objective functions using a reuse library of model, followed by simulation based optimization. Sponsored as part of DARPA's RASSP program, this approach has developed and released a number of interoperable and verified architectural component libraries at the system level (processors, communication protocols, and topologies). While these libraries have been used in actual demonstrations of avionics and military systems, such as the MIT Lincoln Laboratory's SAR Benchmark, the F-14 legacy Infrared Search and Track System (IRST), and as part of NASA/JPL's Remote Exploration/Experimentation (REE) program studies, the authors introduce the methodology of conceptual prototyping and establish the requirements and features of the proposed environment. They also illustrate its use on some common applications with relatively sophisticated architectural building blocks, such as IEEE SCI protocol and Analog Devices' SHARC processor family.
{"title":"Model-based architectural design and verification of scalable embedded DSP systems-a RASSP approach","authors":"Lan-Rong Dung, V. K. Madisetti, J. Hines","doi":"10.1109/VLSISP.1996.558314","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558314","url":null,"abstract":"The paper describes how rapid model-year architectural synthesis (e.g., HW/SW codesign) of embedded signal processors can be performed to optimize various cost objective functions using a reuse library of model, followed by simulation based optimization. Sponsored as part of DARPA's RASSP program, this approach has developed and released a number of interoperable and verified architectural component libraries at the system level (processors, communication protocols, and topologies). While these libraries have been used in actual demonstrations of avionics and military systems, such as the MIT Lincoln Laboratory's SAR Benchmark, the F-14 legacy Infrared Search and Track System (IRST), and as part of NASA/JPL's Remote Exploration/Experimentation (REE) program studies, the authors introduce the methodology of conceptual prototyping and establish the requirements and features of the proposed environment. They also illustrate its use on some common applications with relatively sophisticated architectural building blocks, such as IEEE SCI protocol and Analog Devices' SHARC processor family.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134391300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558307
S. Dutta, A. Wolfe, W. Wolf, K. O'Connor
This paper is a design study of a very long instruction word (VLIW) video signal processor (VSP), concentrating on the VLSI tradeoffs which affect the processor's architecture. VLIW architectures provide high parallelism and excellent high-level language programmability, but require careful attention to VLSI design. Flexible, high-bandwidth interconnect, high-connectivity register files, and fast cycle time are required to achieve real-time video signal processing. The design targets 32-64 operations per cycle at clock rates exceeding 500 MHz. Parameterizable versions of key modules have been designed in a 0.25 /spl mu/m CMOS process, allowing us to explore the VLIW VSP design space and study the tradeoffs defined by the characteristics of the process.
{"title":"Design issues for very-long-instruction-word VLSI video signal processors","authors":"S. Dutta, A. Wolfe, W. Wolf, K. O'Connor","doi":"10.1109/VLSISP.1996.558307","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558307","url":null,"abstract":"This paper is a design study of a very long instruction word (VLIW) video signal processor (VSP), concentrating on the VLSI tradeoffs which affect the processor's architecture. VLIW architectures provide high parallelism and excellent high-level language programmability, but require careful attention to VLSI design. Flexible, high-bandwidth interconnect, high-connectivity register files, and fast cycle time are required to achieve real-time video signal processing. The design targets 32-64 operations per cycle at clock rates exceeding 500 MHz. Parameterizable versions of key modules have been designed in a 0.25 /spl mu/m CMOS process, allowing us to explore the VLIW VSP design space and study the tradeoffs defined by the characteristics of the process.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130933924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558371
S. Simon, P. Rieder, J. Nossek
A variety of architectures for the discrete wavelet transform (DWT) is examined to derive an efficient VLSI implementation. The comparison leads to a lattice filter structure which uses single steps of the CORDIC algorithm. Due to the modular structure of the proposed architecture, this approach is especially suited for full custom design style using module generators to automate the manual design process.
{"title":"Efficient VLSI suited architectures for discrete wavelet transforms","authors":"S. Simon, P. Rieder, J. Nossek","doi":"10.1109/VLSISP.1996.558371","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558371","url":null,"abstract":"A variety of architectures for the discrete wavelet transform (DWT) is examined to derive an efficient VLSI implementation. The comparison leads to a lattice filter structure which uses single steps of the CORDIC algorithm. Due to the modular structure of the proposed architecture, this approach is especially suited for full custom design style using module generators to automate the manual design process.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133285091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558369
J.C. Limqueco, M. Bayoumi
We propose an efficient and simple systolic-like architecture for VLSI implementation of a 2-D discrete wavelet transform (DWT). The "approximation" and "detailed" components of a signal are computed simultaneously in the first octave and alternately in the other octave(s). Each processing element has its own local memory for storing intermediate data and minimum routing requirement limited only to its neighbors. The proposed architecture uses the same clock frequency for every octave level and has a 100% utilization for j=2 architecture, and N/sup 2/+N period cycle. The architecture is scalable for different filter lengths (divisible by 2) and different octave levels.
{"title":"A scalable architecture for 2-D discrete wavelet transform","authors":"J.C. Limqueco, M. Bayoumi","doi":"10.1109/VLSISP.1996.558369","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558369","url":null,"abstract":"We propose an efficient and simple systolic-like architecture for VLSI implementation of a 2-D discrete wavelet transform (DWT). The \"approximation\" and \"detailed\" components of a signal are computed simultaneously in the first octave and alternately in the other octave(s). Each processing element has its own local memory for storing intermediate data and minimum routing requirement limited only to its neighbors. The proposed architecture uses the same clock frequency for every octave level and has a 100% utilization for j=2 architecture, and N/sup 2/+N period cycle. The architecture is scalable for different filter lengths (divisible by 2) and different octave levels.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132273168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558278
J. Kneip, P. Pirsch
The paper describes principle and practical implementation of an object based cache concept, allowing conflict free regular access to data structures for a cluster of processing units. The cache is based on a virtual object bound address space instead of the conventional linear address space for the access to shared data located in on-chip caches. By extending the conventional block based cache principle to 2-D blocks and using virtual addresses for address arithmetic and hit/miss detection, the time critical address calculations in the load/store pipeline can be performed fast and at low hardware cost. Transform to physical addresses is performed during block transfer between internal caches and external system memory, where it is much less time critical and must only be performed once per block. The object based cache is compiler friendly, fully transparent to the programmer, and allows the hardware efficient implementation of a shared on-chip memory system for future parallel digital image processors.
{"title":"An object based data cache with conflict free concurrent access as shared memory for a parallel DSP","authors":"J. Kneip, P. Pirsch","doi":"10.1109/VLSISP.1996.558278","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558278","url":null,"abstract":"The paper describes principle and practical implementation of an object based cache concept, allowing conflict free regular access to data structures for a cluster of processing units. The cache is based on a virtual object bound address space instead of the conventional linear address space for the access to shared data located in on-chip caches. By extending the conventional block based cache principle to 2-D blocks and using virtual addresses for address arithmetic and hit/miss detection, the time critical address calculations in the load/store pipeline can be performed fast and at low hardware cost. Transform to physical addresses is performed during block transfer between internal caches and external system memory, where it is much less time critical and must only be performed once per block. The object based cache is compiler friendly, fully transparent to the programmer, and allows the hardware efficient implementation of a shared on-chip memory system for future parallel digital image processors.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558325
R. Hezar, V. K. Madisetti
We propose an efficient design procedure for digital FIR filters whose coefficients are restricted to the ternary set (-1, 0, +1), cascaded by a multiplication-free architecture. A dynamic programming algorithm, minimizing the instantaneous error, is also proposed to assist in the search for the optimal ternary filter coefficient set. Power reductions in a VLSI implementation appear feasible, when compared to other published approaches.
{"title":"Low-power digital filter implementations using ternary coefficients","authors":"R. Hezar, V. K. Madisetti","doi":"10.1109/VLSISP.1996.558325","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558325","url":null,"abstract":"We propose an efficient design procedure for digital FIR filters whose coefficients are restricted to the ternary set (-1, 0, +1), cascaded by a multiplication-free architecture. A dynamic programming algorithm, minimizing the instantaneous error, is also proposed to assist in the search for the optimal ternary filter coefficient set. Power reductions in a VLSI implementation appear feasible, when compared to other published approaches.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127038858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558376
K. Nadehara, H. Stolberg, M. Ikekawa, E. Murata, I. Kuroda
This paper presents a real-time MPEC-1 video decoder implemented in software on a DSP-enhanced, 160-mW, 100-MHz, 32-bit microprocessor. The processor's DSP-oriented instructions improves the performance of generic DSP operations such as the inverse discrete cosine transform, while fast software algorithms that perform parallel operation on packed-pixel data are developed for processes unique to video decoding such as motion compensation. Furthermore, to reduce the clock count as well as the instruction count, load/store scheduling and cache miss reduction are performed. In total, the processor can achieve 30 frames/sec MPEC-1 video decoding at a cost and power dissipation (160 mW) comparable to dedicated LSIs.
{"title":"Real-time software MPEG-1 video decoder design for low-cost, low-power applications","authors":"K. Nadehara, H. Stolberg, M. Ikekawa, E. Murata, I. Kuroda","doi":"10.1109/VLSISP.1996.558376","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558376","url":null,"abstract":"This paper presents a real-time MPEC-1 video decoder implemented in software on a DSP-enhanced, 160-mW, 100-MHz, 32-bit microprocessor. The processor's DSP-oriented instructions improves the performance of generic DSP operations such as the inverse discrete cosine transform, while fast software algorithms that perform parallel operation on packed-pixel data are developed for processes unique to video decoding such as motion compensation. Furthermore, to reduce the clock count as well as the instruction count, load/store scheduling and cache miss reduction are performed. In total, the processor can achieve 30 frames/sec MPEC-1 video decoding at a cost and power dissipation (160 mW) comparable to dedicated LSIs.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121065272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558306
T. Aoki, Hiroshi Tokoyo, T. Higuchi
This paper presents a unified approach for designing high-radix dividers for on-line signal and data processing applications. It has long been recognized that the use of higher radices makes possible the reduction of computational steps in the division process. However most of the conventional high-radix algorithms are not suited for designing high-speed parallel dividers since they require lookup tables for selecting the quotient digits. We present a high-radix divider design that does not assume the use of lookup tables and is applicable to arbitrary radices. By prescaling the operands and converting the representation of each partial remainder into partially non-redundant representation, the quotient digit can be obtained directly from the integer part of the partial remainder. This paper also discusses the design of a radix-8 fully parallel divider as an example.
{"title":"High-radix parallel dividers for VLSI signal processing","authors":"T. Aoki, Hiroshi Tokoyo, T. Higuchi","doi":"10.1109/VLSISP.1996.558306","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558306","url":null,"abstract":"This paper presents a unified approach for designing high-radix dividers for on-line signal and data processing applications. It has long been recognized that the use of higher radices makes possible the reduction of computational steps in the division process. However most of the conventional high-radix algorithms are not suited for designing high-speed parallel dividers since they require lookup tables for selecting the quotient digits. We present a high-radix divider design that does not assume the use of lookup tables and is applicable to arbitrary radices. By prescaling the operands and converting the representation of each partial remainder into partially non-redundant representation, the quotient digit can be obtained directly from the integer part of the partial remainder. This paper also discusses the design of a radix-8 fully parallel divider as an example.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121066585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558311
V. Zivojnovic, S. Pees, Heinrich Meyr
A machine description language is presented. The language, LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW codesign, and cosimulation environments. The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other. The main part of the paper is devoted to behavioral pipeline modeling. The pipeline controller of the generic machine model is represented as an ASAP (as soon as possible) sequencer parameterized by precedence and resource constraints of operations of each instruction. The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes. Using the newly introduced L-charts we reduced the parameterization of the pipeline controller to a minimum and at the same time covered typical pipeline controls found in state of the art signal processors. As an example, the application of the LISA model on the TI-TMS320C54x signal processor is presented.
提出了一种机器描述语言。LISA语言及其通用机器模型能够生成位和周期/相位精确的处理器模型,涵盖了硬件/软件协同设计和协同仿真环境的特定需求。为了弥补编译器和指令集模拟器中使用的粗略ISA模型与硬件设计中使用的详细模型之间的差距,开发一种新语言是必要的。论文的主要部分是行为管道建模。通用机器模型的流水线控制器表示为一个ASAP (as soon as possible)序列器,该序列器由每条指令操作的优先级和资源约束参数化。基于保留表和甘特图的标准管道描述被额外的操作描述符扩展,这些操作描述符能够检测数据和控制危险,并允许对管道冲洗进行建模。使用新引入的l图,我们将管道控制器的参数化减少到最小,同时涵盖了最先进信号处理器中发现的典型管道控制。最后给出了LISA模型在TI-TMS320C54x信号处理器上的应用实例。
{"title":"LISA-machine description language and generic machine model for HW/SW co-design","authors":"V. Zivojnovic, S. Pees, Heinrich Meyr","doi":"10.1109/VLSISP.1996.558311","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558311","url":null,"abstract":"A machine description language is presented. The language, LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW codesign, and cosimulation environments. The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other. The main part of the paper is devoted to behavioral pipeline modeling. The pipeline controller of the generic machine model is represented as an ASAP (as soon as possible) sequencer parameterized by precedence and resource constraints of operations of each instruction. The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes. Using the newly introduced L-charts we reduced the parameterization of the pipeline controller to a minimum and at the same time covered typical pipeline controls found in state of the art signal processors. As an example, the application of the LISA model on the TI-TMS320C54x signal processor is presented.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}