VLSI Signal Processing, IX最新文献_第5页

A chip set for a ray-casting engine 用于光线投射引擎的芯片

VLSI Signal Processing, IX

Pub Date : 1996-10-30 DOI: 10.1109/VLSISP.1996.558335

G. Hekstra, E. Deprettere

Rendering artificial scenes is an appealing example of a class of problems leading to complex data dependent algorithms for which efficient software/hardware mapping techniques have to be envisaged. We present one of the ASICs in our rendering system to illustrate our design methodology in more detail. The first step in the algorithm-architecture design is to reformulate an existing naive algorithm in such a way that, as much as possible, only significant operations are performed. The resulting algorithm has a nested loop structure, with non-manifest, data-dependent loop bounds, rendering classical techniques for parallelisation useless. The second step is to greatly reduce the overall computation time of the algorithm by reducing the computational complexity of the innermost loop operation. The third and last step is to map this algorithm on a pipelined architecture, where the pipeline stages-functional units within an ASIC-implement different loop levels. Due to the data dependent nature, the functional units that implement the parts of the loops are time-varying with regard to both execution time and in how much data is produced for the following pipeline stages. Since the execution times of the various pipeline stages are changing, so does the location of the bottleneck over time. Hence the goal is not to keep all pipeline stages continually busy, but to keep the throughput at the most critical innermost loop operation as high as possible.

渲染人工场景是导致复杂数据依赖算法的一类问题的一个吸引人的例子，必须设想有效的软件/硬件映射技术。我们展示了渲染系统中的一个asic，以更详细地说明我们的设计方法。算法架构设计的第一步是重新制定现有的朴素算法，使其尽可能只执行重要的操作。生成的算法具有嵌套循环结构，具有非明显的、依赖数据的循环边界，使得传统的并行化技术毫无用处。第二步是通过降低最内层循环操作的计算复杂度来大大减少算法的整体计算时间。第三步也是最后一步是将该算法映射到流水线架构上，其中流水线阶段(asic中的功能单元)实现不同的循环级别。由于数据依赖的性质，实现循环部分的功能单元在执行时间和为以下管道阶段产生的数据量方面都是时变的。由于各个管道阶段的执行时间都在变化，因此瓶颈的位置也会随着时间的推移而变化。因此，我们的目标不是让所有管道阶段都持续忙碌，而是在最关键的最内层循环操作中保持尽可能高的吞吐量。

{"title":"A chip set for a ray-casting engine","authors":"G. Hekstra, E. Deprettere","doi":"10.1109/VLSISP.1996.558335","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558335","url":null,"abstract":"Rendering artificial scenes is an appealing example of a class of problems leading to complex data dependent algorithms for which efficient software/hardware mapping techniques have to be envisaged. We present one of the ASICs in our rendering system to illustrate our design methodology in more detail. The first step in the algorithm-architecture design is to reformulate an existing naive algorithm in such a way that, as much as possible, only significant operations are performed. The resulting algorithm has a nested loop structure, with non-manifest, data-dependent loop bounds, rendering classical techniques for parallelisation useless. The second step is to greatly reduce the overall computation time of the algorithm by reducing the computational complexity of the innermost loop operation. The third and last step is to map this algorithm on a pipelined architecture, where the pipeline stages-functional units within an ASIC-implement different loop levels. Due to the data dependent nature, the functional units that implement the parts of the loops are time-varying with regard to both execution time and in how much data is produced for the following pipeline stages. Since the execution times of the various pipeline stages are changing, so does the location of the bottleneck over time. Hence the goal is not to keep all pipeline stages continually busy, but to keep the throughput at the most critical innermost loop operation as high as possible.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Scalability of 2-D wavelet transform algorithms: analytical and experimental results on coarse-grained parallel computers 二维小波变换算法的可扩展性:在粗粒度并行计算机上的分析和实验结果

VLSI Signal Processing, IX

Pub Date : 1996-10-30 DOI: 10.1109/VLSISP.1996.558370

Jamshed N. Pately, Ashfaq A. Khokharz, Leah H. Jamiesony

We present analytical and experimental results for the scalability of 2-D discrete wavelet transform algorithms on coarse-grained parallel architectures. The principal operation in the 2-D DWT is the filtering operation used to implement the filter banks of the 2-D subband decomposition. We derive analytical results comparing time domain and frequency domain parallel algorithms for realizing the filter banks. Experiments on the Intel Paragon validate the analytical results. We demonstrate that there exist combinations of the machine size, image size, and wavelet size for which the time-domain algorithms outperform the frequency domain algorithms, and vice-versa.

我们给出了二维离散小波变换算法在粗粒度并行结构上的可扩展性的分析和实验结果。二维DWT中的主要操作是用于实现二维子带分解的滤波器组的滤波操作。给出了实现滤波器组的时域和频域并行算法的比较分析结果。在Intel Paragon上的实验验证了分析结果。我们证明存在机器大小，图像大小和小波大小的组合，其中时域算法优于频域算法，反之亦然。

引用次数: 32

New motion estimation using low-resolution quantization for MPEG2 video encoding 基于低分辨率量化的MPEG2视频编码新运动估计

VLSI Signal Processing, IX

Pub Date : 1996-10-30 DOI: 10.1109/VLSISP.1996.558375

Seongsoo Lee, Jeong-Min Kim, S. Chae

We propose a new algorithm of real-time motion estimation for MPEG2 video encoding. It reduces the computational cost by using low bit-resolution quantization and new matching criterion. To maintain the performance, we employed a low-resolution search followed by a full-resolution search. Simulation results show that the proposed algorithm requires 1/17.4 computational cost while maintaining the performance degradation less than 0.37 dB with respect to the full search algorithm for -32.0/spl sim/+31.5 search range in the CCIR601 image. The architecture for the real-time MPEG2 motion estimator using this algorithm is also explained. It searches concurrently two prediction modes for -32.0/spl sim/+31.5 search range. Its hardware complexity is estimated to about 100,000 gates of random logic and 90 Kbits of SRAM. A VLSI design of the proposed architecture is in progress using a 0.5 /spl mu/m triple-metal CMOS standard-cell technology.

提出了一种用于MPEG2视频编码的实时运动估计算法。采用了低比特分辨率量化和新的匹配准则，降低了计算量。为了保持性能，我们先进行低分辨率搜索，然后再进行全分辨率搜索。仿真结果表明，在CCIR601图像的-32.0/spl sim/+31.5搜索范围内，与全搜索算法相比，该算法的计算成本为1/17.4，性能下降小于0.37 dB。并给出了基于该算法的实时MPEG2运动估计器的结构。它同时搜索-32.0/spl sim/+31.5搜索范围的两种预测模式。其硬件复杂性估计约为100,000个随机逻辑门和90 kb的SRAM。采用0.5 /spl mu/m三金属CMOS标准电池技术的拟议架构的VLSI设计正在进行中。

引用次数: 2

Low power parallel multipliers 低功率并行乘法器

VLSI Signal Processing, IX

Pub Date : 1996-10-30 DOI: 10.1109/VLSISP.1996.558332

Edwin de Angel, Earl E. Swartzlander

This paper presents and compares sign extension techniques used to decrease the switching activity and improve the performance of parallel multipliers. A detailed review of different sign extension schemes is presented and an improved scheme for reducing the power dissipation is proposed. Four parallel CMOS multipliers designed in 0.6 /spl mu/m technology are used to implement and compare the sign extension schemes.

本文介绍并比较了用于减少开关活动和提高并行乘法器性能的符号扩展技术。对不同的符号扩展方案进行了详细的回顾，并提出了一种降低功耗的改进方案。采用4个以0.6 /spl mu/m技术设计的并行CMOS乘法器来实现和比较符号扩展方案。

引用次数: 58

ComBox: library-based generation of VHDL modules ComBox:基于库的VHDL模块生成

VLSI Signal Processing, IX

Pub Date : 1996-10-30 DOI: 10.1109/VLSISP.1996.558362

M. Vaupel, T. Grotker, H. Meyr

We describe the automated generation of components for high throughput data-flow dominated VLSI-systems in digital communications. By means of a hierarchically organized library both behavioural models with high simulation efficiency and corresponding hardware generators that produce sophisticated VHDL descriptions are made easily accessible to the system designer. The structured approach allows the evaluation of the trade-offs between alternatives at each design step and guarantees a fast and reliable design flow towards hardware. The design environment ComBox enhances reusability and enables rapid implementation of complex systems starting from a system level description.

我们描述了数字通信中高吞吐量数据流主导的vlsi系统的自动生成组件。通过层次化组织的库，系统设计人员可以很容易地访问具有高仿真效率的行为模型和生成复杂VHDL描述的相应硬件生成器。结构化方法允许在每个设计步骤中评估备选方案之间的权衡，并保证快速可靠的硬件设计流程。设计环境ComBox增强了可重用性，并且能够从系统级描述开始快速实现复杂系统。

引用次数: 6