VLIW DSP vs. superscalar implementation of a baseline 11.263 video encoder

Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154) Pub Date : 2000-12-01 DOI:10.1109/ACSSC.2000.911272

S. Banerjee, H. Sheikh, L. John, B. Evans, A. Bovik

{"title":"VLIW DSP vs. superscalar implementation of a baseline 11.263 video encoder","authors":"S. Banerjee, H. Sheikh, L. John, B. Evans, A. Bovik","doi":"10.1109/ACSSC.2000.911272","DOIUrl":null,"url":null,"abstract":"A Very Long Instruction Word (VLIW) processor and a superscalar processor can execute multiple instructions simultaneously. A VLIW processor depends on the compiler and programmer to find the parallelism in the instructions, whereas a superscaler processor determines the parallelism at runtime. This paper compares TI TMS320C6700 VLIW digital signal processor (DSP) and SimpleScalar superscalar implementations of a baseline 11.263 video encoder in C. With level two C compiler optimization, a one-way issue superscalar processor is 7.5 times faster than the VLIW DSP for the same processor clock speed. The superscalar speedup from one-way to four-way issue is 2.88:1, and from four-way to 256-way issue is 2.43:1. To reduce the execution time on the C6700, we write assembly routines for sum-of-absolute-difference, interpolation, and reconstruction, and place frequently used code and data into on-chip memory. We use TI's discrete cosine transform assembly routines. The hand optimized VLIW DSP implementation is 61/spl times/ faster than the C version compiled with level two optimization. Most of the improvement was due to the efficient placement of data and programs in memory. The hand optimized VLIW implementation is 14% faster than a 256-way superscalar implementation without hand optimizations.","PeriodicalId":10581,"journal":{"name":"Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154)","volume":"92 1","pages":"1665-1669 vol.2"},"PeriodicalIF":0.0000,"publicationDate":"2000-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACSSC.2000.911272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

A Very Long Instruction Word (VLIW) processor and a superscalar processor can execute multiple instructions simultaneously. A VLIW processor depends on the compiler and programmer to find the parallelism in the instructions, whereas a superscaler processor determines the parallelism at runtime. This paper compares TI TMS320C6700 VLIW digital signal processor (DSP) and SimpleScalar superscalar implementations of a baseline 11.263 video encoder in C. With level two C compiler optimization, a one-way issue superscalar processor is 7.5 times faster than the VLIW DSP for the same processor clock speed. The superscalar speedup from one-way to four-way issue is 2.88:1, and from four-way to 256-way issue is 2.43:1. To reduce the execution time on the C6700, we write assembly routines for sum-of-absolute-difference, interpolation, and reconstruction, and place frequently used code and data into on-chip memory. We use TI's discrete cosine transform assembly routines. The hand optimized VLIW DSP implementation is 61/spl times/ faster than the C version compiled with level two optimization. Most of the improvement was due to the efficient placement of data and programs in memory. The hand optimized VLIW implementation is 14% faster than a 256-way superscalar implementation without hand optimizations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VLIW DSP与基线11.263视频编码器的超标量实现

超长指令字(VLIW)处理器和超标量处理器可以同时执行多条指令。VLIW处理器依赖于编译器和程序员来查找指令中的并行性，而超标量处理器则在运行时确定并行性。本文比较了TI TMS320C6700 VLIW数字信号处理器(DSP)和SimpleScalar超标量实现的基准11.263 C语言视频编码器。通过对二级C编译器的优化，在相同的处理器时钟速度下，单向超标量处理器比VLIW DSP快7.5倍。从单向到四路的超标量加速是2.88:1，从四路到256路的超标量加速是2.43:1。为了减少C6700上的执行时间，我们编写了用于绝对差和、插值和重构的汇编例程，并将经常使用的代码和数据放在片上存储器中。我们使用TI的离散余弦变换汇编例程。手工优化的VLIW DSP实现比用二级优化编译的C版本快61/spl倍。大部分的改进是由于数据和程序在内存中的有效放置。手动优化的VLIW实现比没有手动优化的256路超标量实现快14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154)

自引率

0.00%

发文量

期刊最新文献

Generalized lapped biorthogonal transforms using lifting steps Linear unitary precoders for maximum diversity gains with multiple transmit and receive antennas An N2logN back-projection algorithm for SAR image formation A fast constant modulus algorithm for blind equalization A signal separation algorithm for fetal heart-rate estimation