Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors最新文献

英文中文

A model-based methodology for application specific energy efficient data path design using FPGAs 一种基于模型的方法，用于使用fpga设计特定的节能数据路径

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030706

Sumit Mohanty, S. Choi, Ju-wook Jang, V. Prasanna

Presents a methodology to design energy-efficient data paths using FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low level simulation to understand the tradeoffs between energy, latency, and area. The domain specific modeling technique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given application kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate tradeoff analysis. Design space exploration (DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection. We illustrate our methodology using a family of architectures and algorithms for matrix multiplication. The designs identified by our methodology demonstrate tradeoffs among energy, latency, and area.

提出了一种利用fpga设计节能数据路径的方法。我们的方法集成了特定领域的建模、粗粒度的性能评估、设计空间探索和低级模拟，以了解能量、延迟和面积之间的权衡。特定于领域的建模技术通过识别特定于影响系统范围能量耗散的领域的各种组件和参数来定义高级模型。域是针对给定应用程序内核的一系列体系结构和相应的算法。高级模型还包括用于估计能量、延迟和面积的功能，这些功能便于权衡分析。设计空间探索(DSE)是对领域定义的设计空间进行分析，选择一组设计。低级模拟用于对DSE选择的设计进行准确的性能估计，也用于最终的设计选择。我们使用矩阵乘法的一系列架构和算法来说明我们的方法。通过我们的方法确定的设计证明了能量，延迟和面积之间的权衡。

{"title":"A model-based methodology for application specific energy efficient data path design using FPGAs","authors":"Sumit Mohanty, S. Choi, Ju-wook Jang, V. Prasanna","doi":"10.1109/ASAP.2002.1030706","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030706","url":null,"abstract":"Presents a methodology to design energy-efficient data paths using FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low level simulation to understand the tradeoffs between energy, latency, and area. The domain specific modeling technique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given application kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate tradeoff analysis. Design space exploration (DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection. We illustrate our methodology using a family of architectures and algorithms for matrix multiplication. The designs identified by our methodology demonstrate tradeoffs among energy, latency, and area.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127702499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

High-radix logarithm with selection by rounding 四舍五入选择的高基数对数

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030708

José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera

A high-radix digit-recurrence algorithm or the computation of the logarithm is presented in this paper. Selection by rounding is used in iterations j/spl ges/2, and selection by table in the first iteration is combined with a restricted digit-set for the second one, in order to guarantee the convergence of the algorithm. A sequential architecture is proposed. and the execution time and hardware requirements of this architecture are estimated, for a target precision of n=32 bits and a radix r=256. These estimates are obtained according to a rough model for the delay and area cost of the main logic blocks employed, and show the achievement of a speed-up by over 4 times with regard to a conventional radix-2 implementation with redundant arithmetic.

本文提出了一种计算对数的高基数数字递归算法。为了保证算法的收敛性，在j/spl ges/2迭代中采用舍入选择，在第一次迭代中采用表选择，在第二次迭代中采用限制数字集选择。提出了一种顺序结构。在目标精度为n=32位，基数r=256的情况下，估计了该体系结构的执行时间和硬件需求。这些估计是根据所采用的主逻辑块的延迟和面积成本的粗略模型得到的，并且表明与传统的冗余算术基数-2实现相比，实现了超过4倍的加速。

引用次数: 24

Implications of programmable general purpose processors for compression/encryption applications 压缩/加密应用中可编程通用处理器的含义

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030722

Byeong Kil Lee, L. John

With the growth of the Internet and mobile communication industry, multimedia applications form a dominant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video andencryption applications. The instruction mix, memory access and parallelism aspects during the execution of these programs are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also study the data parallelism in these applications, indicating the ability of instruction level and data level parallel processors to exploit the parallelism in these applications. The parallelism ranges from 6 to 529 in encode processes and 18 to 558 in decode processes.

随着互联网和移动通信行业的发展，多媒体应用构成了计算机工作负荷的主要部分。媒体工作负载通常在特定应用集成电路(asic)、特定应用处理器(asp)或通用处理器(gpp)上执行。gpp是灵活的，允许在应用程序和算法的变化比asic和asp更好。然而，在gpp上执行这些应用程序的成本很高。在本文中，我们从在可编程通用处理器与asp上执行媒体压缩/解压缩算法的开销的角度来分析它们。我们从音频，图像/视频和加密应用程序中选择九种编码/解码程序。分析了这些程序在执行过程中的指令混合、内存访问和并行性等问题。内存访问延迟被认为是影响通用处理器执行时间的主要因素。大多数这些压缩/解压缩算法都涉及到通过执行阶段(例如量化，编码等)处理数据，并且在这些阶段之间存储和检索临时结果。定义了一个称为每个输入/输出字节的开销内存访问带宽的度量来描述每个应用程序的临时内存活动。我们观察到，这些程序所进行的90%以上的内存访问都是临时数据存储和加载，这是由执行平台的通用特性引起的。我们还研究了这些应用程序中的数据并行性，指出了指令级和数据级并行处理器在这些应用程序中利用并行性的能力。在编码过程中，并行度范围从6到529，在解码过程中，并行度范围从18到558。

{"title":"Implications of programmable general purpose processors for compression/encryption applications","authors":"Byeong Kil Lee, L. John","doi":"10.1109/ASAP.2002.1030722","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030722","url":null,"abstract":"With the growth of the Internet and mobile communication industry, multimedia applications form a dominant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video andencryption applications. The instruction mix, memory access and parallelism aspects during the execution of these programs are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also study the data parallelism in these applications, indicating the ability of instruction level and data level parallel processors to exploit the parallelism in these applications. The parallelism ranges from 6 to 529 in encode processes and 18 to 558 in decode processes.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Efficient conversion from binary to multi-digit multi-dimensional logarithmic number systems using arrays of range addressable look-up tables 使用范围可寻址查找表数组从二进制到多位数多维对数系统的有效转换

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030711

R. Muscedere, V. Dimitrov, G. Jullien, W. Miller

The multi-dimensional logarithmic number system (MDLNS), with similar properties to the logarithmic number system (LNS), provides more degrees of freedom than the LNS by virtue of having two orthogonal bases and the ability to use multiple digits. Unlike the LNS, there is no direct functional relationship between binary/floating point representation and the MDLNS representation. Traditionally look-up tables (LUTs) were used to move from the binary domain to the MDLNS domain. This method can be unrealistic for hardware implementation when large binary ranges or multiple digits are used. This paper introduces a range addressable technique for table look-up arrays that allows efficient conversion from binary to single or multi-digit MDLNS.

多维对数系统(MDLNS)具有与对数系统(LNS)相似的特性，由于具有两个正交的基数和使用多位数的能力，提供了比LNS更多的自由度。与LNS不同，二进制/浮点表示与MDLNS表示之间没有直接的函数关系。传统上使用查找表(lut)从二进制域移动到MDLNS域。当使用大二进制范围或多位数时，这种方法对于硬件实现可能是不现实的。本文介绍了一种用于表查找数组的范围寻址技术，该技术允许从二进制到单位数或多位数MDLNS的有效转换。

引用次数: 10

Refining instruction set architecture for high-performance multimedia processing in constrained environments 改进约束环境下高性能多媒体处理的指令集体系结构

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030724

R. Lee, A. M. Fiskiran, Z. Shi, Xiao Yang

Multimedia processing in software has been significantly accelerated by the addition of subword-parallel instructions to the instruction set architectures (ISAs) of modem microprocessors. While some of these multimedia instructions are simple and effective, others are very complex, requiring large, special-purpose functional units that are not practical for constrained environments such as handheld multimedia information appliances. For such environments, low-power and low-cost are as important as the high performance required for real-time multimedia processing and the general-purpose programmability required to support an ever growing range of applications. In this paper, we introduce PLX, a concise ISA that selects the most useful features from the first two generations of multimedia instructions added to microprocessors, and explores new ISA features for high-performance yet low-cost multimedia processing with small footprint processors. PLX is unique in that it is designed from scratch as a fully subword-parallel architecture with novel features like datapath scalability from 32-bit to 128-bit words, and a new definition of predication for reducing conditional branches. We illustrate the use of PLX's architectural features with four frequently used multimedia kernels: discrete cosine transform, pixel padding, clip test and median filter. Our performance results show that a 64-bit PLX implementation achieves significant speedups compared to a basic 64-bit RISC processor and to IA-32 processors with MMX and SSE multimedia extensions. PLX's datapath scalability feature often provides an additional 2x speedup in a cost-effective way.

在现代微处理器的指令集结构(isa)中加入子字并行指令，大大加快了软件中的多媒体处理速度。虽然其中一些多媒体指令简单而有效，但其他的则非常复杂，需要大型的专用功能单元，这对于手持多媒体信息设备等受限环境是不实用的。对于这种环境，低功耗和低成本与实时多媒体处理所需的高性能和支持不断增长的应用程序所需的通用可编程性同样重要。在本文中，我们介绍了PLX，这是一种简明的ISA，它从添加到微处理器的前两代多媒体指令中选择最有用的功能，并探索了使用小占用处理器进行高性能但低成本多媒体处理的新ISA功能。PLX的独特之处在于，它是从头开始设计的完全子字并行架构，具有新颖的功能，如从32位到128位字的数据路径可伸缩性，以及用于减少条件分支的预测的新定义。我们用四种常用的多媒体内核来说明PLX的架构特征的使用:离散余弦变换、像素填充、剪辑测试和中值滤波器。我们的性能结果表明，与基本的64位RISC处理器和具有MMX和SSE多媒体扩展的IA-32处理器相比，64位PLX实现实现了显着的速度提升。PLX的数据路径可扩展性特性通常以经济有效的方式提供额外的2倍加速。

{"title":"Refining instruction set architecture for high-performance multimedia processing in constrained environments","authors":"R. Lee, A. M. Fiskiran, Z. Shi, Xiao Yang","doi":"10.1109/ASAP.2002.1030724","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030724","url":null,"abstract":"Multimedia processing in software has been significantly accelerated by the addition of subword-parallel instructions to the instruction set architectures (ISAs) of modem microprocessors. While some of these multimedia instructions are simple and effective, others are very complex, requiring large, special-purpose functional units that are not practical for constrained environments such as handheld multimedia information appliances. For such environments, low-power and low-cost are as important as the high performance required for real-time multimedia processing and the general-purpose programmability required to support an ever growing range of applications. In this paper, we introduce PLX, a concise ISA that selects the most useful features from the first two generations of multimedia instructions added to microprocessors, and explores new ISA features for high-performance yet low-cost multimedia processing with small footprint processors. PLX is unique in that it is designed from scratch as a fully subword-parallel architecture with novel features like datapath scalability from 32-bit to 128-bit words, and a new definition of predication for reducing conditional branches. We illustrate the use of PLX's architectural features with four frequently used multimedia kernels: discrete cosine transform, pixel padding, clip test and median filter. Our performance results show that a 64-bit PLX implementation achieves significant speedups compared to a basic 64-bit RISC processor and to IA-32 processors with MMX and SSE multimedia extensions. PLX's datapath scalability feature often provides an additional 2x speedup in a cost-effective way.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133635247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

A VLSI architecture for object recognition using tree matching 基于树匹配的超大规模集成电路目标识别体系结构

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030731

K. Sitaraman, N. Ranganathan, A. Ejnioui

The problem of tree pattern matching for object recognition in images is computationally intensive in nature. In two-dimensional images, the objects can be represented through multiscale decomposition as tree structures. The pattern tree representing an object can be matched with a subject tree representing an image in order to detect the objects within the image. In this paper, we describe a new systolic algorithm and its realization as a VLSI chip for tree pattern matching. The hardware algorithm is based on a linear array of processing elements (PEs) where the pattern matching is done in a pipelined fashion relying on nearest-neighbor communication between the PEs and the subject and pattern trees of arbitrary length can be processed using a fixed size PE array. The algorithm has an improved execution time of O(/spl lceil/m/a/spl rceil/n) required to perform the matching where in, a and n are the sizes of the pattern tree, processor array, subject tree respectively. A prototype CMOS VLSI chip implementing the proposed algorithm has been designed and verified It is shown that the hardware algorithm proposed in this work represent a significant improvement in terms of computational complexity, data flow, and architecture over the ones previously proposed for this problem.

图像中目标识别的树模式匹配问题本质上是计算密集型的。在二维图像中，物体可以通过多尺度分解表示为树形结构。表示对象的模式树可以与表示图像的主题树相匹配，以便检测图像中的对象。本文描述了一种新的树型匹配的压缩算法及其在VLSI芯片上的实现。硬件算法基于处理元素的线性阵列(PE)，其中模式匹配以流水线方式完成，依赖于PE与主题之间的最近邻通信，并且可以使用固定大小的PE阵列处理任意长度的模式树。该算法将执行匹配所需的执行时间缩短为O(/spl ceil/m/a/spl ceil/n)，其中in、a、n分别为模式树、处理器阵列、主题树的大小。设计并验证了实现所提算法的原型CMOS VLSI芯片。结果表明，本工作中提出的硬件算法在计算复杂度、数据流和架构方面比先前提出的算法有了显著的改进。

{"title":"A VLSI architecture for object recognition using tree matching","authors":"K. Sitaraman, N. Ranganathan, A. Ejnioui","doi":"10.1109/ASAP.2002.1030731","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030731","url":null,"abstract":"The problem of tree pattern matching for object recognition in images is computationally intensive in nature. In two-dimensional images, the objects can be represented through multiscale decomposition as tree structures. The pattern tree representing an object can be matched with a subject tree representing an image in order to detect the objects within the image. In this paper, we describe a new systolic algorithm and its realization as a VLSI chip for tree pattern matching. The hardware algorithm is based on a linear array of processing elements (PEs) where the pattern matching is done in a pipelined fashion relying on nearest-neighbor communication between the PEs and the subject and pattern trees of arbitrary length can be processed using a fixed size PE array. The algorithm has an improved execution time of O(/spl lceil/m/a/spl rceil/n) required to perform the matching where in, a and n are the sizes of the pattern tree, processor array, subject tree respectively. A prototype CMOS VLSI chip implementing the proposed algorithm has been designed and verified It is shown that the hardware algorithm proposed in this work represent a significant improvement in terms of computational complexity, data flow, and architecture over the ones previously proposed for this problem.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"85 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131012848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A component architecture for FPGA-based, DSP system design 一种基于fpga、DSP的组件架构系统设计

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030703

G. Spivey, S. Bhattacharyya, K. Nakajima

Introducing FPGA components into DSP system implementations creates an assortment of challenges across system architecture and logic design. Recognizing that some of the greatest challenges occur in the integration of the various components, we have developed a component architecture and an associated set of software tools, collectively called the Logic Foundry. Using the Logic Foundry, an FPGA-based DSP system can be easily constructed from pre-built components and implemented on a variety of back-end FPGA platforms. The resulting implementation can then be encapsulated and integrated into a variety of front-end software application environments. This paper develops the component architecture and integration capabilities of the Logic Foundry, and examines a number of application case studies that we have experimented with using the Logic Foundry.

在DSP系统实现中引入FPGA组件会给系统架构和逻辑设计带来各种各样的挑战。认识到一些最大的挑战出现在各种组件的集成中，我们开发了一个组件体系结构和一组相关的软件工具，统称为Logic Foundry。使用Logic Foundry，基于FPGA的DSP系统可以很容易地从预先构建的组件构建，并在各种后端FPGA平台上实现。然后可以将生成的实现封装并集成到各种前端软件应用程序环境中。本文开发了Logic Foundry的组件架构和集成功能，并检查了我们使用Logic Foundry进行实验的一些应用案例研究。

引用次数: 7

On the propagation of faults and their detection in a hardware implementation of the Advanced Encryption Standard 高级加密标准硬件实现中的故障传播及其检测

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030729

G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, V. Piuri

High reliability is a desirable property of any implementation of the Advanced Encryption Standard (AES). To achieve high reliability, all possible faults must be detected to avoid the use and transmission of erroneous encrypted/decrypted data. In this paper we first study the behavior of faults which may occur during the encryption and decryption procedures of AES, and the way such faults eventually propagate to the final result. We then describe an appropriate detection technique for these faults. This work extends our preliminary results (G. Bertoni et al, MPCS 2002) by considering more general fault models (e.g., permanent and multiple transient faults), and the possibility of fault masking.

高可靠性是任何高级加密标准(AES)实现的理想属性。为了实现高可靠性，必须检测所有可能的故障，以避免错误加/解密数据的使用和传输。本文首先研究了AES加密和解密过程中可能出现的错误行为，以及这些错误最终传播到最终结果的方式。然后，我们描述了一种适合这些故障的检测技术。这项工作扩展了我们的初步结果(G. Bertoni等人，MPCS 2002)，考虑了更一般的故障模型(例如，永久和多个瞬态故障)，以及故障屏蔽的可能性。

引用次数: 32

A combined interval and floating-point comparator/selector 区间和浮点比较器/选择器的组合

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030720

A. Akkas

Interval arithmetic provides a robust method for automatically monitoring numerical errors and can be used to solve problems that cannot be efficiently solved with floating-point arithmetic. This paper presents the design and implementation of a combined interval and floating-point comparator/selector, which performs interval intersection, hull, mignitude, magnitude, minimum, maximum, and comparisons, as well as floating-point minimum, maximum and comparisons. Area and delay estimates indicate that the combined interval and floating-point comparator/selector has 98% more area and a worst case delay that is 42% greater than a conventional floating point comparator/selector. The combined interval and floating-point comparator/selector greatly improves the performance of interval selection operations.

区间算法为数值误差的自动监测提供了一种鲁棒的方法，可用于解决浮点算法无法有效解决的问题。本文提出了一种区间与浮点组合比较器/选择器的设计与实现，该比较器/选择器可以进行区间交叉、船体、幅度、幅度、最小值、最大值和比较，以及浮点最小值、最大值和比较。面积和延迟估计表明，组合间隔和浮点比较器/选择器的面积比传统的浮点比较器/选择器大98%，最坏情况下延迟比传统的浮点比较器/选择器大42%。区间和浮点比较器/选择器的组合极大地提高了区间选择操作的性能。

引用次数: 13

Design and evaluation of a multimedia computing architecture based on a 3D graphics pipeline 基于三维图形管道的多媒体计算体系结构设计与评价

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

Pub Date : 2002-07-17 DOI: 10.1109/ASAP.2002.1030723

C. Y. Chung, R. Managuli, Yongmin Kim

With the innovation and integration of media objects in multimedia applications, the importance of architectural support for different types of media objects, e.g., image, video and graphics, in one platform has significantly increased. While several approaches based on vector or VLIW (very long instruction word) architectures, e.g., Vector-IRAM and Imagine, have been pursued, they are not as effective as dedicated graphics pipelines for high-performance 3D graphics. We have explored a new programmable computing architecture based on a 3D graphics pipeline, which utilizes dedicated hardware resources in the 3D graphics pipeline for other types of multimedia computing. Adding programmable flexibility to a graphics pipeline for texture mapping has proven to be effective, e.g., pixel shader. However, due to the diversity of imaging and video processing applications, there are several challenges associated with converting a fixed graphics pipeline to a flexible multimedia computing engine. In this paper, we identify the additional architectural requirements, introduce the proposed architecture with extension details, and present the results of the performance evaluation. With cycle-accurate simulation of several benchmark functions, we have verified that the proposed architecture outperforms a modem powerful media processor in imaging and video processing by a factor of 1.3 to 7.5. The 3D graphics performance would not change much because the additional pipeline stages for the extension result in longer pipeline latency but similar throughout.

随着多媒体应用中媒体对象的创新和集成，在一个平台中对不同类型的媒体对象(如图像、视频和图形)的架构支持的重要性显著增加。虽然有几种基于矢量或VLIW(非常长的指令字)架构的方法，例如vector - iram和Imagine，但它们并不像高性能3D图形的专用图形管道那样有效。我们探索了一种新的基于3D图形管道的可编程计算架构，它利用3D图形管道中的专用硬件资源进行其他类型的多媒体计算。在纹理映射的图形管道中添加可编程的灵活性已被证明是有效的，例如，像素着色器。然而，由于图像和视频处理应用程序的多样性，将固定的图形管道转换为灵活的多媒体计算引擎存在一些挑战。在本文中，我们确定了额外的体系结构需求，介绍了带有扩展细节的拟议体系结构，并给出了性能评估的结果。通过对几个基准功能的周期精确模拟，我们已经验证了所提出的架构在成像和视频处理方面比现代强大的媒体处理器性能高出1.3到7.5倍。3D图形性能不会有太大变化，因为扩展的额外管道阶段会导致更长的管道延迟，但整个过程是相似的。

{"title":"Design and evaluation of a multimedia computing architecture based on a 3D graphics pipeline","authors":"C. Y. Chung, R. Managuli, Yongmin Kim","doi":"10.1109/ASAP.2002.1030723","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030723","url":null,"abstract":"With the innovation and integration of media objects in multimedia applications, the importance of architectural support for different types of media objects, e.g., image, video and graphics, in one platform has significantly increased. While several approaches based on vector or VLIW (very long instruction word) architectures, e.g., Vector-IRAM and Imagine, have been pursued, they are not as effective as dedicated graphics pipelines for high-performance 3D graphics. We have explored a new programmable computing architecture based on a 3D graphics pipeline, which utilizes dedicated hardware resources in the 3D graphics pipeline for other types of multimedia computing. Adding programmable flexibility to a graphics pipeline for texture mapping has proven to be effective, e.g., pixel shader. However, due to the diversity of imaging and video processing applications, there are several challenges associated with converting a fixed graphics pipeline to a flexible multimedia computing engine. In this paper, we identify the additional architectural requirements, introduce the proposed architecture with extension details, and present the results of the performance evaluation. With cycle-accurate simulation of several benchmark functions, we have verified that the proposed architecture outperforms a modem powerful media processor in imaging and video processing by a factor of 1.3 to 7.5. The 3D graphics performance would not change much because the additional pipeline stages for the extension result in longer pipeline latency but similar throughout.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132272781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀