Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.最新文献

英文中文

Optimized data-reuse in processor arrays 优化了处理器数组中的数据重用

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10024

Sebastian Siegel, R. Merker

We present a method for co-partitioning affine indexed algorithms resulting in a processor array with an optimized data-reuse. Through this method, a memory hierarchy with an optimized data transfer is derived which allows a significant reduction of the power consumption caused by memory accesses. Apart from former design flows which begin with a space-time transformation, we start with the co-partitioning of the iteration space. This allows an adaption of the resulting processor array towards the constraints of the target architecture at the beginning of the design. We illustrate our method for the full search motion estimation algorithm which bears a high potential of data-reuse.

我们提出了一种共划分仿射索引算法的方法，导致处理器阵列具有优化的数据重用。通过这种方法，导出了具有优化数据传输的内存层次结构，从而大大降低了内存访问引起的功耗。与以往的设计流程从时空变换开始不同，我们从迭代空间的共划分开始。这允许在设计之初根据目标体系结构的约束对生成的处理器阵列进行调整。给出了一种具有较高数据重用潜力的全搜索运动估计算法。

引用次数: 6

A hierarchical classification scheme to derive interprocess communication in process networks 一种派生进程网络中进程间通信的分层分类方案

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10025

A. Turjan, B. Kienhuis, E. Deprettere

The Compaan compiler automatically derives a process network (PN) description from an application written in Matlab. The basic element of a PN is a producer/consumer (P/C) pair. Four different communication patterns for a P/C pair have been identified and the complexity of communication structure differs depending on the communication pattern involved. Therefore, in order to obtain cost-efficient process networks our compiler automatically identifies the communication pattern of each P/C pair. This problem is equivalent to integer linear programming and thus in general can not be solved efficiently. In this paper we present simpler techniques that allow classifying the interprocess communication in a PN. However, in some cases those techniques do not allow to find an answer and therefore, an ILP test has still to be applied. Thus, we introduce a hierarchical classification scheme that correctly classifies the interprocess communication, but uses dramatically less integer linear programming, in only 5% of the cases to classify, we still rely on integer linear programming; in the remaining 95%, the techniques presented Are able to classify a case correctly.

Compaan编译器自动从用Matlab编写的应用程序中派生进程网络(PN)描述。PN的基本元素是生产者/消费者(P/C)对。已经确定了P/C对的四种不同的通信模式，并且通信结构的复杂性取决于所涉及的通信模式。因此，为了获得经济高效的进程网络，我们的编译器自动识别每个P/C对的通信模式。这个问题等价于整数线性规划，因此通常不能有效地求解。在本文中，我们提出了一种简单的技术，可以对PN中的进程间通信进行分类。然而，在某些情况下，这些技术不允许找到答案，因此，ILP测试仍然需要应用。因此，我们引入了一种正确分类进程间通信的分层分类方案，但大大减少了整数线性规划的使用，在只有5%的情况下进行分类，我们仍然依赖整数线性规划;在剩下的95%中，所提出的技术能够正确分类病例。

{"title":"A hierarchical classification scheme to derive interprocess communication in process networks","authors":"A. Turjan, B. Kienhuis, E. Deprettere","doi":"10.1109/ASAP.2004.10025","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10025","url":null,"abstract":"The Compaan compiler automatically derives a process network (PN) description from an application written in Matlab. The basic element of a PN is a producer/consumer (P/C) pair. Four different communication patterns for a P/C pair have been identified and the complexity of communication structure differs depending on the communication pattern involved. Therefore, in order to obtain cost-efficient process networks our compiler automatically identifies the communication pattern of each P/C pair. This problem is equivalent to integer linear programming and thus in general can not be solved efficiently. In this paper we present simpler techniques that allow classifying the interprocess communication in a PN. However, in some cases those techniques do not allow to find an answer and therefore, an ILP test has still to be applied. Thus, we introduce a hierarchical classification scheme that correctly classifies the interprocess communication, but uses dramatically less integer linear programming, in only 5% of the cases to classify, we still rely on integer linear programming; in the remaining 95%, the techniques presented Are able to classify a case correctly.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122064995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Architectural support for arithmetic in optimal extension fields 对最优扩展字段中的算法的体系结构支持

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10004

J. Großschädl, Sandeep S. Kumar, C. Paar

Public-key cryptosystems generally involve computation-intensive arithmetic operations, making them impractical for software implementation on constrained devices such as smart cards. We investigate the potential of architectural enhancements and instruction set extensions for low-level arithmetic used in public-key cryptography, most notably multiplication in finite fields of large order. The focus of the present work is directed towards a special type of finite fields, the so-called optimal extension fields GF(p/sup m/) where p is a pseudo-Mersenne (PM) prime of the form p = 2/sup n/ - c that fits into a single register. Based on the M/PS32 instruction set architecture, we introduce two custom instructions to accelerate the reduction modulo a PM prime. Moreover, we show that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow. The proposed extensions support a wide range of PM primes and allow a reduction modulo 2/sup n/ - c to complete in only four clock cycles when n /spl les/ 32.

公钥密码系统通常涉及计算密集型的算术运算，这使得它们不适合在受限设备(如智能卡)上实现软件。我们研究了在公钥密码学中使用的底层算法的架构增强和指令集扩展的潜力，最值得注意的是大阶有限域中的乘法。当前工作的重点是针对一种特殊类型的有限域，即所谓的最优扩展域GF(p/sup m/)，其中p是p = 2/sup n/ -c形式的伪梅森(PM)素数，适用于单个寄存器。基于M/PS32指令集架构，我们引入了两条自定义指令来加速对PM素数的约简。此外，我们还证明了在最优扩展域中的乘法可以利用带有宽累加器的乘法/累加单元，从而可以在不溢出的情况下求和一定数量的64位乘积。所提出的扩展支持广泛的PM素数，并允许在n/ spl小于/ 32时仅在四个时钟周期内完成约简模2/sup n/ - c。

{"title":"Architectural support for arithmetic in optimal extension fields","authors":"J. Großschädl, Sandeep S. Kumar, C. Paar","doi":"10.1109/ASAP.2004.10004","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10004","url":null,"abstract":"Public-key cryptosystems generally involve computation-intensive arithmetic operations, making them impractical for software implementation on constrained devices such as smart cards. We investigate the potential of architectural enhancements and instruction set extensions for low-level arithmetic used in public-key cryptography, most notably multiplication in finite fields of large order. The focus of the present work is directed towards a special type of finite fields, the so-called optimal extension fields GF(p/sup m/) where p is a pseudo-Mersenne (PM) prime of the form p = 2/sup n/ - c that fits into a single register. Based on the M/PS32 instruction set architecture, we introduce two custom instructions to accelerate the reduction modulo a PM prime. Moreover, we show that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow. The proposed extensions support a wide range of PM primes and allow a reduction modulo 2/sup n/ - c to complete in only four clock cycles when n /spl les/ 32.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127525274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A novel highly reliable low-power nano architecture when von Neumann augments Kolmogorov 当冯·诺依曼增强柯尔莫哥洛夫时，一种新颖的高可靠的低功耗纳米架构

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10021

Valeriu Beiu

This work presents a novel architecture, which is both device and circuit independent. The starting idea is that computations can be performed in three fundamentally different ways: entirely digital (using Boolean gates), entirely analog (using analog circuits), or mixed (using both digital and analog circuits). The boundaries between these are sometimes very thin. As an example, a threshold logic gate is already mixed, i.e. even if the inputs and the output are Boolean, the weighted sum-of-inputs is a multiple-valued logic signal, i.e. a low-precision analog signal. It has already been suggested that, at least for CMOS, a mixed analog/digital approach is the most power-efficient solution. Still, the main disadvantages of using analog circuits are: (i) their more complex (handcrafted) design, and (ii) their (expected) lower reliability (signal-to-noise or precision), which will be exacerbated by scaling. Here, we will show how both these disadvantages could be tackled. A constructive solution for Kolmogorov's superposition and (multi-threshold) threshold logic synthesis could be used for automating the design. Digital or threshold logic circuits will compensate for the accumulation of noise in the cascaded (very) low precision analog circuits. These digital circuits will also contribute to a von Neumann's multiplexing scheme used to augment the defect- and fault-tolerance of the architecture. A few examples will show how this architectural approach could be mapped on top of a given (nano) technology.

这项工作提出了一种新颖的架构，它既独立于器件又独立于电路。最初的想法是计算可以以三种基本不同的方式进行:完全数字(使用布尔门)，完全模拟(使用模拟电路)或混合(使用数字和模拟电路)。两者之间的界限有时非常模糊。例如，一个阈值逻辑门已经混合，即即使输入和输出都是布尔值，输入的加权和也是一个多值逻辑信号，即低精度模拟信号。已经有人提出，至少对于CMOS，混合模拟/数字方法是最节能的解决方案。尽管如此，使用模拟电路的主要缺点是:(i)它们更复杂(手工制作)的设计，以及(ii)它们(预期的)较低的可靠性(信噪比或精度)，这将因缩放而加剧。在这里，我们将展示如何解决这两个缺点。Kolmogorov叠加和(多阈值)阈值逻辑合成的构造解可用于自动化设计。数字或阈值逻辑电路将补偿级联(非常)低精度模拟电路中的噪声积累。这些数字电路也将有助于冯·诺伊曼的多路复用方案，用于增加结构的容错性和容错性。一些示例将展示如何将这种体系结构方法映射到给定的(纳米)技术之上。

{"title":"A novel highly reliable low-power nano architecture when von Neumann augments Kolmogorov","authors":"Valeriu Beiu","doi":"10.1109/ASAP.2004.10021","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10021","url":null,"abstract":"This work presents a novel architecture, which is both device and circuit independent. The starting idea is that computations can be performed in three fundamentally different ways: entirely digital (using Boolean gates), entirely analog (using analog circuits), or mixed (using both digital and analog circuits). The boundaries between these are sometimes very thin. As an example, a threshold logic gate is already mixed, i.e. even if the inputs and the output are Boolean, the weighted sum-of-inputs is a multiple-valued logic signal, i.e. a low-precision analog signal. It has already been suggested that, at least for CMOS, a mixed analog/digital approach is the most power-efficient solution. Still, the main disadvantages of using analog circuits are: (i) their more complex (handcrafted) design, and (ii) their (expected) lower reliability (signal-to-noise or precision), which will be exacerbated by scaling. Here, we will show how both these disadvantages could be tackled. A constructive solution for Kolmogorov's superposition and (multi-threshold) threshold logic synthesis could be used for automating the design. Digital or threshold logic circuits will compensate for the accumulation of noise in the cascaded (very) low precision analog circuits. These digital circuits will also contribute to a von Neumann's multiplexing scheme used to augment the defect- and fault-tolerance of the architecture. A few examples will show how this architectural approach could be mapped on top of a given (nano) technology.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128479392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Design of the QBIC wearable computing platform QBIC可穿戴计算平台的设计

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10001

O. Amft, M. Lauffer, Stijn Ossevoort, Fabrizio Macaluso, P. Lukowicz, G. Tröster

Wearable computing systems can be broadly defined as mobile electronic devices that can be unobtrusively embedded in a user's outfit as part of the garment or an accessory. Unlike conventional mobile devices, such systems shall be virtually invisible, not hindering physical activity, always active and running without user's attention. We present our wearability driven design approach and the philosophy for a novel wearable computing system integrated into a fully functional belt. This system integrates the main electronics in the buckle of a belt and utilizes the belt itself as extension bus and mechanical support for add ons. The system runs GNU/Linux operating system and has sufficient resources to address a variety of applications in the field of wearable computing. Considerations regarding ergonomic design, system architecture, first implementation results and applications are presented.

可穿戴计算系统可以被广泛地定义为一种移动电子设备，它可以作为服装或配件的一部分不显眼地嵌入用户的服装中。与传统的移动设备不同，这种系统几乎是看不见的，不会妨碍身体活动，在用户不注意的情况下始终处于活动状态。我们提出了我们的可穿戴性驱动的设计方法和理念，一个新颖的可穿戴计算系统集成到一个功能齐全的腰带。该系统将主要电子元件集成在皮带扣中，并利用皮带本身作为扩展总线和附加组件的机械支持。系统运行GNU/Linux操作系统，有足够的资源处理可穿戴计算领域的各种应用。介绍了人体工程学设计、系统架构、首次实现结果和应用方面的考虑。

引用次数: 74

Optimizing the memory bandwidth with loop morphing 利用循环变形优化存储器带宽

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10020

J. I. Gómez, P. Marchal, Sven Verdoolaege, L. Piñuel, F. Catthoor

The memory bandwidth largely determines the performance of embedded systems. However, very often compilers ignore the actual behavior of the memory architecture, causing large performance loss. To better utilize the memory bandwidth, several researchers have introduced instruction scheduling/data assignment techniques. Because they only optimize the bandwidth inside each basic block, they often fail to use all available bandwidth. Loop fusion is an interesting alternative to more globally optimize the memory access schedule. By fusing loops we increase the number of independent memory operations inside each basic block. The compiler can then better exploit the available bandwidth and increase the system's performance. However, existing fusion techniques can only combine loops with a conformable header. To overcome this limitation we present loop morphing; we combine fusion with strip mining and loop splitting. We also introduce a technique to steer loop morphing such that we find a compact memory access schedule. Experimental results show that with our approach we can decrease the execution time up to 88%.

内存带宽在很大程度上决定了嵌入式系统的性能。但是，编译器经常忽略内存体系结构的实际行为，从而导致很大的性能损失。为了更好地利用内存带宽，一些研究人员引入了指令调度/数据分配技术。因为它们只优化每个基本块内部的带宽，所以它们经常不能使用所有可用的带宽。循环融合是一种有趣的替代方案，可以更全面地优化内存访问调度。通过融合循环，我们增加了每个基本块内独立内存操作的数量。然后，编译器可以更好地利用可用带宽并提高系统性能。然而，现有的融合技术只能将循环与符合标准的头组合在一起。为了克服这一限制，我们提出了循环变形;我们将核聚变与条带开采和环分离相结合。我们还介绍了一种引导循环变形的技术，以便找到紧凑的内存访问调度。实验结果表明，采用该方法可以将执行时间缩短88%。

{"title":"Optimizing the memory bandwidth with loop morphing","authors":"J. I. Gómez, P. Marchal, Sven Verdoolaege, L. Piñuel, F. Catthoor","doi":"10.1109/ASAP.2004.10020","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10020","url":null,"abstract":"The memory bandwidth largely determines the performance of embedded systems. However, very often compilers ignore the actual behavior of the memory architecture, causing large performance loss. To better utilize the memory bandwidth, several researchers have introduced instruction scheduling/data assignment techniques. Because they only optimize the bandwidth inside each basic block, they often fail to use all available bandwidth. Loop fusion is an interesting alternative to more globally optimize the memory access schedule. By fusing loops we increase the number of independent memory operations inside each basic block. The compiler can then better exploit the available bandwidth and increase the system's performance. However, existing fusion techniques can only combine loops with a conformable header. To overcome this limitation we present loop morphing; we combine fusion with strip mining and loop splitting. We also introduce a technique to steer loop morphing such that we find a compact memory access schedule. Experimental results show that with our approach we can decrease the execution time up to 88%.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125593869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Modeling and scheduling parallel data flow systems using structured systems of recurrence equations 用递归方程的结构化系统建模和调度并行数据流系统

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10032

François Charot, Madeleine Nyamsi, P. Quinton, Charles Wagner

Many multimedia and telecommunications applications are modeled as multi-rate, parallel data flow systems. We present techniques to model and schedule such applications using structured systems of recurrence equations. We show that the schedule can be obtained first by computing the period of each component of the system, then by applying structured scheduling to the entire system. This method is implemented in the MMAlpha software, and it is applied to model a WCDMA uplink receiver.

许多多媒体和电信应用被建模为多速率、并行数据流系统。我们提出了使用递归方程的结构化系统来建模和调度这类应用的技术。首先通过计算系统各部件的周期，然后通过对整个系统应用结构化调度来获得调度。该方法在MMAlpha软件中实现，并应用于WCDMA上行链路接收机的建模。

引用次数: 10

Evaluating instruction set extensions for fast arithmetic on binary finite fields 二元有限域上快速算法的指令集扩展评估

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10003

A. M. Fiskiran, R. Lee

Binary finite fields GF(2/sup n/) are very commonly used in cryptography, particularly in public-key algorithms such as elliptic curve cryptography (ECC). On word-oriented programmable processors, field elements are generally represented as polynomials with coefficients from [0, 1]. Key arithmetic operations on these polynomials, such as squaring and multiplication, are not supported by integer-oriented processor architectures. Instead, these are implemented in software, causing a very large fraction of the cryptography execution time to be dominated by a few elementary operations. For example, more than 90% of the execution time of 163-bit ECC may be consumed by two simple field operations: squaring and multiplication. A few processor architectures have been proposed recently that include instructions for binary field arithmetic. However, these have only considered processors with small wordsizes and in-order, single-issue execution. The first contribution of this paper is to validate these new arithmetic instructions for processors with wider wordsizes and multiple-issue (e.g. superscalar) execution. We also consider the effects of varying the number of functional units and load/store pipes. We demonstrate that the combination of microarchitecture and new instructions provides speedups up to 22.4x for ECC point multiplication. Second, we show that if a bit-level reverse instruction is included in the instruction set, the size of the multiplier can be reduced by half without significant performance degradation. Third, we compare the benefits of superscalar execution with wordsize scaling. The latter has been used in recent processor architectures such as PLX and PAX as a new way to extract parallelism. We show that 2x wordsize scaling provides 70% better performance than 2-way superscalar execution. Finally, we suggest a low-cost method, which we call multi-word result execution, to realize some of the benefits of wordsize scaling in existing processors with fixed wordsizes.

二进制有限域GF(2/sup n/)在密码学中非常常用，特别是在椭圆曲线密码学(ECC)等公钥算法中。在面向字的可编程处理器上，字段元素通常表示为系数为[0,1]的多项式。这些多项式上的关键算术运算，如平方和乘法，不支持面向整数的处理器体系结构。相反，这些都是在软件中实现的，导致加密执行时间的很大一部分被一些基本操作所支配。例如，超过90%的163位ECC的执行时间可能被两个简单的字段操作所消耗:平方和乘法。最近提出了一些包含二进制字段运算指令的处理器体系结构。然而，这些方法只考虑了字数小、按顺序单问题执行的处理器。本文的第一个贡献是验证这些新的算术指令适用于具有更大字长和多问题(例如超标量)执行的处理器。我们还考虑了改变功能单元和加载/存储管道数量的影响。我们证明了微架构和新指令的组合为ECC点乘法提供了高达22.4倍的加速。其次，我们表明，如果在指令集中包含位级反向指令，则乘法器的大小可以减少一半而不会显著降低性能。第三，我们比较了超标量执行和字长缩放的好处。后者已被用于最近的处理器体系结构，如PLX和PAX，作为提取并行性的新方法。我们表明，2倍字长缩放比双向超标量执行提供了70%的性能提升。最后，我们提出了一种低成本的方法，我们称之为多词结果执行，以在具有固定词长的现有处理器中实现词长缩放的一些好处。

{"title":"Evaluating instruction set extensions for fast arithmetic on binary finite fields","authors":"A. M. Fiskiran, R. Lee","doi":"10.1109/ASAP.2004.10003","DOIUrl":"https://doi.org/10.1109/ASAP.2004.10003","url":null,"abstract":"Binary finite fields GF(2/sup n/) are very commonly used in cryptography, particularly in public-key algorithms such as elliptic curve cryptography (ECC). On word-oriented programmable processors, field elements are generally represented as polynomials with coefficients from [0, 1]. Key arithmetic operations on these polynomials, such as squaring and multiplication, are not supported by integer-oriented processor architectures. Instead, these are implemented in software, causing a very large fraction of the cryptography execution time to be dominated by a few elementary operations. For example, more than 90% of the execution time of 163-bit ECC may be consumed by two simple field operations: squaring and multiplication. A few processor architectures have been proposed recently that include instructions for binary field arithmetic. However, these have only considered processors with small wordsizes and in-order, single-issue execution. The first contribution of this paper is to validate these new arithmetic instructions for processors with wider wordsizes and multiple-issue (e.g. superscalar) execution. We also consider the effects of varying the number of functional units and load/store pipes. We demonstrate that the combination of microarchitecture and new instructions provides speedups up to 22.4x for ECC point multiplication. Second, we show that if a bit-level reverse instruction is included in the instruction set, the size of the multiplier can be reduced by half without significant performance degradation. Third, we compare the benefits of superscalar execution with wordsize scaling. The latter has been used in recent processor architectures such as PLX and PAX as a new way to extract parallelism. We show that 2x wordsize scaling provides 70% better performance than 2-way superscalar execution. Finally, we suggest a low-cost method, which we call multi-word result execution, to realize some of the benefits of wordsize scaling in existing processors with fixed wordsizes.","PeriodicalId":120245,"journal":{"name":"Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Families of FPGA-based algorithms for approximate string matching 基于fpga的近似字符串匹配算法家族

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10013

T. Court, M. Herbordt

Dynamic programming for approximate string matching is a large family of different algorithms, which vary significantly in purpose, complexity, and hardware utilization. Many implementations have reported impressive speed-ups, but have typically been point solutions -highly specialized and addressing only one or a few of the many possible options. The problem to be solved is creating a hardware description that implements a broad range of behavioral options without losing efficiency due to feature bloat. We report a set of three component types that address different parts of the DP string matching problem. Multiple, interchangeable implementations are available for each component type. This allows each application to choose the feature set required, then make maximum use of the FPGA fabric according to that application's specific resource requirements. Synthesis estimates show a 4:1 improvement in time-space performance, depending on the options chosen for a specific matching task.

用于近似字符串匹配的动态规划是一大类不同的算法，它们在目的、复杂性和硬件利用率方面差别很大。许多实现都报告了令人印象深刻的加速，但通常都是点解决方案——高度专门化，只处理许多可能选项中的一个或几个。要解决的问题是创建一个硬件描述，实现广泛的行为选择，而不会因为功能膨胀而失去效率。我们报告了一组三种组件类型，用于解决DP字符串匹配问题的不同部分。每种组件类型都有多个可互换的实现。这允许每个应用程序选择所需的功能集，然后根据该应用程序的特定资源需求最大限度地利用FPGA结构。综合估计显示时空性能提高了4:1，这取决于为特定匹配任务选择的选项。

引用次数: 52

Decimal floating-point division using Newton-Raphson iteration 使用牛顿-拉夫森迭代的十进制浮点除法

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

Pub Date : 2004-09-27 DOI: 10.1109/ASAP.2004.10005

Liang-Kai Wang, M. Schulte

Decreasing feature sizes allow additional functionality to be added to future microprocessors to improve the performance of important application domains. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This work presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with IEEE-754R, has an estimated critical path delay of 0.69 ns when implemented using LSI Logic's 0.11 micron gflx-p standard cell library.

减小特性尺寸允许在未来的微处理器中添加额外的功能，以提高重要应用领域的性能。由于金融、商业和基于互联网的应用程序的快速增长，各种计算机制造商现在正在考虑对十进制浮点运算的硬件支持，并且已将十进制浮点运算的规范添加到IEEE-754浮点运算标准(IEEE-754R)的修订草案中。本文提出了一种高效的十进制浮点除法算法和硬件设计。该设计使用了优化的分段线性近似、改进的牛顿-拉夫森迭代、专门的舍入技术和简化的组合十进制加/减数。综合结果表明，使用LSI Logic的0.11微米gflx-p标准单元库实现符合IEEE-754R标准的64位(16位)十进制分频器时，估计关键路径延迟为0.69 ns。

引用次数: 42

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀