首页 > 最新文献

Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors最新文献

英文 中文
PAPA - packed arithmetic on a prefix adder for multimedia applications 多媒体应用中前缀加法器的PAPA打包算法
N. Burgess
This paper introduces PAPA: packed arithmetic on a prefix adder, a new approach to parallel prefix adder design that supports a wide variety of packed arithmetic computations, including packed add and subtract with saturation, packed rounded average, and packed absolute difference. The approach consists of altering the prefix adder cell logic equations to take advantage of a previously unused "don't care" state. The principle of logical effort is employed to assess the delay of the new adder architecture by establishing the extra effort needed to select and drive the appropriate carry signal to the requisite sum sub-word. This adder will find applications in video processors and other multimedia-orientated processor chips that implement packed arithmetic operations.
本文介绍了前缀加法器上的PAPA:打包算法,这是一种新的并行前缀加法器设计方法,它支持多种打包算法的计算,包括带饱和的打包加减、打包取整平均和打包绝对差。该方法包括改变前缀加法器单元逻辑方程,以利用以前未使用的“不关心”状态。通过确定选择和驱动适当的进位信号到必要的求和子字所需的额外努力,采用逻辑努力原则来评估新加法器结构的延迟。该加法器将在视频处理器和其他实现打包算术运算的多媒体处理器芯片中得到应用。
{"title":"PAPA - packed arithmetic on a prefix adder for multimedia applications","authors":"N. Burgess","doi":"10.1109/ASAP.2002.1030719","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030719","url":null,"abstract":"This paper introduces PAPA: packed arithmetic on a prefix adder, a new approach to parallel prefix adder design that supports a wide variety of packed arithmetic computations, including packed add and subtract with saturation, packed rounded average, and packed absolute difference. The approach consists of altering the prefix adder cell logic equations to take advantage of a previously unused \"don't care\" state. The principle of logical effort is employed to assess the delay of the new adder architecture by establishing the extra effort needed to select and drive the appropriate carry signal to the requisite sum sub-word. This adder will find applications in video processors and other multimedia-orientated processor chips that implement packed arithmetic operations.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128626057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Matrix engine for signal processing applications using the logarithmic number system 矩阵引擎用于信号处理应用,使用对数数制
E. Chester, J. N. Coleman
An architecture design is presented for a device based upon the logarithmic number system (LNS) that is capable of performing general matrix and complex arithmetic, with features useful for DSP system-on-chip applications. A modified LNS addition/subtraction unit is employed in multiple execution units to achieve a maximum single-precision floating-point (FP) equivalent throughput of 3.2 Gflop/s at a clock frequency of 200 MHz. Each execution unit is capable of computing functions of the form (ab + cd)/sup e/ for e /spl isin/ {/spl plusmn/0.5, /spl plusmn/1, /spl plusmn/2} in a 5-stage arithmetic pipeline and returning a result every cycle, yielding a considerable per-cycle improvement over both floating- and fixed-point systems. Comparisons with existing devices and a single floating-point unit are given.
提出了一种基于对数系统(LNS)的器件体系结构设计,该器件能够执行一般矩阵和复杂运算,并具有DSP片上系统应用的特点。在多个执行单元中采用改进的LNS加减单元,在时钟频率为200mhz时,最大单精度浮点吞吐量可达3.2 Gflop/s。每个执行单元都能够在一个5阶段的算术管道中计算形式为(ab + cd)/sup /的函数(对于e/ spl isin/ {/spl plusmn/0.5, /spl plusmn/1, /spl plusmn/2}的函数,并在每个周期返回一个结果,与浮点和浮点系统相比,每个周期都有相当大的改进。给出了与现有器件和单个浮点单元的比较。
{"title":"Matrix engine for signal processing applications using the logarithmic number system","authors":"E. Chester, J. N. Coleman","doi":"10.1109/ASAP.2002.1030730","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030730","url":null,"abstract":"An architecture design is presented for a device based upon the logarithmic number system (LNS) that is capable of performing general matrix and complex arithmetic, with features useful for DSP system-on-chip applications. A modified LNS addition/subtraction unit is employed in multiple execution units to achieve a maximum single-precision floating-point (FP) equivalent throughput of 3.2 Gflop/s at a clock frequency of 200 MHz. Each execution unit is capable of computing functions of the form (ab + cd)/sup e/ for e /spl isin/ {/spl plusmn/0.5, /spl plusmn/1, /spl plusmn/2} in a 5-stage arithmetic pipeline and returning a result every cycle, yielding a considerable per-cycle improvement over both floating- and fixed-point systems. Comparisons with existing devices and a single floating-point unit are given.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125622358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Evaluating products of nonlinear functions by indirect bipartite table lookup 用间接二部表查找法求非线性函数的乘积
D. Matula, A. Fit-Florea, L. McFearin
Many function approximation procedures can obtain enhanced accuracy by an efficient table lookup of a product z=f(x)g(y). Both x and y are represented by indices of i leading bits (typically 7
许多函数近似程序可以通过对积z=f(x)g(y)进行有效的表查找来提高精度。对于规范化为[0,1]或[1,2]的参数,x和y都由i个前导位的索引表示(通常为7
{"title":"Evaluating products of nonlinear functions by indirect bipartite table lookup","authors":"D. Matula, A. Fit-Florea, L. McFearin","doi":"10.1109/ASAP.2002.1030710","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030710","url":null,"abstract":"Many function approximation procedures can obtain enhanced accuracy by an efficient table lookup of a product z=f(x)g(y). Both x and y are represented by indices of i leading bits (typically 7<i<16) for arguments normalized to [0, 1] or [1, 2]. Direct bipartite lookup employs 1/2 bits each of x and y yielding roughly an 1/2 bit result which can lose 2 to 3 bits of accuracy when f and g are nonlinear. Indirect bipartite lookup first generates i/2 bit interval index values for f(x) and g(y) using separate j-bits-in 1/2bits-out tables for f(x) and g(y) where i/2<j<i and is chosen large enough to substantially reduce the effect of nonlinearity in f(x) and g(y). The separate tables readily compensate for the high nonlinearity in f and/or g and generate interval index values representing intervals that can be tailored to minimize the maximum error of the product z=f(x)g(y) determined by an interval product table with the concatenated interval indices as the i bit input. We describe several variations in interval index generation methodology and in the design of the interval product table lookup architecture so as to obtain accuracy of 1/2 bits (or better) in output in 2-3 cycles of table lookup latency.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132247345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Reduced power consumption for MPEG decoding with LNS 降低了使用LNS进行MPEG解码的功耗
M. Arnold
By reducing the accuracy of the logarithmic number system (LNS) it is possible to achieve lower power consumption for multimedia applications, such as MPEG, without significantly lowering the visual quality of the output. An LNS wordsize of 8 to 10 bits produces a comparable MPEG output as a fixed-point wordsize of 14 to 16 bits. The switching activity of an LNS ALU that computes the inverse discrete cosine transform (IDCT) is one quarter that of fixed point, implying lower power consumption. By skipping inputs that are zero (which MPEG can do naturally with its run-length coding and zigzag ordering) the switching activity of LNS MPEG becomes one-tenth that of fixed point, in contrast to the minimal impact zero skipping has on fixed-point power consumption.
通过降低对数系统(LNS)的精度,可以降低多媒体应用(如MPEG)的功耗,而不会显著降低输出的视觉质量。8 ~ 10位的LNS字长与14 ~ 16位的定点字长产生的MPEG输出相当。计算反离散余弦变换(IDCT)的LNS ALU的开关活动是固定点的四分之一,意味着更低的功耗。通过跳过为零的输入(MPEG可以通过其运行长度编码和之字形排序自然地做到这一点),LNS MPEG的切换活动变成了定点的十分之一,与此相反,跳零对定点功耗的影响最小。
{"title":"Reduced power consumption for MPEG decoding with LNS","authors":"M. Arnold","doi":"10.1109/ASAP.2002.1030705","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030705","url":null,"abstract":"By reducing the accuracy of the logarithmic number system (LNS) it is possible to achieve lower power consumption for multimedia applications, such as MPEG, without significantly lowering the visual quality of the output. An LNS wordsize of 8 to 10 bits produces a comparable MPEG output as a fixed-point wordsize of 14 to 16 bits. The switching activity of an LNS ALU that computes the inverse discrete cosine transform (IDCT) is one quarter that of fixed point, implying lower power consumption. By skipping inputs that are zero (which MPEG can do naturally with its run-length coding and zigzag ordering) the switching activity of LNS MPEG becomes one-tenth that of fixed point, in contrast to the minimal impact zero skipping has on fixed-point power consumption.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127628861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Instruction stream mutation for non-deterministic processors 非确定性处理器的指令流突变
J. Irwin, D. Page, N. Smart
Differential power analysis (DPA) has become a real-world threat to the security of cryptographic hardware devices such as smart-cards. By using cheap and readily available equipment, attacks can easily compromise algorithms running on these devices in a non-invasive manner. Adding non-determinism to the execution of cryptographic algorithms has been proposed as a defence against these attacks. One way of achieving this non-determinism is to introduce random additional operations to the algorithm which produce noise in the power profile of the device. We describe the addition of a specialised processor pipeline stage which increases the level of potential non-determinism and hence guards against the revelation of secret information.
差分功率分析(DPA)已经成为智能卡等加密硬件设备安全的现实威胁。通过使用廉价和现成的设备,攻击可以很容易地以非侵入性的方式破坏在这些设备上运行的算法。在加密算法的执行中加入不确定性已经被提议作为对这些攻击的防御。实现这种非确定性的一种方法是在算法中引入随机的附加操作,这些操作会在器件的功率分布中产生噪声。我们描述了一个专门的处理器管道阶段的增加,它增加了潜在的不确定性水平,从而防止了秘密信息的泄露。
{"title":"Instruction stream mutation for non-deterministic processors","authors":"J. Irwin, D. Page, N. Smart","doi":"10.1109/ASAP.2002.1030727","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030727","url":null,"abstract":"Differential power analysis (DPA) has become a real-world threat to the security of cryptographic hardware devices such as smart-cards. By using cheap and readily available equipment, attacks can easily compromise algorithms running on these devices in a non-invasive manner. Adding non-determinism to the execution of cryptographic algorithms has been proposed as a defence against these attacks. One way of achieving this non-determinism is to introduce random additional operations to the algorithm which produce noise in the power profile of the device. We describe the addition of a specialised processor pipeline stage which increases the level of potential non-determinism and hence guards against the revelation of secret information.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114536779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
New results on array contraction [memory optimization] 数组收缩的新结果[内存优化]
A. Darte, Guillaume Huard
Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting, but thus far, the complexity of the problem was unknown, and no exact approach was available. In this paper, we prove two NP-complete results that characterize precisely the problem and we give a practical integer linear programming formulation to solve the problem exactly.
数组收缩是一种在循环中将数组变量转换为标量变量的优化。相反的转换,即标量展开,用于启用并行性(以牺牲内存大小为代价),而数组收缩用于通过删除临时数组来节省内存并增加局部性。已经提出了几种启发式方法,通过循环融合和/或循环移位来执行阵列收缩,但到目前为止,问题的复杂性是未知的,并且没有确切的方法可用。在本文中,我们证明了两个np完全的结果,并给出了一个实用的整数线性规划公式来精确地解决这个问题。
{"title":"New results on array contraction [memory optimization]","authors":"A. Darte, Guillaume Huard","doi":"10.1109/ASAP.2002.1030735","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030735","url":null,"abstract":"Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting, but thus far, the complexity of the problem was unknown, and no exact approach was available. In this paper, we prove two NP-complete results that characterize precisely the problem and we give a practical integer linear programming formulation to solve the problem exactly.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114647787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Nanocomputing with delays 具有延迟的纳米计算
J. Fortes
The push to obtain smaller and denser circuits solely based on lithography and silicon technology is quickly reaching limits imposed by device physics and processing technology. It is anticipated that these limits will invalidate Moore's law and lead to unacceptable manufacturing costs, unreliable devices, and hard-to-manage power dissipation and interconnect problems. Nanotechnologies that rely on self-assembly, biomolecular components, and nanoelectronics are promising alternatives to silicon-based microelectronics. They will eventually enable levels of integration that exceed that of today's silicon-based microelectronics by three orders of magnitude. These nascent technologies present intriguing challenges and exciting opportunities to use biologically inspired solutions to address system architecture questions. This paper discusses recent results of an ongoing collaborative research effort by nanotechnologists, neurocomputing experts, and computer and circuit designers to explore novel architectures for nanoscale neuromorphic systems. The focus is placed on implementations whose behavior depends on how propagation delays affect communication among system components. The components under consideration are reminiscent of spiking neurons and, unlike in classical systems, interconnect is used for computation as well as communication purposes. Hybrid systems are also briefly discussed.
仅仅基于光刻和硅技术来获得更小、更密集的电路的努力正迅速达到器件物理和加工技术所施加的极限。预计这些限制将使摩尔定律失效,并导致不可接受的制造成本,不可靠的设备,以及难以管理的功耗和互连问题。纳米技术依赖于自组装、生物分子组件和纳米电子学,是硅基微电子的有前途的替代品。它们最终将使集成水平超过当今硅基微电子的三个数量级。这些新生的技术为使用生物学启发的解决方案来解决系统架构问题提供了有趣的挑战和令人兴奋的机会。本文讨论了纳米技术专家、神经计算专家、计算机和电路设计师为探索纳米级神经形态系统的新架构而进行的合作研究的最新成果。重点放在其行为取决于传播延迟如何影响系统组件之间通信的实现上。考虑中的组件让人想起尖峰神经元,与经典系统不同的是,互联不仅用于通信目的,还用于计算。对混合动力系统也作了简要讨论。
{"title":"Nanocomputing with delays","authors":"J. Fortes","doi":"10.1109/ASAP.2002.1030699","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030699","url":null,"abstract":"The push to obtain smaller and denser circuits solely based on lithography and silicon technology is quickly reaching limits imposed by device physics and processing technology. It is anticipated that these limits will invalidate Moore's law and lead to unacceptable manufacturing costs, unreliable devices, and hard-to-manage power dissipation and interconnect problems. Nanotechnologies that rely on self-assembly, biomolecular components, and nanoelectronics are promising alternatives to silicon-based microelectronics. They will eventually enable levels of integration that exceed that of today's silicon-based microelectronics by three orders of magnitude. These nascent technologies present intriguing challenges and exciting opportunities to use biologically inspired solutions to address system architecture questions. This paper discusses recent results of an ongoing collaborative research effort by nanotechnologists, neurocomputing experts, and computer and circuit designers to explore novel architectures for nanoscale neuromorphic systems. The focus is placed on implementations whose behavior depends on how propagation delays affect communication among system components. The components under consideration are reminiscent of spiking neurons and, unlike in classical systems, interconnect is used for computation as well as communication purposes. Hybrid systems are also briefly discussed.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117202708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Polynomial evaluation on multimedia processors 多媒体处理器的多项式求值
J. Villalba, G. Bandera, Mario A. González, J. Hormigo, E. Zapata
In this paper we deal with polynomial evaluation based on new processor architectures for multimedia applications. We introduce some algorithms to take advantage of the new attributes of multimedia processors, such as VLIW (very long instruction word) and SIMD (single instruction multiple data architecture) architectures. Algorithms to support polynomial evaluation based only in addition/shift operations and other different algorithms with MAC (multiply-and-add) instructions are analyzed and tailored to subword parallelism units of the new processors. Both potential instruction-level and machine-level parallelism are fully exploited through concurrent use of all functional units.
本文讨论了基于新的多媒体处理器体系结构的多项式求值问题。我们介绍了一些利用多媒体处理器新属性的算法,如VLIW(甚长指令字)和SIMD(单指令多数据体系结构)体系结构。分析了仅支持基于加法/移位操作的多项式计算的算法和其他带有MAC(乘法和加法)指令的不同算法,并针对新处理器的子字并行单元进行了定制。通过并发使用所有功能单元,充分利用了潜在的指令级和机器级并行性。
{"title":"Polynomial evaluation on multimedia processors","authors":"J. Villalba, G. Bandera, Mario A. González, J. Hormigo, E. Zapata","doi":"10.1109/ASAP.2002.1030725","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030725","url":null,"abstract":"In this paper we deal with polynomial evaluation based on new processor architectures for multimedia applications. We introduce some algorithms to take advantage of the new attributes of multimedia processors, such as VLIW (very long instruction word) and SIMD (single instruction multiple data architecture) architectures. Algorithms to support polynomial evaluation based only in addition/shift operations and other different algorithms with MAC (multiply-and-add) instructions are analyzed and tailored to subword parallelism units of the new processors. Both potential instruction-level and machine-level parallelism are fully exploited through concurrent use of all functional units.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A mathematical model of trace cache 痕迹缓存数学模型
A. Hossain, D. Pease, James S. Burns, N. Parveen
Wide-issue superscalar processors have capabilities to execute several basic blocks in a cycle. A regular instruction cache fetch mechanism is not capable of supporting this high fetch throughput requirement. Several improvements of the fetch mechanism are currently in use. One of the most successful of these improvements is the addition of an instruction memory structure known as a trace cache. In this paper an analytical model of instruction fetch performance of a trace cache microarchitecture is presented. Parameters, which affect trace cache instruction fetch performance, are explored and several analytical expressions are presented. The presented model can be used to understand performance tradeoffs in trace cache design. Results from the validation of the model are presented. The instruction fetch rates predicted by the model differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. The model is implemented in a computer program named Tulip. To show how different parameters influence performance, results from Tulip are also presented.
宽发行量超标量处理器具有在一个周期内执行多个基本模块的能力。常规的指令缓存获取机制无法支持这种高获取吞吐量要求。目前,已有几种取指机制得到了改进。其中最成功的改进之一是增加了一种称为跟踪高速缓存的指令存储器结构。本文介绍了跟踪高速缓存微体系结构的指令取回性能分析模型。本文探讨了影响跟踪高速缓存指令获取性能的参数,并给出了几种分析表达式。该模型可用于了解跟踪高速缓存设计中的性能权衡。本文还介绍了模型的验证结果。模型预测的指令取回率与 SPEC2000 基准程序的模拟取回率相差 7%。该模型在名为 Tulip 的计算机程序中实现。为了说明不同参数对性能的影响,还介绍了 Tulip 的结果。
{"title":"A mathematical model of trace cache","authors":"A. Hossain, D. Pease, James S. Burns, N. Parveen","doi":"10.1109/ASAP.2002.1030715","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030715","url":null,"abstract":"Wide-issue superscalar processors have capabilities to execute several basic blocks in a cycle. A regular instruction cache fetch mechanism is not capable of supporting this high fetch throughput requirement. Several improvements of the fetch mechanism are currently in use. One of the most successful of these improvements is the addition of an instruction memory structure known as a trace cache. In this paper an analytical model of instruction fetch performance of a trace cache microarchitecture is presented. Parameters, which affect trace cache instruction fetch performance, are explored and several analytical expressions are presented. The presented model can be used to understand performance tradeoffs in trace cache design. Results from the validation of the model are presented. The instruction fetch rates predicted by the model differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. The model is implemented in a computer program named Tulip. To show how different parameters influence performance, results from Tulip are also presented.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129835793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A mid-texturing pixel rasterization pipeline architecture for 3D rendering processors 一种用于3D渲染处理器的中纹理像素光栅化管道架构
W. Park, Kilwhan Lee, Il-San Kim, T. Han, Sung-Bong Yang
As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture, which performs a depth test operation twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste caused by fetching unnecessary obscured texture data, by performing the depth test before texture mapping. The proposed architecture reduces the miss penalties of the pixel cache by using a pre-fetch scheme - that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption, producing high-performance gains.
随着3D场景的日益复杂和屏幕分辨率的提高,有效的内存架构设计是3D渲染处理器的重要问题之一。我们提出了一种像素栅格化架构,该架构在纹理映射之前和之后执行两次深度测试操作。该架构通过在纹理映射前进行深度测试,消除了由于获取不必要的模糊纹理数据而造成的内存带宽浪费。所提出的架构通过使用预取方案减少了像素缓存的缺失惩罚-即,由于在第一次深度测试中缓存缺失而进行的帧内存访问与纹理映射同时进行。提出的像素光栅化架构实现了内存带宽效率和降低功耗,产生高性能增益。
{"title":"A mid-texturing pixel rasterization pipeline architecture for 3D rendering processors","authors":"W. Park, Kilwhan Lee, Il-San Kim, T. Han, Sung-Bong Yang","doi":"10.1109/ASAP.2002.1030717","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030717","url":null,"abstract":"As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture, which performs a depth test operation twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste caused by fetching unnecessary obscured texture data, by performing the depth test before texture mapping. The proposed architecture reduces the miss penalties of the pixel cache by using a pre-fetch scheme - that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption, producing high-performance gains.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122849466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1