首页 > 最新文献

Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors最新文献

英文 中文
An empirical study of datapath, memory hierarchy, and network in SIMD array architectures SIMD阵列架构中数据路径、内存层级和网络的实证研究
M. Herbordt, C. Weems
Although SIMD arrays have been built for 30 years, they have as a class been the subject of few empirical design studies. Using ENPASSANT, a simulation environment developed for that purpose, we analyze several aspects of SIMD array architecture with respect to a test suite of spatially mapped applications. Several surprising results are obtained. With respect to memory hierarchy, we find that adding a level of cache to current PE designs is likely to be advantageous, but that such a cache will look quite different than expected. In particular, we find that associativity has unusual significance and that performance varies inversely with block size. Router network results indicate the importance of support for local transfers, broadcast, and reduction even at the expense of arbitrary permutations. Other communication results point to the appropriate dimensionality of k-ary n-cube networks (2 or 3), and the criticality of supporting bidirectional transfers, even if the overall bandwidth remains unchanged.
尽管SIMD阵列已经建造了30年,但它们作为一个类别一直是很少实证设计研究的主题。使用为此目的开发的模拟环境ENPASSANT,我们分析了SIMD阵列体系结构的几个方面,这些方面与空间映射应用程序的测试套件有关。得到了几个令人惊讶的结果。关于内存层次结构,我们发现在当前PE设计中添加一级缓存可能是有利的,但是这样的缓存看起来与预期的完全不同。特别是,我们发现结合性具有不同寻常的意义,并且性能与块大小成反比。路由器网络的结果表明支持本地传输、广播和减少的重要性,甚至以牺牲任意排列为代价。其他通信结果指出了k-ary n-cube网络的适当维度(2或3),以及支持双向传输的重要性,即使总体带宽保持不变。
{"title":"An empirical study of datapath, memory hierarchy, and network in SIMD array architectures","authors":"M. Herbordt, C. Weems","doi":"10.1109/ICCD.1995.528921","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528921","url":null,"abstract":"Although SIMD arrays have been built for 30 years, they have as a class been the subject of few empirical design studies. Using ENPASSANT, a simulation environment developed for that purpose, we analyze several aspects of SIMD array architecture with respect to a test suite of spatially mapped applications. Several surprising results are obtained. With respect to memory hierarchy, we find that adding a level of cache to current PE designs is likely to be advantageous, but that such a cache will look quite different than expected. In particular, we find that associativity has unusual significance and that performance varies inversely with block size. Router network results indicate the importance of support for local transfers, broadcast, and reduction even at the expense of arbitrary permutations. Other communication results point to the appropriate dimensionality of k-ary n-cube networks (2 or 3), and the criticality of supporting bidirectional transfers, even if the overall bandwidth remains unchanged.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123139497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance estimation for real-time distributed embedded systems 实时分布式嵌入式系统的性能评估
Ti-Yen Yen, W. Wolf
Many embedded computing systems are distributed systems: communicating processes executing on several CPUs/ASICs connected by communication links. This paper describes a new, efficient analysis algorithm to derive tight bounds on the execution time required for an application task executing on a distributed system. Tight bounds are essential to cosynthesis algorithms. Our bounding algorithms are valid for a general problem model: the system can contain several tasks with different periods; each task is partitioned into a set of processes related by data dependencies; the periods and the computation times of processes are bounded but not necessarily constant. Experimental results show that our algorithm can find tight bounds in small amounts of CPU time.
许多嵌入式计算系统是分布式系统:通信进程在通过通信链路连接的多个cpu / asic上执行。本文描述了一种新的、高效的分析算法,用于推导出在分布式系统上执行应用程序任务所需的执行时间的严格界限。紧界对共合成算法至关重要。我们的边界算法对一般问题模型是有效的:系统可以包含多个具有不同周期的任务;每个任务根据数据依赖关系划分为一组进程;过程的周期和计算时间是有界的,但不一定是恒定的。实验结果表明,该算法可以在较少的CPU时间内找到较紧的边界。
{"title":"Performance estimation for real-time distributed embedded systems","authors":"Ti-Yen Yen, W. Wolf","doi":"10.1109/ICCD.1995.528792","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528792","url":null,"abstract":"Many embedded computing systems are distributed systems: communicating processes executing on several CPUs/ASICs connected by communication links. This paper describes a new, efficient analysis algorithm to derive tight bounds on the execution time required for an application task executing on a distributed system. Tight bounds are essential to cosynthesis algorithms. Our bounding algorithms are valid for a general problem model: the system can contain several tasks with different periods; each task is partitioned into a set of processes related by data dependencies; the periods and the computation times of processes are bounded but not necessarily constant. Experimental results show that our algorithm can find tight bounds in small amounts of CPU time.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"61 13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114954478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 139
Write buffer design for cache-coherent shared-memory multiprocessors 缓存一致共享内存多处理器的写缓冲区设计
F. Mounes-Toussi, D. Lilja
We evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation coherence mechanism in a shared-memory multiprocessor. Using an execution-driven simulator, we find that the one word per entry buffer configuration with a write-back policy is preferred for small write-buffer sizes when both buffers have an equal number of data words, and when they have equal hardware cost. Furthermore, when partial block invalidation is supported, we find that a write-through policy is preferred over a write-back policy due to its simpler cache hit detection mechanism, its elimination of write-back transactions, and its competitive-performance when the write-buffer is relatively large.
当在共享内存多处理器中使用部分块无效一致性机制时,我们评估了两种不同的写缓冲区配置(每个缓冲区条目一个字,每个缓冲区条目一个块)和两种不同的写策略(透写和回写)对性能的影响。使用执行驱动的模拟器,我们发现,当两个缓冲区具有相同数量的数据字并且具有相同的硬件成本时,对于较小的写缓冲区大小,首选带有回写策略的每个条目一个字的缓冲区配置。此外,当支持部分块无效时,我们发现透写策略优于回写策略,因为它具有更简单的缓存命中检测机制、消除回写事务以及在写缓冲区相对较大时的竞争性能。
{"title":"Write buffer design for cache-coherent shared-memory multiprocessors","authors":"F. Mounes-Toussi, D. Lilja","doi":"10.1109/ICCD.1995.528915","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528915","url":null,"abstract":"We evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation coherence mechanism in a shared-memory multiprocessor. Using an execution-driven simulator, we find that the one word per entry buffer configuration with a write-back policy is preferred for small write-buffer sizes when both buffers have an equal number of data words, and when they have equal hardware cost. Furthermore, when partial block invalidation is supported, we find that a write-through policy is preferred over a write-back policy due to its simpler cache hit detection mechanism, its elimination of write-back transactions, and its competitive-performance when the write-buffer is relatively large.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115072042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Minimal self-correcting shift counters 最小的自校正移位计数器
A.M. Tokarnia, A. Peterson
In some applications of shift counters, self initialization is an advantage. It eliminates the need for complex initialization and guarantees the return to the original state sequence after a temporary failure. The low operating frequencies and large areas of the available self correcting shift counters, however, impose severe limitations to their use. This poor performance is partially due to a widely used design method. It consists of modifying the state diagram of a counter with the desired modulus until a single cycle is left. Due to the additional hardware required to change state transitions, the final circuit tends to be slow and large. The paper presents a technique for determining self correcting shift counters by selecting the feedback functions from a large set of functions. The set is searched for functions satisfying a minimization criterion. Self correcting shift counters with up to 10 stages have been determined. These counters are faster and smaller than the self correcting shift counters available from the literature. A table of self correcting shift counters with 6 stages is included in the paper.
在移位计数器的某些应用中,自初始化是一个优点。它消除了复杂初始化的需要,并保证在临时故障后返回到原始状态序列。然而,现有的自校正移位计数器的低工作频率和大面积对其使用施加了严重的限制。这种较差的性能部分是由于广泛使用的设计方法。它包括用所需的模数修改计数器的状态图,直到剩下一个周期。由于需要额外的硬件来改变状态转换,最终电路往往是缓慢和庞大的。本文提出了一种从大量函数中选择反馈函数来确定自校正移位计数器的方法。在集合中搜索满足最小化准则的函数。自校正移位计数器与多达10个阶段已确定。这些计数器比文献中提供的自校正移位计数器更快、更小。给出了一种6级自校正移位计数器表。
{"title":"Minimal self-correcting shift counters","authors":"A.M. Tokarnia, A. Peterson","doi":"10.1109/ICCD.1995.528925","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528925","url":null,"abstract":"In some applications of shift counters, self initialization is an advantage. It eliminates the need for complex initialization and guarantees the return to the original state sequence after a temporary failure. The low operating frequencies and large areas of the available self correcting shift counters, however, impose severe limitations to their use. This poor performance is partially due to a widely used design method. It consists of modifying the state diagram of a counter with the desired modulus until a single cycle is left. Due to the additional hardware required to change state transitions, the final circuit tends to be slow and large. The paper presents a technique for determining self correcting shift counters by selecting the feedback functions from a large set of functions. The set is searched for functions satisfying a minimization criterion. Self correcting shift counters with up to 10 stages have been determined. These counters are faster and smaller than the self correcting shift counters available from the literature. A table of self correcting shift counters with 6 stages is included in the paper.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122634417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Logic synthesis for a single large look-up table 单个大型查询表的逻辑综合
R. Murgai, M. Fujita, F. Hirose
Logic synthesis for look-up tables (LUTs) has received much attention in the past few years, since Xilinx introduced its LUT-based field-programmable gate array (FPGA) architectures. An m-input LUT can implement any Boolean function of up to m inputs. So the goal of synthesis for such architectures has been to synthesize a circuit in which each function can be implemented by one m-LUT such that either the total number of functions or the number of levels of the circuit is minimized. In this work, we focus on a different though related problem: synthesize the given circuit on a single memory or LUT L, which has a capacity of M bits. In addition to satisfying the memory constraint M, we also wish to minimize the total number of functions to be implemented. The main motivation for the problem comes from the problem of minimizing the simulation time on a hardware accelerator for logic simulation. This accelerator uses memory as a logic primitive. In fact, the problem is also relevant in the context of compile-code or software simulation. Another situation where the problem arises is in synthesis for the FPGA architectures being proposed that have on-chip memory for storing programs and data. The unused memory locations can be used to store logic functions. We show that the existing LUT synthesis methods are inadequate to solve this problem. We propose techniques to solve the problem and present experimental evidence to demonstrate their effectiveness.
自Xilinx推出基于lut的现场可编程门阵列(FPGA)架构以来,查找表(lut)的逻辑综合在过去几年中受到了广泛关注。m输入LUT可以实现最多m输入的任何布尔函数。因此,这种体系结构的综合目标是综合一个电路,其中每个功能都可以由一个m-LUT实现,这样无论是功能的总数还是电路的层数都是最小的。在这项工作中,我们专注于一个不同但相关的问题:在单个存储器或LUT L上合成给定电路,其容量为M位。除了满足内存约束M之外,我们还希望最小化要实现的函数总数。这个问题的主要动机来自于最小化硬件加速器上的逻辑仿真时间的问题。这个加速器使用内存作为逻辑原语。事实上,这个问题也与编译代码或软件仿真相关。出现问题的另一种情况是在FPGA架构的合成中提出的具有片上存储器用于存储程序和数据的FPGA架构。未使用的内存位置可以用来存储逻辑函数。我们发现现有的LUT合成方法不足以解决这一问题。我们提出了解决问题的技术,并提出了实验证据来证明其有效性。
{"title":"Logic synthesis for a single large look-up table","authors":"R. Murgai, M. Fujita, F. Hirose","doi":"10.1109/ICCD.1995.528842","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528842","url":null,"abstract":"Logic synthesis for look-up tables (LUTs) has received much attention in the past few years, since Xilinx introduced its LUT-based field-programmable gate array (FPGA) architectures. An m-input LUT can implement any Boolean function of up to m inputs. So the goal of synthesis for such architectures has been to synthesize a circuit in which each function can be implemented by one m-LUT such that either the total number of functions or the number of levels of the circuit is minimized. In this work, we focus on a different though related problem: synthesize the given circuit on a single memory or LUT L, which has a capacity of M bits. In addition to satisfying the memory constraint M, we also wish to minimize the total number of functions to be implemented. The main motivation for the problem comes from the problem of minimizing the simulation time on a hardware accelerator for logic simulation. This accelerator uses memory as a logic primitive. In fact, the problem is also relevant in the context of compile-code or software simulation. Another situation where the problem arises is in synthesis for the FPGA architectures being proposed that have on-chip memory for storing programs and data. The unused memory locations can be used to store logic functions. We show that the existing LUT synthesis methods are inadequate to solve this problem. We propose techniques to solve the problem and present experimental evidence to demonstrate their effectiveness.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128516578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Interrupt-based hardware support for profiling memory system performance 基于中断的硬件支持,用于分析内存系统性能
A. Goldberg, J. Trotter
Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.
在更高的时钟速率和超标量技术的推动下,处理器速度的增长继续超过内存系统性能的改进。反映这一趋势的是,架构师正在开发越来越复杂的内存层次结构来掩盖速度差距,编译器编写者正在添加局域增强转换以更好地利用复杂的内存层次结构,应用程序程序员正在重新编码他们的算法以利用内存系统。所有这些小组都需要记忆系统行为的经验数据来指导他们的优化。本文描述了如何结合简单的硬件支持和采样技术来获得这些数据,而不会明显干扰系统性能。这个想法在Mprof原型中实现,该原型分析了Sun Sparc 10/41上的数据失速周期、一级缓存丢失和二级缓存丢失。
{"title":"Interrupt-based hardware support for profiling memory system performance","authors":"A. Goldberg, J. Trotter","doi":"10.1109/ICCD.1995.528917","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528917","url":null,"abstract":"Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
DART: delay and routability driven technology mapping for LUT based FPGAs 基于LUT的fpga的延迟和可路由性驱动技术映射
A. Lu, E. Dagless, J. Saul
A two-phased approach for routability directed delay-optimal mapping of LUT based FPGAs is presented based on the results of stochastic routability analysis. First, delay-optimal mapping is performed which simultaneously minimizes area and delay. Then, the mapped circuits are restructured to alleviate the potential routing congestions. Experimental results indicate that the first phase creates designs which require 17% fewer levels and 40% fewer LUTs than MIS-pga (delay), 11% fewer levels and 37% fewer LUTs than FlowMap-r, and 5% fewer levels and 39% fewer LUTs than TechMap-D. The success of the second phase is confirmed by running a vendor's layout tool APR. It is observed that they are more routable and have less final delays than those produced by other mappers if they are placed and routed.
基于随机可达性分析的结果,提出了基于LUT的fpga可达性定向延迟最优映射的两阶段方法。首先,进行延迟最优映射,同时最小化面积和延迟。然后,重构映射电路以减轻潜在的路由拥塞。实验结果表明,第一阶段创建的设计比MIS-pga(延迟)减少17%的电平和40%的lut,比FlowMap-r减少11%的电平和37%的lut,比TechMap-D减少5%的电平和39%的lut。第二阶段的成功是通过运行供应商的布局工具apr来确认的。观察到,如果它们被放置和路由,它们比其他映射器产生的更可路由,并且具有更少的最终延迟。
{"title":"DART: delay and routability driven technology mapping for LUT based FPGAs","authors":"A. Lu, E. Dagless, J. Saul","doi":"10.1109/ICCD.1995.528841","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528841","url":null,"abstract":"A two-phased approach for routability directed delay-optimal mapping of LUT based FPGAs is presented based on the results of stochastic routability analysis. First, delay-optimal mapping is performed which simultaneously minimizes area and delay. Then, the mapped circuits are restructured to alleviate the potential routing congestions. Experimental results indicate that the first phase creates designs which require 17% fewer levels and 40% fewer LUTs than MIS-pga (delay), 11% fewer levels and 37% fewer LUTs than FlowMap-r, and 5% fewer levels and 39% fewer LUTs than TechMap-D. The success of the second phase is confirmed by running a vendor's layout tool APR. It is observed that they are more routable and have less final delays than those produced by other mappers if they are placed and routed.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multiprocessor design verification for the PowerPC 620 microprocessor powerpc620微处理器的多处理器设计验证
C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers
Multiprocessor design verification for the PowerPC 620 microprocessor was challenging due to the 620 Bus protocol complexity. The highly concurrent bus and level 2 (LS) cache interfaces, and the extensive system configurability. In order to verify this functionality, a combination of random and deterministic approaches were used. The Random Test Program Generator (RTPG) and the newly developed Stochastic Concurrent Program Generator (SCPG) tools were used for random verification. In the deterministic front, testcases in C were written to verify specific scenarios. In creating SCPG, we dealt with the design complexity and frequent design changes by abstracting areas of concern as simple languages, writing tools to generate tests, and executing these in the standard verification environment. The added value of these tests is that they exercise true data sharing among processors, are self-checking and resemble commercial multiprocessor code.
由于620总线协议的复杂性,PowerPC 620微处理器的多处理器设计验证具有挑战性。高并发总线和2级(LS)缓存接口,以及广泛的系统可配置性。为了验证这一功能,使用了随机和确定性方法的组合。随机测试程序生成器(RTPG)和新开发的随机并发程序生成器(SCPG)工具进行随机验证。在确定性方面,用C编写测试用例来验证特定的场景。在创建SCPG时,我们处理设计的复杂性和频繁的设计变更,方法是将关注的领域抽象为简单的语言,编写工具来生成测试,并在标准验证环境中执行这些测试。这些测试的附加价值在于,它们在处理器之间实现了真正的数据共享,具有自检功能,类似于商业多处理器代码。
{"title":"Multiprocessor design verification for the PowerPC 620 microprocessor","authors":"C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers","doi":"10.1109/ICCD.1995.528809","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528809","url":null,"abstract":"Multiprocessor design verification for the PowerPC 620 microprocessor was challenging due to the 620 Bus protocol complexity. The highly concurrent bus and level 2 (LS) cache interfaces, and the extensive system configurability. In order to verify this functionality, a combination of random and deterministic approaches were used. The Random Test Program Generator (RTPG) and the newly developed Stochastic Concurrent Program Generator (SCPG) tools were used for random verification. In the deterministic front, testcases in C were written to verify specific scenarios. In creating SCPG, we dealt with the design complexity and frequent design changes by abstracting areas of concern as simple languages, writing tools to generate tests, and executing these in the standard verification environment. The added value of these tests is that they exercise true data sharing among processors, are self-checking and resemble commercial multiprocessor code.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"46 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114116798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
The PowerPC 603e microprocessor: an enhanced, low-power, superscalar microprocessor PowerPC 603e微处理器:一种增强型、低功耗、超标量微处理器
C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers, K. R. Kishore
The PowerPC 603e microprocessor is a high performance, low cost, low power microprocessor designed for use in portable computers. The 603e is an enhanced version of the PowerPC 603 microprocessor and extends the performance range of the PowerPC microprocessor family of portable products. The enhancements include increasing the frequency to 100 MHZ doubling the on-chip instruction and data caches to 16 Kbytes each, increasing the cache associativity to 4-way set-associative, adding an extra integer unit, and increasing the throughput of stores and misaligned accesses. Three new bus modes are added to allow for more flexibility in system design. The estimated performance of the 603e at 100 MHz is 120 SPECint92 and 105 SPECfp92. The 603e is fabricated in the same 3.3 volt, 0.5 micron, four-level metal technology as the 603 and contains 2.6 million transistors. The die size is 98 mm/sup 2/. The typical power consumption of the 603e at 100 MHz is 3 watts. Like the 603, the 603e provides three software controllable power-down modes to further extend power saving capability.
PowerPC 603e微处理器是一款高性能、低成本、低功耗的微处理器,专为便携式计算机而设计。603e是PowerPC 603微处理器的增强版本,扩展了PowerPC微处理器系列便携式产品的性能范围。这些增强包括将频率提高到100 MHZ,使片上指令和数据缓存各增加一倍,达到16 kb,将缓存关联性提高到4路集合关联性,增加一个额外的整数单元,并提高存储和不对齐访问的吞吐量。增加了三种新的总线模式,以便在系统设计中具有更大的灵活性。603e在100 MHz时的估计性能为120 SPECint92和105 SPECfp92。603e采用与603相同的3.3伏、0.5微米、四级金属技术制造,包含260万个晶体管。模具尺寸为98毫米/sup 2/。603e在100 MHz时的典型功耗为3瓦。与603一样,603e提供了三种软件可控的下电模式,进一步扩展了省电能力。
{"title":"The PowerPC 603e microprocessor: an enhanced, low-power, superscalar microprocessor","authors":"C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers, K. R. Kishore","doi":"10.1109/ICCD.1995.528810","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528810","url":null,"abstract":"The PowerPC 603e microprocessor is a high performance, low cost, low power microprocessor designed for use in portable computers. The 603e is an enhanced version of the PowerPC 603 microprocessor and extends the performance range of the PowerPC microprocessor family of portable products. The enhancements include increasing the frequency to 100 MHZ doubling the on-chip instruction and data caches to 16 Kbytes each, increasing the cache associativity to 4-way set-associative, adding an extra integer unit, and increasing the throughput of stores and misaligned accesses. Three new bus modes are added to allow for more flexibility in system design. The estimated performance of the 603e at 100 MHz is 120 SPECint92 and 105 SPECfp92. The 603e is fabricated in the same 3.3 volt, 0.5 micron, four-level metal technology as the 603 and contains 2.6 million transistors. The die size is 98 mm/sup 2/. The typical power consumption of the 603e at 100 MHz is 3 watts. Like the 603, the 603e provides three software controllable power-down modes to further extend power saving capability.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SSM-MP: more scalability in shared-memory multi-processor SSM-MP:在共享内存多处理器中具有更高的可伸缩性
Shigeaki Iwasa, Shu Shing, Hisashi Mogi, Hiroshi Nozuwe, Hiroo Hayashi, Osamu Wakamori, Takashi Ohmizo, Kuninori Tanaka, H. Sakai, M. Saito
Bus-based shared-memory multi-processors (SM-MP) have successfully been used commercially, since implementation requires no drastic changes to the programming paradigm. In this paper we propose the memory structure called SSM-MP (Scalable shared-memory multi-processors), aimed to shorten the cache refill latency and to relax the bus bottle neck problem. In this machine, main memory consists of local memories dedicated to each of the processors and something called MTag. MTag is a small piece of hardware that filters out bus traffic headed to the system bus and maintains cache coherency. A popular UNIX (SVR4 ES/MP) was ported. Original OS code works well due to its natural locality. Furthermore, by allocating tasks to the local memory, we were able to reduce the system bus traffic to nearly a quarter. SSM-MP is an effective approach in building a multi-processor system with a medium number (4-32) of processors.
基于总线的共享内存多处理器(SM-MP)已经成功地用于商业,因为实现不需要对编程范例进行重大更改。本文提出了SSM-MP(可扩展共享内存多处理器)内存结构,旨在缩短缓存重新填充延迟和缓解总线瓶颈问题。在这台机器中,主存储器由专用于每个处理器的本地存储器和称为MTag的东西组成。MTag是一小块硬件,它过滤掉流向系统总线的总线流量并保持缓存一致性。移植了一个流行的UNIX (SVR4 ES/MP)。原始操作系统代码由于其自然位置而工作良好。此外,通过将任务分配到本地内存,我们能够将系统总线流量减少到近四分之一。SSM-MP是构建中等数量(4-32)处理器的多处理器系统的有效方法。
{"title":"SSM-MP: more scalability in shared-memory multi-processor","authors":"Shigeaki Iwasa, Shu Shing, Hisashi Mogi, Hiroshi Nozuwe, Hiroo Hayashi, Osamu Wakamori, Takashi Ohmizo, Kuninori Tanaka, H. Sakai, M. Saito","doi":"10.1109/ICCD.1995.528923","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528923","url":null,"abstract":"Bus-based shared-memory multi-processors (SM-MP) have successfully been used commercially, since implementation requires no drastic changes to the programming paradigm. In this paper we propose the memory structure called SSM-MP (Scalable shared-memory multi-processors), aimed to shorten the cache refill latency and to relax the bus bottle neck problem. In this machine, main memory consists of local memories dedicated to each of the processors and something called MTag. MTag is a small piece of hardware that filters out bus traffic headed to the system bus and maintains cache coherency. A popular UNIX (SVR4 ES/MP) was ported. Original OS code works well due to its natural locality. Furthermore, by allocating tasks to the local memory, we were able to reduce the system bus traffic to nearly a quarter. SSM-MP is an effective approach in building a multi-processor system with a medium number (4-32) of processors.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132507700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1