首页 > 最新文献

ASPLOS III最新文献

英文 中文
The design of nectar: a network backplane for heterogeneous multicomputers 花蜜的设计:异构多计算机的网络背板
Pub Date : 1989-04-01 DOI: 10.1145/70082.68202
E. Arnould, F. Bitz, Eric C. Cooper, H. T. Kung, Robert D. Sansom, P. Steenkiste
Nectar is a “network backplane” for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be scaled up by connecting hundreds of these networks together.The Nectar architecture provides a flexible way to handle heterogeneity and task-level parallelism. A wide variety of machines can be connected as Nectar nodes and the Nectar system software allows applications to communicate at a high level. Protocol processing is off-loaded to powerful communication processors so that nodes do not have to support a suite of network protocols.We have designed and built a prototype Nectar system that has been operational since November 1988. This paper presents the motivation and goals for Nectar and describes its hardware and software. The presentation emphasizes how the goals influenced the design decisions and led to the novel aspects of Nectar.
Nectar是一种用于异构多计算机的“网络背板”。初始系统由星形光纤网络组成,总带宽为1.6千兆位/秒,切换延迟为700纳秒。该系统可以通过将数百个这样的网络连接在一起来扩大规模。Nectar架构提供了一种灵活的方式来处理异构性和任务级并行性。各种各样的机器可以作为Nectar节点连接,Nectar系统软件允许应用程序在高水平上进行通信。协议处理被卸载给功能强大的通信处理器,这样节点就不必支持一套网络协议。我们已经设计和建造了一个原型花蜜系统,自1988年11月以来一直在运行。本文介绍了Nectar的动机和目标,并介绍了其硬件和软件。演讲强调了这些目标是如何影响设计决策并创造出《Nectar》的新元素。
{"title":"The design of nectar: a network backplane for heterogeneous multicomputers","authors":"E. Arnould, F. Bitz, Eric C. Cooper, H. T. Kung, Robert D. Sansom, P. Steenkiste","doi":"10.1145/70082.68202","DOIUrl":"https://doi.org/10.1145/70082.68202","url":null,"abstract":"Nectar is a “network backplane” for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be scaled up by connecting hundreds of these networks together.\u0000The Nectar architecture provides a flexible way to handle heterogeneity and task-level parallelism. A wide variety of machines can be connected as Nectar nodes and the Nectar system software allows applications to communicate at a high level. Protocol processing is off-loaded to powerful communication processors so that nodes do not have to support a suite of network protocols.\u0000We have designed and built a prototype Nectar system that has been operational since November 1988. This paper presents the motivation and goals for Nectar and describes its hardware and software. The presentation emphasizes how the goals influenced the design decisions and led to the novel aspects of Nectar.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126182874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 167
Program optimization for instruction caches 指令缓存的程序优化
Pub Date : 1989-04-01 DOI: 10.1145/70082.68200
S. McFarling
This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.
提出了一种减少指令缓存失误的优化算法。该算法使用配置文件信息在内存中重新定位程序,这样直接映射的缓存就像具有完全关联性和完全了解未来的最优缓存一样。为了获得最佳效果,缓存应该具有排除编译器指定的某些指令的机制。本文首先给出了该算法的简化形式。在假设基本块可以随意重新排序的情况下,这种形式可以在没有条件和树调用图的情况下为程序产生最优的缺失率。如果允许条件,但条件中没有循环,则算法可以为与概要信息一致的程序的最坏情况执行提供最佳缓存。然后,利用启发式算法对一般程序进行了扩展。对于不同缓存大小的一组10个程序的经验结果证明了这些启发式方法的有效性。改进取决于缓存大小。对于512字的缓存,直接映射指令缓存的缺失率减半。对于8K字缓存,缺失率下降了75%以上。在广泛的缓存大小范围内,该算法与将缓存大小增加3倍一样有效。对于512个单词,该算法只比最优缓存多产生32%的错误。在直接映射缓存上优化的程序比在相同大小的集合关联缓存上未优化的程序有更低的丢失率。
{"title":"Program optimization for instruction caches","authors":"S. McFarling","doi":"10.1145/70082.68200","DOIUrl":"https://doi.org/10.1145/70082.68200","url":null,"abstract":"This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114202069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 257
Reference history, page size, and migration daemons in local/remote architectures 引用本地/远程体系结构中的历史记录、页面大小和迁移守护进程
Pub Date : 1989-04-01 DOI: 10.1145/70082.68192
M. A. Holliday
We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We identify some of the key issues with respect to architectural support (reference history maintenance, and page size), and operating system mechanism (duration between daemon passes, and number of migration daemons).The experiments were conducted using software implemented page tables on 32-node BBN Butterfly Plus#8482;. Several numeral programs with both synthetic and real data were used as the workload. The primary conclusion is that for the cases considered migration was at best marginally effective. On the other hand, practical migration mechanisms were robust and never significantly degraded performance. The specific results include: 1) Referenced bits with aging can closely approximate Usage fields, 2) larger page sizes are beneficial except when the page is large enough to include locality sets of two processes, and 3) multiple migration daemons can be useful.Only small regions of the space of architectural, system, and workload parameters were explored. Further investigation of other parameter combinations is clearly warranted.
我们解决了共享内存多处理器的本地/远程架构子类中的分页主存管理问题。我们考虑操作系统承担主要责任并使用页面迁移作为其主要工具的情况。我们确定了与体系结构支持(引用历史维护和页面大小)和操作系统机制(守护进程传递之间的持续时间和迁移守护进程的数量)相关的一些关键问题。实验采用软件实现的页表在32节点BBN Butterfly Plus#8482;上进行。以几个具有合成数据和真实数据的数字程序作为工作负载。主要的结论是,对于所考虑的情况,移民最多是边际有效的。另一方面,实际的迁移机制是健壮的,不会显著降低性能。具体结果包括:1)具有老化的引用位可以非常接近Usage字段;2)较大的页面大小是有益的,除非页面大到足以包含两个进程的位置集;3)多个迁移守护进程可能是有用的。只探索了建筑、系统和工作负载参数空间的小区域。显然有必要进一步研究其他参数组合。
{"title":"Reference history, page size, and migration daemons in local/remote architectures","authors":"M. A. Holliday","doi":"10.1145/70082.68192","DOIUrl":"https://doi.org/10.1145/70082.68192","url":null,"abstract":"We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We identify some of the key issues with respect to architectural support (reference history maintenance, and page size), and operating system mechanism (duration between daemon passes, and number of migration daemons).\u0000The experiments were conducted using software implemented page tables on 32-node BBN Butterfly Plus#8482;. Several numeral programs with both synthetic and real data were used as the workload. The primary conclusion is that for the cases considered migration was at best marginally effective. On the other hand, practical migration mechanisms were robust and never significantly degraded performance. The specific results include: 1) Referenced bits with aging can closely approximate Usage fields, 2) larger page sizes are beneficial except when the page is large enough to include locality sets of two processes, and 3) multiple migration daemons can be useful.\u0000Only small regions of the space of architectural, system, and workload parameters were explored. Further investigation of other parameter combinations is clearly warranted.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115932463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
An analysis of 8086 instruction set usage in MS DOS programs 8086指令集在DOS程序中的使用分析
Pub Date : 1989-04-01 DOI: 10.1145/70082.68197
T. L. Adams, R. E. Zimmerman
1. Introduction An architectural evaluation must be based upon real programs in an actual operating environment. The ubiquitous IBM** personal computer running MS DOS@ represents an excellent test bed for architectural evaluation of Intel@ 8086 systems. There are many programs and tools available to evaluate the performance of IBM Personal Computers and compatibles; these evaluation tools are intended to relate the performance of one machine to another. Very little data is available on dynamic instruction traces in systems using an 8086. This paper reports on dynamic traces of 8086/88 programs obtained using software tracing tools (described below). The objective of this work is to analyze instruction usage and addressing modes used in actual software. The system used to obtain the dynamic instruction frequencies was a compatible running MS DOS* 3.1 with a Softpatch@ BIOS. To illustrate the RISC argument that only a few instruction types are sufficient, the 8086 results are compared with similar studies on the Motorola* 68ooO and the Digital Equipment VAX-1 l@'.
1. 架构评估必须基于实际运行环境中的实际程序。运行MS DOS@的无处不在的IBM**个人计算机为英特尔* 8086系统的架构评估提供了一个极好的测试平台。有许多程序和工具可用于评估IBM个人电脑和兼容机的性能;这些评估工具旨在将一台机器的性能与另一台机器的性能联系起来。在使用8086的系统中,动态指令跟踪的可用数据非常少。本文报告了使用软件跟踪工具(如下所述)获得的8086/88程序的动态跟踪。本工作的目的是分析实际软件中使用的指令使用和寻址模式。获取动态指令频率的系统是兼容运行的MS DOS* 3.1,带有Softpatch@ BIOS。为了说明RISC认为只有少数指令类型就足够了,我们将8086的结果与Motorola* 68oo和Digital Equipment VAX-1 l@'上的类似研究结果进行了比较。
{"title":"An analysis of 8086 instruction set usage in MS DOS programs","authors":"T. L. Adams, R. E. Zimmerman","doi":"10.1145/70082.68197","DOIUrl":"https://doi.org/10.1145/70082.68197","url":null,"abstract":"1. Introduction An architectural evaluation must be based upon real programs in an actual operating environment. The ubiquitous IBM** personal computer running MS DOS@ represents an excellent test bed for architectural evaluation of Intel@ 8086 systems. There are many programs and tools available to evaluate the performance of IBM Personal Computers and compatibles; these evaluation tools are intended to relate the performance of one machine to another. Very little data is available on dynamic instruction traces in systems using an 8086. This paper reports on dynamic traces of 8086/88 programs obtained using software tracing tools (described below). The objective of this work is to analyze instruction usage and addressing modes used in actual software. The system used to obtain the dynamic instruction frequencies was a compatible running MS DOS* 3.1 with a Softpatch@ BIOS. To illustrate the RISC argument that only a few instruction types are sufficient, the 8086 results are compared with similar studies on the Motorola* 68ooO and the Digital Equipment VAX-1 l@'.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116104001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Using registers to optimize cross-domain call performance 使用寄存器来优化跨域调用性能
Pub Date : 1989-04-01 DOI: 10.1145/70082.68201
P. Karger
This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of registers that must be saved, restored, or cleared when changing from one protection domain to another. The size of the performance gain depends on the level of trust between the calling and called protection domains. The paper presents alternate implementations for an extended VAX architecture and for a RISC architecture and reports performance measurements done on a re-microprogrammed VAX-11/730 processor.
本文介绍了一种在基于能力的计算机系统中提高跨域调用和返回性能的新技术。使用从编译器获得的寄存器优化信息,可信链接器可以最大限度地减少从一个保护域更改到另一个保护域时必须保存、恢复或清除的寄存器数量。性能增益的大小取决于调用和被调用保护域之间的信任级别。本文介绍了扩展VAX架构和RISC架构的替代实现,并报告了在重新微编程的VAX-11/730处理器上进行的性能测量。
{"title":"Using registers to optimize cross-domain call performance","authors":"P. Karger","doi":"10.1145/70082.68201","DOIUrl":"https://doi.org/10.1145/70082.68201","url":null,"abstract":"This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of registers that must be saved, restored, or cleared when changing from one protection domain to another. The size of the performance gain depends on the level of trust between the calling and called protection domains. The paper presents alternate implementations for an extended VAX architecture and for a RISC architecture and reports performance measurements done on a re-microprogrammed VAX-11/730 processor.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130097327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
The effect of sharing on the cache and bus performance of parallel programs 共享对并行程序缓存和总线性能的影响
Pub Date : 1989-04-01 DOI: 10.1145/70082.68206
S. Eggers, R. Katz
Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.
总线带宽最终限制了基于总线的共享内存多处理器的性能,从而限制了其规模。以前的研究是从单处理器测量和模拟中推断出这些机器的性能。在本研究中,我们使用并行程序的跟踪来评估共享内存多处理器的缓存和总线性能,其中一致性由写无效协议保持。特别地,我们分析了共享开销对缓存丢失率和总线利用率的影响。我们的研究表明,并行程序比类似的单处理器程序产生更高的丢失率和总线利用率。这些指标的共享部分随着缓存和块大小成比例地增加,并且对于某些缓存配置决定了它们的大小和趋势。开销的大小取决于共享数据的内存引用模式。表现出良好的每处理器局部性的程序比那些具有细粒度共享的程序表现得更好。这表明并行软件编写者和更好的编译器技术可以通过更好地组织共享数据的内存来提高程序性能。
{"title":"The effect of sharing on the cache and bus performance of parallel programs","authors":"S. Eggers, R. Katz","doi":"10.1145/70082.68206","DOIUrl":"https://doi.org/10.1145/70082.68206","url":null,"abstract":"Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.\u0000Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128266235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 155
Data buffering: run-time versus compile-time support 数据缓冲:运行时与编译时的支持
Pub Date : 1989-04-01 DOI: 10.1145/70082.68196
Hans M. Mulder
Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local processor memory (e.g. registers). This paper focuses solely on methods of increasing the utilization of data memory, local to the processor (registers or register-oriented buffers).A utilization increase of local processor memory is possible by means of compile-time software, run-time hardware, or a combination of both. This paper looks at data buffers which perform solely because of the compile-time software (single register sets); those which operate mainly through hardware but with possible software assistance (multiple register sets); and those intended to operate transparently with main memory implying no software assistance whatsoever (stack buffers). This paper shows that hardware buffering schemes cannot replace compile-time effort, but at most can reduce the complexity of this effort. It shows the utility increase of applying register allocation for multiple register sets. The paper also shows a potential utility decrease inherent to stack buffers. The observation that a single register set, allocated by means of interprocedural allocation, performs competitively with both multiple register set and stack buffer emphasizes the significance of the conclusion
数据依赖性、分支和内存访问损失是高速微处理器性能的主要制约因素。内存访问惩罚涉及外部内存(例如缓存)或本地处理器内存(例如寄存器)使用不足所施加的惩罚。本文仅关注于提高数据内存利用率的方法,这些数据内存位于处理器的本地(寄存器或面向寄存器的缓冲区)。通过编译时软件、运行时硬件或两者的组合,可以提高本地处理器内存的利用率。本文着眼于数据缓冲区的执行仅仅是因为编译时软件(单寄存器集);主要通过硬件操作,但可能有软件辅助的(多寄存器组);而那些意图与主存透明地操作意味着没有任何软件帮助(堆栈缓冲区)。本文表明,硬件缓冲方案不能取代编译时的工作,但最多可以降低这种工作的复杂性。它显示了对多个寄存器集应用寄存器分配的效用增加。本文还显示了堆栈缓冲区固有的潜在效用降低。通过过程间分配分配的单个寄存器集与多个寄存器集和堆栈缓冲区竞争,这一观察结果强调了结论的重要性
{"title":"Data buffering: run-time versus compile-time support","authors":"Hans M. Mulder","doi":"10.1145/70082.68196","DOIUrl":"https://doi.org/10.1145/70082.68196","url":null,"abstract":"Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local processor memory (e.g. registers). This paper focuses solely on methods of increasing the utilization of data memory, local to the processor (registers or register-oriented buffers).\u0000A utilization increase of local processor memory is possible by means of compile-time software, run-time hardware, or a combination of both. This paper looks at data buffers which perform solely because of the compile-time software (single register sets); those which operate mainly through hardware but with possible software assistance (multiple register sets); and those intended to operate transparently with main memory implying no software assistance whatsoever (stack buffers). This paper shows that hardware buffering schemes cannot replace compile-time effort, but at most can reduce the complexity of this effort. It shows the utility increase of applying register allocation for multiple register sets. The paper also shows a potential utility decrease inherent to stack buffers. The observation that a single register set, allocated by means of interprocedural allocation, performs competitively with both multiple register set and stack buffer emphasizes the significance of the conclusion","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124623636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Failure correction techniques for large disk arrays 大型磁盘阵列的故障校正技术
Pub Date : 1989-04-01 DOI: 10.1145/70082.68194
Garth A. Gibson, L. Hellerstein, R. Karp, R. Katz, D. Patterson
The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information in disk arrays. Codes are selected to maximize mean time to data loss or minimize disks containing redundant data, but are all constrained to minimize performance penalties associated with updating information or recovering from catastrophic disk failures. We also codes that give highly reliable data storage with low redundant data overhead for arrays of 1000 information disks.
越来越大的磁盘阵列将满足对I/O带宽不断增长的需求。这些阵列需要冗余来防止数据丢失。本文研究了在磁盘阵列中可靠地存储信息的编码或代码的替代选择。代码的选择是为了最大化数据丢失的平均时间或最小化包含冗余数据的磁盘,但所有代码都是为了最小化与更新信息或从灾难性磁盘故障中恢复相关的性能损失。我们还为1000个信息磁盘的阵列提供高可靠的数据存储和低冗余数据开销。
{"title":"Failure correction techniques for large disk arrays","authors":"Garth A. Gibson, L. Hellerstein, R. Karp, R. Katz, D. Patterson","doi":"10.1145/70082.68194","DOIUrl":"https://doi.org/10.1145/70082.68194","url":null,"abstract":"The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information in disk arrays. Codes are selected to maximize mean time to data loss or minimize disks containing redundant data, but are all constrained to minimize performance penalties associated with updating information or recovering from catastrophic disk failures. We also codes that give highly reliable data storage with low redundant data overhead for arrays of 1000 information disks.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"32 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131291498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
A unified vector/scalar floating-point architecture 一个统一的矢量/标量浮点架构
Pub Date : 1989-04-01 DOI: 10.1145/70082.68195
N. Jouppi, J. Bertoni, D. W. Wall
In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of vectorizable applications. For example, reduction operations and recurrences can be expressed in vector form in this architecture. This approach results in greater overall performance for most applications than does the approach of emphasizing peak vector performance. The hardware required to support the enhanced vector capability is insignificant, but allows the execution of two operations per cycle for vectorized code. Moreover, the size of the unified vector/scalar register file required for peak performance is an order of magnitude smaller than traditional vector register files, allowing efficient on-chip VLSI implementation. The results of simulations of the Livermore Loops and Linpack using this architecture are presented.
在本文中,我们提出了一种统一的矢量和标量计算方法,使用单个寄存器文件来处理标量操作数和矢量元素。该体系结构的目标是提高标量性能,同时扩大可向量化应用程序的范围。例如,在这个架构中,约简操作和递归可以用向量形式表示。与强调峰值矢量性能的方法相比,这种方法可以为大多数应用程序带来更高的总体性能。支持增强的矢量功能所需的硬件并不重要,但是对于向量化代码,每个周期可以执行两次操作。此外,峰值性能所需的统一矢量/标量寄存器文件的大小比传统的矢量寄存器文件小一个数量级,从而允许高效的片上VLSI实现。给出了利用该结构对利弗莫尔环路和Linpack进行仿真的结果。
{"title":"A unified vector/scalar floating-point architecture","authors":"N. Jouppi, J. Bertoni, D. W. Wall","doi":"10.1145/70082.68195","DOIUrl":"https://doi.org/10.1145/70082.68195","url":null,"abstract":"In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of vectorizable applications. For example, reduction operations and recurrences can be expressed in vector form in this architecture. This approach results in greater overall performance for most applications than does the approach of emphasizing peak vector performance. The hardware required to support the enhanced vector capability is insignificant, but allows the execution of two operations per cycle for vectorized code. Moreover, the size of the unified vector/scalar register file required for peak performance is an order of magnitude smaller than traditional vector register files, allowing efficient on-chip VLSI implementation. The results of simulations of the Livermore Loops and Linpack using this architecture are presented.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123543729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Translation lookaside buffer consistency: a software approach 翻译备用缓冲区一致性:一种软件方法
Pub Date : 1989-04-01 DOI: 10.1145/70082.68193
David L. Black, R. Rashid, D. Golub, C. R. Hill, R. Baron
We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is in regular production use. Performance evaluations establish the basic costs of the algorithm and show that it has minimal impact on application performance. As a result, TLB consistency does not pose an insurmountable obstacle to multiprocessors with several hundred processors. We also discuss hardware support options for TLB consistency ranging from a minor interrupt structure modification to complete hardware implementations. Features are identified in current hardware that compound the TLB consistency problem; removal or correction of these features can simplify and/or reduce the overhead of maintaining TLB consistency in software.
讨论了多处理器的转换暂置缓冲区(TLB)一致性问题,并介绍了在软件中维护TLB一致性的Mach停机算法。该算法已在多个多处理器上实现,并已在生产中正常使用。性能评估确定了算法的基本成本,并表明它对应用程序性能的影响最小。因此,TLB一致性不会对具有几百个处理器的多处理器构成不可逾越的障碍。我们还讨论了TLB一致性的硬件支持选项,从一个小的中断结构修改到完整的硬件实现。在当前的硬件中识别出复杂的TLB一致性问题的特性;删除或纠正这些特性可以简化和/或减少在软件中维护TLB一致性的开销。
{"title":"Translation lookaside buffer consistency: a software approach","authors":"David L. Black, R. Rashid, D. Golub, C. R. Hill, R. Baron","doi":"10.1145/70082.68193","DOIUrl":"https://doi.org/10.1145/70082.68193","url":null,"abstract":"We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is in regular production use. Performance evaluations establish the basic costs of the algorithm and show that it has minimal impact on application performance. As a result, TLB consistency does not pose an insurmountable obstacle to multiprocessors with several hundred processors. We also discuss hardware support options for TLB consistency ranging from a minor interrupt structure modification to complete hardware implementations. Features are identified in current hardware that compound the TLB consistency problem; removal or correction of these features can simplify and/or reduce the overhead of maintaining TLB consistency in software.","PeriodicalId":359206,"journal":{"name":"ASPLOS III","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132004876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
期刊
ASPLOS III
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1