[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture最新文献

英文中文

Dynamic processor allocation in hypercube computers 超立方体计算机中的动态处理器分配

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325110

Po-Jen Chuang, N. Tzeng

Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space generated is dependent upon the dimension of the requested subcube dynamically, instead of being predetermined and fixed. The basic idea of this strategy lies in collapsing the binary tree representations of a hypercube successively so that the nodes which form a subcube but are distant would be brought close to each other for recognition. The strategy can be implemented efficiently by using shuffle operations on the leaf node addresses of binary tree representations. Extensive simulation runs are carried out to collect experimental performance measures of interest of different allocation strategies. It is shown from analytic and experimental results that this strategy compares favorably in many situations with any other known allocation scheme capable of achieving complete subcube recognition.<>

由于超立方体的特殊结构，在超立方体计算机中全面有效地识别各种子立方体是非常困难的。作者提出了一种比多gc策略在生成搜索空间方面的复杂性低得多的方法，同时实现了完整的子立方体识别。这种方法被称为动态处理器分配方案，因为生成的搜索空间动态地依赖于所请求的子多维数据集的维度，而不是预先确定和固定的。该策略的基本思想是对一个超立方体的二叉树表示进行连续的折叠，使组成子立方体但距离较远的节点相互靠近以进行识别。通过对二叉树表示的叶节点地址进行shuffle操作，可以有效地实现该策略。进行了大量的模拟运行，以收集不同分配策略感兴趣的实验性能指标。分析和实验结果表明，在许多情况下，该策略比任何其他已知的能够实现完全子立方体识别的分配方案都要好。

{"title":"Dynamic processor allocation in hypercube computers","authors":"Po-Jen Chuang, N. Tzeng","doi":"10.1145/325164.325110","DOIUrl":"https://doi.org/10.1145/325164.325110","url":null,"abstract":"Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space generated is dependent upon the dimension of the requested subcube dynamically, instead of being predetermined and fixed. The basic idea of this strategy lies in collapsing the binary tree representations of a hypercube successively so that the nodes which form a subcube but are distant would be brought close to each other for recognition. The strategy can be implemented efficiently by using shuffle operations on the leaf node addresses of binary tree representations. Extensive simulation runs are carried out to collect experimental performance measures of interest of different allocation strategies. It is shown from analytic and experimental results that this strategy compares favorably in many situations with any other known allocation scheme capable of achieving complete subcube recognition.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

The TLB slice-a low-cost high-speed address translation mechanism TLB片——一种低成本的高速地址转换机制

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325161

G. Taylor, Peter Davies, M. Farmwald

The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer, called a TLB slice, which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices. The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically indexed cache.<>

MIPS R6000微处理器依赖于一种称为TLB片的新型平移暂置缓冲器，它的大小不到传统TLB的十分之一，速度与一个多路复用器延迟一样快，但命中率足够高，可以实际应用。快速转换使使用物理缓存成为可能，而无需在处理器的管道中添加转换阶段。小尺寸使得在芯片上包含地址转换成为可能，即使在设备数量有限的技术中也是如此。TLB切片背后的关键思想是在物理索引缓存上同时拥有虚拟标签和物理标签。

引用次数: 126

An investigation of static versus dynamic scheduling 静态与动态调度的研究

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325140

C. Love, H. Jordan

Two techniques for instruction scheduling, dynamic and static scheduling, are investigated. A decoupled access execute architecture consists of an execution unit and a memory unit with separate program counters and separate instruction memories. The very long instruction word (VLIW) architecture has only one program counter and relies on the compiler to perform static scheduling of multiple units. To idealize the comparison, the VLIW architecture considered had only two units. The instruction sets and execution times for the two architectures were made as nearly the same as possible. The execution times were compared and analyzed to compare the capabilities of static and dynamic instruction scheduling. Both regular and irregular programs were constructed and optimized by hand for each architecture.<>

研究了指令调度的两种技术:动态调度和静态调度。解耦访问执行体系结构由一个执行单元和一个具有独立程序计数器和独立指令存储器的存储单元组成。超长指令字(VLIW)体系结构只有一个程序计数器，依靠编译器对多个单元执行静态调度。为了使比较理想化，所考虑的VLIW体系结构只有两个单元。这两种体系结构的指令集和执行时间尽可能接近相同。通过对执行时间的比较和分析，比较了静态和动态指令调度的性能。规则和不规则的程序都是为每个建筑手工构建和优化的。

引用次数: 11

APRIL: a processor architecture for multiprocessing 用于多处理的处理器体系结构

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325119

A. Agarwal, B. Lim, D. Kranz, J. Kubiatowicz

The architecture of a rapid-context-switching processor called APRIL, with support for fine-grain threads and synchronization, is described. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial reduced-instruction-set-computer-(RISC-) based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles are described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of 2 over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. The authors show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.<>

本文描述了一个名为APRIL的快速上下文切换处理器的体系结构，它支持细粒度线程和同步。四月实现了高单线程性能，并支持虚拟动态线程。本文描述了一种基于商用精简指令集计算机(RISC)的APRIL实现和一种可以在大约10个周期内切换上下文的运行时软件系统。在APRIL模拟器上对几个并行应用程序进行的测量表明，与Encore multiax上的相应实现相比，支持基于期货的并行任务的开销减少了2倍。利用性能模型探讨了基于APRIL的多处理器的可扩展性。作者表明，在大型基于缓存的机器上，基于sparc的APRIL实现可以实现接近80%的处理器利用率，每个处理器只有三个常驻线程，平均基本网络延迟为55个周期。

引用次数: 447

Adaptive software cache management for distributed shared memory architectures 分布式共享内存架构的自适应软件缓存管理

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325124

J. Bennett, J. Carter, W. Zwaenepoel

An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. The authors contend that, in distributed shared-memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. They have examined the sharing and synchronization behavior of a variety of shared-memory parallel programs. It is found that the access patterns of a large percentage of shared data objects fall into a small number of categories for which efficient software coherence mechanisms exist. In addition, the authors have performed a simulation study that provides two examples of how an adaptive caching mechanism can take advantage of semantic information.<>

自适应缓存一致性机制利用有关特定数据对象的预期或观察到的访问行为的语义信息。作者认为，在分布式共享内存系统中，自适应缓存一致性机制将优于静态缓存一致性机制。他们研究了各种共享内存并行程序的共享和同步行为。研究发现，很大比例的共享数据对象的访问模式属于少数类别，这些类别存在有效的软件一致性机制。此外，作者还进行了模拟研究，提供了两个例子，说明自适应缓存机制如何利用语义信息。

引用次数: 191

Maximizing performance in a striped disk array 在条带阵列中实现性能最大化

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325158

Peter M. Chen, D. Patterson

Improvements in disk speeds have not kept up with improvements in processor and memory speeds. One way to correct the resulting speed mismatch is to stripe data across many disks. The authors address how to stripe data to get maximum performance from the disks. Specifically, they examine how to choose the striping unit, that is, the amount of logically contiguous data on each disk. Rules for determining the best striping unit for a given range of workloads are synthesized. It is shown how the choice of striping unit depends on only two parameters: (1) the number of outstanding requests in the disk system at any given time, and (2) the average positioning time*data transfer rate of the disks. The authors derive an equation for the optimal striping unit as a function of these two parameters; they also show how to choose the striping unit without prior knowledge about the workload.<>

磁盘速度的提高没有跟上处理器和内存速度的提高。纠正由此导致的速度不匹配的一种方法是在许多磁盘上分条数据。作者讨论了如何将数据分条以从磁盘获得最大性能。具体来说，它们检查如何选择条带化单元，即每个磁盘上逻辑连续数据的数量。对给定工作负载范围确定最佳分条单元的规则进行了综合。它显示了条带化单元的选择如何仅取决于两个参数:(1)在任何给定时间磁盘系统中未完成请求的数量，以及(2)磁盘的平均定位时间*数据传输速率。作者推导出了这两个参数的函数式，并给出了最佳分条装置的方程;他们还展示了如何在不事先了解工作负载的情况下选择条带化单元。

引用次数: 240

The impact of synchronization and granularity on parallel systems 同步和粒度对并行系统的影响

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325150

D. Chen, H. Su, P. Yew

A study is made of the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. It is found that, even though there can be a lot of parallelism at the fine-grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration-level parallelism seems to be a more appropriate level when those factors are considered. Barrier synchronization and data synchronization at the loop-iteration level are also studied. It is found that both schemes are needed for a better performance.<>

采用执行驱动仿真技术研究了同步和粒度对并行系统性能的影响。研究发现，尽管在细粒度级别上存在大量并行性，但同步和调度策略决定了系统的最终性能。当考虑到这些因素时，循环迭代级的并行性似乎是更合适的级别。研究了循环迭代层的屏障同步和数据同步。结果表明，为了获得更好的性能，这两种方案都是必要的。

引用次数: 95

The performance impact of block sizes and fetch strategies 块大小和获取策略对性能的影响

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325135

S. Przybylski

The interactions between a cache's block size, fetch size, and fetch policy from the perspective of maximizing system-level performance are explored. It has been previously noted that, given a simple fetch strategy, the performance optimal block size is almost always four or eight words. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of 2 greater than the data cache fetch size, which in turn should be equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above.<>

从最大化系统级性能的角度探讨了缓存的块大小、取大小和取策略之间的相互作用。前面已经注意到，给定一个简单的获取策略，性能最优的块大小几乎总是四个或八个字。如果与更长的块或读取相关的周期时间损失很小，那么性能最佳大小就会明显降低。在分割缓存组织中，指令和数据缓存的取值和块大小都是独立的设计变量，指令缓存块大小和取值大小应该相同。对于本跟踪驱动模拟研究中使用的工作负载和回写策略，指令缓存块大小应该是数据缓存读取大小的2倍左右，而数据缓存读取大小又应该等于或两倍于数据缓存块大小。最简单的取操作策略是只在未取的情况下取操作，并暂停CPU直到取操作完成。由于有限的内存资源和缓存缺失的强时间聚类，复杂的读取策略并不能产生相应的缺失率降低所表明的性能改进。对于模拟的环境，最有效的获取策略比上面描述的最简单的策略提高了1.7%到4.5%的性能。

{"title":"The performance impact of block sizes and fetch strategies","authors":"S. Przybylski","doi":"10.1145/325164.325135","DOIUrl":"https://doi.org/10.1145/325164.325135","url":null,"abstract":"The interactions between a cache's block size, fetch size, and fetch policy from the perspective of maximizing system-level performance are explored. It has been previously noted that, given a simple fetch strategy, the performance optimal block size is almost always four or eight words. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of 2 greater than the data cache fetch size, which in turn should be equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122916996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108

Trace-driven simulations for a two-level cache design of open bus systems 开放总线系统两级缓存设计的轨迹驱动仿真

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325151

Hakon O. Bugge, E. Kristiansen, B. O. Bakka

Two-level cache hierarchies will be a design issue in future high-performance CPUs. An evaluation is made of various metrics for data cache designs. A discussion is presented of one- and two-level cache hierarchies. The target is a new 100+ MIPS CPU, but the methods are applicable to any cache design. The basis of this work is a new trace-driven, multiprocess cache simulator. The simulator incorporates a simple priority-based scheduler which controls the execution of the processes. The scheduler blocks a process when a system call is executed. A workload consists of a total of 60 processes, distributed among seven unique programs with about nine instances each. Two open bus systems, Futurebus+ and Scalable Coherent Interface (SCI), that support a coherent memory model, are discussed as the interconnect system for main memory.<>

两级缓存层次结构将是未来高性能cpu的设计问题。对数据缓存设计的各种指标进行了评估。讨论了一级和二级缓存层次结构。目标是一个新的100+ MIPS CPU，但方法适用于任何缓存设计。这项工作的基础是一个新的跟踪驱动的多进程缓存模拟器。模拟器包含一个简单的基于优先级的调度器，用于控制进程的执行。调度程序在执行系统调用时阻塞进程。一个工作负载总共包含60个进程，分布在7个不同的程序中，每个程序大约有9个实例。讨论了支持相干存储器模型的两种开放总线系统Futurebus+和可扩展的相干接口(SCI)作为主存的互连系统。

引用次数: 34

Boosting beyond static scheduling in a superscalar processor 超标量处理器中超越静态调度的提升

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

Pub Date : 1990-05-01 DOI: 10.1145/325164.325160

Michael D. Smith, M. Lam, M. Horowitz

A superscalar processor that combines the best qualities of static and dynamic instruction scheduling to increase the performance of nonnumerical applications is described. The architecture performs all instruction scheduling statically to take advantage of the compiler's ability to schedule operations across many basic blocks efficiently. Since the conditional branches in nonnumerical code are highly data dependent, the architecture introduces the concept of boosted instructions, that is, instructions that are committed conditionally upon the result of later branch instructions. Boosting effectively removes the dependences caused by branches and makes the scheduling of side-effect instructions as simple as it is for instructions that are side-effect free. For efficiency, boosting is supported in the hardware by shadow structures that temporarily hold the side effects of boosted instructions until the conditional branches that the boosted instructions depend upon are executed.<>

描述了一种结合静态和动态指令调度的最佳特性来提高非数值应用性能的超标量处理器。该体系结构静态地执行所有指令调度，以利用编译器跨许多基本块高效调度操作的能力。由于非数值代码中的条件分支高度依赖于数据，因此该体系结构引入了增强指令的概念，即根据后面分支指令的结果有条件地提交指令。增强有效地消除了由分支引起的依赖，并使副作用指令的调度与没有副作用的指令一样简单。为了提高效率，硬件中的增强是通过影子结构来支持的，影子结构暂时保留增强指令的副作用，直到增强指令所依赖的条件分支被执行。

引用次数: 145

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀