首页 > 最新文献

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture最新文献

英文 中文
Chaos router: architecture and performance 混沌路由器:架构和性能
S. Konstantinidou, L. Snyder
The Chaos router is an adaptive, randomized message router for multlicomputers. Aclaptive routers are superior to oblivious routers, the state-of-the-art, because they can by-pass congestion and faults. unlike other adaptive routers, however, the Chaos router has reduced the complexity along the critical path of the routing decision by using randomization to eliminate livelock protection, The foundational theory for Chaotic routing, proving that, this approach is sound, has been previously de~-eloped [1 1]. In this paper we present, the complete design of the router together with (simulated) performance figures. The results show that, the Chaos t-outer is competitive with the simple and fast obli~ious routers for random loads and greatly superior for loads with hot spots.
混沌路由器是一个自适应的,随机消息路由器多台计算机。自适应路由器优于最先进的遗忘路由器,因为它们可以绕过拥塞和故障。然而,与其他自适应路由器不同的是,混沌路由器通过使用随机化来消除活锁保护,降低了路由决策关键路径上的复杂性。混沌路由的基础理论证明,这种方法是合理的,之前已经提出[11]。在本文中,我们给出了路由器的完整设计和(模拟)性能数据。结果表明,对于随机负载,混沌t-outer与简单快速的模糊路由器具有竞争力,对于有热点的负载,混沌t-outer具有明显的优势。
{"title":"Chaos router: architecture and performance","authors":"S. Konstantinidou, L. Snyder","doi":"10.1145/115952.115974","DOIUrl":"https://doi.org/10.1145/115952.115974","url":null,"abstract":"The Chaos router is an adaptive, randomized message router for multlicomputers. Aclaptive routers are superior to oblivious routers, the state-of-the-art, because they can by-pass congestion and faults. unlike other adaptive routers, however, the Chaos router has reduced the complexity along the critical path of the routing decision by using randomization to eliminate livelock protection, The foundational theory for Chaotic routing, proving that, this approach is sound, has been previously de~-eloped [1 1]. In this paper we present, the complete design of the router together with (simulated) performance figures. The results show that, the Chaos t-outer is competitive with the simple and fast obli~ious routers for random loads and greatly superior for loads with hot spots.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122519081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
Exploiting fine-grained parallelism through a combination of hardware and software techniques 通过硬件和软件技术的结合来开发细粒度的并行性
S. Melvin, Y. Patt
It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this ABSTRACT It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this paper we show that &nt& to this notiOK (here is actually a significant amount of unexploited parallelism in typical general purpose code. In order to exploit this parallelism, a combination of hardware and software techniques must be applied. We analyze three techniques: dynamic scheduling, speculative execution and basic block enlargement. We will show that indeed for narrow instruction words little is tobegainedby applying these techniques. However, as the number of simultaneous operations increases, it becomes possible to achieve speedups of three to six on realistic processors.
有人认为,非科学代码的并行性很少,而现有的进程还没有利用这些并行性。本文提出,非科学代码具有很少的并行性,这些并行性尚未被现有的进程所利用。在本文中,我们展示了&nt&对于这个概念(这里实际上是典型的通用代码中大量未开发的并行性)。为了利用这种并行性,必须应用硬件和软件技术的组合。我们分析了三种技术:动态调度、推测执行和基本块扩展。我们将证明,对于狭窄的指令词,应用这些技术几乎是不可能的。然而,随着同时操作数量的增加,在实际处理器上实现3到6倍的加速是可能的。
{"title":"Exploiting fine-grained parallelism through a combination of hardware and software techniques","authors":"S. Melvin, Y. Patt","doi":"10.1145/115953.115981","DOIUrl":"https://doi.org/10.1145/115953.115981","url":null,"abstract":"It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this ABSTRACT It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this paper we show that &nt& to this notiOK (here is actually a significant amount of unexploited parallelism in typical general purpose code. In order to exploit this parallelism, a combination of hardware and software techniques must be applied. We analyze three techniques: dynamic scheduling, speculative execution and basic block enlargement. We will show that indeed for narrow instruction words little is tobegainedby applying these techniques. However, as the number of simultaneous operations increases, it becomes possible to achieve speedups of three to six on realistic processors.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Comparison of hardware and software cache coherence schemes 硬件和软件缓存一致性方案的比较
S. Adve, Vikram S. Adve, M. Hill, M. Vernon
We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system. Our goal is to identify the workloads for which either of the schemes is significantly better. Our methodology improves upon previous analytical studies and complements previous simulation studies by developing a common high-level workload model that is used to derive separate sets of lowlevel workload parameters for the two schemes. This approach allows an equitable comparison of the two schemes for a specific workload. is attractive because the overhead of detecting stale data is transferred from runtime to compile time, and the design complexity is transferred from hardware to software. However. software schemes may perform poorly because compile-time analysis may need IO be conservative, leading to unnecessary cache misses and main memory updates. In this paper, we use approximate Mean Value Analysis [U881 to compare the performance of a representative software scheme with a directory-based hardware scheme on a large-scale shared-memory system. In a previous study comparing the performance of hardware and software coherence, Cheong and VeidenOur resuIi, show that software schemes are haum used a parallelizing compiler to implement three difable (in terms of processor efficiency) IO hardware schemes ferent Software coherence schemes [Che90]. For selccted for a wide class of programs. The only cases for which subroutines Of Seven programs, they show that the hit ratio software schemes ,,erform sienificmtlv worse than of their most sophisticated software scheme (version con, ~~~ ~~~~~~ r~
我们使用均值分析模型来比较大型共享内存系统中具有代表性的硬件和软件缓存一致性方案。我们的目标是确定哪一种方案明显更好的工作负载。我们的方法改进了以前的分析研究,并通过开发一个通用的高级工作负载模型来补充以前的模拟研究,该模型用于为两个方案派生单独的低级工作负载参数集。这种方法可以公平地比较特定工作量的两种办法。之所以有吸引力,是因为检测陈旧数据的开销从运行时转移到了编译时,并且设计复杂性从硬件转移到了软件。然而。软件方案可能表现不佳,因为编译时分析可能需要IO保守,导致不必要的缓存丢失和主存更新。在本文中,我们使用近似均值分析[U881]来比较大型共享内存系统上具有代表性的软件方案与基于目录的硬件方案的性能。在之前的一项比较硬件和软件一致性性能的研究中,Cheong和VeidenOur的研究结果表明,软件方案可以使用并行编译器来实现三种不同的IO硬件方案(就处理器效率而言)和不同的软件一致性方案[Che90]。可供广泛的课程选择。在七个程序的子程序中,他们显示命中率软件方案的唯一情况下,比他们最复杂的软件方案(版本con, ~~~ ~~~~~~ r~)表现得更糟糕
{"title":"Comparison of hardware and software cache coherence schemes","authors":"S. Adve, Vikram S. Adve, M. Hill, M. Vernon","doi":"10.1145/115953.115982","DOIUrl":"https://doi.org/10.1145/115953.115982","url":null,"abstract":"We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system. Our goal is to identify the workloads for which either of the schemes is significantly better. Our methodology improves upon previous analytical studies and complements previous simulation studies by developing a common high-level workload model that is used to derive separate sets of lowlevel workload parameters for the two schemes. This approach allows an equitable comparison of the two schemes for a specific workload. is attractive because the overhead of detecting stale data is transferred from runtime to compile time, and the design complexity is transferred from hardware to software. However. software schemes may perform poorly because compile-time analysis may need IO be conservative, leading to unnecessary cache misses and main memory updates. In this paper, we use approximate Mean Value Analysis [U881 to compare the performance of a representative software scheme with a directory-based hardware scheme on a large-scale shared-memory system. In a previous study comparing the performance of hardware and software coherence, Cheong and VeidenOur resuIi, show that software schemes are haum used a parallelizing compiler to implement three difable (in terms of processor efficiency) IO hardware schemes ferent Software coherence schemes [Che90]. For selccted for a wide class of programs. The only cases for which subroutines Of Seven programs, they show that the hit ratio software schemes ,,erform sienificmtlv worse than of their most sophisticated software scheme (version con, ~~~ ~~~~~~ r~","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130799567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Scheduling pipelined communication in distributed memory multiprocessors for real-time applications 面向实时应用的分布式内存多处理器中的调度流水线通信
S. Shukla, D. Agrawal
This paper investigates communication in distributed memory multiprocessors to support tasklevel parallelism for real-time applications. It is shown that wormhole routing, used in second generation multicomputers, does not support task-level pipelining because its oblivious contention resolution leads to output inconsistency in which a constant throughput is not guaranteed. We propose scheduled routing which guarantees constant throughputs by integrating task specifications with flow-control. In this routing technique, communication processors provide explicit flowcontrol by independently executing switching schedules computed at compile-time. It is deadlock-free, contention-free, does not load the intermediate node memory, and makes use of the multiple equivalent paths between non-adjacent nodes. The resource allocation and scheduling problems resulting from such routing are formulated and related implementation issues are anal yzed. A comparison with wormhole routing for various generalized hyp ercubes and tori shows that scheduled routing is effective in providing a constant throughput when wormhole routing does not and enables pipelining at higher input arrival rates.
本文研究了分布式内存多处理器中的通信,以支持实时应用的任务级并行。研究表明,第二代多计算机中使用的虫洞路由不支持任务级流水线,因为它的无关争用解决导致输出不一致,从而不能保证恒定的吞吐量。我们提出了调度路由,它通过集成任务规范和流量控制来保证恒定的吞吐量。在这种路由技术中,通信处理器通过独立执行在编译时计算的交换调度来提供显式的流量控制。它无死锁,无争用,不加载中间节点内存,并利用非相邻节点之间的多条等效路径。阐述了这种路由导致的资源分配和调度问题,并分析了相关的实现问题。与各种广义超立方体和环面虫洞路由的比较表明,当虫洞路由不能提供恒定的吞吐量时,调度路由可以有效地提供恒定的吞吐量,并且可以在更高的输入到达率下实现流水线。
{"title":"Scheduling pipelined communication in distributed memory multiprocessors for real-time applications","authors":"S. Shukla, D. Agrawal","doi":"10.1145/115952.115975","DOIUrl":"https://doi.org/10.1145/115952.115975","url":null,"abstract":"This paper investigates communication in distributed memory multiprocessors to support tasklevel parallelism for real-time applications. It is shown that wormhole routing, used in second generation multicomputers, does not support task-level pipelining because its oblivious contention resolution leads to output inconsistency in which a constant throughput is not guaranteed. We propose scheduled routing which guarantees constant throughputs by integrating task specifications with flow-control. In this routing technique, communication processors provide explicit flowcontrol by independently executing switching schedules computed at compile-time. It is deadlock-free, contention-free, does not load the intermediate node memory, and makes use of the multiple equivalent paths between non-adjacent nodes. The resource allocation and scheduling problems resulting from such routing are formulated and related implementation issues are anal yzed. A comparison with wormhole routing for various generalized hyp ercubes and tori shows that scheduled routing is effective in providing a constant throughput when wormhole routing does not and enables pipelining at higher input arrival rates.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122756124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Performance evaluation of a communication system for transputer-networks based on monitored event traces 基于监控事件轨迹的转发器网络通信系统性能评估
C. W. Oehlrich, Andreas Quick
Most parallel applications (e.g. image processing, mtdtigrid algorithms) in a Transputer-network require a lot of communication between the processing nodes. For such applications the communication system TRACOS was developed to support data transfer between random Transputers in the network. To marimize the performance of the parallel system, its dynamic internal behavior has to be analyzed. For this purpose event-driven monitoring is an appropriate technique. It reduces the dynamic behavior of the system to sequences of events. They are recor&d by a monitor system and stored as event traces. In this paper the communication system TRACOS and its performance evaluation based on monitored event traces are presented. First a synthetic workload was instrumented and monitored with the distributed hardware monitor ZM4. The results showed that the performance of TRACOS is poor for packets smaller than 4 Kbyte. Therefore, TRACOS itself was instrumented and monitored to get insight into the interactions and interdependencies of all TRACOS processes. Based on the monitoring resuhs, TRACOS could be improved which led to a performance increase of 25T0.
在transputer网络中,大多数并行应用程序(如图像处理、mtdtigrid算法)需要处理节点之间进行大量通信。为了这种应用,开发了通信系统TRACOS,以支持网络中随机转发器之间的数据传输。为了使并联系统的性能最大化,必须对其动态内部行为进行分析。为此目的,事件驱动的监视是一种合适的技术。它将系统的动态行为简化为事件序列。它们由监控系统记录并作为事件轨迹存储。本文介绍了通信系统TRACOS及其基于监控事件轨迹的性能评估。首先,使用分布式硬件监视器ZM4检测和监视合成工作负载。结果表明,TRACOS对于小于4kbyte的数据包性能较差。因此,对TRACOS本身进行了检测和监控,以深入了解所有TRACOS进程的相互作用和相互依赖性。根据监测结果,可以改进TRACOS,使性能提高25T0。
{"title":"Performance evaluation of a communication system for transputer-networks based on monitored event traces","authors":"C. W. Oehlrich, Andreas Quick","doi":"10.1145/115952.115973","DOIUrl":"https://doi.org/10.1145/115952.115973","url":null,"abstract":"Most parallel applications (e.g. image processing, mtdtigrid algorithms) in a Transputer-network require a lot of communication between the processing nodes. For such applications the communication system TRACOS was developed to support data transfer between random Transputers in the network. To marimize the performance of the parallel system, its dynamic internal behavior has to be analyzed. For this purpose event-driven monitoring is an appropriate technique. It reduces the dynamic behavior of the system to sequences of events. They are recor&d by a monitor system and stored as event traces. In this paper the communication system TRACOS and its performance evaluation based on monitored event traces are presented. First a synthetic workload was instrumented and monitored with the distributed hardware monitor ZM4. The results showed that the performance of TRACOS is poor for packets smaller than 4 Kbyte. Therefore, TRACOS itself was instrumented and monitored to get insight into the interactions and interdependencies of all TRACOS processes. Based on the monitoring resuhs, TRACOS could be improved which led to a performance increase of 25T0.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
High performance interprocessor communication through optical wavelength division multiple access channels 通过光波分多址通道实现高性能处理器间通信
P. Dowd
A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. The system capitalizes on the optical selfrouting characteristic of wavelength division multiple access to improve performance and reduce complexity. This paper examines typical optical multiple access channel implementations and shows that the star-coupled approach is superior due to optical power budget considerations. Star-coupled configurations which exhibit the optical self-routing characteristic are then studied. A hypercube based structure is introduced where optical multiple access channels span the dimensional axes. This severely reduces the required degree since only one 1/0 port is required per dimension, and performance is maintained through the high capacity characteristics of optical communication.
结合光纤通信中可用的高容量信道的最新进展,可以以低成本构建具有大量节点的多处理器系统。在降低系统复杂性的前提下,建立了具有良好性能特征的高容错系统。该系统利用了波分多址的光自路由特性,提高了性能,降低了复杂度。本文分析了典型的光多址通道实现,并表明由于光功率预算的考虑,星耦合方法是优越的。然后研究了具有光自路由特性的星耦合结构。介绍了一种基于超立方体的结构,其中光多址通道跨越维度轴。这大大降低了所需的程度,因为每个维度只需要一个1/0端口,并且通过光通信的高容量特性保持性能。
{"title":"High performance interprocessor communication through optical wavelength division multiple access channels","authors":"P. Dowd","doi":"10.1145/115952.115963","DOIUrl":"https://doi.org/10.1145/115952.115963","url":null,"abstract":"A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. The system capitalizes on the optical selfrouting characteristic of wavelength division multiple access to improve performance and reduce complexity. This paper examines typical optical multiple access channel implementations and shows that the star-coupled approach is superior due to optical power budget considerations. Star-coupled configurations which exhibit the optical self-routing characteristic are then studied. A hypercube based structure is introduced where optical multiple access channels span the dimensional axes. This severely reduces the required degree since only one 1/0 port is required per dimension, and performance is maintained through the high capacity characteristics of optical communication.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116105426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Single instruction stream parallelism is greater than two 单指令流并行度大于2
M. Butler, Tse-Yu Yeh, Y. Patt, M. Alsup, H. Scales, M. Shebanow
Recent studies have concluded that little parallelism (less than two operations per cycle) is available in single instruction streams. Since the amount of available parallelism should influence the design of the processor, it is important to verify how much parallelism really exists. In this study we model the execution of the SPEC benchmarks under differing resource constraints. We repeat the work of the previous researchers, and show that under the hardware resource constraints they imposed, we get similar results. On the other hand, when all constraints are removed except those ~equired by the semantics oft he program, we have found degrees of parallelism in excess of 17 instructions per cycle. Finally, and perhaps most important for exploiting single instruction stream parallelism now, we show that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.
最近的研究已经得出结论,在单个指令流中几乎没有并行性(每个周期少于两个操作)。由于可用并行性的数量会影响处理器的设计,因此验证实际存在多少并行性非常重要。在本研究中,我们在不同的资源约束下对SPEC基准的执行进行了建模。我们重复了先前研究人员的工作,并表明在他们施加的硬件资源约束下,我们得到了类似的结果。另一方面,当除去程序语义所要求的约束外的所有约束时,我们发现每个周期的并行度超过17条指令。最后,也许对于现在利用单指令流并行性最重要的是,我们表明,如果硬件得到适当的平衡,可以在当今合理设计的处理器上维持每个周期2.0到5.8条指令。
{"title":"Single instruction stream parallelism is greater than two","authors":"M. Butler, Tse-Yu Yeh, Y. Patt, M. Alsup, H. Scales, M. Shebanow","doi":"10.1145/115952.115980","DOIUrl":"https://doi.org/10.1145/115952.115980","url":null,"abstract":"Recent studies have concluded that little parallelism (less than two operations per cycle) is available in single instruction streams. Since the amount of available parallelism should influence the design of the processor, it is important to verify how much parallelism really exists. In this study we model the execution of the SPEC benchmarks under differing resource constraints. We repeat the work of the previous researchers, and show that under the hardware resource constraints they imposed, we get similar results. On the other hand, when all constraints are removed except those ~equired by the semantics oft he program, we have found degrees of parallelism in excess of 17 instructions per cycle. Finally, and perhaps most important for exploiting single instruction stream parallelism now, we show that if the hardware is properly balanced, one can sustain from 2.0 to 5.8 instructions per cycle on a processor that is reasonable to design today.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125864330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 176
On the validity of trace-driven simulation for multiprocessors 多处理器跟踪驱动仿真的有效性研究
E. J. Koldinger, S. Eggers, H. Levy
Trace-driven simulation is a commonly-used technique for evaluating multiprocessor memory systems. However, several open questions exist concerning the validity of multiprocessor traces. One is the extent to which tracing induced dilation affects the traces and consequently the results of the simulations. A second is whether the traces generated from multiple runs of the same program will yield the same simulation results. This study examines the variation in simulation results caused by both dilation and multiple runs of the same program on a shared-memory multiprocessor. Overall, our results validate the use of trace-driven simulation for these machines: variability due to dilation and multiple runs appears to be small. However, where small differences in simulated results are crucial to design decisions, multiple traces of parallel applications should be examined.
跟踪驱动仿真是评估多处理器内存系统的常用技术。然而,关于多处理器跟踪的有效性存在几个悬而未决的问题。一个是跟踪引起的膨胀对跟踪的影响程度,从而影响模拟的结果。第二个是由同一程序的多次运行生成的跟踪是否会产生相同的模拟结果。本研究考察了在共享内存多处理器上同一程序的扩展和多次运行所引起的模拟结果的变化。总的来说,我们的结果验证了对这些机器使用跟踪驱动模拟:由于膨胀和多次运行引起的可变性似乎很小。然而,当模拟结果中的微小差异对设计决策至关重要时,应该检查并行应用程序的多个跟踪。
{"title":"On the validity of trace-driven simulation for multiprocessors","authors":"E. J. Koldinger, S. Eggers, H. Levy","doi":"10.1145/115953.115977","DOIUrl":"https://doi.org/10.1145/115953.115977","url":null,"abstract":"Trace-driven simulation is a commonly-used technique for evaluating multiprocessor memory systems. However, several open questions exist concerning the validity of multiprocessor traces. One is the extent to which tracing induced dilation affects the traces and consequently the results of the simulations. A second is whether the traces generated from multiple runs of the same program will yield the same simulation results. This study examines the variation in simulation results caused by both dilation and multiple runs of the same program on a shared-memory multiprocessor. Overall, our results validate the use of trace-driven simulation for these machines: variability due to dilation and multiple runs appears to be small. However, where small differences in simulated results are crucial to design decisions, multiple traces of parallel applications should be examined.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127020618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Psfudo-randomly interleaved memory 随机交错存储器
B. R. Rau
Interleaved memories are often used to provide the high bandwidth needed by multiprocessors and high performance uniprocessors such as vector and VLIW processors. The manner in which memory locations are distributed across the memory modules has a significant influence on whether, and for which types of reference patterns, the full bandwidth of the memory system is achieved. The most common interleaved memory architecture is the sequentially interleaved memory in which successive memory locations are assigned to successive memory modules. Although such an architecture is the simplest to implement and provides good performance with strides that are odd integers, it can degrade badly in the face of even strides, especially strides that are a power of two. In a pseudo-randomly interleaved memory architecture, memory locations are assigned to the memory modules in some pseudo-random fashion in the hope that those sequences of references, which are likely to occur in practice, will end up being evenly distributed across the memory modules. The notion of polynomial interleaving modulo an irreducible polynomial is introduced as a way of achieving pseudo-random interleaving with certain attractive and provable properties. The theory behind this scheme is developed and the results of simulations are presented. Kev words: supercomputer memory, parallel memory, interleaved memory, hashed memory, pseudo-random interleaving, memory buffering.
交错存储器通常用于提供多处理器和高性能单处理器(如矢量和VLIW处理器)所需的高带宽。存储器位置分布在存储器模块上的方式对存储器系统的全带宽是否实现以及对于哪种类型的参考模式实现具有重要影响。最常见的交错存储器结构是顺序交错存储器,其中连续的存储器位置被分配给连续的存储器模块。尽管这样的架构是最容易实现的,并且对于奇数步进提供了良好的性能,但是面对偶数步进,尤其是2的幂步进,它的性能会严重下降。在伪随机交错的内存体系结构中,内存位置以某种伪随机的方式分配给内存模块,希望这些引用序列(在实践中很可能发生)最终均匀分布在内存模块中。引入多项式交错模的概念,作为实现伪随机交错的一种方法,具有一定的吸引性和可证明性。提出了该方案的理论基础,并给出了仿真结果。关键词:超级计算机存储器,并行存储器,交错存储器,散列存储器,伪随机交错存储器,存储器缓冲。
{"title":"Psfudo-randomly interleaved memory","authors":"B. R. Rau","doi":"10.1109/ISCA.1991.1021601","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021601","url":null,"abstract":"Interleaved memories are often used to provide the high bandwidth needed by multiprocessors and high performance uniprocessors such as vector and VLIW processors. The manner in which memory locations are distributed across the memory modules has a significant influence on whether, and for which types of reference patterns, the full bandwidth of the memory system is achieved. The most common interleaved memory architecture is the sequentially interleaved memory in which successive memory locations are assigned to successive memory modules. Although such an architecture is the simplest to implement and provides good performance with strides that are odd integers, it can degrade badly in the face of even strides, especially strides that are a power of two. In a pseudo-randomly interleaved memory architecture, memory locations are assigned to the memory modules in some pseudo-random fashion in the hope that those sequences of references, which are likely to occur in practice, will end up being evenly distributed across the memory modules. The notion of polynomial interleaving modulo an irreducible polynomial is introduced as a way of achieving pseudo-random interleaving with certain attractive and provable properties. The theory behind this scheme is developed and the results of simulations are presented. Kev words: supercomputer memory, parallel memory, interleaved memory, hashed memory, pseudo-random interleaving, memory buffering.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132163186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 163
The SNAP-1 parallel AI prototype SNAP-1并行AI原型
R. Demara, D. Moldovan
The Semantic Network Array Processor (SNAP) is a parallel architecture for Artificial Intelligence (AI) applications. We haue implemented a first-generation hardware/soflware prototype called SNAP-1 using Digital Signal Processor chips and ouerlapping groups of multiport memories. The design features 32 processing clusters with four to five functionally dedicated Digital Signal Processors in each cluster. Processors within clusters share a marker-processing memo y while communication between clusters is implemented by a buffered messagepassing scheme.
语义网络阵列处理器(SNAP)是一种面向人工智能(AI)应用的并行架构。我们已经实现了第一代硬件/软件原型,称为SNAP-1,使用数字信号处理器芯片和重叠的多端口存储器组。该设计具有32个处理集群,每个集群中有4到5个功能专用的数字信号处理器。集群内的处理器共享标记处理备忘录,而集群之间的通信由缓冲的消息传递方案实现。
{"title":"The SNAP-1 parallel AI prototype","authors":"R. Demara, D. Moldovan","doi":"10.1145/115952.115954","DOIUrl":"https://doi.org/10.1145/115952.115954","url":null,"abstract":"The Semantic Network Array Processor (SNAP) is a parallel architecture for Artificial Intelligence (AI) applications. We haue implemented a first-generation hardware/soflware prototype called SNAP-1 using Digital Signal Processor chips and ouerlapping groups of multiport memories. The design features 32 processing clusters with four to five functionally dedicated Digital Signal Processors in each cluster. Processors within clusters share a marker-processing memo y while communication between clusters is implemented by a buffered messagepassing scheme.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"249 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134218740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
期刊
[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1