首页 > 最新文献

Proceedings of the 20th Annual International Symposium on Computer Architecture最新文献

英文 中文
Design Tradeoffs For Software-managed Tlbs 软件管理Tlbs的设计权衡
Pub Date : 1994-08-01 DOI: 10.1109/ISCA.1993.698543
D. Nagle, R. Uhlig, Timothy J. Stanley, S. Sechrest, T. Mudge, Richard B. Brown
An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties that are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of monolithic and microkernel operating systems. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0.
越来越多的体系结构通过软件管理的tlb提供虚拟内存支持。然而,软件管理可能会施加相当大的惩罚,这些惩罚高度依赖于操作系统的结构及其对虚拟内存的使用。这项工作探讨了软件管理的TLB设计权衡及其与一系列单片和微内核操作系统的交互。通过硬件监控和仿真,我们探索了在基于MIPS r2000的工作站上运行Ultrix、OSF/1和三个版本Mach 3.0的基准测试的TLB性能。
{"title":"Design Tradeoffs For Software-managed Tlbs","authors":"D. Nagle, R. Uhlig, Timothy J. Stanley, S. Sechrest, T. Mudge, Richard B. Brown","doi":"10.1109/ISCA.1993.698543","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698543","url":null,"abstract":"An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties that are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of monolithic and microkernel operating systems. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1994-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123656049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 142
Hierarchical Performance Modeling With MACS: A Case Study Of The Convex C-240 用MACS进行分层性能建模:凸型C-240的案例研究
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698561
E. Boyd, E. Davidson
The MACS performance model introduced here can be applied to a Machine and Application of interest, the Compiler-generated workload, and the Scheduling of the workload by the compiler. The Ma, MAC, and MACS bounds each fix the named subset of M, A, C, and S while freeing the bound from the constraints imposed by the others. A/X performance measurement is used to measure access-only and execute-only code performance. Such hierarchical performance modeling exposes the gaps between the various bounds, the A/X measurements, and the actual performance, thereby focusing performance optimization at the appropriate levels in a systematic and goal-directed manner. A simple, but detailed, case study of the Convex C-240 vector mini-supercomputer illustrates the method.
这里介绍的MACS性能模型可以应用于感兴趣的机器和应用程序、编译器生成的工作负载以及编译器对工作负载的调度。Ma、MAC和MACS边界都修复了M、A、C和S的命名子集,同时将其从其他边界施加的约束中解放出来。A/X性能度量用于度量仅访问和仅执行代码的性能。这种分层性能建模暴露了各种界限、A/X测量和实际性能之间的差距,从而以系统和目标导向的方式将性能优化集中在适当的级别上。一个简单但详细的凸C-240矢量微型超级计算机的案例研究说明了这种方法。
{"title":"Hierarchical Performance Modeling With MACS: A Case Study Of The Convex C-240","authors":"E. Boyd, E. Davidson","doi":"10.1109/ISCA.1993.698561","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698561","url":null,"abstract":"The MACS performance model introduced here can be applied to a Machine and Application of interest, the Compiler-generated workload, and the Scheduling of the workload by the compiler. The Ma, MAC, and MACS bounds each fix the named subset of M, A, C, and S while freeing the bound from the constraints imposed by the others. A/X performance measurement is used to measure access-only and execute-only code performance. Such hierarchical performance modeling exposes the gaps between the various bounds, the A/X measurements, and the actual performance, thereby focusing performance optimization at the appropriate levels in a systematic and goal-directed manner. A simple, but detailed, case study of the Convex C-240 vector mini-supercomputer illustrates the method.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116769287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
The Performance Of Cache-coherent Ring-based Multiprocessors 基于缓存相干环的多处理器性能研究
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698567
L. Barroso, M. Dubois
Advances in circuit and integration technology are continuously boosting the speed of microprocessors. One of the main challenges presented by such developments is the effective use of powerful microprocessors in shared memory multiprocessor configurations. We believe that the interconnection problem is not solved even for small scale shared memory multiprocessors, since the speed of shared buses is unlikely to keep up with the bandwidth requirements of new microprocessors. In this paper we evaluate the performance of unidirectional slotted ring interconnection for small to medium scale shared memory systems, using a hybrid methodology of analytical models and trace-driven simulations. We evaluate both snooping and directory-based coherence protocols for the ring and compare it to high performance split transaction buses.
电路和集成技术的进步不断提高微处理器的速度。这种发展带来的主要挑战之一是在共享内存多处理器配置中有效地使用功能强大的微处理器。我们认为,即使对于小规模的共享内存多处理器,互连问题也没有得到解决,因为共享总线的速度不太可能跟上新微处理器的带宽要求。在本文中,我们使用分析模型和跟踪驱动仿真的混合方法评估了中小型共享存储系统的单向开槽环互连的性能。我们评估了环的窥探和基于目录的一致性协议,并将其与高性能分割事务总线进行了比较。
{"title":"The Performance Of Cache-coherent Ring-based Multiprocessors","authors":"L. Barroso, M. Dubois","doi":"10.1109/ISCA.1993.698567","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698567","url":null,"abstract":"Advances in circuit and integration technology are continuously boosting the speed of microprocessors. One of the main challenges presented by such developments is the effective use of powerful microprocessors in shared memory multiprocessor configurations. We believe that the interconnection problem is not solved even for small scale shared memory multiprocessors, since the speed of shared buses is unlikely to keep up with the bandwidth requirements of new microprocessors. In this paper we evaluate the performance of unidirectional slotted ring interconnection for small to medium scale shared memory systems, using a hybrid methodology of analytical models and trace-driven simulations. We evaluate both snooping and directory-based coherence protocols for the ring and compare it to high performance split transaction buses.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133957475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Evaluation Of Release Consistent Software Distributed Shared Memory On Emerging Network Technology 新兴网络技术下发布一致性软件分布式共享内存的评价
S. Dwarkadas, P. Keleher, A. Cox, W. Zwaenepoel
We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate.Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved.While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that the future work on software DSMs should concentrate on reducing the amount of synchronization or its effect.
我们评估了处理器速度、网络特性和软件开销对发布一致的软件分布式共享内存性能的影响。我们研究了实现版本一致性的五种不同协议:即时更新、即时失效、延迟更新、延迟失效和一种称为延迟混合的新协议。这种惰性混合协议结合了惰性更新和惰性无效的优点。我们的模拟表明,随着处理器和网络变得可用,像Jacobi和TSP这样的粗粒度应用程序表现良好,或多或少独立于所使用的协议。中等粒度的应用程序(如Water)可以获得良好的性能,但协议的选择至关重要。对于16个处理器,最佳协议(惰性混合协议)的性能比最差协议(急于更新协议)的性能要好3倍以上。像Cholesky这样的细粒度应用程序,无论使用哪种协议,由于同步操作的频率和所涉及的高延迟,几乎无法实现加速。虽然使用宽松内存模型、延迟实现和多写入器协议减少了错误共享的影响,但同步延迟仍然是软件分布式共享内存系统的一个严重问题。这些结果表明,软件dsm的未来工作应该集中在减少同步的数量或其影响上。
{"title":"Evaluation Of Release Consistent Software Distributed Shared Memory On Emerging Network Technology","authors":"S. Dwarkadas, P. Keleher, A. Cox, W. Zwaenepoel","doi":"10.1145/165123.165150","DOIUrl":"https://doi.org/10.1145/165123.165150","url":null,"abstract":"We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate.\u0000Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved.\u0000While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that the future work on software DSMs should concentrate on reducing the amount of synchronization or its effect.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127951813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 118
A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History 使用两层分支历史的动态分支预测器的比较
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698566
Tse-Yu Yeh, Y. Patt
Recent attention to speculative execution as a mechanism for increasing performance of single instruction streams has demanded substantially better branch prediction than what has been previously available. We [1,2] and Pan, So, and Rahmen [4] have both proposed variations of the same aggressive dynamic branch predictor for handling those needs. We call the basic model Two-Level Adaptive Branch Prediction; Pan, So, and Rahmeh call it Correlation Branch Prediction. In this paper, we adopt the terminology of [2] and show that there are really nine variations of the same basic model. We compare the nine variations with respect to the amount of history information kept. We study the effects of different branch history lengths and pattern history table configurations. Finally, we evaluate the cost effectiveness of the nine variations.
最近关注推测执行作为一种提高单指令流性能的机制,需要比以前更好的分支预测。我们[1,2]和Pan, So和Rahmen[4]都提出了相同的主动动态分支预测器的变体来处理这些需求。我们将基本模型称为两级自适应分支预测;Pan, So和Rahmeh称之为相关分支预测。在本文中,我们采用[2]的术语,并证明了同一基本模型实际上有九个变体。我们比较了这九种变体所保存的历史信息量。我们研究了不同分支历史长度和模式历史表配置的影响。最后,我们评估了九种变化的成本效益。
{"title":"A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History","authors":"Tse-Yu Yeh, Y. Patt","doi":"10.1109/ISCA.1993.698566","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698566","url":null,"abstract":"Recent attention to speculative execution as a mechanism for increasing performance of single instruction streams has demanded substantially better branch prediction than what has been previously available. We [1,2] and Pan, So, and Rahmen [4] have both proposed variations of the same aggressive dynamic branch predictor for handling those needs. We call the basic model Two-Level Adaptive Branch Prediction; Pan, So, and Rahmeh call it Correlation Branch Prediction. In this paper, we adopt the terminology of [2] and show that there are really nine variations of the same basic model. We compare the nine variations with respect to the amount of history information kept. We study the effects of different branch history lengths and pattern history table configurations. Finally, we evaluate the cost effectiveness of the nine variations.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126542211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 419
Improving AP1000 Parallel Computer Performance With Message Communication 利用消息通信提高AP1000并行计算机性能
T. Horie, K. Hayashi, T. Shimizu, H. Ishihata
The performance of message-passing applications depends on cpu speed, communication throughput and latency, and message handling overhead. In this paper we investigate the effect of varying these parameters and applying techniques to reduce message handling overhead on the execution efficiency of ten different applications. Using a message level simulator set up for the architecture of the AP1000, we showed that improving communication performance, especially message handling, improves total performance. If a cpu that is 32 times faster is provided, the total performance increases by less than ten times unless message handling overhead is reduced. Overlapping computation with message reception improves performance significantly. We also discuss how to improve the AP1000 architecture.
消息传递应用程序的性能取决于cpu速度、通信吞吐量和延迟以及消息处理开销。在本文中,我们研究了改变这些参数和应用技术来减少消息处理开销对十种不同应用程序的执行效率的影响。使用为AP1000架构设置的消息级模拟器,我们展示了改进通信性能,特别是消息处理,可以提高总体性能。如果提供一个速度快32倍的cpu,除非降低消息处理开销,否则总性能提高不到10倍。带有消息接收的重叠计算显著提高了性能。我们还讨论了如何改进AP1000架构。
{"title":"Improving AP1000 Parallel Computer Performance With Message Communication","authors":"T. Horie, K. Hayashi, T. Shimizu, H. Ishihata","doi":"10.1145/165123.165168","DOIUrl":"https://doi.org/10.1145/165123.165168","url":null,"abstract":"The performance of message-passing applications depends on cpu speed, communication throughput and latency, and message handling overhead. In this paper we investigate the effect of varying these parameters and applying techniques to reduce message handling overhead on the execution efficiency of ten different applications. Using a message level simulator set up for the architecture of the AP1000, we showed that improving communication performance, especially message handling, improves total performance. If a cpu that is 32 times faster is provided, the total performance increases by less than ten times unless message handling overhead is reduced. Overlapping computation with message reception improves performance significantly. We also discuss how to improve the AP1000 architecture.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125293127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
The Chinese Remainder Theorem And The Prime Memory System 中国剩余定理与素数记忆系统
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698573
Qing-Qiang Gao
As we know, the conflict problem is a very important problem in memory system of super computer, there are two kinds of conflict-free memory system approaches: skewing scheme approach and prime memory system approach. Previously published prime memory approaches are complex or wasting 1/p of the memory space for filling the “holes” [17], where p is the number of memory modules. In this paper, based on Chinese remainder theorem, we present a perfect prime memory system which only need to find the d Mod p without wasting any memory space and without computing the quotient.
冲突问题是超级计算机存储系统中一个非常重要的问题,无冲突的存储系统方法有两种:倾斜方案方法和素数存储系统方法。以前发表的主要内存方法很复杂,或者浪费1/p的内存空间来填补“洞”[17],其中p是内存模块的数量。本文基于中国剩余定理,给出了一种不浪费任何存储空间,不需要计算商,只需要求d Mod p的完美素数记忆系统。
{"title":"The Chinese Remainder Theorem And The Prime Memory System","authors":"Qing-Qiang Gao","doi":"10.1109/ISCA.1993.698573","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698573","url":null,"abstract":"As we know, the conflict problem is a very important problem in memory system of super computer, there are two kinds of conflict-free memory system approaches: skewing scheme approach and prime memory system approach. Previously published prime memory approaches are complex or wasting 1/p of the memory space for filling the “holes” [17], where p is the number of memory modules. In this paper, based on Chinese remainder theorem, we present a perfect prime memory system which only need to find the d Mod p without wasting any memory space and without computing the quotient.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
The Architecture Of A Fault-tolerant Cached RAID Controller 容错缓存RAID控制器的结构
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698547
J. Menon, Jim Cortney
RAID-5 arrays need 4 disk accesses to update a data block—2 to read old data and parity, and 2 to write new data and parity. Schemes previously proposed to improve the update performance of such arrays are the Log-Structured File System [10] and the Floating Parity Approach [6]. Here, we consider a third approach, called Fast Write, which eliminates disk time from the host response time to a write, by using a Non-Volatile Cache in the disk array controller. We examine three alternatives for handling Fast Writes and describe a hierarchy of destage algorithms with increasing robustness to failures. These destage algorithms are compared against those that would be used by a disk controller employing mirroring. We show that array controllers require considerably more (2 to 3 times more) bus bandwidth and memory bandwidth than do disk controllers that employ mirroring. So, array controllers that use parity are likely to be more expensive than controllers that do mirroring, though mirroring is more expensive when both controllers and disks are considered.
RAID-5阵列需要4次磁盘访问来更新数据块—2次读取旧数据和奇偶校验,2次写入新数据和奇偶校验。以前提出的改进此类数组更新性能的方案是日志结构文件系统[10]和浮动奇偶校验方法[6]。这里,我们考虑第三种方法,称为Fast Write,它通过在磁盘阵列控制器中使用非易失性缓存,消除了从主机响应时间到写入的磁盘时间。我们研究了处理快速写入的三种替代方案,并描述了具有增强故障鲁棒性的破坏算法的层次结构。将这些破坏算法与采用镜像的磁盘控制器所使用的算法进行比较。我们表明,阵列控制器比采用镜像的磁盘控制器需要更多的总线带宽和内存带宽(多2到3倍)。因此,使用奇偶校验的阵列控制器可能比使用镜像的控制器成本更高,尽管在同时考虑控制器和磁盘时,镜像成本更高。
{"title":"The Architecture Of A Fault-tolerant Cached RAID Controller","authors":"J. Menon, Jim Cortney","doi":"10.1109/ISCA.1993.698547","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698547","url":null,"abstract":"RAID-5 arrays need 4 disk accesses to update a data block—2 to read old data and parity, and 2 to write new data and parity. Schemes previously proposed to improve the update performance of such arrays are the Log-Structured File System [10] and the Floating Parity Approach [6]. Here, we consider a third approach, called Fast Write, which eliminates disk time from the host response time to a write, by using a Non-Volatile Cache in the disk array controller. We examine three alternatives for handling Fast Writes and describe a hierarchy of destage algorithms with increasing robustness to failures. These destage algorithms are compared against those that would be used by a disk controller employing mirroring. We show that array controllers require considerably more (2 to 3 times more) bus bandwidth and memory bandwidth than do disk controllers that employ mirroring. So, array controllers that use parity are likely to be more expensive than controllers that do mirroring, though mirroring is more expensive when both controllers and disks are considered.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114008423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
Transactional Memory: Architectural Support For Lock-free Data Structures 事务性内存:无锁数据结构的体系结构支持
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698569
Maurice Herlihy, J. E. B. Moss
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
如果一个共享数据结构的操作不需要互斥,那么它就是无锁的。如果一个进程在操作过程中被中断,其他进程不会被阻止对该对象进行操作。在高度并发的系统中,无锁数据结构避免了与传统锁技术相关的常见问题,包括优先级反转、传输和难以避免死锁。本文介绍了事务性内存,这是一种新的多处理器架构,旨在使无锁同步与基于互斥的传统技术一样高效(且易于使用)。事务性内存允许程序员定义自定义的读-修改-写操作,这些操作应用于多个独立选择的内存单词。它是通过对任何多处理器缓存一致性协议的直接扩展实现的。模拟结果表明,即使在没有优先级反转、传输和死锁的情况下,事务性内存在简单基准测试中也可以匹配或优于最知名的锁定技术。
{"title":"Transactional Memory: Architectural Support For Lock-free Data Structures","authors":"Maurice Herlihy, J. E. B. Moss","doi":"10.1109/ISCA.1993.698569","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698569","url":null,"abstract":"A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123303955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2560
Adaptive Cache Coherency For Detecting Migratory Shared Data 自适应缓存一致性检测迁移共享数据
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698549
A. Cox, R. Fowler
Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of inter-node messages on some applications. Since cache coherency traffic represents a larger part of the total communication as cache size increases, the relative benefit of using the adaptive protocol also increases.
并行程序表现出少量不同的数据共享模式。迁移访问是一种常见的数据共享模式,其特征是一次由一个处理器对共享数据进行独占读写访问。我们描述了一系列自适应缓存一致性协议,这些协议动态识别迁移的共享数据,以减少移动它们的成本。这些协议使用标准的内存模型和处理器缓存接口。它们不需要任何编译时或运行时软件支持。我们描述了基于总线的多处理器和使用基于目录的缓存的共享内存多处理器的实现。这些实现很简单,不会显著增加硬件成本。我们使用跟踪和执行驱动的仿真来比较自适应协议与标准写无效协议的性能。这些模拟表明,与传统协议相比,使用自适应协议可以使某些应用程序的节点间消息数量几乎减少一半。由于随着缓存大小的增加,缓存一致性流量占总通信的很大一部分,因此使用自适应协议的相对好处也会增加。
{"title":"Adaptive Cache Coherency For Detecting Migratory Shared Data","authors":"A. Cox, R. Fowler","doi":"10.1109/ISCA.1993.698549","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698549","url":null,"abstract":"Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of inter-node messages on some applications. Since cache coherency traffic represents a larger part of the total communication as cache size increases, the relative benefit of using the adaptive protocol also increases.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126756251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 194
期刊
Proceedings of the 20th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1