首页 > 最新文献

Workshop on Memory System Performance and Correctness最新文献

英文 中文
Supporting virtual memory in GPGPU without supporting precise exceptions 支持GPGPU中的虚拟内存,但不支持精确异常
Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247698
Hyesoon Kim
Supporting precise exceptions has been one of the essential components of designing modern out-of-order processors. It allows handling exception routines, including virtual memory support and also supports debugging features. However, GPGPU, one of the recent popular scientific computing platforms, does not support precise exceptions. Here, in this paper, we argue that supporting precise exceptions is not essential for GPGPUs and we propose an alternate solution to provide virtual memory support without supporting precise exceptions.
支持精确异常已经成为设计现代无序处理器的重要组成部分之一。它允许处理异常例程,包括虚拟内存支持,还支持调试功能。然而,最近流行的科学计算平台之一GPGPU并不支持精确的例外。在本文中,我们认为支持精确异常对gpgpu来说并不是必需的,我们提出了一种替代解决方案,在不支持精确异常的情况下提供虚拟内存支持。
{"title":"Supporting virtual memory in GPGPU without supporting precise exceptions","authors":"Hyesoon Kim","doi":"10.1145/2247684.2247698","DOIUrl":"https://doi.org/10.1145/2247684.2247698","url":null,"abstract":"Supporting precise exceptions has been one of the essential components of designing modern out-of-order processors. It allows handling exception routines, including virtual memory support and also supports debugging features. However, GPGPU, one of the recent popular scientific computing platforms, does not support precise exceptions. Here, in this paper, we argue that supporting precise exceptions is not essential for GPGPUs and we propose an alternate solution to provide virtual memory support without supporting precise exceptions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127014212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis 通过重用距离分析确定基于循环的并行程序的最佳多核缓存层次结构
Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247687
Meng-Ju Wu, D. Yeung
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.
理解多核内存行为至关重要,但由于现代cpu中使用的复杂缓存层次结构,这可能具有挑战性。在今天的层次结构中,性能是由复杂的线程交互决定的,例如共享缓存中的干扰以及私有缓存中的复制和通信。研究人员通常会进行大量的模拟来研究这些相互作用,但这可能是昂贵的,而且不是很有见地。另一种选择是多核重用距离(RD)分析,它可以提供关于多核内存行为的极其丰富的信息。在本文中,我们应用多核RD分析来更好地理解缓存系统的设计。我们的重点是基于循环的并行程序,这是一类重要的程序,RD分析提供了很高的准确性。我们提出了一个新的框架来确定最佳的多核缓存层次结构,并提取了一些新的见解。我们还描述了最优缓存层次结构如何随核心数和问题大小而变化。
{"title":"Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis","authors":"Meng-Ju Wu, D. Yeung","doi":"10.1145/2247684.2247687","DOIUrl":"https://doi.org/10.1145/2247684.2247687","url":null,"abstract":"Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114475522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Defensive loop tiling for multi-core processor 多核处理器的防御循环平铺
Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247701
Bin Bao, Xiaoya Xiang
Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.
循环平铺是一种编译器转换,它调整应用程序的工作集以适应缓存层次结构。在今天的多核处理器上,部分层次结构,特别是最后一级缓存(LLC)是共享的。在本文中,我们表明缓存共享需要特殊类型的平铺,这取决于共同运行的程序。我们分析了性能差异的原因,并给出了一个始终表现最好或接近最好的防御策略。例如,当在Intel多核处理器上测试时,与保守平铺(针对私有缓存进行平铺)相比,防御平铺在单独运行时的性能与保守平铺相似,但在程序共同运行时高出20%。
{"title":"Defensive loop tiling for multi-core processor","authors":"Bin Bao, Xiaoya Xiang","doi":"10.1145/2247684.2247701","DOIUrl":"https://doi.org/10.1145/2247684.2247701","url":null,"abstract":"Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A study towards optimal data layout for GPU computing 面向GPU计算的最佳数据布局研究
Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247699
E. Zhang, Han Li, Xipeng Shen
The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. A recent study shows the promise of eliminating irregular references through runtime thread-data remapping. However, how to efficiently determine the optimal mapping is yet an open question. This paper presents some initial exploration to the question, especially in the dimension of data layout optimization. It describes three algorithms to compute or approximate optimal data layouts for GPU. These algorithms exhibit a spectrum of tradeoff among the space cost, time cost, and quality of the resulting data layouts.
图形处理单元(GPU)的性能对不规则内存引用非常敏感。最近的一项研究表明,通过运行时线程数据重映射消除不规则引用是有希望的。然而,如何有效地确定最优映射仍然是一个悬而未决的问题。本文对这一问题进行了初步的探索,特别是在数据布局优化的维度上。介绍了计算或近似GPU最优数据布局的三种算法。这些算法在空间成本、时间成本和结果数据布局的质量之间进行了一系列权衡。
{"title":"A study towards optimal data layout for GPU computing","authors":"E. Zhang, Han Li, Xipeng Shen","doi":"10.1145/2247684.2247699","DOIUrl":"https://doi.org/10.1145/2247684.2247699","url":null,"abstract":"The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. A recent study shows the promise of eliminating irregular references through runtime thread-data remapping. However, how to efficiently determine the optimal mapping is yet an open question. This paper presents some initial exploration to the question, especially in the dimension of data layout optimization. It describes three algorithms to compute or approximate optimal data layouts for GPU. These algorithms exhibit a spectrum of tradeoff among the space cost, time cost, and quality of the resulting data layouts.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127813895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Let there be light!: the future of memory systems is photonics and 3D stacking 要有光!存储系统的未来是光子学和3D堆叠
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988926
K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick
Energy consumption is the fundamental barrier to exascale supercomputing and it is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. A memory system based on these technologies has the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. To reach the goals of exascale computing with a manageable power budget, the industry will have to adopt these technologies. Doing so will enable exascale computing, and will have a major worldwide economic impact.
能源消耗是百亿亿次超级计算的根本障碍,主要是将数据从一个点移动到另一个点的成本,而不是计算成本。类似地,性能主要取决于数据移动,而不是计算。解决这个问题需要三个关键技术:3D集成、光片对片通信和一种新的通信模型。基于这些技术的内存系统有可能以数量级降低本地内存访问的成本,并提供更多的带宽。为了在可管理的功率预算下实现百亿亿次计算的目标,业界将不得不采用这些技术。这样做将使百亿亿次计算成为可能,并将对全球经济产生重大影响。
{"title":"Let there be light!: the future of memory systems is photonics and 3D stacking","authors":"K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick","doi":"10.1145/1988915.1988926","DOIUrl":"https://doi.org/10.1145/1988915.1988926","url":null,"abstract":"Energy consumption is the fundamental barrier to exascale supercomputing and it is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. A memory system based on these technologies has the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. To reach the goals of exascale computing with a manageable power budget, the industry will have to adopt these technologies. Doing so will enable exascale computing, and will have a major worldwide economic impact.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130016231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Deferred gratification: engineering for high performance garbage collection from the get go 延迟满足:从一开始就进行高性能垃圾收集的工程
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988930
Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley
Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript#8482;, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later. We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java#8482; ---they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance. Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.
实现一个新的编程语言系统是一项艰巨的任务。一个常见的陷阱是依赖精确垃圾收集的设计和工程,而选择引用计数或保守垃圾收集(GC)。例如,AppleScript、Perl、Python和PHP的实现者选择了引用计数(RC),而Ruby选择了保守GC。虽然更容易工作,但引用计数的性能很差,保守GC不灵活,在分配率很高时性能很差。然而,高性能GC对于托管语言的性能至关重要,而且由于现代体系结构相对较低的内存带宽和较高的内存延迟,它只会变得更加重要。不幸的是,由于高性能收集器的特性,稍后对其进行改进是一项艰巨的软件工程任务。无论他们是否意识到这一点,实现者有三条路线:(1)继续使用引用计数或保守GC,然后再担心后果;(2)在现有的托管运行时上构建具有精确GC的语言,并将GC调整为脚本语言工作负载;或者(3)从头开始设计精确的GC,并尽早享受正确性和性能方面的好处。我们使用最流行的服务器端脚本语言PHP来探索这个难题。PHP实现了引用计数,镜像了之前的脚本语言。由于引用计数是不完整的,因此实现者必须(a)还实现跟踪以检测循环垃圾,或者(b)禁止循环数据结构,或者(c)永远不回收循环垃圾。PHP选择了(a), AppleScript选择了(b), Perl选择了(c)。我们描述了五个典型PHP程序的内存行为,以确定它们的实现选择是否符合对高性能PHP不断增长的需求。这些PHP程序的内存行为类似于其他托管语言,如Java#8482;——它们分配了许多寿命短的对象,各种各样的对象大小,并且分配的平均对象大小很小。这些特征表明复制分代GC将获得高性能。认真对待正确性和性能的语言实现者需要理解延迟满足:预先支付精确GC的软件工程成本将在以后交付正确性和内存系统性能。
{"title":"Deferred gratification: engineering for high performance garbage collection from the get go","authors":"Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley","doi":"10.1145/1988915.1988930","DOIUrl":"https://doi.org/10.1145/1988915.1988930","url":null,"abstract":"Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript#8482;, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later.\u0000 We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java#8482; ---they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance.\u0000 Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121129870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
How to fit program footprint curves 如何拟合程序占用曲线
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988920
Xiaoya Xiang, Bin Bao
A footprint is the volume of data accessed in a time window. A complete characterization requires summarizing all footprints in all execution windows. A concise summary is the footprint curve, which gives the average footprint in windows of different lengths. The footprint curve contains information from all footprints. It can be measured in time O(n) for a trace of length n, which is fast enough for most benchmarks. In this paper, we outline a study on footprint curves. We propose four curve fitting methods based on the real data observed in SPEC benchmark programs.
内存占用是在一个时间窗口内访问的数据量。完整的描述需要总结所有执行窗口中的所有占用。一个简洁的总结就是占用空间曲线,它给出了不同长度窗口的平均占用空间。占用曲线包含来自所有占用的信息。对于长度为n的跟踪,可以在O(n)时间内测量它,这对于大多数基准测试来说已经足够快了。本文概述了足迹曲线的研究。本文提出了四种基于SPEC基准程序实际观测数据的曲线拟合方法。
{"title":"How to fit program footprint curves","authors":"Xiaoya Xiang, Bin Bao","doi":"10.1145/1988915.1988920","DOIUrl":"https://doi.org/10.1145/1988915.1988920","url":null,"abstract":"A footprint is the volume of data accessed in a time window. A complete characterization requires summarizing all footprints in all execution windows. A concise summary is the footprint curve, which gives the average footprint in windows of different lengths. The footprint curve contains information from all footprints. It can be measured in time O(n) for a trace of length n, which is fast enough for most benchmarks.\u0000 In this paper, we outline a study on footprint curves. We propose four curve fitting methods based on the real data observed in SPEC benchmark programs.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128419085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A programming model for deterministic task parallelism 确定性任务并行的编程模型
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988918
Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos
The currently dominant programming models to write software for multicore processors use threads that run over shared memory. However, as the core count increases, cache coherency protocols get very complex and ineffective, and maintaining a shared memory abstraction becomes expensive and impractical. Moreover, writing multithreaded programs is notoriously difficult, as the programmer needs to reason about all the possible thread interleavings and interactions, including the myriad of implicit, non-obvious, and often unpredictable thread interactions through shared memory. Overall, as processors get more cores and parallel software becomes mainstream, the shared memory model reaches its limits regarding ease of programming and efficiency. This position paper presents two ideas aiming to solve the problem. First, we restrict the way the programmer expresses parallelism: The program is a collection of possibly recursive tasks, where each task is atomic and cannot communicate with any other task during its execution. Second, we relax the requirement for coherent shared memory: Each task defines its memory footprint, and is guaranteed to have exclusive access to that memory during its execution. Using this model, we can then define a runtime system that transparently performs the data transfers required among cores without cache coherency, and also produces a deterministic execution of the program, provably equivalent to its sequential elision.
目前为多核处理器编写软件的主流编程模型使用在共享内存上运行的线程。然而,随着核心数量的增加,缓存一致性协议变得非常复杂和无效,并且维护共享内存抽象变得昂贵和不切实际。此外,编写多线程程序是出了名的困难,因为程序员需要推断所有可能的线程交织和交互,包括通过共享内存的无数隐式的、不明显的、通常不可预测的线程交互。总的来说,随着处理器拥有更多的内核和并行软件成为主流,共享内存模型在编程的简易性和效率方面达到了极限。这份意见书提出了两个旨在解决这个问题的想法。首先,我们限制程序员表达并行性的方式:程序是可能递归任务的集合,其中每个任务都是原子的,在执行期间不能与任何其他任务通信。其次,我们放宽了对一致共享内存的要求:每个任务定义其内存占用,并保证在执行期间对该内存具有独占访问权。使用此模型,我们可以定义一个运行时系统,该系统透明地执行内核之间所需的数据传输,而无需缓存一致性,并且还生成程序的确定性执行,可证明等同于其顺序省略。
{"title":"A programming model for deterministic task parallelism","authors":"Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos","doi":"10.1145/1988915.1988918","DOIUrl":"https://doi.org/10.1145/1988915.1988918","url":null,"abstract":"The currently dominant programming models to write software for multicore processors use threads that run over shared memory. However, as the core count increases, cache coherency protocols get very complex and ineffective, and maintaining a shared memory abstraction becomes expensive and impractical. Moreover, writing multithreaded programs is notoriously difficult, as the programmer needs to reason about all the possible thread interleavings and interactions, including the myriad of implicit, non-obvious, and often unpredictable thread interactions through shared memory. Overall, as processors get more cores and parallel software becomes mainstream, the shared memory model reaches its limits regarding ease of programming and efficiency.\u0000 This position paper presents two ideas aiming to solve the problem. First, we restrict the way the programmer expresses parallelism: The program is a collection of possibly recursive tasks, where each task is atomic and cannot communicate with any other task during its execution. Second, we relax the requirement for coherent shared memory: Each task defines its memory footprint, and is guaranteed to have exclusive access to that memory during its execution. Using this model, we can then define a runtime system that transparently performs the data transfers required among cores without cache coherency, and also produces a deterministic execution of the program, provably equivalent to its sequential elision.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131781935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain 多种内存架构对多核消费软件的影响:来自电子游戏领域的工业视角
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988925
G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort
Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.
为了在多核系统的软件中实现性能和可伸缩性,内存架构需要进行调整。在本文中,我们讨论了可扩展内存架构技术的影响,特别是使用多个非缓存一致内存空间,对消费者软件的实现和性能的影响。首先,我们报告了Codeplay软件有限公司在这一领域获得的丰富的实际经验,这是一家从事视频游戏和GPU软件编译器领域的软件工具公司。我们将讨论用于处理消费者软件中内存体系结构变化的解决方案,以及这些变化对软件开发工作的影响,从而影响开发成本。本文介绍了对软件的影响的初步发现,在未来几年计划进行更大规模的分析之前。本文所讨论的技术已经成功应用于AAA跨平台电子游戏的开发和优化中。
{"title":"The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain","authors":"G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort","doi":"10.1145/1988915.1988925","DOIUrl":"https://doi.org/10.1145/1988915.1988925","url":null,"abstract":"Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116144257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximating inclusion-based points-to analysis 近似基于包容的分析点
Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988931
R. Nasre
It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.
已经确定的是,实现在分析时间方面可伸缩的点到分析通常涉及分析精度和/或内存的权衡。在本文中,我们提出了一种新的技术来近似基于包含的点对分析的解。该技术是基于智能逼近程序中变量之间的指针和位置等价。我们在此基础上开发了一个简单的近似算法。通过利用该解的各种行为特性,我们开发了另一种改进算法,该算法实现了与合并顺序、邻近搜索、延迟合并和识别频率相关的各种优化。改进后的算法为客户端提供了一个强大的控制,可以根据需求权衡分析时间和精度。使用包括SPEC 2000基准测试和五个大型开源程序在内的大型程序套件,我们展示了我们的算法如何帮助实现可扩展的解决方案。
{"title":"Approximating inclusion-based points-to analysis","authors":"R. Nasre","doi":"10.1145/1988915.1988931","DOIUrl":"https://doi.org/10.1145/1988915.1988931","url":null,"abstract":"It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"48 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123560932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Workshop on Memory System Performance and Correctness
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1