首页 > 最新文献

Workshop on Memory System Performance and Correctness最新文献

英文 中文
Concurrency control with data coloring 具有数据着色的并发控制
Pub Date : 2008-03-02 DOI: 10.1145/1353522.1353525
L. Ceze, C. V. Praun, Calin Cascaval, Pablo Montesinos, J. Torrellas
Concurrency control is one of the main sources of error and complexity in shared memory parallel programming. While there are several techniques to handle concurrency control such as locks and transactional memory, simplifying concurrency control has proved elusive. In this paper we introduce the Data Coloring programming model, based on the principles of our previous work on architecture support for data-centric synchronization. The main idea is to group data structures into consistency domains and mark places in the control flow where data should be consistent. Based on these annotations, the system dynamically infers transaction boundaries. An important aspect of data coloring is that the occurrence of a synchronization defect is typically determinate and leads to a violation of liveness rather than to a safety violation. Finally, this paper includes empirical data that shows that most of the critical sections in large applications are used in a data-centric manner.
并发控制是共享内存并行编程中产生错误和复杂性的主要来源之一。虽然有几种技术可以处理并发控制,如锁和事务性内存,但简化并发控制已被证明是难以实现的。在本文中,我们基于之前关于以数据为中心的同步的体系结构支持工作的原则,介绍了数据着色编程模型。其主要思想是将数据结构分组到一致性域,并在控制流中标记数据应该保持一致的位置。基于这些注释,系统动态地推断事务边界。数据着色的一个重要方面是,同步缺陷的发生通常是确定的,它会导致对活动的违反,而不是对安全的违反。最后,本文包含的经验数据表明,大型应用程序中的大多数关键部分都是以数据为中心的方式使用的。
{"title":"Concurrency control with data coloring","authors":"L. Ceze, C. V. Praun, Calin Cascaval, Pablo Montesinos, J. Torrellas","doi":"10.1145/1353522.1353525","DOIUrl":"https://doi.org/10.1145/1353522.1353525","url":null,"abstract":"Concurrency control is one of the main sources of error and complexity in shared memory parallel programming. While there are several techniques to handle concurrency control such as locks and transactional memory, simplifying concurrency control has proved elusive.\u0000 In this paper we introduce the Data Coloring programming model, based on the principles of our previous work on architecture support for data-centric synchronization. The main idea is to group data structures into consistency domains and mark places in the control flow where data should be consistent. Based on these annotations, the system dynamically infers transaction boundaries. An important aspect of data coloring is that the occurrence of a synchronization defect is typically determinate and leads to a violation of liveness rather than to a safety violation. Finally, this paper includes empirical data that shows that most of the critical sections in large applications are used in a data-centric manner.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
The potential for variable-granularity access tracking for optimistic parallelism 为乐观并行性提供可变粒度访问跟踪的可能性
Pub Date : 2008-03-02 DOI: 10.1145/1353522.1353527
Mihai Burcea, J. Gregory Steffan, C. Amza
Support for optimistic parallelism such as thread-level speculation (TLS) and transactional memory (TM) has been proposed to ease the task of parallelizing software to exploit the new abundance of multicores. A key requirement for such support is the mechanism for tracking memory accesses so that conflicts between speculative threads or transactions can be detected; existing schemes mainly track accesses at a single fixed granularity---i.e., at the word level, cache-line level, or page level. In this paper we demonstrate, for a hardware implementation of TLS and corresponding speculatively-parallelized SpecINT benchmarks, that the coarsest access tracking granularity that does not incur false violations varies significantly across applications, within applications, and across ranges of memory---from word-size to page size. These results motivate a variable-granularity approach to access tracking, and we show that such an approach can reduce the number of memory ranges that must be tracked and compared to detect conflicts can be reduced by an order of magnitude compared to word-level tracking, without increasing false violations. We are currently developing variable-granularity implementations of both a hardware-based TLS system and an STM system.
对乐观并行性(如线程级推测(TLS)和事务内存(TM))的支持已经提出,以简化并行化软件的任务,以利用新的多核资源。这种支持的一个关键要求是跟踪内存访问的机制,以便可以检测推测线程或事务之间的冲突;现有的方案主要以单一的固定粒度跟踪访问。,在字级、缓存行级或页级。在本文中,对于TLS的硬件实现和相应的推测并行SpecINT基准测试,我们证明了不会导致错误违规的最粗访问跟踪粒度在应用程序之间、应用程序内部以及内存范围(从单词大小到页面大小)之间存在显著差异。这些结果激发了一种可变粒度的访问跟踪方法,并且我们表明,这种方法可以减少必须跟踪的内存范围的数量,并且与字级跟踪相比,可以将检测冲突的内存范围减少一个数量级,而不会增加错误违规。我们目前正在开发基于硬件的TLS系统和STM系统的可变粒度实现。
{"title":"The potential for variable-granularity access tracking for optimistic parallelism","authors":"Mihai Burcea, J. Gregory Steffan, C. Amza","doi":"10.1145/1353522.1353527","DOIUrl":"https://doi.org/10.1145/1353522.1353527","url":null,"abstract":"Support for optimistic parallelism such as thread-level speculation (TLS) and transactional memory (TM) has been proposed to ease the task of parallelizing software to exploit the new abundance of multicores. A key requirement for such support is the mechanism for tracking memory accesses so that conflicts between speculative threads or transactions can be detected; existing schemes mainly track accesses at a single fixed granularity---i.e., at the word level, cache-line level, or page level. In this paper we demonstrate, for a hardware implementation of TLS and corresponding speculatively-parallelized SpecINT benchmarks, that the coarsest access tracking granularity that does not incur false violations varies significantly across applications, within applications, and across ranges of memory---from word-size to page size. These results motivate a variable-granularity approach to access tracking, and we show that such an approach can reduce the number of memory ranges that must be tracked and compared to detect conflicts can be reduced by an order of magnitude compared to word-level tracking, without increasing false violations. We are currently developing variable-granularity implementations of both a hardware-based TLS system and an STM system.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1429 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132670514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
General and efficient locking without blocking 无阻塞的通用高效锁定
Pub Date : 2008-03-02 DOI: 10.1145/1353522.1353524
Y. Smaragdakis, Anthony Kay, R. Behrends, M. Young
Standard concurrency control mechanisms offer a trade-off: Transactional memory approaches maximize concurrency, but suffer high overheads and cost for retrying in the case of actual contention. Locking offers lower overheads, but typically reduces concurrency due to the difficulty of associating locks with the exact data that need to be accessed. Moreover, locking allows irreversible operations, is ubiquitous in legacy software, and seems unlikely to ever be completely supplanted. We believe that the trade-off between transactions and (blocking) locks has not been sufficiently exploited to obtain a "best of both worlds" mechanism, although the main components have been identified. Mechanisms for converting locks to atomic sections (which can abort and retry) have already been proposed in the literature: Rajwar and Goodman's "lock elision" (at the hardware level) and Welc et al.'s hybrid monitors (at the software level) are the best known representatives. Nevertheless, these approaches admit improvements on both the generality and the performance front. In this position paper we present two ideas. First, we discuss an adaptive criterion for switching from a locking to a transactional implementation, and back to a locking implementation if the transactional one appears to be introducing overhead for no gain in concurrency. Second, we discuss the issues arising when locks are nested. Contrary to assertions in past work, transforming locks into transactions can be incorrect in the presence of nesting. We explain the problem and provide a precise condition for safety.
标准并发控制机制提供了一种权衡:事务性内存方法最大限度地提高了并发性,但在实际争用的情况下,重试的开销和成本很高。锁提供了较低的开销,但由于难以将锁与需要访问的确切数据关联起来,因此通常会降低并发性。此外,锁允许不可逆操作,在遗留软件中无处不在,似乎不太可能被完全取代。我们认为,事务和(阻塞)锁之间的权衡还没有被充分利用,以获得“两全其美”的机制,尽管已经确定了主要组件。将锁转换为原子段(可以中止和重试)的机制已经在文献中提出:Rajwar和Goodman的“锁省略”(在硬件级别)和Welc等人的混合监视器(在软件级别)是最著名的代表。然而,这些方法在通用性和性能方面都有改进。在这份立场文件中,我们提出两个想法。首先,我们将讨论从锁定实现切换到事务实现的自适应标准,如果事务实现在并发性方面没有带来任何好处,则切换回锁定实现。其次,我们将讨论锁嵌套时出现的问题。与过去工作中的断言相反,在存在嵌套的情况下将锁转换为事务可能是不正确的。我们解释了这个问题,并提供了一个精确的安全条件。
{"title":"General and efficient locking without blocking","authors":"Y. Smaragdakis, Anthony Kay, R. Behrends, M. Young","doi":"10.1145/1353522.1353524","DOIUrl":"https://doi.org/10.1145/1353522.1353524","url":null,"abstract":"Standard concurrency control mechanisms offer a trade-off: Transactional memory approaches maximize concurrency, but suffer high overheads and cost for retrying in the case of actual contention. Locking offers lower overheads, but typically reduces concurrency due to the difficulty of associating locks with the exact data that need to be accessed. Moreover, locking allows irreversible operations, is ubiquitous in legacy software, and seems unlikely to ever be completely supplanted.\u0000 We believe that the trade-off between transactions and (blocking) locks has not been sufficiently exploited to obtain a \"best of both worlds\" mechanism, although the main components have been identified. Mechanisms for converting locks to atomic sections (which can abort and retry) have already been proposed in the literature: Rajwar and Goodman's \"lock elision\" (at the hardware level) and Welc et al.'s hybrid monitors (at the software level) are the best known representatives. Nevertheless, these approaches admit improvements on both the generality and the performance front. In this position paper we present two ideas. First, we discuss an adaptive criterion for switching from a locking to a transactional implementation, and back to a locking implementation if the transactional one appears to be introducing overhead for no gain in concurrency. Second, we discuss the issues arising when locks are nested. Contrary to assertions in past work, transforming locks into transactions can be incorrect in the presence of nesting. We explain the problem and provide a precise condition for safety.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127420130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Reliability-aware data placement for partial memory protection in embedded processors 嵌入式处理器中部分内存保护的可靠性感知数据放置
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178600
M. Mehrara, T. Austin
Low cost protection of embedded systems against soft errors has recently become a major concern. This issue is even more critical in memory elements that are inherently more prone to transient faults. In this paper, we propose a reliability aware data placement technique in order to partially protect embedded memory systems. We show that by adopting this method instead of traditional placement schemes with complete memory protection, an acceptable level of fault tolerance can be achieved while incurring less area and power overhead. In this approach, each variable in the program is placed in either protected or non-protected memory area according to the profile-driven liveness analysis of all memory variables. In order to measure the level of fault coverage, we inject faults into the memory during the course of program execution in a Monte Carlo simulation framework. Subsequently, we calculate the coverage of partial protection scheme based on the number of protected, failed and crashed runs during the fault injection experiment.
嵌入式系统对软错误的低成本保护最近成为人们关注的主要问题。这个问题在内存元素中更为关键,因为内存元素本身就更容易发生瞬态故障。为了对嵌入式存储系统进行部分保护,提出了一种可靠性感知的数据放置技术。我们的研究表明,采用这种方法取代具有完整内存保护的传统放置方案,可以在产生更少的面积和功耗开销的同时实现可接受的容错水平。在这种方法中,程序中的每个变量根据所有内存变量的概要驱动的活动性分析被放置在受保护或不受保护的内存区域中。在蒙特卡罗仿真框架中,我们在程序执行过程中将故障注入到内存中,以测量故障覆盖的程度。然后,根据故障注入实验中被保护、失败和崩溃的运行次数计算部分保护方案的覆盖率。
{"title":"Reliability-aware data placement for partial memory protection in embedded processors","authors":"M. Mehrara, T. Austin","doi":"10.1145/1178597.1178600","DOIUrl":"https://doi.org/10.1145/1178597.1178600","url":null,"abstract":"Low cost protection of embedded systems against soft errors has recently become a major concern. This issue is even more critical in memory elements that are inherently more prone to transient faults. In this paper, we propose a reliability aware data placement technique in order to partially protect embedded memory systems. We show that by adopting this method instead of traditional placement schemes with complete memory protection, an acceptable level of fault tolerance can be achieved while incurring less area and power overhead. In this approach, each variable in the program is placed in either protected or non-protected memory area according to the profile-driven liveness analysis of all memory variables. In order to measure the level of fault coverage, we inject faults into the memory during the course of program execution in a Monte Carlo simulation framework. Subsequently, we calculate the coverage of partial protection scheme based on the number of protected, failed and crashed runs during the fault injection experiment.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123413034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
What do high-level memory models mean for transactions? 高级内存模型对事务意味着什么?
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178609
D. Grossman, Jeremy Manson, W. Pugh
Many people have proposed adding transactions, or atomic blocks, to type-safe high-level programming languages. However, researchers have not considered the semantics of transactions with respect to a memory model weaker than sequential consistency. The details of such semantics are more subtle than many people realize, and the interaction between compiler transformations and transactions could produce behaviors that many people find surprising. A language's memory model, which determines these interactions, must clearly indicate which behaviors are legal, and which are not. These design decisions affect both the idioms that are useful for designing concurrent software and the compiler transformations that are legal within the language.Cases where semantics are more subtle than people expect include the actual meaning of both strong and weak atomicity; correct idioms for thread safe lazy initialization; compiler transformations of transactions that touch only thread local memory; and whether there is a well-defined notion for transactions that corresponds to the notion of correct and incorrect use of synchronization in Java. Open questions for a high-level memory-model that includes transactions involve both issues of isolation and ordering.
许多人建议将事务或原子块添加到类型安全的高级编程语言中。然而,研究人员并没有考虑到事务的语义相对于一个弱于顺序一致性的内存模型。这种语义的细节比许多人意识到的更加微妙,编译器转换和事务之间的交互可能产生许多人感到惊讶的行为。语言的内存模型决定了这些交互,它必须清楚地指出哪些行为是合法的,哪些是不合法的。这些设计决策既影响对设计并发软件有用的习惯用法,也影响在语言中合法的编译器转换。语义比人们期望的更微妙的情况包括强原子性和弱原子性的实际含义;正确的线程安全惰性初始化习惯用法;仅触及线程局部内存的事务的编译器转换;以及是否有一个定义良好的事务概念,对应于Java中正确和不正确使用同步的概念。包含事务的高级内存模型的开放问题涉及隔离和排序问题。
{"title":"What do high-level memory models mean for transactions?","authors":"D. Grossman, Jeremy Manson, W. Pugh","doi":"10.1145/1178597.1178609","DOIUrl":"https://doi.org/10.1145/1178597.1178609","url":null,"abstract":"Many people have proposed adding transactions, or atomic blocks, to type-safe high-level programming languages. However, researchers have not considered the semantics of transactions with respect to a memory model weaker than sequential consistency. The details of such semantics are more subtle than many people realize, and the interaction between compiler transformations and transactions could produce behaviors that many people find surprising. A language's memory model, which determines these interactions, must clearly indicate which behaviors are legal, and which are not. These design decisions affect both the idioms that are useful for designing concurrent software and the compiler transformations that are legal within the language.Cases where semantics are more subtle than people expect include the actual meaning of both strong and weak atomicity; correct idioms for thread safe lazy initialization; compiler transformations of transactions that touch only thread local memory; and whether there is a well-defined notion for transactions that corresponds to the notion of correct and incorrect use of synchronization in Java. Open questions for a high-level memory-model that includes transactions involve both issues of isolation and ordering.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126873471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Keynote talk challenges in chip multiprocessor memory systems 主题演讲:芯片多处理器存储系统的挑战
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178607
D. Wood
The semiconductor industry appears on the brink of an arms race, competing to see which company can cram the most cores on a single die. Yet early entries are hardly well-balanced, general-purpose computers: Sun's Niagara has feeble floating-point performance and IBM's Cell processor is a thinly disguised GPU. Worse, no company has announced anything that will help address the real problem: programming the multithreaded applications needed to exploit the ample computational resources. This talk will discuss several challenges facing the memory system designers of emerging computation-rich chip multiprocessors.
半导体行业似乎处于军备竞赛的边缘,竞相看哪家公司能在一个芯片上塞进最多的核心。然而,早期的产品几乎都不是平衡良好的通用计算机:Sun的Niagara浮点运算性能很弱,IBM的Cell处理器是一种几乎没有伪装的GPU。更糟糕的是,没有任何公司宣布能够帮助解决真正的问题:编程利用大量计算资源所需的多线程应用程序。本讲座将讨论新兴的计算能力丰富的芯片多处理器的存储系统设计者所面临的几个挑战。
{"title":"Keynote talk challenges in chip multiprocessor memory systems","authors":"D. Wood","doi":"10.1145/1178597.1178607","DOIUrl":"https://doi.org/10.1145/1178597.1178607","url":null,"abstract":"The semiconductor industry appears on the brink of an arms race, competing to see which company can cram the most cores on a single die. Yet early entries are hardly well-balanced, general-purpose computers: Sun's Niagara has feeble floating-point performance and IBM's Cell processor is a thinly disguised GPU. Worse, no company has announced anything that will help address the real problem: programming the multithreaded applications needed to exploit the ample computational resources. This talk will discuss several challenges facing the memory system designers of emerging computation-rich chip multiprocessors.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125302378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems 对提高嵌入式系统上java应用程序的TLB性能的硬件/软件方法的综合研究
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178614
Jinzhan Peng, Guei-Yuan Lueh, Gansha Wu, Xiaogang Gou, R. Rakvic
The working set size of Java applications on embedded systems has recently been increasing, causing the Translation Lookaside Buffer (TLB) to become a serious performance bottleneck. From a thorough analysis of the SPECjvm98 benchmark suite executing on a commodity embedded system, we find TLB misses attribute from 24% to 50% of the total execution time. We explore and evaluate a wide spectrum of TLB-enhancing techniques with different combinations of software/hardware approaches, namely superpage for reducing TLB miss rates, two-level TLB and TLB prefetching for reducing both TLB miss rates and TLB miss latency, and even a no-TLB design for removing TLB overhead completely. We adapt and then in a novel way extend these approaches to fit the design space of embedded systems executing Java code. We compare these approaches, discussing their performance behavior, software/hardware complexity and constraints, especially the design implications for the application, runtime and OS.We first conclude that even with the aggressive approaches presented, there remains a performance bottleneck with the TLB. Second, in addition to facing very different design considerations and constraints for embedded systems, proven hardware techniques, such as TLB prefetching have different performance implications. Third, software based solutions, no-TLB design and superpaging, appear to be more effective in improving Java application performance on embedded systems. Finally, beyond performance, these approaches have their respective pros and cons; it is left to the system designer to make the appropriate engineering tradeoff.
嵌入式系统上Java应用程序的工作集大小最近一直在增加,这导致翻译暂置缓冲区(Translation Lookaside Buffer, TLB)成为严重的性能瓶颈。通过对在商用嵌入式系统上执行的SPECjvm98基准测试套件的彻底分析,我们发现TLB缺失属性占总执行时间的24%到50%。我们通过不同的软件/硬件方法组合探索和评估了广泛的TLB增强技术,即减少TLB缺失率的超页,减少TLB缺失率和TLB延迟的两级TLB和TLB预取,甚至是完全消除TLB开销的无TLB设计。我们调整并以一种新颖的方式扩展这些方法,以适应执行Java代码的嵌入式系统的设计空间。我们比较了这些方法,讨论了它们的性能行为、软件/硬件复杂性和约束,特别是对应用程序、运行时和操作系统的设计含义。我们首先得出的结论是,即使采用了积极的方法,TLB仍然存在性能瓶颈。其次,除了面对嵌入式系统非常不同的设计考虑和约束外,经过验证的硬件技术(如TLB预取)具有不同的性能含义。第三,基于软件的解决方案(无tlb设计和超级分页)似乎在提高嵌入式系统上的Java应用程序性能方面更有效。最后,除了性能之外,这些方法也有各自的优缺点;这是留给系统设计师做出适当的工程权衡。
{"title":"A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems","authors":"Jinzhan Peng, Guei-Yuan Lueh, Gansha Wu, Xiaogang Gou, R. Rakvic","doi":"10.1145/1178597.1178614","DOIUrl":"https://doi.org/10.1145/1178597.1178614","url":null,"abstract":"The working set size of Java applications on embedded systems has recently been increasing, causing the Translation Lookaside Buffer (TLB) to become a serious performance bottleneck. From a thorough analysis of the SPECjvm98 benchmark suite executing on a commodity embedded system, we find TLB misses attribute from 24% to 50% of the total execution time. We explore and evaluate a wide spectrum of TLB-enhancing techniques with different combinations of software/hardware approaches, namely superpage for reducing TLB miss rates, two-level TLB and TLB prefetching for reducing both TLB miss rates and TLB miss latency, and even a no-TLB design for removing TLB overhead completely. We adapt and then in a novel way extend these approaches to fit the design space of embedded systems executing Java code. We compare these approaches, discussing their performance behavior, software/hardware complexity and constraints, especially the design implications for the application, runtime and OS.We first conclude that even with the aggressive approaches presented, there remains a performance bottleneck with the TLB. Second, in addition to facing very different design considerations and constraints for embedded systems, proven hardware techniques, such as TLB prefetching have different performance implications. Third, software based solutions, no-TLB design and superpaging, appear to be more effective in improving Java application performance on embedded systems. Finally, beyond performance, these approaches have their respective pros and cons; it is left to the system designer to make the appropriate engineering tradeoff.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129626911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Implicit and explicit optimizations for stencil computations 模板计算的隐式和显式优化
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178605
S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick
Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.
基于模板的核构成了许多关于块结构网格的科学应用的核心。不幸的是,由于处理器和主存储器速度之间的差异,这些代码只能达到峰值性能的一小部分。我们研究了在Itanium 2、Opteron和Power5的传统基于缓存的内存系统上的几种优化,以及Cell处理器的异构多核设计。优化的目标是跨模板扫描的缓存重用,包括隐式缓存无关方法和缓存感知算法,以匹配缓存结构。最后,我们考虑在具有显式管理的内存层次结构的机器上的模板计算,即Cell处理器。总体而言,结果表明,缓存感知方法比缓存无关方法要快得多,并且Cell上显式管理的内存效率更高:相对于Power5,它的内存带宽几乎增加了2倍,速度提高了3.7倍。
{"title":"Implicit and explicit optimizations for stencil computations","authors":"S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick","doi":"10.1145/1178597.1178605","DOIUrl":"https://doi.org/10.1145/1178597.1178605","url":null,"abstract":"Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121990913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 153
Atomicity via source-to-source translation 通过源到源转换实现原子性
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178611
Benjamin Hindman, D. Grossman
We present an implementation and evaluation of atomicity (also known as software transactions) for a dialect of Java. Our implementation is fundamentally different from prior work in three respects: (1) It is entirely a source-to-source translation, producing Java source code that can be compiled by any Java compiler and run on any Java Virtual Machine. (2) It can enforce "strong" atomicity without assuming special hardware or a uniprocessor. (3) The implementation uses locks rather than optimistic concurrency, but it cannot deadlock and requires inter-thread communication only when there is data contention.
我们提出了一种Java方言的原子性(也称为软件事务)的实现和评估。我们的实现在三个方面与之前的工作有根本的不同:(1)它完全是一个源代码到源代码的转换,生成可以由任何Java编译器编译并在任何Java虚拟机上运行的Java源代码。(2)它可以强制执行“强”原子性,而不需要特殊的硬件或单处理器。(3)实现使用锁而不是乐观并发,但它不会死锁,只有在存在数据争用时才需要线程间通信。
{"title":"Atomicity via source-to-source translation","authors":"Benjamin Hindman, D. Grossman","doi":"10.1145/1178597.1178611","DOIUrl":"https://doi.org/10.1145/1178597.1178611","url":null,"abstract":"We present an implementation and evaluation of atomicity (also known as software transactions) for a dialect of Java. Our implementation is fundamentally different from prior work in three respects: (1) It is entirely a source-to-source translation, producing Java source code that can be compiled by any Java compiler and run on any Java Virtual Machine. (2) It can enforce \"strong\" atomicity without assuming special hardware or a uniprocessor. (3) The implementation uses locks rather than optimistic concurrency, but it cannot deadlock and requires inter-thread communication only when there is data contention.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123871248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Memory models for open-nested transactions 开放嵌套事务的内存模型
Pub Date : 2006-10-22 DOI: 10.1145/1178597.1178610
Kunal Agrawal, C. Leiserson, Jim Sukha
Open nesting provides a loophole in the strict model of atomic transactions. Moss and Hosking suggested adapting open nesting for transactional memory, and Moss and a group at Stanford have proposed hardware schemes to support open nesting. Since these researchers have described their schemes using only operational definitions, however, the semantics of these systems have not been specified in an implementation-independent way. This paper offers a framework for defining and exploring the memory semantics of open nesting in a transactionl-memory setting.Our framework allows us to define the traditional model of serializability and two new transactional-memory models, race freedom and prefix race freedom. The weakest of these memory models, prefix race freedom, closely resembles the Stanford openesting model. We prove that these three memory models are equivalent for transactional-memory systems that support only closed nesting, as long as aborted transactions are "ignored." We prove that for systems that support open nesting, however, the models of serializability, race freedom, and prefix race freedom are distinct. We show that the Stanford TM system implements a model at least as strong as prefix race freedom and strictly weaker than race freedom. Thus, their model compromises serializability, the property traditionally used to reason about the correctness of transactions.
开放嵌套在原子事务的严格模型中提供了一个漏洞。莫斯和霍斯金建议为事务性内存调整开放嵌套,莫斯和斯坦福大学的一个小组提出了支持开放嵌套的硬件方案。然而,由于这些研究人员只使用操作定义来描述他们的方案,这些系统的语义并没有以独立于实现的方式指定。本文提供了一个框架,用于定义和探索在事务-内存设置中开放嵌套的内存语义。我们的框架允许我们定义传统的可序列化性模型和两个新的事务内存模型,竞争自由和前缀竞争自由。这些记忆模型中最弱的是前缀种族自由,与斯坦福大学的开放模型非常相似。我们证明这三种内存模型对于只支持封闭嵌套的事务-内存系统是等价的,只要“忽略”终止的事务即可。然而,我们证明了对于支持开放嵌套的系统,可序列化性、竞争自由和前缀竞争自由的模型是不同的。我们证明了斯坦福TM系统实现的模型至少与前缀种族自由一样强,严格弱于种族自由。因此,他们的模型损害了序列化性,而序列化性是传统上用来判断事务正确性的属性。
{"title":"Memory models for open-nested transactions","authors":"Kunal Agrawal, C. Leiserson, Jim Sukha","doi":"10.1145/1178597.1178610","DOIUrl":"https://doi.org/10.1145/1178597.1178610","url":null,"abstract":"Open nesting provides a loophole in the strict model of atomic transactions. Moss and Hosking suggested adapting open nesting for transactional memory, and Moss and a group at Stanford have proposed hardware schemes to support open nesting. Since these researchers have described their schemes using only operational definitions, however, the semantics of these systems have not been specified in an implementation-independent way. This paper offers a framework for defining and exploring the memory semantics of open nesting in a transactionl-memory setting.Our framework allows us to define the traditional model of serializability and two new transactional-memory models, race freedom and prefix race freedom. The weakest of these memory models, prefix race freedom, closely resembles the Stanford openesting model. We prove that these three memory models are equivalent for transactional-memory systems that support only closed nesting, as long as aborted transactions are \"ignored.\" We prove that for systems that support open nesting, however, the models of serializability, race freedom, and prefix race freedom are distinct. We show that the Stanford TM system implements a model at least as strong as prefix race freedom and strictly weaker than race freedom. Thus, their model compromises serializability, the property traditionally used to reason about the correctness of transactions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
Workshop on Memory System Performance and Correctness
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1