首页 > 最新文献

ACM International Conference on Computing Frontiers最新文献

英文 中文
Maintaining real-time synchrony on SpiNNaker 维护SpiNNaker的实时同步
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016622
Sergio Davies, Alexander D. Rast, F. Galluppi, S. Furber
As an asynchronous universal multiprocessor for real-time neural simulation, SpiNNaker presents timing concerns not present in synchronous systems. In this paper we present a series of tools that solve the problem of synchronising a multichip distributed simulation containing multiple independent time domains. These tools hint at an important neural modelling capability of the SpiNNaker system: the ability to decouple the system time from the model time, leading to an abstract-time neural modelling platform.
作为一种用于实时神经仿真的异步通用多处理器,SpiNNaker解决了同步系统中不存在的时序问题。在本文中,我们提出了一系列解决包含多个独立时域的多芯片分布式仿真同步问题的工具。这些工具暗示了SpiNNaker系统的一个重要的神经建模能力:将系统时间与模型时间解耦的能力,从而形成一个抽象时间神经建模平台。
{"title":"Maintaining real-time synchrony on SpiNNaker","authors":"Sergio Davies, Alexander D. Rast, F. Galluppi, S. Furber","doi":"10.1145/2016604.2016622","DOIUrl":"https://doi.org/10.1145/2016604.2016622","url":null,"abstract":"As an asynchronous universal multiprocessor for real-time neural simulation, SpiNNaker presents timing concerns not present in synchronous systems. In this paper we present a series of tools that solve the problem of synchronising a multichip distributed simulation containing multiple independent time domains. These tools hint at an important neural modelling capability of the SpiNNaker system: the ability to decouple the system time from the model time, leading to an abstract-time neural modelling platform.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121490860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic co-management of persistent RAM main memory and storage resources 动态共同管理持久RAM主存储器和存储资源
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016620
J. Jung, Sangyeun Cho
This paper proposes Memorage, a novel system architecture that synergistically manages persistent RAM (PRAM) main memory and a PRAM storage device. Memorage leverages the existing OS virtual memory manager to globally manage PRAM resources and to enhance the utilization of the available PRAM resources. Preliminary experimental and analytical evaluation suggests that Memorage can improve the performance of memory-intensive workloads (by 4.6X on average and up to 9.4X under the examined configuration). It also increases PRAM utilization and significantly extends the longetivity of the PRAM main memory (by 8X).
本文提出了一种新的系统架构memory,它可以协同管理持久性RAM (persistent RAM, PRAM)主存储器和PRAM存储设备。memory利用现有的OS虚拟内存管理器来全局管理PRAM资源,并提高可用PRAM资源的利用率。初步的实验和分析评估表明,memory可以提高内存密集型工作负载的性能(平均提高4.6倍,在测试配置下最高可提高9.4倍)。它还增加了PRAM的利用率,并显着延长了PRAM主存储器的寿命(8倍)。
{"title":"Dynamic co-management of persistent RAM main memory and storage resources","authors":"J. Jung, Sangyeun Cho","doi":"10.1145/2016604.2016620","DOIUrl":"https://doi.org/10.1145/2016604.2016620","url":null,"abstract":"This paper proposes Memorage, a novel system architecture that synergistically manages persistent RAM (PRAM) main memory and a PRAM storage device. Memorage leverages the existing OS virtual memory manager to globally manage PRAM resources and to enhance the utilization of the available PRAM resources. Preliminary experimental and analytical evaluation suggests that Memorage can improve the performance of memory-intensive workloads (by 4.6X on average and up to 9.4X under the examined configuration). It also increases PRAM utilization and significantly extends the longetivity of the PRAM main memory (by 8X).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124453450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Bounding the effect of partition camping in GPU kernels 限制GPU内核中分区露营的效果
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016637
Ashwin M. Aji, Mayank Daga, Wu-chun Feng
Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge and complement these performance models and tools, by modeling and analyzing a lesser known, but very severe performance pitfall, called Partition Camping, in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of GPU kernels by up to seven-fold. There is no existing tool that can detect the partition camping effect in GPU kernels. Unlike the traditional performance modeling approaches, we predict a performance range that bounds the partition camping effect in the GPU kernel. Our idea of predicting a performance range, instead of the exact performance, is more realistic due to the large performance variations induced by partition camping. We design and develop the prediction model by first characterizing the effects of partition camping with an indigenous suite of micro-benchmarks. We then apply rigorous statistical regression techniques over the micro-benchmark data to predict the performance bounds of real GPU kernels, with and without the partition camping effect. We test the accuracy of our performance model by analyzing three real applications with known memory access patterns and partition camping effects. Our results show that the geometric mean of errors in our performance range prediction model is within 12% of the actual execution times. We also develop and present a very easy-to-use spreadsheet based tool called CampProf, which is a visual front-end to our performance range prediction model and can be used to gain insights into the degree of partition camping in GPU kernels. Lastly, we demonstrate how CampProf can be used to visually monitor the performance improvements in the kernels, as the partition camping effect is being removed.
当前的GPU工具和性能模型提供了一些常见的架构见解,指导程序员编写最佳代码。我们通过建模和分析NVIDIA gpu中一个鲜为人知但非常严重的性能陷阱,即分区露营,来挑战和补充这些性能模型和工具。分区露营是由于内存访问倾向于可用内存分区的一个子集造成的,这可能会使GPU内核的性能降低多达七倍。目前还没有工具可以检测GPU内核中的分区露营效果。与传统的性能建模方法不同,我们预测了GPU内核中分区露营效应的性能范围。我们预测性能范围(而不是准确的性能)的想法更现实,因为分区露营会导致很大的性能变化。我们设计并开发了预测模型,首先用一套本土的微基准来描述分区露营的影响。然后,我们在微基准数据上应用严格的统计回归技术来预测真实GPU内核的性能界限,无论是否存在分区露营效应。我们通过分析三个具有已知内存访问模式和分区露营效果的实际应用程序来测试性能模型的准确性。我们的结果表明,在我们的性能范围预测模型中,误差的几何平均值在实际执行时间的12%以内。我们还开发并展示了一个非常易于使用的基于电子表格的工具,称为CampProf,它是我们性能范围预测模型的可视化前端,可用于深入了解GPU内核中的分区露营程度。最后,我们将演示如何使用CampProf可视化地监控内核中的性能改进,因为分区露营效应正在被移除。
{"title":"Bounding the effect of partition camping in GPU kernels","authors":"Ashwin M. Aji, Mayank Daga, Wu-chun Feng","doi":"10.1145/2016604.2016637","DOIUrl":"https://doi.org/10.1145/2016604.2016637","url":null,"abstract":"Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge and complement these performance models and tools, by modeling and analyzing a lesser known, but very severe performance pitfall, called Partition Camping, in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of GPU kernels by up to seven-fold. There is no existing tool that can detect the partition camping effect in GPU kernels.\u0000 Unlike the traditional performance modeling approaches, we predict a performance range that bounds the partition camping effect in the GPU kernel. Our idea of predicting a performance range, instead of the exact performance, is more realistic due to the large performance variations induced by partition camping. We design and develop the prediction model by first characterizing the effects of partition camping with an indigenous suite of micro-benchmarks. We then apply rigorous statistical regression techniques over the micro-benchmark data to predict the performance bounds of real GPU kernels, with and without the partition camping effect. We test the accuracy of our performance model by analyzing three real applications with known memory access patterns and partition camping effects. Our results show that the geometric mean of errors in our performance range prediction model is within 12% of the actual execution times.\u0000 We also develop and present a very easy-to-use spreadsheet based tool called CampProf, which is a visual front-end to our performance range prediction model and can be used to gain insights into the degree of partition camping in GPU kernels. Lastly, we demonstrate how CampProf can be used to visually monitor the performance improvements in the kernels, as the partition camping effect is being removed.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor mppt - CELL:用于CELL BE处理器的高性能数据流编程环境
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016618
A. Franceschelli, P. Burgio, Giuseppe Tagliavini, A. Marongiu, M. Ruggiero, M. Lombardi, Alessio Bonfietti, M. Milano, L. Benini
We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.
我们提出了MPOpt-Cell,这是一个架构感知框架,用于在CELL BE处理器上进行高生产率开发和高效执行流应用程序。它使开发人员能够使用基于一组编译器指令的简单而直观的编程接口快速构建同步数据流(SDF)应用程序,这些编译器指令捕获了SDF的关键抽象。编译器后端和系统运行时有效地管理硬件资源。
{"title":"MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor","authors":"A. Franceschelli, P. Burgio, Giuseppe Tagliavini, A. Marongiu, M. Ruggiero, M. Lombardi, Alessio Bonfietti, M. Milano, L. Benini","doi":"10.1145/2016604.2016618","DOIUrl":"https://doi.org/10.1145/2016604.2016618","url":null,"abstract":"We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132818864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system 动态二进制翻译系统中跨程序精确环巢的实时检测
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016634
Yukinori Sato, Y. Inoguchi, Tadao Nakamura
Loop structures in programs have been regarded as a primary source of finding parallelism from sequential codes. In this paper, we present a new technique that dynamically detects precise loop structures with their inter-procedural nests on a dynamic binary translation system. Using precompiled application binary code as an input, our mechanism generates the simple but precise markers when they are loaded from their binary code image, and at runtime monitors loop structures with inter-procedural nesting on the fly using Loop-Call Context Graph. We implement our mechanism and evaluate it using SPEC CPU benchmark suite. The results show that our mechanism reveals precise loop structures with interprocedural loop nesting successfully. The results also show that ours can reduce overheads for loop analysis compared with the existing ones. These indicate that our mechanism can be applied to runtime optimization and parallelization as well as hints for performance tuning.
程序中的循环结构被认为是从顺序代码中寻找并行性的主要来源。在本文中,我们提出了一种在动态二进制翻译系统中动态检测精确循环结构及其程序间巢的新技术。使用预编译的应用程序二进制代码作为输入,我们的机制在从二进制代码映像加载标记时生成简单但精确的标记,并在运行时使用循环调用上下文图动态地监视带有过程间嵌套的循环结构。我们实现了我们的机制,并使用SPEC CPU基准套件对其进行了评估。结果表明,该机制成功地揭示了过程间循环嵌套的精确循环结构。结果还表明,与现有的方法相比,我们的方法可以减少循环分析的开销。这表明我们的机制可以应用于运行时优化和并行化,以及性能调优提示。
{"title":"On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system","authors":"Yukinori Sato, Y. Inoguchi, Tadao Nakamura","doi":"10.1145/2016604.2016634","DOIUrl":"https://doi.org/10.1145/2016604.2016634","url":null,"abstract":"Loop structures in programs have been regarded as a primary source of finding parallelism from sequential codes. In this paper, we present a new technique that dynamically detects precise loop structures with their inter-procedural nests on a dynamic binary translation system. Using precompiled application binary code as an input, our mechanism generates the simple but precise markers when they are loaded from their binary code image, and at runtime monitors loop structures with inter-procedural nesting on the fly using Loop-Call Context Graph. We implement our mechanism and evaluate it using SPEC CPU benchmark suite. The results show that our mechanism reveals precise loop structures with interprocedural loop nesting successfully. The results also show that ours can reduce overheads for loop analysis compared with the existing ones. These indicate that our mechanism can be applied to runtime optimization and parallelization as well as hints for performance tuning.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114739609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
CnC-Hadoop: a graphical coordination language for distributed multiscale parallelism CnC-Hadoop:分布式多尺度并行的图形化协调语言
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016626
Riyaz Haque, David M. Peixotto, Vivek Sarkar
The information-technology platform is being radically transformed with the widespread adoption of the cloud computing model supported by data centers containing large numbers of multicore servers. While cloud computing platforms can potentially enable a rich variety of distributed applications, the need to exploit multiscale parallelism at the inter-node and intra-node level poses significantly new challenges for software. Recent advances in the Google MapReduce and Hadoop frameworks have led to simplified programming models for a restricted class of distributed batch-processing applications. However, these frameworks do not support richer distributed application structures beyond map-reduce, and do not offer any solutions for exploiting shared-memory multicore parallelism at the intra-node level.
随着包含大量多核服务器的数据中心支持的云计算模型的广泛采用,信息技术平台正在发生根本性的变化。虽然云计算平台可以潜在地实现丰富多样的分布式应用程序,但需要在节点间和节点内级别上利用多尺度并行性,这对软件提出了重大的新挑战。Google MapReduce和Hadoop框架的最新进展已经简化了一类受限的分布式批处理应用程序的编程模型。然而,这些框架不支持除map-reduce之外的更丰富的分布式应用程序结构,并且不提供任何在节点内级别利用共享内存多核并行性的解决方案。
{"title":"CnC-Hadoop: a graphical coordination language for distributed multiscale parallelism","authors":"Riyaz Haque, David M. Peixotto, Vivek Sarkar","doi":"10.1145/2016604.2016626","DOIUrl":"https://doi.org/10.1145/2016604.2016626","url":null,"abstract":"The information-technology platform is being radically transformed with the widespread adoption of the cloud computing model supported by data centers containing large numbers of multicore servers. While cloud computing platforms can potentially enable a rich variety of distributed applications, the need to exploit multiscale parallelism at the inter-node and intra-node level poses significantly new challenges for software. Recent advances in the Google MapReduce and Hadoop frameworks have led to simplified programming models for a restricted class of distributed batch-processing applications. However, these frameworks do not support richer distributed application structures beyond map-reduce, and do not offer any solutions for exploiting shared-memory multicore parallelism at the intra-node level.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133227608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor Godson-T多核处理器分子动力学仿真性能分析与优化
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016643
Liu Peng, A. Nakano, Guangming Tan, P. Vashishta, Dongrui Fan, Hao Zhang, R. Kalia, Fenglong Song
Molecular dynamics (MD) simulation has broad applications, but its irregular memory-access pattern makes performance optimization a challenge. This paper presents a joint application/architecture study to enhance on-chip parallelism of MD on Godson-T -like many-core architecture. First, a preprocessing leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then we propose three incremental optimization strategies: (1) a novel data-layout to re-organize linked-list cell data structures to improve data locality; (2) an on-chip locality-aware parallel algorithm to enhance data reuse; and (3) a pipelining algorithm to hide latency to shared memory. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency 0.99 on 64 cores, which is confirmed by an FPGA emulator. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a simple performance model suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.
分子动力学(MD)模拟具有广泛的应用,但其不规则的内存访问模式给性能优化带来了挑战。本文提出了一种应用/架构联合研究方法,以提高MD在类Godson-T多核架构上的片上并行性。首先,设计了利用自适应分治框架的预处理,通过软件控制内存的内存层次利用局部性。在此基础上,提出了三种增量优化策略:(1)采用一种新的数据布局,重新组织链表单元数据结构,提高数据局部性;(2)芯片上位置感知并行算法,增强数据重用;(3)采用流水线算法来隐藏共享内存的延迟。在goson - t模拟器上的实验表明,该算法在64核上的并行效率为0.99,并通过FPGA仿真得到了验证。详细的分析表明,利用体系结构特性最大化数据局部性和增强数据重用的优化最有利于可伸缩性。此外,一个简单的性能模型表明,优化方案可能很好地扩展到百亿亿次。某些架构特性对于这些优化至关重要,它们可以指导未来的硬件开发。
{"title":"Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor","authors":"Liu Peng, A. Nakano, Guangming Tan, P. Vashishta, Dongrui Fan, Hao Zhang, R. Kalia, Fenglong Song","doi":"10.1145/2016604.2016643","DOIUrl":"https://doi.org/10.1145/2016604.2016643","url":null,"abstract":"Molecular dynamics (MD) simulation has broad applications, but its irregular memory-access pattern makes performance optimization a challenge. This paper presents a joint application/architecture study to enhance on-chip parallelism of MD on Godson-T -like many-core architecture. First, a preprocessing leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then we propose three incremental optimization strategies: (1) a novel data-layout to re-organize linked-list cell data structures to improve data locality; (2) an on-chip locality-aware parallel algorithm to enhance data reuse; and (3) a pipelining algorithm to hide latency to shared memory. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency 0.99 on 64 cores, which is confirmed by an FPGA emulator. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a simple performance model suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs AstroLIT:在Intel®和Transmeta设计之间实现基于仿真的微架构比较
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016629
Guilherme Ottoni, G. Chinya, Gerolf Hoflehner, Jamison D. Collins, Amit Kumar, E. Schuchman, D. Ditzel, Ronak Singhal, Hong Wang
While the out-of-order engine has been the mainstream micro-architecture-design paradigm to achieve high performance, Transmeta took a different approach using dynamic binary translation (BT). To enable detailed comparison of these two radically different processor-design approaches, it is natural to leverage well-established simulation-based methodologies. However, BT-based processor designs pose new challenges to standard sampling-based simulation methodologies. This paper describes these challenges, and it also introduces the AstroLIT methodology to address them.
虽然乱序引擎已经成为实现高性能的主流微架构设计范例,但Transmeta采用了一种不同的方法,使用动态二进制转换(BT)。为了能够对这两种完全不同的处理器设计方法进行详细的比较,自然要利用成熟的基于仿真的方法。然而,基于bp的处理器设计对标准的基于采样的仿真方法提出了新的挑战。本文描述了这些挑战,并介绍了解决这些挑战的AstroLIT方法。
{"title":"AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs","authors":"Guilherme Ottoni, G. Chinya, Gerolf Hoflehner, Jamison D. Collins, Amit Kumar, E. Schuchman, D. Ditzel, Ronak Singhal, Hong Wang","doi":"10.1145/2016604.2016629","DOIUrl":"https://doi.org/10.1145/2016604.2016629","url":null,"abstract":"While the out-of-order engine has been the mainstream micro-architecture-design paradigm to achieve high performance, Transmeta took a different approach using dynamic binary translation (BT). To enable detailed comparison of these two radically different processor-design approaches, it is natural to leverage well-established simulation-based methodologies. However, BT-based processor designs pose new challenges to standard sampling-based simulation methodologies. This paper describes these challenges, and it also introduces the AstroLIT methodology to address them.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An MPSoC design approach for multiple use-cases of throughput constrainted applications 一种用于吞吐量受限应用的多用例的MPSoC设计方法
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016628
A. Shabbir, S. Stuijk, Akash Kumar, H. Corporaal, B. Mesman
Modern multimedia systems must support a variety of different use-cases. Multi-processors Systems-on-Chip (MPSoCs) are used to realize these systems. A system designer has to dimension the size of an MPSoC such that the performance constraints of the applications are satisfied in all use-cases. In this paper, we present an approach to design MPSoCs that can meet the throughput constraints of a set of applications while minimizing the resource requirements.
现代多媒体系统必须支持各种不同的用例。这些系统采用多处理器片上系统(mpsoc)来实现。系统设计人员必须确定MPSoC的尺寸,以便在所有用例中满足应用程序的性能约束。在本文中,我们提出了一种设计mpsoc的方法,该方法可以满足一组应用程序的吞吐量限制,同时最大限度地减少资源需求。
{"title":"An MPSoC design approach for multiple use-cases of throughput constrainted applications","authors":"A. Shabbir, S. Stuijk, Akash Kumar, H. Corporaal, B. Mesman","doi":"10.1145/2016604.2016628","DOIUrl":"https://doi.org/10.1145/2016604.2016628","url":null,"abstract":"Modern multimedia systems must support a variety of different use-cases. Multi-processors Systems-on-Chip (MPSoCs) are used to realize these systems. A system designer has to dimension the size of an MPSoC such that the performance constraints of the applications are satisfied in all use-cases. In this paper, we present an approach to design MPSoCs that can meet the throughput constraints of a set of applications while minimizing the resource requirements.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127033845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Multi- and many-core data mining with adaptive sparse grids 基于自适应稀疏网格的多核和多核数据挖掘
Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016640
A. Heinecke, D. Pflüger
Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. Furthermore, we analyze the suitability of parallel programming languages for the implementation. Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA's Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and an artificial one which exhibits challenging properties. In all settings, we achieve excellent results, obtaining speedups of more than 60 using single precision on a hybrid system.
从庞大的数据集中获取知识是当今数据驱动应用程序的主要挑战。稀疏网格为数据挖掘中的分类和回归提供了一种数值方法,该方法仅在数据点数量上线性扩展,因此非常适合于大量数据。由于稀疏网格算法的递归特性,它们对现代硬件架构(如加速器)的并行化提出了挑战。在本文中,我们介绍了当前几个任务和数据并行平台上的并行化,包括带有矢量单元的多核cpu, gpu和混合系统。此外,我们还分析了并行编程语言对实现的适用性。考虑到硬件,我们将自己限制在带有SSE和AVX矢量扩展的x86平台上,并将gpu限制在NVIDIA的Fermi架构上。我们独立考虑多核CPU和GPU架构,以及多达12核和2个费米GPU的混合系统。关于并行编程,我们研究了开放标准OpenCL和英特尔阵列构建块,这是最近引入的高级编程方法。作为基线,我们使用经典并行化稀疏网格算法及其openmp并行化的内在对应(SSE和AVX指令)获得的最佳结果,报告单精度和双精度测量。我们使用的庞大数据集是来自天体物理学的真实数据集和具有挑战性特性的人工数据集。在所有设置中,我们都取得了出色的结果,在混合系统上使用单精度获得了超过60的加速。
{"title":"Multi- and many-core data mining with adaptive sparse grids","authors":"A. Heinecke, D. Pflüger","doi":"10.1145/2016604.2016640","DOIUrl":"https://doi.org/10.1145/2016604.2016640","url":null,"abstract":"Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. Furthermore, we analyze the suitability of parallel programming languages for the implementation.\u0000 Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA's Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and an artificial one which exhibits challenging properties. In all settings, we achieve excellent results, obtaining speedups of more than 60 using single precision on a hybrid system.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124877562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
期刊
ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1