2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第5页

Compiler-assisted data distribution for chip multiprocessors 芯片多处理器的编译器辅助数据分发

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854335

Yong Li, Ahmed Abousamra, R. Melhem, A. Jones

Data access latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in non-uniform cache architectures with distributed cache banks. To mitigate this effect, it is necessary to leverage the data access locality and choose an optimum data placement. Achieving this is especially challenging when other constraints such as cache capacity, coherence messages and runtime overhead need to be considered. This paper presents a compiler-based approach used for analyzing data access behavior in multi-threaded applications. The proposed experimental compiler framework employs novel compilation techniques to discover and represent multi-threaded memory access patterns (MMAPs). At run time, symbolic MMAPs are resolved and used by a partitioning algorithm to choose a partition of allocated memory blocks among the forked threads in the analyzed application. This partition is used to enforce data ownership by associating the data with the core that executes the thread owning the data. We demonstrate how this information can be used in an experimental architecture to accelerate applications. In particular, our compiler assisted approach shows a 20% speedup over shared caching and 5% speedup over the closest runtime approximation, “first touch”.

数据访问延迟是芯片多处理器性能的一个限制因素，在分布式缓存银行的非统一缓存架构中，随着内核数量的增加，数据访问延迟会显著增加。为了减轻这种影响，有必要利用数据访问局部性并选择最佳数据放置位置。当需要考虑缓存容量、一致性消息和运行时开销等其他约束时，实现这一点尤其具有挑战性。本文提出了一种基于编译器的方法来分析多线程应用程序中的数据访问行为。提出的实验性编译器框架采用新颖的编译技术来发现和表示多线程内存访问模式(mmap)。在运行时，符号mmap被解析并由分区算法使用，以便在分析的应用程序中的分叉线程中选择已分配内存块的分区。该分区用于通过将数据与执行拥有数据的线程的核心关联来强制数据所有权。我们将演示如何在实验架构中使用这些信息来加速应用程序。特别是，我们的编译器辅助方法比共享缓存加速20%，比最接近的运行时近似(“第一次接触”)加速5%。

{"title":"Compiler-assisted data distribution for chip multiprocessors","authors":"Yong Li, Ahmed Abousamra, R. Melhem, A. Jones","doi":"10.1145/1854273.1854335","DOIUrl":"https://doi.org/10.1145/1854273.1854335","url":null,"abstract":"Data access latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in non-uniform cache architectures with distributed cache banks. To mitigate this effect, it is necessary to leverage the data access locality and choose an optimum data placement. Achieving this is especially challenging when other constraints such as cache capacity, coherence messages and runtime overhead need to be considered. This paper presents a compiler-based approach used for analyzing data access behavior in multi-threaded applications. The proposed experimental compiler framework employs novel compilation techniques to discover and represent multi-threaded memory access patterns (MMAPs). At run time, symbolic MMAPs are resolved and used by a partitioning algorithm to choose a partition of allocated memory blocks among the forked threads in the analyzed application. This partition is used to enforce data ownership by associating the data with the core that executes the thread owning the data. We demonstrate how this information can be used in an experimental architecture to accelerate applications. In particular, our compiler assisted approach shows a 20% speedup over shared caching and 5% speedup over the closest runtime approximation, “first touch”.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133773183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Efficient Runahead Threads 高效的运行线程

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854328

Tanausú Ramírez, Alex Pajuelo, Oliverio J. Santana, O. Mutlu, M. Valero

Runahead Threads (RaT) is a promising solution that enables a thread to speculatively run ahead and prefetch data instead of stalling for a long-latency load in a simultaneous multithreading processor. With this capability, RaT can reduces resource monopolization due to memory-intensive threads and exploits memory-level parallelism, improving both system performance and single-thread performance. Unfortunately,the benefits of RaT come at the expense of increasing the number of executed instructions, which adversely affects its energy efficiency. In this paper, we propose Runahead Distance Prediction (RDP), a simple technique to improve the efficiency of Runahead Threads. The main idea of the RDP mechanism is to predict how far a thread should run ahead speculatively such that speculative execution is useful. By limiting the runahead distance of a thread, we generate efficient runahead threads that avoid unnecessary speculative execution and enhance RaT energy efficiency. By reducing runahead-based speculation when it is predicted to be not useful, RDP also allows shared resources to be efficiently used by non-speculative threads. Our results show that RDP significantly reduces power consumption while maintaining the performance of RaT, providing better performance and energy balance than previous proposals in the field.

提前运行线程(RaT)是一种很有前途的解决方案，它使线程能够推测地提前运行并预取数据，而不是在并发多线程处理器中为长延迟负载而停机。有了这个功能，RaT可以减少由于内存密集型线程造成的资源垄断，并利用内存级并行性，从而提高系统性能和单线程性能。不幸的是，RaT的好处是以增加执行指令的数量为代价的，这对其能源效率产生了不利影响。在本文中，我们提出了一种简单的技术RDP (Runahead Distance Prediction)来提高Runahead线程的效率。RDP机制的主要思想是预测线程应该提前运行多远，以便推测执行是有用的。通过限制线程的运行前距离，我们生成了高效的运行前线程，避免了不必要的推测执行并提高了RaT的能源效率。通过减少基于运行前的推测，RDP还允许非推测线程有效地使用共享资源。我们的研究结果表明，RDP在保持RaT性能的同时显着降低了功耗，提供了比该领域先前建议更好的性能和能量平衡。

{"title":"Efficient Runahead Threads","authors":"Tanausú Ramírez, Alex Pajuelo, Oliverio J. Santana, O. Mutlu, M. Valero","doi":"10.1145/1854273.1854328","DOIUrl":"https://doi.org/10.1145/1854273.1854328","url":null,"abstract":"Runahead Threads (RaT) is a promising solution that enables a thread to speculatively run ahead and prefetch data instead of stalling for a long-latency load in a simultaneous multithreading processor. With this capability, RaT can reduces resource monopolization due to memory-intensive threads and exploits memory-level parallelism, improving both system performance and single-thread performance. Unfortunately,the benefits of RaT come at the expense of increasing the number of executed instructions, which adversely affects its energy efficiency. In this paper, we propose Runahead Distance Prediction (RDP), a simple technique to improve the efficiency of Runahead Threads. The main idea of the RDP mechanism is to predict how far a thread should run ahead speculatively such that speculative execution is useful. By limiting the runahead distance of a thread, we generate efficient runahead threads that avoid unnecessary speculative execution and enhance RaT energy efficiency. By reducing runahead-based speculation when it is predicted to be not useful, RDP also allows shared resources to be efficiently used by non-speculative threads. Our results show that RDP significantly reduces power consumption while maintaining the performance of RaT, providing better performance and energy balance than previous proposals in the field.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124837086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Improving speculative loop parallelization via selective squash and speculation reuse 通过选择性压缩和推测重用改进推测循环并行化

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854343

Santhosh Sharma Ananthramu, Deepak Majeti, S. Aggarwal, Mainak Chaudhuri

Speculative parallelization is a powerful technique to parallelize loops with irregular data dependencies. In this poster, we present a value-based selective squash protocol and an optimistic speculation reuse technique that leverages an extended notion of silent stores. These optimizations focus on reducing the number of squashes due to dependency violations. Our proposed optimizations, when applied to loops selected from standard benchmark suites, demonstrate an average (geometric mean) 2.5x performance improvement. This improvement is attributed to a 94% success in speculation reuse and a 77% reduction in the number of squashed threads compared to an implementation that, in such cases of squashes, would have squashed all the successors starting from the oldest offending one.

推测并行化是一种强大的技术，可以并行化具有不规则数据依赖性的循环。在这张海报中，我们提出了一个基于值的选择性压缩协议和一个利用扩展的静默存储概念的乐观推测重用技术。这些优化的重点是减少由于依赖违反而导致的压缩次数。我们提出的优化，当应用于从标准基准套件中选择的循环时，显示出平均(几何平均)2.5倍的性能改进。这种改进归因于94%的投机重用成功率和77%的压缩线程数量减少，而在这种压缩情况下，实现将从最老的有问题的线程开始压缩所有后继线程。

引用次数: 0

A programmable parallel accelerator for learning and classification 用于学习和分类的可编程并行加速器

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854309

S. Cadambi, Abhinandan Majumdar, M. Becchi, S. Chakradhar, H. Graf

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.

对于在具有严格性能约束的大量非结构化数据上操作的学习和分类工作负载，通用处理器的性能随着数据大小的变化而变化得很差。在本文中，我们提出了一个可编程的加速器。为了构建加速器，我们分析了五个代表性的工作负载，并发现它们的计算密集型部分可以被表述为生成大量中间数据的矩阵或向量操作，然后通过二次操作(如数组排序、查找max/min和聚合)减少中间数据。这个被提议的加速器被称为MAPLE，它有数百个简单的处理元素(pe)布置在一个二维网格中，有两个关键特征。首先，它使用内存处理，其中片上内存块执行二次缩减操作。通过这样做，中间数据被动态处理，而不会存储或发送到芯片外。其次，MAPLE使用存储的片外内存，并将pe组织成独立的组，每个组都有自己的片外内存库。这两个特性一起允许MAPLE根据数据大小扩展其性能。本文描述了MAPLE的体系结构，用模拟器探讨了其设计空间，并说明了如何将应用程序内核自动映射到硬件。我们还实现了MAPLE的512-PE FPGA原型，并发现它比2.5 GHz四核至强处理器快1.5 - 10倍，尽管运行频率为125 MHz。

{"title":"A programmable parallel accelerator for learning and classification","authors":"S. Cadambi, Abhinandan Majumdar, M. Becchi, S. Chakradhar, H. Graf","doi":"10.1145/1854273.1854309","DOIUrl":"https://doi.org/10.1145/1854273.1854309","url":null,"abstract":"For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116629159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Moths: Mobile threads for On-Chip Networks 飞蛾:片上网络的移动线程

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854342

Matthew Misler, Natalie D. Enright Jerger

As the number of cores integrated on a single chip continues to increase, communication has the potential to become a severe bottleneck to overall system performance. The presence of thread sharing and the distribution of data across cache banks on the chip can result in long distance communication. Long distance communication incurs substantial latency that impacts performance; furthermore, this communication consumes significant dynamic power when packets are switched over many Network-on-Chip (NoC) links and routers. Thread migration can mitigate problems created by long distance communication. We present Moths, an efficient run-time algorithm that responds automatically to dynamic NoC traffic patterns, providing beneficial thread migration to decrease overall traffic volume and average packet latency. Moths reduces on-chip network latency by up to 28.4% (18.0% on average) and traffic volume by up to 24.9% (20.6% on average) across a variety of commercial and scientific benchmarks.

随着集成在单个芯片上的核心数量不断增加，通信有可能成为整体系统性能的严重瓶颈。线程共享的存在和数据在芯片上跨缓存库的分布可以导致长距离通信。长距离通信会产生影响性能的大量延迟;此外，当数据包在许多片上网络(NoC)链路和路由器之间交换时，这种通信消耗大量的动态功率。线程迁移可以减轻远程通信造成的问题。我们提出了Moths，这是一种有效的运行时算法，可以自动响应动态NoC流量模式，提供有益的线程迁移以减少总体流量和平均数据包延迟。在各种商业和科学基准测试中，Moths可将片上网络延迟最多减少28.4%(平均18.0%)，流量最多减少24.9%(平均20.6%)。

引用次数: 8

ATAC: A 1000-core cache-coherent processor with on-chip optical network ATAC:带有片上光网络的1000核缓存相干处理器

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854332

George Kurian, Jason E. Miller, James Psota, J. Eastep, Jifeng Liu, J. Michel, L. Kimerling, A. Agarwal

Based on current trends, multicore processors will have 1000 cores or more within the next decade. However, their promise of increased performance will only be realized if their inherent scaling and programming challenges are overcome. Fortunately, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a reality—interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical signaling. Optical interconnect has the potential to enable massive scaling and preserve familiar programming models in future multicore chips. This paper presents ATAC, a new multicore architecture with integrated optics, and ACKwise, a novel cache coherence protocol designed to leverage ATAC's strengths. ATAC uses nanophotonic technology to implement a fast, efficient global broadcast network which helps address a number of the challenges that future multicores will face. ACKwise is a new directory-based cache coherence protocol that uses this broadcast mechanism to provide high performance and scalability. Based on 64-core and 1024-core simulations with Splash2, Parsec, and synthetic benchmarks, we show that ATAC with ACKwise out-performs a chip with conventional interconnect and cache coherence protocols. On 1024-core evaluations, ACKwise protocol on ATAC outperforms the best conventional cache coherence protocol on an electrical mesh network by 2.5x with Splash2 benchmarks and by 61% with synthetic benchmarks.

根据目前的趋势，在未来十年内，多核处理器将拥有1000个或更多的核。然而，只有克服了固有的可伸缩性和编程挑战，它们提高性能的承诺才会实现。幸运的是，纳米光子器件制造的最新进展使cmos集成光学成为一种现实互连技术，它可以在更低的功耗下提供比传统电信号更大的带宽。光互连有可能在未来的多核芯片中实现大规模扩展并保留熟悉的编程模型。本文介绍了ATAC，一种具有集成光学的新型多核架构，以及ACKwise，一种旨在利用ATAC优势的新型缓存相干协议。ATAC使用纳米光子技术实现快速、高效的全球广播网络，这有助于解决未来多核将面临的一些挑战。ACKwise是一种新的基于目录的缓存一致性协议，它使用这种广播机制来提供高性能和可扩展性。基于使用Splash2、Parsec和合成基准的64核和1024核模拟，我们表明使用ACKwise的ATAC优于使用传统互连和缓存一致性协议的芯片。在1024核评估中，ACKwise协议在ATAC上的性能比电子网状网络上的最佳传统缓存一致性协议在Splash2基准测试中高出2.5倍，在综合基准测试中高出61%。

{"title":"ATAC: A 1000-core cache-coherent processor with on-chip optical network","authors":"George Kurian, Jason E. Miller, James Psota, J. Eastep, Jifeng Liu, J. Michel, L. Kimerling, A. Agarwal","doi":"10.1145/1854273.1854332","DOIUrl":"https://doi.org/10.1145/1854273.1854332","url":null,"abstract":"Based on current trends, multicore processors will have 1000 cores or more within the next decade. However, their promise of increased performance will only be realized if their inherent scaling and programming challenges are overcome. Fortunately, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a reality—interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical signaling. Optical interconnect has the potential to enable massive scaling and preserve familiar programming models in future multicore chips. This paper presents ATAC, a new multicore architecture with integrated optics, and ACKwise, a novel cache coherence protocol designed to leverage ATAC's strengths. ATAC uses nanophotonic technology to implement a fast, efficient global broadcast network which helps address a number of the challenges that future multicores will face. ACKwise is a new directory-based cache coherence protocol that uses this broadcast mechanism to provide high performance and scalability. Based on 64-core and 1024-core simulations with Splash2, Parsec, and synthetic benchmarks, we show that ATAC with ACKwise out-performs a chip with conventional interconnect and cache coherence protocols. On 1024-core evaluations, ACKwise protocol on ATAC outperforms the best conventional cache coherence protocol on an electrical mesh network by 2.5x with Splash2 benchmarks and by 61% with synthetic benchmarks.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122217084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 231

An OpenCL framework for heterogeneous multicores with local memory 具有本地内存的异构多核的OpenCL框架

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854301

Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Hong-Seok Kim, Thanh Tuan Dao, Yongjin Cho, Sungsok Seo, Seung Hak Lee, Seung Mo Cho, H. Song, Sang-Bum Suh, Jong-Deok Choi

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.

在本文中，我们提出了一个开放计算语言(OpenCL)框架的设计和实现，该框架针对具有本地内存的异构加速器多核架构。该体系结构由一个通用处理器核心和多个通常没有任何缓存的加速器核心组成。相反，每个加速器核心都有一个小的内部本地存储器。我们的OpenCL运行时基于软件管理的缓存和一致性协议，保证OpenCL内存的一致性，以克服本地内存的有限大小。为了提高性能，运行时依赖于三种源代码转换技术，工作项合并，基于web的变量扩展和预加载-后存储缓冲，由我们的OpenCL C源到源转换器执行。工作项合并是一个过程，用于序列化在存在障碍的情况下并发执行的多个类似spmd的任务，并按顺序在单个加速器核心上运行它们。它需要基于web的变量扩展技术来为私有变量分配局部内存。预加载-后存储缓冲是一种消除软件缓存访问开销的缓冲技术。与工作项目合并一起，它对提高绩效具有协同效应。我们展示了我们的OpenCL框架的有效性，用一个由两个Cell BE处理器组成的系统来评估它的性能。实验结果表明，该方法是可行的。

{"title":"An OpenCL framework for heterogeneous multicores with local memory","authors":"Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Hong-Seok Kim, Thanh Tuan Dao, Yongjin Cho, Sungsok Seo, Seung Hak Lee, Seung Mo Cho, H. Song, Sang-Bum Suh, Jong-Deok Choi","doi":"10.1145/1854273.1854301","DOIUrl":"https://doi.org/10.1145/1854273.1854301","url":null,"abstract":"In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116102756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Build watson: An overview of DeepQA for the Jeopardy! Challenge 构建沃森:《危险边缘!》DeepQA概述挑战

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854275

D. Ferrucci

Computer systems that can directly and accurately answer peoples' questions over a broad domain of human knowledge have been envisioned by scientists and writers since the advent of computers themselves. Open domain question answering holds tremendous promise for facilitating informed decision making over vast volumes of natural language content. Applications in business intelligence, healthcare, customer support, enterprise knowledge management, social computing, science and government would all benefit from deep language processing. The DeepQA project (www.ibm.com/deepqa) is aimed at illustrating how the advancement and integration of Natural Language Processing (NLP), Information Retrieval (IR), Machine Learning (ML), massively parallel computation and Knowledge Representation and Reasoning (KR&R) can greatly advance open-domain automatic Question Answering. An exciting proof-point in this challenge is to develop a computer system that can successfully compete against top human players at the Jeopardy! quiz show (www.jeopardy.com). Attaining champion-level performance Jeopardy! requires a computer to rapidly answer rich open-domain questions, and to predict its own performance on any given category/question. The system must deliver high degrees of precision and confidence over a very broad range of knowledge and natural language content and with a 3-second response time. To do this DeepQA generates, evidences and evaluates many competing hypotheses. A key to success is automatically learning and combining accurate confidences across an array of complex algorithms and over different dimensions of evidence. Accurate confidences are needed to know when to “buzz in” against your competitors and how much to bet. Critical for winning at Jeopardy!, High precision and accurate confidence computations are just as critical for providing real value in business settings where helping users focus on the right content sooner and with greater confidence can make all the difference. The need for speed and high precision demands a massively parallel compute platform capable of generating, evaluating and combing 1000's of hypotheses and their associated evidence. In this talk I will introduce the audience to the Jeopardy! Challenge and describe our technical approach and our progress on this grand-challenge problem.

自计算机出现以来，科学家和作家就一直在设想能够直接、准确地回答人们在广泛的人类知识领域提出的问题的计算机系统。开放领域的问题回答为促进对大量自然语言内容的知情决策提供了巨大的希望。商业智能、医疗保健、客户支持、企业知识管理、社会计算、科学和政府等领域的应用都将受益于深度语言处理。DeepQA项目(www.ibm.com/deepqa)旨在说明自然语言处理(NLP)、信息检索(IR)、机器学习(ML)、大规模并行计算和知识表示与推理(KR&R)的进步和集成如何极大地推进开放域自动问答。在这个挑战中，一个令人兴奋的证明是开发一个计算机系统，它可以在Jeopardy!智力竞赛节目(www.jeopardy.com)。获得冠军级别的表现Jeopardy!要求计算机快速回答丰富的开放域问题，并预测其在任何给定类别/问题上的表现。该系统必须在非常广泛的知识和自然语言内容上提供高度的精度和信心，并具有3秒的响应时间。为此，DeepQA生成证据并评估许多相互竞争的假设。成功的关键是在一系列复杂的算法和不同维度的证据中自动学习和组合准确的信心。准确的信心是需要知道什么时候“进场”对抗你的竞争对手，以及该下注多少。在《危险边缘》中获胜的关键!高精度和准确的置信度计算对于在商业环境中提供真正的价值同样重要，在商业环境中，帮助用户更快、更有信心地关注正确的内容可以使一切变得不同。对速度和高精度的需求需要一个能够生成、评估和梳理数千个假设及其相关证据的大规模并行计算平台。在这次演讲中，我将向观众介绍Jeopardy!挑战并描述我们在这个重大挑战问题上的技术方法和进展。

{"title":"Build watson: An overview of DeepQA for the Jeopardy! Challenge","authors":"D. Ferrucci","doi":"10.1145/1854273.1854275","DOIUrl":"https://doi.org/10.1145/1854273.1854275","url":null,"abstract":"Computer systems that can directly and accurately answer peoples' questions over a broad domain of human knowledge have been envisioned by scientists and writers since the advent of computers themselves. Open domain question answering holds tremendous promise for facilitating informed decision making over vast volumes of natural language content. Applications in business intelligence, healthcare, customer support, enterprise knowledge management, social computing, science and government would all benefit from deep language processing. The DeepQA project (www.ibm.com/deepqa) is aimed at illustrating how the advancement and integration of Natural Language Processing (NLP), Information Retrieval (IR), Machine Learning (ML), massively parallel computation and Knowledge Representation and Reasoning (KR&R) can greatly advance open-domain automatic Question Answering. An exciting proof-point in this challenge is to develop a computer system that can successfully compete against top human players at the Jeopardy! quiz show (www.jeopardy.com). Attaining champion-level performance Jeopardy! requires a computer to rapidly answer rich open-domain questions, and to predict its own performance on any given category/question. The system must deliver high degrees of precision and confidence over a very broad range of knowledge and natural language content and with a 3-second response time. To do this DeepQA generates, evidences and evaluates many competing hypotheses. A key to success is automatically learning and combining accurate confidences across an array of complex algorithms and over different dimensions of evidence. Accurate confidences are needed to know when to “buzz in” against your competitors and how much to bet. Critical for winning at Jeopardy!, High precision and accurate confidence computations are just as critical for providing real value in business settings where helping users focus on the right content sooner and with greater confidence can make all the difference. The need for speed and high precision demands a massively parallel compute platform capable of generating, evaluating and combing 1000's of hypotheses and their associated evidence. In this talk I will introduce the audience to the Jeopardy! Challenge and describe our technical approach and our progress on this grand-challenge problem.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124414682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Accelerating multicore reuse distance analysis with sampling and parallelization 利用采样和并行化加速多核重用距离分析

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854286

Derek L. Schuff, Milind Kulkarni, Vijay S. Pai

Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.

重用距离分析是一种完善的工具，用于预测缓存性能、驱动编译器优化以及协助程序的可视化和手动优化。现有的重用距离分析方法要么没有考虑到多线程的影响，要么会遭受严重的性能损失。本文提出了一种采样、并行的方法来测量多线程程序的重用距离概要，并对私有和共享缓存配置进行建模。采样技术允许它在快速低开销模式下执行大部分时间，并且允许使用一种新的测量方法，因为采样分析不需要考虑重用堆栈的完整状态。这种测量方法使用O(1)个数据结构，可以将其设置为线程私有，从而允许并行化以减少分析模式中的开销。结果系统的性能在不同的并行基准测试中进行了分析，并显示与非采样的完整分析相比，生成了准确的输出，以及在基准测试中定位低局域代码的常见应用程序的良好结果，所有这些都具有与最佳单线程分析技术相当的性能开销。

{"title":"Accelerating multicore reuse distance analysis with sampling and parallelization","authors":"Derek L. Schuff, Milind Kulkarni, Vijay S. Pai","doi":"10.1145/1854273.1854286","DOIUrl":"https://doi.org/10.1145/1854273.1854286","url":null,"abstract":"Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130430598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 113

On-chip network design considerations for compute accelerators 计算加速器的片上网络设计考虑

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854339

A. Bakhoda, John Kim, Tor M. Aamodt

There has been little work investigating the overall performance impact of on-chip communication in manycore compute accelerators. In this paper we evaluate performance of a GPU-like compute accelerator running CUDA workloads and consisting of compute nodes, interconnection network and the graphics DRAM memory system using detailed cycle-level simulation. First, we study performance of a baseline architecture employing a scalable mesh network. We then propose several microarchitectural techniques to exploit the communication characteristics of these applications while providing a cost-effective (i.e., low area) on-chip network. Instead of increasing costly bisection bandwidth, we increase the the number of injection ports at the memory controller router nodes to increase terminal bandwidth at the few nodes. In addition, we propose a novel “checkerboard” on-chip network which alternates between conventional, full-routers and half -routers with limited connectivity. This network is enabled by limited communication of the many-to-few traffic pattern. We describe a minimal routing algorithm for the checkerboard network that does not increase the hop count.

在多核计算加速器中，很少有研究片上通信对整体性能影响的工作。在本文中，我们使用详细的周期级模拟评估了运行CUDA工作负载并由计算节点，互连网络和图形DRAM内存系统组成的类似gpu的计算加速器的性能。首先，我们研究了采用可扩展网状网络的基准架构的性能。然后，我们提出了几种微架构技术来利用这些应用程序的通信特性，同时提供一个具有成本效益的(即，低面积)片上网络。我们在内存控制器路由器节点上增加注入端口的数量，以增加少数节点上的终端带宽，而不是增加昂贵的对分带宽。此外，我们提出了一种新颖的“棋盘”片上网络，它在传统的全路由器和半路由器之间交替，具有有限的连接。该网络是通过多对少流量模式的有限通信实现的。我们描述了一个最小的路由算法的棋盘网络，不增加跳数。

{"title":"On-chip network design considerations for compute accelerators","authors":"A. Bakhoda, John Kim, Tor M. Aamodt","doi":"10.1145/1854273.1854339","DOIUrl":"https://doi.org/10.1145/1854273.1854339","url":null,"abstract":"There has been little work investigating the overall performance impact of on-chip communication in manycore compute accelerators. In this paper we evaluate performance of a GPU-like compute accelerator running CUDA workloads and consisting of compute nodes, interconnection network and the graphics DRAM memory system using detailed cycle-level simulation. First, we study performance of a baseline architecture employing a scalable mesh network. We then propose several microarchitectural techniques to exploit the communication characteristics of these applications while providing a cost-effective (i.e., low area) on-chip network. Instead of increasing costly bisection bandwidth, we increase the the number of injection ports at the memory controller router nodes to increase terminal bandwidth at the few nodes. In addition, we propose a novel “checkerboard” on-chip network which alternates between conventional, full-routers and half -routers with limited connectivity. This network is enabled by limited communication of the many-to-few traffic pattern. We describe a minimal routing algorithm for the checkerboard network that does not increase the hop count.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"429 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132168684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6