2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文中文

Proximity coherence for chip multiprocessors 芯片多处理器的邻近相干性

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854293

Nick Barrow-Williams

Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.

多核架构为利用现代制造工艺中可用的越来越多的晶体管提供了一种有效的方法。虽然它们与多节点系统相似，但它们表现出不同的通信延迟和存储特性，从而提供了以前不可行的新设计机会。传统的缓存一致性协议虽然经常用于多核设计，但已经在多节点系统的背景下发展起来。因此，它们很少利用多核体系结构提供的新可能性。我们提出了邻近相干(Proximity Coherence)方案，在该方案中，L1负载丢失通过新的专用链接乐观地转发到附近的缓存，而不是总是通过目录结构间接转发。这种优化是通过使用片上网络资源访问本地缓存的可比较成本实现的。一致性是通过嵌入在L1缓存中的轻量级图结构来维持的。我们将我们的邻近相干协议与现有的基于目录的MESI协议进行比较，使用32核系统的全系统模拟。我们的扩展将L1缓存负载丢失的延迟降低了32%，同时在一系列并行基准测试中，将全局片上互连上传输的字节减少了19%。采用邻近一致性可将执行时间提高13%，将缓存层次能耗降低30%，并为芯片多处理器中的一致性挑战提供更有效的解决方案。

{"title":"Proximity coherence for chip multiprocessors","authors":"Nick Barrow-Williams","doi":"10.1145/1854273.1854293","DOIUrl":"https://doi.org/10.1145/1854273.1854293","url":null,"abstract":"Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126123494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application! 信不信由你!多核cpu可以匹配GPU性能的flop密集型应用程序!

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854340

R. Bordawekar, Uday Bondhugula, R. Rao

In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.22s, respectively. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

在本文中，我们评估了实际图像处理应用程序的性能，该应用程序使用相互关联算法将给定图像与参考图像进行比较。我们使用CUDA在nVidia GTX 285 GPU上实现该算法，并使用手动和自动技术在Intel Xeon (Nehalem)和IBM Power7处理器上并行化该算法。手动并行化版本使用带有SSE和VSX矢量特性的Pthreads和OpenMP，而基于多面体模型的最先进优化框架用于自动编译器并行化和优化。Power7、Nehalem和GTX 285上表现最好的版本分别运行1.02秒、1.82秒和1.22秒。该算法在nVidia GPU上的性能存在以下问题:(1)较小的共享内存，(2)未对齐的设备内存访问模式，(3)昂贵的原子操作，(4)较弱的单线程性能。这些结果最终表明，在某些条件下，在多核处理器上运行的flop密集型结构化应用程序有可能达到甚至超过同等GPU版本的性能。

{"title":"Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application!","authors":"R. Bordawekar, Uday Bondhugula, R. Rao","doi":"10.1145/1854273.1854340","DOIUrl":"https://doi.org/10.1145/1854273.1854340","url":null,"abstract":"In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.22s, respectively. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122331598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Using memory mapping to support cactus stacks in work-stealing runtime systems 在偷取工作的运行时系统中使用内存映射来支持cactus堆栈

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854324

I. Lee, Silas Boyd-Wickizer, Zhiyi Huang, C. Leiserson

Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.

许多使用工作窃取运行时系统的多线程并发平台都包含“仙人掌堆栈”，其中函数对堆栈变量的访问适当地尊重函数的调用祖先，即使许多函数并行操作也是如此。不幸的是，这种现有的并发平台至少不能满足以下三个理想标准中的一个:†与使用普通线性堆栈编译的遗留或第三方串行二进制文件完全互操作性，†在具有足够并行性的应用程序上提供近乎完美的线性加速的调度器，以及†cactus堆栈的有限且有效的内存使用。我们通过修改Linux操作系统内核来提供对线程本地内存映射(TLMM)的支持，从而解决了这个cactus-stack问题。我们使用TLMM在开源的Cilk-5运行时系统中重新实现cactus堆栈。Cilk-M运行时系统消除了Cilk-5在串行代码和并行代码之间强加的语言区别，消除了Cilk-5串行代码不能调用并行代码的限制，并提供了与现有串行调用约定的完全兼容性。Cilk-M运行时系统为调度器性能和堆栈空间提供了强有力的保证。基准测试结果表明，原型Cilk- m 1.0的性能与Cilk 5.4.6系统相当，并且堆栈空间消耗适中。

{"title":"Using memory mapping to support cactus stacks in work-stealing runtime systems","authors":"I. Lee, Silas Boyd-Wickizer, Zhiyi Huang, C. Leiserson","doi":"10.1145/1854273.1854324","DOIUrl":"https://doi.org/10.1145/1854273.1854324","url":null,"abstract":"Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126313047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Handling the problems and opportunities posed by multiple on-chip memory controllers 处理多个片上存储器控制器带来的问题和机会

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854314

M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis

Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD's HyperTransport™, or Intel's Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.

现代处理器，如Tilera的Tile64、Intel的Nehalem和AMD的Opteron，正在将内存控制器(mc)迁移到片上，同时保持一个大的内存地址空间。这种利用多个MC的趋势可能会继续下去，因此，一个核心或插槽将需要通过类似于AMD的HyperTransport™或英特尔的快速路径互连™的套接字间或套接字内互连结构，将内存请求路由到适当的MC。这样的系统因此受到非统一内存访问(NUMA)延迟的影响，因为传输到远程mc所花费的时间。每个MC将充当通往特定物理内存块的网关。因此，在最小化内存访问延迟方面，数据放置将变得越来越重要。到目前为止，还没有先前的工作研究了在这种系统中多个mc之间放置数据的影响。未来的芯片多处理器可能包括多个mc和更大数量的核心。这种趋势将增加这些系统中的内存访问延迟变化。将工作负载数据适当地分配给适当的MC对于减少内存服务请求的延迟非常重要。分配策略需要考虑每个MC的排队延迟、片上延迟和行缓冲区命中率。在本文中，我们提出了在将数据放置在适当的物理内存片中时考虑这些因素的动态机制。我们引入自适应的首次触摸页面放置和动态页面迁移机制，以减少多mc系统的DRAM访问延迟。对于自适应首次触摸页面放置，这些策略的平均性能提高了17%，对于动态页面迁移策略，性能提高了35%。

{"title":"Handling the problems and opportunities posed by multiple on-chip memory controllers","authors":"M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis","doi":"10.1145/1854273.1854314","DOIUrl":"https://doi.org/10.1145/1854273.1854314","url":null,"abstract":"Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD's HyperTransport™, or Intel's Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Subspace snooping: Filtering snoops with operating system support 子空间窥探:过滤带有操作系统支持的窥探

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854292

Daehoon Kim, Jeongseob Ahn, Jaehong Kim, Jaehyuk Huh

Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44% of snoops on average.

尽管基于侦听的一致性协议通过简单而稳健的一致性机制提供了快速的缓存到缓存传输，但由于广播侦听的开销，扩展协议一直很困难。在本文中，我们提出了一种称为子空间窥探的相干过滤技术，该技术将每个内存页的潜在共享者存储在页表条目中。通过在页表项中使用共享器信息，页的一致性事务仅向系统(子空间)中的节点子集生成窥探请求。但是，随着应用程序的阶段可能发生变化，或者操作系统可能将线程迁移到不同的节点，页面的相干子空间可能会发生变化。为了动态调整子空间，子空间窥探支持一种收缩机制，即从子空间中删除过时的节点。子空间窥探可以集成到任何类型的相干协议和网络拓扑结构中。由于子空间窥探保证了子空间总是包含页面的精确共享者，因此它不会限制一致性协议和网络的设计。我们评估了无序网状网络上具有令牌相干性的子空间窥探。对于16核系统上的科学和服务器应用程序，子空间窥探平均减少44%的窥探。

{"title":"Subspace snooping: Filtering snoops with operating system support","authors":"Daehoon Kim, Jeongseob Ahn, Jaehong Kim, Jaehyuk Huh","doi":"10.1145/1854273.1854292","DOIUrl":"https://doi.org/10.1145/1854273.1854292","url":null,"abstract":"Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44% of snoops on average.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"161 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131658238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Criticality-driven superscalar design space exploration 临界驱动的超标量设计空间探索

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854308

Sandeep Navada, N. Choudhary, E. Rotenberg

It has become increasingly difficult to perform design space exploration (DSE) of computer systems with a short turnaround time because of exploding design spaces, increasing design complexity and long-running workloads. Researchers have used classical search/optimization techniques like simulated annealing, genetic algorithms, etc., to accelerate the DSE. While these techniques are better than an exhaustive search, a substantial amount of time must still be dedicated to DSE. This is a serious bottleneck in reducing research/development time. These techniques do not perform the DSE quickly enough, primarily because they do not leverage any insight as to how the different design parameters of a computer system interact to increase or degrade performance at a design point and treat the computer system as a “black-box”.

由于设计空间的爆炸式增长、设计复杂性的增加和长时间运行的工作负载，在较短的周转时间内对计算机系统进行设计空间探索(DSE)变得越来越困难。研究人员已经使用经典的搜索/优化技术，如模拟退火，遗传算法等来加速DSE。虽然这些技术比详尽的搜索要好，但是仍然必须将大量的时间用于DSE。这是减少研究/开发时间的严重瓶颈。这些技术不能足够快地执行DSE，主要是因为它们没有利用任何关于计算机系统的不同设计参数如何相互作用以增加或降低设计点上的性能的洞察力，并将计算机系统视为“黑盒子”。

引用次数: 11

Avoiding deadlock avoidance 避免死锁

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854288

Hari K. Pyla, S. Varadarajan

The evolution of processor architectures from single core designs with increasing clock frequencies to multi-core designs with relatively stable clock frequencies has fundamentally altered application design. Since application programmers can no longer rely on clock frequency increases to boost performance, over the last several years, there has been significant emphasis on application level threading to achieve performance gains. A core problem with concurrent programming using threads is the potential for deadlocks. Even well-written codes that spend an inordinate amount of effort in deadlock avoidance cannot always avoid deadlocks, particularly when the order of lock acquisitions is not known a priori. Furthermore, arbitrarily composing lock based codes may result in deadlock - one of the primary motivations for transactional memory. In this paper, we present a language independent runtime system called Sammati that provides automatic deadlock detection and recovery for threaded applications that use the POSIX threads (pthreads) interface - the de facto standard for UNIX systems. The runtime is implemented as a pre-loadable library and does not require either the application source code or recompiling/relinking phases, enabling its use for existing applications with arbitrary multi-threading models. Performance evaluation of the runtime with unmodified SPLASH, Phoenix and synthetic benchmark suites shows that it is scalable, with speedup comparable to baseline execution with modest memory overhead.

处理器架构从时钟频率不断增加的单核设计发展到时钟频率相对稳定的多核设计，从根本上改变了应用程序设计。由于应用程序程序员不能再依赖于时钟频率的增加来提高性能，在过去的几年中，应用程序级别的线程已经成为实现性能提升的重点。使用线程的并发编程的一个核心问题是潜在的死锁。即使编写得很好的代码在避免死锁方面花费了大量的精力，也不能总是避免死锁，特别是当锁获取的顺序事先不知道时。此外，任意组合基于锁的代码可能导致死锁——这是使用事务性内存的主要动机之一。在本文中，我们提出了一个名为Sammati的独立于语言的运行时系统，它为使用POSIX线程(pthreads)接口(UNIX系统的事实标准)的线程应用程序提供自动死锁检测和恢复。运行时被实现为一个可预加载的库，既不需要应用程序源代码，也不需要重新编译/重链接阶段，因此可以将其用于具有任意多线程模型的现有应用程序。使用未修改的SPLASH、Phoenix和合成基准套件对运行时进行的性能评估表明，它是可伸缩的，在适度的内存开销下，其加速可与基线执行相媲美。

{"title":"Avoiding deadlock avoidance","authors":"Hari K. Pyla, S. Varadarajan","doi":"10.1145/1854273.1854288","DOIUrl":"https://doi.org/10.1145/1854273.1854288","url":null,"abstract":"The evolution of processor architectures from single core designs with increasing clock frequencies to multi-core designs with relatively stable clock frequencies has fundamentally altered application design. Since application programmers can no longer rely on clock frequency increases to boost performance, over the last several years, there has been significant emphasis on application level threading to achieve performance gains. A core problem with concurrent programming using threads is the potential for deadlocks. Even well-written codes that spend an inordinate amount of effort in deadlock avoidance cannot always avoid deadlocks, particularly when the order of lock acquisitions is not known a priori. Furthermore, arbitrarily composing lock based codes may result in deadlock - one of the primary motivations for transactional memory. In this paper, we present a language independent runtime system called Sammati that provides automatic deadlock detection and recovery for threaded applications that use the POSIX threads (pthreads) interface - the de facto standard for UNIX systems. The runtime is implemented as a pre-loadable library and does not require either the application source code or recompiling/relinking phases, enabling its use for existing applications with arbitrary multi-threading models. Performance evaluation of the runtime with unmodified SPLASH, Phoenix and synthetic benchmark suites shows that it is scalable, with speedup comparable to baseline execution with modest memory overhead.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"58 32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126781434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

NoC-aware cache design for chip multiprocessors 芯片多处理器的noc感知缓存设计

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854354

Ahmed Abousamra, R. Melhem, A. Jones

The performance of chip multiprocessors (CMPs) is dependent on the data access latency, which is highly dependent on the design of the on-chip interconnect (NoC) and the organization of the memory caches. However, prior research attempts to optimize the performance of the NoC and cache mostly in isolation of each other. In this work we present a NoC-aware cache design that focuses on communication locality; a property both the cache and NoC affect and can exploit.

芯片多处理器(cmp)的性能取决于数据访问延迟，而数据访问延迟在很大程度上取决于片上互连(NoC)的设计和存储缓存的组织。然而，先前的研究试图优化NoC和缓存的性能，大多是相互隔离的。在这项工作中，我们提出了一个关注通信局部性的noc感知缓存设计;这是缓存和NoC都会影响并可以利用的属性。

引用次数: 5

MEDICS: Ultra-portable processing for medical image reconstruction MEDICS:用于医学图像重建的超便携处理

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854299

Ganesh S. Dasika, Ankit Sethia, Vincentius Robby, T. Mudge, S. Mahlke

Medical imaging provides physicians with the ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a unique set of challenges. In order to increase portability, the power consumed in image acquisition - currently the most power-consuming activity in an imaging device - must be dramatically reduced. This can only be done, however, by using complex image reconstruction algorithms to correct artifacts introduced by low-power acquisition, resulting in image processing becoming the dominant power-consuming task. Current solutions use combinations of digital signal processors, general-purpose processors and, more recently, general-purpose graphics processing units for medical image processing. These solutions fall short for various reasons including high power consumption and an inability to execute the next generation of image reconstruction algorithms. This paper presents the MEDICS architecture - a domain-specific multicore architecture designed specifically for medical imaging applications, but with sufficient generality tomake it programmable. The goal is to achieve 100 GFLOPs of performance while consuming orders of magnitude less power than the existing solutions. MEDICS has a throughput of 128 GFLOPs while consuming as little as 1.6W of power on advanced CT reconstruction applications. This represents up to a 20X increase in computation efficiency over current designs.

医学成像为医生提供了生成人体3D图像的能力，以便检测和诊断各种疾病。使医学成像便于携带和更容易获得提供了一系列独特的挑战。为了提高可移植性，图像采集所消耗的功率——目前成像设备中最耗电的活动——必须大幅降低。然而，这只能通过使用复杂的图像重建算法来纠正低功耗采集带来的伪影，从而导致图像处理成为主要的功耗任务。目前的解决方案使用数字信号处理器、通用处理器以及最近用于医学图像处理的通用图形处理单元的组合。这些解决方案由于各种原因而不足，包括高功耗和无法执行下一代图像重建算法。本文介绍了MEDICS体系结构——一个专门为医学成像应用设计的领域特定的多核体系结构，但具有足够的通用性，使其可编程。目标是实现100 GFLOPs的性能，同时消耗比现有解决方案少几个数量级的功率。MEDICS的吞吐量为128 GFLOPs，而在高级CT重建应用中功耗仅为1.6W。这意味着与当前设计相比，计算效率提高了20倍。

{"title":"MEDICS: Ultra-portable processing for medical image reconstruction","authors":"Ganesh S. Dasika, Ankit Sethia, Vincentius Robby, T. Mudge, S. Mahlke","doi":"10.1145/1854273.1854299","DOIUrl":"https://doi.org/10.1145/1854273.1854299","url":null,"abstract":"Medical imaging provides physicians with the ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a unique set of challenges. In order to increase portability, the power consumed in image acquisition - currently the most power-consuming activity in an imaging device - must be dramatically reduced. This can only be done, however, by using complex image reconstruction algorithms to correct artifacts introduced by low-power acquisition, resulting in image processing becoming the dominant power-consuming task. Current solutions use combinations of digital signal processors, general-purpose processors and, more recently, general-purpose graphics processing units for medical image processing. These solutions fall short for various reasons including high power consumption and an inability to execute the next generation of image reconstruction algorithms. This paper presents the MEDICS architecture - a domain-specific multicore architecture designed specifically for medical imaging applications, but with sufficient generality tomake it programmable. The goal is to achieve 100 GFLOPs of performance while consuming orders of magnitude less power than the existing solutions. MEDICS has a throughput of 128 GFLOPs while consuming as little as 1.6W of power on advanced CT reconstruction applications. This represents up to a 20X increase in computation efficiency over current designs.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131174015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

NUcache: A multicore cache organization based on Next-Use distance NUcache:基于下次使用距离的多核缓存组织

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2010-09-11 DOI: 10.1145/1854273.1854356

R. Manikantan, K. Rajan, R. Govindarajan

In this work, we propose a new organization for the last level shared cache of a multicore system. Our design is based on the observation that the Next-Use distance, measured in terms of intervening misses between the eviction of a line and its next use, for lines brought in by a given delinquent PC falls within a predictable range of values. We exploit this correlation to improve the performance of shared caches in multi-core architectures by proposing the NUcache organization.

在这项工作中，我们提出了一种新的多核系统的最后一级共享缓存组织。我们的设计是基于这样的观察:对于给定的拖欠PC带来的线路，下一次使用距离(根据一条线路的移除和它的下一次使用之间的间隔缺失来测量)落在一个可预测的值范围内。通过提出NUcache组织，我们利用这种相关性来提高多核架构中共享缓存的性能。

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀