Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.
{"title":"Proximity coherence for chip multiprocessors","authors":"Nick Barrow-Williams","doi":"10.1145/1854273.1854293","DOIUrl":"https://doi.org/10.1145/1854273.1854293","url":null,"abstract":"Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126123494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.22s, respectively. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.
{"title":"Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application!","authors":"R. Bordawekar, Uday Bondhugula, R. Rao","doi":"10.1145/1854273.1854340","DOIUrl":"https://doi.org/10.1145/1854273.1854340","url":null,"abstract":"In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.22s, respectively. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122331598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Lee, Silas Boyd-Wickizer, Zhiyi Huang, C. Leiserson
Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.
许多使用工作窃取运行时系统的多线程并发平台都包含“仙人掌堆栈”,其中函数对堆栈变量的访问适当地尊重函数的调用祖先,即使许多函数并行操作也是如此。不幸的是,这种现有的并发平台至少不能满足以下三个理想标准中的一个:†与使用普通线性堆栈编译的遗留或第三方串行二进制文件完全互操作性,†在具有足够并行性的应用程序上提供近乎完美的线性加速的调度器,以及†cactus堆栈的有限且有效的内存使用。我们通过修改Linux操作系统内核来提供对线程本地内存映射(TLMM)的支持,从而解决了这个cactus-stack问题。我们使用TLMM在开源的Cilk-5运行时系统中重新实现cactus堆栈。Cilk-M运行时系统消除了Cilk-5在串行代码和并行代码之间强加的语言区别,消除了Cilk-5串行代码不能调用并行代码的限制,并提供了与现有串行调用约定的完全兼容性。Cilk-M运行时系统为调度器性能和堆栈空间提供了强有力的保证。基准测试结果表明,原型Cilk- m 1.0的性能与Cilk 5.4.6系统相当,并且堆栈空间消耗适中。
{"title":"Using memory mapping to support cactus stacks in work-stealing runtime systems","authors":"I. Lee, Silas Boyd-Wickizer, Zhiyi Huang, C. Leiserson","doi":"10.1145/1854273.1854324","DOIUrl":"https://doi.org/10.1145/1854273.1854324","url":null,"abstract":"Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126313047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis
Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD's HyperTransport™, or Intel's Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.
{"title":"Handling the problems and opportunities posed by multiple on-chip memory controllers","authors":"M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, A. Davis","doi":"10.1145/1854273.1854314","DOIUrl":"https://doi.org/10.1145/1854273.1854314","url":null,"abstract":"Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD's HyperTransport™, or Intel's Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daehoon Kim, Jeongseob Ahn, Jaehong Kim, Jaehyuk Huh
Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44% of snoops on average.
{"title":"Subspace snooping: Filtering snoops with operating system support","authors":"Daehoon Kim, Jeongseob Ahn, Jaehong Kim, Jaehyuk Huh","doi":"10.1145/1854273.1854292","DOIUrl":"https://doi.org/10.1145/1854273.1854292","url":null,"abstract":"Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44% of snoops on average.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"161 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131658238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It has become increasingly difficult to perform design space exploration (DSE) of computer systems with a short turnaround time because of exploding design spaces, increasing design complexity and long-running workloads. Researchers have used classical search/optimization techniques like simulated annealing, genetic algorithms, etc., to accelerate the DSE. While these techniques are better than an exhaustive search, a substantial amount of time must still be dedicated to DSE. This is a serious bottleneck in reducing research/development time. These techniques do not perform the DSE quickly enough, primarily because they do not leverage any insight as to how the different design parameters of a computer system interact to increase or degrade performance at a design point and treat the computer system as a “black-box”.
{"title":"Criticality-driven superscalar design space exploration","authors":"Sandeep Navada, N. Choudhary, E. Rotenberg","doi":"10.1145/1854273.1854308","DOIUrl":"https://doi.org/10.1145/1854273.1854308","url":null,"abstract":"It has become increasingly difficult to perform design space exploration (DSE) of computer systems with a short turnaround time because of exploding design spaces, increasing design complexity and long-running workloads. Researchers have used classical search/optimization techniques like simulated annealing, genetic algorithms, etc., to accelerate the DSE. While these techniques are better than an exhaustive search, a substantial amount of time must still be dedicated to DSE. This is a serious bottleneck in reducing research/development time. These techniques do not perform the DSE quickly enough, primarily because they do not leverage any insight as to how the different design parameters of a computer system interact to increase or degrade performance at a design point and treat the computer system as a “black-box”.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134633542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The evolution of processor architectures from single core designs with increasing clock frequencies to multi-core designs with relatively stable clock frequencies has fundamentally altered application design. Since application programmers can no longer rely on clock frequency increases to boost performance, over the last several years, there has been significant emphasis on application level threading to achieve performance gains. A core problem with concurrent programming using threads is the potential for deadlocks. Even well-written codes that spend an inordinate amount of effort in deadlock avoidance cannot always avoid deadlocks, particularly when the order of lock acquisitions is not known a priori. Furthermore, arbitrarily composing lock based codes may result in deadlock - one of the primary motivations for transactional memory. In this paper, we present a language independent runtime system called Sammati that provides automatic deadlock detection and recovery for threaded applications that use the POSIX threads (pthreads) interface - the de facto standard for UNIX systems. The runtime is implemented as a pre-loadable library and does not require either the application source code or recompiling/relinking phases, enabling its use for existing applications with arbitrary multi-threading models. Performance evaluation of the runtime with unmodified SPLASH, Phoenix and synthetic benchmark suites shows that it is scalable, with speedup comparable to baseline execution with modest memory overhead.
{"title":"Avoiding deadlock avoidance","authors":"Hari K. Pyla, S. Varadarajan","doi":"10.1145/1854273.1854288","DOIUrl":"https://doi.org/10.1145/1854273.1854288","url":null,"abstract":"The evolution of processor architectures from single core designs with increasing clock frequencies to multi-core designs with relatively stable clock frequencies has fundamentally altered application design. Since application programmers can no longer rely on clock frequency increases to boost performance, over the last several years, there has been significant emphasis on application level threading to achieve performance gains. A core problem with concurrent programming using threads is the potential for deadlocks. Even well-written codes that spend an inordinate amount of effort in deadlock avoidance cannot always avoid deadlocks, particularly when the order of lock acquisitions is not known a priori. Furthermore, arbitrarily composing lock based codes may result in deadlock - one of the primary motivations for transactional memory. In this paper, we present a language independent runtime system called Sammati that provides automatic deadlock detection and recovery for threaded applications that use the POSIX threads (pthreads) interface - the de facto standard for UNIX systems. The runtime is implemented as a pre-loadable library and does not require either the application source code or recompiling/relinking phases, enabling its use for existing applications with arbitrary multi-threading models. Performance evaluation of the runtime with unmodified SPLASH, Phoenix and synthetic benchmark suites shows that it is scalable, with speedup comparable to baseline execution with modest memory overhead.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"58 32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126781434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance of chip multiprocessors (CMPs) is dependent on the data access latency, which is highly dependent on the design of the on-chip interconnect (NoC) and the organization of the memory caches. However, prior research attempts to optimize the performance of the NoC and cache mostly in isolation of each other. In this work we present a NoC-aware cache design that focuses on communication locality; a property both the cache and NoC affect and can exploit.
{"title":"NoC-aware cache design for chip multiprocessors","authors":"Ahmed Abousamra, R. Melhem, A. Jones","doi":"10.1145/1854273.1854354","DOIUrl":"https://doi.org/10.1145/1854273.1854354","url":null,"abstract":"The performance of chip multiprocessors (CMPs) is dependent on the data access latency, which is highly dependent on the design of the on-chip interconnect (NoC) and the organization of the memory caches. However, prior research attempts to optimize the performance of the NoC and cache mostly in isolation of each other. In this work we present a NoC-aware cache design that focuses on communication locality; a property both the cache and NoC affect and can exploit.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130901602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ganesh S. Dasika, Ankit Sethia, Vincentius Robby, T. Mudge, S. Mahlke
Medical imaging provides physicians with the ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a unique set of challenges. In order to increase portability, the power consumed in image acquisition - currently the most power-consuming activity in an imaging device - must be dramatically reduced. This can only be done, however, by using complex image reconstruction algorithms to correct artifacts introduced by low-power acquisition, resulting in image processing becoming the dominant power-consuming task. Current solutions use combinations of digital signal processors, general-purpose processors and, more recently, general-purpose graphics processing units for medical image processing. These solutions fall short for various reasons including high power consumption and an inability to execute the next generation of image reconstruction algorithms. This paper presents the MEDICS architecture - a domain-specific multicore architecture designed specifically for medical imaging applications, but with sufficient generality tomake it programmable. The goal is to achieve 100 GFLOPs of performance while consuming orders of magnitude less power than the existing solutions. MEDICS has a throughput of 128 GFLOPs while consuming as little as 1.6W of power on advanced CT reconstruction applications. This represents up to a 20X increase in computation efficiency over current designs.
{"title":"MEDICS: Ultra-portable processing for medical image reconstruction","authors":"Ganesh S. Dasika, Ankit Sethia, Vincentius Robby, T. Mudge, S. Mahlke","doi":"10.1145/1854273.1854299","DOIUrl":"https://doi.org/10.1145/1854273.1854299","url":null,"abstract":"Medical imaging provides physicians with the ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a unique set of challenges. In order to increase portability, the power consumed in image acquisition - currently the most power-consuming activity in an imaging device - must be dramatically reduced. This can only be done, however, by using complex image reconstruction algorithms to correct artifacts introduced by low-power acquisition, resulting in image processing becoming the dominant power-consuming task. Current solutions use combinations of digital signal processors, general-purpose processors and, more recently, general-purpose graphics processing units for medical image processing. These solutions fall short for various reasons including high power consumption and an inability to execute the next generation of image reconstruction algorithms. This paper presents the MEDICS architecture - a domain-specific multicore architecture designed specifically for medical imaging applications, but with sufficient generality tomake it programmable. The goal is to achieve 100 GFLOPs of performance while consuming orders of magnitude less power than the existing solutions. MEDICS has a throughput of 128 GFLOPs while consuming as little as 1.6W of power on advanced CT reconstruction applications. This represents up to a 20X increase in computation efficiency over current designs.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131174015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we propose a new organization for the last level shared cache of a multicore system. Our design is based on the observation that the Next-Use distance, measured in terms of intervening misses between the eviction of a line and its next use, for lines brought in by a given delinquent PC falls within a predictable range of values. We exploit this correlation to improve the performance of shared caches in multi-core architectures by proposing the NUcache organization.
{"title":"NUcache: A multicore cache organization based on Next-Use distance","authors":"R. Manikantan, K. Rajan, R. Govindarajan","doi":"10.1145/1854273.1854356","DOIUrl":"https://doi.org/10.1145/1854273.1854356","url":null,"abstract":"In this work, we propose a new organization for the last level shared cache of a multicore system. Our design is based on the observation that the Next-Use distance, measured in terms of intervening misses between the eviction of a line and its next use, for lines brought in by a given delinquent PC falls within a predictable range of values. We exploit this correlation to improve the performance of shared caches in multi-core architectures by proposing the NUcache organization.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}