L. Liu, Sibren Isaacman, A. Bhattacharjee, U. Kremer
Approximations and redundancies allow mobile and distributed applications to produce answers or outcomes of lesser quality at lower costs. This paper introduces RAPID, a new programming framework and methodology for service-based applications with approximations and redundancies. Finding the best service configuration under a given resource budget becomes a constrained, dual-weight graph optimization problem.
{"title":"POSTER: Exploiting Approximations for Energy/Quality Tradeoffs in Service-Based Applications","authors":"L. Liu, Sibren Isaacman, A. Bhattacharjee, U. Kremer","doi":"10.1109/PACT.2017.57","DOIUrl":"https://doi.org/10.1109/PACT.2017.57","url":null,"abstract":"Approximations and redundancies allow mobile and distributed applications to produce answers or outcomes of lesser quality at lower costs. This paper introduces RAPID, a new programming framework and methodology for service-based applications with approximations and redundancies. Finding the best service configuration under a given resource budget becomes a constrained, dual-weight graph optimization problem.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"31 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather
Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and diversity of modern hardware and software. Machine learning is aproven technique for learning such heuristics, but its success is bound by thequality of the features used. These features must be hand crafted by developersthrough a combination of expert domain knowledge and trial and error. This makesthe quality of the final model directly dependent on the skill and availabletime of the system architect.Our work introduces a better way for building heuristics. We develop a deepneural network that learns heuristics over raw code, entirely without using codefeatures. The neural network simultaneously constructs appropriaterepresentations of the code and learns how best to optimize, removing the needfor manual feature creation. Further, we show that our neural nets can transferlearning from one optimization problem to another, improving the accuracy of newmodels, without the help of human experts.We compare the effectiveness of our automatically generated heuristics againstones with features hand-picked by experts. We examine two challenging tasks:predicting optimal mapping for heterogeneous parallelism and GPU threadcoarsening factors. In 89% of the cases, the quality of our fully automaticheuristics matches or surpasses that of state-of-the-art predictive models usinghand-crafted features, providing on average 14% and 12% more performance withno human effort expended on designing features.
{"title":"End-to-End Deep Learning of Optimization Heuristics","authors":"Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather","doi":"10.1109/PACT.2017.24","DOIUrl":"https://doi.org/10.1109/PACT.2017.24","url":null,"abstract":"Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and diversity of modern hardware and software. Machine learning is aproven technique for learning such heuristics, but its success is bound by thequality of the features used. These features must be hand crafted by developersthrough a combination of expert domain knowledge and trial and error. This makesthe quality of the final model directly dependent on the skill and availabletime of the system architect.Our work introduces a better way for building heuristics. We develop a deepneural network that learns heuristics over raw code, entirely without using codefeatures. The neural network simultaneously constructs appropriaterepresentations of the code and learns how best to optimize, removing the needfor manual feature creation. Further, we show that our neural nets can transferlearning from one optimization problem to another, improving the accuracy of newmodels, without the help of human experts.We compare the effectiveness of our automatically generated heuristics againstones with features hand-picked by experts. We examine two challenging tasks:predicting optimal mapping for heterogeneous parallelism and GPU threadcoarsening factors. In 89% of the cases, the quality of our fully automaticheuristics matches or surpasses that of state-of-the-art predictive models usinghand-crafted features, providing on average 14% and 12% more performance withno human effort expended on designing features.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133793982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering is a crucial tool for analyzing data in virtually every scientific and engineering discipline. The U.S. National Academy of Sciences (NAS) has recently announced "the seven giants of statistical data analysis" in which data clustering plays a central role [1]. This research also emphasizes that more scalable solutions are required to enable time and space clustering for the future large-scale data analyses. Therefore, hardware and software innovations are necessary to make the future large scale data analysis practical.This project proposes a novel mechanism for computing bit serial medians within resistive RAM (RRAM) arrays with no need to read out the operands from memory cells.
{"title":"Large Scale Data Clustering Using Memristive k-Median Computation","authors":"Yomi Karthik Rupesh, M. N. Bojnordi","doi":"10.1109/PACT.2017.52","DOIUrl":"https://doi.org/10.1109/PACT.2017.52","url":null,"abstract":"Clustering is a crucial tool for analyzing data in virtually every scientific and engineering discipline. The U.S. National Academy of Sciences (NAS) has recently announced \"the seven giants of statistical data analysis\" in which data clustering plays a central role [1]. This research also emphasizes that more scalable solutions are required to enable time and space clustering for the future large-scale data analyses. Therefore, hardware and software innovations are necessary to make the future large scale data analysis practical.This project proposes a novel mechanism for computing bit serial medians within resistive RAM (RRAM) arrays with no need to read out the operands from memory cells.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115003457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we conduct a systematic exploration on the promise and challenges of deep learning for the sparse matrix format selection. We propose a set of novel techniques to solve special challenges to deep learning, including input matrix representations, a late-merging deep neural network structure design, and the use of transfer learning to alleviate cross-architecture portability issues.
{"title":"POSTER: Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection","authors":"Yue Zhao, Jiajia Li, C. Liao, Xipeng Shen","doi":"10.1109/PACT.2017.33","DOIUrl":"https://doi.org/10.1109/PACT.2017.33","url":null,"abstract":"In this work, we conduct a systematic exploration on the promise and challenges of deep learning for the sparse matrix format selection. We propose a set of novel techniques to solve special challenges to deep learning, including input matrix representations, a late-merging deep neural network structure design, and the use of transfer learning to alleviate cross-architecture portability issues.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116023521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seikwon Kim, Seonyoung Lee, Taehoon Kim, Jaehyuk Huh
The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms.Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latency-optimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.
{"title":"Transparent Dual Memory Compression Architecture","authors":"Seikwon Kim, Seonyoung Lee, Taehoon Kim, Jaehyuk Huh","doi":"10.1109/PACT.2017.12","DOIUrl":"https://doi.org/10.1109/PACT.2017.12","url":null,"abstract":"The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms.Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latency-optimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133082264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Different from training common neural networks (NNs) for inference on general-purpose processors, the development of NNs for neuromorphic chips is usually faced with a number of hardware-specific restrictions. This paper proposes a systematic methodology to address the challenge. It can transform an existing trained, unrestricted NN (usually for software execution substrate) into an equivalent network that meets the given hardware constraints, which decouples NN applications from target hardware. We have built such a software tool that supports both spiking neural networks (SNNs) and traditional artificial neural networks (ANNs). Its effectiveness has been demonstrated with a real neuromorphic chip and a processor-in-memory(PIM) design. Tests show that the extra inference error caused by this solution is very limited and the transformation time is much less than the retraining time.
{"title":"POSTER: Bridge the Gap Between Neural Networks and Neuromorphic Hardware","authors":"Yu Ji, Youhui Zhang, Wenguang Chen, Yuan Xie","doi":"10.1109/PACT.2017.59","DOIUrl":"https://doi.org/10.1109/PACT.2017.59","url":null,"abstract":"Different from training common neural networks (NNs) for inference on general-purpose processors, the development of NNs for neuromorphic chips is usually faced with a number of hardware-specific restrictions. This paper proposes a systematic methodology to address the challenge. It can transform an existing trained, unrestricted NN (usually for software execution substrate) into an equivalent network that meets the given hardware constraints, which decouples NN applications from target hardware. We have built such a software tool that supports both spiking neural networks (SNNs) and traditional artificial neural networks (ANNs). Its effectiveness has been demonstrated with a real neuromorphic chip and a processor-in-memory(PIM) design. Tests show that the extra inference error caused by this solution is very limited and the transformation time is much less than the retraining time.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133199667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Go language lacks built-in data structures that allow fine-grained concurrent access. In particular, its map data type, one of only two generic collections in Go, limits concurrency to the case where all operations are read-only; any mutation (insert, update, or remove) requires exclusive access to the entire map. The tight integration of this map into the Go language and runtime precludes its replacement with known scalable map implementations.This paper introduces the Interlocked Hash Table (IHT). The IHT is the result of language-driven data structure design: it requires minimal changes to the Go map API, supports the full range of operations available on the sequential Go map, and provides a path for the language to evolve to become more amenable to scalable computation over shared data structures. The IHT employs a novel optimistic locking protocol to avoid the risk of deadlock, and allows large critical sections that access a single IHT element, and can easily support multikey atomic operations. These features come at the cost of relaxed, though still straightforward, iteration semantics. In experimentation in both Java and Go, the IHT performs well, reaching up to 7× the performance of the state of the art in Go at 24 threads. In Java, the IHT performs on par with the best Java maps in the research literature, while providing iteration and other features absent from other maps.
{"title":"Redesigning Go’s Built-In Map to Support Concurrent Operations","authors":"Louis Jenkins, Tingzhe Zhou, Michael F. Spear","doi":"10.1109/PACT.2017.45","DOIUrl":"https://doi.org/10.1109/PACT.2017.45","url":null,"abstract":"The Go language lacks built-in data structures that allow fine-grained concurrent access. In particular, its map data type, one of only two generic collections in Go, limits concurrency to the case where all operations are read-only; any mutation (insert, update, or remove) requires exclusive access to the entire map. The tight integration of this map into the Go language and runtime precludes its replacement with known scalable map implementations.This paper introduces the Interlocked Hash Table (IHT). The IHT is the result of language-driven data structure design: it requires minimal changes to the Go map API, supports the full range of operations available on the sequential Go map, and provides a path for the language to evolve to become more amenable to scalable computation over shared data structures. The IHT employs a novel optimistic locking protocol to avoid the risk of deadlock, and allows large critical sections that access a single IHT element, and can easily support multikey atomic operations. These features come at the cost of relaxed, though still straightforward, iteration semantics. In experimentation in both Java and Go, the IHT performs well, reaching up to 7× the performance of the state of the art in Go at 24 threads. In Java, the IHT performs on par with the best Java maps in the research literature, while providing iteration and other features absent from other maps.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DF
{"title":"In-memory Data Flow Processor","authors":"Daichi Fujiki, S. Mahlke, R. Das","doi":"10.1109/PACT.2017.53","DOIUrl":"https://doi.org/10.1109/PACT.2017.53","url":null,"abstract":"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DF","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130462285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Orhan Kislal, Jagadish B. Kotra, Xulong Tang, M. Kandemir, Myoungsoo Jung
Employing an on-chip network in a manycore system (to improve scalability) makes the latencies of data accesses issued by a core non-uniform, which significant impact application performance. This paper presents a compiler strategy which involves exposing architecture information to the compiler to enable optimized computation-to-core mapping. Our scheme takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. Our experiments of 12 multi-threaded applications reveal that, on average, our approach reduces the on-chip network latency in a 6x6 manycore system by 49.5% in the case of private LLCs and 52.7% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 14.8% and 15.2% for the private LLC and shared LLC based systems.
{"title":"POSTER: Location-Aware Computation Mapping for Manycore Processors","authors":"Orhan Kislal, Jagadish B. Kotra, Xulong Tang, M. Kandemir, Myoungsoo Jung","doi":"10.1109/PACT.2017.20","DOIUrl":"https://doi.org/10.1109/PACT.2017.20","url":null,"abstract":"Employing an on-chip network in a manycore system (to improve scalability) makes the latencies of data accesses issued by a core non-uniform, which significant impact application performance. This paper presents a compiler strategy which involves exposing architecture information to the compiler to enable optimized computation-to-core mapping. Our scheme takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. Our experiments of 12 multi-threaded applications reveal that, on average, our approach reduces the on-chip network latency in a 6x6 manycore system by 49.5% in the case of private LLCs and 52.7% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 14.8% and 15.2% for the private LLC and shared LLC based systems.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123886208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Work-queue is an effective approach for mapping irregular-parallel workloads to GPGPUs. It can improve the utilization of SIMD units by only processing useful works which are dynamically generated during execution. As current GPGPUs lack necessary supports for work-queues, a software-based work-queue implementation often suffers from memory contention and load balancing issues. We present a novel hardware work-queue design named DaQueue, which incorporates data-aware features to improve the efficiency of work-queues on GPGPUs. We evaluate our proposal on irregular-parallel workloads with a cycle-level simulator. Experimental results show that the DaQueue significantly improves the performance over software-based implementation for these workloads. Compared with an idealized hardware worklist approach which is the state-of-the-art prior work, the DaQueue can achieve an average of 29.54% extra speedup.
{"title":"POSTER: DaQueue: A Data-Aware Work-Queue Design for GPGPUs","authors":"Yashuai Lü, Libo Huang, Li Shen","doi":"10.1109/PACT.2017.22","DOIUrl":"https://doi.org/10.1109/PACT.2017.22","url":null,"abstract":"Work-queue is an effective approach for mapping irregular-parallel workloads to GPGPUs. It can improve the utilization of SIMD units by only processing useful works which are dynamically generated during execution. As current GPGPUs lack necessary supports for work-queues, a software-based work-queue implementation often suffers from memory contention and load balancing issues. We present a novel hardware work-queue design named DaQueue, which incorporates data-aware features to improve the efficiency of work-queues on GPGPUs. We evaluate our proposal on irregular-parallel workloads with a cycle-level simulator. Experimental results show that the DaQueue significantly improves the performance over software-based implementation for these workloads. Compared with an idealized hardware worklist approach which is the state-of-the-art prior work, the DaQueue can achieve an average of 29.54% extra speedup.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"433 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122803206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}