2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文中文

POSTER: Exploiting Approximations for Energy/Quality Tradeoffs in Service-Based Applications 海报:在基于服务的应用中开发能源/质量权衡的近似

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-10-31 DOI: 10.1109/PACT.2017.57

L. Liu, Sibren Isaacman, A. Bhattacharjee, U. Kremer

Approximations and redundancies allow mobile and distributed applications to produce answers or outcomes of lesser quality at lower costs. This paper introduces RAPID, a new programming framework and methodology for service-based applications with approximations and redundancies. Finding the best service configuration under a given resource budget becomes a constrained, dual-weight graph optimization problem.

近似和冗余允许移动和分布式应用程序以较低的成本产生较低质量的答案或结果。本文介绍了RAPID，一种新的编程框架和方法，用于具有近似和冗余的基于服务的应用程序。在给定资源预算下寻找最佳服务配置成为一个受限的双权图优化问题。

引用次数: 1

End-to-End Deep Learning of Optimization Heuristics 端到端优化启发式深度学习

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-09 DOI: 10.1109/PACT.2017.24

Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather

Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and diversity of modern hardware and software. Machine learning is aproven technique for learning such heuristics, but its success is bound by thequality of the features used. These features must be hand crafted by developersthrough a combination of expert domain knowledge and trial and error. This makesthe quality of the final model directly dependent on the skill and availabletime of the system architect.Our work introduces a better way for building heuristics. We develop a deepneural network that learns heuristics over raw code, entirely without using codefeatures. The neural network simultaneously constructs appropriaterepresentations of the code and learns how best to optimize, removing the needfor manual feature creation. Further, we show that our neural nets can transferlearning from one optimization problem to another, improving the accuracy of newmodels, without the help of human experts.We compare the effectiveness of our automatically generated heuristics againstones with features hand-picked by experts. We examine two challenging tasks:predicting optimal mapping for heterogeneous parallelism and GPU threadcoarsening factors. In 89% of the cases, the quality of our fully automaticheuristics matches or surpasses that of state-of-the-art predictive models usinghand-crafted features, providing on average 14% and 12% more performance withno human effort expended on designing features.

精确的自动优化启发式是处理现代硬件和软件复杂性和多样性的必要条件。机器学习是学习这种启发式的公认技术，但它的成功取决于所使用的特征的质量。这些特性必须由开发人员通过结合专家领域知识和试错来手工制作。这使得最终模型的质量直接依赖于系统架构师的技能和可用时间。我们的工作介绍了一种更好的构建启发式的方法。我们开发了一个深度神经网络，可以在原始代码上学习启发式，完全不使用代码特征。神经网络同时构建适当的代码表示并学习如何最好地优化，从而消除了手动创建特征的需要。此外，我们表明，我们的神经网络可以将学习从一个优化问题转移到另一个优化问题，提高新模型的准确性，而无需人类专家的帮助。我们将自动生成的启发式算法的有效性与专家精心挑选的特征进行比较。我们研究了两个具有挑战性的任务:预测异构并行性的最佳映射和GPU线程粗化因素。在89%的情况下，我们的全自动特征的质量匹配或超过了使用手工制作特征的最先进的预测模型，在不花费人工设计特征的情况下，平均提供14%和12%的性能。

{"title":"End-to-End Deep Learning of Optimization Heuristics","authors":"Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather","doi":"10.1109/PACT.2017.24","DOIUrl":"https://doi.org/10.1109/PACT.2017.24","url":null,"abstract":"Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and diversity of modern hardware and software. Machine learning is aproven technique for learning such heuristics, but its success is bound by thequality of the features used. These features must be hand crafted by developersthrough a combination of expert domain knowledge and trial and error. This makesthe quality of the final model directly dependent on the skill and availabletime of the system architect.Our work introduces a better way for building heuristics. We develop a deepneural network that learns heuristics over raw code, entirely without using codefeatures. The neural network simultaneously constructs appropriaterepresentations of the code and learns how best to optimize, removing the needfor manual feature creation. Further, we show that our neural nets can transferlearning from one optimization problem to another, improving the accuracy of newmodels, without the help of human experts.We compare the effectiveness of our automatically generated heuristics againstones with features hand-picked by experts. We examine two challenging tasks:predicting optimal mapping for heterogeneous parallelism and GPU threadcoarsening factors. In 89% of the cases, the quality of our fully automaticheuristics matches or surpasses that of state-of-the-art predictive models usinghand-crafted features, providing on average 14% and 12% more performance withno human effort expended on designing features.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133793982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 159

Large Scale Data Clustering Using Memristive k-Median Computation 基于记忆k-中值计算的大规模数据聚类

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.52

Yomi Karthik Rupesh, M. N. Bojnordi

Clustering is a crucial tool for analyzing data in virtually every scientific and engineering discipline. The U.S. National Academy of Sciences (NAS) has recently announced "the seven giants of statistical data analysis" in which data clustering plays a central role [1]. This research also emphasizes that more scalable solutions are required to enable time and space clustering for the future large-scale data analyses. Therefore, hardware and software innovations are necessary to make the future large scale data analysis practical.This project proposes a novel mechanism for computing bit serial medians within resistive RAM (RRAM) arrays with no need to read out the operands from memory cells.

聚类在几乎所有科学和工程学科中都是分析数据的关键工具。美国国家科学院(NAS)最近公布了“统计数据分析的七大巨头”，其中数据聚类在其中起着核心作用。该研究还强调，需要更多可扩展的解决方案来实现未来大规模数据分析的时间和空间集群。因此，硬件和软件的创新是必要的，以使未来的大规模数据分析切实可行。本计画提出一种在电阻式随机存取存储器(RRAM)阵列中计算位序列中位数的新机制，而无需从储存单元中读出操作数。

引用次数: 1

POSTER: Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection 海报:弥合深度学习和稀疏矩阵格式选择之间的差距

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.33

Yue Zhao, Jiajia Li, C. Liao, Xipeng Shen

In this work, we conduct a systematic exploration on the promise and challenges of deep learning for the sparse matrix format selection. We propose a set of novel techniques to solve special challenges to deep learning, including input matrix representations, a late-merging deep neural network structure design, and the use of transfer learning to alleviate cross-architecture portability issues.

在这项工作中，我们对稀疏矩阵格式选择的深度学习的前景和挑战进行了系统的探索。我们提出了一套新的技术来解决深度学习的特殊挑战，包括输入矩阵表示，后期合并深度神经网络结构设计，以及使用迁移学习来缓解跨架构可移植性问题。

引用次数: 3

Transparent Dual Memory Compression Architecture 透明双内存压缩架构

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.12

Seikwon Kim, Seonyoung Lee, Taehoon Kim, Jaehyuk Huh

The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms.Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latency-optimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.

大数据应用程序对内存的需求不断增加，这推动了服务器系统内存容量的急剧增长。为了最大限度地提高外部存储器的效率，人们提出了基于hw的存储器压缩技术来增加有效存储器容量。尽管这种内存压缩技术可以显著提高内存效率，但在基于hw的压缩技术中存在一个关键的权衡。由于需要尽可能快地解压缩内存块以处理缓存丢失，因此延迟优化技术在缓存粒度上应用压缩，从而实现少于几个周期的解压缩延迟。然而，这种延迟优化技术可能会失去容量优化技术潜在的高压缩比，后者使用更长的延迟算法压缩更大的内存块。考虑到内存压缩中的基本权衡，本文提出了一种透明双内存压缩(DMC)架构，该架构选择性地使用两种具有不同延迟和压缩特性的压缩算法。利用内存访问的局部性，提出的体系结构使用容量优化压缩算法压缩访问频率较低的块，同时使用延迟优化算法压缩最近访问的块。此外，本研究主张在未压缩的地址空间和压缩的物理空间之间进行基于hw的转换，而不是依靠虚拟内存系统的支持来定位压缩的内存块。这种操作系统透明的方法消除了压缩效率和为减少TLB丢失而采用的大页面支持之间的冲突。提出的压缩架构应用于混合记忆体(HMC)，在堆叠的dram下有一个逻辑层。实验结果表明，所提出的压缩体系结构比当前最先进的延迟优化技术提供了54%的压缩比，并且与没有压缩的基准系统相比没有性能下降。

{"title":"Transparent Dual Memory Compression Architecture","authors":"Seikwon Kim, Seonyoung Lee, Taehoon Kim, Jaehyuk Huh","doi":"10.1109/PACT.2017.12","DOIUrl":"https://doi.org/10.1109/PACT.2017.12","url":null,"abstract":"The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms.Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latency-optimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133082264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

POSTER: Bridge the Gap Between Neural Networks and Neuromorphic Hardware 海报:弥合神经网络和神经形态硬件之间的鸿沟

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.59

Yu Ji, Youhui Zhang, Wenguang Chen, Yuan Xie

Different from training common neural networks (NNs) for inference on general-purpose processors, the development of NNs for neuromorphic chips is usually faced with a number of hardware-specific restrictions. This paper proposes a systematic methodology to address the challenge. It can transform an existing trained, unrestricted NN (usually for software execution substrate) into an equivalent network that meets the given hardware constraints, which decouples NN applications from target hardware. We have built such a software tool that supports both spiking neural networks (SNNs) and traditional artificial neural networks (ANNs). Its effectiveness has been demonstrated with a real neuromorphic chip and a processor-in-memory(PIM) design. Tests show that the extra inference error caused by this solution is very limited and the transformation time is much less than the retraining time.

与在通用处理器上训练用于推理的普通神经网络不同，神经形态芯片的神经网络开发通常面临许多特定硬件的限制。本文提出了一种系统的方法来应对这一挑战。它可以将现有的经过训练的、不受限制的神经网络(通常用于软件执行基板)转换为满足给定硬件约束的等效网络，从而将神经网络应用与目标硬件解耦。我们已经构建了这样一个软件工具，它既支持峰值神经网络(SNNs)，也支持传统的人工神经网络(ann)。其有效性已通过一个真实的神经形态芯片和内存处理器(PIM)设计得到了验证。实验表明，该方法产生的额外推理误差非常有限，转换时间大大小于再训练时间。

引用次数: 0

Redesigning Go’s Built-In Map to Support Concurrent Operations 重新设计Go的内置映射以支持并发操作

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.45

Louis Jenkins, Tingzhe Zhou, Michael F. Spear

The Go language lacks built-in data structures that allow fine-grained concurrent access. In particular, its map data type, one of only two generic collections in Go, limits concurrency to the case where all operations are read-only; any mutation (insert, update, or remove) requires exclusive access to the entire map. The tight integration of this map into the Go language and runtime precludes its replacement with known scalable map implementations.This paper introduces the Interlocked Hash Table (IHT). The IHT is the result of language-driven data structure design: it requires minimal changes to the Go map API, supports the full range of operations available on the sequential Go map, and provides a path for the language to evolve to become more amenable to scalable computation over shared data structures. The IHT employs a novel optimistic locking protocol to avoid the risk of deadlock, and allows large critical sections that access a single IHT element, and can easily support multikey atomic operations. These features come at the cost of relaxed, though still straightforward, iteration semantics. In experimentation in both Java and Go, the IHT performs well, reaching up to 7× the performance of the state of the art in Go at 24 threads. In Java, the IHT performs on par with the best Java maps in the research literature, while providing iteration and other features absent from other maps.

Go语言缺乏允许细粒度并发访问的内置数据结构。特别是，它的map数据类型(Go中仅有的两种泛型集合之一)将并发性限制在所有操作都是只读的情况下;任何突变(插入、更新或删除)都需要对整个映射的独占访问。这种映射与Go语言和运行时的紧密集成使其无法被已知的可伸缩映射实现替代。介绍了互锁哈希表(Interlocked Hash Table, IHT)。IHT是语言驱动的数据结构设计的结果:它需要对Go map API进行最小的更改，支持顺序Go map上可用的全部操作，并为语言发展提供了一条路径，使其更适合在共享数据结构上进行可扩展计算。IHT采用了一种新的乐观锁定协议来避免死锁的风险，并允许访问单个IHT元素的大临界区，并且可以轻松地支持多键原子操作。这些特性的代价是宽松的迭代语义，尽管仍然是直接的。在Java和Go的实验中，IHT表现良好，在24线程时达到了Go当前状态的7倍。在Java中，IHT的性能与研究文献中最好的Java地图相当，同时提供了迭代和其他地图所没有的其他特性。

{"title":"Redesigning Go’s Built-In Map to Support Concurrent Operations","authors":"Louis Jenkins, Tingzhe Zhou, Michael F. Spear","doi":"10.1109/PACT.2017.45","DOIUrl":"https://doi.org/10.1109/PACT.2017.45","url":null,"abstract":"The Go language lacks built-in data structures that allow fine-grained concurrent access. In particular, its map data type, one of only two generic collections in Go, limits concurrency to the case where all operations are read-only; any mutation (insert, update, or remove) requires exclusive access to the entire map. The tight integration of this map into the Go language and runtime precludes its replacement with known scalable map implementations.This paper introduces the Interlocked Hash Table (IHT). The IHT is the result of language-driven data structure design: it requires minimal changes to the Go map API, supports the full range of operations available on the sequential Go map, and provides a path for the language to evolve to become more amenable to scalable computation over shared data structures. The IHT employs a novel optimistic locking protocol to avoid the risk of deadlock, and allows large critical sections that access a single IHT element, and can easily support multikey atomic operations. These features come at the cost of relaxed, though still straightforward, iteration semantics. In experimentation in both Java and Go, the IHT performs well, reaching up to 7× the performance of the state of the art in Go at 24 threads. In Java, the IHT performs on par with the best Java maps in the research literature, while providing iteration and other features absent from other maps.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

In-memory Data Flow Processor 内存数据流处理器

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.53

Daichi Fujiki, S. Mahlke, R. Das

Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DF

近年来，非易失性存储器(NVMs)的发展为内存计算开辟了一个新的领域。通过重新利用存储结构，某些nvm已被证明具有原位模拟计算能力。例如，电阻式存储器(reram)以氧化钛的电阻形式存储数据，并通过向字线注入电压并在位线上感应产生的电流，我们利用基尔霍夫定律获得输入电压和电池电导的点积。最近的作品通过利用这种点积功能，探索了基于ReRAM的机器学习算法加速器的设计空间[2]。这些基于ReRAM的加速器利用了大量的并行性和宽松的精度要求，尽管它们的读/写延迟很高，但与当前的CPU/GPU架构和定制asic相比，它们提供了数量级的改进。尽管计算型nvm提供了显著的性能提升，但以前的工作依赖于手动将工作负载映射到内存阵列，这使得很难为新的工作负载配置它。我们通过提出一个可编程内存处理器架构和编程框架来解决这个问题。该体系结构由分组为块的内存阵列和一个自定义互连组成，以促进阵列之间的通信。每个数组既是存储单元又是处理单元。所提出的内存处理器架构很简单。关键的挑战是开发一个编程框架和一个丰富的ISA，允许不同的数据并行程序利用底层的计算效率。所建议的内存处理器的效率来自两个来源。首先，大规模并行。nvm由数千个阵列组成。这些数组中的每一个都被转换成可以并发计算的alu。第二，减少数据移动，避免数据在内存和处理器内核之间变换。我们的目标是建立编程语义和执行模型，将ReRAM计算的上述优点暴露给通用数据并行程序。所建议的编程框架试图通过合并数据流和矢量处理(SIMD)的概念来暴露硬件中的底层并行性。数据流显式地暴露了程序中的指令级并行性(ILP)，而向量处理暴露了程序中的数据级并行性(DLP)。Google的TensorFlow[1]是一种流行的机器学习编程模型。我们观察到TensorFlow的编程语义是数据流和向量处理的完美结合。因此，我们提出的编程框架首先要求程序员使用TensorFlow编写程序。我们开发了一个TensorFlow编译器，它为我们的内存数据流处理器生成二进制代码。TensorFlow (TF)程序本质上是数据流图(DFG)，其中每个操作符节点都可以将张量作为操作数。对vector的一个元素进行操作的DFG被编译器称为模块。编译器将输入DFG转换成数据并行模块的集合。操作相同向量的模块属于指令块(IB)，并在内存数组上并发运行。我们的编译器探索了几个有趣的优化，如展开高维张量、最大化模块内的ILP、管道式内存读写以及最小化数组之间的通信。为了创建一个可编程的内存处理器，我们认为需要通过利用ReRAM阵列的模拟计算能力来实现各种计算原语。因此，我们开发了一个通用的ISA，并设计了一个支持多种操作的存储阵列架构。例如，我们展示了如何通过在ReRAM内存数组中使用模拟原语来有效地实现复杂操作(如除法、超越函数、元素向量乘法等)。此外，我们还讨论了约简操作和分散/收集操作的高效网络/内存协同设计。总之，在这张海报中，我们将展示一个编程框架、编译器、ISA和我们提出的通用内存数据流处理器的架构，该处理器是由电阻式计算内存构建的。图X显示了我们的整体框架。我们还将介绍我们在PARSEC和Rodinia的微基准测试和实际基准测试中的实验结果。初步结果表明，与多核和GPU执行相比，加速速度提高了~ 800倍和~ 100倍。

{"title":"In-memory Data Flow Processor","authors":"Daichi Fujiki, S. Mahlke, R. Das","doi":"10.1109/PACT.2017.53","DOIUrl":"https://doi.org/10.1109/PACT.2017.53","url":null,"abstract":"Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the bit-line, we obtain the dot-product of the input voltages and cell conductances using Kirchhoff's law. Recent works have explored the design space of ReRAM based accelerators for machine learning algorithms by leveraging this dot-product functionality [2]. These ReRAM based accelerators exploit the massive parallelism and relaxed precision requirements, to provide orders of magnitude improvement when compared to current CPU/GPU architectures and custom ASICs, inspite of their high read/write latency. Despite the significant performance gain offered by computational NVMs, previous works have relied on manual mapping of workloads to the memory arrays, making it difficult to configure it for new workloads. We combat this problem by proposing a programmable inmemory processor architecture and programming framework. The architecture consists of memory arrays grouped in tiles, and a custom interconnect to facilitate communication between the arrays. Each array acts as unit of storage as well as processing element. The proposed in-memory processor architecture is simple. The key challenge is developing a programming framework and a rich ISA which can allow diverse data-parallel programs to leverage the underlying computational efficiency. The efficiency of the proposed in-memory processor comes from two sources. First, massive parallelism. NVMs are composed of several thousands of arrays. Each of these arrays are transformed into ALUs which can compute concurrently. Second, reduction in data movement, by avoiding shuffling of data between memory and processor cores. Our goal is to establish the programming semantics and execution models to expose the above benefits of ReRAM computing to general purpose data parallel programs. The proposed programming framework seeks to expose the underling parallelism in the hardware by merging the concepts of data-flow and vector processing (or SIMD). Data-flow explicitly exposes the Instruction Level Parallelism (ILP) in the programs, while vector processing exposes the Data Level Parallelism (DLP) in programs. Google's TensorFlow [1] is a popular programming model for machine learning. We observe that TensorFlow's programming semantics is a perfect marriage of data-flow and vector-processing. Thus, our proposed programming framework starts by requiring the programmers to write programs in TensorFlow. We develop a TensorFlow compiler that generates binary code for our in-memory data-flow processor. The TensorFlow (TF) programs are essentially Data Flow Graphs (DFG) where each operator node can have tensors as operands. A DF","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130462285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

POSTER: Location-Aware Computation Mapping for Manycore Processors 海报:多核处理器的位置感知计算映射

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.20

Orhan Kislal, Jagadish B. Kotra, Xulong Tang, M. Kandemir, Myoungsoo Jung

Employing an on-chip network in a manycore system (to improve scalability) makes the latencies of data accesses issued by a core non-uniform, which significant impact application performance. This paper presents a compiler strategy which involves exposing architecture information to the compiler to enable optimized computation-to-core mapping. Our scheme takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. Our experiments of 12 multi-threaded applications reveal that, on average, our approach reduces the on-chip network latency in a 6x6 manycore system by 49.5% in the case of private LLCs and 52.7% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 14.8% and 15.2% for the private LLC and shared LLC based systems.

在多核系统中使用片上网络(以提高可伸缩性)使得由一个核心发出的数据访问延迟不一致，这将严重影响应用程序的性能。本文提出了一种编译器策略，该策略包括向编译器公开体系结构信息，以实现优化的计算到核心映射。我们的方案考虑了多核系统中内核，最后一级缓存(llc)和内存控制器(mc)的相对位置(和之间的距离)，并生成了计算到核心的映射，目标是最小化片上网络流量。我们对12个多线程应用程序的实验表明，平均而言，我们的方法在私有llc的情况下将6x6多核系统中的片上网络延迟降低了49.5%，在共享llc的情况下降低了52.7%。这些改进转化为基于私有LLC和基于共享LLC的系统相应的执行时间改进14.8%和15.2%。

引用次数: 11

POSTER: DaQueue: A Data-Aware Work-Queue Design for GPGPUs 海报:DaQueue:一个数据感知的gpgpu工作队列设计

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2017-09-01 DOI: 10.1109/PACT.2017.22

Yashuai Lü, Libo Huang, Li Shen

Work-queue is an effective approach for mapping irregular-parallel workloads to GPGPUs. It can improve the utilization of SIMD units by only processing useful works which are dynamically generated during execution. As current GPGPUs lack necessary supports for work-queues, a software-based work-queue implementation often suffers from memory contention and load balancing issues. We present a novel hardware work-queue design named DaQueue, which incorporates data-aware features to improve the efficiency of work-queues on GPGPUs. We evaluate our proposal on irregular-parallel workloads with a cycle-level simulator. Experimental results show that the DaQueue significantly improves the performance over software-based implementation for these workloads. Compared with an idealized hardware worklist approach which is the state-of-the-art prior work, the DaQueue can achieve an average of 29.54% extra speedup.

工作队列是一种将不规则并行工作负载映射到gpgpu的有效方法。通过只处理在执行过程中动态生成的有用工作，可以提高SIMD单元的利用率。由于当前的gpgpu缺乏对工作队列的必要支持，基于软件的工作队列实现经常会遇到内存争用和负载平衡问题。为了提高gpgpu上工作队列的效率，我们提出了一种新的硬件工作队列设计——DaQueue。我们用一个周期级模拟器来评估我们在不规则并行工作负载上的提议。实验结果表明，与基于软件的实现相比，DaQueue显著提高了这些工作负载的性能。与理想的硬件工作列表方法(最先进的先前工作)相比，DaQueue可以实现平均29.54%的额外加速。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀