ACM Sigplan Notices最新文献

英文中文

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018 第23届编程语言和操作系统体系结构支持国际会议论文集，ASPLOS 2018，威廉斯堡，弗吉尼亚州，美国，2018年3月24-28日

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957

引用次数: 0

MAERI

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173176

Hyoukjun Kwon, A. Samajdar, T. Krishna

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

深度神经网络(DNN)在计算机视觉和语音识别方面表现出了非常有前途的成果，并正在成为无处不在的人工智能的基础。这些算法的计算复杂性和对高能效的需求导致了硬件加速器研究的激增。%用于此范例。为了减少访问DRAM的延迟和能源成本，大多数DNN加速器本质上是空间的，具有数百个处理元素(PE)并行运行并直接相互通信。dnn正在快速发展，在最新的拓扑结构中，具有不同输入和过滤器大小的卷积、循环、池化和完全连接层是很常见的。它们可能密集，也可能稀疏。它们还可以以无数种方式(在层内和跨层)进行分区，以利用数据重用(权重和中间输出)。上述所有因素都可能导致加速器衬底内的不同数据流模式。不幸的是，大多数DNN加速器在内部只支持固定的数据流模式，因为它们执行pe和片上网络(NoC)的仔细协同设计。事实上，它们中的大多数只针对卷积层内的流量进行了优化。这使得在结构上有效地映射任意数据流变得具有挑战性，并且可能导致可用计算资源的利用不足。深度神经网络加速器需要可编程以实现大规模部署。为了使它们可编程，它们需要在内部进行配置，以支持可以映射到它们上的各种数据流模式。为了满足这一需求，我们提出了MAERI，这是一个DNN加速器，由一组模块化和可配置的构建块构建，可以通过适当配置微小开关轻松支持无数DNN分区和映射。MAERI在刚性NoC结构的基线上跨多个数据流映射提供了8-459%的更好利用率。

{"title":"MAERI","authors":"Hyoukjun Kwon, A. Samajdar, T. Krishna","doi":"10.1145/3296957.3173176","DOIUrl":"https://doi.org/10.1145/3296957.3173176","url":null,"abstract":"Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"83 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83269720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

DLibOS

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173209

S. Mallon, V. Gramoli, Guillaume Jourjon

A long body of research work has led to the conjecture that highly efficient IO processing at user-level would necessarily violate protection. In this paper, we debunk this myth by introducing DLibOS a new paradigm that consists of distributing a library OS on specialized cores to achieve performance and protection at the user-level. Its main novelty consists of leveraging network-on-chip to allow hardware message passing, rather than context switches, for communication between different address spaces. To demonstrate the feasibility of our approach, we implement a driver and a network stack at user-level on a Tilera many-core machine. We define a novel asynchronous socket interface and partition the memory such that the reception, the transmission and the application modify isolated regions. Our high performance results of 4.2 and 3.1 million requests per second obtained on a webserver and the Memcached applications, respectively, confirms the relevance of our design decisions. Finally, we compare DLibOS against a non-protected user-level network stack and show that protection comes at a negligible cost.

长期的研究工作导致了这样的猜想:用户级的高效IO处理必然会违反保护。在本文中，我们通过引入dlibo来揭穿这个神话，dlibo是一种新的范例，它包括在专门的核心上分发库操作系统，以实现用户级的性能和保护。它的主要新颖之处在于利用片上网络来允许硬件消息传递，而不是上下文切换，以便在不同地址空间之间进行通信。为了演示我们方法的可行性，我们在Tilera多核机器上实现了用户级的驱动程序和网络堆栈。我们定义了一种新的异步套接字接口，并对内存进行了分区，使得接收、传输和应用程序修改了隔离的区域。我们在web服务器和Memcached应用程序上分别获得了每秒420万和310万请求的高性能结果，这证实了我们设计决策的相关性。最后，我们将dlibo与未受保护的用户级网络堆栈进行比较，并表明保护的成本可以忽略不计。

引用次数: 0

DAMN 该死的

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173175

Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir

DMA operations can access memory buffers only if they are "mapped" in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking. We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1)~implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2)~modifying the network stack to use this allocator; and (3)~copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.

DMA操作只有在IOMMU中被“映射”时才能访问内存缓冲区，因此操作系统通过在每个数据包被DMAed之前/之后立即映射和取消映射来保护自己免受恶意/错误的网络DMA的侵害。最近发现，这种方法比保持数据包不可用并将其内容复制到/从永久映射的缓冲区更危险，性能也更差。不过，额外的复制会影响千兆网络的性能。我们注意到，在DMA (un)映射边界实现保护是不必要的约束，因为必须防止设备仅在内核读取数据后才更改数据。因此，不需要在DMA (un)映射层的内核和设备之间切换缓冲区的所有权，这与所有现有IOMMU保护方案所采用的方法相反。因此，我们通过(1)~实现一个名为DMA-Aware Malloc for Networking (DAMN)的新分配器来消除额外的拷贝，它(de)从永久映射到IOMMU的内存池中分配数据包缓冲区;(2)~修改网络堆栈以使用这个分配器;(3)仅在内核需要时复制数据包数据，这通常将前面提到的额外复制转变为在用户内核边界执行的内核标准复制操作。因此，DAMN提供完整的IOMMU保护，其性能可与未受保护的系统相媲美。

{"title":"DAMN","authors":"Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir","doi":"10.1145/3296957.3173175","DOIUrl":"https://doi.org/10.1145/3296957.3173175","url":null,"abstract":"DMA operations can access memory buffers only if they are \"mapped\" in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking. We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1)~implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2)~modifying the network stack to use this allocator; and (3)~copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84355680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SOFRITAS SOFRITAS

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173192

Christian DeLozier, Ariel Eizenberg, Brandon Lucia, Joseph Devietti

Correctly synchronizing multithreaded programs is challenging and errors can lead to program failures such as atomicity violations. Existing strong memory consistency models rule out some possible failures, but are limited by depending on programmer-defined locking code. We present the new Ordering-Free Region (OFR) serializability consistency model that ensures atomicity for OFRs, which are spans of dynamic instructions between consecutive ordering constructs (e.g., barriers), without breaking atomicity at lock operations. Our platform, Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably (SOFRITAS), ensures a C/C++ program's execution is equivalent to a serialization of OFRs by default. We build two systems that realize the SOFRITAS idea: a concurrency bug finding tool for testing called SOFRITEST, and a production runtime system called SOPRO. SOFRITEST uses OFRs to find concurrency bugs, including a multi-critical-section atomicity violation in memcached that weaker consistency models will miss. If OFR's are too coarse-grained, SOFRITEST suggests refinement annotations automatically. Our software-only SOPRO implementation has high performance, scales well with increased parallelism, and prevents failures despite bugs in locking code. SOFRITAS has an average overhead of just 1.59x on a single-threaded execution and 1.51x on sixteen threads, despite pthreads' much weaker memory model.

正确地同步多线程程序具有挑战性，错误可能导致程序失败，例如原子性违反。现有的强内存一致性模型排除了一些可能的故障，但由于依赖于程序员定义的锁定代码而受到限制。我们提出了新的无序区(OFR)序列化一致性模型，该模型保证了OFR的原子性，OFR是连续有序结构(例如，屏障)之间的动态指令的跨度，而不会破坏锁操作的原子性。我们的平台，用于可伸缩性地增加线程原子性的可序列化无顺序区域(SOFRITAS)，确保C/ c++程序的执行在默认情况下等同于ofr的序列化。我们构建了两个实现SOFRITAS思想的系统:一个用于测试的并发错误查找工具称为SOFRITEST，以及一个称为SOPRO的生产运行时系统。SOFRITEST使用OFR来发现并发性错误，包括memcached中较弱的一致性模型无法发现的多临界段原子性冲突。如果OFR过于粗粒度，SOFRITEST建议自动改进注释。我们的纯软件SOPRO实现具有高性能，随着并行性的增加而扩展良好，并且可以防止锁定代码中的错误导致的故障。SOFRITAS在单线程执行时的平均开销仅为1.59倍，在16线程执行时为1.51倍，尽管pthreads的内存模型要弱得多。

{"title":"SOFRITAS","authors":"Christian DeLozier, Ariel Eizenberg, Brandon Lucia, Joseph Devietti","doi":"10.1145/3296957.3173192","DOIUrl":"https://doi.org/10.1145/3296957.3173192","url":null,"abstract":"Correctly synchronizing multithreaded programs is challenging and errors can lead to program failures such as atomicity violations. Existing strong memory consistency models rule out some possible failures, but are limited by depending on programmer-defined locking code. We present the new Ordering-Free Region (OFR) serializability consistency model that ensures atomicity for OFRs, which are spans of dynamic instructions between consecutive ordering constructs (e.g., barriers), without breaking atomicity at lock operations. Our platform, Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably (SOFRITAS), ensures a C/C++ program's execution is equivalent to a serialization of OFRs by default. We build two systems that realize the SOFRITAS idea: a concurrency bug finding tool for testing called SOFRITEST, and a production runtime system called SOPRO. SOFRITEST uses OFRs to find concurrency bugs, including a multi-critical-section atomicity violation in memcached that weaker consistency models will miss. If OFR's are too coarse-grained, SOFRITEST suggests refinement annotations automatically. Our software-only SOPRO implementation has high performance, scales well with increased parallelism, and prevents failures despite bugs in locking code. SOFRITAS has an average overhead of just 1.59x on a single-threaded execution and 1.51x on sixteen threads, despite pthreads' much weaker memory model.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87509934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

vbench

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173207

A. Lottarini, À. Ramírez, Joel Coburn, Martha A. Kim, Parthasarathy Ranganathan, Daniel Stodolsky, Mark Wachsler

This paper presents vbench, a publicly available benchmark for cloud video services. We are the first study, to the best of our knowledge, to characterize the emerging video-as-a-service workload. Unlike prior video processing benchmarks, vbench's videos are algorithmically selected to represent a large commercial corpus of millions of videos. Reflecting the complex infrastructure that processes and hosts these videos, vbench includes carefully constructed metrics and baselines. The combination of validated corpus, baselines, and metrics reveal nuanced tradeoffs between speed, quality, and compression. We demonstrate the importance of video selection with a microarchitectural study of cache, branch, and SIMD behavior. vbench reveals trends from the commercial corpus that are not visible in other video corpuses. Our experiments with GPUs under vbench's scoring scenarios reveal that context is critical: GPUs are well suited for live-streaming, while for video-on-demand shift costs from compute to storage and network. Counterintuitively, they are not viable for popular videos, for which highly compressed, high quality copies are required. We instead find that popular videos are currently well-served by the current trajectory of software encoders.

本文介绍了vbench，一个公开可用的云视频服务基准。据我们所知，我们是第一个描述新兴视频即服务工作量的研究。与之前的视频处理基准不同，vbench的视频是通过算法选择来代表数百万个视频的大型商业语料库的。反映处理和托管这些视频的复杂基础设施，vbench包括精心构建的指标和基线。经过验证的语料库、基线和度量的组合揭示了速度、质量和压缩之间的微妙权衡。我们通过对缓存、分支和SIMD行为的微架构研究来证明视频选择的重要性。Vbench揭示了其他视频语料库中不可见的商业语料库的趋势。我们在vbench的评分场景下对gpu的实验表明，上下文是至关重要的:gpu非常适合直播流媒体，而视频点播从计算到存储和网络的转移成本。与直觉相反，它们不适用于流行视频，因为流行视频需要高度压缩、高质量的拷贝。相反，我们发现流行的视频目前很好地服务于软件编码器的当前轨迹。

{"title":"vbench","authors":"A. Lottarini, À. Ramírez, Joel Coburn, Martha A. Kim, Parthasarathy Ranganathan, Daniel Stodolsky, Mark Wachsler","doi":"10.1145/3296957.3173207","DOIUrl":"https://doi.org/10.1145/3296957.3173207","url":null,"abstract":"This paper presents vbench, a publicly available benchmark for cloud video services. We are the first study, to the best of our knowledge, to characterize the emerging video-as-a-service workload. Unlike prior video processing benchmarks, vbench's videos are algorithmically selected to represent a large commercial corpus of millions of videos. Reflecting the complex infrastructure that processes and hosts these videos, vbench includes carefully constructed metrics and baselines. The combination of validated corpus, baselines, and metrics reveal nuanced tradeoffs between speed, quality, and compression. We demonstrate the importance of video selection with a microarchitectural study of cache, branch, and SIMD behavior. vbench reveals trends from the commercial corpus that are not visible in other video corpuses. Our experiments with GPUs under vbench's scoring scenarios reveal that context is critical: GPUs are well suited for live-streaming, while for video-on-demand shift costs from compute to storage and network. Counterintuitively, they are not viable for popular videos, for which highly compressed, high quality copies are required. We instead find that popular videos are currently well-served by the current trajectory of software encoders.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"301 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73597693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Wonderland 仙境

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173208

Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, Kang Chen

Many important graph applications are iterative algorithms that repeatedly process the input graph until convergence. For such algorithms, graph abstraction is an important technique: although much smaller than the original graph, it can bootstrap an initial result that can significantly accelerate the final convergence speed, leading to a better overall performance. However, existing graph abstraction techniques typically assume either fully in-memory or distributed environment, which leads to many obstacles preventing the application to an out-of-core graph processing system. In this paper, we propose Wonderland, a novel out-of-core graph processing system based on abstraction. Wonderland has three unique features: 1) A simple method applicable to out-of-core systems allowing users to extract effective abstractions from the original graph with acceptable cost and a specific memory limit; 2) Abstraction-enabled information propagation, where an abstraction can be used as a bridge over the disjoint on-disk graph partitions; 3) Abstraction guided priority scheduling, where an abstraction can infer the better priority-based order in processing on-disk graph partitions. Wonderland is a significant advance over the state-of-the-art because it not only makes graph abstraction feasible to out-of-core systems, but also broadens the applications of the concept in important ways. Evaluation results of Wonderland reveal that Wonderland achieves a drastic speedup over the other state-of-the-art systems, up to two orders of magnitude for certain cases.

许多重要的图应用都是迭代算法，反复处理输入图直到收敛。对于这样的算法，图抽象是一项重要的技术:虽然比原始图小得多，但它可以引导一个初始结果，可以显著加快最终的收敛速度，从而获得更好的整体性能。然而，现有的图形抽象技术通常假设全内存或分布式环境，这导致了许多阻碍应用程序到核外图形处理系统的障碍。本文提出了一种新的基于抽象的离核图形处理系统Wonderland。Wonderland有三个独特的特点:1)一种适用于外核系统的简单方法，允许用户在可接受的成本和特定的内存限制下从原始图形中提取有效的抽象;2)支持抽象的信息传播，其中抽象可以用作连接磁盘上不相交的图分区的桥梁;3)抽象引导优先级调度，在处理磁盘上的图分区时，抽象可以推断出更好的基于优先级的顺序。Wonderland是最先进技术的一大进步，因为它不仅使图形抽象在核心外系统中可行，而且还以重要的方式拓宽了该概念的应用。Wonderland的评估结果显示，与其他最先进的系统相比，Wonderland实现了极大的加速，在某些情况下可达到两个数量级。

{"title":"Wonderland","authors":"Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, Kang Chen","doi":"10.1145/3296957.3173208","DOIUrl":"https://doi.org/10.1145/3296957.3173208","url":null,"abstract":"Many important graph applications are iterative algorithms that repeatedly process the input graph until convergence. For such algorithms, graph abstraction is an important technique: although much smaller than the original graph, it can bootstrap an initial result that can significantly accelerate the final convergence speed, leading to a better overall performance. However, existing graph abstraction techniques typically assume either fully in-memory or distributed environment, which leads to many obstacles preventing the application to an out-of-core graph processing system. In this paper, we propose Wonderland, a novel out-of-core graph processing system based on abstraction. Wonderland has three unique features: 1) A simple method applicable to out-of-core systems allowing users to extract effective abstractions from the original graph with acceptable cost and a specific memory limit; 2) Abstraction-enabled information propagation, where an abstraction can be used as a bridge over the disjoint on-disk graph partitions; 3) Abstraction guided priority scheduling, where an abstraction can infer the better priority-based order in processing on-disk graph partitions. Wonderland is a significant advance over the state-of-the-art because it not only makes graph abstraction feasible to out-of-core systems, but also broadens the applications of the concept in important ways. Evaluation results of Wonderland reveal that Wonderland achieves a drastic speedup over the other state-of-the-art systems, up to two orders of magnitude for certain cases.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81684350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Tigr

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173180

Amir Hossein Nodehi Sabet, Junqiao Qiu, Zhijia Zhao

Graph analytics delivers deep knowledge by processing large volumes of highly connected data. In real-world graphs, the degree distribution tends to follow the power law -- a small portion of nodes own a large number of neighbors. The high irregularity of degree distribution acts as a major barrier to their efficient processing on GPU architectures, which are primarily designed for accelerating computations on regular data with SIMD executions. Existing solutions to the inefficiency of GPU-based graph analytics either modify the graph programming abstraction or rely on changes to the low-level thread execution models. The former requires more programming efforts for designing and maintaining graph analytics; while the latter couples with the underlying architectures, making it difficult to adapt as architectures quickly evolve. Unlike prior efforts, this work proposes to address the above fundamental problem at its origin -- the irregular graph data itself. It raises a critical question in irregular graph processing: Is it possible to transform irregular graphs into more regular ones such that the graphs can be processed more efficiently on GPU-like architectures, yet still producing the same results? Inspired by the question, this work introduces Tigr -- a graph transformation framework that can effectively reduce the irregularity of real-world graphs with correctness guarantees for a wide range of graph analytics. To make the transformations practical, Tigr features a lightweight virtual transformation scheme, which can substantially reduce the costs of graph transformations, while preserving the benefits of reduced irregularity. Evaluation on Tigr-based GPU graph processing shows significant and consistent speedup over the state-of-the-art GPU graph processing frameworks for a spectrum of irregular graphs.

图形分析通过处理大量高度连接的数据来提供深入的知识。在现实世界的图中，度分布倾向于遵循幂律——一小部分节点拥有大量的邻居。度分布的高度不规则性是它们在GPU架构上高效处理的主要障碍，GPU架构主要用于通过执行SIMD来加速对常规数据的计算。针对基于gpu的图形分析效率低下的现有解决方案，要么修改图形编程抽象，要么依赖于对低级线程执行模型的更改。前者需要更多的编程工作来设计和维护图形分析;而后者与底层体系结构相结合，使得随着体系结构的快速发展而难以适应。与之前的努力不同，这项工作提出了解决上述基本问题的根源-不规则图形数据本身。它提出了不规则图形处理中的一个关键问题:是否有可能将不规则图形转换为更规则的图形，以便在类似gpu的架构上更有效地处理图形，同时仍然产生相同的结果?受这个问题的启发，这项工作引入了Tigr——一个图转换框架，可以有效地减少现实世界图的不规则性，并保证广泛的图分析的正确性。为了使转换切实可行，Tigr提供了一个轻量级的虚拟转换方案，它可以大大降低图转换的成本，同时保留减少不规则性的好处。对基于tigr的GPU图形处理的评估显示，在不规则图形的频谱上，与最先进的GPU图形处理框架相比，具有显著且一致的加速。

{"title":"Tigr","authors":"Amir Hossein Nodehi Sabet, Junqiao Qiu, Zhijia Zhao","doi":"10.1145/3296957.3173180","DOIUrl":"https://doi.org/10.1145/3296957.3173180","url":null,"abstract":"Graph analytics delivers deep knowledge by processing large volumes of highly connected data. In real-world graphs, the degree distribution tends to follow the power law -- a small portion of nodes own a large number of neighbors. The high irregularity of degree distribution acts as a major barrier to their efficient processing on GPU architectures, which are primarily designed for accelerating computations on regular data with SIMD executions. Existing solutions to the inefficiency of GPU-based graph analytics either modify the graph programming abstraction or rely on changes to the low-level thread execution models. The former requires more programming efforts for designing and maintaining graph analytics; while the latter couples with the underlying architectures, making it difficult to adapt as architectures quickly evolve. Unlike prior efforts, this work proposes to address the above fundamental problem at its origin -- the irregular graph data itself. It raises a critical question in irregular graph processing: Is it possible to transform irregular graphs into more regular ones such that the graphs can be processed more efficiently on GPU-like architectures, yet still producing the same results? Inspired by the question, this work introduces Tigr -- a graph transformation framework that can effectively reduce the irregularity of real-world graphs with correctness guarantees for a wide range of graph analytics. To make the transformations practical, Tigr features a lightweight virtual transformation scheme, which can substantially reduce the costs of graph transformations, while preserving the benefits of reduced irregularity. Evaluation on Tigr-based GPU graph processing shows significant and consistent speedup over the state-of-the-art GPU graph processing frameworks for a spectrum of irregular graphs.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78639832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

BranchScope BranchScope

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-30 DOI: 10.1145/3296957.3173204

Dmitry Evtyushkin, Ryan D. Riley, Nael Abu-Ghazaleh, D. Ponomarev

We present BranchScope - a new side-channel attack where the attacker infers the direction of an arbitrary conditional branch instruction in a victim program by manipulating the shared directional branch predictor. The directional component of the branch predictor stores the prediction on a given branch (taken or not-taken) and is a different component from the branch target buffer (BTB) attacked by previous work. BranchScope is the first fine-grained attack on the directional branch predictor, expanding our understanding of the side channel vulnerability of the branch prediction unit. Our attack targets complex hybrid branch predictors with unknown organization. We demonstrate how an attacker can force these predictors to switch to a simple 1-level mode to simplify the direction recovery. We carry out BranchScope on several recent Intel CPUs and also demonstrate the attack against an SGX enclave.

我们提出了BranchScope——一种新的侧信道攻击，攻击者通过操纵共享方向分支预测器来推断受害者程序中任意条件分支指令的方向。分支预测器的方向组件将预测存储在给定的分支上(已取或未取)，并且与先前工作攻击的分支目标缓冲区(BTB)不同。BranchScope是第一个针对定向分支预测器的细粒度攻击，扩展了我们对分支预测单元侧信道漏洞的理解。我们的攻击目标是具有未知组织的复杂混合分支预测器。我们演示了攻击者如何迫使这些预测器切换到简单的1级模式，以简化方向恢复。我们在几个最新的英特尔cpu上执行BranchScope，并演示了对SGX飞地的攻击。

引用次数: 17

A practical unification of multi-stage programming and macros 多阶段编程和宏的实用统一

Q1 Computer Science

ACM Sigplan Notices

Pub Date : 2018-11-05 DOI: 10.1145/3393934.3278139

StuckiNicolas, BiboudisAggelos, OderskyMartin

Program generation is indispensable. We propose a novel unification of two existing metaprogramming techniques: multi-stage programming and hygienic generative macros. The former supports runtime c...

程序生成是必不可少的。我们提出了一种新的统一现有的两种元编程技术:多阶段编程和卫生生成宏。前者支持运行时c…

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Sigplan Notices

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀