Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems最新文献_第7页

Session details: Session 2B: Performance Management 会话详细信息:会话2B:性能管理

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252955

J. Larus

引用次数: 0

Filtering Translation Bandwidth with Virtual Caching 过滤翻译带宽与虚拟缓存

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173195

Hongil Yoon, Jason Lowe-Power, G. Sohi

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

将GPU集成在与cpu相同的芯片上的异构计算是普遍存在的，并且为了增加可编程性，许多这些系统都支持来自GPU硬件的虚拟地址访问。然而，这需要在每次内存访问时进行地址转换。我们观察到，由于频繁的私有TLB丢失，未来的gpu和工作负载对共享地址转换硬件显示出非常高的带宽需求(在某些情况下每个周期多达4次访问)。这极大地影响了性能(相对于理想的MMU，平均性能下降32%)。为了减少这种开销，我们提出了一个软件无关的、实用的GPU虚拟缓存层次结构。我们使用虚拟缓存层次结构作为有效的地址转换带宽过滤器。我们观察到许多在私有tlb中丢失的请求在GPU缓存层次结构中找到相应的有效数据。使用GPU虚拟缓存层次结构，可以过滤这些TLB未命中(即虚拟缓存命中)，从而显着降低共享地址转换硬件的带宽需求。此外，gpu的特定于加速器的属性(例如，更少的同义词的可能性)降低了虚拟缓存的设计复杂性，使整个虚拟缓存层次结构(包括共享L2缓存)对gpu实用。我们的评估表明，整个GPU虚拟缓存层次结构有效地过滤了高地址转换带宽，实现了与理想MMU几乎相同的性能。我们还评估了仅l1的虚拟缓存设计，并表明使用整个虚拟缓存层次结构可以获得额外的性能优势(平均加速1.31倍)。

{"title":"Filtering Translation Bandwidth with Virtual Caching","authors":"Hongil Yoon, Jason Lowe-Power, G. Sohi","doi":"10.1145/3173162.3173195","DOIUrl":"https://doi.org/10.1145/3173162.3173195","url":null,"abstract":"Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133111845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

The Architectural Implications of Autonomous Driving: Constraints and Acceleration 自动驾驶的架构含义:约束和加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173191

Shi-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E. Haque, Lingjia Tang, Jason Mars

Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital and engineering power on developing such systems. Building autonomous driving systems is particularly challenging due to stringent performance requirements in terms of both making the safe operational decisions and finishing processing at real-time. Despite the recent advancements in technology, such systems are still largely under experimentation and architecting end-to-end autonomous driving systems remains an open research question. To investigate this question, we first present and formalize the design constraints for building an autonomous driving system in terms of performance, predictability, storage, thermal and power. We then build an end-to-end autonomous driving system using state-of-the-art award-winning algorithms to understand the design trade-offs for building such systems. In our real-system characterization, we identify three computational bottlenecks, which conventional multicore CPUs are incapable of processing under the identified design constraints. To meet these constraints, we accelerate these algorithms using three accelerator platforms including GPUs, FPGAs, and ASICs, which can reduce the tail latency of the system by 169x, 10x, and 93x respectively. With accelerator-based designs, we are able to build an end-to-end autonomous driving system that meets all the design constraints, and explore the trade-offs among performance, power and the higher accuracy enabled by higher resolution cameras.

自动驾驶系统最近引起了人们的极大兴趣，许多行业领导者，如谷歌、优步、特斯拉和Mobileye，都在开发此类系统上投入了大量资金和工程力量。由于在安全操作决策和实时完成处理方面的严格性能要求，构建自动驾驶系统尤其具有挑战性。尽管最近技术取得了进步，但此类系统在很大程度上仍处于实验阶段，构建端到端自动驾驶系统仍然是一个开放的研究问题。为了研究这个问题，我们首先从性能、可预测性、存储、热量和功率等方面提出并形式化了构建自动驾驶系统的设计约束。然后，我们使用最先进的获奖算法构建端到端自动驾驶系统，以了解构建此类系统的设计权衡。在我们的实际系统表征中，我们确定了三个计算瓶颈，在确定的设计约束下，传统的多核cpu无法处理这些瓶颈。为了满足这些限制，我们使用gpu、fpga和asic三种加速器平台来加速这些算法，可以将系统的尾部延迟分别降低169倍、10倍和93倍。通过基于加速器的设计，我们能够构建端到端自动驾驶系统，满足所有设计限制，并探索性能，功率和更高分辨率相机实现的更高精度之间的权衡。

{"title":"The Architectural Implications of Autonomous Driving: Constraints and Acceleration","authors":"Shi-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E. Haque, Lingjia Tang, Jason Mars","doi":"10.1145/3173162.3173191","DOIUrl":"https://doi.org/10.1145/3173162.3173191","url":null,"abstract":"Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital and engineering power on developing such systems. Building autonomous driving systems is particularly challenging due to stringent performance requirements in terms of both making the safe operational decisions and finishing processing at real-time. Despite the recent advancements in technology, such systems are still largely under experimentation and architecting end-to-end autonomous driving systems remains an open research question. To investigate this question, we first present and formalize the design constraints for building an autonomous driving system in terms of performance, predictability, storage, thermal and power. We then build an end-to-end autonomous driving system using state-of-the-art award-winning algorithms to understand the design trade-offs for building such systems. In our real-system characterization, we identify three computational bottlenecks, which conventional multicore CPUs are incapable of processing under the identified design constraints. To meet these constraints, we accelerate these algorithms using three accelerator platforms including GPUs, FPGAs, and ASICs, which can reduce the tail latency of the system by 169x, 10x, and 93x respectively. With accelerator-based designs, we are able to build an end-to-end autonomous driving system that meets all the design constraints, and explore the trade-offs among performance, power and the higher accuracy enabled by higher resolution cameras.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130675556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 315

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks 消费类设备的Google工作负载:缓解数据移动瓶颈

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173177

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, A. Knies, Parthasarathy Ranganathan, O. Mutlu

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google's machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-in-memory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).

我们正在经历消费设备数量的爆炸式增长，包括智能手机、平板电脑、基于网络的电脑(如chromebook)和可穿戴设备。对于这类设备，由于有限的电池容量和热功率预算，能源效率是头等大事。我们发现，在消费设备中，数据移动是总系统能量和执行时间的主要贡献者。在内存系统和计算单元之间移动数据的能量和性能成本明显高于计算成本。因此，处理数据移动对消费者设备来说至关重要。在这项工作中，我们全面分析了数据移动对几种广泛使用的谷歌消费者工作负载的能量和性能影响:(1)Chrome web浏览器;(2)谷歌的机器学习框架TensorFlow Mobile;(3)视频播放，(4)视频捕捉，这两者都用于许多视频服务，如YouTube和Google Hangouts。我们发现，通过在内存附近执行部分计算，内存中处理(PIM)可以显著减少所有这些工作负载的数据移动。每个工作负载都包含简单的原语和函数，这些原语和函数对整个数据移动的贡献很大。考虑到消费者设备的有限面积和功率限制，我们研究这些原语和函数是否可以使用PIM实现。我们的分析表明，将这些原语卸载到PIM逻辑(由简单核心或专用加速器组成)可以消除大量的数据移动，并显著降低总系统能量(跨工作负载平均减少55.4%)和执行时间(平均减少54.2%)。

{"title":"Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks","authors":"Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, A. Knies, Parthasarathy Ranganathan, O. Mutlu","doi":"10.1145/3173162.3173177","DOIUrl":"https://doi.org/10.1145/3173162.3173177","url":null,"abstract":"We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google's machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-in-memory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115483296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 233

VAULT: Reducing Paging Overheads in SGX with Efficient Integrity Verification Structures VAULT:通过高效的完整性验证结构减少SGX中的分页开销

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177155

Meysam Taassori, Ali Shafiee, R. Balasubramonian

Intel's SGX offers state-of-the-art security features, including confidentiality, integrity, and authentication (CIA) when accessing sensitive pages in memory. Sensitive pages are placed in an Enclave Page Cache (EPC) within the physical memory before they can be accessed by the processor. To control the overheads imposed by CIA guarantees, the EPC operates with a limited capacity (currently 128 MB). Because of this limited EPC size, sensitive pages must be frequently swapped between EPC and non-EPC regions in memory. A page swap is expensive (about 40K cycles) because it requires an OS system call, page copying, updates to integrity trees and metadata, etc. Our analysis shows that the paging overhead can slow the system on average by 5×, and other studies have reported even higher slowdowns for memory-intensive workloads. The paging overhead can be reduced by growing the size of the EPC to match the size of physical memory, while allowing the EPC to also accommodate non-sensitive pages. However, at least two important problems must be addressed to enable this growth in EPC: (i) the depth of the integrity tree and its cacheability must be improved to keep memory bandwidth overheads in check, (ii) the space overheads of integrity verification (tree and MACs) must be reduced. We achieve both goals by introducing a variable arity unified tree (VAULT) organization that is more compact and has lower depth. We further reduce the space overheads with techniques that combine MAC sharing and compression. With simulations, we show that the combination of our techniques can address most inefficiencies in SGX memory access and improve overall performance by 3.7×, relative to an SGX baseline, while incurring a memory capacity over-head of only 4.7%.

Intel的SGX提供了最先进的安全特性，包括访问内存中的敏感页面时的机密性、完整性和身份验证(CIA)。在处理器访问敏感页面之前，它们被放置在物理内存中的Enclave Page Cache (EPC)中。为了控制中央情报局保证的开销，EPC的运行容量有限(目前为128 MB)。由于EPC大小有限，敏感页必须在内存中的EPC和非EPC区域之间频繁交换。页交换是昂贵的(大约40K周期)，因为它需要操作系统调用、页面复制、更新完整性树和元数据等。我们的分析表明，分页开销会使系统平均降低5倍的速度，而其他研究报告显示，对于内存密集型工作负载，降低的速度甚至更高。通过增加EPC的大小以匹配物理内存的大小，同时允许EPC容纳非敏感页面，可以减少分页开销。然而，要实现EPC的这种增长，至少必须解决两个重要问题:(i)必须改进完整性树的深度及其可缓存性，以控制内存带宽开销;(ii)必须减少完整性验证(树和mac)的空间开销。我们通过引入更紧凑和更低深度的可变密度统一树(VAULT)组织来实现这两个目标。我们通过结合MAC共享和压缩的技术进一步减少了空间开销。通过模拟，我们表明，我们的技术组合可以解决SGX内存访问中的大多数低效率问题，并将总体性能提高3.7倍(相对于SGX基线)，同时仅产生4.7%的内存容量开销。

{"title":"VAULT: Reducing Paging Overheads in SGX with Efficient Integrity Verification Structures","authors":"Meysam Taassori, Ali Shafiee, R. Balasubramonian","doi":"10.1145/3173162.3177155","DOIUrl":"https://doi.org/10.1145/3173162.3177155","url":null,"abstract":"Intel's SGX offers state-of-the-art security features, including confidentiality, integrity, and authentication (CIA) when accessing sensitive pages in memory. Sensitive pages are placed in an Enclave Page Cache (EPC) within the physical memory before they can be accessed by the processor. To control the overheads imposed by CIA guarantees, the EPC operates with a limited capacity (currently 128 MB). Because of this limited EPC size, sensitive pages must be frequently swapped between EPC and non-EPC regions in memory. A page swap is expensive (about 40K cycles) because it requires an OS system call, page copying, updates to integrity trees and metadata, etc. Our analysis shows that the paging overhead can slow the system on average by 5×, and other studies have reported even higher slowdowns for memory-intensive workloads. The paging overhead can be reduced by growing the size of the EPC to match the size of physical memory, while allowing the EPC to also accommodate non-sensitive pages. However, at least two important problems must be addressed to enable this growth in EPC: (i) the depth of the integrity tree and its cacheability must be improved to keep memory bandwidth overheads in check, (ii) the space overheads of integrity verification (tree and MACs) must be reduced. We achieve both goals by introducing a variable arity unified tree (VAULT) organization that is more compact and has lower depth. We further reduce the space overheads with techniques that combine MAC sharing and compression. With simulations, we show that the combination of our techniques can address most inefficiencies in SGX memory access and improve overall performance by 3.7×, relative to an SGX baseline, while incurring a memory capacity over-head of only 4.7%.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115668301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

DLibOS: Performance and Protection with a Network-on-Chip 基于片上网络的性能和保护

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173209

S. Mallon, V. Gramoli, Guillaume Jourjon

A long body of research work has led to the conjecture that highly efficient IO processing at user-level would necessarily violate protection. In this paper, we debunk this myth by introducing DLibOS a new paradigm that consists of distributing a library OS on specialized cores to achieve performance and protection at the user-level. Its main novelty consists of leveraging network-on-chip to allow hardware message passing, rather than context switches, for communication between different address spaces. To demonstrate the feasibility of our approach, we implement a driver and a network stack at user-level on a Tilera many-core machine. We define a novel asynchronous socket interface and partition the memory such that the reception, the transmission and the application modify isolated regions. Our high performance results of 4.2 and 3.1 million requests per second obtained on a webserver and the Memcached applications, respectively, confirms the relevance of our design decisions. Finally, we compare DLibOS against a non-protected user-level network stack and show that protection comes at a negligible cost.

长期的研究工作导致了这样的猜想:用户级的高效IO处理必然会违反保护。在本文中，我们通过引入dlibo来揭穿这个神话，dlibo是一种新的范例，它包括在专门的核心上分发库操作系统，以实现用户级的性能和保护。它的主要新颖之处在于利用片上网络来允许硬件消息传递，而不是上下文切换，以便在不同地址空间之间进行通信。为了演示我们方法的可行性，我们在Tilera多核机器上实现了用户级的驱动程序和网络堆栈。我们定义了一种新的异步套接字接口，并对内存进行了分区，使得接收、传输和应用程序修改了隔离的区域。我们在web服务器和Memcached应用程序上分别获得了每秒420万和310万请求的高性能结果，这证实了我们设计决策的相关性。最后，我们将dlibo与未受保护的用户级网络堆栈进行比较，并表明保护的成本可以忽略不计。

{"title":"DLibOS: Performance and Protection with a Network-on-Chip","authors":"S. Mallon, V. Gramoli, Guillaume Jourjon","doi":"10.1145/3173162.3173209","DOIUrl":"https://doi.org/10.1145/3173162.3173209","url":null,"abstract":"A long body of research work has led to the conjecture that highly efficient IO processing at user-level would necessarily violate protection. In this paper, we debunk this myth by introducing DLibOS a new paradigm that consists of distributing a library OS on specialized cores to achieve performance and protection at the user-level. Its main novelty consists of leveraging network-on-chip to allow hardware message passing, rather than context switches, for communication between different address spaces. To demonstrate the feasibility of our approach, we implement a driver and a network stack at user-level on a Tilera many-core machine. We define a novel asynchronous socket interface and partition the memory such that the reception, the transmission and the application modify isolated regions. Our high performance results of 4.2 and 3.1 million requests per second obtained on a webserver and the Memcached applications, respectively, confirms the relevance of our design decisions. Finally, we compare DLibOS against a non-protected user-level network stack and show that protection comes at a negligible cost.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Session details: Session 5B Neural Networks 会话细节:会话5B神经网络

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252961

Adrian Sampson

引用次数: 0

DAMN: Overhead-Free IOMMU Protection for Networking DAMN:无开销IOMMU网络保护

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173175

Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir

DMA operations can access memory buffers only if they are "mapped" in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking. We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1)~implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2)~modifying the network stack to use this allocator; and (3)~copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.

DMA操作只有在IOMMU中被“映射”时才能访问内存缓冲区，因此操作系统通过在每个数据包被DMAed之前/之后立即映射和取消映射来保护自己免受恶意/错误的网络DMA的侵害。最近发现，这种方法比保持数据包不可用并将其内容复制到/从永久映射的缓冲区更危险，性能也更差。不过，额外的复制会影响千兆网络的性能。我们注意到，在DMA (un)映射边界实现保护是不必要的约束，因为必须防止设备仅在内核读取数据后才更改数据。因此，不需要在DMA (un)映射层的内核和设备之间切换缓冲区的所有权，这与所有现有IOMMU保护方案所采用的方法相反。因此，我们通过(1)~实现一个名为DMA-Aware Malloc for Networking (DAMN)的新分配器来消除额外的拷贝，它(de)从永久映射到IOMMU的内存池中分配数据包缓冲区;(2)~修改网络堆栈以使用这个分配器;(3)仅在内核需要时复制数据包数据，这通常将前面提到的额外复制转变为在用户内核边界执行的内核标准复制操作。因此，DAMN提供完整的IOMMU保护，其性能可与未受保护的系统相媲美。

{"title":"DAMN: Overhead-Free IOMMU Protection for Networking","authors":"Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir","doi":"10.1145/3173162.3173175","DOIUrl":"https://doi.org/10.1145/3173162.3173175","url":null,"abstract":"DMA operations can access memory buffers only if they are \"mapped\" in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking. We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1)~implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2)~modifying the network stack to use this allocator; and (3)~copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131931567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Sugar: Secure GPU Acceleration in Web Browsers Sugar: Web浏览器中的安全GPU加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173186

Zhihao Yao, Zongheng Ma, Yingtong Liu, A. A. Sani, Aparna Chandramowlishwaran

Modern personal computers have embraced increasingly powerful Graphics Processing Units (GPUs). Recently, GPU-based graphics acceleration in web apps (i.e., applications running inside a web browser) has become popular. WebGL is the main effort to provide OpenGL-like graphics for web apps and it is currently used in 53% of the top-100 websites. Unfortunately, WebGL has posed serious security concerns as several attack vectors have been demonstrated through WebGL. Web browsers» solutions to these attacks have been reactive: discovered vulnerabilities have been patched and new runtime security checks have been added. Unfortunately, this approach leaves the system vulnerable to zero-day vulnerability exploits, especially given the large size of the Trusted Computing Base of the graphics plane. We present Sugar, a novel operating system solution that enhances the security of GPU acceleration for web apps by design. The key idea behind Sugar is using a dedicated virtual graphics plane for a web app by leveraging modern GPU virtualization solutions. A virtual graphics plane consists of a dedicated virtual GPU (or vGPU) as well as all the software graphics stack (including the device driver). Sugar enhances the system security since a virtual graphics plane is fully isolated from the rest of the system. Despite GPU virtualization overhead, we show that Sugar achieves high performance. Moreover, unlike current systems, Sugar is able to use two underlying physical GPUs, when available, to co-render the User Interface (UI): one GPU is used to provide virtual graphics planes for web apps and the other to provide the primary graphics plane for the rest of the system. Such a design not only provides strong security guarantees, it also provides enhanced performance isolation.

现代个人电脑已经采用了功能日益强大的图形处理单元(gpu)。最近，基于gpu的图形加速在web应用程序(即在web浏览器中运行的应用程序)中变得流行起来。WebGL主要致力于为web应用程序提供类似opengl的图形，目前排名前100的网站中有53%使用它。不幸的是，WebGL带来了严重的安全问题，因为WebGL已经展示了几种攻击向量。Web浏览器»针对这些攻击的解决方案是被动的:已发现的漏洞已被修补，并添加了新的运行时安全检查。不幸的是，这种方法使系统容易受到零日漏洞的攻击，特别是考虑到图形平面的可信计算基础的大尺寸。我们提出Sugar，一个新颖的操作系统解决方案，通过设计增强了web应用程序GPU加速的安全性。Sugar背后的关键思想是通过利用现代GPU虚拟化解决方案，为web应用程序使用专用的虚拟图形平面。虚拟图形平面由一个专用的虚拟GPU(或vGPU)和所有的图形软件堆栈(包括设备驱动程序)组成。Sugar增强了系统安全性，因为虚拟图形平面与系统的其余部分完全隔离。尽管GPU虚拟化开销很大，但我们证明Sugar实现了高性能。此外，与当前的系统不同，Sugar能够使用两个底层物理GPU，当可用时，共同渲染用户界面(UI):一个GPU用于为web应用程序提供虚拟图形平面，另一个用于为系统的其余部分提供主图形平面。这样的设计不仅提供了强大的安全保证，还提供了增强的性能隔离。

{"title":"Sugar: Secure GPU Acceleration in Web Browsers","authors":"Zhihao Yao, Zongheng Ma, Yingtong Liu, A. A. Sani, Aparna Chandramowlishwaran","doi":"10.1145/3173162.3173186","DOIUrl":"https://doi.org/10.1145/3173162.3173186","url":null,"abstract":"Modern personal computers have embraced increasingly powerful Graphics Processing Units (GPUs). Recently, GPU-based graphics acceleration in web apps (i.e., applications running inside a web browser) has become popular. WebGL is the main effort to provide OpenGL-like graphics for web apps and it is currently used in 53% of the top-100 websites. Unfortunately, WebGL has posed serious security concerns as several attack vectors have been demonstrated through WebGL. Web browsers» solutions to these attacks have been reactive: discovered vulnerabilities have been patched and new runtime security checks have been added. Unfortunately, this approach leaves the system vulnerable to zero-day vulnerability exploits, especially given the large size of the Trusted Computing Base of the graphics plane. We present Sugar, a novel operating system solution that enhances the security of GPU acceleration for web apps by design. The key idea behind Sugar is using a dedicated virtual graphics plane for a web app by leveraging modern GPU virtualization solutions. A virtual graphics plane consists of a dedicated virtual GPU (or vGPU) as well as all the software graphics stack (including the device driver). Sugar enhances the system security since a virtual graphics plane is fully isolated from the rest of the system. Despite GPU virtualization overhead, we show that Sugar achieves high performance. Moreover, unlike current systems, Sugar is able to use two underlying physical GPUs, when available, to co-render the User Interface (UI): one GPU is used to provide virtual graphics planes for web apps and the other to provide the primary graphics plane for the rest of the system. Such a design not only provides strong security guarantees, it also provides enhanced performance isolation.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126176508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

VIBNN: Hardware Acceleration of Bayesian Neural Networks 贝叶斯神经网络的硬件加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-02-02 DOI: 10.1145/3173162.3173212

R. Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, Yanzhi Wang

Bayesian Neural Networks (BNNs) have been proposed to address the problem of model uncertainty in training and inference. By introducing weights associated with conditioned probability distributions, BNNs are capable of resolving the overfitting issue commonly seen in conventional neural networks and allow for small-data training, through the variational inference process. Frequent usage of Gaussian random variables in this process requires a properly optimized Gaussian Random Number Generator (GRNG). The high hardware cost of conventional GRNG makes the hardware implementation of BNNs challenging. In this paper, we propose VIBNN, an FPGA-based hardware accelerator design for variational inference on BNNs. We explore the design space for massive amount of Gaussian variable sampling tasks in BNNs. Specifically, we introduce two high performance Gaussian (pseudo) random number generators: 1) the RAM-based Linear Feedback Gaussian Random Number Generator (RLF-GRNG), which is inspired by the properties of binomial distribution and linear feedback logics; and 2) the Bayesian Neural Network-oriented Wallace Gaussian Random Number Generator. To achieve high scalability and efficient memory access, we propose a deep pipelined accelerator architecture with fast execution and good hardware utilization. Experimental results demonstrate that the proposed VIBNN implementations on an FPGA can achieve throughput of 321,543.4 Images/s and energy efficiency upto 52,694.8 Images/J while maintaining similar accuracy as its software counterpart.

为了解决训练和推理中的模型不确定性问题，提出了贝叶斯神经网络(BNNs)。通过引入与条件概率分布相关的权重，bnn能够解决传统神经网络中常见的过拟合问题，并允许通过变分推理过程进行小数据训练。在此过程中频繁使用高斯随机变量，需要适当优化的高斯随机数生成器(GRNG)。传统GRNG的高硬件成本给bnn的硬件实现带来了挑战。在本文中，我们提出了一种基于fpga的硬件加速器VIBNN，用于对bnn进行变分推理。我们探索了bnn中大量高斯变量采样任务的设计空间。具体来说，我们介绍了两种高性能的高斯(伪)随机数生成器:1)基于ram的线性反馈高斯随机数生成器(RLF-GRNG)，它的灵感来自二项分布和线性反馈逻辑的特性;2)基于贝叶斯神经网络的华莱士高斯随机数发生器。为了实现高可扩展性和高效的内存访问，我们提出了一种执行速度快、硬件利用率高的深度流水线加速器架构。实验结果表明，在FPGA上实现的VIBNN可以实现321,543.4图像/s的吞吐量和高达52,694.8图像/J的能量效率，同时保持与软件相似的精度。

{"title":"VIBNN: Hardware Acceleration of Bayesian Neural Networks","authors":"R. Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, Yanzhi Wang","doi":"10.1145/3173162.3173212","DOIUrl":"https://doi.org/10.1145/3173162.3173212","url":null,"abstract":"Bayesian Neural Networks (BNNs) have been proposed to address the problem of model uncertainty in training and inference. By introducing weights associated with conditioned probability distributions, BNNs are capable of resolving the overfitting issue commonly seen in conventional neural networks and allow for small-data training, through the variational inference process. Frequent usage of Gaussian random variables in this process requires a properly optimized Gaussian Random Number Generator (GRNG). The high hardware cost of conventional GRNG makes the hardware implementation of BNNs challenging. In this paper, we propose VIBNN, an FPGA-based hardware accelerator design for variational inference on BNNs. We explore the design space for massive amount of Gaussian variable sampling tasks in BNNs. Specifically, we introduce two high performance Gaussian (pseudo) random number generators: 1) the RAM-based Linear Feedback Gaussian Random Number Generator (RLF-GRNG), which is inspired by the properties of binomial distribution and linear feedback logics; and 2) the Bayesian Neural Network-oriented Wallace Gaussian Random Number Generator. To achieve high scalability and efficient memory access, we propose a deep pipelined accelerator architecture with fast execution and good hardware utilization. Experimental results demonstrate that the proposed VIBNN implementations on an FPGA can achieve throughput of 321,543.4 Images/s and energy efficiency upto 52,694.8 Images/J while maintaining similar accuracy as its software counterpart.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114479895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71