Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems最新文献

英文中文

Devirtualizing Memory in Heterogeneous Systems 异构系统中的内存去虚拟化

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173194

Swapnil Haria, M. Hill, M. Swift

Accelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, shared virtual memory (VM) promises to simplify programming and provide safe data sharing with CPUs. Unfortunately, the overheads of virtual memory, which are high for general-purpose processors, are even higher for accelerators. Providing accelerators with direct access to physical memory (PM) in contrast, provides high performance but is both unsafe and more difficult to program. We propose Devirtualized Memory (DVM) to combine the protection of VM with direct access to PM. By allocating memory such that physical and virtual addresses are almost always identical (VA==PA), DVM mostly replaces page-level address translation with faster region-level Devirtualized Access Validation (DAV). Optionally on read accesses, DAV can be overlapped with data fetch to hide VM overheads. DVM requires modest OS and IOMMU changes, and is transparent to the application. Implemented in Linux 4.10, DVM reduces VM overheads in a graph-processing accelerator to just 1.6% on average. DVM also improves performance by 2.1X over an optimized conventional VM implementation, while consuming 3.9X less dynamic energy for memory management. We further discuss DVM's potential to extend beyond accelerators to CPUs, where it reduces VM overheads to 5% on average, down from 29% for conventional VM.

加速器越来越被认为是未来计算增长的主要驱动力之一。对于加速器，共享虚拟内存(VM)承诺简化编程并提供与cpu的安全数据共享。不幸的是，虚拟内存的开销对于通用处理器来说很高，对于加速器来说甚至更高。相反，提供直接访问物理内存(PM)的加速器提供了高性能，但既不安全又难以编程。我们提出了去虚拟化内存(DVM)，将虚拟机的保护与直接访问PM相结合。通过分配内存，使物理地址和虚拟地址几乎总是相同的(VA==PA)， DVM通常用更快的区域级去虚拟化访问验证(DAV)取代页级地址转换。在读访问时，DAV可以与数据读取重叠，以隐藏VM开销。DVM需要对OS和IOMMU进行适度的更改，并且对应用程序是透明的。在Linux 4.10中实现，DVM将图形处理加速器中的VM开销平均减少到1.6%。与经过优化的传统VM实现相比，DVM还将性能提高了2.1倍，同时在内存管理方面消耗的动态能量减少了3.9倍。我们进一步讨论了DVM从加速器扩展到cpu的潜力，它将VM开销平均降低到5%，低于传统VM的29%。

{"title":"Devirtualizing Memory in Heterogeneous Systems","authors":"Swapnil Haria, M. Hill, M. Swift","doi":"10.1145/3173162.3173194","DOIUrl":"https://doi.org/10.1145/3173162.3173194","url":null,"abstract":"Accelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, shared virtual memory (VM) promises to simplify programming and provide safe data sharing with CPUs. Unfortunately, the overheads of virtual memory, which are high for general-purpose processors, are even higher for accelerators. Providing accelerators with direct access to physical memory (PM) in contrast, provides high performance but is both unsafe and more difficult to program. We propose Devirtualized Memory (DVM) to combine the protection of VM with direct access to PM. By allocating memory such that physical and virtual addresses are almost always identical (VA==PA), DVM mostly replaces page-level address translation with faster region-level Devirtualized Access Validation (DAV). Optionally on read accesses, DAV can be overlapped with data fetch to hide VM overheads. DVM requires modest OS and IOMMU changes, and is transparent to the application. Implemented in Linux 4.10, DVM reduces VM overheads in a graph-processing accelerator to just 1.6% on average. DVM also improves performance by 2.1X over an optimized conventional VM implementation, while consuming 3.9X less dynamic energy for memory management. We further discuss DVM's potential to extend beyond accelerators to CPUs, where it reduces VM overheads to 5% on average, down from 29% for conventional VM.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115287400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

CALOREE: Learning Control for Predictable Latency and Low Energy 可预测延迟和低能量的学习控制

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173184

Nikita Mishra, Connor Imes, J. Lafferty, H. Hoffmann

Many modern computing systems must provide reliable latency with minimal energy. Two central challenges arise when allocating system resources to meet these conflicting goals: (1) complexity modern hardware exposes diverse resources with complicated interactions and (2) dynamics latency must be maintained despite unpredictable changes in operating environment or input. Machine learning accurately models the latency of complex, interacting resources, but does not address system dynamics; control theory adjusts to dynamic changes, but struggles with complex resource interaction. We therefore propose CALOREE, a resource manager that learns key control parameters to meet latency requirements with minimal energy in complex, dynamic en- vironments. CALOREE breaks resource allocation into two sub-tasks: learning how interacting resources affect speedup, and controlling speedup to meet latency requirements with minimal energy. CALOREE deines a general control system whose parameters are customized by a learning framework while maintaining control-theoretic formal guarantees that the latency goal will be met. We test CALOREE's ability to deliver reliable latency on heterogeneous ARM big.LITTLE architectures in both single and multi-application scenarios. Compared to the best prior learning and control solutions, CALOREE reduces deadline misses by 60% and energy consumption by 13%.

许多现代计算系统必须以最小的能量提供可靠的延迟。在分配系统资源以满足这些相互冲突的目标时，出现了两个主要挑战:(1)复杂性现代硬件暴露了具有复杂交互的各种资源;(2)尽管操作环境或输入发生了不可预测的变化，但必须保持动态延迟。机器学习准确地模拟了复杂的、相互作用的资源的延迟，但不解决系统动力学;控制理论能够适应动态变化，但难以应对复杂的资源交互作用。因此，我们提出一种资源管理器CALOREE，它可以学习关键控制参数，以满足复杂动态环境中最小能量的延迟要求。CALOREE将资源分配分解为两个子任务:学习交互资源如何影响加速，以及控制加速以最小的能量满足延迟要求。CALOREE定义了一种通用控制系统，其参数由学习框架自定义，同时保持控制理论形式保证延迟目标的满足。我们测试了CALOREE在异构ARM处理器上提供可靠延迟的能力。LITTLE架构适用于单一和多应用场景。与最佳的先验学习和控制解决方案相比，CALOREE将最后期限遗漏率降低了60%，能耗降低了13%。

{"title":"CALOREE: Learning Control for Predictable Latency and Low Energy","authors":"Nikita Mishra, Connor Imes, J. Lafferty, H. Hoffmann","doi":"10.1145/3173162.3173184","DOIUrl":"https://doi.org/10.1145/3173162.3173184","url":null,"abstract":"Many modern computing systems must provide reliable latency with minimal energy. Two central challenges arise when allocating system resources to meet these conflicting goals: (1) complexity modern hardware exposes diverse resources with complicated interactions and (2) dynamics latency must be maintained despite unpredictable changes in operating environment or input. Machine learning accurately models the latency of complex, interacting resources, but does not address system dynamics; control theory adjusts to dynamic changes, but struggles with complex resource interaction. We therefore propose CALOREE, a resource manager that learns key control parameters to meet latency requirements with minimal energy in complex, dynamic en- vironments. CALOREE breaks resource allocation into two sub-tasks: learning how interacting resources affect speedup, and controlling speedup to meet latency requirements with minimal energy. CALOREE deines a general control system whose parameters are customized by a learning framework while maintaining control-theoretic formal guarantees that the latency goal will be met. We test CALOREE's ability to deliver reliable latency on heterogeneous ARM big.LITTLE architectures in both single and multi-application scenarios. Compared to the best prior learning and control solutions, CALOREE reduces deadline misses by 60% and energy consumption by 13%.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

Hardware Multithreaded Transactions 硬件多线程事务

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173172

Jordan Fix, N. P. Nagendra, Sotiris Apostolakis, Hansen Zhang, Sophie Qiu, David I. August

Speculation with transactional memory systems helps pro- grammers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers. Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets. Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to equivalently achieve. Instead, this paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs. Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 99% over sequential execution on a multicore machine with 4 cores.

对事务性内存系统的推测有助于程序员和编译器生成有利可图的线程级并行程序。先前的工作表明，支持可以跨越多个线程的事务，而不是要求将事务包含在单个线程中，可以为程序员和并行编译器提供新型的推测并行化技术。不幸的是，对多线程事务(mtx)的软件支持带来了大量额外的线程间通信开销，用于推测验证。这种开销可能会使原本良好的并行化对具有相当大的读写集的程序无利可图。一些使用这些早期软件mtx的程序通过专业程序员的大量努力来克服这个问题，以最小化这些集合并优化通信，这些功能是编译器技术无法同等实现的。相反，本文通过低开销的推测验证使推测并行化更不费力，更可行，提出了硬件mtx的第一个完整的设计、实现和评估。即使对每个负载进行最大限度的推测验证，并在事务中存储数千万到数亿条指令，也可以实现复杂程序的有益并行化。在8个基准测试中，与4核多核机器上的顺序执行相比，该系统实现了99%的几何加速。

{"title":"Hardware Multithreaded Transactions","authors":"Jordan Fix, N. P. Nagendra, Sotiris Apostolakis, Hansen Zhang, Sophie Qiu, David I. August","doi":"10.1145/3173162.3173172","DOIUrl":"https://doi.org/10.1145/3173162.3173172","url":null,"abstract":"Speculation with transactional memory systems helps pro- grammers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers. Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets. Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to equivalently achieve. Instead, this paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs. Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 99% over sequential execution on a multicore machine with 4 cores.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128309761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Session details: Session 1A: New Architectures 会议详情:会议1A:新架构

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252952

J. Torrellas

引用次数: 0

Time Dilation and Contraction for Programmable Analog Devices with Jaunt 具有Jaunt的可编程模拟器件的时间膨胀和收缩

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173179

Sara Achour, M. Rinard

Programmable analog devices are a powerful new computing substrate that are especially appropriate for performing computationally intensive simulations of neuromorphic and cytomorphic models. Current state of the art techniques for configuring analog devices to simulate dynamical systems do not consider the current and voltage operating ranges of analog device components or the sampling limitations of the digital interface of the device. We present Jaunt, a new solver that scales the values that configure the analog device to ensure the resulting analog computation executes within the operating constraints of the device, preserves the recoverable dynamics of the original simulation, and executes slowly enough to observe these dynamics at the sampled digital outputs. Our results show that, on a set of benchmark biological simulations, 1) unscaled configurations produce incorrect simulations because they violate the operating ranges of the device and 2) Jaunt delivers scaled configurations that respect the operating ranges to produce correct simulations with observable dynamics.

可编程模拟器件是一种功能强大的新型计算基板，特别适用于执行神经形态和细胞形态模型的计算密集型模拟。目前用于配置模拟设备以模拟动态系统的最先进技术没有考虑模拟设备组件的电流和电压工作范围或设备数字接口的采样限制。我们提出了Jaunt，一种新的求解器，它可以缩放配置模拟设备的值，以确保在设备的操作约束内执行所得到的模拟计算，保留原始模拟的可恢复动态，并且执行速度足够慢，可以在采样的数字输出中观察这些动态。我们的研究结果表明，在一组基准生物模拟中，1)未缩放的配置会产生不正确的模拟，因为它们违反了设备的工作范围;2)Jaunt提供的缩放配置尊重工作范围，从而产生具有可观察动态的正确模拟。

{"title":"Time Dilation and Contraction for Programmable Analog Devices with Jaunt","authors":"Sara Achour, M. Rinard","doi":"10.1145/3173162.3173179","DOIUrl":"https://doi.org/10.1145/3173162.3173179","url":null,"abstract":"Programmable analog devices are a powerful new computing substrate that are especially appropriate for performing computationally intensive simulations of neuromorphic and cytomorphic models. Current state of the art techniques for configuring analog devices to simulate dynamical systems do not consider the current and voltage operating ranges of analog device components or the sampling limitations of the digital interface of the device. We present Jaunt, a new solver that scales the values that configure the analog device to ensure the resulting analog computation executes within the operating constraints of the device, preserves the recoverable dynamics of the original simulation, and executes slowly enough to observe these dynamics at the sampled digital outputs. Our results show that, on a set of benchmark biological simulations, 1) unscaled configurations produce incorrect simulations because they violate the operating ranges of the device and 2) Jaunt delivers scaled configurations that respect the operating ranges to produce correct simulations with observable dynamics.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121951188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Session details: Session 6B: Datacenters 会话详细信息:会话6B:数据中心

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252963

John B. Carter

引用次数: 0

LATR: Lazy Translation Coherence 延迟翻译连贯

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173198

Mohan Kumar, Steffen Maass, Sanidhya Kashyap, J. Veselý, Zi Yan, Taesoo Kim, A. Bhattacharjee, T. Krishna

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

我们提出了LATR-lazy TLB coherence——一种基于软件的TLB关机机制，可以减轻现有操作系统中同步TLB关机机制的开销。通过以惰性方式处理TLB一致性，LATR可以避免昂贵的ipi，这是向远程内核传递停机信号所需的，以及相关中断处理程序的性能开销。因此，虚拟内存操作，如释放和页面迁移操作，可以从LATR的机制中获得显著的好处。例如，LATR在2套接字的机器上将munmap()的延迟提高了70.8%，这是现代数据中心中广泛使用的配置。现实世界中，性能关键型应用程序(如web服务器)也可以从LATR中受益:在没有任何应用程序级更改的情况下，LATR比Linux提高了59.9%，比ABIS(一种高度优化的、最先进的TLB一致性技术)提高了37.9%。

引用次数: 35

Session details: Session 3A: Programmable Devices and Co-processors 会议详情:会议3A:可编程器件和协处理器

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252956

S. Narayanasamy

引用次数: 0

Exploiting Dynamic Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications 基于移动应用的智能手机动态热能收集再利用研究

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173188

Yuting Dai, Tao Li, Benyong Liu, Mingcong Song, Huixiang Chen

Recently, mobile applications have gradually become performance- and resource- intensive, which results in a massive battery power drain and high surface temperature, and further degrades the user experience. Thus, high power consumption and surface over-heating have been considered as a severe challenge to smartphone design. In this paper, we propose DTEHR, a mobile Dynamic Thermal Energy Harvesting Reusing framework to tackle this challenge. The approach is sustainable in that it generates energy using dynamic Thermoelectric Generators (TEGs). The generated energy not only powers Thermoelectric Coolers (TECs) for cooling down hot-spots, but also recharges micro-supercapacitors (MSCs) for extended smartphone usage. To analyze thermal characteristics and evaluate DTEHR across real-world applications, we build MPPTAT (Multi-comPonent Power and Thermal Analysis Tool), a power and thermal analyzing tool for Android. The result shows that DTEHR reduces the temperature differences between hot areas and cold areas up to 15.4°C (internal) and 7°C (surface). With TEC-based hot-spots cooling, DTEHR reduces the temperature of the surface and internal hot-spots by an average of 8° and 12.8mW respectively. With dynamic TEGs, DTEHR generates 2.7-15mW power, more than hundreds of times of power that TECs need to cool down hot-spots. Thus, extra-generated power can be stored into MSCs to prolong battery life.

最近，移动应用逐渐变得性能和资源密集型，这导致了大量的电池电量消耗和高表面温度，并进一步降低了用户体验。因此，高功耗和表面过热被认为是智能手机设计面临的严峻挑战。在本文中，我们提出了DTEHR，一个移动动态热能收集再利用框架来解决这一挑战。这种方法是可持续的，因为它使用动态热电发电机(teg)产生能量。产生的能量不仅可以为热电冷却器(tec)提供动力，用于冷却热点，还可以为微型超级电容器(MSCs)充电，以扩展智能手机的使用。为了分析热特性并评估实际应用中的DTEHR，我们构建了MPPTAT(多组件功率和热分析工具)，这是一款用于Android的功率和热分析工具。结果表明，DTEHR将冷热区温差降低至15.4℃(内部)和7℃(表面)。基于tec的热点冷却，DTEHR使表面和内部热点的温度平均分别降低了8°和12.8mW。采用动态teg, DTEHR可产生2.7-15mW的功率，是tec冷却热点所需功率的数百倍以上。因此，额外产生的能量可以储存在msc中以延长电池寿命。

{"title":"Exploiting Dynamic Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications","authors":"Yuting Dai, Tao Li, Benyong Liu, Mingcong Song, Huixiang Chen","doi":"10.1145/3173162.3173188","DOIUrl":"https://doi.org/10.1145/3173162.3173188","url":null,"abstract":"Recently, mobile applications have gradually become performance- and resource- intensive, which results in a massive battery power drain and high surface temperature, and further degrades the user experience. Thus, high power consumption and surface over-heating have been considered as a severe challenge to smartphone design. In this paper, we propose DTEHR, a mobile Dynamic Thermal Energy Harvesting Reusing framework to tackle this challenge. The approach is sustainable in that it generates energy using dynamic Thermoelectric Generators (TEGs). The generated energy not only powers Thermoelectric Coolers (TECs) for cooling down hot-spots, but also recharges micro-supercapacitors (MSCs) for extended smartphone usage. To analyze thermal characteristics and evaluate DTEHR across real-world applications, we build MPPTAT (Multi-comPonent Power and Thermal Analysis Tool), a power and thermal analyzing tool for Android. The result shows that DTEHR reduces the temperature differences between hot areas and cold areas up to 15.4°C (internal) and 7°C (surface). With TEC-based hot-spots cooling, DTEHR reduces the temperature of the surface and internal hot-spots by an average of 8° and 12.8mW respectively. With dynamic TEGs, DTEHR generates 2.7-15mW power, more than hundreds of times of power that TECs need to cool down hot-spots. Thus, extra-generated power can be stored into MSCs to prolong battery life.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123944695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Liquid Silicon-Monona: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support 液态硅- monona:具有可扩展多上下文支持的可重构面向内存的计算结构

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173167

Yue Zha, J. Li

With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6% area reduction, 1.43× throughput improvement and 94.0% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.

随着现场可编程门阵列(fpga)在加速网络、云服务和人工智能中的计算密集型应用方面向一流用户推广的最新趋势，fpga在保持不同云工作负载在性能和能效方面的竞争优势方面面临两个主要挑战:1)支持轻量级计算/片上数据存储以加速新兴搜索/数据密集型应用的配置能力有限。2)缺乏架构支持来隐藏在云计算环境中协助虚拟化的重新配置开销。在本文中，我们提出了一种可重构的面向存储器的计算结构，即液态硅-单晶硅(L-Si)，它由新兴的非易失性存储器技术(即RRAM)实现，以解决这两个挑战。具体来说，L-Si解决了第一个挑战，因为它采用了一种新的架构，包括物理上相同但功能上可配置的构建块的二维阵列。它首次将现有fpga的配置能力从计算扩展到从计算到数据存储的整个范围。它允许用户通过灵活地在计算和内存之间划分硬件资源来更好地定制硬件，极大地有利于新兴的搜索和数据密集型应用程序。为了解决第二个挑战，L-Si提供了可伸缩的多上下文体系结构支持，以最大限度地减少辅助虚拟化的重新配置开销。此外，我们提供编译器支持，以促进用高级编程语言(例如OpenCL)和框架(例如TensorFlow, MapReduce)编写的应用程序的编程，同时充分利用L-Si的独特架构能力。我们的评估结果表明，与FPGA基线相比，L-Si在搜索密集型基准测试中实现了99.6%的面积减少，1.43倍的吞吐量提高和94.0%的功耗降低。对于神经网络基准测试，L-Si在FPGA基准上平均实现了52.3倍的加速，113.9倍的能耗和81%的面积减少。此外，与现成的FPGA (~ 100ms)相比，L-Si的多上下文架构将上下文切换时间缩短至- 10ns，极大地促进了虚拟化。

{"title":"Liquid Silicon-Monona: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support","authors":"Yue Zha, J. Li","doi":"10.1145/3173162.3173167","DOIUrl":"https://doi.org/10.1145/3173162.3173167","url":null,"abstract":"With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6% area reduction, 1.43× throughput improvement and 94.0% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130110865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀