首页 > 最新文献

Proceedings of the 2018 International Symposium on Code Generation and Optimization最新文献

英文 中文
Loop transformations leveraging hardware prefetching 利用硬件预取的循环转换
Savvas Sioutas, S. Stuijk, H. Corporaal, T. Basten, L. Somers
Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.
内存受限的应用程序在很大程度上依赖于系统的带宽来实现高性能。通过循环转换改善时间和/或空间局部性是减轻这种依赖性的常用方法。然而,选择正确的优化组合并不是一项简单的任务,因为大多数优化组合都会改变应用程序的内存访问模式,从而干扰现代体系结构中硬件预取机制的效率。我们提出了一种优化算法,该算法对循环巢的算法描述进行分析分类,以决定是否应该在强调其时间或空间局部性时进行优化,同时还考虑了硬件预取。我们将我们的技术实现为与Halide编译器一起使用的工具,并在各种基准测试中对其进行测试。我们发现,与之前针对Halide语言和编译器的分析模型相比,平均性能提高了40%以上。
{"title":"Loop transformations leveraging hardware prefetching","authors":"Savvas Sioutas, S. Stuijk, H. Corporaal, T. Basten, L. Somers","doi":"10.1145/3168823","DOIUrl":"https://doi.org/10.1145/3168823","url":null,"abstract":"Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124735508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Local memory-aware kernel perforation 本地内存感知的内核穿孔
Daniel Maier, Biagio Cosenza, B. Juurlink
Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.
许多应用程序对一定数量的错误提供了固有的弹性,并且可以通过使用近似计算来潜在地以准确性换取性能。在gpu上运行的应用程序通常使用本地内存来最小化全局内存访问的数量并加快执行速度。局部内存对于改进近似计算的执行方式也非常有用,例如,通过使用数据重建技术提高近似的质量。本文介绍了专门为GPU内核的加速和逼近而设计的局部内存感知穿孔技术。我们提出了一种局部内存感知的内核穿孔技术,该技术首先跳过从全局内存加载部分输入数据,然后在局部内存上使用重建技术来达到更高的精度,同时具有与最先进的技术相似的性能。实验表明,我们的方法能够将各种应用程序的执行速度从1.6倍提高到3倍,同时引入6%的平均误差,这比其他方法要小得多。结果进一步显示了误差在多大程度上取决于输入数据和应用场景、本地内存调优的影响和不同的参数配置。
{"title":"Local memory-aware kernel perforation","authors":"Daniel Maier, Biagio Cosenza, B. Juurlink","doi":"10.1145/3168814","DOIUrl":"https://doi.org/10.1145/3168814","url":null,"abstract":"Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116480419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Automating efficient variable-grained resiliency for low-power IoT systems 自动化低功耗物联网系统的高效可变粒度弹性
Sara S. Baghsorkhi, Christos Margiolas
New trends in edge computing encourage pushing more of the compute and analytics to the outer edge and processing most of the data locally. We explore how to transparently provide resiliency for heavy duty edge applications running on low-power devices that must deal with frequent and unpredictable power disruptions. Complicating this process further are (a) memory usage restrictions in tiny low-power devices, that affect not only performance but efficacy of the resiliency techniques, and (b) differing resiliency requirements across deployment environments. Nevertheless, an application developer wants the ability to write an application once, and have it be reusable across all low-power platforms and across all different deployment settings. In response to these challenges, we have devised a transparent roll-back recovery mechanism that performs incremental checkpoints with minimal execution time overhead and at variable granularities. Our solution includes the co-design of firmware, runtime and compiler transformations for providing seamless fault-tolerance, along with an auto-tuning layer that automatically generates multiple resilient variants of an application. Each variant spreads application’s execution over atomic transactional regions of a certain granularity. Variants with smaller regions provide better resiliency, but incur higher overhead; thus, there is no single best option, but rather a Pareto optimal set of configurations. We apply these strategies across a variety of edge device applications and measure the execution time overhead of the framework on a TI MSP430FR6989. When we restrict unin- terrupted atomic intervals to 100ms, our framework keeps geomean overhead below 2.48x.
边缘计算的新趋势鼓励将更多的计算和分析推向外部边缘,并在本地处理大部分数据。我们将探讨如何透明地为运行在低功耗设备上的重型边缘应用程序提供弹性,这些设备必须处理频繁和不可预测的电源中断。使这一过程进一步复杂化的是:(a)小型低功耗设备中的内存使用限制,这不仅影响性能,而且影响弹性技术的效率,以及(b)不同部署环境的不同弹性需求。然而,应用程序开发人员希望能够编写一次应用程序,并使其能够跨所有低功耗平台和所有不同的部署设置进行重用。为了应对这些挑战,我们设计了一种透明的回滚恢复机制,以最小的执行时间开销和可变粒度执行增量检查点。我们的解决方案包括固件、运行时和编译器转换的协同设计,以提供无缝的容错,以及自动生成应用程序的多个弹性变体的自动调优层。每个变体将应用程序的执行分散到特定粒度的原子事务区域。具有较小区域的变体提供更好的弹性,但会产生更高的开销;因此,不存在单一的最佳选择,而是一个帕累托最优配置集。我们将这些策略应用于各种边缘设备应用程序,并在TI MSP430FR6989上测量框架的执行时间开销。当我们将不间断原子间隔限制为100ms时,我们的框架将几何开销保持在2.48x以下。
{"title":"Automating efficient variable-grained resiliency for low-power IoT systems","authors":"Sara S. Baghsorkhi, Christos Margiolas","doi":"10.1145/3168816","DOIUrl":"https://doi.org/10.1145/3168816","url":null,"abstract":"New trends in edge computing encourage pushing more of the compute and analytics to the outer edge and processing most of the data locally. We explore how to transparently provide resiliency for heavy duty edge applications running on low-power devices that must deal with frequent and unpredictable power disruptions. Complicating this process further are (a) memory usage restrictions in tiny low-power devices, that affect not only performance but efficacy of the resiliency techniques, and (b) differing resiliency requirements across deployment environments. Nevertheless, an application developer wants the ability to write an application once, and have it be reusable across all low-power platforms and across all different deployment settings. In response to these challenges, we have devised a transparent roll-back recovery mechanism that performs incremental checkpoints with minimal execution time overhead and at variable granularities. Our solution includes the co-design of firmware, runtime and compiler transformations for providing seamless fault-tolerance, along with an auto-tuning layer that automatically generates multiple resilient variants of an application. Each variant spreads application’s execution over atomic transactional regions of a certain granularity. Variants with smaller regions provide better resiliency, but incur higher overhead; thus, there is no single best option, but rather a Pareto optimal set of configurations. We apply these strategies across a variety of edge device applications and measure the execution time overhead of the framework on a TI MSP430FR6989. When we restrict unin- terrupted atomic intervals to 100ms, our framework keeps geomean overhead below 2.48x.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126207344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
SIMD intrinsics on managed language runtimes 托管语言运行时上的SIMD内在特性
A. Stojanov, Ivaylo Toskov, Tiark Rompf, Markus Püschel
Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like C/C++. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from MMX to AVX-512. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.
托管语言运行时(如Java虚拟机(JVM))为广泛的应用程序提供了足够的性能,但与此同时,它们缺乏许多关注性能的程序员在C/ c++等语言中所欣赏的低级控制。一个重要的例子是暴露SIMD(单指令多数据)向量isa(指令集体系结构)指令的内在接口。在本文中,我们提出了一种在托管语言的运行时中自动包含本机固有特性的方法。我们的实现由两部分组成。首先,对于每个向量ISA,我们根据供应商提供的XML规范自动生成intrinsic API。其次,我们采用元编程方法,使程序员能够在运行时生成和加载本机代码。在这种情况下,程序员可以使用整个高级语言作为一种宏系统,以零开销定义新的高级向量api。作为一个示例用例,我们展示了一个可变精度API。我们在HotSpot VM中提供了我们的方法的端到端实现,它支持从MMX到AVX-512的所有5912 Intel SIMD特性。我们的基准测试表明,SIMD和元编程的这种组合使开发人员能够在未经修改的JVM上编写高性能的向量化代码,这种代码的性能优于自动向量化HotSpot即时(JIT)编译器,并提供了向量化本机代码和托管JVM生态系统之间的紧密集成。
{"title":"SIMD intrinsics on managed language runtimes","authors":"A. Stojanov, Ivaylo Toskov, Tiark Rompf, Markus Püschel","doi":"10.1145/3168810","DOIUrl":"https://doi.org/10.1145/3168810","url":null,"abstract":"Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like C/C++. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from MMX to AVX-512. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134104941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Look-ahead SLP: auto-vectorization in the presence of commutative operations 前瞻性SLP:存在交换运算的自动向量化
Vasileios Porpodas, Rodrigo C. O. Rocha, L. F. Góes
Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.
自动向量化编译器从标量代码中自动生成矢量(SIMD)指令。直线代码矢量化的最先进算法是超字级并行化(Superword-Level Parallelism, SLP)。在这项工作中,我们确定了SLP算法核心的一个主要限制,即收集形成SLP图数据结构的向量化候选指令的性能关键步骤。SLP在构建其向量化图时缺乏全局知识,这对其在遇到交换指令时的局部决策产生负面影响。我们提出了一种改进的LSLP算法,它可以插入到现有的SLP实现中,并且可以有效地对具有任意长交换操作链的代码进行矢量化。LSLP依赖于短深度预测,以获得更明智的本地决策。我们在真实机器上的评估表明,LSLP可以显著提高真实代码的性能,而编译时间开销很小。
{"title":"Look-ahead SLP: auto-vectorization in the presence of commutative operations","authors":"Vasileios Porpodas, Rodrigo C. O. Rocha, L. F. Góes","doi":"10.1145/3168807","DOIUrl":"https://doi.org/10.1145/3168807","url":null,"abstract":"Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121526215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Scalable concurrency debugging with distributed graph processing 可扩展的并发调试与分布式图形处理
Long Zheng, Xiaofei Liao, Hai Jin, Jieshan Zhao, Qinggang Wang
Existing constraint-solving-based technique enables an efficient and high-coverage concurrency debugging. Yet, there remains a significant gap between the state of the art and the state of the programming practices for scaling to handle long-running execution of programs. In this paper, we revisit the scalability problem of state-of-the-art constraint-solving-based technique. Our key insight is that concurrency debugging for many real-world bugs can be turned into a graph traversal problem. We therefore present GraphDebugger, a novel debugging framework to enable the scalable concurrency analysis on program graphs via a tailored graph-parallel analysis in a distributed environment. It is verified that GraphDebugger is more capable than CLAP in reproducing the real-world bugs that involve a complex concurrency analysis. Our extensive evaluation on 7 real-world programs shows that, GraphDebugger (deployed on an 8-node EC2 like cluster) is significantly efficient to reproduce concurrency bugs within 1∼8 minutes while CLAP does so with 1∼30 hours, or even without returning solutions.
现有的基于约束求解的技术支持高效、高覆盖率的并发调试。然而,目前的技术水平和编程实践水平之间仍然存在很大的差距,无法扩展以处理程序的长时间执行。在本文中,我们重新审视了基于最先进的约束求解技术的可扩展性问题。我们的主要见解是,许多现实世界bug的并发调试可以变成一个图遍历问题。因此,我们提出了GraphDebugger,这是一个新的调试框架,通过在分布式环境中定制的图并行分析,可以对程序图进行可扩展的并发分析。经过验证,GraphDebugger在再现涉及复杂并发性分析的真实bug方面比CLAP更有能力。我们对7个实际程序的广泛评估表明,GraphDebugger(部署在8个节点的类似EC2的集群上)在1 ~ 8分钟内重现并发错误的效率非常高,而CLAP需要1 ~ 30个小时,甚至不返回解决方案。
{"title":"Scalable concurrency debugging with distributed graph processing","authors":"Long Zheng, Xiaofei Liao, Hai Jin, Jieshan Zhao, Qinggang Wang","doi":"10.1145/3168817","DOIUrl":"https://doi.org/10.1145/3168817","url":null,"abstract":"Existing constraint-solving-based technique enables an efficient and high-coverage concurrency debugging. Yet, there remains a significant gap between the state of the art and the state of the programming practices for scaling to handle long-running execution of programs. In this paper, we revisit the scalability problem of state-of-the-art constraint-solving-based technique. Our key insight is that concurrency debugging for many real-world bugs can be turned into a graph traversal problem. We therefore present GraphDebugger, a novel debugging framework to enable the scalable concurrency analysis on program graphs via a tailored graph-parallel analysis in a distributed environment. It is verified that GraphDebugger is more capable than CLAP in reproducing the real-world bugs that involve a complex concurrency analysis. Our extensive evaluation on 7 real-world programs shows that, GraphDebugger (deployed on an 8-node EC2 like cluster) is significantly efficient to reproduce concurrency bugs within 1∼8 minutes while CLAP does so with 1∼30 hours, or even without returning solutions.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117348236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SGXElide: enabling enclave code secrecy via self-modification SGXElide:通过自我修改启用飞地代码保密
Erick Bauman, Huibo Wang, Mingwei Zhang, Zhiqiang Lin
Intel SGX provides a secure enclave in which code and data are hidden from the outside world, including privileged code such as the OS or hypervisor. However, by default, enclave code prior to initialization can be disassembled and therefore no secrets can be embedded in the binary. This is a problem for developers wishing to protect code secrets. This paper introduces SGXElide, a nearly-transparent framework that enables enclave code confidentiality. The key idea is to treat program code as data and dynamically restore secrets after an enclave is initialized. SGXElide can be integrated into any enclave, providing a mechanism to securely decrypt or deliver the secret code with the assistance of a developer-controlled trusted remote party. We have implemented SGXElide atop a recently released version of the Linux SGX SDK, and our evaluation with a number of programs shows that SGXElide can be used to protect the code secrecy of practical applications with no overhead after enclave initialization.
英特尔SGX提供了一个安全的飞地,其中代码和数据对外部世界隐藏,包括特权代码,如操作系统或管理程序。但是,默认情况下,初始化之前的enclave代码可以被反汇编,因此没有秘密可以嵌入到二进制文件中。对于希望保护代码秘密的开发人员来说,这是一个问题。本文介绍了SGXElide,这是一个几乎透明的框架,可以实现enclave代码的机密性。其关键思想是将程序代码视为数据,并在enclave初始化后动态恢复秘密。SGXElide可以集成到任何enclave中,提供一种机制,在开发人员控制的可信远程方的帮助下安全地解密或交付秘密代码。我们已经在最近发布的Linux SGX SDK版本上实现了SGXElide,我们对许多程序的评估表明,SGXElide可以用于保护实际应用程序的代码机密性,并且在enclave初始化之后没有开销。
{"title":"SGXElide: enabling enclave code secrecy via self-modification","authors":"Erick Bauman, Huibo Wang, Mingwei Zhang, Zhiqiang Lin","doi":"10.1145/3168833","DOIUrl":"https://doi.org/10.1145/3168833","url":null,"abstract":"Intel SGX provides a secure enclave in which code and data are hidden from the outside world, including privileged code such as the OS or hypervisor. However, by default, enclave code prior to initialization can be disassembled and therefore no secrets can be embedded in the binary. This is a problem for developers wishing to protect code secrets. This paper introduces SGXElide, a nearly-transparent framework that enables enclave code confidentiality. The key idea is to treat program code as data and dynamically restore secrets after an enclave is initialized. SGXElide can be integrated into any enclave, providing a mechanism to securely decrypt or deliver the secret code with the assistance of a developer-controlled trusted remote party. We have implemented SGXElide atop a recently released version of the Linux SGX SDK, and our evaluation with a number of programs shows that SGXElide can be used to protect the code secrecy of practical applications with no overhead after enclave initialization.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Transforming loop chains via macro dataflow graphs 通过宏数据流图转换循环链
Eddie C. Davis, M. Strout, C. Olschanowsky
This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.
本文描述了一种使用修改的宏数据流图进行性能优化的方法,其中包含表示循环和涉及模板计算的数据的节点。目标应用包括现有的包含一系列共享数据的模板计算的科学应用,即循环链。通过修改执行计划,可以提高模板应用程序的性能。然而,现代体系结构越来越受到内存子系统带宽的限制。为了充分实现调度更改对改进局域性的好处,还必须最小化临时存储分配。我们提出了一个宏数据流图变体,它包括数据集节点,一个量化给定图所需的内存交互的成本模型,一组可以在图上执行的转换,如融合和平铺,以及一种生成代码来实现转换后的图的方法。我们包括与Halide和PolyMage实现基准的性能比较。我们最快的变体比两个框架生成的自动调优变体性能更好。
{"title":"Transforming loop chains via macro dataflow graphs","authors":"Eddie C. Davis, M. Strout, C. Olschanowsky","doi":"10.1145/3168832","DOIUrl":"https://doi.org/10.1145/3168832","url":null,"abstract":"This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A compiler for cyber-physical digital microfluidic biochips 网络物理数字微流控生物芯片的编译器
C. Curtis, D. Grissom, P. Brisk
Programmable microfluidic laboratories-on-a-chip (LoCs) offer the benefits of automation and miniaturization to the life sciences. This paper presents an updated version of the BioCoder language and a fully static (offline) compiler that can target an emerging class of LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of liquid on a 2D electrode grid. The BioCoder language and runtime execution engine leverage advances in sensor integration to enable specification, compilation, and execution of assays (bio-chemical procedures) that feature online decision-making based on sensory data acquired during assay execution. The compiler features a novel hybrid intermediate representation (IR) that interleaves fluidic operations with computations performed on sensor data. The IR extends the traditional notions of liveness and interference to fluidic variables and operations, as needed to target the DMFB, which itself can be viewed as a spatially reconfigurable array. The code generator converts the IR into the following: (1) a set of electrode activation sequences for each basic block in the control flow graph (CFG); (2) a set of computations performed on sensor data, which dynamically determine the result of each control flow operation; and (3) a set of electrode activation sequences for each control flow transfer operation (CFG edge). The compiler is validated using a software simulator which produces animated videos of realistic bioassay execution on a DMFB.
可编程微流控芯片实验室(loc)为生命科学提供了自动化和小型化的好处。本文介绍了BioCoder语言的更新版本和一个完全静态(离线)编译器,该编译器可以针对称为数字微流体生物芯片(dmfb)的新兴loc类,它可以操纵二维电极网格上的离散液滴。BioCoder语言和运行时执行引擎利用传感器集成方面的先进技术,实现分析(生化程序)的规范、编译和执行,这些分析基于在分析执行过程中获得的传感器数据进行在线决策。编译器的特点是一种新的混合中间表示(IR),它将流体操作与对传感器数据执行的计算交织在一起。IR将活性和干扰的传统概念扩展到流体变量和操作,根据需要针对DMFB,它本身可以被视为一个空间可重构阵列。代码生成器将IR转换为:(1)控制流图(CFG)中每个基本块的一组电极激活序列;(2)对传感器数据进行的一组计算,动态确定每个控制流操作的结果;(3)各控制流传递操作(CFG边)的一组电极激活序列。编译器使用软件模拟器进行验证,该软件模拟器在DMFB上产生逼真的生物测定执行动画视频。
{"title":"A compiler for cyber-physical digital microfluidic biochips","authors":"C. Curtis, D. Grissom, P. Brisk","doi":"10.1145/3168826","DOIUrl":"https://doi.org/10.1145/3168826","url":null,"abstract":"Programmable microfluidic laboratories-on-a-chip (LoCs) offer the benefits of automation and miniaturization to the life sciences. This paper presents an updated version of the BioCoder language and a fully static (offline) compiler that can target an emerging class of LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of liquid on a 2D electrode grid. The BioCoder language and runtime execution engine leverage advances in sensor integration to enable specification, compilation, and execution of assays (bio-chemical procedures) that feature online decision-making based on sensory data acquired during assay execution. The compiler features a novel hybrid intermediate representation (IR) that interleaves fluidic operations with computations performed on sensor data. The IR extends the traditional notions of liveness and interference to fluidic variables and operations, as needed to target the DMFB, which itself can be viewed as a spatially reconfigurable array. The code generator converts the IR into the following: (1) a set of electrode activation sequences for each basic block in the control flow graph (CFG); (2) a set of computations performed on sensor data, which dynamically determine the result of each control flow operation; and (3) a set of electrode activation sequences for each control flow transfer operation (CFG edge). The compiler is validated using a software simulator which produces animated videos of realistic bioassay execution on a DMFB.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133629719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
CollectionSwitch: a framework for efficient and dynamic collection selection CollectionSwitch:一个用于高效和动态收集选择的框架
D. Costa, A. Andrzejak
Selecting collection data structures for a given application is a crucial aspect of the software development. Inefficient usage of collections has been credited as a major cause of performance bloat in applications written in Java, C++ and C#. Furthermore, a single implementation might not be optimal throughout the entire program execution. This demands an adaptive solution that adjusts at runtime the collection implementations to varying workloads. We present CollectionSwitch, an application-level framework for efficient collection adaptation. It selects at runtime collection implementations in order to optimize the execution and memory performance of an application. Unlike previous works, we use workload data on the level of collection allocation sites to guide the optimization process. Our framework identifies allocation sites which instantiate suboptimal collection variants, and selects optimized variants for future instantiations. As a further contribution we propose adaptive collection implementations which switch their underlying data structures according to the size of the collection. We implement this framework in Java, and demonstrate the improvements in terms of time and memory behavior across a range of benchmarks. To our knowledge, it is the first approach which is capable of runtime performance optimization of Java collections with very low overhead.
为给定的应用程序选择集合数据结构是软件开发的一个关键方面。在用Java、c++和c#编写的应用程序中,集合的低效使用被认为是导致性能膨胀的主要原因。此外,在整个程序执行过程中,单个实现可能不是最优的。这需要一个自适应的解决方案,在运行时根据不同的工作负载调整集合实现。我们提出了CollectionSwitch,一个用于有效集合适应的应用程序级框架。它在运行时选择收集实现,以优化应用程序的执行和内存性能。与以往的工作不同,我们使用收集分配站点级别的工作负载数据来指导优化过程。我们的框架确定了实例化次优集合变量的分配站点,并为未来的实例化选择优化的变量。作为进一步的贡献,我们提出了自适应集合实现,根据集合的大小切换其底层数据结构。我们在Java中实现了这个框架,并在一系列基准测试中展示了在时间和内存行为方面的改进。据我们所知,它是第一种能够以非常低的开销对Java集合进行运行时性能优化的方法。
{"title":"CollectionSwitch: a framework for efficient and dynamic collection selection","authors":"D. Costa, A. Andrzejak","doi":"10.1145/3168825","DOIUrl":"https://doi.org/10.1145/3168825","url":null,"abstract":"Selecting collection data structures for a given application is a crucial aspect of the software development. Inefficient usage of collections has been credited as a major cause of performance bloat in applications written in Java, C++ and C#. Furthermore, a single implementation might not be optimal throughout the entire program execution. This demands an adaptive solution that adjusts at runtime the collection implementations to varying workloads. We present CollectionSwitch, an application-level framework for efficient collection adaptation. It selects at runtime collection implementations in order to optimize the execution and memory performance of an application. Unlike previous works, we use workload data on the level of collection allocation sites to guide the optimization process. Our framework identifies allocation sites which instantiate suboptimal collection variants, and selects optimized variants for future instantiations. As a further contribution we propose adaptive collection implementations which switch their underlying data structures according to the size of the collection. We implement this framework in Java, and demonstrate the improvements in terms of time and memory behavior across a range of benchmarks. To our knowledge, it is the first approach which is capable of runtime performance optimization of Java collections with very low overhead.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
Proceedings of the 2018 International Symposium on Code Generation and Optimization
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1