Memory virtualization abstracts a physical machine's memory resource and presents to the virtual machines running on it a piece of physical memory that could be shared, compressed and moved. To optimize the memory resource utilization by fully leveraging the flexibility afforded by memory virtualization, it is essential that the hypervisor have some sense of how the guest VMs use their allocated physical memory. One way to do this is virtual machine introspection (VMI), which interprets byte values in a guest memory space into semantically meaningful data structures. However, identifying a guest VM's memory usage information such as free memory pool is non-trivial. This paper describes a bootstrapping VM introspection technique that could accurately extract free memory pool information from multiple versions of Windows and Linux without kernel version-specific hard-coding, how to apply this technique to improve the efficiency of memory de-duplication and memory state migration, and the resulting improvement in memory de-duplication speed, gain in additional memory pages de-duplicated, and reduction in traffic loads associated with memory state migration.
{"title":"Introspection-based memory de-duplication and migration","authors":"J. Chiang, Han-Lin Li, T. Chiueh","doi":"10.1145/2451512.2451525","DOIUrl":"https://doi.org/10.1145/2451512.2451525","url":null,"abstract":"Memory virtualization abstracts a physical machine's memory resource and presents to the virtual machines running on it a piece of physical memory that could be shared, compressed and moved. To optimize the memory resource utilization by fully leveraging the flexibility afforded by memory virtualization, it is essential that the hypervisor have some sense of how the guest VMs use their allocated physical memory. One way to do this is virtual machine introspection (VMI), which interprets byte values in a guest memory space into semantically meaningful data structures. However, identifying a guest VM's memory usage information such as free memory pool is non-trivial. This paper describes a bootstrapping VM introspection technique that could accurately extract free memory pool information from multiple versions of Windows and Linux without kernel version-specific hard-coding, how to apply this technique to improve the efficiency of memory de-duplication and memory state migration, and the resulting improvement in memory de-duplication speed, gain in additional memory pages de-duplicated, and reduction in traffic loads associated with memory state migration.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128848639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chun-Chen Hsu, Pangfeng Liu, Jan-Jan Wu, P. Yew, Ding-Yong Hong, W. Hsu, Chien-Min Wang
Most dynamic binary translators (DBT) and optimizers (DBO) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. The Next-Executing-Tail (NET) trace formation method used in HP Dynamo is an early example of such techniques. Many existing trace formation schemes are variants of NET. They work very well for most binary traces, but they also suffer a major problem: the formed traces may contain a large number of early exits that could be branched out during the execution. If this happens frequently, the program execution will spend more time in the slow binary interpreter or in the unoptimized code regions than in the optimized traces in code cache. The benefit of the trace optimization is thus lost. Traces/regions with frequently taken early-exits are called delinquent traces/regions. Our empirical study shows that at least 8 of the 12 SPEC CPU2006 integer benchmarks have delinquent traces. In this paper, we propose a light-weight region formation technique called Early-Exit Guided Region Formation (EEG) to improve the quality of the formed traces/regions. It iteratively identifies and merges delinquent regions into larger code regions. We have implemented our EEG algorithm in two LLVM-based multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISA), respectively. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to an NET-variant currently used in QEMU, a state-of-the-art retargetable DBT, EEG can achieve a significant performance improvement of up to 72% (27% on average), and to 49% (23% on average) for IA32 and ARM, respectively.
{"title":"Improving dynamic binary optimization through early-exit guided code region formation","authors":"Chun-Chen Hsu, Pangfeng Liu, Jan-Jan Wu, P. Yew, Ding-Yong Hong, W. Hsu, Chien-Min Wang","doi":"10.1145/2451512.2451519","DOIUrl":"https://doi.org/10.1145/2451512.2451519","url":null,"abstract":"Most dynamic binary translators (DBT) and optimizers (DBO) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. The Next-Executing-Tail (NET) trace formation method used in HP Dynamo is an early example of such techniques. Many existing trace formation schemes are variants of NET. They work very well for most binary traces, but they also suffer a major problem: the formed traces may contain a large number of early exits that could be branched out during the execution. If this happens frequently, the program execution will spend more time in the slow binary interpreter or in the unoptimized code regions than in the optimized traces in code cache. The benefit of the trace optimization is thus lost. Traces/regions with frequently taken early-exits are called delinquent traces/regions. Our empirical study shows that at least 8 of the 12 SPEC CPU2006 integer benchmarks have delinquent traces.\u0000 In this paper, we propose a light-weight region formation technique called Early-Exit Guided Region Formation (EEG) to improve the quality of the formed traces/regions. It iteratively identifies and merges delinquent regions into larger code regions. We have implemented our EEG algorithm in two LLVM-based multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISA), respectively. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to an NET-variant currently used in QEMU, a state-of-the-art retargetable DBT, EEG can achieve a significant performance improvement of up to 72% (27% on average), and to 49% (23% on average) for IA32 and ARM, respectively.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121927814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtual machine (VM) live storage migration techniques significantly increase the mobility and manageability of virtual machines in the era of cloud computing. On the other hand, as solid state drives (SSDs) become increasingly popular in data centers, VM live storage migration will inevitably encounter heterogeneous storage environments. Nevertheless, conventional migration mechanisms do not consider the speed discrepancy and SSD's wear-out issue, which not only causes significant performance degradation but also shortens SSD's lifetime. This paper, for the first time, addresses the efficiency of VM live storage migration in heterogeneous storage environments from a multi-dimensional perspective, i.e., user experience, device wearing, and manageability. We derive a flexible metric (migration cost), which captures various design preference. Based on that, we propose and prototype three new storage migration strategies, namely: 1) Low Redundancy (LR), which generates the least amount of redundant writes; 2) Source-based Low Redundancy (SLR), which keeps the balance between IO performance and write redundancy; and 3) Asynchronous IO Mirroring, which seeks the highest IO performance. The evaluation of our prototyped system shows that our techniques outperform existing live storage migration by a significant margin. Furthermore, by adaptively mixing our proposed schemes, the cost of massive VM live storage migration can be even lower than that of only using the best of individual mechanism.
{"title":"Optimizing virtual machine live storage migration in heterogeneous storage environment","authors":"Ruijin Zhou, Fang Liu, Chao Li, Tao Li","doi":"10.1145/2451512.2451529","DOIUrl":"https://doi.org/10.1145/2451512.2451529","url":null,"abstract":"Virtual machine (VM) live storage migration techniques significantly increase the mobility and manageability of virtual machines in the era of cloud computing. On the other hand, as solid state drives (SSDs) become increasingly popular in data centers, VM live storage migration will inevitably encounter heterogeneous storage environments. Nevertheless, conventional migration mechanisms do not consider the speed discrepancy and SSD's wear-out issue, which not only causes significant performance degradation but also shortens SSD's lifetime. This paper, for the first time, addresses the efficiency of VM live storage migration in heterogeneous storage environments from a multi-dimensional perspective, i.e., user experience, device wearing, and manageability. We derive a flexible metric (migration cost), which captures various design preference. Based on that, we propose and prototype three new storage migration strategies, namely: 1) Low Redundancy (LR), which generates the least amount of redundant writes; 2) Source-based Low Redundancy (SLR), which keeps the balance between IO performance and write redundancy; and 3) Asynchronous IO Mirroring, which seeks the highest IO performance. The evaluation of our prototyped system shows that our techniques outperform existing live storage migration by a significant margin. Furthermore, by adaptively mixing our proposed schemes, the cost of massive VM live storage migration can be even lower than that of only using the best of individual mechanism.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128454534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changyeon Jo, E. Gustafsson, Jeongseok Son, Bernhard Egger
Live migration of virtual machines (VM) across distinct physical hosts is an important feature of virtualization technology for maintenance, load-balancing and energy reduction, especially so for data centers operators and cluster service providers. Several techniques have been proposed to reduce the downtime of the VM being transferred, often at the expense of the total migration time. In this work, we present a technique to reduce the total time required to migrate a running VM from one host to another while keeping the downtime to a minimum. Based on the observation that modern operating systems use the better part of the physical memory to cache data from secondary storage, our technique tracks the VM's I/O operations to the network-attached storage device and maintains an updated mapping of memory pages that currently reside in identical form on the storage device. During the iterative pre-copy live migration process, instead of transferring those pages from the source to the target host, the memory-to-disk mapping is sent to the target host which then fetches the contents directly from the network-attached storage device. We have implemented our approach into the Xen hypervisor and ran a series of experiments with Linux HVM guests. On average, the presented technique shows a reduction of up over 30% on average of the total transfer time for a series of benchmarks.
{"title":"Efficient live migration of virtual machines using shared storage","authors":"Changyeon Jo, E. Gustafsson, Jeongseok Son, Bernhard Egger","doi":"10.1145/2451512.2451524","DOIUrl":"https://doi.org/10.1145/2451512.2451524","url":null,"abstract":"Live migration of virtual machines (VM) across distinct physical hosts is an important feature of virtualization technology for maintenance, load-balancing and energy reduction, especially so for data centers operators and cluster service providers. Several techniques have been proposed to reduce the downtime of the VM being transferred, often at the expense of the total migration time. In this work, we present a technique to reduce the total time required to migrate a running VM from one host to another while keeping the downtime to a minimum. Based on the observation that modern operating systems use the better part of the physical memory to cache data from secondary storage, our technique tracks the VM's I/O operations to the network-attached storage device and maintains an updated mapping of memory pages that currently reside in identical form on the storage device. During the iterative pre-copy live migration process, instead of transferring those pages from the source to the target host, the memory-to-disk mapping is sent to the target host which then fetches the contents directly from the network-attached storage device. We have implemented our approach into the Xen hypervisor and ran a series of experiments with Linux HVM guests. On average, the presented technique shows a reduction of up over 30% on average of the total transfer time for a series of benchmarks.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132270464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtualization technology is being widely adopted by servers and data centers in the cloud computing era to improve resource utilization and energy efficiency. Nevertheless, the heterogeneous memory demands from multiple virtual machines (VM) make it more challenging to design efficient memory systems. Even worse, mission critical VM management activities (e.g. checkpointing) could incur significant runtime overhead due to intensive IO operations. In this paper, we propose to leverage the adaptable and non-volatile features of the emerging phase change memory (PCM) to achieve efficient virtual machine execution. Towards this end, we exploit VM-aware PCM management mechanisms, which 1) smartly tune SLC/MLC page allocation within a single VM and across different VMs and 2) keep critical checkpointing pages in PCM to reduce I/O traffic. Experimental results show that our single VM design (IntraVM) improves performance by 10% and 20% compared to pure SLC- and MLC- based systems. Further incorporating VM-aware resource management schemes (IntraVM+InterVM) increases system performance by 15%. In addition, our design saves 46% of checkpoint/restore duration and reduces 50% of overall IO penalty to the system.
{"title":"Leveraging phase change memory to achieve efficient virtual machine execution","authors":"Ruijin Zhou, Tao Li","doi":"10.1145/2451512.2451547","DOIUrl":"https://doi.org/10.1145/2451512.2451547","url":null,"abstract":"Virtualization technology is being widely adopted by servers and data centers in the cloud computing era to improve resource utilization and energy efficiency. Nevertheless, the heterogeneous memory demands from multiple virtual machines (VM) make it more challenging to design efficient memory systems. Even worse, mission critical VM management activities (e.g. checkpointing) could incur significant runtime overhead due to intensive IO operations. In this paper, we propose to leverage the adaptable and non-volatile features of the emerging phase change memory (PCM) to achieve efficient virtual machine execution. Towards this end, we exploit VM-aware PCM management mechanisms, which 1) smartly tune SLC/MLC page allocation within a single VM and across different VMs and 2) keep critical checkpointing pages in PCM to reduce I/O traffic. Experimental results show that our single VM design (IntraVM) improves performance by 10% and 20% compared to pure SLC- and MLC- based systems. Further incorporating VM-aware resource management schemes (IntraVM+InterVM) increases system performance by 15%. In addition, our design saves 46% of checkpoint/restore duration and reduces 50% of overall IO penalty to the system.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131361927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue-hua Dai, Yong Qi, Jianbao Ren, Yi Shi, Xiaoguang Wang, Xuan Yu
Traditional Virtual Machine Monitor (VMM) virtualizes some devices and instructions, which induces performance overhead to guest operating systems. Furthermore, the virtualization contributes a large amount of codes to VMM, which makes a VMM prone to bugs and vulnerabilities. On the other hand, in cloud computing, cloud service provider configures virtual machines based on requirements which are specified by customers in advance. As resources in a multi-core server increase to more than adequate in the future, virtualization is not necessary although it provides convenience for cloud computing. Based on the above observations, this paper presents an alternative way for constructing a VMM: configuring a booting interface instead of virtualization technology. A lightweight virtual machine monitor - OSV is proposed based on this idea. OSV can host multiple full functional Linux kernels with little performance overhead. There are only 6 hyper-calls in OSV. The Linux running on top of OSV is intercepted only for the inter-processor interrupts. The resource isolation is implemented with hardware-assist virtualization. The resource sharing is controlled by distributed protocols embedded in current operating systems. We implement a prototype of OSV on AMD Opteron processor based 32-core servers with SVM and cache-coherent NUMA architectures. OSV can host up to 8 Linux kernels on the server with less than 10 lines of code modifications to Linux kernel. OSV has about 8000 lines of code which can be easily tuned and debugged. The experiment results show that OSV VMM has 23.7% performance improvement compared with Xen VMM.
{"title":"A lightweight VMM on many core for high performance computing","authors":"Yue-hua Dai, Yong Qi, Jianbao Ren, Yi Shi, Xiaoguang Wang, Xuan Yu","doi":"10.1145/2451512.2451535","DOIUrl":"https://doi.org/10.1145/2451512.2451535","url":null,"abstract":"Traditional Virtual Machine Monitor (VMM) virtualizes some devices and instructions, which induces performance overhead to guest operating systems. Furthermore, the virtualization contributes a large amount of codes to VMM, which makes a VMM prone to bugs and vulnerabilities.\u0000 On the other hand, in cloud computing, cloud service provider configures virtual machines based on requirements which are specified by customers in advance. As resources in a multi-core server increase to more than adequate in the future, virtualization is not necessary although it provides convenience for cloud computing. Based on the above observations, this paper presents an alternative way for constructing a VMM: configuring a booting interface instead of virtualization technology. A lightweight virtual machine monitor - OSV is proposed based on this idea. OSV can host multiple full functional Linux kernels with little performance overhead. There are only 6 hyper-calls in OSV. The Linux running on top of OSV is intercepted only for the inter-processor interrupts. The resource isolation is implemented with hardware-assist virtualization. The resource sharing is controlled by distributed protocols embedded in current operating systems.\u0000 We implement a prototype of OSV on AMD Opteron processor based 32-core servers with SVM and cache-coherent NUMA architectures. OSV can host up to 8 Linux kernels on the server with less than 10 lines of code modifications to Linux kernel. OSV has about 8000 lines of code which can be easily tuned and debugged. The experiment results show that OSV VMM has 23.7% performance improvement compared with Xen VMM.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125516993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Operating system (OS) reboots are an essential part of updating kernels and applications on laptops and desktop PCs. Long downtime during OS reboots severely disrupts users' computational activities. This long disruption discourages the users from conducting OS reboots, failing to enforce them to conduct software updates. This paper presents ShadowReboot, a virtual machine monitor (VMM)-based approach that shortens downtime of OS reboots in software updates. ShadowReboot conceals OS reboot activities from user's applications by spawning a VM dedicated to an OS reboot and systematically producing the rebooted state where the updated kernel and applications are ready for use. ShadowReboot provides an illusion to the users that the guest OS travels forward in time to the rebooted state. ShadowReboot offers the following advantages. It can be used to apply patches to the kernels and even system configuration updates. Next, it does not require any special patch requiring detailed knowledge about the target kernels. Lastly, it does not require any target kernel modification. We implemented a prototype in VirtualBox 4.0.10 OSE. Our experimental results show that ShadowReboot successfully updated software on unmodified commodity OS kernels and shortened the downtime of commodity OS reboots on five Linux distributions (Fedora, Ubuntu, Gentoo, Cent, and SUSE) by 91 to 98%.
{"title":"Traveling forward in time to newer operating systems using ShadowReboot","authors":"H. Yamada, K. Kono","doi":"10.1145/2451512.2451536","DOIUrl":"https://doi.org/10.1145/2451512.2451536","url":null,"abstract":"Operating system (OS) reboots are an essential part of updating kernels and applications on laptops and desktop PCs. Long downtime during OS reboots severely disrupts users' computational activities. This long disruption discourages the users from conducting OS reboots, failing to enforce them to conduct software updates. This paper presents ShadowReboot, a virtual machine monitor (VMM)-based approach that shortens downtime of OS reboots in software updates. ShadowReboot conceals OS reboot activities from user's applications by spawning a VM dedicated to an OS reboot and systematically producing the rebooted state where the updated kernel and applications are ready for use. ShadowReboot provides an illusion to the users that the guest OS travels forward in time to the rebooted state. ShadowReboot offers the following advantages. It can be used to apply patches to the kernels and even system configuration updates. Next, it does not require any special patch requiring detailed knowledge about the target kernels. Lastly, it does not require any target kernel modification. We implemented a prototype in VirtualBox 4.0.10 OSE. Our experimental results show that ShadowReboot successfully updated software on unmodified commodity OS kernels and shortened the downtime of commodity OS reboots on five Linux distributions (Fedora, Ubuntu, Gentoo, Cent, and SUSE) by 91 to 98%.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133437618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phase selection is the process of customizing the applied set of compiler optimization phases for individual functions or programs to improve performance of generated code. Researchers have recently developed novel feature-vector based heuristic techniques to perform phase selection during online JIT compilation. While these heuristics improve program startup speed, steady-state performance was not seen to benefit over the default fixed single sequence baseline. Unfortunately, it is still not conclusively known whether this lack of steady-state performance gain is due to a failure of existing online phase selection heuristics, or because there is, indeed, little or no speedup to be gained by phase selection in online JIT environments. The goal of this work is to resolve this question, while examining the phase selection related behavior of optimizations, and assessing and improving the effectiveness of existing heuristic solutions. We conduct experiments to find and understand the potency of the factors that can cause the phase selection problem in JIT compilers. Next, using long-running genetic algorithms we determine that program-wide and method-specific phase selection in the HotSpot JIT compiler can produce ideal steady-state performance gains of up to 15% (4.3% average) and 44% (6.2% average) respectively. We also find that existing state-of-the-art heuristic solutions are unable to realize these performance gains (in our experimental setup), discuss possible causes, and show that exploiting knowledge of optimization phase behavior can help improve such heuristic solutions. Our work develops a robust open-source production-quality framework using the HotSpot JVM to further explore this problem in the future.
{"title":"Performance potential of optimization phase selection during dynamic JIT compilation","authors":"Michael R. Jantz, P. Kulkarni","doi":"10.1145/2451512.2451539","DOIUrl":"https://doi.org/10.1145/2451512.2451539","url":null,"abstract":"Phase selection is the process of customizing the applied set of compiler optimization phases for individual functions or programs to improve performance of generated code. Researchers have recently developed novel feature-vector based heuristic techniques to perform phase selection during online JIT compilation. While these heuristics improve program startup speed, steady-state performance was not seen to benefit over the default fixed single sequence baseline. Unfortunately, it is still not conclusively known whether this lack of steady-state performance gain is due to a failure of existing online phase selection heuristics, or because there is, indeed, little or no speedup to be gained by phase selection in online JIT environments. The goal of this work is to resolve this question, while examining the phase selection related behavior of optimizations, and assessing and improving the effectiveness of existing heuristic solutions.\u0000 We conduct experiments to find and understand the potency of the factors that can cause the phase selection problem in JIT compilers. Next, using long-running genetic algorithms we determine that program-wide and method-specific phase selection in the HotSpot JIT compiler can produce ideal steady-state performance gains of up to 15% (4.3% average) and 44% (6.2% average) respectively. We also find that existing state-of-the-art heuristic solutions are unable to realize these performance gains (in our experimental setup), discuss possible causes, and show that exploiting knowledge of optimization phase behavior can help improve such heuristic solutions. Our work develops a robust open-source production-quality framework using the HotSpot JVM to further explore this problem in the future.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124939595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael R. Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, K. Doshi
This paper proposes a collaborative approach in which applications can provide guidance to the operating system regarding allocation and recycling of physical memory. The operating system incorporates this guidance to decide which physical page should be used to back a particular virtual page. The key intuition behind this approach is that application software, as a generator of memory accesses, is best equipped to inform the operating system about the relative access rates and overlapping patterns of usage of its own address space. It is also capable of steering its own algorithms in order to keep its dynamic memory footprint under check when there is a need to reduce power or to contain the spillover effects from bursts in demand. Application software, working cooperatively with the operating system, can therefore help the latter schedule memory more effectively and efficiently than when the operating system is forced to act alone without such guidance. It is particularly difficult to achieve power efficiency without application guidance since power expended in memory is a function not merely of the intensity with which memory is accessed in time but also how many physical ranks are affected by an application's memory usage. Our framework introduces an abstraction called "colors" for the application to communicate its intent to the operating system. We modify the operating system to receive this communication in an efficient way, and to organize physical memory pages into intermediate level grouping structures called "trays" which capture the physically independent access channels and self-refresh domains, so that it can apply this guidance without entangling the application in lower level details of power or bandwidth management. This paper describes how we re-architect the memory management of a recent Linux kernel to realize a three way collaboration between hardware, supervisory software, and application tasks.
{"title":"A framework for application guidance in virtual memory systems","authors":"Michael R. Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, K. Doshi","doi":"10.1145/2451512.2451543","DOIUrl":"https://doi.org/10.1145/2451512.2451543","url":null,"abstract":"This paper proposes a collaborative approach in which applications can provide guidance to the operating system regarding allocation and recycling of physical memory. The operating system incorporates this guidance to decide which physical page should be used to back a particular virtual page. The key intuition behind this approach is that application software, as a generator of memory accesses, is best equipped to inform the operating system about the relative access rates and overlapping patterns of usage of its own address space. It is also capable of steering its own algorithms in order to keep its dynamic memory footprint under check when there is a need to reduce power or to contain the spillover effects from bursts in demand. Application software, working cooperatively with the operating system, can therefore help the latter schedule memory more effectively and efficiently than when the operating system is forced to act alone without such guidance. It is particularly difficult to achieve power efficiency without application guidance since power expended in memory is a function not merely of the intensity with which memory is accessed in time but also how many physical ranks are affected by an application's memory usage.\u0000 Our framework introduces an abstraction called \"colors\" for the application to communicate its intent to the operating system. We modify the operating system to receive this communication in an efficient way, and to organize physical memory pages into intermediate level grouping structures called \"trays\" which capture the physically independent access channels and self-refresh domains, so that it can apply this guidance without entangling the application in lower level details of power or bandwidth management.\u0000 This paper describes how we re-architect the memory management of a recent Linux kernel to realize a three way collaboration between hardware, supervisory software, and application tasks.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
On-stack replacement (OSR) is a technique that allows a virtual machine to interrupt running code during the execution of a function/method, to re-optimize the function on-the-fly using an optimizing JIT compiler, and then to resume the interrupted function at the point and state at which it was interrupted. OSR is particularly useful for programs with potentially long-running loops, as it allows dynamic optimization of those loops as soon as they become hot. This paper presents a modular approach to implementing OSR for the LLVM compiler infrastructure. This is an important step forward because LLVM is gaining popular support, and adding the OSR capability allows compiler developers to develop new dynamic techniques. In particular, it will enable more sophisticated LLVM-based JIT compiler approaches. Indeed, other compiler/VM developers can use our approach because it is a clean modular addition to the standard LLVM distribution. Further, our approach is defined completely at the LLVM-IR level and thus does not require any modifications to the target code generation. The OSR implementation can be used by different compilers to support a variety of dynamic optimizations. As a demonstration of our OSR approach, we have used it to support dynamic inlining in McVM. McVM is a virtual machine for MATLAB which uses a LLVM-based JIT compiler. MATLAB is a popular dynamic language for scientific and engineering applications that typically manipulate large matrices and often contain long-running loops, and is thus an ideal target for dynamic JIT compilation and OSRs. Using our McVM example, we demonstrate reasonable overheads for our benchmark set, and performance improvements when using it to perform dynamic inlining.
{"title":"A modular approach to on-stack replacement in LLVM","authors":"Nurudeen Lameed, L. Hendren","doi":"10.1145/2451512.2451541","DOIUrl":"https://doi.org/10.1145/2451512.2451541","url":null,"abstract":"On-stack replacement (OSR) is a technique that allows a virtual machine to interrupt running code during the execution of a function/method, to re-optimize the function on-the-fly using an optimizing JIT compiler, and then to resume the interrupted function at the point and state at which it was interrupted. OSR is particularly useful for programs with potentially long-running loops, as it allows dynamic optimization of those loops as soon as they become hot.\u0000 This paper presents a modular approach to implementing OSR for the LLVM compiler infrastructure. This is an important step forward because LLVM is gaining popular support, and adding the OSR capability allows compiler developers to develop new dynamic techniques. In particular, it will enable more sophisticated LLVM-based JIT compiler approaches. Indeed, other compiler/VM developers can use our approach because it is a clean modular addition to the standard LLVM distribution. Further, our approach is defined completely at the LLVM-IR level and thus does not require any modifications to the target code generation.\u0000 The OSR implementation can be used by different compilers to support a variety of dynamic optimizations. As a demonstration of our OSR approach, we have used it to support dynamic inlining in McVM. McVM is a virtual machine for MATLAB which uses a LLVM-based JIT compiler. MATLAB is a popular dynamic language for scientific and engineering applications that typically manipulate large matrices and often contain long-running loops, and is thus an ideal target for dynamic JIT compilation and OSRs. Using our McVM example, we demonstrate reasonable overheads for our benchmark set, and performance improvements when using it to perform dynamic inlining.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123122828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}