Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation最新文献
This paper makes a case for paths as an explicit abstraction in operating system design. Paths provide a unifying infrastructure for several OS mechanisms that have been introduced in the last several years, including fbufs, integrated layer processing, packet classifiers, code specialization, and migrating threads. This paper articulates the potential advantages of a path-based OS structure, describes the specific path architecture implemented in the Scout OS, and demonstrates the advantages in a particular application domain---receiving, decoding, and displaying MPEG-compressed video.
{"title":"Making paths explicit in the Scout operating system","authors":"D. Mosberger, L. Peterson","doi":"10.1145/238721.238771","DOIUrl":"https://doi.org/10.1145/238721.238771","url":null,"abstract":"This paper makes a case for paths as an explicit abstraction in operating system design. Paths provide a unifying infrastructure for several OS mechanisms that have been introduced in the last several years, including fbufs, integrated layer processing, packet classifiers, code specialization, and migrating threads. This paper articulates the potential advantages of a path-based OS structure, describes the specific path architecture implemented in the Scout OS, and demonstrates the advantages in a particular application domain---receiving, decoding, and displaying MPEG-compressed video.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"43 4 1","pages":"153-167"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91030377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We studied two aspects of the performance of Windows NT: processor bandwidth requirements for memory accesses in a uniprocessor system running benchmark and commercial applications, and locking behavior of a commercial database on a small-scale multiprocessor. Our studies are based on full dynamic execution traces of the systems, which include all instructions executed by the operating system and applications over periods of a few seconds (enough time to allow for significant computation). The traces were obtained on Alpha PCs, using a new software tool called PatchWrx that takes advantage of the Alpha architecture’s PAL-code layer to implement efficient, comprehensive system tracing. Because the Alpha version of Windows NT uses substantially the same code base as other versions, and therefore executes nearly the same sequence of calls, basic blocks, and data structure accesses, we believe our conclusions are relevant for non-Alpha systems as well. This paper describes our performance studies and interesting aspects of PatchWrx. We conclude from our studies that processor bandwidth can be a first-order bottleneck to achieving good performance. This is particularly apparent when studying commercial benchmarks. Operating system code and data structures contribute disproportionately to the memory access load. We also found that operating system software lock contention was a factor preventing the database benchmark from scaling up on the small multiprocessor, and that the cache coherence protocol employed by the machine introduced more cache interference than necessary.
{"title":"Studies of Windows NT performance using dynamic execution traces","authors":"Sharon E. Perl, R. L. Sites","doi":"10.1145/238721.238773","DOIUrl":"https://doi.org/10.1145/238721.238773","url":null,"abstract":"We studied two aspects of the performance of Windows NT: processor bandwidth requirements for memory accesses in a uniprocessor system running benchmark and commercial applications, and locking behavior of a commercial database on a small-scale multiprocessor. Our studies are based on full dynamic execution traces of the systems, which include all instructions executed by the operating system and applications over periods of a few seconds (enough time to allow for significant computation). The traces were obtained on Alpha PCs, using a new software tool called PatchWrx that takes advantage of the Alpha architecture’s PAL-code layer to implement efficient, comprehensive system tracing. Because the Alpha version of Windows NT uses substantially the same code base as other versions, and therefore executes nearly the same sequence of calls, basic blocks, and data structure accesses, we believe our conclusions are relevant for non-Alpha systems as well. This paper describes our performance studies and interesting aspects of PatchWrx. We conclude from our studies that processor bandwidth can be a first-order bottleneck to achieving good performance. This is particularly apparent when studying commercial benchmarks. Operating system code and data structures contribute disproportionately to the memory access load. We also found that operating system software lock contention was a factor preventing the database benchmark from scaling up on the small multiprocessor, and that the cache coherence protocol employed by the machine introduced more cache interference than necessary.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"97 1","pages":"169-183"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86359931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large amount of memory it consumes for protocol overhead data, and because of the diÆculty of garbage collecting that data. To achieve more scalable performance, we introduce and evaluate two new protocols. The rst, Home-based LRC (HLRC), is based on the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are propagated and from which all copies are derived. Unlike AURC, HLRC requires no specialized hardware support. We nd that the use of homes provides substantial improvements in performance and scalability over LRC. Our second protocol, called Overlapped Home-based LRC (OHLRC), takes advantage of the communication processor found on each node of the Paragon to o oad some of the protocol overhead of HLRC from the critical path followed by the compute processor. We nd that OHLRC provides modest improvements over HLRC. We also apply overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message traÆc, and memory use for each of the protocols.
{"title":"Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems","authors":"Yuanyuan Zhou, L. Iftode, Kai Li","doi":"10.1145/238721.238763","DOIUrl":"https://doi.org/10.1145/238721.238763","url":null,"abstract":"This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large amount of memory it consumes for protocol overhead data, and because of the diÆculty of garbage collecting that data. To achieve more scalable performance, we introduce and evaluate two new protocols. The rst, Home-based LRC (HLRC), is based on the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are propagated and from which all copies are derived. Unlike AURC, HLRC requires no specialized hardware support. We nd that the use of homes provides substantial improvements in performance and scalability over LRC. Our second protocol, called Overlapped Home-based LRC (OHLRC), takes advantage of the communication processor found on each node of the Paragon to o oad some of the protocol overhead of HLRC from the critical path followed by the compute processor. We nd that OHLRC provides modest improvements over HLRC. We also apply overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message traÆc, and memory use for each of the protocols.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"35 1","pages":"75-88"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88907164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Kimbrel, A. Tomkins, R. H. Patterson, B. Bershad, P. Cao, E. Felten, Garth A. Gibson, Anna R. Karlin, Kai Li
High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are signi cant interactions between them; a block prefetched too early reduces the e ectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate tracedriven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We nd that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when signi cant I/O stalls are not present; and that a single, simple strategy is capable of doing both.
{"title":"A trace-driven comparison of algorithms for parallel prefetching and caching","authors":"T. Kimbrel, A. Tomkins, R. H. Patterson, B. Bershad, P. Cao, E. Felten, Garth A. Gibson, Anna R. Karlin, Kai Li","doi":"10.1145/238721.238737","DOIUrl":"https://doi.org/10.1145/238721.238737","url":null,"abstract":"High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are signi cant interactions between them; a block prefetched too early reduces the e ectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate tracedriven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We nd that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when signi cant I/O stalls are not present; and that a single, simple strategy is capable of doing both.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"11 1","pages":"19-34"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83509316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Costa, P. Guedes, M. Sequeira, N. Neves, M. Castro
This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to the memory consistency protocol: it sends no additional messages during failure-free periods; it adds only a minimal amount of data to one of the DSM protocol messages; it introduces no forced rollbacks of non-faulty processes; and it performs no communication-induced accesses to stable storage. Furthermore, the algorithm logs only a very small amount of data, because it uses the log of memory accesses already maintained by the memory consistency protocol. The algorithm was implemented in TreadMarks, a state-of-the-art DSM system. Experimental results show that the algorithm has near zero time overhead and very low space overhead during failure-free execution, thus refuting the common belief that logging overhead is necessarily high in recoverable DSM systems.
{"title":"Lightweight logging for lazy release consistent distributed shared memory","authors":"Manuel Costa, P. Guedes, M. Sequeira, N. Neves, M. Castro","doi":"10.1145/238721.238762","DOIUrl":"https://doi.org/10.1145/238721.238762","url":null,"abstract":"This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to the memory consistency protocol: it sends no additional messages during failure-free periods; it adds only a minimal amount of data to one of the DSM protocol messages; it introduces no forced rollbacks of non-faulty processes; and it performs no communication-induced accesses to stable storage. Furthermore, the algorithm logs only a very small amount of data, because it uses the log of memory accesses already maintained by the memory consistency protocol. The algorithm was implemented in TreadMarks, a state-of-the-art DSM system. Experimental results show that the algorithm has near zero time overhead and very low space overhead during failure-free execution, thus refuting the common belief that logging overhead is necessarily high in recoverable DSM systems.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"65 1","pages":"59-73"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73789977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract : This paper describes a mechanism by which an operating system kernel can determine with certainty that it is safe to execute a binary supplied by an untrusted source. The kernel first defines a safety policy and makes it public. Then, using this policy, an application can provide binaries in a special form called proof-carrying code, or simply PCC. Each PCC binary contains, in addition to the native code, a formal proof that the code obeys the safety policy. The kernel can easily validate the proof without using cryptography and without consulting any external trusted entities. If the validation succeeds, the code is guaranteed to respect the safety policy without relying on run-time checks. The main practical difficulty of PCC is in generating the safety proofs. In order to gain some preliminary experience with this, we have written several network packet filters in hand-tuned DEC Alpha assembly language, and then generated PCC binaries for them using a special prototype assembler. The PCC binaries can be executed with no run-time over-head, beyond a one-time cost of 1 to 3 milliseconds for validating the enclosed proofs. The net result is that our packet filters are formally guaranteed to be safe and are faster than packet filters created using Berkeley Packet Filters, Software Fault Isolation, or safe languages such as Modula-3.
{"title":"Safe kernel extensions without run-time checking","authors":"G. Necula, Peter Lee","doi":"10.1145/238721.238781","DOIUrl":"https://doi.org/10.1145/238721.238781","url":null,"abstract":"Abstract : This paper describes a mechanism by which an operating system kernel can determine with certainty that it is safe to execute a binary supplied by an untrusted source. The kernel first defines a safety policy and makes it public. Then, using this policy, an application can provide binaries in a special form called proof-carrying code, or simply PCC. Each PCC binary contains, in addition to the native code, a formal proof that the code obeys the safety policy. The kernel can easily validate the proof without using cryptography and without consulting any external trusted entities. If the validation succeeds, the code is guaranteed to respect the safety policy without relying on run-time checks. The main practical difficulty of PCC is in generating the safety proofs. In order to gain some preliminary experience with this, we have written several network packet filters in hand-tuned DEC Alpha assembly language, and then generated PCC binaries for them using a special prototype assembler. The PCC binaries can be executed with no run-time over-head, beyond a one-time cost of 1 to 3 milliseconds for validating the enclosed proofs. The net result is that our packet filters are formally guaranteed to be safe and are faster than packet filters created using Berkeley Packet Filters, Software Fault Isolation, or safe languages such as Modula-3.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"50 1","pages":"229-243"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84546377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The need for supporting variety of hard and soft real-time as well as best effort applications in a multimedia computing environment requires an operating system framework that: (1) enables different schedulers to be employed for different application classes, and (2) provides protection between the various classes of applications. We argue that these objectives can be achieved by hierarchical partitioning of CPU bandwidth, in which an operating system partitions the CPU bandwidth among various application classes, and each application class, in turn, partitions its allocation (potentially using a different scheduling algorithm) among its sub-classes or applications. We present Start-time Fair Queuing (SFQ) algorithm, which enables such hierarchical partitioning. We have implemented a hierarchical scheduler in Solaris 2.4. We describe our implementation, and demonstrate its suitability for multimedia operating systems.
{"title":"A hierarchial CPU scheduler for multimedia operating systems","authors":"P. Goyal, Xingang Guo, H. Vin","doi":"10.1145/238721.238766","DOIUrl":"https://doi.org/10.1145/238721.238766","url":null,"abstract":"The need for supporting variety of hard and soft real-time as well as best effort applications in a multimedia computing environment requires an operating system framework that: (1) enables different schedulers to be employed for different application classes, and (2) provides protection between the various classes of applications. We argue that these objectives can be achieved by hierarchical partitioning of CPU bandwidth, in which an operating system partitions the CPU bandwidth among various application classes, and each application class, in turn, partitions its allocation (potentially using a different scheduling algorithm) among its sub-classes or applications. We present Start-time Fair Queuing (SFQ) algorithm, which enables such hierarchical partitioning. We have implemented a hierarchical scheduler in Solaris 2.4. We describe our implementation, and demonstrate its suitability for multimedia operating systems.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"1 1","pages":"107-121"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87463472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a very low-overhead decentralized algorithm for cooperative caching that provides performance comparable to that of existing centralized algorithms. Unlike existing algorithms that rely on centralized control of cache functions, our algorithm uses hints (i.e. inexact information) to allow clients to perform these functions in a decentralized fashion. This paper shows that a hint-based system performs as well as a more tightlycoordinated system while requiring less overhead. Simulations show that the block access times of our system are as good as those of the existing tightly-coordinated algorithms, while reducing manager load by more than a factor of 15, block lookup traffic by nearly a factor of two-thirds, and replacement traffic by more than a factor of 5.
{"title":"Efficient cooperative caching using hints","authors":"P. Sarkar, J. Hartman","doi":"10.1145/238721.238741","DOIUrl":"https://doi.org/10.1145/238721.238741","url":null,"abstract":"We present a very low-overhead decentralized algorithm for cooperative caching that provides performance comparable to that of existing centralized algorithms. Unlike existing algorithms that rely on centralized control of cache functions, our algorithm uses hints (i.e. inexact information) to allow clients to perform these functions in a decentralized fashion. This paper shows that a hint-based system performs as well as a more tightlycoordinated system while requiring less overhead. Simulations show that the block access times of our system are as good as those of the existing tightly-coordinated algorithms, while reducing manager load by more than a factor of 15, block lookup traffic by nearly a factor of two-thirds, and replacement traffic by more than a factor of 5.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"29 1","pages":"35-46"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81996007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Buzzard, D. Jacobson, M. Mackey, Scott B. Marovich, J. Wilkes
interconnects, clusters, sender-based, Hamlyn, Myrinet As the latency and bandwidth of multicomputer interconnection fabrics improve, there is a growing need for an interface between them and host processors that does not hide these gains behind software overhead. The Hamlyn interface architecture does this. It uses sender-based memory management to eliminate receiver buffer overruns, provides applications with direct hardware access to minimize latency, supports adaptive routing networks to allow higher throughput, and offers full protection between applications so that it can be used in a general-purpose computing environment. To test these claims we built a prototype Hamlyn interface for a Myrinet network connected to a standard HP workstation and report here on its design and performance. Our interface delivers an application-to-application round trip time of 28 ~s for short messages and a one way time of 17.4~s + 32.6nslbyte (30.7 MB/s) for longer ones, while requiring fewer CPU cycles than an aggressive implementation of Active Messages on the CM-5.
{"title":"An implementation of the Hamlyn sender-managed interface architecture","authors":"G. Buzzard, D. Jacobson, M. Mackey, Scott B. Marovich, J. Wilkes","doi":"10.1145/238721.238784","DOIUrl":"https://doi.org/10.1145/238721.238784","url":null,"abstract":"interconnects, clusters, sender-based, Hamlyn, Myrinet As the latency and bandwidth of multicomputer interconnection fabrics improve, there is a growing need for an interface between them and host processors that does not hide these gains behind software overhead. The Hamlyn interface architecture does this. It uses sender-based memory management to eliminate receiver buffer overruns, provides applications with direct hardware access to minimize latency, supports adaptive routing networks to allow higher throughput, and offers full protection between applications so that it can be used in a general-purpose computing environment. To test these claims we built a prototype Hamlyn interface for a Myrinet network connected to a standard HP workstation and report here on its design and performance. Our interface delivers an application-to-application round trip time of 28 ~s for short messages and a one way time of 17.4~s + 32.6nslbyte (30.7 MB/s) for longer ones, while requiring fewer CPU cycles than an aggressive implementation of Active Messages on the CM-5.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"2 1","pages":"245-259"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72579536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current operating systems offer poor performance when a numeric application’s working set does not fit in main memory. As a result, programmers who wish to solve “out-of-core” problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and release hints for managing I/O, and the operating system cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-ofcore versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with one application speeding up by over threefold.
{"title":"Automatic compiler-inserted I/O prefetching for out-of-core applications","authors":"T. Mowry, Angela K. Demke, O. Krieger","doi":"10.1145/238721.238734","DOIUrl":"https://doi.org/10.1145/238721.238734","url":null,"abstract":"Current operating systems offer poor performance when a numeric application’s working set does not fit in main memory. As a result, programmers who wish to solve “out-of-core” problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and release hints for managing I/O, and the operating system cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-ofcore versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with one application speeding up by over threefold.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"38 1","pages":"3-17"},"PeriodicalIF":0.0,"publicationDate":"1996-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80951034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation