Kexin Pei, Shiqi Wang, Yuchi Tian, J. Whitehouse, Carl Vondrick, Yinzhi Cao, Baishakhi Ray, S. Jana, Junfeng Yang
Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including autonomous driving, robotics, and malware detection, where the correctness and predictability of a system on corner-case inputs are of great importance. Unfortunately, the common practice to validating a deep neural network (DNN) - measuring overall accuracy on a randomly selected test set - is not designed to surface corner-case errors. As recent work shows, even DNNs with state-of-the-art accuracy are easily fooled by human-imperceptible, adversarial perturbations to the inputs. Questions such as how to test corner-case behaviors more thoroughly and whether all adversarial samples have been found remain unanswered. In the last few years, we have been working on bringing more engineering rigor into deep learning. Towards this goal, we have built five systems to test DNNs more thoroughly and verify the absence of adversarial samples for given datasets. These systems check a broad spectrum of properties (e.g., rotating an image should never change its classification) and find thousands of error-inducing samples for popular DNNs in critical domains (e.g., ImageNet, autonomous driving, and malware detection). Our DNN verifiers are also orders of magnitude (e.g., 5,000×) faster than similar tools. This article overviews our systems and discusses three open research challenges to hopefully inspire more future research towards testing and verifying DNNs.
{"title":"Bringing Engineering Rigor to Deep Learning","authors":"Kexin Pei, Shiqi Wang, Yuchi Tian, J. Whitehouse, Carl Vondrick, Yinzhi Cao, Baishakhi Ray, S. Jana, Junfeng Yang","doi":"10.1145/3352020.3352030","DOIUrl":"https://doi.org/10.1145/3352020.3352030","url":null,"abstract":"Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including autonomous driving, robotics, and malware detection, where the correctness and predictability of a system on corner-case inputs are of great importance. Unfortunately, the common practice to validating a deep neural network (DNN) - measuring overall accuracy on a randomly selected test set - is not designed to surface corner-case errors. As recent work shows, even DNNs with state-of-the-art accuracy are easily fooled by human-imperceptible, adversarial perturbations to the inputs. Questions such as how to test corner-case behaviors more thoroughly and whether all adversarial samples have been found remain unanswered. In the last few years, we have been working on bringing more engineering rigor into deep learning. Towards this goal, we have built five systems to test DNNs more thoroughly and verify the absence of adversarial samples for given datasets. These systems check a broad spectrum of properties (e.g., rotating an image should never change its classification) and find thousands of error-inducing samples for popular DNNs in critical domains (e.g., ImageNet, autonomous driving, and malware detection). Our DNN verifiers are also orders of magnitude (e.g., 5,000×) faster than similar tools. This article overviews our systems and discusses three open research challenges to hopefully inspire more future research towards testing and verifying DNNs.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"59 - 67"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47345932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dongjin Shin, Taebum Kim, Byung-Gon Chun
The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability. This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, a de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into elements of a symbolic dataflow graph. Our experiments show that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.
{"title":"Speculative Symbolic Graph Execution of Imperative Deep Learning Programs","authors":"Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dongjin Shin, Taebum Kim, Byung-Gon Chun","doi":"10.1145/3352020.3352025","DOIUrl":"https://doi.org/10.1145/3352020.3352025","url":null,"abstract":"The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability. This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, a de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into elements of a symbolic dataflow graph. Our experiments show that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"26 - 33"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43140122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With operating systems being at the core of computer systems, decades of research and engineering efforts have been put into the development of OSes. To keep pace with the speed of modern hardware and application evolvement, we argue that a different approach should be taken in future OS development. Instead of relying solely on human wisdom, we should also leverage AI and machine learning techniques to automatically "learn" how to build and tune an OS. This paper explores the opportunities and challenges of the "learned" OS approach and makes recommendation for future researchers and practitioners on building such an OS.
{"title":"\"Learned\"","authors":"Yiying Zhang, Yutong Huang","doi":"10.1145/3352020.3352027","DOIUrl":"https://doi.org/10.1145/3352020.3352027","url":null,"abstract":"With operating systems being at the core of computer systems, decades of research and engineering efforts have been put into the development of OSes. To keep pace with the speed of modern hardware and application evolvement, we argue that a different approach should be taken in future OS development. Instead of relying solely on human wisdom, we should also leverage AI and machine learning techniques to automatically \"learn\" how to build and tune an OS. This paper explores the opportunities and challenges of the \"learned\" OS approach and makes recommendation for future researchers and practitioners on building such an OS.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"40 - 45"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64021780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Krishnan, Aaron J. Elmore, M. Franklin, John Paparrizos, Zechao Shang, Adam Dziedzic, R. Liu
The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of "AI Systems" research where reducing resource consumption, reasoning about transient resource availability, trading off resource consumption for accuracy, and managing contention on specialized hardware will become the community's main research focus. This paper overviews the history of AI systems research, a vision for the future, and the open challenges ahead.
{"title":"Artificial Intelligence in Resource-Constrained and Shared Environments","authors":"S. Krishnan, Aaron J. Elmore, M. Franklin, John Paparrizos, Zechao Shang, Adam Dziedzic, R. Liu","doi":"10.1145/3352020.3352022","DOIUrl":"https://doi.org/10.1145/3352020.3352022","url":null,"abstract":"The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of \"AI Systems\" research where reducing resource consumption, reasoning about transient resource availability, trading off resource consumption for accuracy, and managing contention on specialized hardware will become the community's main research focus. This paper overviews the history of AI systems research, a vision for the future, and the open challenges ahead.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"1 - 6"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44776599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deploying machine learning into IoT cloud settings will require an evolution of the cloud infrastructure. In this white paper, we justify this assertion and identify new capabilities needed for real-time intelligent systems. We also outline our initial efforts to create a new edge architecture more suitable for ML. Although the work is still underway, several components exist, and we review them. We then point to open technical problems that will need to be solved as we progress further in this direction.
{"title":"Cloud-Hosted Intelligence for Real-time IoT Applications","authors":"K. Birman, B. Hariharan, Christopher De Sa","doi":"10.1145/3352020.3352023","DOIUrl":"https://doi.org/10.1145/3352020.3352023","url":null,"abstract":"Deploying machine learning into IoT cloud settings will require an evolution of the cloud infrastructure. In this white paper, we justify this assertion and identify new capabilities needed for real-time intelligent systems. We also outline our initial efforts to create a new edge architecture more suitable for ML. Although the work is still underway, several components exist, and we review them. We then point to open technical problems that will need to be solved as we progress further in this direction.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"7 - 13"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43805758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices. Detecting UOS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services.
{"title":"Leveraging Deep Learning to Improve Performance Predictability in Cloud Microservices with Seer","authors":"Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, Christina Delimitrou","doi":"10.1145/3352020.3352026","DOIUrl":"https://doi.org/10.1145/3352020.3352026","url":null,"abstract":"Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices. Detecting UOS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"34 - 39"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43082359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luo Mai, A. Koliousis, Guo Li, A. Brabete, P. Pietzuch
Deep learning (DL) systems expose many tuning parameters ("hyper-parameters") that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning them empirically. We argue that future DL systems should be designed to help manage hyper-parameters. We describe how a distributed DL system can (i) remove the impact of hyper-parameters on both performance and accuracy, thus making it easier to decide on a good setting, and (ii) support more powerful dynamic policies for adapting hyper-parameters, which take monitored training metrics into account. We report results from prototype implementations that show the practicality of DL system designs that are hyper-parameter-friendly.
{"title":"Taming Hyper-parameters in Deep Learning Systems","authors":"Luo Mai, A. Koliousis, Guo Li, A. Brabete, P. Pietzuch","doi":"10.1145/3352020.3352029","DOIUrl":"https://doi.org/10.1145/3352020.3352029","url":null,"abstract":"Deep learning (DL) systems expose many tuning parameters (\"hyper-parameters\") that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning them empirically. We argue that future DL systems should be designed to help manage hyper-parameters. We describe how a distributed DL system can (i) remove the impact of hyper-parameters on both performance and accuracy, thus making it easier to decide on a good setting, and (ii) support more powerful dynamic policies for adapting hyper-parameters, which take monitored training metrics into account. We report results from prototype implementations that show the practicality of DL system designs that are hyper-parameter-friendly.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"52 - 58"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48800132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The data gluttony of AI is well known: Data fuels the artificial intelligence. Technologies that help to gather the needed data are then essential, among which the IoT. However, the deployment of IoT solutions raises significant challenges, especially regarding the resource and financial costs at stake. It is our view that mobile crowdsensing, aka phone sensing, has a major role to play because it potentially contributes massive data at a relatively low cost. Still, crowdsensing is useless, and even harmful, if the contributed data are not properly analyzed. This paper surveys our work on the development of systems facing this challenge, which also illustrates the virtuous circles of AI. We specifically focus on how intelligent crowdsensing middleware leverages on-device machine learning to enhance the reported physical observations. Keywords: Crowdsensing, Middleware, Online learning.
{"title":"When the Power of the Crowd Meets the Intelligence of the Middleware","authors":"Yifan Du, V. Issarny, F. Sailhan","doi":"10.1145/3352020.3352033","DOIUrl":"https://doi.org/10.1145/3352020.3352033","url":null,"abstract":"The data gluttony of AI is well known: Data fuels the artificial intelligence. Technologies that help to gather the needed data are then essential, among which the IoT. However, the deployment of IoT solutions raises significant challenges, especially regarding the resource and financial costs at stake. It is our view that mobile crowdsensing, aka phone sensing, has a major role to play because it potentially contributes massive data at a relatively low cost. Still, crowdsensing is useless, and even harmful, if the contributed data are not properly analyzed. This paper surveys our work on the development of systems facing this challenge, which also illustrates the virtuous circles of AI. We specifically focus on how intelligent crowdsensing middleware leverages on-device machine learning to enhance the reported physical observations. Keywords: Crowdsensing, Middleware, Online learning.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":"85 - 90"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3352020.3352033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49280302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson
In modern computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of generalpurpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device and bypass the overhead on the host system. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, coprocessor- equipped systems, Morpheus allows application objects to be sent directly from a storage device to a co-processor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66x, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32x. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39x in a heterogeneous computing platform.
{"title":"Morpheus","authors":"Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, S. Swanson","doi":"10.1145/3273982.3273989","DOIUrl":"https://doi.org/10.1145/3273982.3273989","url":null,"abstract":"In modern computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of generalpurpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device and bypass the overhead on the host system. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, coprocessor- equipped systems, Morpheus allows application objects to be sent directly from a storage device to a co-processor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66x, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32x. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39x in a heterogeneous computing platform.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3273982.3273989","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64013000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in largescale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9x over a combined CPU+GPU implementation and 2.6x over a 12-core CPU-only implementation which uses AVX vector instructions.
{"title":"ActivePointers","authors":"Sagi Shahar, Shai Bergman, M. Silberstein","doi":"10.1145/3273982.3273990","DOIUrl":"https://doi.org/10.1145/3273982.3273990","url":null,"abstract":"Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in largescale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9x over a combined CPU+GPU implementation and 2.6x over a 12-core CPU-only implementation which uses AVX vector instructions.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78407299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}