Marco F. Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Martin T. Vechev, Vikash K. Mansinghka
We present a novel approach for approximate sampling in probabilistic programs based on incremental inference. The key idea is to adapt the samples for a program P into samples for a program Q, thereby avoiding the expensive sampling computation for program Q. To enable incremental inference in probabilistic programming, our work: (i) introduces the concept of a trace translator which adapts samples from P into samples of Q, (ii) phrases this translation approach in the context of sequential Monte Carlo (SMC), which gives theoretical guarantees that the adapted samples converge to the distribution induced by Q, and (iii) shows how to obtain a concrete trace translator by establishing a correspondence between the random choices of the two probabilistic programs. We implemented our approach in two different probabilistic programming systems and showed that, compared to methods that sample the program Q from scratch, incremental inference can lead to orders of magnitude increase in efficiency, depending on how closely related P and Q are.
{"title":"Incremental inference for probabilistic programs","authors":"Marco F. Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Martin T. Vechev, Vikash K. Mansinghka","doi":"10.1145/3296979.3192399","DOIUrl":"https://doi.org/10.1145/3296979.3192399","url":null,"abstract":"We present a novel approach for approximate sampling in probabilistic programs based on incremental inference. The key idea is to adapt the samples for a program P into samples for a program Q, thereby avoiding the expensive sampling computation for program Q. To enable incremental inference in probabilistic programming, our work: (i) introduces the concept of a trace translator which adapts samples from P into samples of Q, (ii) phrases this translation approach in the context of sequential Monte Carlo (SMC), which gives theoretical guarantees that the adapted samples converge to the distribution induced by Q, and (iii) shows how to obtain a concrete trace translator by establishing a correspondence between the random choices of the two probabilistic programs. We implemented our approach in two different probabilistic programming systems and showed that, compared to methods that sample the program Q from scratch, incremental inference can lead to orders of magnitude increase in efficiency, depending on how closely related P and Q are.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"28 1","pages":"571 - 585"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86667229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chit-Kwan Lin, Andreas Wild, G. Chinya, Tsung-Han Lin, Mike Davies, Hong Wang
We present a compiler for Loihi, a novel manycore neuromorphic processor that features a programmable, on-chip learning engine for training and executing spiking neural networks (SNNs). An SNN is distinguished from other neural networks in that (1) its independent computing units, or "neurons", communicate with others only through spike messages; and (2) each neuron evaluates local learning rules, which are functions of spike arrival and departure timings, to modify its local state. The collective neuronal state dynamics of an SNN form a nonlinear dynamical system that can be cast as an unconventional model of computation. To realize such an SNN on Loihi requires each constituent neuron to locally store and independently update its own spike timing information. However, each Loihi core has limited resources for this purpose and these must be shared by neurons assigned to the same core. In this work, we present a compiler for Loihi that maps the neurons of an SNN onto and across Loihi's cores efficiently. We show that a poor neuron-to-core mapping can incur significant energy costs and address this with a greedy algorithm that compiles SNNs onto Loihi in a power-efficient manner. In so doing, we highlight the need for further development of compilers for this new, emerging class of architectures.
{"title":"Mapping spiking neural networks onto a manycore neuromorphic architecture","authors":"Chit-Kwan Lin, Andreas Wild, G. Chinya, Tsung-Han Lin, Mike Davies, Hong Wang","doi":"10.1145/3296979.3192371","DOIUrl":"https://doi.org/10.1145/3296979.3192371","url":null,"abstract":"We present a compiler for Loihi, a novel manycore neuromorphic processor that features a programmable, on-chip learning engine for training and executing spiking neural networks (SNNs). An SNN is distinguished from other neural networks in that (1) its independent computing units, or \"neurons\", communicate with others only through spike messages; and (2) each neuron evaluates local learning rules, which are functions of spike arrival and departure timings, to modify its local state. The collective neuronal state dynamics of an SNN form a nonlinear dynamical system that can be cast as an unconventional model of computation. To realize such an SNN on Loihi requires each constituent neuron to locally store and independently update its own spike timing information. However, each Loihi core has limited resources for this purpose and these must be shared by neurons assigned to the same core. In this work, we present a compiler for Loihi that maps the neurons of an SNN onto and across Loihi's cores efficiently. We show that a poor neuron-to-core mapping can incur significant energy costs and address this with a greedy algorithm that compiles SNNs onto Loihi in a power-efficient manner. In so doing, we highlight the need for further development of compilers for this new, emerging class of architectures.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"15 1","pages":"78 - 89"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80462146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Type systems and syntactic sugar are both valuable to programmers, but sometimes at odds. While sugar is a valuable mechanism for implementing realistic languages, the expansion process obscures program source structure. As a result, type errors can reference terms the programmers did not write (and even constructs they do not know), baffling them. The language developer must also manually construct type rules for the sugars, to give a typed account of the surface language. We address these problems by presenting a process for automatically reconstructing type rules for the surface language using rules for the core. We have implemented this theory, and show several interesting case studies.
{"title":"Inferring type rules for syntactic sugar","authors":"Justin Pombrio, S. Krishnamurthi","doi":"10.1145/3296979.3192398","DOIUrl":"https://doi.org/10.1145/3296979.3192398","url":null,"abstract":"Type systems and syntactic sugar are both valuable to programmers, but sometimes at odds. While sugar is a valuable mechanism for implementing realistic languages, the expansion process obscures program source structure. As a result, type errors can reference terms the programmers did not write (and even constructs they do not know), baffling them. The language developer must also manually construct type rules for the sugars, to give a typed account of the surface language. We address these problems by presenting a process for automatically reconstructing type rules for the surface language using rules for the core. We have implemented this theory, and show several interesting case studies.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"94 1","pages":"812 - 825"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84594377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kostas Ferles, Jacob Van Geffen, Işıl Dillig, Y. Smaragdakis
Explicit signaling between threads is a perennial cause of bugs in concurrent programs. While there are several run-time techniques to automatically notify threads upon the availability of some shared resource, such techniques are not widely-adopted due to their run-time overhead. This paper proposes a new solution based on static analysis for automatically generating a performant explicit-signal program from its corresponding implicit-signal implementation. The key idea is to generate verification conditions that allow us to minimize the number of required signals and unnecessary context switches, while guaranteeing semantic equivalence between the source and target programs. We have implemented our method in a tool called Expresso and evaluate it on challenging benchmarks from prior papers and open-source software. Expresso-generated code significantly outperforms past automatic signaling mechanisms (avg. 1.56x speedup) and closely matches the performance of hand-optimized explicit-signal code.
{"title":"Symbolic reasoning for automatic signal placement","authors":"Kostas Ferles, Jacob Van Geffen, Işıl Dillig, Y. Smaragdakis","doi":"10.1145/3296979.3192395","DOIUrl":"https://doi.org/10.1145/3296979.3192395","url":null,"abstract":"Explicit signaling between threads is a perennial cause of bugs in concurrent programs. While there are several run-time techniques to automatically notify threads upon the availability of some shared resource, such techniques are not widely-adopted due to their run-time overhead. This paper proposes a new solution based on static analysis for automatically generating a performant explicit-signal program from its corresponding implicit-signal implementation. The key idea is to generate verification conditions that allow us to minimize the number of required signals and unnecessary context switches, while guaranteeing semantic equivalence between the source and target programs. We have implemented our method in a tool called Expresso and evaluate it on challenging benchmarks from prior papers and open-source software. Expresso-generated code significantly outperforms past automatic signaling mechanisms (avg. 1.56x speedup) and closely matches the performance of hand-optimized explicit-signal code.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"1 1","pages":"120 - 134"},"PeriodicalIF":0.0,"publicationDate":"2018-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86897473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel approach for dynamic binary translation (DBT) to automatically learn translation rules from guest and host binaries compiled from the same source code. The learned trans...
{"title":"Enhancing Cross-ISA DBT Through Automatically Learned Translation Rules","authors":"WangWenwen, McCamantStephen, ZhaiAntonia, YewPen-Chung","doi":"10.1145/3296957.3177160","DOIUrl":"https://doi.org/10.1145/3296957.3177160","url":null,"abstract":"This paper presents a novel approach for dynamic binary translation (DBT) to automatically learn translation rules from guest and host binaries compiled from the same source code. The learned trans...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3296957.3177160","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43825226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinShih-Chieh, ZhangYunqi, HsuChang-Hong, SkachMatt, E. HaqueMd, TangLingjia, MarsJason
Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital an...
{"title":"The Architectural Implications of Autonomous Driving","authors":"LinShih-Chieh, ZhangYunqi, HsuChang-Hong, SkachMatt, E. HaqueMd, TangLingjia, MarsJason","doi":"10.1145/3296957.3173191","DOIUrl":"https://doi.org/10.1145/3296957.3173191","url":null,"abstract":"Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital an...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3296957.3173191","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42958270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a static, precise, and scalable technique for finding CVEs (Common Vulnerabilities and Exposures) in stripped firmware images. Our technique is able to efficiently find vulnerabilities in real-world firmware with high accuracy. Given a vulnerable procedure in an executable binary and a firmware image containing multiple stripped binaries, our goal is to detect possible occurrences of the vulnerable procedure in the firmware image. Due to the variety of architectures and unique tool chains used by vendors, as well as the highly customized nature of firmware, identifying procedures in stripped firmware is extremely challenging. Vulnerability detection requires not only pairwise similarity between procedures but also information about the relationships between procedures in the surrounding executable. This observation serves as the foundation for a novel technique that establishes a partial correspondence between procedures in the two binaries. We implemented our technique in a tool called FirmUp and performed an extensive evaluation over 40 million procedures, over 4 different prevalent architectures, crawled from public vendor firmware images. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device. A thorough comparison of FirmUp to previous methods shows that it accurately and effectively finds vulnerabilities in firmware, while outperforming the detection rate of the state of the art by 45% on average.
{"title":"FirmUp","authors":"Yaniv David, Nimrod Partush, Eran Yahav","doi":"10.1145/3296957.3177157","DOIUrl":"https://doi.org/10.1145/3296957.3177157","url":null,"abstract":"We present a static, precise, and scalable technique for finding CVEs (Common Vulnerabilities and Exposures) in stripped firmware images. Our technique is able to efficiently find vulnerabilities in real-world firmware with high accuracy. Given a vulnerable procedure in an executable binary and a firmware image containing multiple stripped binaries, our goal is to detect possible occurrences of the vulnerable procedure in the firmware image. Due to the variety of architectures and unique tool chains used by vendors, as well as the highly customized nature of firmware, identifying procedures in stripped firmware is extremely challenging. Vulnerability detection requires not only pairwise similarity between procedures but also information about the relationships between procedures in the surrounding executable. This observation serves as the foundation for a novel technique that establishes a partial correspondence between procedures in the two binaries. We implemented our technique in a tool called FirmUp and performed an extensive evaluation over 40 million procedures, over 4 different prevalent architectures, crawled from public vendor firmware images. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device. A thorough comparison of FirmUp to previous methods shows that it accurately and effectively finds vulnerabilities in firmware, while outperforming the detection rate of the state of the art by 45% on average.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"183 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85617753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rajadurai, Jeffrey Bosboom, W. Wong, S. Amarasinghe
An important class of applications computes on long-running or infinite streams of data, often with known fixed data rates. The latter is referred to as synchronous data flow ~(SDF) streams. These stream applications need to run on clusters or the cloud due to the high performance requirement. Further, they require live reconfiguration and reoptimization for various reasons such as hardware maintenance, elastic computation, or to respond to fluctuations in resources or application workload. However, reconfiguration and reoptimization without downtime while accurately preserving program state in a distributed environment is difficult. In this paper, we introduce Gloss, a suite of compiler and runtime techniques for live reconfiguration of distributed stream programs. Gloss, for the first time, avoids periods of zero throughput during the reconfiguration of both stateless and stateful SDF based stream programs. Furthermore, unlike other systems, Gloss globally reoptimizes and completely recompiles the program during reconfiguration. This permits it to reoptimize the application for entirely new configurations that it may not have encountered before. All these Gloss operations happen in-situ, requiring no extra hardware resources. We show how Gloss allows stream programs to reconfigure and reoptimize with no downtime and minimal overhead, and demonstrate the wider applicability of it via a variety of experiments.
{"title":"Gloss","authors":"S. Rajadurai, Jeffrey Bosboom, W. Wong, S. Amarasinghe","doi":"10.1145/3296957.3173170","DOIUrl":"https://doi.org/10.1145/3296957.3173170","url":null,"abstract":"\u0000 An important class of applications computes on long-running or infinite streams of data, often with known fixed data rates. The latter is referred to as\u0000 synchronous data flow\u0000 ~(SDF) streams. These stream applications need to run on clusters or the cloud due to the high performance requirement. Further, they require live reconfiguration and reoptimization for various reasons such as hardware maintenance, elastic computation, or to respond to fluctuations in resources or application workload. However, reconfiguration and reoptimization without downtime while accurately preserving program state in a distributed environment is difficult. In this paper, we introduce Gloss, a suite of compiler and runtime techniques for live reconfiguration of distributed stream programs. Gloss, for the first time, avoids periods of zero throughput during the reconfiguration of both stateless and stateful SDF based stream programs. Furthermore, unlike other systems, Gloss globally reoptimizes and completely recompiles the program during reconfiguration. This permits it to reoptimize the application for entirely new configurations that it may not have encountered before. All these Gloss operations happen in-situ, requiring no extra hardware resources. We show how Gloss allows stream programs to reconfigure and reoptimize with no downtime and minimal overhead, and demonstrate the wider applicability of it via a variety of experiments.\u0000","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"344 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77621877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.
{"title":"LTRF","authors":"Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu","doi":"10.1145/3296957.3173211","DOIUrl":"https://doi.org/10.1145/3296957.3173211","url":null,"abstract":"Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86490881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (O...
{"title":"Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing","authors":"YuZhibin, BeiZhendong, QianXuehai","doi":"10.1145/3296957.3173187","DOIUrl":"https://doi.org/10.1145/3296957.3173187","url":null,"abstract":"In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (O...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3296957.3173187","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45723935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}