Decompiled code lacks meaningful variable names. We used statistical machine translation to suggest variable names that are natural given the context. This technique has previously been successfully applied to obfuscated JavaScript code, but decompiled C code poses unique challenges in constructing an aligned corpus and selecting the best translation from among several candidates.
{"title":"Suggesting meaningful variable names for decompiled code: a machine translation approach","authors":"A. Jaffe","doi":"10.1145/3106237.3121274","DOIUrl":"https://doi.org/10.1145/3106237.3121274","url":null,"abstract":"Decompiled code lacks meaningful variable names. We used statistical machine translation to suggest variable names that are natural given the context. This technique has previously been successfully applied to obfuscated JavaScript code, but decompiled C code poses unique challenges in constructing an aligned corpus and selecting the best translation from among several candidates.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126519155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile applications, or apps for short, are gaining popularity. The input sources (e.g., touchscreen, sensors, transmitters) of the smart devices that host these apps enable the apps to offer a rich experience to the users, but these input sources pose testing complications to the developers (e.g., writing tests to accurately utilize multiple input sources together and be able to replay such tests at a later time). To alleviate these complications, researchers and practitioners in recent years have developed a variety of record-and-replay tools to support the testing expressiveness of smart devices. These tools allow developers to easily record and automate the replay of complicated usage scenarios of their app. Due to Android's large share of the smart-device market, numerous record-and-replay tools have been developed using a variety of techniques to test Android apps. To better understand the strengths and weaknesses of these tools, we present a comparison of popular record-and-replay tools from researchers and practitioners, by applying these tools to test three popular industrial apps downloaded from the Google Play store. Our comparison is based on three main metrics: (1) ability to reproduce common usage scenarios, (2) space overhead of traces created by the tools, and (3) robustness of traces created by the tools (when being replayed on devices with different resolutions). The results from our comparison show which record-and-replay tools may be the best for developers and identify future directions for improving these tools to better address testing complications of smart devices.
{"title":"Record and replay for Android: are we there yet in industrial cases?","authors":"Wing Lam, Zhengkai Wu, Dengfeng Li, Wenyu Wang, Haibing Zheng, Hui Luo, Peng Yan, Yuetang Deng, Tao Xie","doi":"10.1145/3106237.3117769","DOIUrl":"https://doi.org/10.1145/3106237.3117769","url":null,"abstract":"Mobile applications, or apps for short, are gaining popularity. The input sources (e.g., touchscreen, sensors, transmitters) of the smart devices that host these apps enable the apps to offer a rich experience to the users, but these input sources pose testing complications to the developers (e.g., writing tests to accurately utilize multiple input sources together and be able to replay such tests at a later time). To alleviate these complications, researchers and practitioners in recent years have developed a variety of record-and-replay tools to support the testing expressiveness of smart devices. These tools allow developers to easily record and automate the replay of complicated usage scenarios of their app. Due to Android's large share of the smart-device market, numerous record-and-replay tools have been developed using a variety of techniques to test Android apps. To better understand the strengths and weaknesses of these tools, we present a comparison of popular record-and-replay tools from researchers and practitioners, by applying these tools to test three popular industrial apps downloaded from the Google Play store. Our comparison is based on three main metrics: (1) ability to reproduce common usage scenarios, (2) space overhead of traces created by the tools, and (3) robustness of traces created by the tools (when being replayed on devices with different resolutions). The results from our comparison show which record-and-replay tools may be the best for developers and identify future directions for improving these tools to better address testing complications of smart devices.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131742017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is challenging to verify regular properties of programs. This paper presents symbolic regular verification (SRV), a dynamic symbolic execution based technique for verifying regular properties. The key technique of SRV is a novel synergistic combination of property-oriented path slicing and guiding to mitigate the path explosion problem. Indeed, slicing can prune redundant paths, while guiding can boost the finding of counterexamples. We have implemented SRV for Java and evaluated it on 16 real-world open-source Java programs (totaling 270K lines of code). The experimental results demonstrate the effectiveness and efficiency of SRV.
{"title":"Practical symbolic verification of regular properties","authors":"Hengbiao Yu","doi":"10.1145/3106237.3121275","DOIUrl":"https://doi.org/10.1145/3106237.3121275","url":null,"abstract":"It is challenging to verify regular properties of programs. This paper presents symbolic regular verification (SRV), a dynamic symbolic execution based technique for verifying regular properties. The key technique of SRV is a novel synergistic combination of property-oriented path slicing and guiding to mitigate the path explosion problem. Indeed, slicing can prune redundant paths, while guiding can boost the finding of counterexamples. We have implemented SRV for Java and evaluated it on 16 real-world open-source Java programs (totaling 270K lines of code). The experimental results demonstrate the effectiveness and efficiency of SRV.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134491430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Code reuse is traditionally seen as good practice. Recent trends have pushed the idea of code reuse to an extreme, by using packages that implement simple and trivial tasks, which we call ‘trivial packages’. A recent incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix, put the spotlight on whether using trivial packages should be encouraged. Therefore, in this research, we mine more than 230,000 npm packages and 38,000 JavaScript projects in order to study the prevalence of trivial packages. We found that trivial packages are common, making up 16.8% of the studied npm packages. We performed a survey with 88 Node.js developers who use trivial packages to understand the reasons for and drawbacks of their use. We found that trivial packages are used because they are perceived to be well-implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce.
{"title":"Reasons and drawbacks of using trivial npm packages: the developers' perspective","authors":"Rabe Abdalkareem","doi":"10.1145/3106237.3121278","DOIUrl":"https://doi.org/10.1145/3106237.3121278","url":null,"abstract":"Code reuse is traditionally seen as good practice. Recent trends have pushed the idea of code reuse to an extreme, by using packages that implement simple and trivial tasks, which we call ‘trivial packages’. A recent incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix, put the spotlight on whether using trivial packages should be encouraged. Therefore, in this research, we mine more than 230,000 npm packages and 38,000 JavaScript projects in order to study the prevalence of trivial packages. We found that trivial packages are common, making up 16.8% of the studied npm packages. We performed a survey with 88 Node.js developers who use trivial packages to understand the reasons for and drawbacks of their use. We found that trivial packages are used because they are perceived to be well-implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132314343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Cedrim, Alessandro F. Garcia, Melina Mongiovi, Rohit Gheyi, L. Sousa, R. Mello, B. Neto, Márcio Ribeiro, Alexander Chávez
Code smells in a program represent indications of structural quality problems, which can be addressed by software refactoring. However, refactoring intends to achieve different goals in practice, and its application may not reduce smelly structures. Developers may neglect or end up creating new code smells through refactoring. Unfortunately, little has been reported about the beneficial and harmful effects of refactoring on code smells. This paper reports a longitudinal study intended to address this gap. We analyze how often commonly-used refactoring types affect the density of 13 types of code smells along the version histories of 23 projects. Our findings are based on the analysis of 16,566 refactorings distributed in 10 different types. Even though 79.4% of the refactorings touched smelly elements, 57% did not reduce their occurrences. Surprisingly, only 9.7% of refactorings removed smells, while 33.3% induced the introduction of new ones. More than 95% of such refactoring-induced smells were not removed in successive commits, which suggest refactorings tend to more frequently introduce long-living smells instead of eliminating existing ones. We also characterized and quantified typical refactoring-smell patterns, and observed that harmful patterns are frequent, including: (i) approximately 30% of the Move Method and Pull Up Method refactorings induced the emergence of God Class, and (ii) the Extract Superclass refactoring creates the smell Speculative Generality in 68% of the cases.
{"title":"Understanding the impact of refactoring on smells: a longitudinal study of 23 software projects","authors":"Diego Cedrim, Alessandro F. Garcia, Melina Mongiovi, Rohit Gheyi, L. Sousa, R. Mello, B. Neto, Márcio Ribeiro, Alexander Chávez","doi":"10.1145/3106237.3106259","DOIUrl":"https://doi.org/10.1145/3106237.3106259","url":null,"abstract":"Code smells in a program represent indications of structural quality problems, which can be addressed by software refactoring. However, refactoring intends to achieve different goals in practice, and its application may not reduce smelly structures. Developers may neglect or end up creating new code smells through refactoring. Unfortunately, little has been reported about the beneficial and harmful effects of refactoring on code smells. This paper reports a longitudinal study intended to address this gap. We analyze how often commonly-used refactoring types affect the density of 13 types of code smells along the version histories of 23 projects. Our findings are based on the analysis of 16,566 refactorings distributed in 10 different types. Even though 79.4% of the refactorings touched smelly elements, 57% did not reduce their occurrences. Surprisingly, only 9.7% of refactorings removed smells, while 33.3% induced the introduction of new ones. More than 95% of such refactoring-induced smells were not removed in successive commits, which suggest refactorings tend to more frequently introduce long-living smells instead of eliminating existing ones. We also characterized and quantified typical refactoring-smell patterns, and observed that harmful patterns are frequent, including: (i) approximately 30% of the Move Method and Pull Up Method refactorings induced the emergence of God Class, and (ii) the Extract Superclass refactoring creates the smell Speculative Generality in 68% of the cases.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131390688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Nelson, Natasha Danas, Daniel J. Dougherty, S. Krishnamurthi
Scenario-finding tools like the Alloy Analyzer are widely used in numerous concrete domains like security, network analysis, UML analysis, and so on. They can help to verify properties and, more generally, aid in exploring a system's behavior. While scenario finders are valuable for their ability to produce concrete examples, individual scenarios only give insight into what is possible, leaving the user to make their own conclusions about what might be necessary. This paper enriches scenario finding by allowing users to ask ``why?'' and ``why not?'' questions about the examples they are given. We show how to distinguish parts of an example that cannot be consistently removed (or changed) from those that merely reflect underconstraint in the specification. In the former case we show how to determine which elements of the specification and which other components of the example together explain the presence of such facts. This paper formalizes the act of computing provenance in scenario-finding. We present Amalgam, an extension of the popular Alloy scenario-finder, which implements these foundations and provides interactive exploration of examples. We also evaluate Amalgam's algorithmics on a variety of both textbook and real-world examples.
{"title":"The power of \"why\" and \"why not\": enriching scenario exploration with provenance","authors":"Tim Nelson, Natasha Danas, Daniel J. Dougherty, S. Krishnamurthi","doi":"10.1145/3106237.3106272","DOIUrl":"https://doi.org/10.1145/3106237.3106272","url":null,"abstract":"Scenario-finding tools like the Alloy Analyzer are widely used in numerous concrete domains like security, network analysis, UML analysis, and so on. They can help to verify properties and, more generally, aid in exploring a system's behavior. While scenario finders are valuable for their ability to produce concrete examples, individual scenarios only give insight into what is possible, leaving the user to make their own conclusions about what might be necessary. This paper enriches scenario finding by allowing users to ask ``why?'' and ``why not?'' questions about the examples they are given. We show how to distinguish parts of an example that cannot be consistently removed (or changed) from those that merely reflect underconstraint in the specification. In the former case we show how to determine which elements of the specification and which other components of the example together explain the presence of such facts. This paper formalizes the act of computing provenance in scenario-finding. We present Amalgam, an extension of the popular Alloy scenario-finder, which implements these foundations and provides interactive exploration of examples. We also evaluate Amalgam's algorithmics on a variety of both textbook and real-world examples.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127585811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reproducing field failures is the first essential step for understanding, localizing and removing faults. Reproducing concurrency field failures is hard due to the need of synthesizing a test code jointly with a thread interleaving that induce the failure in the presence of limited information from the field. Current techniques for reproducing concurrency failures focus on identifying failure-inducing interleavings, leaving largely open the problem of synthesizing the test code that manifests such interleavings. In this paper, we present ConCrash, a technique to automatically generate test codes that reproduce concurrency failures that violate thread-safety from crash stacks, which commonly summarize the conditions of field failures. ConCrash efficiently explores the huge space of possible test codes to identify a failure-inducing one by using a suitable set of search pruning strategies. Combined with existing techniques for exploring interleavings, ConCrash automatically reproduces a given concurrency failure that violates the thread-safety of a class by identifying both a failure-inducing test code and corresponding interleaving. In the paper, we define the ConCrash approach, present a prototype implementation of ConCrash, and discuss the experimental results that we obtained on a known set of ten field failures that witness the effectiveness of the approach.
{"title":"Reproducing concurrency failures from crash stacks","authors":"F. A. Bianchi, M. Pezzè, Valerio Terragni","doi":"10.1145/3106237.3106292","DOIUrl":"https://doi.org/10.1145/3106237.3106292","url":null,"abstract":"Reproducing field failures is the first essential step for understanding, localizing and removing faults. Reproducing concurrency field failures is hard due to the need of synthesizing a test code jointly with a thread interleaving that induce the failure in the presence of limited information from the field. Current techniques for reproducing concurrency failures focus on identifying failure-inducing interleavings, leaving largely open the problem of synthesizing the test code that manifests such interleavings. In this paper, we present ConCrash, a technique to automatically generate test codes that reproduce concurrency failures that violate thread-safety from crash stacks, which commonly summarize the conditions of field failures. ConCrash efficiently explores the huge space of possible test codes to identify a failure-inducing one by using a suitable set of search pruning strategies. Combined with existing techniques for exploring interleavings, ConCrash automatically reproduces a given concurrency failure that violates the thread-safety of a class by identifying both a failure-inducing test code and corresponding interleaving. In the paper, we define the ConCrash approach, present a prototype implementation of ConCrash, and discuss the experimental results that we obtained on a known set of ten field failures that witness the effectiveness of the approach.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127361471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Continuous Integration (CI) is a hot topic. Yet, little attention is payed to how CI systems work and what impacts their behavior. In parallel, bug prediction in software is gaining high attention. But this is done mostly in the context of software engineering, and the relation to the realm of CI and CI systems engineering has not been established yet. In this paper we describe how Project Gating systems operate, which are a specific type of CI systems used to keep the mainline of development always clean. We propose and evaluate three heuristics for improving Gating performance and demonstrate their trade-offs. The third heuristic, which leverages state-of-the-art bug prediction achieves the best performance across the entire spectrum of workload conditions.
{"title":"Screening heuristics for project gating systems","authors":"Zahy Volf, Edi Shmueli","doi":"10.1145/3106237.3117766","DOIUrl":"https://doi.org/10.1145/3106237.3117766","url":null,"abstract":"Continuous Integration (CI) is a hot topic. Yet, little attention is payed to how CI systems work and what impacts their behavior. In parallel, bug prediction in software is gaining high attention. But this is done mostly in the context of software engineering, and the relation to the realm of CI and CI systems engineering has not been established yet. In this paper we describe how Project Gating systems operate, which are a specific type of CI systems used to keep the mainline of development always clean. We propose and evaluate three heuristics for improving Gating performance and demonstrate their trade-offs. The third heuristic, which leverages state-of-the-art bug prediction achieves the best performance across the entire spectrum of workload conditions.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133199026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Siegmund, Norman Peitek, Chris Parnin, S. Apel, Johannes C. Hofmeister, Christian Kästner, Andrew Begel, A. Bethmann, A. Brechmann
Most modern software programs cannot be understood in their entirety by a single programmer. Instead, programmers must rely on a set of cognitive processes that aid in seeking, filtering, and shaping relevant information for a given programming task. Several theories have been proposed to explain these processes, such as ``beacons,' for locating relevant code, and ``plans,'' for encoding cognitive models. However, these theories are decades old and lack validation with modern cognitive-neuroscience methods. In this paper, we report on a study using functional magnetic resonance imaging (fMRI) with 11 participants who performed program comprehension tasks. We manipulated experimental conditions related to beacons and layout to isolate specific cognitive processes related to bottom-up comprehension and comprehension based on semantic cues. We found evidence of semantic chunking during bottom-up comprehension and lower activation of brain areas during comprehension based on semantic cues, confirming that beacons ease comprehension.
{"title":"Measuring neural efficiency of program comprehension","authors":"J. Siegmund, Norman Peitek, Chris Parnin, S. Apel, Johannes C. Hofmeister, Christian Kästner, Andrew Begel, A. Bethmann, A. Brechmann","doi":"10.1145/3106237.3106268","DOIUrl":"https://doi.org/10.1145/3106237.3106268","url":null,"abstract":"Most modern software programs cannot be understood in their entirety by a single programmer. Instead, programmers must rely on a set of cognitive processes that aid in seeking, filtering, and shaping relevant information for a given programming task. Several theories have been proposed to explain these processes, such as ``beacons,' for locating relevant code, and ``plans,'' for encoding cognitive models. However, these theories are decades old and lack validation with modern cognitive-neuroscience methods. In this paper, we report on a study using functional magnetic resonance imaging (fMRI) with 11 participants who performed program comprehension tasks. We manipulated experimental conditions related to beacons and layout to isolate specific cognitive processes related to bottom-up comprehension and comprehension based on semantic cues. We found evidence of semantic chunking during bottom-up comprehension and lower activation of brain areas during comprehension based on semantic cues, confirming that beacons ease comprehension.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116526968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Garbervetsky, Zvonimir Pavlinovic, Mike Barnett, Madan Musuvathi, Todd Mytkowicz, Edgardo Zoppi
Query languages for big data analysis provide user extensibility through a mechanism of user-defined operators (UDOs). These operators allow programmers to write proprietary functionalities on top of a relational query skeleton. However, achieving effective query optimization for such languages is extremely challenging since the optimizer needs to understand data dependencies induced by UDOs. SCOPE, the query language from Microsoft, allows for hand coded declarations of UDO data dependencies. Unfortunately, most programmers avoid using this facility since writing and maintaining the declarations is tedious and error-prone. In this work, we designed and implemented two sound and robust static analyses for computing UDO data dependencies. The analyses can detect what columns of an input table are never used or pass-through a UDO unchanged. This information can be used to significantly improve execution of SCOPE scripts. We evaluate our analyses on thousands of real-world queries and show we can catch many unused and pass-through columns automatically without relying on any manually provided declarations.
{"title":"Static analysis for optimizing big data queries","authors":"D. Garbervetsky, Zvonimir Pavlinovic, Mike Barnett, Madan Musuvathi, Todd Mytkowicz, Edgardo Zoppi","doi":"10.1145/3106237.3117774","DOIUrl":"https://doi.org/10.1145/3106237.3117774","url":null,"abstract":"Query languages for big data analysis provide user extensibility through a mechanism of user-defined operators (UDOs). These operators allow programmers to write proprietary functionalities on top of a relational query skeleton. However, achieving effective query optimization for such languages is extremely challenging since the optimizer needs to understand data dependencies induced by UDOs. SCOPE, the query language from Microsoft, allows for hand coded declarations of UDO data dependencies. Unfortunately, most programmers avoid using this facility since writing and maintaining the declarations is tedious and error-prone. In this work, we designed and implemented two sound and robust static analyses for computing UDO data dependencies. The analyses can detect what columns of an input table are never used or pass-through a UDO unchanged. This information can be used to significantly improve execution of SCOPE scripts. We evaluate our analyses on thousands of real-world queries and show we can catch many unused and pass-through columns automatically without relying on any manually provided declarations.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122035704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}