Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00075
Bo Yang, Zhenchang Xing, Xin Xia, Chunyang Chen, Deheng Ye, Shanping Li
Just like code smells in source code, UI design has visual design smells. We study 93 don't-do-that guidelines in the Material Design, a complex design system created by Google. We find that these don't-guidelines go far beyond UI aesthetics, and involve seven general design dimensions (layout, typography, iconography, navigation, communication, color, and shape) and four component design aspects (anatomy, placement, behavior, and usage). Violating these guidelines results in visual design smells in UIs (or UI design smells). In a study of 60,756 UIs of 9,286 Android apps, we find that 7,497 UIs of 2,587 apps have at least one violation of some Material Design guidelines. This reveals the lack of developer training and tool support to avoid UI design smells. To fill this gap, we design an automated UI design smell detector (UIS-Hunter) that extracts and validates multi-modal UI information (component metadata, typography, iconography, color, and edge) for detecting the violation of diverse don't-guidelines in Material Design. The detection accuracy of UIS-Hunter is high (precision=0.81, recall=0.90) on the 60,756 UIs of 9,286 apps. We build a guideline gallery with real-world UI design smells that UIS-Hunter detects for developers to learn the best Material Design practices. Our user studies show that UIS-Hunter is more effective than manual detection of UI design smells, and the UI design smells that are detected by UIS-Hunter have severely negative impacts on app users.
{"title":"Don’t Do That! Hunting Down Visual Design Smells in Complex UIs Against Design Guidelines","authors":"Bo Yang, Zhenchang Xing, Xin Xia, Chunyang Chen, Deheng Ye, Shanping Li","doi":"10.1109/ICSE43902.2021.00075","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00075","url":null,"abstract":"Just like code smells in source code, UI design has visual design smells. We study 93 don't-do-that guidelines in the Material Design, a complex design system created by Google. We find that these don't-guidelines go far beyond UI aesthetics, and involve seven general design dimensions (layout, typography, iconography, navigation, communication, color, and shape) and four component design aspects (anatomy, placement, behavior, and usage). Violating these guidelines results in visual design smells in UIs (or UI design smells). In a study of 60,756 UIs of 9,286 Android apps, we find that 7,497 UIs of 2,587 apps have at least one violation of some Material Design guidelines. This reveals the lack of developer training and tool support to avoid UI design smells. To fill this gap, we design an automated UI design smell detector (UIS-Hunter) that extracts and validates multi-modal UI information (component metadata, typography, iconography, color, and edge) for detecting the violation of diverse don't-guidelines in Material Design. The detection accuracy of UIS-Hunter is high (precision=0.81, recall=0.90) on the 60,756 UIs of 9,286 apps. We build a guideline gallery with real-world UI design smells that UIS-Hunter detects for developers to learn the best Material Design practices. Our user studies show that UIS-Hunter is more effective than manual detection of UI design smells, and the UI design smells that are detected by UIS-Hunter have severely negative impacts on app users.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00033
Yiming Tang, Raffi Khatchadourian, M. Bagherzadeh, Rhia Singh, Ajani Stewart, A. Raja
Machine Learning (ML), including Deep Learning (DL), systems, i.e., those with ML capabilities, are pervasive in today's data-driven society. Such systems are complex; they are comprised of ML models and many subsystems that support learning processes. As with other complex systems, ML systems are prone to classic technical debt issues, especially when such systems are long-lived, but they also exhibit debt specific to these systems. Unfortunately, there is a gap of knowledge in how ML systems actually evolve and are maintained. In this paper, we fill this gap by studying refactorings, i.e., source-to-source semantics-preserving program transformations, performed in real-world, open-source software, and the technical debt issues they alleviate. We analyzed 26 projects, consisting of 4.2 MLOC, along with 327 manually examined code patches. The results indicate that developers refactor these systems for a variety of reasons, both specific and tangential to ML, some refactorings correspond to established technical debt categories, while others do not, and code duplication is a major cross-cutting theme that particularly involved ML configuration and model code, which was also the most refactored. We also introduce 14 and 7 new ML-specific refactorings and technical debt categories, respectively, and put forth several recommendations, best practices, and anti-patterns. The results can potentially assist practitioners, tool developers, and educators in facilitating long-term ML system usefulness.
{"title":"An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems","authors":"Yiming Tang, Raffi Khatchadourian, M. Bagherzadeh, Rhia Singh, Ajani Stewart, A. Raja","doi":"10.1109/ICSE43902.2021.00033","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00033","url":null,"abstract":"Machine Learning (ML), including Deep Learning (DL), systems, i.e., those with ML capabilities, are pervasive in today's data-driven society. Such systems are complex; they are comprised of ML models and many subsystems that support learning processes. As with other complex systems, ML systems are prone to classic technical debt issues, especially when such systems are long-lived, but they also exhibit debt specific to these systems. Unfortunately, there is a gap of knowledge in how ML systems actually evolve and are maintained. In this paper, we fill this gap by studying refactorings, i.e., source-to-source semantics-preserving program transformations, performed in real-world, open-source software, and the technical debt issues they alleviate. We analyzed 26 projects, consisting of 4.2 MLOC, along with 327 manually examined code patches. The results indicate that developers refactor these systems for a variety of reasons, both specific and tangential to ML, some refactorings correspond to established technical debt categories, while others do not, and code duplication is a major cross-cutting theme that particularly involved ML configuration and model code, which was also the most refactored. We also introduce 14 and 7 new ML-specific refactorings and technical debt categories, respectively, and put forth several recommendations, best practices, and anti-patterns. The results can potentially assist practitioners, tool developers, and educators in facilitating long-term ML system usefulness.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129527493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00093
E. Dias, Paulo Meirelles, F. C. Filho, Igor Steinmacher, I. Wiese, G. Pinto
Although Open Source Software (OSS) maintainers devote a significant proportion of their work to coding tasks, great maintainers must excel in many other activities beyond coding. Maintainers should care about fostering a community, helping new members to find their place, while also saying "no" to patches that although are well-coded and well-tested, do not contribute to the goal of the project. To perform all these activities masterfully, maintainers should exercise attributes that software engineers (working on closed source projects) do not always need to master. This paper aims to uncover, relate, and prioritize the unique attributes that great OSS maintainers might have. To achieve this goal, we conducted 33 semi-structured interviews with well-experienced maintainers that are the gatekeepers of notable projects such as the Linux Kernel, the Debian operating system, and the GitLab coding platform. After we analyzed the interviews and curated a list of attributes, we created a conceptual framework to explain how these attributes are connected. We then conducted a rating survey with 90 OSS contributors. We noted that "technical excellence" and "communication" are the most recurring attributes. When grouped, these attributes fit into four broad categories: management, social, technical, and personality. While we noted that "sustain a long term vision of the project" and being "extremely careful" seem to form the basis of our framework, we noted through our survey that the communication attribute was perceived as the most essential one.
{"title":"What Makes a Great Maintainer of Open Source Projects?","authors":"E. Dias, Paulo Meirelles, F. C. Filho, Igor Steinmacher, I. Wiese, G. Pinto","doi":"10.1109/ICSE43902.2021.00093","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00093","url":null,"abstract":"Although Open Source Software (OSS) maintainers devote a significant proportion of their work to coding tasks, great maintainers must excel in many other activities beyond coding. Maintainers should care about fostering a community, helping new members to find their place, while also saying \"no\" to patches that although are well-coded and well-tested, do not contribute to the goal of the project. To perform all these activities masterfully, maintainers should exercise attributes that software engineers (working on closed source projects) do not always need to master. This paper aims to uncover, relate, and prioritize the unique attributes that great OSS maintainers might have. To achieve this goal, we conducted 33 semi-structured interviews with well-experienced maintainers that are the gatekeepers of notable projects such as the Linux Kernel, the Debian operating system, and the GitLab coding platform. After we analyzed the interviews and curated a list of attributes, we created a conceptual framework to explain how these attributes are connected. We then conducted a rating survey with 90 OSS contributors. We noted that \"technical excellence\" and \"communication\" are the most recurring attributes. When grouped, these attributes fit into four broad categories: management, social, technical, and personality. While we noted that \"sustain a long term vision of the project\" and being \"extremely careful\" seem to form the basis of our framework, we noted through our survey that the communication attribute was perceived as the most essential one.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114760125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00028
Yanjie Gao, Yonghao Zhu, Hongyu Zhang, Haoxiang Lin, Mao Yang
Deep learning models, like traditional software systems, provide a large number of configuration options. A deep learning model can be configured with different hyperparameters and neural architectures. Recently, AutoML (Automated Machine Learning) has been widely adopted to automate model training by systematically exploring diverse configurations. However, current AutoML approaches do not take into consideration the computational constraints imposed by various resources such as available memory, computing power of devices, or execution time. The training with non-conforming configurations could lead to many failed AutoML trial jobs or inappropriate models, which cause significant resource waste and severely slow down development productivity. In this paper, we propose DnnSAT, a resource-guided AutoML approach for deep learning models to help existing AutoML tools efficiently reduce the configuration space ahead of time. DnnSAT can speed up the search process and achieve equal or even better model learning performance because it excludes trial jobs not satisfying the constraints and saves resources for more trials. We formulate the resource-guided configuration space reduction as a constraint satisfaction problem. DnnSAT includes a unified analytic cost model to construct common constraints with respect to the model weight size, number of floating-point operations, model inference time, and GPU memory consumption. It then utilizes an SMT solver to obtain the satisfiable configurations of hyperparameters and neural architectures. Our evaluation results demonstrate the effectiveness of DnnSAT in accelerating state-of-the-art AutoML methods (Hyperparameter Optimization and Neural Architecture Search) with an average speedup from 1.19X to 3.95X on public benchmarks. We believe that DnnSAT can make AutoML more practical in a real-world environment with constrained resources.
{"title":"Resource-Guided Configuration Space Reduction for Deep Learning Models","authors":"Yanjie Gao, Yonghao Zhu, Hongyu Zhang, Haoxiang Lin, Mao Yang","doi":"10.1109/ICSE43902.2021.00028","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00028","url":null,"abstract":"Deep learning models, like traditional software systems, provide a large number of configuration options. A deep learning model can be configured with different hyperparameters and neural architectures. Recently, AutoML (Automated Machine Learning) has been widely adopted to automate model training by systematically exploring diverse configurations. However, current AutoML approaches do not take into consideration the computational constraints imposed by various resources such as available memory, computing power of devices, or execution time. The training with non-conforming configurations could lead to many failed AutoML trial jobs or inappropriate models, which cause significant resource waste and severely slow down development productivity. In this paper, we propose DnnSAT, a resource-guided AutoML approach for deep learning models to help existing AutoML tools efficiently reduce the configuration space ahead of time. DnnSAT can speed up the search process and achieve equal or even better model learning performance because it excludes trial jobs not satisfying the constraints and saves resources for more trials. We formulate the resource-guided configuration space reduction as a constraint satisfaction problem. DnnSAT includes a unified analytic cost model to construct common constraints with respect to the model weight size, number of floating-point operations, model inference time, and GPU memory consumption. It then utilizes an SMT solver to obtain the satisfiable configurations of hyperparameters and neural architectures. Our evaluation results demonstrate the effectiveness of DnnSAT in accelerating state-of-the-art AutoML methods (Hyperparameter Optimization and Neural Architecture Search) with an average speedup from 1.19X to 3.95X on public benchmarks. We believe that DnnSAT can make AutoML more practical in a real-world environment with constrained resources.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131847543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00128
Amreeta Chatterjee, M. Guizani, Catherine Stevens, Jillian Emard, Mary Evelyn May, M. Burnett, Iftekhar Ahmed, A. Sarma
The tools and infrastructure used in tech, including Open Source Software (OSS), can embed "inclusivity bugs"- features that disproportionately disadvantage particular groups of contributors. To see whether OSS developers have existing practices to ward off such bugs, we surveyed 266 OSS developers. Our results show that a majority (77%) of developers do not use any inclusivity practices, and 92% of respondents cited a lack of concrete resources to enable them to do so. To help fill this gap, this paper introduces AID, a tool that automates the GenderMag method to systematically find gender-inclusivity bugs in software. We then present the results of the tool's evaluation on 20 GitHub projects. The tool achieved precision of 0.69, recall of 0.92, an F-measure of 0.79 and even captured some inclusivity bugs that human GenderMag teams missed.
{"title":"AID: An Automated Detector for Gender-Inclusivity Bugs in OSS Project Pages","authors":"Amreeta Chatterjee, M. Guizani, Catherine Stevens, Jillian Emard, Mary Evelyn May, M. Burnett, Iftekhar Ahmed, A. Sarma","doi":"10.1109/ICSE43902.2021.00128","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00128","url":null,"abstract":"The tools and infrastructure used in tech, including Open Source Software (OSS), can embed \"inclusivity bugs\"- features that disproportionately disadvantage particular groups of contributors. To see whether OSS developers have existing practices to ward off such bugs, we surveyed 266 OSS developers. Our results show that a majority (77%) of developers do not use any inclusivity practices, and 92% of respondents cited a lack of concrete resources to enable them to do so. To help fill this gap, this paper introduces AID, a tool that automates the GenderMag method to systematically find gender-inclusivity bugs in software. We then present the results of the tool's evaluation on 20 GitHub projects. The tool achieved precision of 0.69, recall of 0.92, an F-measure of 0.79 and even captured some inclusivity bugs that human GenderMag teams missed.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131366799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00056
Norman Peitek, S. Apel, Chris Parnin, A. Brechmann, J. Siegmund
Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use code metrics for this purpose, their validity is debated, since they rely on simple code properties and rarely consider particularities of human cognition. Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension. Method: We have conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics. Results: While our data corroborate that complexity metrics can-to a limited degree-explain programmers' cognition in program comprehension, fMRI allowed us to gain insights into why some code properties are difficult to process. In particular, a code's textual size drives programmers' attention, and vocabulary size burdens programmers' working memory. Conclusion: Our results provide neuro-scientific evidence supporting warnings of prior research questioning the validity of code complexity metrics and pin down factors relevant to program comprehension. Future Work: We outline several follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to code complexity metrics.
{"title":"Program Comprehension and Code Complexity Metrics: An fMRI Study","authors":"Norman Peitek, S. Apel, Chris Parnin, A. Brechmann, J. Siegmund","doi":"10.1109/ICSE43902.2021.00056","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00056","url":null,"abstract":"Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use code metrics for this purpose, their validity is debated, since they rely on simple code properties and rarely consider particularities of human cognition. Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension. Method: We have conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics. Results: While our data corroborate that complexity metrics can-to a limited degree-explain programmers' cognition in program comprehension, fMRI allowed us to gain insights into why some code properties are difficult to process. In particular, a code's textual size drives programmers' attention, and vocabulary size burdens programmers' working memory. Conclusion: Our results provide neuro-scientific evidence supporting warnings of prior research questioning the validity of code complexity metrics and pin down factors relevant to program comprehension. Future Work: We outline several follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to code complexity metrics.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116408590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00052
Fariha Nusrat, Foyzul Hassan, Hao Zhong, Xiaoyin Wang
Virtual Reality (VR) is an emerging technique that provides immersive experience for users. Due to the high computation cost of rendering real-time animation twice (for both eyes) and the resource limitation of wearable devices, VR applications often face performance bottlenecks and performanceoptimization plays an important role in VR software develop-ment. Performance optimizations of VR applications can be very different from those in traditional software as VR involves more elements such as graphics rendering and real-time animation. In this paper, we present the first empirical study on 183 real-world performance optimizations from 45 VR software projects. In particular, we manually categorized the optimizations in to 11 categories, and applied static analysis to identify how they affect different life-cycle phases of VR applications. Furthermore, we studied the complexity and design / behavior effects of performance optimizations, and how optimizations are different between large organizational software projects and smaller personal software projects. Our major findings include: (1) graphics simplification (24.0%), rendering optimization (16.9%), language / API optimization (15.3%), heap avoidance (14.8%), and valuecaching (12.0%) are the most common categories of performance optimization in VR applications; (2) game logic updates (30.4%) and before-scene initialization (20.0%) are the most common life-cycle phases affected by performance issues; (3) 45.9% of the optimizations have behavior and design effects and 39.3% of the optimizations are systematic changes; (4) the distributionsof optimization classes are very different between organizational VR projects and personal VR projects.
{"title":"How Developers Optimize Virtual Reality Applications: A Study of Optimization Commits in Open Source Unity Projects","authors":"Fariha Nusrat, Foyzul Hassan, Hao Zhong, Xiaoyin Wang","doi":"10.1109/ICSE43902.2021.00052","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00052","url":null,"abstract":"Virtual Reality (VR) is an emerging technique that provides immersive experience for users. Due to the high computation cost of rendering real-time animation twice (for both eyes) and the resource limitation of wearable devices, VR applications often face performance bottlenecks and performanceoptimization plays an important role in VR software develop-ment. Performance optimizations of VR applications can be very different from those in traditional software as VR involves more elements such as graphics rendering and real-time animation. In this paper, we present the first empirical study on 183 real-world performance optimizations from 45 VR software projects. In particular, we manually categorized the optimizations in to 11 categories, and applied static analysis to identify how they affect different life-cycle phases of VR applications. Furthermore, we studied the complexity and design / behavior effects of performance optimizations, and how optimizations are different between large organizational software projects and smaller personal software projects. Our major findings include: (1) graphics simplification (24.0%), rendering optimization (16.9%), language / API optimization (15.3%), heap avoidance (14.8%), and valuecaching (12.0%) are the most common categories of performance optimization in VR applications; (2) game logic updates (30.4%) and before-scene initialization (20.0%) are the most common life-cycle phases affected by performance issues; (3) 45.9% of the optimizations have behavior and design effects and 39.3% of the optimizations are systematic changes; (4) the distributionsof optimization classes are very different between organizational VR projects and personal VR projects.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124500204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Network (DNN) testing is one of the most widely-used ways to guarantee the quality of DNNs. However, labeling test inputs to check the correctness of DNN prediction is very costly, which could largely affect the efficiency of DNN testing, even the whole process of DNN development. To relieve the labeling-cost problem, we propose a novel test input prioritization approach (called PRIMA) for DNNs via intelligent mutation analysis in order to label more bug-revealing test inputs earlier for a limited time, which facilitates to improve the efficiency of DNN testing. PRIMA is based on the key insight: a test input that is able to kill many mutated models and produce different prediction results with many mutated inputs, is more likely to reveal DNN bugs, and thus it should be prioritized higher. After obtaining a number of mutation results from a series of our designed model and input mutation rules for each test input, PRIMA further incorporates learning-to-rank (a kind of supervised machine learning to solve ranking problems) to intelligently combine these mutation results for effective test input prioritization. We conducted an extensive study based on 36 popular subjects by carefully considering their diversity from five dimensions (i.e., different domains of test inputs, different DNN tasks, different network structures, different types of test inputs, and different training scenarios). Our experimental results demonstrate the effectiveness of PRIMA, significantly outperforming the state-of-the-art approaches (with the average improvement of 8.50%~131.01% in terms of prioritization effectiveness). In particular, we have applied PRIMA to the practical autonomous-vehicle testing in a large motor company, and the results on 4 real-world scene-recognition models in autonomous vehicles further confirm the practicability of PRIMA.
{"title":"Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis","authors":"Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, Wenbin Zhang","doi":"10.1109/ICSE43902.2021.00046","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00046","url":null,"abstract":"Deep Neural Network (DNN) testing is one of the most widely-used ways to guarantee the quality of DNNs. However, labeling test inputs to check the correctness of DNN prediction is very costly, which could largely affect the efficiency of DNN testing, even the whole process of DNN development. To relieve the labeling-cost problem, we propose a novel test input prioritization approach (called PRIMA) for DNNs via intelligent mutation analysis in order to label more bug-revealing test inputs earlier for a limited time, which facilitates to improve the efficiency of DNN testing. PRIMA is based on the key insight: a test input that is able to kill many mutated models and produce different prediction results with many mutated inputs, is more likely to reveal DNN bugs, and thus it should be prioritized higher. After obtaining a number of mutation results from a series of our designed model and input mutation rules for each test input, PRIMA further incorporates learning-to-rank (a kind of supervised machine learning to solve ranking problems) to intelligently combine these mutation results for effective test input prioritization. We conducted an extensive study based on 36 popular subjects by carefully considering their diversity from five dimensions (i.e., different domains of test inputs, different DNN tasks, different network structures, different types of test inputs, and different training scenarios). Our experimental results demonstrate the effectiveness of PRIMA, significantly outperforming the state-of-the-art approaches (with the average improvement of 8.50%~131.01% in terms of prioritization effectiveness). In particular, we have applied PRIMA to the practical autonomous-vehicle testing in a large motor company, and the results on 4 real-world scene-recognition models in autonomous vehicles further confirm the practicability of PRIMA.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125770756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00059
Yaza Wainakh, Moiz Rauf, Michael Pradel
Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.
{"title":"IdBench: Evaluating Semantic Representations of Identifier Names in Source Code","authors":"Yaza Wainakh, Moiz Rauf, Michael Pradel","doi":"10.1109/ICSE43902.2021.00059","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00059","url":null,"abstract":"Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129869044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-05-01DOI: 10.1109/ICSE43902.2021.00014
Mohammed Alhamed, Tim Storer
Reliable cost effective effort estimation remains a considerable challenge for software projects. Recent work has demonstrated that the popular Planning Poker practice can produce reliable estimates when undertaken within a software team of knowledgeable domain experts. However, the process depends on the availability of experts and can be time-consuming to perform, making it impractical for large scale or open source projects that may curate many thousands of outstanding tasks. This paper reports on a full study to investigate the feasibility of using crowd workers supplied with limited information about a task to provide comparably accurate estimates using Planning Poker. We describe the design of a Crowd Planning Poker (CPP) process implemented on Amazon Mechanical Turk and the results of a substantial set of trials, involving more than 5000 crowd workers and 39 diverse software tasks. Our results show that a carefully organised and selected crowd of workers can produce effort estimates that are of similar accuracy to those of a single expert.
{"title":"Playing Planning Poker in Crowds: Human Computation of Software Effort Estimates","authors":"Mohammed Alhamed, Tim Storer","doi":"10.1109/ICSE43902.2021.00014","DOIUrl":"https://doi.org/10.1109/ICSE43902.2021.00014","url":null,"abstract":"Reliable cost effective effort estimation remains a considerable challenge for software projects. Recent work has demonstrated that the popular Planning Poker practice can produce reliable estimates when undertaken within a software team of knowledgeable domain experts. However, the process depends on the availability of experts and can be time-consuming to perform, making it impractical for large scale or open source projects that may curate many thousands of outstanding tasks. This paper reports on a full study to investigate the feasibility of using crowd workers supplied with limited information about a task to provide comparably accurate estimates using Planning Poker. We describe the design of a Crowd Planning Poker (CPP) process implemented on Amazon Mechanical Turk and the results of a substantial set of trials, involving more than 5000 crowd workers and 39 diverse software tasks. Our results show that a carefully organised and selected crowd of workers can produce effort estimates that are of similar accuracy to those of a single expert.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132037494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}