Code summarization is the task of creating short, natural language descriptions of source code. It is an important part of code comprehension, and a powerful method of documentation. Previous work has made progress in identifying where programmers focus in code as they write their own summaries (i.e. writing). However, there is currently a gap studying programmers’ attention as they read code with pre-written summaries (i.e., reading). As a result, it is currently unknown how these two forms of code comprehension compare: reading and writing. Also, there is a limited understanding of programmer attention in code summarization with respect to program semantics. We address these gaps in this paper with a human eye-tracking study (n = 27) comparing reading and writing. We examined programmer attention with respect to fine-grained program semantics, including their attention sequence (i.e., scan path). We find distinctions in programmer attention between the comprehension tasks, similarities in reading patterns between them, and differences mediated by expertise. Furthermore, we mapped programmers’ gaze data onto the Abstract Syntax Tree (AST) to explore another representation of human attention. Some significant differences in programmer attention on the raw code are not significant on the AST, while others are more significant.
{"title":"A Tale of Two Comprehensions? Analyzing Student Programmer Attention during Code Summarization","authors":"Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, Yu Huang","doi":"10.1145/3664808","DOIUrl":"https://doi.org/10.1145/3664808","url":null,"abstract":"<p>Code summarization is the task of creating short, natural language descriptions of source code. It is an important part of code comprehension, and a powerful method of documentation. Previous work has made progress in identifying where programmers focus in code as they write their own summaries (i.e. writing). However, there is currently a gap studying programmers’ attention as they read code with pre-written summaries (i.e., reading). As a result, it is currently unknown how these two forms of code comprehension compare: reading and writing. Also, there is a limited understanding of programmer attention in code summarization with respect to program semantics. We address these gaps in this paper with a human eye-tracking study (n = 27) comparing reading and writing. We examined programmer attention with respect to fine-grained program semantics, including their attention sequence (i.e., scan path). We find distinctions in programmer attention between the comprehension tasks, similarities in reading patterns between them, and differences mediated by expertise. Furthermore, we mapped programmers’ gaze data onto the Abstract Syntax Tree (AST) to explore another representation of human attention. Some significant differences in programmer attention on the raw code are not significant on the AST, while others are more significant.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"29 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinqiang Yu, Michael Fu, Alexey Ignatiev, Chakkrit Tantithamthavorn, Peter Stuckey
Just-In-Time (JIT) defect prediction has been proposed to help teams to prioritize the limited resources on the most risky commits (or pull requests), yet it remains largely a black-box, whose predictions are not explainable nor actionable to practitioners. Thus, prior studies have applied various model-agnostic techniques to explain the predictions of JIT models. Yet, explanations generated from existing model-agnostic techniques are still not formally sound, robust, and actionable. In this paper, we propose FoX, a Formal eXplainer for JIT Defect Prediction, which builds on formal reasoning about the behaviour of JIT defect prediction models and hence is able to provide provably correct explanations, which are additionally guaranteed to be minimal. Our experimental results show that FoX is able to efficiently generate provably-correct, robust, and actionable explanations while existing model-agnostic techniques cannot. Our survey study with 54 software practitioners provides valuable insights into the usefulness and trustworthiness of our FoX approach. 86% of participants agreed that our approach is useful, while 74% of participants found it trustworthy. Thus, this paper serves as an important stepping stone towards trustable explanations for JIT models to help domain experts and practitioners better understand why a commit is predicted as defective and what to do to mitigate the risk.
{"title":"A Formal Explainer for Just-In-Time Defect Predictions","authors":"Jinqiang Yu, Michael Fu, Alexey Ignatiev, Chakkrit Tantithamthavorn, Peter Stuckey","doi":"10.1145/3664809","DOIUrl":"https://doi.org/10.1145/3664809","url":null,"abstract":"<p>Just-In-Time (JIT) defect prediction has been proposed to help teams to prioritize the limited resources on the most risky commits (or pull requests), yet it remains largely a black-box, whose predictions are not explainable nor actionable to practitioners. Thus, prior studies have applied various model-agnostic techniques to explain the predictions of JIT models. Yet, explanations generated from existing model-agnostic techniques are still not formally sound, robust, and actionable. In this paper, we propose <span>FoX</span>, a <underline>Fo</underline>rmal e<underline>X</underline>plainer for JIT Defect Prediction, which builds on formal reasoning about the behaviour of JIT defect prediction models and hence is able to provide provably correct explanations, which are additionally guaranteed to be minimal. Our experimental results show that <span>FoX</span> is able to efficiently generate provably-correct, robust, and actionable explanations while existing model-agnostic techniques cannot. Our survey study with 54 software practitioners provides valuable insights into the usefulness and trustworthiness of our <span>FoX</span> approach. 86% of participants agreed that our approach is useful, while 74% of participants found it trustworthy. Thus, this paper serves as an important stepping stone towards trustable explanations for JIT models to help domain experts and practitioners better understand why a commit is predicted as defective and what to do to mitigate the risk.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"11 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyu Jiang, Zhenhang He, Yuwen Chen, Mingrong Zhang, Le Ma
As mobile applications evolve rapidly, their fast iterative update nature leads to an increase in software defects. Just-In-Time Software Defect Prediction (JIT-SDP) offers immediate feedback on code changes. For new applications without historical data, researchers have proposed Cross-Project JIT-SDP (CP JIT-SDP). Existing CP JIT-SDP approaches are designed for offline scenarios where target data is available in advance. However, target data in real-world applications usually arrives online in a streaming manner, making online CP JIT-SDP face cross-project distribution differences and target project data concept drift challenges in online scenarios. These challenges often co-exist during application development, and their interactions cause model performance to degrade. To address these issues, we propose an online CP JIT-SDP framework called COTL. Specifically, COTL consists of two stages: offline and online. In offline stage, the cross-domain structure preserving projection algorithm is used to reduce the cross-project distribution differences. In online stage, target data arrives sequentially over time. By reducing the differences in marginal and conditional distributions between offline and online data for target project, concept drift is mitigated and classifier weights are updated online. Experimental results on 15 mobile application benchmark datasets show that COTL outperforms 13 benchmark methods on four performance metrics.
{"title":"Mobile Application Online Cross-Project Just-in-Time Software Defect Prediction Framework","authors":"Siyu Jiang, Zhenhang He, Yuwen Chen, Mingrong Zhang, Le Ma","doi":"10.1145/3664607","DOIUrl":"https://doi.org/10.1145/3664607","url":null,"abstract":"<p>As mobile applications evolve rapidly, their fast iterative update nature leads to an increase in software defects. Just-In-Time Software Defect Prediction (JIT-SDP) offers immediate feedback on code changes. For new applications without historical data, researchers have proposed Cross-Project JIT-SDP (CP JIT-SDP). Existing CP JIT-SDP approaches are designed for offline scenarios where target data is available in advance. However, target data in real-world applications usually arrives online in a streaming manner, making online CP JIT-SDP face cross-project distribution differences and target project data concept drift challenges in online scenarios. These challenges often co-exist during application development, and their interactions cause model performance to degrade. To address these issues, we propose an online CP JIT-SDP framework called COTL. Specifically, COTL consists of two stages: offline and online. In offline stage, the cross-domain structure preserving projection algorithm is used to reduce the cross-project distribution differences. In online stage, target data arrives sequentially over time. By reducing the differences in marginal and conditional distributions between offline and online data for target project, concept drift is mitigated and classifier weights are updated online. Experimental results on 15 mobile application benchmark datasets show that COTL outperforms 13 benchmark methods on four performance metrics.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"151 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present LLMEffiChecker, which can work under both white-box setting and black-box setting. In the white-box scenario, LLMEffiChecker develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, LLMEffiChecker employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of LLMEffiChecker, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that LLMEffiChecker can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to 3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by LLMEffiChecker significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs).
{"title":"LLMEffiChecker:Understanding and Testing Efficiency Degradation of Large Language Models","authors":"Xiaoning Feng, Xiaohong Han, Simin Chen, Wei Yang","doi":"10.1145/3664812","DOIUrl":"https://doi.org/10.1145/3664812","url":null,"abstract":"<p>Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present <monospace>LLMEffiChecker</monospace>, which can work under both white-box setting and black-box setting. In the white-box scenario, <monospace>LLMEffiChecker</monospace> develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, <monospace>LLMEffiChecker</monospace> employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of <monospace>LLMEffiChecker</monospace>, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that <monospace>LLMEffiChecker</monospace> can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to 3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by <monospace>LLMEffiChecker</monospace> significantly affect the battery power in real-world mobile devices (<i>i.e.</i>, drain more than 30 times battery power than normal inputs).</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"7 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although App updates are frequent and software engineers would like to verify updated features only, automated testing techniques verify entire Apps and are thus wasting resources.
We present Continuous Adaptation of Learned Models (CALM), an automated App testing approach that efficiently test App updates by adapting App models learned when automatically testing previous App versions. CALM focuses on functional testing. Since functional correctness can be mainly verified through the visual inspection of App screens, CALM minimizes the number of App screens to be visualized by software testers while maximizing the percentage of updated methods and instructions exercised.
Our empirical evaluation shows that CALM exercises a significantly higher proportion of updated methods and instructions than six state-of-the-art approaches, for the same maximum number of App screens to be visually inspected. Further, in common update scenarios, where only a small fraction of methods are updated, CALM is even quicker to outperform all competing approaches in a more significant way.
{"title":"Testing Updated Apps by Adapting Learned Models","authors":"Chanh Duc Ngo, Fabrizio Pastore, Lionel Briand","doi":"10.1145/3664601","DOIUrl":"https://doi.org/10.1145/3664601","url":null,"abstract":"<p>Although App updates are frequent and software engineers would like to verify updated features only, automated testing techniques verify entire Apps and are thus wasting resources. </p><p>We present <i>Continuous Adaptation of Learned Models (CALM)</i>, an automated App testing approach that efficiently test App updates by adapting App models learned when automatically testing previous App versions. CALM focuses on functional testing. Since functional correctness can be mainly verified through the visual inspection of App screens, CALM minimizes the number of App screens to be visualized by software testers while maximizing the percentage of updated methods and instructions exercised. </p><p>Our empirical evaluation shows that CALM exercises a significantly higher proportion of updated methods and instructions than six state-of-the-art approaches, for the same maximum number of App screens to be visually inspected. Further, in common update scenarios, where only a small fraction of methods are updated, CALM is even quicker to outperform all competing approaches in a more significant way.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoye Wang, Zhipeng Gao, Tingting Bi, John Grundy, Xinyu Wang, Minghui Wu, Xiaohu Yang
Software development is a collaborative process that involves various interactions among individuals and teams. TODO comments in source code play a critical role in managing and coordinating diverse tasks during this process. However, this study finds that a large proportion of open-source project TODO comments are left unresolved or take a long time to be resolved. About 46.7% of TODO comments in open-source repositories are of low-quality (e.g., TODOs that are ambiguous, lack information, or are useless to developers). This highlights the need for better TODO practices. In this study, we investigate four aspects regarding the quality of TODO comments in open-source projects: (1) the prevalence of low-quality TODO comments; (2) the key characteristics of high-quality TODO comments; (3) how are TODO comments of different quality managed in practice; and (4) the feasibility of automatically assessing TODO comment quality. Examining 2,863 TODO comments from Top100 GitHub Java repositories, we propose criteria to identify high-quality TODO comments and provide insights into their optimal composition. We discuss the lifecycle of TODO comments with varying quality. To assist developers, we construct deep learning-based methods that show promising performance in identifying the quality of TODO comments, potentially enhancing development efficiency and code quality.
软件开发是一个协作过程,涉及个人和团队之间的各种互动。在这一过程中,源代码中的 TODO 注释在管理和协调各种任务方面发挥着至关重要的作用。然而,本研究发现,很大一部分开源项目的 TODO 注释都没有得到解决或需要很长时间才能解决。开源资源库中约有 46.7% 的 TODO 注释质量不高(例如,TODO 含糊不清、缺乏信息或对开发人员毫无用处)。这凸显了改善 TODO 实践的必要性。在本研究中,我们从四个方面调查了开源项目中 TODO 注释的质量:(1) 低质量 TODO 注释的普遍性;(2) 高质量 TODO 注释的主要特征;(3) 不同质量的 TODO 注释在实践中是如何管理的;(4) 自动评估 TODO 注释质量的可行性。通过研究 GitHub Java 库 Top100 中的 2,863 条 TODO 注释,我们提出了识别高质量 TODO 注释的标准,并对其最佳构成提出了见解。我们讨论了不同质量的 TODO 注释的生命周期。为了帮助开发人员,我们构建了基于深度学习的方法,这些方法在识别 TODO 注释质量方面表现出良好的性能,有望提高开发效率和代码质量。
{"title":"What Makes a Good TODO Comment?","authors":"Haoye Wang, Zhipeng Gao, Tingting Bi, John Grundy, Xinyu Wang, Minghui Wu, Xiaohu Yang","doi":"10.1145/3664811","DOIUrl":"https://doi.org/10.1145/3664811","url":null,"abstract":"<p>Software development is a collaborative process that involves various interactions among individuals and teams. TODO comments in source code play a critical role in managing and coordinating diverse tasks during this process. However, this study finds that a large proportion of open-source project TODO comments are left unresolved or take a long time to be resolved. About 46.7% of TODO comments in open-source repositories are of low-quality (e.g., TODOs that are ambiguous, lack information, or are useless to developers). This highlights the need for better TODO practices. In this study, we investigate four aspects regarding the quality of TODO comments in open-source projects: (1) the prevalence of low-quality TODO comments; (2) the key characteristics of high-quality TODO comments; (3) how are TODO comments of different quality managed in practice; and (4) the feasibility of automatically assessing TODO comment quality. Examining 2,863 TODO comments from Top100 GitHub Java repositories, we propose criteria to identify high-quality TODO comments and provide insights into their optimal composition. We discuss the lifecycle of TODO comments with varying quality. To assist developers, we construct deep learning-based methods that show promising performance in identifying the quality of TODO comments, potentially enhancing development efficiency and code quality.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"11 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Li, Dawei Yuan, Tao Zhang, Haipeng Cai, David Lo, Cuiyun Gao, Xiapu Luo, He Jiang
With the emergence of smartphones, Android has become a widely used mobile operating system. However, it is vulnerable when encountering various types of attacks. Every day, new malware threatens the security of users’ devices and private data. Many methods have been proposed to classify malicious applications, utilizing static or dynamic analysis for classification. However, previous methods still suffer from unsatisfactory performance due to two challenges. First, they are unable to address the imbalanced data distribution problem, leading to poor performance for malware families with few members. Second, they are unable to address the zero-day malware (zero-day malware refers to malicious applications that exploit unknown vulnerabilities) classification problem. In this paper, we introduce an innovative meta-learning approach for multi-family Android malware classification named Meta-MAMC, which uses meta-learning technology to learn meta-knowledge (i.e. the similarities and differences among different malware families) of few-family samples and combines new sampling algorithms to solve the above challenges. Meta-MAMC integrates (i) the meta-knowledge contained within the dataset to guide models in learning to identify unknown malware, and (ii) more accurate and diverse tasks based on novel sampling strategies, as well as directly adapting meta-learning to a new few-sample and zero-sample task to classify families. We have evaluated Meta-MAMC on two popular datasets and a corpus of real-world Android applications. The results demonstrate its efficacy in accurately classifying malicious applications belonging to certain malware families, even achieving 100% classification in some families.
{"title":"Meta-Learning for Multi-Family Android Malware Classification","authors":"Yao Li, Dawei Yuan, Tao Zhang, Haipeng Cai, David Lo, Cuiyun Gao, Xiapu Luo, He Jiang","doi":"10.1145/3664806","DOIUrl":"https://doi.org/10.1145/3664806","url":null,"abstract":"<p>With the emergence of smartphones, Android has become a widely used mobile operating system. However, it is vulnerable when encountering various types of attacks. Every day, new malware threatens the security of users’ devices and private data. Many methods have been proposed to classify malicious applications, utilizing static or dynamic analysis for classification. However, previous methods still suffer from unsatisfactory performance due to two challenges. First, they are unable to address the imbalanced data distribution problem, leading to poor performance for malware families with few members. Second, they are unable to address the zero-day malware (zero-day malware refers to malicious applications that exploit unknown vulnerabilities) classification problem. In this paper, we introduce an innovative <b>meta</b>-learning approach for <b>m</b>ulti-family <b>A</b>ndroid <b>m</b>alware <b>c</b>lassification named <b>Meta-MAMC</b>, which uses meta-learning technology to learn meta-knowledge (i.e. the similarities and differences among different malware families) of few-family samples and combines new sampling algorithms to solve the above challenges. <monospace>Meta-MAMC</monospace> integrates (i) the meta-knowledge contained within the dataset to guide models in learning to identify unknown malware, and (ii) more accurate and diverse tasks based on novel sampling strategies, as well as directly adapting meta-learning to a new few-sample and zero-sample task to classify families. We have evaluated <monospace>Meta-MAMC</monospace> on two popular datasets and a corpus of real-world Android applications. The results demonstrate its efficacy in accurately classifying malicious applications belonging to certain malware families, even achieving 100% classification in some families.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingzhe Lyu, Heng Li, Zhen Ming (Jack) Jiang, Ahmed Hassan
AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to be constantly maintained after deployment. While prior works focus on innovative modeling techniques to improve the performance of AIOps models before releasing them into the field, when and how to update AIOps models remain an under-investigated topic. In this work, we performed a case study on three large-scale public operation data: two trace datasets from the cloud computing platforms of Google and Alibaba and one disk stats dataset from the BackBlaze cloud storage data center. We empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model. Particularly, applying sophisticated model update strategies (e.g., concept drift detection, time-based ensembles, and online learning) could provide better performance, efficiency, and stability than simply retraining AIOps models periodically. In addition, we observed that, although some update strategies (e.g., time-based ensemble and online learning) can save model training time, they significantly sacrifice model testing time, which could hinder their applications in AIOps solutions where the operation data arrive at high pace and volume and where immediate inferences are required. Our findings highlight that practitioners should consider the evolution of operation data and actively maintain AIOps models over time. Our observations can also guide researchers and practitioners in investigating more efficient and effective model update strategies that fit in the context of AIOps.
{"title":"On the Model Update Strategies for Supervised Learning in AIOps Solutions","authors":"Yingzhe Lyu, Heng Li, Zhen Ming (Jack) Jiang, Ahmed Hassan","doi":"10.1145/3664599","DOIUrl":"https://doi.org/10.1145/3664599","url":null,"abstract":"<p>AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to be constantly maintained after deployment. While prior works focus on innovative modeling techniques to improve the performance of AIOps models before releasing them into the field, when and how to update AIOps models remain an under-investigated topic. In this work, we performed a case study on three large-scale public operation data: two trace datasets from the cloud computing platforms of Google and Alibaba and one disk stats dataset from the BackBlaze cloud storage data center. We empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model. Particularly, applying sophisticated model update strategies (e.g., concept drift detection, time-based ensembles, and online learning) could provide better performance, efficiency, and stability than simply retraining AIOps models periodically. In addition, we observed that, although some update strategies (e.g., time-based ensemble and online learning) can save model training time, they significantly sacrifice model testing time, which could hinder their applications in AIOps solutions where the operation data arrive at high pace and volume and where immediate inferences are required. Our findings highlight that practitioners should consider the evolution of operation data and actively maintain AIOps models over time. Our observations can also guide researchers and practitioners in investigating more efficient and effective model update strategies that fit in the context of AIOps.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"40 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Hu, Han Wang, Ruiqi Dong, Xiao Chen, Chunyang Chen
Mobile apps are ubiquitous in our daily lives for supporting different tasks such as reading and chatting. Despite the availability of many GUI testing tools, app testers still struggle with low testing code coverage due to tools frequently getting stuck in loops or overlooking activities with concealed entries. This results in a significant amount of testing time being spent on redundant and repetitive exploration of a few GUI pages. To address this, we utilize Android’s deep links, which assist in triggering Android intents to lead users to specific pages and introduce a deep link-enhanced exploration method. This approach, integrated into the testing tool Monkey, gives rise to Delm (Deep Link-enhanced Monkey). Delm oversees the dynamic exploration process, guiding the tool out of meaningless testing loops to unexplored GUI pages. We provide a rigorous activity context mock-up approach for triggering existing Android intents to discover more activities with hidden entrances. We conduct experiments to evaluate Delm’s effectiveness on activity context mock-up, activity coverage, method coverage, and crash detection. The findings reveal that Delm can mock up more complex activity contexts and significantly outperform state-of-the-art baselines with 27.2% activity coverage, 21.13% method coverage, and 23.81% crash detection.
{"title":"Enhancing GUI Exploration Coverage of Android Apps with Deep Link-Integrated Monkey","authors":"Han Hu, Han Wang, Ruiqi Dong, Xiao Chen, Chunyang Chen","doi":"10.1145/3664810","DOIUrl":"https://doi.org/10.1145/3664810","url":null,"abstract":"<p>Mobile apps are ubiquitous in our daily lives for supporting different tasks such as reading and chatting. Despite the availability of many GUI testing tools, app testers still struggle with low testing code coverage due to tools frequently getting stuck in loops or overlooking activities with concealed entries. This results in a significant amount of testing time being spent on redundant and repetitive exploration of a few GUI pages. To address this, we utilize Android’s deep links, which assist in triggering Android intents to lead users to specific pages and introduce a deep link-enhanced exploration method. This approach, integrated into the testing tool Monkey, gives rise to Delm (Deep Link-enhanced Monkey). Delm oversees the dynamic exploration process, guiding the tool out of meaningless testing loops to unexplored GUI pages. We provide a rigorous activity context mock-up approach for triggering existing Android intents to discover more activities with hidden entrances. We conduct experiments to evaluate Delm’s effectiveness on activity context mock-up, activity coverage, method coverage, and crash detection. The findings reveal that Delm can mock up more complex activity contexts and significantly outperform state-of-the-art baselines with 27.2% activity coverage, 21.13% method coverage, and 23.81% crash detection.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"30 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tahereh Zohdinasab, Vincenzo Riccio, Paolo Tonella
Testing Autonomous Driving Systems (ADSs) is crucial to ensure their reliability when navigating complex environments. ADSs may exhibit unexpected behaviours when presented, during operation, with driving scenarios containing features inadequately represented in the training dataset. To address this shift from development to operation, developers must acquire new data with the newly observed features. This data can be then utilised to fine tune the ADS, so as to reach the desired level of reliability in performing driving tasks. However, the resource-intensive nature of testing ADSs requires efficient methodologies for generating targeted and diverse tests.
In this work, we introduce a novel approach, DeepAtash-LR, that incorporates a surrogate model into the focused test generation process. This integration significantly improves focused testing effectiveness and applicability in resource-intensive scenarios. Experimental results show that the integration of the surrogate model is fundamental to the success of DeepAtash-LR. Our approach was able to generate an average of up to 60 × more targeted, failure-inducing inputs compared to the baseline approach. Moreover, the inputs generated by DeepAtash-LR were useful to significantly improve the quality of the original ADS through fine tuning.
{"title":"Focused Test Generation for Autonomous Driving Systems","authors":"Tahereh Zohdinasab, Vincenzo Riccio, Paolo Tonella","doi":"10.1145/3664605","DOIUrl":"https://doi.org/10.1145/3664605","url":null,"abstract":"<p>Testing Autonomous Driving Systems (ADSs) is crucial to ensure their reliability when navigating complex environments. ADSs may exhibit unexpected behaviours when presented, during operation, with driving scenarios containing features inadequately represented in the training dataset. To address this shift from development to operation, developers must acquire new data with the newly observed features. This data can be then utilised to fine tune the ADS, so as to reach the desired level of reliability in performing driving tasks. However, the resource-intensive nature of testing ADSs requires efficient methodologies for generating targeted and diverse tests. </p><p>In this work, we introduce a novel approach, <span>DeepAtash-LR</span>, that incorporates a surrogate model into the focused test generation process. This integration significantly improves focused testing effectiveness and applicability in resource-intensive scenarios. Experimental results show that the integration of the surrogate model is fundamental to the success of <span>DeepAtash-LR</span>. Our approach was able to generate an average of up to 60 × more targeted, failure-inducing inputs compared to the baseline approach. Moreover, the inputs generated by <span>DeepAtash-LR</span> were useful to significantly improve the quality of the original ADS through fine tuning.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"40 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140940425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}