Software applications (apps) have been playing an increasingly important role in various aspects of society. In particular, mobile apps and web apps are the most prevalent among all applications and are widely used in various industries as well as in people’s daily lives. To help ensure mobile and web app quality, many approaches have been introduced to improve app GUI testing via automated exploration, including random testing, model-based testing, learning-based testing, etc. Despite the extensive effort, existing approaches are still limited in reaching high code coverage, constructing high-quality models, and being generally applicable. Reinforcement learning-based approaches, as a group of representative and advanced approaches for automated GUI exploration testing, are faced with difficult challenges, including effective app state abstraction, reward function design, etc. Moreover, they heavily depend on the specific execution platforms (i.e., Android or Web), thus leading to poor generalizability and being unable to adapt to different platforms.
This work specifically tackles these challenges based on the high-level observation that apps from distinct platforms share commonalities in GUI design. Indeed, we propose PIRLTEST, an effective platform-independent approach for app testing. Specifically, PIRLTEST utilizes computer vision and reinforcement learning techniques in a novel, synergistic manner for automated testing. It extracts the GUI widgets from GUI pages and characterizes the corresponding GUI layouts, embedding the GUI pages as states. The app GUI state combines the macroscopic perspective (app GUI layout) and the microscopic perspective (app GUI widget), and attaches the critical semantic information from GUI images. This enables PIRLTEST to be platform-independent and makes the testing approach generally applicable on different platforms. PIRLTEST explores apps with the guidance of a curiosity-driven strategy, which uses a Q-network to estimate the values of specific state-action pairs to encourage more exploration in uncovered pages without platform dependency. The exploration will be assigned with rewards for all actions, which are designed considering both the app GUI states and the concrete widgets, to help the framework explore more uncovered pages. We conduct an empirical study on 20 mobile apps and 5 web apps, and the results show that PIRLTEST is zero-cost when being adapted to different platforms, and can perform better than the baselines, covering 6.3–41.4% more code on mobile apps and 1.5–51.1% more code on web apps. PIRLTEST is capable of detecting 128 unique bugs on mobile and web apps, including 100 bugs that cannot be detected by the baselines.
{"title":"Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement Learning","authors":"Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, Zhendong Su","doi":"10.1145/3674728","DOIUrl":"https://doi.org/10.1145/3674728","url":null,"abstract":"<p>Software applications (apps) have been playing an increasingly important role in various aspects of society. In particular, mobile apps and web apps are the most prevalent among all applications and are widely used in various industries as well as in people’s daily lives. To help ensure mobile and web app quality, many approaches have been introduced to improve app GUI testing via automated exploration, including random testing, model-based testing, learning-based testing, <i>etc.</i> Despite the extensive effort, existing approaches are still limited in reaching high code coverage, constructing high-quality models, and being generally applicable. Reinforcement learning-based approaches, as a group of representative and advanced approaches for automated GUI exploration testing, are faced with difficult challenges, including effective app state abstraction, reward function design, <i>etc.</i> Moreover, they heavily depend on the specific execution platforms (<i>i.e.,</i> Android or Web), thus leading to poor generalizability and being unable to adapt to different platforms.</p><p>This work specifically tackles these challenges based on the high-level observation that apps from distinct platforms share commonalities in GUI design. Indeed, we propose PIRLT<sub>EST</sub>, an effective platform-independent approach for app testing. Specifically, PIRLT<sub>EST</sub> utilizes computer vision and reinforcement learning techniques in a novel, synergistic manner for automated testing. It extracts the GUI widgets from GUI pages and characterizes the corresponding GUI layouts, embedding the GUI pages as states. The app GUI state combines the macroscopic perspective (app GUI layout) and the microscopic perspective (app GUI widget), and attaches the critical semantic information from GUI images. This enables PIRLT<sub>EST</sub> to be platform-independent and makes the testing approach generally applicable on different platforms. PIRLT<sub>EST</sub> explores apps with the guidance of a curiosity-driven strategy, which uses a Q-network to estimate the values of specific state-action pairs to encourage more exploration in uncovered pages without platform dependency. The exploration will be assigned with rewards for all actions, which are designed considering both the app GUI states and the concrete widgets, to help the framework explore more uncovered pages. We conduct an empirical study on 20 mobile apps and 5 web apps, and the results show that PIRLT<sub>EST</sub> is zero-cost when being adapted to different platforms, and can perform better than the baselines, covering 6.3–41.4% more code on mobile apps and 1.5–51.1% more code on web apps. PIRLT<sub>EST</sub> is capable of detecting 128 unique bugs on mobile and web apps, including 100 bugs that cannot be detected by the baselines.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"51 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anni Peng, Dongliang Fang, Le Guan, Erik van der Kouwe, Yin Li, Wenwen Wang, Limin Sun, Yuqing Zhang
Deeply embedded systems powered by microcontrollers are becoming popular with the emergence of Internet of Things (IoT) technology. However, these devices primarily run C/C++ code and are susceptible to memory bugs, which can potentially lead to both control data attacks and non-control data attacks. Existing defense mechanisms (such as control flow integrity (CFI), data flow integrity (DFI) and write integrity testing (WIT), etc.) consume a massive amount of resources, making them less practical in real products. To make it lightweight, we design a bitmap-based allowlist mechanism to unify the storage of the runtime data for protecting both control data and non-control data. The memory requirements are constant and small, regardless of the number of deployed defense mechanisms. We store the allowlist in the TrustZone to ensure its integrity and confidentiality. Meanwhile, we perform an offline analysis to detect potential collisions and make corresponding adjustments when if happens. We have implemented our idea on an ARM Cortex-M based development board. Our evaluation results show a substantial reduction in memory consumption when deploying the proposed CFI and DFI mechanisms, without compromising runtime performance. Specifically, our prototype enforces CFI and DFI at a cost of just 2.09% performance overhead and 32.56% memory overhead on average.
{"title":"Bitmap-Based Security Monitoring for Deeply Embedded Systems","authors":"Anni Peng, Dongliang Fang, Le Guan, Erik van der Kouwe, Yin Li, Wenwen Wang, Limin Sun, Yuqing Zhang","doi":"10.1145/3672460","DOIUrl":"https://doi.org/10.1145/3672460","url":null,"abstract":"<p>Deeply embedded systems powered by microcontrollers are becoming popular with the emergence of Internet of Things (IoT) technology. However, these devices primarily run C/C++ code and are susceptible to memory bugs, which can potentially lead to both control data attacks and non-control data attacks. Existing defense mechanisms (such as control flow integrity (CFI), data flow integrity (DFI) and write integrity testing (WIT), etc.) consume a massive amount of resources, making them less practical in real products. To make it lightweight, we design a bitmap-based allowlist mechanism to unify the storage of the runtime data for protecting both control data and non-control data. The memory requirements are constant and small, regardless of the number of deployed defense mechanisms. We store the allowlist in the TrustZone to ensure its integrity and confidentiality. Meanwhile, we perform an offline analysis to detect potential collisions and make corresponding adjustments when if happens. We have implemented our idea on an ARM Cortex-M based development board. Our evaluation results show a substantial reduction in memory consumption when deploying the proposed CFI and DFI mechanisms, without compromising runtime performance. Specifically, our prototype enforces CFI and DFI at a cost of just 2.09% performance overhead and 32.56% memory overhead on average.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"19 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elijah Zolduoarrati, Sherlock A. Licorish, Nigel Stanger
The need for collective intelligence in technology means that online Q&A platforms, such as Stack Overflow and Reddit, have become invaluable in building the global knowledge ecosystem. Despite literature demonstrating a prevalence of inclusion and contribution disparities in online communities, studies investigating the underlying reasons behind such fluctuations remain scarce. The current study examines Stack Overflow users’ contribution profiles, both in isolation and relative to various diversity metrics, including GDP and access to electricity. This study also examines whether such profiles propagate to the city and state levels, supplemented by granular data such as per capita income and education, before validating quantitative findings using content analysis. We selected 143 countries and compared the profiles of their respective users to assess implicit diversity-related complications that impact how users contribute. Results show that countries with high GDP, prominent R&D presence, less wealth inequality, and sufficient access to infrastructure tend to have more users, regardless of their development status. Similarly, cities and states where technology is more prevalent (e.g., San Francisco and New York) have more users who tend to contribute more often. Qualitative analysis reveals distinct communication styles based on users’ locations. Urban users exhibited assertive, solution-oriented behaviour, actively sharing information. Conversely, rural users engaged through inquiries and discussions, incorporating personal anecdotes, gratitude, and conciliatory language. Findings from this study may benefit scholars and practitioners, allowing them to develop sustainable mechanisms to bridge the inclusion and diversity gaps.
{"title":"Harmonising Contributions: Exploring Diversity in Software Engineering through CQA Mining on Stack Overflow","authors":"Elijah Zolduoarrati, Sherlock A. Licorish, Nigel Stanger","doi":"10.1145/3672453","DOIUrl":"https://doi.org/10.1145/3672453","url":null,"abstract":"<p>The need for collective intelligence in technology means that online Q&A platforms, such as Stack Overflow and Reddit, have become invaluable in building the global knowledge ecosystem. Despite literature demonstrating a prevalence of inclusion and contribution disparities in online communities, studies investigating the underlying reasons behind such fluctuations remain scarce. The current study examines Stack Overflow users’ contribution profiles, both in isolation and relative to various diversity metrics, including GDP and access to electricity. This study also examines whether such profiles propagate to the city and state levels, supplemented by granular data such as per capita income and education, before validating quantitative findings using content analysis. We selected 143 countries and compared the profiles of their respective users to assess implicit diversity-related complications that impact how users contribute. Results show that countries with high GDP, prominent R&D presence, less wealth inequality, and sufficient access to infrastructure tend to have more users, regardless of their development status. Similarly, cities and states where technology is more prevalent (e.g., San Francisco and New York) have more users who tend to contribute more often. Qualitative analysis reveals distinct communication styles based on users’ locations. Urban users exhibited assertive, solution-oriented behaviour, actively sharing information. Conversely, rural users engaged through inquiries and discussions, incorporating personal anecdotes, gratitude, and conciliatory language. Findings from this study may benefit scholars and practitioners, allowing them to develop sustainable mechanisms to bridge the inclusion and diversity gaps.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"25 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To foster the verifiability and testability of Deep Neural Networks (DNN), an increasing number of methods for test case generation techniques are being developed.
When confronted with testing DNN models, the user can apply any existing test generation technique. However, it needs to do so for each technique and each DNN model under test, which can be expensive. Therefore, a paradigm shift could benefit this testing process: rather than regenerating the test set independently for each DNN model under test, we could transfer from existing DNN models.
This paper introduces GIST (Generated Inputs Sets Transferability), a novel approach for the efficient transfer of test sets. Given a property selected by a user (e.g., neurons covered, faults), GIST enables the selection of good test sets from the point of view of this property among available test sets. This allows the user to recover similar properties on the transferred test sets as he would have obtained by generating the test set from scratch with a test cases generation technique. Experimental results show that GIST can select effective test sets for the given property to transfer. Moreover, GIST scales better than reapplying test case generation techniques from scratch on DNN models under test.
{"title":"GIST: Generated Inputs Sets Transferability in Deep Learning","authors":"Florian Tambon, Foutse Khomh, Giuliano Antoniol","doi":"10.1145/3672457","DOIUrl":"https://doi.org/10.1145/3672457","url":null,"abstract":"<p>To foster the verifiability and testability of Deep Neural Networks (DNN), an increasing number of methods for test case generation techniques are being developed.</p><p>When confronted with testing DNN models, the user can apply any existing test generation technique. However, it needs to do so for each technique and each DNN model under test, which can be expensive. Therefore, a paradigm shift could benefit this testing process: rather than regenerating the test set independently for each DNN model under test, we could transfer from existing DNN models.</p><p>This paper introduces GIST (Generated Inputs Sets Transferability), a novel approach for the efficient transfer of test sets. Given a property selected by a user (e.g., neurons covered, faults), GIST enables the selection of good test sets from the point of view of this property among available test sets. This allows the user to recover similar properties on the transferred test sets as he would have obtained by generating the test set from scratch with a test cases generation technique. Experimental results show that GIST can select effective test sets for the given property to transfer. Moreover, GIST scales better than reapplying test case generation techniques from scratch on DNN models under test.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"131 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Database-backed applications rely on the database access code to interact with the underlying database management systems (DBMSs). Although many prior studies aim at database access issues like SQL anti-patterns or SQL code smells, there is a lack of study of database access bugs during the maintenance of database-backed applications. In this paper, we empirically investigate 423 database access bugs collected from seven large-scale Java open source applications that use relational database management systems (e.g., MySQL or PostgreSQL). We study the characteristics (e.g., occurrence and root causes) of the bugs by manually examining the bug reports and commit histories. We find that the number of reported database and non-database access bugs share a similar trend but their modified files in bug fixing commits are different. Additionally, we generalize categories of the root causes of database access bugs, containing five main categories (SQL queries, Schema, API, Configuration, SQL query result) and 25 unique root causes. We find that the bugs pertaining to SQL queries, Schema, and API cover 84.2% of database access bugs across all studied applications. In particular, SQL queries bug (54%) and API bug (38.7%) are the most frequent issues when using JDBC and Hibernate, respectively. Finally, we provide a discussion on the implications of our findings for developers and researchers.
数据库支持的应用程序依靠数据库访问代码与底层数据库管理系统(DBMS)进行交互。尽管之前的许多研究都针对数据库访问问题,如 SQL 反模式或 SQL 代码气味,但缺乏对数据库支持应用程序维护过程中数据库访问错误的研究。在本文中,我们对从七个使用关系数据库管理系统(如 MySQL 或 PostgreSQL)的大型 Java 开源应用程序中收集到的 423 个数据库访问错误进行了实证研究。我们通过人工检查错误报告和提交历史记录来研究错误的特征(如发生率和根本原因)。我们发现,报告的数据库和非数据库访问错误的数量有相似的趋势,但它们在错误修复提交中修改的文件却不同。此外,我们对数据库访问错误的根本原因进行了归纳分类,其中包括五大类(SQL 查询、模式、API、配置、SQL 查询结果)和 25 个独特的根本原因。我们发现,在所有研究的应用程序中,与 SQL 查询、模式和 API 相关的错误占数据库访问错误的 84.2%。其中,SQL 查询错误(54%)和 API 错误(38.7%)分别是使用 JDBC 和 Hibernate 时最常见的问题。最后,我们就研究结果对开发人员和研究人员的影响进行了讨论。
{"title":"An Empirical Study on the Characteristics of Database Access Bugs in Java Applications","authors":"Wei Liu, Shouvick Mondal, Tse-Hsun (Peter) Chen","doi":"10.1145/3672449","DOIUrl":"https://doi.org/10.1145/3672449","url":null,"abstract":"<p>Database-backed applications rely on the database access code to interact with the underlying database management systems (DBMSs). Although many prior studies aim at database access issues like SQL anti-patterns or SQL code smells, there is a lack of study of database access bugs during the maintenance of database-backed applications. In this paper, we empirically investigate 423 database access bugs collected from seven large-scale Java open source applications that use relational database management systems (e.g., MySQL or PostgreSQL). We study the characteristics (e.g., occurrence and root causes) of the bugs by manually examining the bug reports and commit histories. We find that the number of reported database and non-database access bugs share a similar trend but their modified files in bug fixing commits are different. Additionally, we generalize categories of the root causes of database access bugs, containing five main categories (SQL queries, Schema, API, Configuration, SQL query result) and 25 unique root causes. We find that the bugs pertaining to SQL queries, Schema, and API cover 84.2% of database access bugs across all studied applications. In particular, SQL queries bug (54%) and API bug (38.7%) are the most frequent issues when using JDBC and Hibernate, respectively. Finally, we provide a discussion on the implications of our findings for developers and researchers.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"264 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although large language models (LLMs) have demonstrated impressive ability in code generation, they are still struggling to address the complicated intent provided by humans. It is widely acknowledged that humans typically employ planning to decompose complex problems and schedule solution steps prior to implementation. To this end, we introduce planning into code generation to help the model understand complex intent and reduce the difficulty of problem-solving. This paper proposes a self-planning code generation approach with large language models, which consists of two phases, namely planning phase and implementation phase. Specifically, in the planning phase, LLM plans out concise solution steps from the intent combined with few-shot prompting. Subsequently, in the implementation phase, the model generates code step by step, guided by the preceding solution steps. We conduct extensive experiments on various code-generation benchmarks across multiple programming languages. Experimental results show that self-planning code generation achieves a relative improvement of up to 25.4% in Pass@1 compared to direct code generation, and up to 11.9% compared to Chain-of-Thought of code generation. Moreover, our self-planning approach also enhances the quality of the generated code with respect to correctness, readability, and robustness, as assessed by humans.
{"title":"Self-planning Code Generation with Large Language Models","authors":"Xue Jiang, Yihong Dong, Lecheng Wang, Fang Zheng, Qiwei Shang, Ge Li, Zhi Jin, Wenpin Jiao","doi":"10.1145/3672456","DOIUrl":"https://doi.org/10.1145/3672456","url":null,"abstract":"<p>Although large language models (LLMs) have demonstrated impressive ability in code generation, they are still struggling to address the complicated intent provided by humans. It is widely acknowledged that humans typically employ planning to decompose complex problems and schedule solution steps prior to implementation. To this end, we introduce planning into code generation to help the model understand complex intent and reduce the difficulty of problem-solving. This paper proposes a self-planning code generation approach with large language models, which consists of two phases, namely planning phase and implementation phase. Specifically, in the planning phase, LLM plans out concise solution steps from the intent combined with few-shot prompting. Subsequently, in the implementation phase, the model generates code step by step, guided by the preceding solution steps. We conduct extensive experiments on various code-generation benchmarks across multiple programming languages. Experimental results show that self-planning code generation achieves a relative improvement of up to 25.4% in Pass@1 compared to direct code generation, and up to 11.9% compared to Chain-of-Thought of code generation. Moreover, our self-planning approach also enhances the quality of the generated code with respect to correctness, readability, and robustness, as assessed by humans.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"18 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Huang, Qingwen Bu, Yichao Fu, Yuhao Qing, Xiaofei Xie, Junjie Chen, Heming Cui
Deep Neural Networks (DNNs) have been widely deployed in software to address various tasks (e.g., autonomous driving, medical diagnosis). However, they can also produce incorrect behaviors that result in financial losses and even threaten human safety. To reveal and repair incorrect behaviors in DNNs, developers often collect rich, unlabeled datasets from the natural world and label them to test DNN models. However, properly labeling a large number of datasets is a highly expensive and time-consuming task.
To address the above-mentioned problem, we propose NSS, Neuron Sensitivity Guided Test Case Selection, which can reduce the labeling time by selecting valuable test cases from unlabeled datasets. NSS leverages the information of the internal neuron induced by the test cases to select valuable test cases, which have high confidence in causing the model to behave incorrectly. We evaluated NSS with four widely used datasets and four well-designed DNN models compared to the state-of-the-art (SOTA) baseline methods. The results show that NSS performs well in assessing the probability of failure triggering in test cases and in the improvement capabilities of the model. Specifically, compared to the baseline approaches, NSS achieves a higher fault detection rate (e.g., when selecting 5% of the test cases from the unlabeled dataset in the MNIST&LeNet1 experiment, NSS can obtain an 81.8% fault detection rate, which is a 20% increase compared with SOTA baseline strategies).
深度神经网络(DNN)已被广泛应用于软件中,以解决各种任务(如自动驾驶、医疗诊断)。然而,它们也可能产生错误行为,导致经济损失,甚至威胁人类安全。为了揭示和修复 DNN 中的不正确行为,开发人员通常会从自然世界中收集丰富的未标记数据集,并对其进行标记,以测试 DNN 模型。为了解决上述问题,我们提出了神经元灵敏度指导测试用例选择(NSS,Neuron Sensitivity Guided Test Case Selection),它可以从未标明的数据集中选择有价值的测试用例,从而缩短标注时间。NSS 利用测试用例诱导的内部神经元信息来选择有价值的测试用例,这些测试用例在导致模型出现错误行为方面具有很高的可信度。我们使用四个广泛使用的数据集和四个精心设计的 DNN 模型对 NSS 进行了评估,并与最先进的(SOTA)基线方法进行了比较。结果表明,NSS 在评估测试用例中触发故障的概率和模型改进能力方面表现出色。具体而言,与基线方法相比,NSS 实现了更高的故障检测率(例如,在 MNIST&LeNet1 实验中,从未标明数据集中选择 5% 的测试用例时,NSS 可以获得 81.8% 的故障检测率,与 SOTA 基线策略相比提高了 20%)。
{"title":"Neuron Sensitivity Guided Test Case Selection","authors":"Dong Huang, Qingwen Bu, Yichao Fu, Yuhao Qing, Xiaofei Xie, Junjie Chen, Heming Cui","doi":"10.1145/3672454","DOIUrl":"https://doi.org/10.1145/3672454","url":null,"abstract":"<p>Deep Neural Networks (DNNs) have been widely deployed in software to address various tasks (e.g., autonomous driving, medical diagnosis). However, they can also produce incorrect behaviors that result in financial losses and even threaten human safety. To reveal and repair incorrect behaviors in DNNs, developers often collect rich, unlabeled datasets from the natural world and label them to test DNN models. However, properly labeling a large number of datasets is a highly expensive and time-consuming task.</p><p>To address the above-mentioned problem, we propose NSS, Neuron Sensitivity Guided Test Case Selection, which can reduce the labeling time by selecting valuable test cases from unlabeled datasets. NSS leverages the information of the internal neuron induced by the test cases to select valuable test cases, which have high confidence in causing the model to behave incorrectly. We evaluated NSS with four widely used datasets and four well-designed DNN models compared to the state-of-the-art (SOTA) baseline methods. The results show that NSS performs well in assessing the probability of failure triggering in test cases and in the improvement capabilities of the model. Specifically, compared to the baseline approaches, NSS achieves a higher fault detection rate (e.g., when selecting 5% of the test cases from the unlabeled dataset in the MNIST&LeNet1 experiment, NSS can obtain an 81.8% fault detection rate, which is a 20% increase compared with SOTA baseline strategies).</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"34 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collaborative teamwork, a strategy that significantly controls development complexity and enhances software quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs, exemplified by ChatGPT. Specifically, through role instructions, 1) Multiple LLM agents act as distinct ‘experts’, each responsible for a specific subtask within a complex task; 2) Specify the way to collaborate and interact, so that different roles form a virtual team to facilitate each other’s work, ultimately the virtual team addresses code generation tasks collaboratively without the need for human intervention. To effectively organize and manage this virtual team, we incorporate software-development methodology into the framework. Thus, we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible for software development’s analysis, coding, and testing stages. We conduct comprehensive experiments on various code-generation benchmarks. Experimental results indicate that self-collaboration code generation relatively improves 29.9%-47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not readily solved by the single LLM agent.
{"title":"Self-collaboration Code Generation via ChatGPT","authors":"Yihong Dong, Xue Jiang, Zhi Jin, Ge Li","doi":"10.1145/3672459","DOIUrl":"https://doi.org/10.1145/3672459","url":null,"abstract":"<p>Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collaborative teamwork, a strategy that significantly controls development complexity and enhances software quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs, exemplified by ChatGPT. Specifically, through role instructions, 1) Multiple LLM agents act as distinct ‘experts’, each responsible for a specific subtask within a complex task; 2) Specify the way to collaborate and interact, so that different roles form a virtual team to facilitate each other’s work, ultimately the virtual team addresses code generation tasks collaboratively without the need for human intervention. To effectively organize and manage this virtual team, we incorporate software-development methodology into the framework. Thus, we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible for software development’s analysis, coding, and testing stages. We conduct comprehensive experiments on various code-generation benchmarks. Experimental results indicate that self-collaboration code generation relatively improves 29.9%-47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not readily solved by the single LLM agent.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"196 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Dejon, Chrystel Gaber, Gilles Grimaud, Narjes Jomaa
Despite growing efforts and encouraging successes in the last decades, fully formally-verified projects are still rare in the industrial landscape. The industry often lacks the tools and methodologies to efficiently scale the proof development process. In this work, we give a comprehensible overview of the proof development process for proof developers and project managers. The goal is to support proof developers by rationalizing the proof development process, which currently relies heavily on their intuition and expertise, and by facilitating communication with the management line. To this end, we concentrate on the aspect of proof manufacturing and highlight the most significant sources of proof effort. We propose means to mitigate the latter through proof practices (proof structuring, proof strategies, and proof planning), proof metrics, and tools. Our approach is project-agnostic, independent of specific proof expertise, and computed estimations do not assume prior similar developments. We evaluate our guidelines using a separation kernel undergoing formal verification, driving the proof process in an optimised way. Feedback from a project manager unfamiliar with proof development confirms the benefits of detailed planning of the proof development steps, clear progress communication to the hierarchy line, and alignment with established practices in the software industry.
{"title":"Code to Qed, the Project Manager's Guide to Proof Engineering","authors":"Nicolas Dejon, Chrystel Gaber, Gilles Grimaud, Narjes Jomaa","doi":"10.1145/3664807","DOIUrl":"https://doi.org/10.1145/3664807","url":null,"abstract":"<p>Despite growing efforts and encouraging successes in the last decades, fully formally-verified projects are still rare in the industrial landscape. The industry often lacks the tools and methodologies to efficiently scale the proof development process. In this work, we give a comprehensible overview of the proof development process for proof developers and project managers. The goal is to support proof developers by rationalizing the proof development process, which currently relies heavily on their intuition and expertise, and by facilitating communication with the management line. To this end, we concentrate on the aspect of proof manufacturing and highlight the most significant sources of proof effort. We propose means to mitigate the latter through proof practices (proof structuring, proof strategies, and proof planning), proof metrics, and tools. Our approach is project-agnostic, independent of specific proof expertise, and computed estimations do not assume prior similar developments. We evaluate our guidelines using a separation kernel undergoing formal verification, driving the proof process in an optimised way. Feedback from a project manager unfamiliar with proof development confirms the benefits of detailed planning of the proof development steps, clear progress communication to the hierarchy line, and alignment with established practices in the software industry.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"25 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suwichak Fungprasertkul, Rami Bahsoon, Rick Kazman
Technical Debt Management (TDM) can suffer from unpredictability, communication gaps and the inaccessibility of relevant information, which hamper the effectiveness of its decision making. These issues can stem from division among decision-makers which takes root in unfair consequences of decisions among different decision-makers. One mitigation route is Skin in the Game thinking, which enforces transparency, fairness and shared responsibility during collective decision-making under uncertainty. This paper illustrates characteristics which require Skin in the Game thinking in Technical Debt (TD) identification, measurement, prioritisation and monitoring. We point out crucial problems in TD monitoring rooted in asymmetric information and asymmetric payoff between different factions of decision-makers. A systematic TD monitoring method is presented to mitigate the said problems. The method leverages Replicator Dynamics and Behavioural Learning. The method supports decision-makers with automated TD monitoring decisions; it informs decision-makers when human interventions are required. Two publicly available industrial projects with a non-trivial number of TD and timestamps are utilised to evaluate the application of our method. Mann-Whitney U hypothesis tests are conducted on samples of decisions from our method and the baseline. The statistical evidence indicates that our method can produce cost-effective and contextual TD monitoring decisions.
技术债务管理(TDM)可能存在不可预测性、沟通障碍和无法获取相关信息等问题,这 些问题阻碍了决策的有效性。这些问题可能源于决策者之间的分歧,这种分歧的根源在于不同决策者之间的决策后果不公平。一种缓解途径是 "游戏中的皮肤"(Skin in the Game)思维,它能在不确定情况下的集体决策过程中实现透明、公平和责任分担。本文阐述了在技术债务(TD)识别、衡量、优先排序和监控中需要 "游戏中的皮肤"(Skin in the Game)思维的特征。我们指出了技术债务监控中的关键问题,其根源在于不同决策者之间的信息不对称和回报不对称。为缓解上述问题,我们提出了一种系统的 TD 监控方法。该方法利用了复制器动力学和行为学习。该方法通过自动 TD 监测决策为决策者提供支持,并在需要人工干预时通知决策者。为了评估我们方法的应用情况,我们利用了两个公开的工业项目,这些项目具有数量不小的 TD 和时间戳。对我们的方法和基准的决策样本进行了 Mann-Whitney U 假设检验。统计结果表明,我们的方法可以产生具有成本效益且符合实际情况的 TD 监控决策。
{"title":"Technical Debt Monitoring Decision Making with Skin in the Game","authors":"Suwichak Fungprasertkul, Rami Bahsoon, Rick Kazman","doi":"10.1145/3664805","DOIUrl":"https://doi.org/10.1145/3664805","url":null,"abstract":"<p>Technical Debt Management (TDM) can suffer from unpredictability, communication gaps and the inaccessibility of relevant information, which hamper the effectiveness of its decision making. These issues can stem from division among decision-makers which takes root in unfair consequences of decisions among different decision-makers. One mitigation route is Skin in the Game thinking, which enforces transparency, fairness and shared responsibility during collective decision-making under uncertainty. This paper illustrates characteristics which require Skin in the Game thinking in Technical Debt (TD) identification, measurement, prioritisation and monitoring. We point out crucial problems in TD monitoring rooted in asymmetric information and asymmetric payoff between different factions of decision-makers. A systematic TD monitoring method is presented to mitigate the said problems. The method leverages Replicator Dynamics and Behavioural Learning. The method supports decision-makers with automated TD monitoring decisions; it informs decision-makers when human interventions are required. Two publicly available industrial projects with a non-trivial number of TD and timestamps are utilised to evaluate the application of our method. Mann-Whitney U hypothesis tests are conducted on samples of decisions from our method and the baseline. The statistical evidence indicates that our method can produce cost-effective and contextual TD monitoring decisions.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"33 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141196672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}