ACM Transactions on Software Engineering and Methodology最新文献_第8页

Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path Conditions 章鱼：通过并行收集可实现路径条件扩展价值流分析

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-24 DOI: 10.1145/3632743

Wensheng Tang, Dejun Dong, Shijie Li, Chengpeng Wang, Peisen Yao, Jinguo Zhou, Charles Zhang

Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm Octopus to collect path conditions for realizable paths efficiently. Octopus builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement Octopus as a tool and evaluate it over 15 real-world programs. The experiment shows that Octopus significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project llvm with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of Octopus. Our empirical and theoretical results demonstrate the great potential of Octopus in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.

值流分析是程序分析中的一项基本技术，对各种客户端（如内存损坏检测和污点分析）都有好处。然而，现有的工作存在潜在速度较低的问题，导致可扩展性不足。在这项工作中，我们提出了一种并行算法 Octopus，用于高效收集可实现路径的路径条件。Octopus 以可实现性分解为基础，按需同时收集不同函数的过程内路径条件，并通过串联获得可实现路径条件，从而实现较高的并行化潜在速度。我们将 Octopus 作为一种工具加以实现，并在 15 个实际程序中对其进行了评估。实验表明，Octopus 明显优于最先进的算法。特别是，在 40 线程设置下，它能在 6.9 分钟内检测出 6.3 MLoC 的 llvm 项目的 NPD 错误。我们还提出并证明了几条定理，以证明 Octopus 的合理性、完整性和潜在的高速性。我们的经验和理论结果证明了 Octopus 在支持各种程序分析客户端方面的巨大潜力。该实现已正式部署在蚂蚁金服集团，为大规模金融科技应用进行夜间代码扫描。

{"title":"Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path Conditions","authors":"Wensheng Tang, Dejun Dong, Shijie Li, Chengpeng Wang, Peisen Yao, Jinguo Zhou, Charles Zhang","doi":"10.1145/3632743","DOIUrl":"https://doi.org/10.1145/3632743","url":null,"abstract":"Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm Octopus to collect path conditions for realizable paths efficiently. Octopus builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement Octopus as a tool and evaluate it over 15 real-world programs. The experiment shows that Octopus significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project <sans-serif>llvm</sans-serif> with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of Octopus. Our empirical and theoretical results demonstrate the great potential of Octopus in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"237 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enablers and Barriers of Empathy in Software Developer and User Interactions: A Mixed Methods Case Study 软件开发人员与用户互动中移情的促进因素和障碍：混合方法案例研究

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-23 DOI: 10.1145/3641849

Hashini Gunatilake, John Grundy, Rashina Hoda, Ingo Mueller

Software engineering (SE) requires developers to collaborate with stakeholders, and understanding their emotions and perspectives is often vital. Empathy is a concept characterising a person’s ability to understand and share the feelings of another. However, empathy continues to be an under-researched human aspect in SE. We studied how empathy is practised between developers and end users using a mixed methods case study. We used an empathy test, observations and interviews to collect data, and socio–technical grounded theory and descriptive statistics to analyse data. We identified the nature of awareness required to trigger empathy and enablers of empathy. We discovered barriers to empathy and a set of potential strategies to overcome these barriers. We report insights on emerging relationships and present a set of recommendations and potential future works on empathy and SE for software practitioners and SE researchers.

软件工程（SE）要求开发人员与利益相关者合作，而理解他们的情感和观点往往至关重要。同理心是一个概念，指一个人理解和分享他人情感的能力。然而，移情仍然是 SE 中研究不足的一个人文方面。我们通过一项混合方法案例研究，探讨了开发人员与最终用户之间如何实现共情。我们使用移情测试、观察和访谈来收集数据，并使用社会技术基础理论和描述性统计来分析数据。我们确定了引发移情所需的意识性质以及移情的促进因素。我们发现了移情的障碍和一系列克服这些障碍的潜在策略。我们报告了对新兴关系的见解，并为软件从业人员和 SE 研究人员提出了一系列有关移情和 SE 的建议和潜在的未来工作。

引用次数: 0

Learning Failure-Inducing Models for Testing Software-Defined Networks 学习故障诱发模型以测试软件定义的网络

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-23 DOI: 10.1145/3641541

Raphaël Ollando, Seung Yeob Shin, Lionel C. Briand

Software-defined networks (SDN) enable flexible and effective communication systems that are managed by centralized software controllers. However, such a controller can undermine the underlying communication network of an SDN-based system and thus must be carefully tested. When an SDN-based system fails, in order to address such a failure, engineers need to precisely understand the conditions under which it occurs. In this article, we introduce a machine learning-guided fuzzing method, named FuzzSDN, aiming at both (1) generating effective test data leading to failures in SDN-based systems and (2) learning accurate failure-inducing models that characterize conditions under which such system fails. To our knowledge, no existing work simultaneously addresses these two objectives for SDNs. We evaluate FuzzSDN by applying it to systems controlled by two open-source SDN controllers. Further, we compare FuzzSDN with two state-of-the-art methods for fuzzing SDNs and two baselines for learning failure-inducing models. Our results show that (1) compared to the state-of-the-art methods, FuzzSDN generates at least 12 times more failures, within the same time budget, with a controller that is fairly robust to fuzzing and (2) our failure-inducing models have, on average, a precision of 98% and a recall of 86%, significantly outperforming the baselines.

软件定义网络（SDN）可实现由集中式软件控制器管理的灵活有效的通信系统。然而，这种控制器可能会破坏基于 SDN 系统的底层通信网络，因此必须对其进行仔细测试。当基于 SDN 的系统发生故障时，为了解决这种故障，工程师需要精确了解故障发生的条件。在本文中，我们介绍了一种以机器学习为指导的模糊处理方法（名为 FuzzSDN），该方法旨在：（1）生成导致基于 SDN 的系统发生故障的有效测试数据；（2）学习准确的故障诱发模型，以描述此类系统发生故障的条件。据我们所知，目前还没有针对 SDN 同时实现这两个目标的工作。我们将 FuzzSDN 应用于由两个开源 SDN 控制器控制的系统，对其进行了评估。此外，我们还将 FuzzSDN 与用于模糊 SDN 的两种最先进方法以及用于学习故障诱发模型的两种基线进行了比较。我们的结果表明：(1) 与最先进的方法相比，FuzzSDN 在相同的时间预算内产生的故障至少多 12 倍，而且控制器对模糊处理具有相当高的鲁棒性；(2) 我们的故障诱发模型平均精确度为 98%，召回率为 86%，明显优于基线。

{"title":"Learning Failure-Inducing Models for Testing Software-Defined Networks","authors":"Raphaël Ollando, Seung Yeob Shin, Lionel C. Briand","doi":"10.1145/3641541","DOIUrl":"https://doi.org/10.1145/3641541","url":null,"abstract":"Software-defined networks (SDN) enable flexible and effective communication systems that are managed by centralized software controllers. However, such a controller can undermine the underlying communication network of an SDN-based system and thus must be carefully tested. When an SDN-based system fails, in order to address such a failure, engineers need to precisely understand the conditions under which it occurs. In this article, we introduce a machine learning-guided fuzzing method, named FuzzSDN, aiming at both (1) generating effective test data leading to failures in SDN-based systems and (2) learning accurate failure-inducing models that characterize conditions under which such system fails. To our knowledge, no existing work simultaneously addresses these two objectives for SDNs. We evaluate FuzzSDN by applying it to systems controlled by two open-source SDN controllers. Further, we compare FuzzSDN with two state-of-the-art methods for fuzzing SDNs and two baselines for learning failure-inducing models. Our results show that (1) compared to the state-of-the-art methods, FuzzSDN generates at least 12 times more failures, within the same time budget, with a controller that is fairly robust to fuzzing and (2) our failure-inducing models have, on average, a precision of 98% and a recall of 86%, significantly outperforming the baselines.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"114 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mapping APIs in Dynamic-typed Programs by Leveraging Transfer Learning 利用迁移学习在动态类型程序中映射应用程序接口

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-22 DOI: 10.1145/3641848

Zhenfei Huang, Junjie Chen, Jiajun Jiang, Yihua Liang, Hanmo You, Fengjie Li

Application Programming Interface (API) migration is a common task for adapting software across different programming languages and platforms, where manually constructing the mapping relations between APIs is indeed time-consuming and error-prone. To facilitate this process, many automated API mapping approaches have been proposed. However, existing approaches were mainly designed and evaluated for mapping APIs of statically-typed languages, while their performance on dynamically-typed languages remains unexplored.

In this paper, we conduct the first extensive study to explore existing API mapping approaches’ performance for mapping APIs in dynamically-typed languages, for which we have manually constructed a high-quality dataset. According to the empirical results, we have summarized several insights. In particular, the source code implementations of APIs can significantly improve the effectiveness of API mapping. However, due to the confidentiality policy, they may not be available in practice. To overcome this, we propose a novel API mapping approach, named Matl, which leverages the transfer learning technique to learn the semantic embeddings of source code implementations from large-scale open-source repositories and then transfers the learned model to facilitate the mapping of APIs. In this way, Matl can produce more accurate API embedding of its functionality for more effective mapping without knowing the source code of the APIs. To evaluate the performance of Matl, we have conducted an extensive study by comparing Matl with state-of-the-art approaches. The results demonstrate that Matl is indeed effective as it improves the state-of-the-art approach by at least 18.36% for mapping APIs of dynamically-typed language and by 30.77% for mapping APIs of the statically-typed language.

应用程序接口（API）迁移是在不同编程语言和平台之间调整软件的一项常见任务，而手动构建 API 之间的映射关系确实既耗时又容易出错。为了简化这一过程，人们提出了许多自动 API 映射方法。然而，现有的方法主要是针对静态类型语言的 API 映射而设计和评估的，而它们在动态类型语言上的性能仍有待探索。在本文中，我们首次对现有 API 映射方法在动态类型语言 API 映射中的性能进行了广泛的研究，并为此手动构建了一个高质量的数据集。根据实证结果，我们总结出了几点启示。其中，API 的源代码实现可以显著提高 API 映射的有效性。然而，由于保密政策的原因，在实践中可能无法获得这些源代码。为了克服这一问题，我们提出了一种名为 Matl 的新型 API 映射方法，它利用迁移学习技术从大规模开源资源库中学习源代码实现的语义嵌入，然后将学习到的模型迁移到 API 映射中。通过这种方式，Matl 可以为其功能生成更准确的 API 嵌入，从而在不知道 API 源代码的情况下实现更有效的映射。为了评估 Matl 的性能，我们进行了一项广泛的研究，将 Matl 与最先进的方法进行了比较。结果表明，Matl 确实是有效的，因为它在映射动态类型语言的应用程序接口方面比最先进的方法至少提高了 18.36%，在映射静态类型语言的应用程序接口方面提高了 30.77%。

{"title":"Mapping APIs in Dynamic-typed Programs by Leveraging Transfer Learning","authors":"Zhenfei Huang, Junjie Chen, Jiajun Jiang, Yihua Liang, Hanmo You, Fengjie Li","doi":"10.1145/3641848","DOIUrl":"https://doi.org/10.1145/3641848","url":null,"abstract":"Application Programming Interface (API) migration is a common task for adapting software across different programming languages and platforms, where manually constructing the mapping relations between APIs is indeed time-consuming and error-prone. To facilitate this process, many automated API mapping approaches have been proposed. However, existing approaches were mainly designed and evaluated for mapping APIs of statically-typed languages, while their performance on dynamically-typed languages remains unexplored. In this paper, we conduct the first extensive study to explore existing API mapping approaches’ performance for mapping APIs in dynamically-typed languages, for which we have manually constructed a high-quality dataset. According to the empirical results, we have summarized several insights. In particular, the source code implementations of APIs can significantly improve the effectiveness of API mapping. However, due to the confidentiality policy, they may not be available in practice. To overcome this, we propose a novel API mapping approach, named Matl, which leverages the transfer learning technique to learn the semantic embeddings of source code implementations from large-scale open-source repositories and then transfers the learned model to facilitate the mapping of APIs. In this way, Matl can produce more accurate API embedding of its functionality for more effective mapping without knowing the source code of the APIs. To evaluate the performance of Matl, we have conducted an extensive study by comparing Matl with state-of-the-art approaches. The results demonstrate that Matl is indeed effective as it improves the state-of-the-art approach by at least 18.36% for mapping APIs of dynamically-typed language and by 30.77% for mapping APIs of the statically-typed language.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"222 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Battling against Protocol Fuzzing: Protecting Networked Embedded Devices from Dynamic Fuzzers 对抗协议模糊：保护联网嵌入式设备免受动态模糊器攻击

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-22 DOI: 10.1145/3641847

Puzhuo Liu, Yaowen Zheng, Chengnian Sun, Hong Li, Zhi Li, Limin Sun

Networked Embedded Devices (NEDs) are increasingly targeted by cyberattacks, mainly due to their widespread use in our daily lives. Vulnerabilities in NEDs are the root causes of these cyberattacks. Although deployed NEDs go through thorough code audits, there can still be considerable exploitable vulnerabilities. Existing mitigation measures like code encryption and obfuscation adopted by vendors can resist static analysis on deployed NEDs, but are ineffective against protocol fuzzing. Attackers can easily apply protocol fuzzing to discover vulnerabilities and compromise deployed NEDs. Unfortunately, prior anti-fuzzing techniques are impractical as they significantly slow down NEDs, hampering NED availability.

To address this issue, we propose Armor—the first anti-fuzzing technique specifically designed for NEDs. First, we design three adversarial primitives—delay, fake coverage, and forged exception—to break the fundamental mechanisms on which fuzzing relies to effectively find vulnerabilities. Second, based on our observation that inputs from normal users consistent with the protocol specification and certain program paths are rarely executed with normal inputs, we design static and dynamic strategies to decide whether to activate the adversarial primitives. Extensive evaluations show that Armor incurs negligible time overhead and effectively reduces the code coverage (e.g., line coverage by 22%-61%) for fuzzing, significantly outperforming the state-of-the-art.

网络嵌入式设备（NED）越来越多地成为网络攻击的目标，这主要是因为它们在我们的日常生活中被广泛使用。NED 中的漏洞是这些网络攻击的根源。尽管已部署的 NED 经过了彻底的代码审计，但仍可能存在大量可利用的漏洞。供应商采用的代码加密和混淆等现有缓解措施可以抵御对已部署 NED 的静态分析，但对协议模糊却无能为力。攻击者可以轻松应用协议模糊技术发现漏洞，并入侵已部署的 NED。遗憾的是，先前的反模糊技术并不实用，因为它们会大大降低 NED 的运行速度，影响 NED 的可用性。为了解决这个问题，我们提出了 Armor--第一种专为 NED 设计的反模糊技术。首先，我们设计了三种对抗原语--延迟、伪造覆盖和伪造异常，以打破模糊测试赖以有效发现漏洞的基本机制。其次，根据我们的观察，符合协议规范的正常用户输入和某些程序路径很少与正常输入一起执行，因此我们设计了静态和动态策略来决定是否激活对抗性原语。广泛的评估表明，Armor 带来的时间开销可以忽略不计，并能有效降低模糊测试的代码覆盖率（例如，行覆盖率降低了 22%-61% ），明显优于最先进的技术。

{"title":"Battling against Protocol Fuzzing: Protecting Networked Embedded Devices from Dynamic Fuzzers","authors":"Puzhuo Liu, Yaowen Zheng, Chengnian Sun, Hong Li, Zhi Li, Limin Sun","doi":"10.1145/3641847","DOIUrl":"https://doi.org/10.1145/3641847","url":null,"abstract":"<underline>N</underline>etworked <underline>E</underline>mbedded <underline>D</underline>evices (NEDs) are increasingly targeted by cyberattacks, mainly due to their widespread use in our daily lives. Vulnerabilities in NEDs are the root causes of these cyberattacks. Although deployed NEDs go through thorough code audits, there can still be considerable exploitable vulnerabilities. Existing mitigation measures like code encryption and obfuscation adopted by vendors can resist static analysis on deployed NEDs, but are ineffective against protocol fuzzing. Attackers can easily apply protocol fuzzing to discover vulnerabilities and compromise deployed NEDs. Unfortunately, prior anti-fuzzing techniques are impractical as they significantly slow down NEDs, hampering NED availability. To address this issue, we propose Armor—the first anti-fuzzing technique specifically designed for NEDs. First, we design three adversarial primitives—delay, fake coverage, and forged exception—to break the fundamental mechanisms on which fuzzing relies to effectively find vulnerabilities. Second, based on our observation that inputs from normal users consistent with the protocol specification and certain program paths are rarely executed with normal inputs, we design static and dynamic strategies to decide whether to activate the adversarial primitives. Extensive evaluations show that Armor incurs negligible time overhead and effectively reduces the code coverage (e.g., line coverage by 22%-61%) for fuzzing, significantly outperforming the state-of-the-art.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"61 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bug Analysis in Jupyter Notebook Projects: An Empirical Study Jupyter Notebook 项目中的错误分析：实证研究

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-22 DOI: 10.1145/3641539

Taijara Loiola de Santana, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida, Iftekhar Ahmed

Computational notebooks, such as Jupyter, have been widely adopted by data scientists to write code for analyzing and visualizing data. Despite their growing adoption and popularity, few studies were found to understand Jupyter development challenges from the practitioners’ point of view. This paper presents a systematic study of bugs and challenges that Jupyter practitioners face through a large-scale empirical investigation. We mined 14,740 commits from 105 GitHub open-source projects with Jupyter notebook code. Next, we analyzed 30,416 Stack Overflow posts, which gave us insights into bugs that practitioners face when developing Jupyter notebook projects. Next, we conducted nineteen interviews with data scientists to uncover more details about Jupyter bugs and to gain insight into Jupyter developers’ challenges. Finally, to validate the study results and proposed taxonomy, we conducted a survey with 91 data scientists. We also highlight bug categories, their root causes, and the challenges that Jupyter practitioners face.

Jupyter 等计算笔记本已被数据科学家广泛采用，用于编写分析和可视化数据的代码。尽管 Jupyter 的应用和普及率越来越高，但很少有研究从从业人员的角度来了解 Jupyter 开发所面临的挑战。本文通过大规模实证调查，对 Jupyter 从业人员面临的错误和挑战进行了系统研究。我们从 105 个带有 Jupyter 笔记本代码的 GitHub 开源项目中挖掘了 14,740 次提交。接着，我们分析了 30,416 篇 Stack Overflow 帖子，从中了解了从业人员在开发 Jupyter 笔记本项目时遇到的 bug。接下来，我们对数据科学家进行了 19 次访谈，以揭示有关 Jupyter bug 的更多细节，并深入了解 Jupyter 开发人员面临的挑战。最后，为了验证研究结果和建议的分类法，我们对 91 名数据科学家进行了调查。我们还强调了错误类别、其根本原因以及 Jupyter 从业人员面临的挑战。

引用次数: 0

Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models 快速：使用预训练模型进行代码搜索的零点领域自适应

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-18 DOI: 10.1145/3641542

Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng

Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.

代码搜索是指针对给定的自然语言查询识别最相关代码片段的过程，在软件维护中发挥着至关重要的作用。然而，当前的方法严重依赖于标注数据进行训练，这导致在面对跨领域场景（包括特定领域或特定项目情况）时性能下降。造成性能下降的原因是，这些方法有效捕捉与此类场景相关的语义的能力有限。为了解决上述问题，我们提出了一个用于代码搜索的带有预设模型的 zeRo-shot domAin adaPtion 框架，名为 RAPID。该框架首先通过伪标记生成合成数据，然后使用采样合成数据训练 CodeBERT。为了避免噪声合成数据的影响并提高模型性能，我们提出了一种混合采样策略，以便在训练过程中获得硬负样本。具体来说，混合采样策略同时考虑了相关性和多样性，以选择模型难以区分的数据。为了验证我们的方法在零样本设置中的有效性，我们进行了大量实验，发现根据 MRR 指标衡量，RAPID 的性能比 CoCoSoDa 和 UniXcoder 模型分别平均高出 15.7% 和 10%。当在完整数据上进行训练时，我们的方法在使用 CodeBERT 的 MRR 指标下平均提高了 7.5%。我们观察到，随着模型在零镜头任务中性能的提高，硬否定的影响也在减小。我们的观察还表明，对用于生成伪标签的 CodeT5 进行微调可以提高代码搜索模型的性能，而且仅使用 100 次样本就能获得与监督基线相当的结果。此外，我们还通过人工和自动评估，评估了 RAPID 在三个 GitHub 项目的实际代码搜索任务中的有效性。我们的研究结果表明，RAPID 表现出卓越的性能，例如，在 MRR 指标下，比表现最好的模型平均提高了 18%。

{"title":"Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models","authors":"Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng","doi":"10.1145/3641542","DOIUrl":"https://doi.org/10.1145/3641542","url":null,"abstract":"Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"37 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Reliability and Explainability of Language Models for Program Generation 论程序生成语言模型的可靠性和可解释性

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-18 DOI: 10.1145/3641540

Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Li Li

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.

最近的研究采用了预先训练好的语言模型，如 CodeT5 和 CodeGPT，用于自动程序生成任务，如代码生成、修复和翻译。许多基于语言模型的方法已被提出，并在各种基准数据集上进行了评估，显示出良好的性能。然而，这些模型的可靠性仍存在不确定性，特别是它们持续转换代码序列的实际能力。这就提出了一个问题：这些技术对于自动程序生成是否足够可靠？因此，我们需要开展进一步研究，以了解模型逻辑并评估可靠性和可解释性。为了弥补这些研究空白，我们在五个代表性数据集上对八个流行语言模型进行了全面的实证研究，以确定自动程序生成方法的能力和局限性。我们进一步采用了先进的可解释人工智能方法，以突出对代码转换有重大贡献的标记。我们发现，最先进的方法因数据严重重复而导致性能评估不当，造成结果过于乐观。我们的可解释性分析表明，在各种实验场景中，语言模型可以识别代码语法和结构信息，但它们对输入序列变化的鲁棒性有限。总之，更严格的评估方法和基准对于提高自动程序生成的可靠性和可解释性至关重要。我们的研究结果为实现这一目标提供了重要指导。

{"title":"On the Reliability and Explainability of Language Models for Program Generation","authors":"Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Li Li","doi":"10.1145/3641540","DOIUrl":"https://doi.org/10.1145/3641540","url":null,"abstract":"Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"48 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rigorous Assessment of Model Inference Accuracy using Language Cardinality 利用语言卡定性严格评估模型推理的准确性

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-16 DOI: 10.1145/3640332

Donato Clun, Donghwan Shin, Antonio Filieri, Domenico Bianculli

Models such as finite state automata are widely used to abstract the behavior of software systems by capturing the sequences of events observable during their execution. Nevertheless, models rarely exist in practice and, when they do, get easily outdated; moreover, manually building and maintaining models is costly and error-prone. As a result, a variety of model inference methods that automatically construct models from execution traces have been proposed to address these issues.

However, performing a systematic and reliable accuracy assessment of inferred models remains an open problem. Even when a reference model is given, most existing model accuracy assessment methods may return misleading and biased results. This is mainly due to their reliance on statistical estimators over a finite number of randomly generated traces, introducing avoidable uncertainty about the estimation and being sensitive to the parameters of the random trace generative process.

This paper addresses this problem by developing a systematic approach based on analytic combinatorics that minimizes bias and uncertainty in model accuracy assessment by replacing statistical estimation with deterministic accuracy measures. We experimentally demonstrate the consistency and applicability of our approach by assessing the accuracy of models inferred by state-of-the-art inference tools against reference models from established specification mining benchmarks.

有限状态自动机等模型通过捕捉软件系统执行过程中可观察到的事件序列，被广泛用于抽象软件系统的行为。然而，模型在实践中很少存在，即使存在也很容易过时；此外，手动构建和维护模型成本高且容易出错。因此，为了解决这些问题，人们提出了多种模型推理方法，这些方法可以根据执行轨迹自动构建模型。然而，对推理出的模型进行系统而可靠的准确性评估仍然是一个有待解决的问题。即使给出了参考模型，大多数现有的模型准确性评估方法也可能返回误导性和有偏差的结果。这主要是由于这些方法依赖于对有限数量随机生成的迹线进行统计估计，从而带来了可避免的估计不确定性，并且对随机迹线生成过程的参数非常敏感。本文通过开发一种基于分析组合学的系统方法来解决这一问题，该方法通过用确定性精度度量取代统计估计，最大限度地减少了模型精度评估中的偏差和不确定性。我们通过实验证明了我们方法的一致性和适用性，评估了最先进的推理工具根据已有的规范挖掘基准参考模型推断出的模型的准确性。

{"title":"Rigorous Assessment of Model Inference Accuracy using Language Cardinality","authors":"Donato Clun, Donghwan Shin, Antonio Filieri, Domenico Bianculli","doi":"10.1145/3640332","DOIUrl":"https://doi.org/10.1145/3640332","url":null,"abstract":"Models such as finite state automata are widely used to abstract the behavior of software systems by capturing the sequences of events observable during their execution. Nevertheless, models rarely exist in practice and, when they do, get easily outdated; moreover, manually building and maintaining models is costly and error-prone. As a result, a variety of model inference methods that automatically construct models from execution traces have been proposed to address these issues. However, performing a systematic and reliable accuracy assessment of inferred models remains an open problem. Even when a reference model is given, most existing model accuracy assessment methods may return misleading and biased results. This is mainly due to their reliance on statistical estimators over a finite number of randomly generated traces, introducing avoidable uncertainty about the estimation and being sensitive to the parameters of the random trace generative process. This paper addresses this problem by developing a systematic approach based on analytic combinatorics that minimizes bias and uncertainty in model accuracy assessment by replacing statistical estimation with deterministic accuracy measures. We experimentally demonstrate the consistency and applicability of our approach by assessing the accuracy of models inferred by state-of-the-art inference tools against reference models from established specification mining benchmarks.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139474727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Method-Level Bug Prediction: Problems and Promises 方法级错误预测：问题与承诺

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology

Pub Date : 2024-01-13 DOI: 10.1145/3640331

Shaiful Chowdhury, Gias Uddin, Hadi Hemmati, Reid Holmes

Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, are these method-level bug prediction models ready for industry use? Unfortunately, the answer is no. The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.

修复软件缺陷的成本非常高昂，尤其是在软件开发生命周期的后期阶段。因此，错误预测一直是研究界的经典难题。截至目前，在谷歌学术网站上以 "错误预测 "为关键词进行搜索，会产生 113,000 次点击。尽管研究界做出了如此巨大的努力，错误预测研究却因没有被果断地应用于实践而备受诟病。现有研究的一个重要问题是，错误预测研究的粒度级别（即类/文件级别）。从业人员发现，在类/文件级粒度上定位错误既困难又耗时。因此，方法级错误预测在过去十年中开始流行起来。我们不禁要问，这些方法级错误预测模型是否已准备好供行业使用？遗憾的是，答案是否定的。如果我们在不同的对时间敏感的现实环境中对这些模型进行评估，那么这些模型所报告的高准确度就会大大降低。起初看起来似乎毫无希望，但令人鼓舞的是，我们表明未来的方法级错误预测可以得到显著改善。总的来说，我们展示了如何可靠地评估未来方法级错误预测模型，以及如何通过以下四种不同的改进途径来改进模型：构建无噪声错误数据、解决概念漂移问题、选择类似的训练项目以及开发混合模型。我们的发现基于三个公开可用的方法级错误数据集，以及一个新建立的错误数据集，该数据集包含来自 49 个开源软件项目的 774 051 个 Java 方法。

{"title":"Method-Level Bug Prediction: Problems and Promises","authors":"Shaiful Chowdhury, Gias Uddin, Hadi Hemmati, Reid Holmes","doi":"10.1145/3640331","DOIUrl":"https://doi.org/10.1145/3640331","url":null,"abstract":"Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, are these method-level bug prediction models ready for industry use? Unfortunately, the answer is no. The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"110 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0