Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm Octopus to collect path conditions for realizable paths efficiently. Octopus builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement Octopus as a tool and evaluate it over 15 real-world programs. The experiment shows that Octopus significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project llvm with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of Octopus. Our empirical and theoretical results demonstrate the great potential of Octopus in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.
{"title":"Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path Conditions","authors":"Wensheng Tang, Dejun Dong, Shijie Li, Chengpeng Wang, Peisen Yao, Jinguo Zhou, Charles Zhang","doi":"10.1145/3632743","DOIUrl":"https://doi.org/10.1145/3632743","url":null,"abstract":"<p>Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm <span>Octopus</span> to collect path conditions for realizable paths efficiently. <span>Octopus</span> builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement <span>Octopus</span> as a tool and evaluate it over 15 real-world programs. The experiment shows that <span>Octopus</span> significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project <sans-serif>llvm</sans-serif> with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of <span>Octopus</span>. Our empirical and theoretical results demonstrate the great potential of <span>Octopus</span> in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"237 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hashini Gunatilake, John Grundy, Rashina Hoda, Ingo Mueller
Software engineering (SE) requires developers to collaborate with stakeholders, and understanding their emotions and perspectives is often vital. Empathy is a concept characterising a person’s ability to understand and share the feelings of another. However, empathy continues to be an under-researched human aspect in SE. We studied how empathy is practised between developers and end users using a mixed methods case study. We used an empathy test, observations and interviews to collect data, and socio–technical grounded theory and descriptive statistics to analyse data. We identified the nature of awareness required to trigger empathy and enablers of empathy. We discovered barriers to empathy and a set of potential strategies to overcome these barriers. We report insights on emerging relationships and present a set of recommendations and potential future works on empathy and SE for software practitioners and SE researchers.
软件工程(SE)要求开发人员与利益相关者合作,而理解他们的情感和观点往往至关重要。同理心是一个概念,指一个人理解和分享他人情感的能力。然而,移情仍然是 SE 中研究不足的一个人文方面。我们通过一项混合方法案例研究,探讨了开发人员与最终用户之间如何实现共情。我们使用移情测试、观察和访谈来收集数据,并使用社会技术基础理论和描述性统计来分析数据。我们确定了引发移情所需的意识性质以及移情的促进因素。我们发现了移情的障碍和一系列克服这些障碍的潜在策略。我们报告了对新兴关系的见解,并为软件从业人员和 SE 研究人员提出了一系列有关移情和 SE 的建议和潜在的未来工作。
{"title":"Enablers and Barriers of Empathy in Software Developer and User Interactions: A Mixed Methods Case Study","authors":"Hashini Gunatilake, John Grundy, Rashina Hoda, Ingo Mueller","doi":"10.1145/3641849","DOIUrl":"https://doi.org/10.1145/3641849","url":null,"abstract":"<p>Software engineering (SE) requires developers to collaborate with stakeholders, and understanding their emotions and perspectives is often vital. Empathy is a concept characterising a person’s ability to understand and share the feelings of another. However, empathy continues to be an under-researched human aspect in SE. We studied how empathy is practised between developers and end users using a mixed methods case study. We used an empathy test, observations and interviews to collect data, and socio–technical grounded theory and descriptive statistics to analyse data. We identified the nature of awareness required to trigger empathy and enablers of empathy. We discovered barriers to empathy and a set of potential strategies to overcome these barriers. We report insights on emerging relationships and present a set of recommendations and potential future works on empathy and SE for software practitioners and SE researchers.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"5 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphaël Ollando, Seung Yeob Shin, Lionel C. Briand
Software-defined networks (SDN) enable flexible and effective communication systems that are managed by centralized software controllers. However, such a controller can undermine the underlying communication network of an SDN-based system and thus must be carefully tested. When an SDN-based system fails, in order to address such a failure, engineers need to precisely understand the conditions under which it occurs. In this article, we introduce a machine learning-guided fuzzing method, named FuzzSDN, aiming at both (1) generating effective test data leading to failures in SDN-based systems and (2) learning accurate failure-inducing models that characterize conditions under which such system fails. To our knowledge, no existing work simultaneously addresses these two objectives for SDNs. We evaluate FuzzSDN by applying it to systems controlled by two open-source SDN controllers. Further, we compare FuzzSDN with two state-of-the-art methods for fuzzing SDNs and two baselines for learning failure-inducing models. Our results show that (1) compared to the state-of-the-art methods, FuzzSDN generates at least 12 times more failures, within the same time budget, with a controller that is fairly robust to fuzzing and (2) our failure-inducing models have, on average, a precision of 98% and a recall of 86%, significantly outperforming the baselines.
软件定义网络(SDN)可实现由集中式软件控制器管理的灵活有效的通信系统。然而,这种控制器可能会破坏基于 SDN 系统的底层通信网络,因此必须对其进行仔细测试。当基于 SDN 的系统发生故障时,为了解决这种故障,工程师需要精确了解故障发生的条件。在本文中,我们介绍了一种以机器学习为指导的模糊处理方法(名为 FuzzSDN),该方法旨在:(1)生成导致基于 SDN 的系统发生故障的有效测试数据;(2)学习准确的故障诱发模型,以描述此类系统发生故障的条件。据我们所知,目前还没有针对 SDN 同时实现这两个目标的工作。我们将 FuzzSDN 应用于由两个开源 SDN 控制器控制的系统,对其进行了评估。此外,我们还将 FuzzSDN 与用于模糊 SDN 的两种最先进方法以及用于学习故障诱发模型的两种基线进行了比较。我们的结果表明:(1) 与最先进的方法相比,FuzzSDN 在相同的时间预算内产生的故障至少多 12 倍,而且控制器对模糊处理具有相当高的鲁棒性;(2) 我们的故障诱发模型平均精确度为 98%,召回率为 86%,明显优于基线。
{"title":"Learning Failure-Inducing Models for Testing Software-Defined Networks","authors":"Raphaël Ollando, Seung Yeob Shin, Lionel C. Briand","doi":"10.1145/3641541","DOIUrl":"https://doi.org/10.1145/3641541","url":null,"abstract":"<p>Software-defined networks (SDN) enable flexible and effective communication systems that are managed by centralized software controllers. However, such a controller can undermine the underlying communication network of an SDN-based system and thus must be carefully tested. When an SDN-based system fails, in order to address such a failure, engineers need to precisely understand the conditions under which it occurs. In this article, we introduce a machine learning-guided fuzzing method, named FuzzSDN, aiming at both (1) generating effective test data leading to failures in SDN-based systems and (2) learning accurate failure-inducing models that characterize conditions under which such system fails. To our knowledge, no existing work simultaneously addresses these two objectives for SDNs. We evaluate FuzzSDN by applying it to systems controlled by two open-source SDN controllers. Further, we compare FuzzSDN with two state-of-the-art methods for fuzzing SDNs and two baselines for learning failure-inducing models. Our results show that (1) compared to the state-of-the-art methods, FuzzSDN generates at least 12 times more failures, within the same time budget, with a controller that is fairly robust to fuzzing and (2) our failure-inducing models have, on average, a precision of 98% and a recall of 86%, significantly outperforming the baselines.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"114 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Application Programming Interface (API) migration is a common task for adapting software across different programming languages and platforms, where manually constructing the mapping relations between APIs is indeed time-consuming and error-prone. To facilitate this process, many automated API mapping approaches have been proposed. However, existing approaches were mainly designed and evaluated for mapping APIs of statically-typed languages, while their performance on dynamically-typed languages remains unexplored.
In this paper, we conduct the first extensive study to explore existing API mapping approaches’ performance for mapping APIs in dynamically-typed languages, for which we have manually constructed a high-quality dataset. According to the empirical results, we have summarized several insights. In particular, the source code implementations of APIs can significantly improve the effectiveness of API mapping. However, due to the confidentiality policy, they may not be available in practice. To overcome this, we propose a novel API mapping approach, named Matl, which leverages the transfer learning technique to learn the semantic embeddings of source code implementations from large-scale open-source repositories and then transfers the learned model to facilitate the mapping of APIs. In this way, Matl can produce more accurate API embedding of its functionality for more effective mapping without knowing the source code of the APIs. To evaluate the performance of Matl, we have conducted an extensive study by comparing Matl with state-of-the-art approaches. The results demonstrate that Matl is indeed effective as it improves the state-of-the-art approach by at least 18.36% for mapping APIs of dynamically-typed language and by 30.77% for mapping APIs of the statically-typed language.
应用程序接口(API)迁移是在不同编程语言和平台之间调整软件的一项常见任务,而手动构建 API 之间的映射关系确实既耗时又容易出错。为了简化这一过程,人们提出了许多自动 API 映射方法。然而,现有的方法主要是针对静态类型语言的 API 映射而设计和评估的,而它们在动态类型语言上的性能仍有待探索。在本文中,我们首次对现有 API 映射方法在动态类型语言 API 映射中的性能进行了广泛的研究,并为此手动构建了一个高质量的数据集。根据实证结果,我们总结出了几点启示。其中,API 的源代码实现可以显著提高 API 映射的有效性。然而,由于保密政策的原因,在实践中可能无法获得这些源代码。为了克服这一问题,我们提出了一种名为 Matl 的新型 API 映射方法,它利用迁移学习技术从大规模开源资源库中学习源代码实现的语义嵌入,然后将学习到的模型迁移到 API 映射中。通过这种方式,Matl 可以为其功能生成更准确的 API 嵌入,从而在不知道 API 源代码的情况下实现更有效的映射。为了评估 Matl 的性能,我们进行了一项广泛的研究,将 Matl 与最先进的方法进行了比较。结果表明,Matl 确实是有效的,因为它在映射动态类型语言的应用程序接口方面比最先进的方法至少提高了 18.36%,在映射静态类型语言的应用程序接口方面提高了 30.77%。
{"title":"Mapping APIs in Dynamic-typed Programs by Leveraging Transfer Learning","authors":"Zhenfei Huang, Junjie Chen, Jiajun Jiang, Yihua Liang, Hanmo You, Fengjie Li","doi":"10.1145/3641848","DOIUrl":"https://doi.org/10.1145/3641848","url":null,"abstract":"<p>Application Programming Interface (API) migration is a common task for adapting software across different programming languages and platforms, where manually constructing the mapping relations between APIs is indeed time-consuming and error-prone. To facilitate this process, many automated API mapping approaches have been proposed. However, existing approaches were mainly designed and evaluated for mapping APIs of statically-typed languages, while their performance on dynamically-typed languages remains unexplored. </p><p>In this paper, we conduct the first extensive study to explore existing API mapping approaches’ performance for mapping APIs in dynamically-typed languages, for which we have manually constructed a high-quality dataset. According to the empirical results, we have summarized several insights. In particular, the source code implementations of APIs can significantly improve the effectiveness of API mapping. However, due to the confidentiality policy, they may not be available in practice. To overcome this, we propose a novel API mapping approach, named <span>Matl</span>, which leverages the transfer learning technique to learn the semantic embeddings of source code implementations from large-scale open-source repositories and then transfers the learned model to facilitate the mapping of APIs. In this way, <span>Matl</span> can produce more accurate API embedding of its functionality for more effective mapping without knowing the source code of the APIs. To evaluate the performance of <span>Matl</span>, we have conducted an extensive study by comparing <span>Matl</span> with state-of-the-art approaches. The results demonstrate that <span>Matl</span> is indeed effective as it improves the state-of-the-art approach by at least 18.36% for mapping APIs of dynamically-typed language and by 30.77% for mapping APIs of the statically-typed language.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"222 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Puzhuo Liu, Yaowen Zheng, Chengnian Sun, Hong Li, Zhi Li, Limin Sun
Networked Embedded Devices (NEDs) are increasingly targeted by cyberattacks, mainly due to their widespread use in our daily lives. Vulnerabilities in NEDs are the root causes of these cyberattacks. Although deployed NEDs go through thorough code audits, there can still be considerable exploitable vulnerabilities. Existing mitigation measures like code encryption and obfuscation adopted by vendors can resist static analysis on deployed NEDs, but are ineffective against protocol fuzzing. Attackers can easily apply protocol fuzzing to discover vulnerabilities and compromise deployed NEDs. Unfortunately, prior anti-fuzzing techniques are impractical as they significantly slow down NEDs, hampering NED availability.
To address this issue, we propose Armor—the first anti-fuzzing technique specifically designed for NEDs. First, we design three adversarial primitives—delay, fake coverage, and forged exception—to break the fundamental mechanisms on which fuzzing relies to effectively find vulnerabilities. Second, based on our observation that inputs from normal users consistent with the protocol specification and certain program paths are rarely executed with normal inputs, we design static and dynamic strategies to decide whether to activate the adversarial primitives. Extensive evaluations show that Armor incurs negligible time overhead and effectively reduces the code coverage (e.g., line coverage by 22%-61%) for fuzzing, significantly outperforming the state-of-the-art.
网络嵌入式设备(NED)越来越多地成为网络攻击的目标,这主要是因为它们在我们的日常生活中被广泛使用。NED 中的漏洞是这些网络攻击的根源。尽管已部署的 NED 经过了彻底的代码审计,但仍可能存在大量可利用的漏洞。供应商采用的代码加密和混淆等现有缓解措施可以抵御对已部署 NED 的静态分析,但对协议模糊却无能为力。攻击者可以轻松应用协议模糊技术发现漏洞,并入侵已部署的 NED。遗憾的是,先前的反模糊技术并不实用,因为它们会大大降低 NED 的运行速度,影响 NED 的可用性。为了解决这个问题,我们提出了 Armor--第一种专为 NED 设计的反模糊技术。首先,我们设计了三种对抗原语--延迟、伪造覆盖和伪造异常,以打破模糊测试赖以有效发现漏洞的基本机制。其次,根据我们的观察,符合协议规范的正常用户输入和某些程序路径很少与正常输入一起执行,因此我们设计了静态和动态策略来决定是否激活对抗性原语。广泛的评估表明,Armor 带来的时间开销可以忽略不计,并能有效降低模糊测试的代码覆盖率(例如,行覆盖率降低了 22%-61% ),明显优于最先进的技术。
{"title":"Battling against Protocol Fuzzing: Protecting Networked Embedded Devices from Dynamic Fuzzers","authors":"Puzhuo Liu, Yaowen Zheng, Chengnian Sun, Hong Li, Zhi Li, Limin Sun","doi":"10.1145/3641847","DOIUrl":"https://doi.org/10.1145/3641847","url":null,"abstract":"<p><underline>N</underline>etworked <underline>E</underline>mbedded <underline>D</underline>evices (NEDs) are increasingly targeted by cyberattacks, mainly due to their widespread use in our daily lives. Vulnerabilities in NEDs are the root causes of these cyberattacks. Although deployed NEDs go through thorough code audits, there can still be considerable exploitable vulnerabilities. Existing mitigation measures like code encryption and obfuscation adopted by vendors can resist static analysis on deployed NEDs, but are ineffective against protocol fuzzing. Attackers can easily apply protocol fuzzing to discover vulnerabilities and compromise deployed NEDs. Unfortunately, prior anti-fuzzing techniques are impractical as they significantly slow down NEDs, hampering NED availability. </p><p>To address this issue, we propose Armor—the first anti-fuzzing technique specifically designed for NEDs. First, we design three adversarial primitives—delay, fake coverage, and forged exception—to break the fundamental mechanisms on which fuzzing relies to effectively find vulnerabilities. Second, based on our observation that inputs from normal users consistent with the protocol specification and certain program paths are rarely executed with normal inputs, we design static and dynamic strategies to decide whether to activate the adversarial primitives. Extensive evaluations show that Armor incurs negligible time overhead and effectively reduces the code coverage (e.g., line coverage by 22%-61%) for fuzzing, significantly outperforming the state-of-the-art.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"61 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taijara Loiola de Santana, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida, Iftekhar Ahmed
Computational notebooks, such as Jupyter, have been widely adopted by data scientists to write code for analyzing and visualizing data. Despite their growing adoption and popularity, few studies were found to understand Jupyter development challenges from the practitioners’ point of view. This paper presents a systematic study of bugs and challenges that Jupyter practitioners face through a large-scale empirical investigation. We mined 14,740 commits from 105 GitHub open-source projects with Jupyter notebook code. Next, we analyzed 30,416 Stack Overflow posts, which gave us insights into bugs that practitioners face when developing Jupyter notebook projects. Next, we conducted nineteen interviews with data scientists to uncover more details about Jupyter bugs and to gain insight into Jupyter developers’ challenges. Finally, to validate the study results and proposed taxonomy, we conducted a survey with 91 data scientists. We also highlight bug categories, their root causes, and the challenges that Jupyter practitioners face.
{"title":"Bug Analysis in Jupyter Notebook Projects: An Empirical Study","authors":"Taijara Loiola de Santana, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida, Iftekhar Ahmed","doi":"10.1145/3641539","DOIUrl":"https://doi.org/10.1145/3641539","url":null,"abstract":"<p>Computational notebooks, such as Jupyter, have been widely adopted by data scientists to write code for analyzing and visualizing data. Despite their growing adoption and popularity, few studies were found to understand Jupyter development challenges from the practitioners’ point of view. This paper presents a systematic study of bugs and challenges that Jupyter practitioners face through a large-scale empirical investigation. We mined 14,740 commits from 105 GitHub open-source projects with Jupyter notebook code. Next, we analyzed 30,416 Stack Overflow posts, which gave us insights into bugs that practitioners face when developing Jupyter notebook projects. Next, we conducted nineteen interviews with data scientists to uncover more details about Jupyter bugs and to gain insight into Jupyter developers’ challenges. Finally, to validate the study results and proposed taxonomy, we conducted a survey with 91 data scientists. We also highlight bug categories, their root causes, and the challenges that Jupyter practitioners face.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"115 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a zeRo-shot domAin adaPtion with pre-traIned moDels framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.
{"title":"Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models","authors":"Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, Zhiyong Feng","doi":"10.1145/3641542","DOIUrl":"https://doi.org/10.1145/3641542","url":null,"abstract":"<p>Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain-specific or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a ze<b>R</b>o-shot dom<b>A</b>in ada<b>P</b>tion with pre-tra<b>I</b>ned mo<b>D</b>els framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"37 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Li Li
Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.
{"title":"On the Reliability and Explainability of Language Models for Program Generation","authors":"Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Li Li","doi":"10.1145/3641540","DOIUrl":"https://doi.org/10.1145/3641540","url":null,"abstract":"<p>Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"48 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donato Clun, Donghwan Shin, Antonio Filieri, Domenico Bianculli
Models such as finite state automata are widely used to abstract the behavior of software systems by capturing the sequences of events observable during their execution. Nevertheless, models rarely exist in practice and, when they do, get easily outdated; moreover, manually building and maintaining models is costly and error-prone. As a result, a variety of model inference methods that automatically construct models from execution traces have been proposed to address these issues.
However, performing a systematic and reliable accuracy assessment of inferred models remains an open problem. Even when a reference model is given, most existing model accuracy assessment methods may return misleading and biased results. This is mainly due to their reliance on statistical estimators over a finite number of randomly generated traces, introducing avoidable uncertainty about the estimation and being sensitive to the parameters of the random trace generative process.
This paper addresses this problem by developing a systematic approach based on analytic combinatorics that minimizes bias and uncertainty in model accuracy assessment by replacing statistical estimation with deterministic accuracy measures. We experimentally demonstrate the consistency and applicability of our approach by assessing the accuracy of models inferred by state-of-the-art inference tools against reference models from established specification mining benchmarks.
{"title":"Rigorous Assessment of Model Inference Accuracy using Language Cardinality","authors":"Donato Clun, Donghwan Shin, Antonio Filieri, Domenico Bianculli","doi":"10.1145/3640332","DOIUrl":"https://doi.org/10.1145/3640332","url":null,"abstract":"<p>Models such as finite state automata are widely used to abstract the behavior of software systems by capturing the sequences of events observable during their execution. Nevertheless, models rarely exist in practice and, when they do, get easily outdated; moreover, manually building and maintaining models is costly and error-prone. As a result, a variety of model inference methods that automatically construct models from execution traces have been proposed to address these issues. </p><p>However, performing a systematic and reliable accuracy assessment of inferred models remains an open problem. Even when a reference model is given, most existing model accuracy assessment methods may return misleading and biased results. This is mainly due to their reliance on statistical estimators over a finite number of randomly generated traces, introducing avoidable uncertainty about the estimation and being sensitive to the parameters of the random trace generative process. </p><p>This paper addresses this problem by developing a systematic approach based on analytic combinatorics that minimizes bias and uncertainty in model accuracy assessment by replacing statistical estimation with deterministic accuracy measures. We experimentally demonstrate the consistency and applicability of our approach by assessing the accuracy of models inferred by state-of-the-art inference tools against reference models from established specification mining benchmarks.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139474727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaiful Chowdhury, Gias Uddin, Hadi Hemmati, Reid Holmes
Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, are these method-level bug prediction models ready for industry use? Unfortunately, the answer is no. The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.
{"title":"Method-Level Bug Prediction: Problems and Promises","authors":"Shaiful Chowdhury, Gias Uddin, Hadi Hemmati, Reid Holmes","doi":"10.1145/3640331","DOIUrl":"https://doi.org/10.1145/3640331","url":null,"abstract":"<p>Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, <i>are these method-level bug prediction models ready for industry use?</i> Unfortunately, the answer is <i>no</i>. The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"110 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}