Empirical Software Engineering最新文献_第9页

Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing 寻找错误：通过回归测试自动识别导致错误的变更

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-04 DOI: 10.1007/s10664-024-10479-z

Michel Maes-Bermejo, Alexander Serebrenik, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesús María González Barahona

Context

Finding code changes that introduced bugs is important both for practitioners and researchers, but doing it precisely is a manual, effort-intensive process. The perfect test method is a theoretical construct aimed at detecting Bug-Introducing Changes (BIC) through a theoretical perfect test. This perfect test always fails if the bug is present, and passes otherwise.

Objective

To explore a possible automatic operationalization of the perfect test method.

Method

To use regression tests as substitutes for the perfect test. For this, we transplant the regression tests to past snapshots of the code, and use them to identify the BIC, on a well-known collection of bugs from the Defects4J dataset.

Results

From 809 bugs in the dataset, when running our operationalization of the perfect test method, for 95 of them the BIC was identified precisely and in the remaining 4 cases, a list of candidates including the BIC was provided.

Conclusions

We demonstrate that the operationalization of the perfect test method through regression tests is feasible and can be completely automated in practice when tests can be transplanted and run in past snapshots of the code. Given that implementing regression tests when a bug is fixed is considered a good practice, when developers follow it, they can detect effortlessly bug-introducing changes by using our operationalization of the perfect test method.

背景寻找引入错误的代码变更对于从业人员和研究人员来说都很重要，但要精确地完成这项工作却是一个需要大量人力和精力的过程。完美测试法是一种理论构造，旨在通过理论上的完美测试来检测引入错误的变更（BIC）。目标探索完美测试法的自动操作方法。方法使用回归测试来替代完美测试。为此，我们将回归测试移植到过去的代码快照中，并在 Defects4J 数据集中的著名错误集合上使用它们来识别 BIC。结果从数据集中的 809 个错误中，当运行我们的完美测试方法操作化时，其中 95 个错误的 BIC 被精确识别，在其余 4 个案例中，提供了包括 BIC 在内的候选列表。结论我们证明，通过回归测试实现完美测试方法的可操作性是可行的，而且在实践中，当测试可以移植到过去的代码快照中运行时，完全可以实现自动化。鉴于在修复错误时实施回归测试被认为是一种良好的做法，因此当开发人员遵循这种做法时，他们可以通过使用我们的完美测试方法的操作化，毫不费力地检测到引入错误的更改。

{"title":"Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing","authors":"Michel Maes-Bermejo, Alexander Serebrenik, Micael Gallego, Francisco Gortázar, Gregorio Robles, Jesús María González Barahona","doi":"10.1007/s10664-024-10479-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10479-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Finding code changes that introduced bugs is important both for practitioners and researchers, but doing it precisely is a manual, effort-intensive process. The <i>perfect test</i> method is a theoretical construct aimed at detecting Bug-Introducing Changes (BIC) through a theoretical <i>perfect test</i>. This <i>perfect test</i> always fails if the bug is present, and passes otherwise.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To explore a possible automatic operationalization of the <i>perfect test</i> method.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To use regression tests as substitutes for the <i>perfect test</i>. For this, we transplant the regression tests to past snapshots of the code, and use them to identify the BIC, on a well-known collection of bugs from the Defects4J dataset.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>From 809 bugs in the dataset, when running our operationalization of the perfect test method, for 95 of them the BIC was identified precisely and in the remaining 4 cases, a list of candidates including the BIC was provided.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>We demonstrate that the operationalization of the <i>perfect test</i> method through regression tests is feasible and can be completely automated in practice when tests can be transplanted and run in past snapshots of the code. Given that implementing regression tests when a bug is fixed is considered a good practice, when developers follow it, they can detect effortlessly bug-introducing changes by using our operationalization of the <i>perfect test</i> method.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VioDroid-Finder: automated evaluation of compliance and consistency for Android apps VioDroid-Finder：自动评估安卓应用程序的合规性和一致性

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-03 DOI: 10.1007/s10664-024-10470-8

Junren Chen, Cheng Huang, Jiaxuan Han

Rapid growth in the variety and quantity of apps makes it difficult for users to protect their privacy, although existing regulations have been introduced and the Android ecosystem is constantly being improved, there are still violations as privacy policies may not fully comply with regulations, and app behavior may not be fully consistent with privacy policies. To solve such issues, this paper proposes an automated method called VioDroid-Finder aiming at the evaluation of compliance and consistency for Android apps. We first study existing common regulations and conclude the privacy policy content into 7 aspects (i.e., privacy categories), for privacy policies, different compliance rules are required to be complied with in each privacy category. Secondly, we present a policy structure parser model based on the structure extraction/rebuilding method (which can convert the unstructured text to an XML tree) and subtitle similarity calculation algorithm. Thirdly, we propose a violation analyzer using the BERT model to classify each sentence in the privacy policy, we collect existing issues and combine them with manual observations to define 6 types of violations and detect them based on classification results. Then, we propose an inconsistency analyzer that converts permissions, APIs, and GUI into a set of personal information based on static analysis, inconsistencies are detected by comparing that set with personal information declared in the privacy policy. Finally, we evaluate 600 Chinese apps using the proposed method, from which we detect many violations and inconsistencies reflecting the current widespread privacy violation issues.

尽管现有法规已经出台，安卓生态系统也在不断完善，但由于隐私政策可能不完全符合法规，应用程序行为可能与隐私政策不完全一致，因此仍然存在违规现象。为了解决这些问题，本文提出了一种名为 VioDroid-Finder 的自动方法，旨在评估安卓应用程序的合规性和一致性。我们首先研究了现有的通用法规，并将隐私政策内容总结为 7 个方面（即隐私类别），对于隐私政策，每个隐私类别需要遵守不同的合规规则。其次，我们提出了基于结构提取/重建方法（可将非结构化文本转换为 XML 树）和字幕相似度计算算法的政策结构解析器模型。第三，我们提出了一个违规分析器，利用 BERT 模型对隐私政策中的每句话进行分类，我们收集现有的问题并结合人工观察，定义了 6 种违规类型，并根据分类结果进行检测。然后，我们提出了一个不一致分析器，它基于静态分析将权限、API 和图形用户界面转换成一组个人信息，通过比较这组信息和隐私政策中声明的个人信息来检测不一致之处。最后，我们使用提出的方法对 600 个中文应用程序进行了评估，从中发现了许多违规和不一致之处，反映了当前普遍存在的侵犯隐私问题。

{"title":"VioDroid-Finder: automated evaluation of compliance and consistency for Android apps","authors":"Junren Chen, Cheng Huang, Jiaxuan Han","doi":"10.1007/s10664-024-10470-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10470-8","url":null,"abstract":"<p>Rapid growth in the variety and quantity of apps makes it difficult for users to protect their privacy, although existing regulations have been introduced and the Android ecosystem is constantly being improved, there are still violations as privacy policies may not fully comply with regulations, and app behavior may not be fully consistent with privacy policies. To solve such issues, this paper proposes an automated method called VioDroid-Finder aiming at the evaluation of compliance and consistency for Android apps. We first study existing common regulations and conclude the privacy policy content into 7 aspects (i.e., privacy categories), for privacy policies, different compliance rules are required to be complied with in each privacy category. Secondly, we present a policy structure parser model based on the structure extraction/rebuilding method (which can convert the unstructured text to an XML tree) and subtitle similarity calculation algorithm. Thirdly, we propose a violation analyzer using the BERT model to classify each sentence in the privacy policy, we collect existing issues and combine them with manual observations to define 6 types of violations and detect them based on classification results. Then, we propose an inconsistency analyzer that converts permissions, APIs, and GUI into a set of personal information based on static analysis, inconsistencies are detected by comparing that set with personal information declared in the privacy policy. Finally, we evaluate 600 Chinese apps using the proposed method, from which we detect many violations and inconsistencies reflecting the current widespread privacy violation issues.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How do annotations affect Java code readability? 注解如何影响 Java 代码的可读性？

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-03 DOI: 10.1007/s10664-024-10460-w

Eduardo Guerra, Everaldo Gomes, Jeferson Ferreira, Igor Wiese, Phyllipe Lima, Marco Gerosa, Paulo Meirelles

Context

Code annotations have gained widespread popularity in programming languages, offering developers the ability to attach metadata to code elements to define custom behaviors. Many modern frameworks and APIs use annotations to keep integration less verbose and located nearer to the corresponding code element. Despite these advantages, practitioners’ anecdotal evidence suggests that annotations might negatively affect code readability.

Objective

To better understand this effect, this paper systematically investigates the relationship between code annotations and code readability.

Method

In a survey with software developers (n=332), we present 15 pairs of Java code snippets with and without code annotations. These pairs were designed considering five categories of annotation used in real-world Java frameworks and APIs. Survey participants selected the code snippet they considered more readable for each pair and answered an open question about how annotations affect the code’s readability.

Results

Preferences were scattered for all categories of annotation usage, revealing no consensus among participants. The answers were spread even when segregated by participants’ programming or annotation-related experience. Nevertheless, some participants showed a consistent preference in favor or against annotations across all categories, which may indicate a personal preference. Our qualitative analysis of the open-ended questions revealed that participants often praise annotation impacts on design, maintainability, and productivity but expressed contrasting views on understandability and code clarity.

Conclusions

Software developers and API designers can consider our results when deciding whether to use annotations, equipped with the insight that developers express contrasting views of the annotations’ impact on code readability.

ContextCode 注释在编程语言中得到了广泛的普及，它为开发人员提供了为代码元素附加元数据以定义自定义行为的能力。许多现代框架和应用程序接口都使用注解来减少集成的冗长度，并使其更接近相应的代码元素。为了更好地理解这种影响，本文系统地研究了代码注释与代码可读性之间的关系。方法在对软件开发人员（n=332）的调查中，我们展示了 15 对有代码注释和没有代码注释的 Java 代码片段。这些代码对的设计考虑了实际 Java 框架和 API 中使用的五类注释。调查参与者为每对代码选择了他们认为可读性更高的代码片段，并回答了一个关于注释如何影响代码可读性的开放性问题。即使按参与者的编程或注释相关经验进行分类，答案也很分散。不过，一些参与者在所有类别中都表现出一致的支持或反对注释的偏好，这可能表明了他们的个人偏好。我们对开放式问题进行的定性分析显示，参与者经常称赞注释对设计、可维护性和生产率的影响，但对可理解性和代码清晰度的影响却表达了截然不同的观点。

{"title":"How do annotations affect Java code readability?","authors":"Eduardo Guerra, Everaldo Gomes, Jeferson Ferreira, Igor Wiese, Phyllipe Lima, Marco Gerosa, Paulo Meirelles","doi":"10.1007/s10664-024-10460-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10460-w","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code annotations have gained widespread popularity in programming languages, offering developers the ability to attach metadata to code elements to define custom behaviors. Many modern frameworks and APIs use annotations to keep integration less verbose and located nearer to the corresponding code element. Despite these advantages, practitioners’ anecdotal evidence suggests that annotations might negatively affect code readability.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To better understand this effect, this paper systematically investigates the relationship between code annotations and code readability.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>In a survey with software developers (n=332), we present 15 pairs of Java code snippets with and without code annotations. These pairs were designed considering five categories of annotation used in real-world Java frameworks and APIs. Survey participants selected the code snippet they considered more readable for each pair and answered an open question about how annotations affect the code’s readability.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Preferences were scattered for all categories of annotation usage, revealing no consensus among participants. The answers were spread even when segregated by participants’ programming or annotation-related experience. Nevertheless, some participants showed a consistent preference in favor or against annotations across all categories, which may indicate a personal preference. Our qualitative analysis of the open-ended questions revealed that participants often praise annotation impacts on design, maintainability, and productivity but expressed contrasting views on understandability and code clarity.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Software developers and API designers can consider our results when deciding whether to use annotations, equipped with the insight that developers express contrasting views of the annotations’ impact on code readability.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"150 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Patterns of multi-container composition for service orchestration with Docker Compose 使用 Docker Compose 进行服务协调的多容器组合模式

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-03 DOI: 10.1007/s10664-024-10462-8

Kalvin Eng, Abram Hindle, Eleni Stroulia

Software design patterns present general code solutions to common software design problems. Modern software systems rely heavily on containers for running their constituent service components. Yet, despite the prevalence of ready-to-use Docker service images ready to participate in multi-container service compositions of applications, developers do not have much guidance on how to compose their own Docker service orchestrations. Thus in this work, we curate a dataset of successful projects that employ Docker Compose as an orchestration tool to run multiple service containers; then, we engage in qualitative and quantitative analysis of Docker Compose configurations. The collection of data and analysis enables the identification and naming of repeating multi-container composition patterns that are used in numerous successful open-source projects, much like software design patterns. These patterns highlight how software systems are orchestrated in the real-world and can give examples to anybody wishing to compose their own service orchestrations. These contributions also advance empirical research in software engineering patterns as evidence is provided about how Docker Compose is used.

软件设计模式为常见的软件设计问题提供了通用代码解决方案。现代软件系统在很大程度上依赖容器来运行其组成服务组件。然而，尽管有很多随时可用的 Docker 服务映像可以参与应用程序的多容器服务组合，开发人员在如何组成自己的 Docker 服务编排方面却没有太多指导。因此，在这项工作中，我们收集了使用 Docker Compose 作为协调工具来运行多个服务容器的成功项目数据集；然后，我们对 Docker Compose 配置进行了定性和定量分析。通过收集数据和分析，我们识别并命名了在众多成功开源项目中使用的重复多容器组成模式，就像软件设计模式一样。这些模式强调了软件系统在现实世界中的协调方式，并为任何希望构建自己的服务协调的人提供了范例。由于提供了有关 Docker Compose 如何使用的证据，这些贡献还推动了软件工程模式的实证研究。

{"title":"Patterns of multi-container composition for service orchestration with Docker Compose","authors":"Kalvin Eng, Abram Hindle, Eleni Stroulia","doi":"10.1007/s10664-024-10462-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10462-8","url":null,"abstract":"<p>Software design patterns present general code solutions to common software design problems. Modern software systems rely heavily on containers for running their constituent service components. Yet, despite the prevalence of ready-to-use Docker service images ready to participate in multi-container service compositions of applications, developers do not have much guidance on how to compose their own Docker service orchestrations. Thus in this work, we curate a dataset of successful projects that employ Docker Compose as an orchestration tool to run multiple service containers; then, we engage in qualitative and quantitative analysis of Docker Compose configurations. The collection of data and analysis enables the identification and naming of repeating multi-container composition patterns that are used in numerous successful open-source projects, much like software design patterns. These patterns highlight how software systems are orchestrated in the real-world and can give examples to anybody wishing to compose their own service orchestrations. These contributions also advance empirical research in software engineering patterns as evidence is provided about how Docker Compose is used.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"15 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic bi-modal question title generation for Stack Overflow with prompt learning 利用提示学习为 Stack Overflow 自动生成双模问题标题

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-03 DOI: 10.1007/s10664-024-10466-4

Shaoyu Yang, Xiang Chen, Ke Liu, Guang Yang, Chi Yu

When drafting question posts for Stack Overflow, developers may not accurately summarize the core problems in the question titles, which can cause these questions to not get timely help. Therefore, improving the quality of question titles has attracted the wide attention of researchers. An initial study aimed to automatically generate the titles by only analyzing the code snippets in the question body. However, this study ignored the helpful information in their corresponding problem descriptions. Therefore, we propose an approach SOTitle+ by considering bi-modal information (i.e., the code snippets and the problem descriptions) in the question body. Then we formalize the title generation for different programming languages as separate but related tasks and utilize multi-task learning to solve these tasks. Later we fine-tune the pre-trained language model CodeT5 to automatically generate the titles. Unfortunately, the inconsistent inputs and optimization objectives between the pre-training task and our investigated task may make fine-tuning hard to fully explore the knowledge of the pre-trained model. To solve this issue, SOTitle+ further prompt-tunes CodeT5 with hybrid prompts (i.e., mixture of hard and soft prompts). To verify the effectiveness of SOTitle+, we construct a large-scale high-quality corpus from recent data dumps shared by Stack Overflow. Our corpus includes 179,119 high-quality question posts for six popular programming languages. Experimental results show that SOTitle+ can significantly outperform four state-of-the-art baselines in both automatic evaluation and human evaluation. In addition, our ablation studies also confirm the effectiveness of component settings (such as bi-modal information, prompt learning, hybrid prompts, and multi-task learning) of SOTitle+. Our work indicates that considering bi-modal information and prompt learning in Stack Overflow title generation is a promising exploration direction.

在为 Stack Overflow 起草问题帖子时，开发人员可能无法在问题标题中准确概括核心问题，从而导致这些问题无法得到及时帮助。因此，提高问题标题的质量引起了研究人员的广泛关注。最初的一项研究旨在仅通过分析问题正文中的代码片段来自动生成标题。但是，这项研究忽略了相应问题描述中的有用信息。因此，我们提出了一种 SOTitle+ 方法，即考虑问题正文中的双模信息（即代码片段和问题描述）。然后，我们将不同编程语言的标题生成形式化为独立但相关的任务，并利用多任务学习来解决这些任务。之后，我们对预先训练好的语言模型 CodeT5 进行微调，以自动生成标题。遗憾的是，由于预训练任务和我们研究的任务之间的输入和优化目标不一致，微调可能难以充分挖掘预训练模型的知识。为了解决这个问题，SOTitle+ 使用混合提示（即硬提示和软提示的混合）对 CodeT5 进行了进一步的提示调整。为了验证 SOTitle+ 的有效性，我们从 Stack Overflow 最近共享的数据转储中构建了一个大规模高质量语料库。我们的语料库包括六种流行编程语言的 179,119 条高质量问题帖子。实验结果表明，SOTitle+ 在自动评估和人工评估中的表现都明显优于四种最先进的基线。此外，我们的消融研究还证实了 SOTitle+ 的组件设置（如双模信息、提示学习、混合提示和多任务学习）的有效性。我们的工作表明，在 Stack Overflow 标题生成中考虑双模信息和提示学习是一个很有前景的探索方向。

{"title":"Automatic bi-modal question title generation for Stack Overflow with prompt learning","authors":"Shaoyu Yang, Xiang Chen, Ke Liu, Guang Yang, Chi Yu","doi":"10.1007/s10664-024-10466-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10466-4","url":null,"abstract":"<p>When drafting question posts for Stack Overflow, developers may not accurately summarize the core problems in the question titles, which can cause these questions to not get timely help. Therefore, improving the quality of question titles has attracted the wide attention of researchers. An initial study aimed to automatically generate the titles by only analyzing the code snippets in the question body. However, this study ignored the helpful information in their corresponding problem descriptions. Therefore, we propose an approach <span>SOTitle+</span> by considering bi-modal information (i.e., the code snippets and the problem descriptions) in the question body. Then we formalize the title generation for different programming languages as separate but related tasks and utilize multi-task learning to solve these tasks. Later we fine-tune the pre-trained language model CodeT5 to automatically generate the titles. Unfortunately, the inconsistent inputs and optimization objectives between the pre-training task and our investigated task may make fine-tuning hard to fully explore the knowledge of the pre-trained model. To solve this issue, <span>SOTitle+</span> further prompt-tunes CodeT5 with hybrid prompts (i.e., mixture of hard and soft prompts). To verify the effectiveness of <span>SOTitle+</span>, we construct a large-scale high-quality corpus from recent data dumps shared by Stack Overflow. Our corpus includes 179,119 high-quality question posts for six popular programming languages. Experimental results show that <span>SOTitle+</span> can significantly outperform four state-of-the-art baselines in both automatic evaluation and human evaluation. In addition, our ablation studies also confirm the effectiveness of component settings (such as bi-modal information, prompt learning, hybrid prompts, and multi-task learning) of <span>SOTitle+</span>. Our work indicates that considering bi-modal information and prompt learning in Stack Overflow title generation is a promising exploration direction.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"20 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An empirical study into the effects of transpilation on quantum circuit smells 转置对量子电路气味影响的实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-02 DOI: 10.1007/s10664-024-10461-9

Manuel De Stefano, Dario Di Nucci, Fabio Palomba, Andrea De Lucia

Quantum computing is a promising field that can solve complex problems beyond traditional computers’ capabilities. Developing high-quality quantum software applications, called quantum software engineering, has recently gained attention. However, quantum software development faces challenges related to code quality. A recent study found that many open-source quantum programs are affected by quantum-specific code smells, with long circuit being the most common. While the study provided relevant insights into the prevalence of code smells in quantum circuits, it did not explore the potential effect of transpilation, a necessary step for executing quantum computer programs, on the emergence of code smells. Indeed, transpilation might alter those characteristics employed to detect the presence of a smell on a circuit. To address this limitation, we present a new study investigating the impact of transpilation on quantum-specific code smells and how different target gate sets affect the results. We conducted experiments on 17 open-source quantum programs alongside a set of 100 synthetic circuits. We found that transpilation can significantly alter the metrics that are used to detect code smells, even into previously smell-free circuits, with the long circuit smell being the most susceptible to transpilation. Furthermore, the choice of the gate set significantly influences the presence and severity of code smells in transpiled circuits, highlighting the need for careful gate set selection to mitigate their impact. These findings have implications for circuit optimization and high-quality quantum software development. Further research is needed to understand the consequences of code smells and their potential impact on quantum computations, considering the characteristics and constraints of different gate sets and hardware platforms.

量子计算是一个大有可为的领域，它可以解决超出传统计算机能力的复杂问题。开发高质量的量子软件应用程序（称为量子软件工程）最近受到了关注。然而，量子软件开发面临着代码质量方面的挑战。最近的一项研究发现，许多开源量子程序都受到量子特定代码气味的影响，其中最常见的是长电路。虽然这项研究提供了量子电路中普遍存在的代码气味的相关见解，但它并没有探讨转译（执行量子计算机程序的必要步骤）对代码气味出现的潜在影响。事实上，转译可能会改变用于检测电路中是否存在气味的特征。为了解决这一局限性，我们开展了一项新研究，调查转置对量子特定代码气味的影响，以及不同目标门集对结果的影响。我们对 17 个开源量子程序和一组 100 个合成电路进行了实验。我们发现，转置可以显著改变用于检测代码气味的指标，即使是以前没有气味的电路也不例外，其中长电路气味最容易受到转置的影响。此外，门集的选择也会极大地影响代码气味在转置电路中的存在和严重程度，这凸显了谨慎选择门集以减轻其影响的必要性。这些发现对电路优化和高质量量子软件开发具有重要意义。考虑到不同门组和硬件平台的特点和限制，还需要进一步研究，以了解代码气味的后果及其对量子计算的潜在影响。

{"title":"An empirical study into the effects of transpilation on quantum circuit smells","authors":"Manuel De Stefano, Dario Di Nucci, Fabio Palomba, Andrea De Lucia","doi":"10.1007/s10664-024-10461-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10461-9","url":null,"abstract":"<p>Quantum computing is a promising field that can solve complex problems beyond traditional computers’ capabilities. Developing high-quality quantum software applications, called quantum software engineering, has recently gained attention. However, quantum software development faces challenges related to code quality. A recent study found that many open-source quantum programs are affected by quantum-specific code smells, with <i>long circuit</i> being the most common. While the study provided relevant insights into the prevalence of code smells in quantum circuits, it did not explore the potential effect of transpilation, a necessary step for executing quantum computer programs, on the emergence of code smells. Indeed, transpilation might alter those characteristics employed to detect the presence of a smell on a circuit. To address this limitation, we present a new study investigating the impact of transpilation on quantum-specific code smells and how different target gate sets affect the results. We conducted experiments on 17 open-source quantum programs alongside a set of 100 synthetic circuits. We found that transpilation can significantly alter the metrics that are used to detect code smells, even into previously smell-free circuits, with the <i>long circuit</i> smell being the most susceptible to transpilation. Furthermore, the choice of the gate set significantly influences the presence and severity of code smells in transpiled circuits, highlighting the need for careful gate set selection to mitigate their impact. These findings have implications for circuit optimization and high-quality quantum software development. Further research is needed to understand the consequences of code smells and their potential impact on quantum computations, considering the characteristics and constraints of different gate sets and hardware platforms.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"9 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparative analysis of real issues in open-source machine learning projects 开源机器学习项目中实际问题的比较分析

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-02 DOI: 10.1007/s10664-024-10467-3

Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, Rajesh Vasa

Context

In the last decade of data-driven decision-making, Machine Learning (ML) systems reign supreme. Because of the different characteristics between ML and traditional Software Engineering systems, we do not know to what extent the issue-reporting needs are different, and to what extent these differences impact the issue resolution process.

Objective

We aim to compare the differences between ML and non-ML issues in open-source applied AI projects in terms of resolution time and size of fix. This research aims to enhance the predictability of maintenance tasks by providing valuable insights for issue reporting and task scheduling activities.

Method

We collect issue reports from Github repositories of open-source ML projects using an automatic approach, filter them using ML keywords and libraries, manually categorize them using an adapted deep learning bug taxonomy, and compare resolution time and fix size for ML and non-ML issues in a controlled sample.

Result

147 ML issues and 147 non-ML issues are collected for analysis. We found that ML issues take more time to resolve than non-ML issues, the median difference is 14 days. There is no significant difference in terms of size of fix between ML and non-ML issues. No significant differences are found between different ML issue categories in terms of resolution time and size of fix.

Conclusion

Our study provided evidence that the life cycle for ML issues is stretched, and thus further work is required to identify the reason. The results also highlighted the need for future work to design custom tooling to support faster resolution of ML issues.

背景在过去十年的数据驱动决策中，机器学习（ML）系统占据了主导地位。由于 ML 与传统软件工程系统的不同特性，我们不知道问题报告的需求在多大程度上有所不同，也不知道这些差异在多大程度上影响了问题的解决过程。这项研究旨在为问题报告和任务调度活动提供有价值的见解，从而提高维护任务的可预测性。方法我们使用自动方法从开源 ML 项目的 Github 存储库中收集问题报告，使用 ML 关键字和库对其进行过滤，使用适应的深度学习错误分类法对其进行手动分类，并在控制样本中比较 ML 和非 ML 问题的解决时间和修复规模。我们发现，ML 问题比非ML 问题的解决时间更长，中位数相差 14 天。就修复规模而言，ML 和非 ML 问题之间没有明显差异。在解决时间和修复规模方面，不同类型的 ML 问题之间也没有明显差异。研究结果还强调，今后需要设计定制工具，以支持更快地解决 ML 问题。

{"title":"Comparative analysis of real issues in open-source machine learning projects","authors":"Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, Rajesh Vasa","doi":"10.1007/s10664-024-10467-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10467-3","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>In the last decade of data-driven decision-making, Machine Learning (ML) systems reign supreme. Because of the different characteristics between ML and traditional Software Engineering systems, we do not know to what extent the issue-reporting needs are different, and to what extent these differences impact the issue resolution process.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We aim to compare the differences between ML and non-ML issues in open-source applied AI projects in terms of resolution time and size of fix. This research aims to enhance the predictability of maintenance tasks by providing valuable insights for issue reporting and task scheduling activities.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We collect issue reports from Github repositories of open-source ML projects using an automatic approach, filter them using ML keywords and libraries, manually categorize them using an adapted deep learning bug taxonomy, and compare resolution time and fix size for ML and non-ML issues in a controlled sample.</p><h3 data-test=\"abstract-sub-heading\">Result</h3><p>147 ML issues and 147 non-ML issues are collected for analysis. We found that ML issues take more time to resolve than non-ML issues, the median difference is 14 days. There is no significant difference in terms of size of fix between ML and non-ML issues. No significant differences are found between different ML issue categories in terms of resolution time and size of fix.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Our study provided evidence that the life cycle for ML issues is stretched, and thus further work is required to identify the reason. The results also highlighted the need for future work to design custom tooling to support faster resolution of ML issues.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"5 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection CoRT：基于变换器的代码表示法，通过预测保留字进行自我监督，用于代码气味检测

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-04-08 DOI: 10.1007/s10664-024-10445-9

Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb

Context

Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.

Objective

The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.

Method

We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.

Results

The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.

Conclusions

The proposed method has been shown to be effective in detecting class-level, and method-level code smells.

ContextCode smell detection 是识别设计和实施不当的代码片段的过程。基于机器学习的方法需要大量人工标注的数据，成本高昂且难以扩展。本研究的目的是提出一种新的代码气味检测方法，该方法利用自我监督学习来学习中间表征，而无需标签，然后在多个任务中对这些表征进行微调。方法我们提出了一种带转换器的代码表征（Code Representation with Transformers，CoRT），通过训练转换器来学习源代码的语义和结构特征，从而识别作为输入应用于代码的屏蔽保留字。我们通过经验证明，定义的代理任务为学习语义和结构特征提供了一种强大的方法。我们在四个下游任务中对我们的方法进行了详尽的评估：检测数据类、上帝类、特征嫉妒和长方法代码气味。此外，我们还将我们的结果与两种范例进行了比较：监督学习和基于特征的方法。最后，我们进行了一次跨项目实验，以评估我们的方法对未见标注数据的普适性。结果结果表明，所提出的方法对代码气味具有很高的检测性能。例如，CoRT 对数据类的检测性能达到了 88.08-99.4 之间的 F1 值、89.62-99.88 之间的曲线下面积（AUC）和 75.28-98.8 之间的马修斯相关系数（MCC），而对上帝类的检测性能达到了 86.32-99.03 之间的 F1 值、92.1-99.85 之间的 AUC 值和 76.15-98.09 之间的 MCC 值。与基线模型和基于特征的方法相比，CoRT 取得了更好的检测性能，在未见过的数据集中具有很高的代码气味检测能力。

{"title":"CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection","authors":"Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb","doi":"10.1007/s10664-024-10445-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10445-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>The proposed method has been shown to be effective in detecting class-level, and method-level code smells.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"48 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taxonomy of inline code comment smells 内联代码注释气味分类法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-04-03 DOI: 10.1007/s10664-023-10425-5

Abstract

Code comments play a vital role in source code comprehension and software maintainability. It is common for developers to write comments to explain a code snippet, and commenting code is generally considered a good practice in software engineering. However, low-quality comments can have a detrimental effect on software quality or be ineffective for code understanding. This study aims to create a taxonomy of inline code comment smells and determine how frequently each smell type occurs in software projects. We conducted a multivocal literature review to define the initial taxonomy of inline comment smells. Afterward, we manually labeled 2447 inline comments from eight open-source projects where half of them were Java, and another half were Python projects. We created a taxonomy of 11 inline code comment smell types and found out that the smells exist in both Java and Python projects with varying degrees. Moreover, we conducted an online survey with 41 software practitioners to learn their opinions on these smells and their impact on code comprehension and software maintainability. The survey respondents generally agreed with the taxonomy; however, they reported that some smell types might have a positive effect on code comprehension in certain scenarios. We also opened pull requests and issues fixing the comment smells in the sampled projects, where we got a 27% acceptance rate. We share our manually labeled dataset online and provide implications for software engineering practitioners, researchers, and educators.

摘要代码注释在源代码理解和软件可维护性方面起着至关重要的作用。开发人员通常会编写注释来解释代码片段，而且注释代码通常被认为是软件工程中的一种良好做法。然而，低质量的注释可能会对软件质量产生不利影响，或者无法有效促进代码理解。本研究旨在创建内联代码注释气味分类法，并确定每种气味类型在软件项目中出现的频率。我们进行了多方文献综述，初步确定了内联注释气味的分类标准。随后，我们对八个开源项目中的 2447 条内联注释进行了人工标注，其中一半是 Java 项目，另一半是 Python 项目。我们创建了包含 11 种内联代码注释气味类型的分类法，并发现这些气味在 Java 和 Python 项目中都不同程度地存在。此外，我们还对 41 名软件从业人员进行了在线调查，以了解他们对这些气味的看法及其对代码理解和软件可维护性的影响。调查对象普遍同意该分类法；不过，他们表示，在某些情况下，某些气味类型可能会对代码理解产生积极影响。我们还打开了拉取请求和问题，以修复抽样项目中的注释气味，我们获得了 27% 的接受率。我们在线分享了人工标注的数据集，并为软件工程从业人员、研究人员和教育工作者提供了启示。

{"title":"Taxonomy of inline code comment smells","authors":"","doi":"10.1007/s10664-023-10425-5","DOIUrl":"https://doi.org/10.1007/s10664-023-10425-5","url":null,"abstract":"<h3>Abstract</h3> <p>Code comments play a vital role in source code comprehension and software maintainability. It is common for developers to write comments to explain a code snippet, and commenting code is generally considered a good practice in software engineering. However, low-quality comments can have a detrimental effect on software quality or be ineffective for code understanding. This study aims to create a taxonomy of inline code comment smells and determine how frequently each smell type occurs in software projects. We conducted a multivocal literature review to define the initial taxonomy of inline comment smells. Afterward, we manually labeled 2447 inline comments from eight open-source projects where half of them were Java, and another half were Python projects. We created a taxonomy of 11 inline code comment smell types and found out that the smells exist in both Java and Python projects with varying degrees. Moreover, we conducted an online survey with 41 software practitioners to learn their opinions on these smells and their impact on code comprehension and software maintainability. The survey respondents generally agreed with the taxonomy; however, they reported that some smell types might have a positive effect on code comprehension in certain scenarios. We also opened pull requests and issues fixing the comment smells in the sampled projects, where we got a 27% acceptance rate. We share our manually labeled dataset online and provide implications for software engineering practitioners, researchers, and educators.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"114 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ROBUST: 221 bugs in the Robot Operating System ROBUST：机器人操作系统中的 221 个错误

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-03-23 DOI: 10.1007/s10664-024-10440-0

Christopher S. Timperley, Gijs van der Hoorn, André Santos, Harshavardhan Deshpande, Andrzej Wąsowski

As robotic systems such as autonomous cars and delivery drones assume greater roles and responsibilities within society, the likelihood and impact of catastrophic software failure within those systems is increased. To aid researchers in the development of new methods to measure and assure the safety and quality of robotics software, we systematically curated a dataset of 221 bugs across 7 popular and diverse software systems implemented via the Robot Operating System (ROS). We produce historically accurate recreations of each of the 221 defective software versions in the form of Docker images, and use a grounded theory approach to examine and categorize their corresponding faults, failures, and fixes. Finally, we reflect on the implications of our findings and outline future research directions for the community.

随着自动驾驶汽车和无人机送货等机器人系统在社会中发挥更大的作用和承担更多的责任，这些系统出现灾难性软件故障的可能性和影响也随之增大。为了帮助研究人员开发测量和确保机器人软件安全与质量的新方法，我们系统地整理了通过机器人操作系统（ROS）实施的 7 种流行和多样化软件系统中的 221 个错误数据集。我们以 Docker 镜像的形式制作了这 221 个有缺陷软件版本中每个版本的历史精确再现，并使用基础理论方法对其相应的故障、失效和修复进行了检查和分类。最后，我们对研究结果的意义进行了反思，并概述了社区未来的研究方向。

引用次数: 0