Empirical Software Engineering最新文献_第8页

Cross-project defect prediction via semantic and syntactic encoding 通过语义和句法编码进行跨项目缺陷预测

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-06-04 DOI: 10.1007/s10664-024-10495-z

Siyu Jiang, Yuwen Chen, Zhenhang He, Yunpeng Shang, Le Ma

Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.

跨项目缺陷预测（CPDP）是一个前景广阔的研究领域，其重点是利用在数据丰富的项目中训练的预测模型，在标注数据有限的项目中检测缺陷。然而，以往基于抽象语法树（AST）的 CPDP 方法在有效获取语义和语法信息方面经常遇到挑战，导致其将语义和语法信息有效结合的能力有限。造成这一问题的主要原因是，许多基于 AST 的方法将 AST 扁平化为线性序列，从而导致代码中分层语法结构和结构信息的丢失。此外，其他基于 AST 的方法使用递归方式遍历树状结构的 AST，这很容易造成梯度消失。为了缓解这些问题，我们引入了一种名为 "通过语义和句法编码进行缺陷预测"（SSE）的新型 CPDP 方法，该方法在保留和考虑 AST 结构的同时对语义和句法信息进行编码，从而增强了 Zhang 的方法。具体来说，我们使用语言模型在大型语料库上进行预训练，以学习语义信息。接下来，我们提出了一种将 AST 分成子树的新规则，以避免梯度消失。然后，我们将从根节点出发并通向叶节点的绝对路径编码为层次句法信息。最后，我们设计了一个编码器，将句法信息整合到语义信息中，并利用双向长短期记忆来学习整个树表示法，以便进行预测。12 个基准项目的实验结果表明，我们提出的 SSE 方法超越了目前最先进的方法。

{"title":"Cross-project defect prediction via semantic and syntactic encoding","authors":"Siyu Jiang, Yuwen Chen, Zhenhang He, Yunpeng Shang, Le Ma","doi":"10.1007/s10664-024-10495-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10495-z","url":null,"abstract":"Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"26 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demystifying code snippets in code reviews: a study of the OpenStack and Qt communities and a practitioner survey 解密代码审查中的代码片段：OpenStack 和 Qt 社区研究及从业人员调查

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-06-03 DOI: 10.1007/s10664-024-10484-2

Beiqi Zhang, Liming Fu, Peng Liang, Jiaxin Yu, Chong Wang

Code review is widely known as one of the best practices for software quality assurance in software development. In a typical code review process, reviewers check the code committed by developers to ensure the quality of the code, during which reviewers and developers would communicate with each other in review comments to exchange necessary information. As a result, understanding the information in review comments is a prerequisite for reviewers and developers to conduct an effective code review. Code snippet, as a special form of code, can be used to convey necessary information in code reviews. For example, reviewers can use code snippets to make suggestions or elaborate their ideas to meet developers’ information needs in code reviews. However, little research has focused on the practices of providing code snippets in code reviews. To bridge this gap, we conduct a mixed-methods study to mine information and knowledge related to code snippets in code reviews, which can help practitioners and researchers get a better understanding about using code snippets in code review. Specifically, our study includes two phases: mining code review data and conducting practitioners’ survey. In Phase 1, we conducted an exploratory study to mine code review data from two popular developer communities (i.e., OpenStack and Qt). We manually labelled 69,604 review comments and finally identified 3,213 review comments that contain code snippets. Based on the code review data collected, we analyzed the extent of using code snippets, the reviewers’ purposes of providing code snippets, the developers’ acceptance of code snippet suggestions, and the reasons that developers do not accept code snippet suggestions in code reviews. In Phase 2, we used an online questionnaire to survey practitioners from industry. By analyzing the 63 valid responses we received, we explored the scenarios reviewers provide code snippets, the developers’ attitudes towards code snippets, and the characteristics of code snippets developers expect reviewers to provide in code reviews. Our results show that: (1) code snippets are not frequently used in code reviews, and most of the code snippets are provided by reviewers rather than developers; (2) the purposes of reviewers providing code snippets in code reviews are Suggestion and Citation, in which Suggestion is the main purpose; (3) most developers would accept reviewers’ code snippet suggestions; (4) the most common reasons that developers do not accept reviewers’ code snippet suggestions in code reviews are difference in the opinions between developers and reviewers and reviewer’s suggestion is flawed; (5) reviewers often provide code snippets in code reviews when code is more illustrate than words; (6) most developers hold positive attitudes towards code snippet comments; and (7) most developers expect that code snippets in review comments are understandable and fitting into existing co

众所周知，代码审查是软件开发中软件质量保证的最佳实践之一。在典型的代码审查过程中，审查员会检查开发人员提交的代码，以确保代码的质量，在此期间，审查员和开发人员会在审查意见中相互沟通，交流必要的信息。因此，理解审查意见中的信息是审查员和开发人员进行有效代码审查的前提。代码片段作为代码的一种特殊形式，可用于在代码审查中传递必要的信息。例如，审查员可以使用代码片段提出建议或阐述自己的想法，以满足开发人员在代码审查中的信息需求。然而，很少有研究关注在代码评审中提供代码片段的做法。为了弥补这一差距，我们开展了一项混合方法研究，以挖掘代码审查中与代码片段相关的信息和知识，从而帮助从业人员和研究人员更好地了解在代码审查中使用代码片段的情况。具体来说，我们的研究包括两个阶段：挖掘代码审查数据和开展从业人员调查。在第一阶段，我们进行了一项探索性研究，从两个流行的开发者社区（即 OpenStack 和 Qt）中挖掘代码审查数据。我们对 69,604 条评论进行了人工标注，最终确定了 3,213 条包含代码片段的评论。根据收集到的代码审查数据，我们分析了代码片段的使用程度、审查者提供代码片段的目的、开发人员对代码片段建议的接受程度，以及开发人员在代码审查中不接受代码片段建议的原因。在第二阶段，我们使用在线问卷调查了来自业界的从业人员。通过分析收到的 63 份有效回复，我们探讨了审查员提供代码片段的情况、开发人员对代码片段的态度以及开发人员希望审查员在代码审查中提供代码片段的特点。我们的结果表明(1)代码片段在代码评审中的使用频率并不高，大多数代码片段是由评审者而非开发者提供的；(2)评审者在代码评审中提供代码片段的目的是建议和引用，其中建议是主要目的；(3)大多数开发者会接受评审者的代码片段建议；(4) 开发人员在代码评审中不接受评审人员的代码片段建议的最常见原因是开发人员和评审人员的意见不一致以及评审人员的建议有缺陷；(5) 当代码比文字更能说明问题时，评审人员通常会在代码评审中提供代码片段；(6) 大多数开发人员对代码片段评论持积极态度；(7) 大多数开发人员希望评审评论中的代码片段可以理解并适合现有代码。研究结果表明，审查员可以在适当的场景下提供代码片段，以满足开发人员在代码审查中的特定信息需求，从而促进和加快代码审查过程。

{"title":"Demystifying code snippets in code reviews: a study of the OpenStack and Qt communities and a practitioner survey","authors":"Beiqi Zhang, Liming Fu, Peng Liang, Jiaxin Yu, Chong Wang","doi":"10.1007/s10664-024-10484-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10484-2","url":null,"abstract":"Code review is widely known as one of the best practices for software quality assurance in software development. In a typical code review process, reviewers check the code committed by developers to ensure the quality of the code, during which reviewers and developers would communicate with each other in review comments to exchange necessary information. As a result, understanding the information in review comments is a prerequisite for reviewers and developers to conduct an effective code review. Code snippet, as a special form of code, can be used to convey necessary information in code reviews. For example, reviewers can use code snippets to make suggestions or elaborate their ideas to meet developers’ information needs in code reviews. However, little research has focused on the practices of providing code snippets in code reviews. To bridge this gap, we conduct a mixed-methods study to mine information and knowledge related to code snippets in code reviews, which can help practitioners and researchers get a better understanding about using code snippets in code review. Specifically, our study includes two phases: mining code review data and conducting practitioners’ survey. In Phase 1, we conducted an exploratory study to mine code review data from two popular developer communities (i.e., OpenStack and Qt). We manually labelled 69,604 review comments and finally identified 3,213 review comments that contain code snippets. Based on the code review data collected, we analyzed the extent of using code snippets, the reviewers’ purposes of providing code snippets, the developers’ acceptance of code snippet suggestions, and the reasons that developers do not accept code snippet suggestions in code reviews. In Phase 2, we used an online questionnaire to survey practitioners from industry. By analyzing the 63 valid responses we received, we explored the scenarios reviewers provide code snippets, the developers’ attitudes towards code snippets, and the characteristics of code snippets developers expect reviewers to provide in code reviews. Our results show that: (1) code snippets are not frequently used in code reviews, and most of the code snippets are provided by reviewers rather than developers; (2) the purposes of reviewers providing code snippets in code reviews are Suggestion and Citation, in which Suggestion is the main purpose; (3) most developers would accept reviewers’ code snippet suggestions; (4) the most common reasons that developers do not accept reviewers’ code snippet suggestions in code reviews are difference in the opinions between developers and reviewers and reviewer’s suggestion is flawed; (5) reviewers often provide code snippets in code reviews when code is more illustrate than words; (6) most developers hold positive attitudes towards code snippet comments; and (7) most developers expect that code snippets in review comments are understandable and fitting into existing co","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"6 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformers and meta-tokenization in sentiment analysis for software engineering 软件工程情感分析中的转换器和元标记化

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-06-03 DOI: 10.1007/s10664-024-10468-2

Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik

Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. To further improve the application of sentiment analysis, there have been two recommendations: Using pre-trained transformer models to classify sentiment and replacing non-natural language elements with meta-tokens. In this work, we benchmark five different sentiment analysis tools (two pre-trained transformer models and three machine learning tools) on 2 gold-standard sentiment analysis datasets. We find that pre-trained transformers outperform the best machine learning tool on only one of the two datasets, and that even on that dataset the performance difference is a few percentage points. Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). Meanwhile, we find that meta-tokenization does not improve the predictive performance of sentiment analysis tools. Both of our findings can be used by software engineering researchers who seek to apply sentiment analysis tools to software engineering data.

情感分析已被用于研究软件工程的各个方面，如问题解决、毒性和自我承认的技术债务。针对软件工程文本的特殊性，情感分析工具通常会考虑从业人员使用的特定技术行话。为了进一步改进情感分析的应用，有两项建议：使用预先训练好的转换器模型对情感进行分类，以及用元符号替换非自然语言元素。在这项工作中，我们在 2 个黄金标准情感分析数据集上对 5 种不同的情感分析工具（2 种预训练转换器模型和 3 种机器学习工具）进行了基准测试。我们发现，在两个数据集中，预训练转换器仅在一个数据集上优于最佳机器学习工具，而且即使在该数据集上，性能差异也只有几个百分点。因此，我们建议软件工程研究人员在选择情感分析工具时不要只考虑预测性能，因为表现最好的情感分析工具之间的性能非常接近（在 4 个百分点以内）。同时，我们发现元标记化并不能提高情感分析工具的预测性能。我们的这两项发现都可以为那些寻求将情感分析工具应用于软件工程数据的软件工程研究人员所用。

{"title":"Transformers and meta-tokenization in sentiment analysis for software engineering","authors":"Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik","doi":"10.1007/s10664-024-10468-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10468-2","url":null,"abstract":"Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. To further improve the application of sentiment analysis, there have been two recommendations: Using pre-trained transformer models to classify sentiment and replacing non-natural language elements with meta-tokens. In this work, we benchmark five different sentiment analysis tools (two pre-trained transformer models and three machine learning tools) on 2 gold-standard sentiment analysis datasets. We find that pre-trained transformers outperform the best machine learning tool on only one of the two datasets, and that even on that dataset the performance difference is a few percentage points. Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). Meanwhile, we find that meta-tokenization does not improve the predictive performance of sentiment analysis tools. Both of our findings can be used by software engineering researchers who seek to apply sentiment analysis tools to software engineering data.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards graph-anonymization of software analytics data: empirical study on JIT defect prediction 实现软件分析数据的图匿名化：JIT 缺陷预测实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-06-01 DOI: 10.1007/s10664-024-10464-6

Akshat Malik, Bram Adams, Ahmed Hassan

As the usage of software analytics for understanding different organizational practices becomes prevalent, it is important that data for these practices is shared across different organizations to build a common understanding of software systems and processes. Yet, organizations are hesitant to share this data and trained models with one another due to concerns around privacy, e.g., because of the risk of reverse engineering the training data of the models. To facilitate data sharing, tabular anonymization techniques like MORPH, LACE and LACE2 have been proposed to provide privacy to defect prediction data. However, said techniques treat data points as individual elements, and lose the context between different features when performing anonymization. We study the effect of four anonymization techniques, i.e., Random Add/Delete, Random Switch, k-DA and Generalization, on the privacy score and performance in six large, long-lived projects. To measure privacy, we use the IPR metric, which is a measure of the inability of an attacker to extract information about sensitive attributes from the anonymized data. We find that all four graph anonymization techniques are able to provide privacy scores higher than 65% in all the datasets, while Random Add/ Delete and Random Switch are even able to achieve privacy scores of 80% and greater in all datasets. For techniques achieving privacy scores of 65%, the AUC and Recall decreased by a median of 1.45% and 5.35%, respectively. For techniques with privacy scores 80% or greater, the AUC and Recall of privatized models decreased by a median of 6.44% and 20.29%, respectively. The state-of-the-art tabular techniques like MORPH, LACE and LACE2 provide high privacy scores (89%-99%); however, they have a higher impact on performance with a median decrease of 21.15% in AUC and 80.34% in Recall. Furthermore, since privacy scores 65% or greater are adequate for sharing, the graph anonymization techniques are able to provide more configurable results where one can make trade-offs between privacy and performance. When compared to unsupervised techniques like a JIT variant of ManualDown, the GA techniques perform comparable or significantly better for AUC, G-Mean and FPR metrics. Our work shows that graph anonymization can be an effective way of providing privacy while preserving model performance.

随着使用软件分析了解不同组织实践的做法变得越来越普遍，不同组织之间共享这些实践的数据以建立对软件系统和流程的共同理解就变得非常重要。然而，出于对隐私的担忧（例如，模型训练数据存在逆向工程风险），各组织在相互共享这些数据和训练模型时犹豫不决。为了促进数据共享，有人提出了 MORPH、LACE 和 LACE2 等表格匿名技术，以保护缺陷预测数据的隐私。然而，上述技术将数据点视为单个元素，在进行匿名化处理时会丢失不同特征之间的上下文。我们研究了四种匿名化技术（即随机添加/删除、随机切换、k-DA 和泛化）对六个大型长期项目的隐私得分和性能的影响。为了衡量隐私性，我们使用了 IPR 指标，该指标衡量攻击者从匿名数据中提取敏感属性信息的能力。我们发现，所有四种图匿名技术都能在所有数据集中提供高于 65% 的隐私分数，而随机添加/删除和随机切换甚至能在所有数据集中达到 80% 或更高的隐私分数。对于隐私得分达到 65% 的技术，AUC 和 Recall 的中位数分别下降了 1.45% 和 5.35%。对于隐私得分达到或超过 80% 的技术，私有化模型的 AUC 和 Recall 中位数分别下降了 6.44% 和 20.29%。最先进的表格技术，如 MORPH、LACE 和 LACE2，提供了较高的隐私分数（89%-99%）；然而，它们对性能的影响更大，AUC 和 Recall 的中位数分别下降了 21.15% 和 80.34%。此外，由于 65% 或更高的隐私分数足以实现共享，因此图匿名技术能够提供更多可配置的结果，人们可以在隐私和性能之间做出权衡。与无监督技术（如 ManualDown 的 JIT 变体）相比，GA 技术在 AUC、G-Mean 和 FPR 指标上表现相当或明显更好。我们的工作表明，图匿名化是一种既能提供隐私又能保持模型性能的有效方法。

{"title":"Towards graph-anonymization of software analytics data: empirical study on JIT defect prediction","authors":"Akshat Malik, Bram Adams, Ahmed Hassan","doi":"10.1007/s10664-024-10464-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10464-6","url":null,"abstract":"As the usage of software analytics for understanding different organizational practices becomes prevalent, it is important that data for these practices is shared across different organizations to build a common understanding of software systems and processes. Yet, organizations are hesitant to share this data and trained models with one another due to concerns around privacy, e.g., because of the risk of reverse engineering the training data of the models. To facilitate data sharing, tabular anonymization techniques like MORPH, LACE and LACE2 have been proposed to provide privacy to defect prediction data. However, said techniques treat data points as individual elements, and lose the context between different features when performing anonymization. We study the effect of four anonymization techniques, i.e., Random Add/Delete, Random Switch, k-DA and Generalization, on the privacy score and performance in six large, long-lived projects. To measure privacy, we use the IPR metric, which is a measure of the inability of an attacker to extract information about sensitive attributes from the anonymized data. We find that all four graph anonymization techniques are able to provide privacy scores higher than 65% in all the datasets, while Random Add/ Delete and Random Switch are even able to achieve privacy scores of 80% and greater in all datasets. For techniques achieving privacy scores of 65%, the AUC and Recall decreased by a median of 1.45% and 5.35%, respectively. For techniques with privacy scores 80% or greater, the AUC and Recall of privatized models decreased by a median of 6.44% and 20.29%, respectively. The state-of-the-art tabular techniques like MORPH, LACE and LACE2 provide high privacy scores (89%-99%); however, they have a higher impact on performance with a median decrease of 21.15% in AUC and 80.34% in Recall. Furthermore, since privacy scores 65% or greater are adequate for sharing, the graph anonymization techniques are able to provide more configurable results where one can make trade-offs between privacy and performance. When compared to unsupervised techniques like a JIT variant of ManualDown, the GA techniques perform comparable or significantly better for AUC, G-Mean and FPR metrics. Our work shows that graph anonymization can be an effective way of providing privacy while preserving model performance.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"88 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning experiment management tools: a mixed-methods empirical study 机器学习实验管理工具：混合方法实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-29 DOI: 10.1007/s10664-024-10444-w

Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger

Machine Learning (ML) experiment management tools support ML practitioners and software engineers when building intelligent software systems. By managing large numbers of ML experiments comprising many different ML assets, they not only facilitate engineering ML models and ML-enabled systems, but also managing their evolution—for instance, tracing system behavior to concrete experiments when the model performance drifts. However, while ML experiment management tools have become increasingly popular, little is known about their effectiveness in practice, as well as their actual benefits and challenges. We present a mixed-methods empirical study of experiment management tools and the support they provide to users. First, our survey of 81 ML practitioners sought to determine the benefits and challenges of ML experiment management and of the existing tool landscape. Second, a controlled experiment with 15 student developers investigated the effectiveness of ML experiment management tools. We learned that 70% of our survey respondents perform ML experiments using specialized tools, while out of those who do not use such tools, 52% are unaware of experiment management tools or of their benefits. The controlled experiment showed that experiment management tools offer valuable support to users to systematically track and retrieve ML assets. Using ML experiment management tools reduced error rates and increased completion rates. By presenting a user’s perspective on experiment management tools, and the first controlled experiment in this area, we hope that our results foster the adoption of these tools in practice, as well as they direct tool builders and researchers to improve the tool landscape overall.

在构建智能软件系统时，机器学习（ML）实验管理工具可为 ML 从业人员和软件工程师提供支持。通过管理由许多不同的 ML 资产组成的大量 ML 实验，它们不仅能促进 ML 模型和支持 ML 的系统的工程设计，还能管理它们的演化--例如，当模型性能发生偏移时，可将系统行为追踪到具体的实验中。然而，虽然 ML 实验管理工具越来越受欢迎，但人们对它们在实践中的有效性以及实际优势和挑战却知之甚少。我们采用混合方法对实验管理工具及其为用户提供的支持进行了实证研究。首先，我们对 81 名 ML 从业人员进行了调查，以确定 ML 实验管理和现有工具的优势和挑战。其次，我们对 15 名学生开发人员进行了对照实验，以调查 ML 实验管理工具的有效性。我们了解到，70% 的调查对象使用专门工具进行 ML 实验，而在不使用此类工具的调查对象中，52% 的人不知道实验管理工具或其好处。对照实验表明，实验管理工具为用户系统跟踪和检索 ML 资产提供了宝贵的支持。使用 ML 实验管理工具降低了错误率，提高了完成率。通过介绍用户对实验管理工具的看法以及该领域的首个对照实验，我们希望我们的结果能够促进这些工具在实践中的应用，并引导工具构建者和研究人员改进工具的整体状况。

{"title":"Machine learning experiment management tools: a mixed-methods empirical study","authors":"Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger","doi":"10.1007/s10664-024-10444-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10444-w","url":null,"abstract":"Machine Learning (ML) experiment management tools support ML practitioners and software engineers when building intelligent software systems. By managing large numbers of ML experiments comprising many different ML assets, they not only facilitate engineering ML models and ML-enabled systems, but also managing their evolution—for instance, tracing system behavior to concrete experiments when the model performance drifts. However, while ML experiment management tools have become increasingly popular, little is known about their effectiveness in practice, as well as their actual benefits and challenges. We present a mixed-methods empirical study of experiment management tools and the support they provide to users. First, our survey of 81 ML practitioners sought to determine the benefits and challenges of ML experiment management and of the existing tool landscape. Second, a controlled experiment with 15 student developers investigated the effectiveness of ML experiment management tools. We learned that 70% of our survey respondents perform ML experiments using specialized tools, while out of those who do not use such tools, 52% are unaware of experiment management tools or of their benefits. The controlled experiment showed that experiment management tools offer valuable support to users to systematically track and retrieve ML assets. Using ML experiment management tools reduced error rates and increased completion rates. By presenting a user’s perspective on experiment management tools, and the first controlled experiment in this area, we hope that our results foster the adoption of these tools in practice, as well as they direct tool builders and researchers to improve the tool landscape overall.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"56 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Do Agile scaling approaches make a difference? an empirical comparison of team effectiveness across popular scaling approaches 敏捷扩展方法有区别吗？对各种流行扩展方法的团队效率进行实证比较

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-29 DOI: 10.1007/s10664-024-10481-5

Christiaan Verwijs, Daniel Russo

With the prevalent use of Agile methodologies, organizations are grappling with the challenge of scaling development across numerous teams. This has led to the emergence of diverse scaling strategies, from complex ones such as “SAFe", to more simplified methods e.g., “LeSS", with some organizations devising their unique approaches. While there have been multiple studies exploring the organizational challenges associated with different scaling approaches, so far, no one has compared these strategies based on empirical data derived from a uniform measure. This makes it hard to draw robust conclusions about how different scaling approaches affect Agile team effectiveness. Thus, the objective of this study is to assess the effectiveness of Agile teams across various scaling approaches, including “SAFe", “LeSS", “Scrum of Scrums", and custom methods, as well as those not using scaling. This study focuses initially on responsiveness, stakeholder concern, continuous improvement, team autonomy, management approach, and overall team effectiveness, followed by an evaluation based on stakeholder satisfaction regarding value, responsiveness, and release frequency. To achieve this, we performed a comprehensive survey involving 15,078 members of 4,013 Agile teams to measure their effectiveness, combined with satisfaction surveys from 1,841 stakeholders of 529 of those teams. We conducted a series of inferential statistical analyses, including Analysis of Variance and multiple linear regression, to identify any significant differences, while controlling for team experience and organizational size. The findings of the study revealed some significant differences, but their magnitude and effect size were considered too negligible to have practical significance. In conclusion, the choice of Agile scaling strategy does not markedly influence team effectiveness, and organizations are advised to choose a method that best aligns with their previous experiences with Agile, organizational culture, and management style.

随着敏捷方法的普遍使用，企业正在努力应对在众多团队中扩展开发的挑战。这导致出现了各种不同的扩展策略，从 "SAFe "等复杂策略到 "LeSS "等更简化的方法，有些组织还设计了自己独特的方法。虽然已有多项研究探讨了与不同扩展方法相关的组织挑战，但迄今为止，还没有人根据统一衡量标准得出的经验数据对这些战略进行过比较。因此，很难就不同的扩展方法如何影响敏捷团队的效率得出可靠的结论。因此，本研究的目的是评估不同扩展方法（包括 "SAFe"、"LeSS"、"Scrum of Scrums "和自定义方法）下的敏捷团队的有效性，以及未使用扩展方法的敏捷团队的有效性。本研究首先关注响应速度、利益相关者的关注、持续改进、团队自治、管理方法和团队整体效率，然后根据利益相关者对价值、响应速度和发布频率的满意度进行评估。为此，我们对 4013 个敏捷团队的 15078 名成员进行了全面调查，以衡量他们的有效性，同时还对其中 529 个团队的 1841 名利益相关者进行了满意度调查。我们进行了一系列推理统计分析，包括方差分析和多元线性回归，以确定任何显著差异，同时控制团队经验和组织规模。研究结果显示了一些显著的差异，但其程度和影响大小都可以忽略不计，不具有实际意义。总之，敏捷扩展策略的选择不会明显影响团队的有效性，建议组织选择一种最符合其以往敏捷经验、组织文化和管理风格的方法。

{"title":"Do Agile scaling approaches make a difference? an empirical comparison of team effectiveness across popular scaling approaches","authors":"Christiaan Verwijs, Daniel Russo","doi":"10.1007/s10664-024-10481-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10481-5","url":null,"abstract":"With the prevalent use of Agile methodologies, organizations are grappling with the challenge of scaling development across numerous teams. This has led to the emergence of diverse scaling strategies, from complex ones such as “SAFe\", to more simplified methods e.g., “LeSS\", with some organizations devising their unique approaches. While there have been multiple studies exploring the organizational challenges associated with different scaling approaches, so far, no one has compared these strategies based on empirical data derived from a uniform measure. This makes it hard to draw robust conclusions about how different scaling approaches affect Agile team effectiveness. Thus, the objective of this study is to assess the effectiveness of Agile teams across various scaling approaches, including “SAFe\", “LeSS\", “Scrum of Scrums\", and custom methods, as well as those not using scaling. This study focuses initially on responsiveness, stakeholder concern, continuous improvement, team autonomy, management approach, and overall team effectiveness, followed by an evaluation based on stakeholder satisfaction regarding value, responsiveness, and release frequency. To achieve this, we performed a comprehensive survey involving 15,078 members of 4,013 Agile teams to measure their effectiveness, combined with satisfaction surveys from 1,841 stakeholders of 529 of those teams. We conducted a series of inferential statistical analyses, including Analysis of Variance and multiple linear regression, to identify any significant differences, while controlling for team experience and organizational size. The findings of the study revealed some significant differences, but their magnitude and effect size were considered too negligible to have practical significance. In conclusion, the choice of Agile scaling strategy does not markedly influence team effectiveness, and organizations are advised to choose a method that best aligns with their previous experiences with Agile, organizational culture, and management style.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"68 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The broken windows theory applies to technical debt 破窗理论适用于技术债务

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-24 DOI: 10.1007/s10664-024-10456-6

William Levén, Hampus Broman, Terese Besker, Richard Torkar

Context:

The term technical debt (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the broken windows theory (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the broken windows of software systems.

Objective:

To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system.

Method:

The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing system extension tasks in already existing systems with high or low TD density.

Results:

The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other code smells identified by the software tool SonarQube, all with at least (95%) credible intervals.

Coclusions:

Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s existence in software engineering contexts. This study finds that existing TD can have a major impact on developers propensity to introduce new TD of various types during development.

背景：技术债务（TD）一词描述的是次优解决方案的集合，这些解决方案阻碍了系统的演进和维护。有人认为，从犯罪学中借用的破窗理论（BWT）也适用于软件开发项目。该理论认为，以前犯罪的迹象（如破窗）会增加进一步犯罪活动的可能性；TD 可被视为软件系统的破窗。研究方法：本研究采用了混合方法研究策略，包括对照实验、随附调查和后续访谈。结果：分析表明，TD水平对受试者倾向于重新实现（而非重复使用）功能、选择非描述性变量名以及引入软件工具SonarQube识别出的其他代码气味有显著影响，所有影响的可信区间至少为（95%）。这项研究发现，现有的 TD 会对开发人员在开发过程中引入各种类型的新 TD 的倾向性产生重大影响。

{"title":"The broken windows theory applies to technical debt","authors":"William Levén, Hampus Broman, Terese Besker, Richard Torkar","doi":"10.1007/s10664-024-10456-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10456-6","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context:</h3>The term technical debt (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the broken windows theory (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the broken windows of software systems.<h3 data-test=\"abstract-sub-heading\">Objective:</h3>To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system.<h3 data-test=\"abstract-sub-heading\">Method:</h3>The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing system extension tasks in already existing systems with high or low TD density.<h3 data-test=\"abstract-sub-heading\">Results:</h3>The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other code smells identified by the software tool SonarQube, all with at least (95%) credible intervals.<h3 data-test=\"abstract-sub-heading\">Coclusions:</h3>Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s existence in software engineering contexts. This study finds that existing TD can have a major impact on developers propensity to introduce new TD of various types during development.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"282 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two is better than one: digital siblings to improve autonomous driving testing 二胜于一：数字兄弟姐妹改善自动驾驶测试

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-17 DOI: 10.1007/s10664-024-10458-4

Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella

Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital siblings—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.

模拟测试是确保自动驾驶软件可靠性的重要一步。在实践中，当公司依靠第三方通用模拟器进行内部或外包测试时，测试结果对真实自动驾驶车辆的通用性就会受到威胁。在本文中，我们通过引入数字兄弟姐妹的概念来增强基于模拟的测试--这种多模拟器方法在多个采用不同技术构建的通用模拟器上测试给定的自动驾驶汽车，这些模拟器在测试过程中作为一个整体共同运行。我们在一个案例研究中示范了我们的方法，该案例研究的重点是测试自动驾驶汽车的车道保持组件。我们使用两个开源模拟器作为数字孪生兄弟，并在大量测试案例中将这种多模拟器方法与物理比例自动驾驶汽车的数字孪生兄弟进行实证比较。我们的方法要求为每个模拟器生成并运行测试用例，测试用例的形式为道路点序列。然后，测试用例在模拟器之间迁移，使用特征图来描述行使的驾驶条件。最后，计算联合预测的故障概率，只有在同胞兄弟一致的情况下才会报告故障。我们的实证评估表明，数字孪生系统的集合故障预测器在预测数字孪生系统故障方面优于单个模拟器。我们将讨论案例研究的结果，并详细介绍我们的方法如何帮助对自动驾驶软件自动测试感兴趣的研究人员。

{"title":"Two is better than one: digital siblings to improve autonomous driving testing","authors":"Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella","doi":"10.1007/s10664-024-10458-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10458-4","url":null,"abstract":"Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital siblings—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2015 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic matching in GUI test reuse 图形用户界面测试重用中的语义匹配

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-09 DOI: 10.1007/s10664-023-10406-8

Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni

Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across Android apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with Test Reuse approaches on migrating test cases across Android apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the Semantic Matching Algorithm. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall Test Reuse approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.

在功能相似的应用程序中重复使用测试用例，既能减少制作有用测试用例的工作量，又能缩短向市场提供可靠应用程序的时间。跨应用程序重用测试用例的主要方法是结合不同的语义匹配和测试生成算法，在安卓应用程序间迁移测试用例。在本文中，我们定义了一个通用框架，用于评估语义匹配与测试重用方法的不同选择对跨安卓应用程序迁移测试用例的影响和有效性。我们对测试迁移流程组件的多种可能选择进行了全面的比较评估。我们提出了一种方法，将测试迁移过程中每个组成部分的最有效选择结合起来，从而获得一种有效的方法。我们报告了对来自 337 个测试配置的 8,099 个图形用户界面事件的实验评估结果。结果证明了语义匹配对测试重用的显著影响。结果表明，句子级别的嵌入技术比单词级别的嵌入技术表现更好。令人惊讶的是，用于为语义匹配算法建立词嵌入模型的文档语料库的影响微乎其微。他们提供的证据表明，选定类型事件的语义匹配效果优于所有类型事件的语义匹配效果。他们表明，整体测试重用方法的有效性取决于测试套件和应用程序的特性。我们在线公开提供的复制包（https://star.inf.usi.ch/#/software-data/11）允许研究人员和从业人员通过更多实验完善结果，并评估测试重用组件的其他选择。

{"title":"Semantic matching in GUI test reuse","authors":"Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni","doi":"10.1007/s10664-023-10406-8","DOIUrl":"https://doi.org/10.1007/s10664-023-10406-8","url":null,"abstract":"Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across Android apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with Test Reuse approaches on migrating test cases across Android apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the Semantic Matching Algorithm. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall Test Reuse approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"66 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Just-in-Time crash prediction for mobile apps 移动应用程序的即时崩溃预测

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-05-08 DOI: 10.1007/s10664-024-10455-7

Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell

Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings and Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.

即时缺陷预测（JIT）的目的是在提交时尽早发现缺陷。因此，开发人员可以在对代码更改记忆犹新时采取预防措施，避免出现缺陷。然而，JIT 缺陷预测在移动应用程序崩溃方面的实用性尚未得到研究。因此，我们采用定量和定性分析方法进行了一项多案例研究。在定量分析中，我们使用了机器学习技术进行预测。我们从 14 个 Android 应用程序的约 30,000 次提交中收集了 113 个可靠性相关指标，并选择了 14 个重要指标进行预测。我们发现，标准 JIT 指标和静态分析警告对于 JIT 预测移动应用程序崩溃都很重要。我们进一步优化了预测性能，通过超参数优化比较了七种最先进的缺陷预测技术。结果表明，随机森林是性能最好的模型，AUC-ROC 为 0.83。在定性分析中，我们手动分析了 642 个提交样本，并确定了导致崩溃的提交中常见的不同变更类型。我们探讨了是否可以将不同方面的变更作为 JIT 模型的衡量指标，以提高预测性能。我们发现这些指标能显著提高预测性能。因此，我们建议考虑静态分析警告和特定于 Android 的指标，以调整标准 JIT 缺陷预测模型，使其适用于移动环境，从而预测崩溃。最后，我们提出了弥合研究与实践之间差距的建议，并指出了未来研究的机遇。

{"title":"Just-in-Time crash prediction for mobile apps","authors":"Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell","doi":"10.1007/s10664-024-10455-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10455-7","url":null,"abstract":"Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings and Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"205 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0