首页 > 最新文献

Empirical Software Engineering最新文献

英文 中文
Transformers and meta-tokenization in sentiment analysis for software engineering 软件工程情感分析中的转换器和元标记化
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-06-03 DOI: 10.1007/s10664-024-10468-2
Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik

Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. To further improve the application of sentiment analysis, there have been two recommendations: Using pre-trained transformer models to classify sentiment and replacing non-natural language elements with meta-tokens. In this work, we benchmark five different sentiment analysis tools (two pre-trained transformer models and three machine learning tools) on 2 gold-standard sentiment analysis datasets. We find that pre-trained transformers outperform the best machine learning tool on only one of the two datasets, and that even on that dataset the performance difference is a few percentage points. Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). Meanwhile, we find that meta-tokenization does not improve the predictive performance of sentiment analysis tools. Both of our findings can be used by software engineering researchers who seek to apply sentiment analysis tools to software engineering data.

情感分析已被用于研究软件工程的各个方面,如问题解决、毒性和自我承认的技术债务。针对软件工程文本的特殊性,情感分析工具通常会考虑从业人员使用的特定技术行话。为了进一步改进情感分析的应用,有两项建议:使用预先训练好的转换器模型对情感进行分类,以及用元符号替换非自然语言元素。在这项工作中,我们在 2 个黄金标准情感分析数据集上对 5 种不同的情感分析工具(2 种预训练转换器模型和 3 种机器学习工具)进行了基准测试。我们发现,在两个数据集中,预训练转换器仅在一个数据集上优于最佳机器学习工具,而且即使在该数据集上,性能差异也只有几个百分点。因此,我们建议软件工程研究人员在选择情感分析工具时不要只考虑预测性能,因为表现最好的情感分析工具之间的性能非常接近(在 4 个百分点以内)。同时,我们发现元标记化并不能提高情感分析工具的预测性能。我们的这两项发现都可以为那些寻求将情感分析工具应用于软件工程数据的软件工程研究人员所用。
{"title":"Transformers and meta-tokenization in sentiment analysis for software engineering","authors":"Nathan Cassee, Andrei Agaronian, Eleni Constantinou, Nicole Novielli, Alexander Serebrenik","doi":"10.1007/s10664-024-10468-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10468-2","url":null,"abstract":"<p>Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. To further improve the application of sentiment analysis, there have been two recommendations: Using pre-trained transformer models to classify sentiment and replacing non-natural language elements with meta-tokens. In this work, we benchmark five different sentiment analysis tools (two pre-trained transformer models and three machine learning tools) on 2 gold-standard sentiment analysis datasets. We find that pre-trained transformers outperform the best machine learning tool on only one of the two datasets, and that even on that dataset the performance difference is a few percentage points. Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). Meanwhile, we find that meta-tokenization does not improve the predictive performance of sentiment analysis tools. Both of our findings can be used by software engineering researchers who seek to apply sentiment analysis tools to software engineering data.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards graph-anonymization of software analytics data: empirical study on JIT defect prediction 实现软件分析数据的图匿名化:JIT 缺陷预测实证研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-06-01 DOI: 10.1007/s10664-024-10464-6
Akshat Malik, Bram Adams, Ahmed Hassan

As the usage of software analytics for understanding different organizational practices becomes prevalent, it is important that data for these practices is shared across different organizations to build a common understanding of software systems and processes. Yet, organizations are hesitant to share this data and trained models with one another due to concerns around privacy, e.g., because of the risk of reverse engineering the training data of the models. To facilitate data sharing, tabular anonymization techniques like MORPH, LACE and LACE2 have been proposed to provide privacy to defect prediction data. However, said techniques treat data points as individual elements, and lose the context between different features when performing anonymization. We study the effect of four anonymization techniques, i.e., Random Add/Delete, Random Switch, k-DA and Generalization, on the privacy score and performance in six large, long-lived projects. To measure privacy, we use the IPR metric, which is a measure of the inability of an attacker to extract information about sensitive attributes from the anonymized data. We find that all four graph anonymization techniques are able to provide privacy scores higher than 65% in all the datasets, while Random Add/ Delete and Random Switch are even able to achieve privacy scores of 80% and greater in all datasets. For techniques achieving privacy scores of 65%, the AUC and Recall decreased by a median of 1.45% and 5.35%, respectively. For techniques with privacy scores 80% or greater, the AUC and Recall of privatized models decreased by a median of 6.44% and 20.29%, respectively. The state-of-the-art tabular techniques like MORPH, LACE and LACE2 provide high privacy scores (89%-99%); however, they have a higher impact on performance with a median decrease of 21.15% in AUC and 80.34% in Recall. Furthermore, since privacy scores 65% or greater are adequate for sharing, the graph anonymization techniques are able to provide more configurable results where one can make trade-offs between privacy and performance. When compared to unsupervised techniques like a JIT variant of ManualDown, the GA techniques perform comparable or significantly better for AUC, G-Mean and FPR metrics. Our work shows that graph anonymization can be an effective way of providing privacy while preserving model performance.

随着使用软件分析了解不同组织实践的做法变得越来越普遍,不同组织之间共享这些实践的数据以建立对软件系统和流程的共同理解就变得非常重要。然而,出于对隐私的担忧(例如,模型训练数据存在逆向工程风险),各组织在相互共享这些数据和训练模型时犹豫不决。为了促进数据共享,有人提出了 MORPH、LACE 和 LACE2 等表格匿名技术,以保护缺陷预测数据的隐私。然而,上述技术将数据点视为单个元素,在进行匿名化处理时会丢失不同特征之间的上下文。我们研究了四种匿名化技术(即随机添加/删除、随机切换、k-DA 和泛化)对六个大型长期项目的隐私得分和性能的影响。为了衡量隐私性,我们使用了 IPR 指标,该指标衡量攻击者从匿名数据中提取敏感属性信息的能力。我们发现,所有四种图匿名技术都能在所有数据集中提供高于 65% 的隐私分数,而随机添加/删除和随机切换甚至能在所有数据集中达到 80% 或更高的隐私分数。对于隐私得分达到 65% 的技术,AUC 和 Recall 的中位数分别下降了 1.45% 和 5.35%。对于隐私得分达到或超过 80% 的技术,私有化模型的 AUC 和 Recall 中位数分别下降了 6.44% 和 20.29%。最先进的表格技术,如 MORPH、LACE 和 LACE2,提供了较高的隐私分数(89%-99%);然而,它们对性能的影响更大,AUC 和 Recall 的中位数分别下降了 21.15% 和 80.34%。此外,由于 65% 或更高的隐私分数足以实现共享,因此图匿名技术能够提供更多可配置的结果,人们可以在隐私和性能之间做出权衡。与无监督技术(如 ManualDown 的 JIT 变体)相比,GA 技术在 AUC、G-Mean 和 FPR 指标上表现相当或明显更好。我们的工作表明,图匿名化是一种既能提供隐私又能保持模型性能的有效方法。
{"title":"Towards graph-anonymization of software analytics data: empirical study on JIT defect prediction","authors":"Akshat Malik, Bram Adams, Ahmed Hassan","doi":"10.1007/s10664-024-10464-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10464-6","url":null,"abstract":"<p>As the usage of software analytics for understanding different organizational practices becomes prevalent, it is important that data for these practices is shared across different organizations to build a common understanding of software systems and processes. Yet, organizations are hesitant to share this data and trained models with one another due to concerns around privacy, e.g., because of the risk of reverse engineering the training data of the models. To facilitate data sharing, tabular anonymization techniques like MORPH, LACE and LACE2 have been proposed to provide privacy to defect prediction data. However, said techniques treat data points as individual elements, and lose the context between different features when performing anonymization. We study the effect of four anonymization techniques, i.e., Random Add/Delete, Random Switch, k-DA and Generalization, on the privacy score and performance in six large, long-lived projects. To measure privacy, we use the IPR metric, which is a measure of the inability of an attacker to extract information about sensitive attributes from the anonymized data. We find that all four graph anonymization techniques are able to provide privacy scores higher than 65% in all the datasets, while Random Add/ Delete and Random Switch are even able to achieve privacy scores of 80% and greater in all datasets. For techniques achieving privacy scores of 65%, the AUC and Recall decreased by a median of 1.45% and 5.35%, respectively. For techniques with privacy scores 80% or greater, the AUC and Recall of privatized models decreased by a median of 6.44% and 20.29%, respectively. The state-of-the-art tabular techniques like MORPH, LACE and LACE2 provide high privacy scores (89%-99%); however, they have a higher impact on performance with a median decrease of 21.15% in AUC and 80.34% in Recall. Furthermore, since privacy scores 65% or greater are adequate for sharing, the graph anonymization techniques are able to provide more configurable results where one can make trade-offs between privacy and performance. When compared to unsupervised techniques like a JIT variant of ManualDown, the GA techniques perform comparable or significantly better for AUC, G-Mean and FPR metrics. Our work shows that graph anonymization can be an effective way of providing privacy while preserving model performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"88 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning experiment management tools: a mixed-methods empirical study 机器学习实验管理工具:混合方法实证研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-29 DOI: 10.1007/s10664-024-10444-w
Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger

Machine Learning (ML) experiment management tools support ML practitioners and software engineers when building intelligent software systems. By managing large numbers of ML experiments comprising many different ML assets, they not only facilitate engineering ML models and ML-enabled systems, but also managing their evolution—for instance, tracing system behavior to concrete experiments when the model performance drifts. However, while ML experiment management tools have become increasingly popular, little is known about their effectiveness in practice, as well as their actual benefits and challenges. We present a mixed-methods empirical study of experiment management tools and the support they provide to users. First, our survey of 81 ML practitioners sought to determine the benefits and challenges of ML experiment management and of the existing tool landscape. Second, a controlled experiment with 15 student developers investigated the effectiveness of ML experiment management tools. We learned that 70% of our survey respondents perform ML experiments using specialized tools, while out of those who do not use such tools, 52% are unaware of experiment management tools or of their benefits. The controlled experiment showed that experiment management tools offer valuable support to users to systematically track and retrieve ML assets. Using ML experiment management tools reduced error rates and increased completion rates. By presenting a user’s perspective on experiment management tools, and the first controlled experiment in this area, we hope that our results foster the adoption of these tools in practice, as well as they direct tool builders and researchers to improve the tool landscape overall.

在构建智能软件系统时,机器学习(ML)实验管理工具可为 ML 从业人员和软件工程师提供支持。通过管理由许多不同的 ML 资产组成的大量 ML 实验,它们不仅能促进 ML 模型和支持 ML 的系统的工程设计,还能管理它们的演化--例如,当模型性能发生偏移时,可将系统行为追踪到具体的实验中。然而,虽然 ML 实验管理工具越来越受欢迎,但人们对它们在实践中的有效性以及实际优势和挑战却知之甚少。我们采用混合方法对实验管理工具及其为用户提供的支持进行了实证研究。首先,我们对 81 名 ML 从业人员进行了调查,以确定 ML 实验管理和现有工具的优势和挑战。其次,我们对 15 名学生开发人员进行了对照实验,以调查 ML 实验管理工具的有效性。我们了解到,70% 的调查对象使用专门工具进行 ML 实验,而在不使用此类工具的调查对象中,52% 的人不知道实验管理工具或其好处。对照实验表明,实验管理工具为用户系统跟踪和检索 ML 资产提供了宝贵的支持。使用 ML 实验管理工具降低了错误率,提高了完成率。通过介绍用户对实验管理工具的看法以及该领域的首个对照实验,我们希望我们的结果能够促进这些工具在实践中的应用,并引导工具构建者和研究人员改进工具的整体状况。
{"title":"Machine learning experiment management tools: a mixed-methods empirical study","authors":"Samuel Idowu, Osman Osman, Daniel Strüber, Thorsten Berger","doi":"10.1007/s10664-024-10444-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10444-w","url":null,"abstract":"<p>Machine Learning (ML) experiment management tools support ML practitioners and software engineers when building intelligent software systems. By managing large numbers of ML experiments comprising many different ML assets, they not only facilitate engineering ML models and ML-enabled systems, but also managing their evolution—for instance, tracing system behavior to concrete experiments when the model performance drifts. However, while ML experiment management tools have become increasingly popular, little is known about their effectiveness in practice, as well as their actual benefits and challenges. We present a mixed-methods empirical study of experiment management tools and the support they provide to users. First, our survey of 81 ML practitioners sought to determine the benefits and challenges of ML experiment management and of the existing tool landscape. Second, a controlled experiment with 15 student developers investigated the effectiveness of ML experiment management tools. We learned that 70% of our survey respondents perform ML experiments using specialized tools, while out of those who do not use such tools, 52% are unaware of experiment management tools or of their benefits. The controlled experiment showed that experiment management tools offer valuable support to users to systematically track and retrieve ML assets. Using ML experiment management tools reduced error rates and increased completion rates. By presenting a user’s perspective on experiment management tools, and the first controlled experiment in this area, we hope that our results foster the adoption of these tools in practice, as well as they direct tool builders and researchers to improve the tool landscape overall.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"56 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Do Agile scaling approaches make a difference? an empirical comparison of team effectiveness across popular scaling approaches 敏捷扩展方法有区别吗?对各种流行扩展方法的团队效率进行实证比较
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-29 DOI: 10.1007/s10664-024-10481-5
Christiaan Verwijs, Daniel Russo

With the prevalent use of Agile methodologies, organizations are grappling with the challenge of scaling development across numerous teams. This has led to the emergence of diverse scaling strategies, from complex ones such as “SAFe", to more simplified methods e.g., “LeSS", with some organizations devising their unique approaches. While there have been multiple studies exploring the organizational challenges associated with different scaling approaches, so far, no one has compared these strategies based on empirical data derived from a uniform measure. This makes it hard to draw robust conclusions about how different scaling approaches affect Agile team effectiveness. Thus, the objective of this study is to assess the effectiveness of Agile teams across various scaling approaches, including “SAFe", “LeSS", “Scrum of Scrums", and custom methods, as well as those not using scaling. This study focuses initially on responsiveness, stakeholder concern, continuous improvement, team autonomy, management approach, and overall team effectiveness, followed by an evaluation based on stakeholder satisfaction regarding value, responsiveness, and release frequency. To achieve this, we performed a comprehensive survey involving 15,078 members of 4,013 Agile teams to measure their effectiveness, combined with satisfaction surveys from 1,841 stakeholders of 529 of those teams. We conducted a series of inferential statistical analyses, including Analysis of Variance and multiple linear regression, to identify any significant differences, while controlling for team experience and organizational size. The findings of the study revealed some significant differences, but their magnitude and effect size were considered too negligible to have practical significance. In conclusion, the choice of Agile scaling strategy does not markedly influence team effectiveness, and organizations are advised to choose a method that best aligns with their previous experiences with Agile, organizational culture, and management style.

随着敏捷方法的普遍使用,企业正在努力应对在众多团队中扩展开发的挑战。这导致出现了各种不同的扩展策略,从 "SAFe "等复杂策略到 "LeSS "等更简化的方法,有些组织还设计了自己独特的方法。虽然已有多项研究探讨了与不同扩展方法相关的组织挑战,但迄今为止,还没有人根据统一衡量标准得出的经验数据对这些战略进行过比较。因此,很难就不同的扩展方法如何影响敏捷团队的效率得出可靠的结论。因此,本研究的目的是评估不同扩展方法(包括 "SAFe"、"LeSS"、"Scrum of Scrums "和自定义方法)下的敏捷团队的有效性,以及未使用扩展方法的敏捷团队的有效性。本研究首先关注响应速度、利益相关者的关注、持续改进、团队自治、管理方法和团队整体效率,然后根据利益相关者对价值、响应速度和发布频率的满意度进行评估。为此,我们对 4013 个敏捷团队的 15078 名成员进行了全面调查,以衡量他们的有效性,同时还对其中 529 个团队的 1841 名利益相关者进行了满意度调查。我们进行了一系列推理统计分析,包括方差分析和多元线性回归,以确定任何显著差异,同时控制团队经验和组织规模。研究结果显示了一些显著的差异,但其程度和影响大小都可以忽略不计,不具有实际意义。总之,敏捷扩展策略的选择不会明显影响团队的有效性,建议组织选择一种最符合其以往敏捷经验、组织文化和管理风格的方法。
{"title":"Do Agile scaling approaches make a difference? an empirical comparison of team effectiveness across popular scaling approaches","authors":"Christiaan Verwijs, Daniel Russo","doi":"10.1007/s10664-024-10481-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10481-5","url":null,"abstract":"<p>With the prevalent use of Agile methodologies, organizations are grappling with the challenge of scaling development across numerous teams. This has led to the emergence of diverse scaling strategies, from complex ones such as “SAFe\", to more simplified methods e.g., “LeSS\", with some organizations devising their unique approaches. While there have been multiple studies exploring the organizational challenges associated with different scaling approaches, so far, no one has compared these strategies based on empirical data derived from a uniform measure. This makes it hard to draw robust conclusions about how different scaling approaches affect Agile team effectiveness. Thus, the objective of this study is to assess the effectiveness of Agile teams across various scaling approaches, including “SAFe\", “LeSS\", “Scrum of Scrums\", and custom methods, as well as those not using scaling. This study focuses initially on responsiveness, stakeholder concern, continuous improvement, team autonomy, management approach, and overall team effectiveness, followed by an evaluation based on stakeholder satisfaction regarding value, responsiveness, and release frequency. To achieve this, we performed a comprehensive survey involving 15,078 members of 4,013 Agile teams to measure their effectiveness, combined with satisfaction surveys from 1,841 stakeholders of 529 of those teams. We conducted a series of inferential statistical analyses, including Analysis of Variance and multiple linear regression, to identify any significant differences, while controlling for team experience and organizational size. The findings of the study revealed some significant differences, but their magnitude and effect size were considered too negligible to have practical significance. In conclusion, the choice of Agile scaling strategy does not markedly influence team effectiveness, and organizations are advised to choose a method that best aligns with their previous experiences with Agile, organizational culture, and management style.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"68 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The broken windows theory applies to technical debt 破窗理论适用于技术债务
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-24 DOI: 10.1007/s10664-024-10456-6
William Levén, Hampus Broman, Terese Besker, Richard Torkar

Context:

The term technical debt (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the broken windows theory (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the broken windows of software systems.

Objective:

To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system.

Method:

The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing system extension tasks in already existing systems with high or low TD density.

Results:

The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other code smells identified by the software tool SonarQube, all with at least (95%) credible intervals.

Coclusions:

Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s existence in software engineering contexts. This study finds that existing TD can have a major impact on developers propensity to introduce new TD of various types during development.

背景:技术债务(TD)一词描述的是次优解决方案的集合,这些解决方案阻碍了系统的演进和维护。有人认为,从犯罪学中借用的破窗理论(BWT)也适用于软件开发项目。该理论认为,以前犯罪的迹象(如破窗)会增加进一步犯罪活动的可能性;TD 可被视为软件系统的破窗。研究方法:本研究采用了混合方法研究策略,包括对照实验、随附调查和后续访谈。结果:分析表明,TD水平对受试者倾向于重新实现(而非重复使用)功能、选择非描述性变量名以及引入软件工具SonarQube识别出的其他代码气味有显著影响,所有影响的可信区间至少为(95%)。这项研究发现,现有的 TD 会对开发人员在开发过程中引入各种类型的新 TD 的倾向性产生重大影响。
{"title":"The broken windows theory applies to technical debt","authors":"William Levén, Hampus Broman, Terese Besker, Richard Torkar","doi":"10.1007/s10664-024-10456-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10456-6","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context:</h3><p>The term <i>technical debt</i> (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the <i>broken windows theory</i> (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the <i>broken windows</i> of software systems.</p><h3 data-test=\"abstract-sub-heading\">Objective:</h3><p>To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system.</p><h3 data-test=\"abstract-sub-heading\">Method:</h3><p>The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing system extension tasks in already existing systems with high or low TD density.</p><h3 data-test=\"abstract-sub-heading\">Results:</h3><p>The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other <i>code smells</i> identified by the software tool <span>SonarQube</span>, all with at least <span>(95%)</span> credible intervals.</p><h3 data-test=\"abstract-sub-heading\">Coclusions:</h3><p>Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s existence in software engineering contexts. This study finds that existing TD can have a major impact on developers propensity to introduce new TD of various types during development.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"282 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two is better than one: digital siblings to improve autonomous driving testing 二胜于一:数字兄弟姐妹改善自动驾驶测试
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-17 DOI: 10.1007/s10664-024-10458-4
Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella

Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital siblings—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.

模拟测试是确保自动驾驶软件可靠性的重要一步。在实践中,当公司依靠第三方通用模拟器进行内部或外包测试时,测试结果对真实自动驾驶车辆的通用性就会受到威胁。在本文中,我们通过引入数字兄弟姐妹的概念来增强基于模拟的测试--这种多模拟器方法在多个采用不同技术构建的通用模拟器上测试给定的自动驾驶汽车,这些模拟器在测试过程中作为一个整体共同运行。我们在一个案例研究中示范了我们的方法,该案例研究的重点是测试自动驾驶汽车的车道保持组件。我们使用两个开源模拟器作为数字孪生兄弟,并在大量测试案例中将这种多模拟器方法与物理比例自动驾驶汽车的数字孪生兄弟进行实证比较。我们的方法要求为每个模拟器生成并运行测试用例,测试用例的形式为道路点序列。然后,测试用例在模拟器之间迁移,使用特征图来描述行使的驾驶条件。最后,计算联合预测的故障概率,只有在同胞兄弟一致的情况下才会报告故障。我们的实证评估表明,数字孪生系统的集合故障预测器在预测数字孪生系统故障方面优于单个模拟器。我们将讨论案例研究的结果,并详细介绍我们的方法如何帮助对自动驾驶软件自动测试感兴趣的研究人员。
{"title":"Two is better than one: digital siblings to improve autonomous driving testing","authors":"Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella","doi":"10.1007/s10664-024-10458-4","DOIUrl":"https://doi.org/10.1007/s10664-024-10458-4","url":null,"abstract":"<p>Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of <i>digital siblings</i>—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2015 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic matching in GUI test reuse 图形用户界面测试重用中的语义匹配
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-09 DOI: 10.1007/s10664-023-10406-8
Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni

Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across Android apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with Test Reuse approaches on migrating test cases across Android apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the Semantic Matching Algorithm. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall Test Reuse approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.

在功能相似的应用程序中重复使用测试用例,既能减少制作有用测试用例的工作量,又能缩短向市场提供可靠应用程序的时间。跨应用程序重用测试用例的主要方法是结合不同的语义匹配和测试生成算法,在安卓应用程序间迁移测试用例。在本文中,我们定义了一个通用框架,用于评估语义匹配与测试重用方法的不同选择对跨安卓应用程序迁移测试用例的影响和有效性。我们对测试迁移流程组件的多种可能选择进行了全面的比较评估。我们提出了一种方法,将测试迁移过程中每个组成部分的最有效选择结合起来,从而获得一种有效的方法。我们报告了对来自 337 个测试配置的 8,099 个图形用户界面事件的实验评估结果。结果证明了语义匹配对测试重用的显著影响。结果表明,句子级别的嵌入技术比单词级别的嵌入技术表现更好。令人惊讶的是,用于为语义匹配算法建立词嵌入模型的文档语料库的影响微乎其微。他们提供的证据表明,选定类型事件的语义匹配效果优于所有类型事件的语义匹配效果。他们表明,整体测试重用方法的有效性取决于测试套件和应用程序的特性。我们在线公开提供的复制包(https://star.inf.usi.ch/#/software-data/11)允许研究人员和从业人员通过更多实验完善结果,并评估测试重用组件的其他选择。
{"title":"Semantic matching in GUI test reuse","authors":"Farideh Khalili, Leonardo Mariani, Ali Mohebbi, Mauro Pezzè, Valerio Terragni","doi":"10.1007/s10664-023-10406-8","DOIUrl":"https://doi.org/10.1007/s10664-023-10406-8","url":null,"abstract":"<p>Reusing test cases across apps that share similar functionalities reduces both the effort required to produce useful test cases and the time to offer reliable apps to the market. The main approaches to reuse test cases across apps combine different semantic matching and test generation algorithms to migrate test cases across <span>Android</span> apps. In this paper we define a general framework to evaluate the impact and effectiveness of different choices of semantic matching with <span>Test Reuse</span> approaches on migrating test cases across <span>Android</span> apps. We offer a thorough comparative evaluation of the many possible choices for the components of test migration processes. We propose an approach that combines the most effective choices for each component of the test migration process to obtain an effective approach. We report the results of an experimental evaluation on 8,099 GUI events from 337 test configurations. The results attest the prominent impact of semantic matching on test reuse. They indicate that sentence level perform better than word level embedding techniques. They surprisingly suggest a negligible impact of the corpus of documents used for building the word embedding model for the <span>Semantic Matching Algorithm</span>. They provide evidence that semantic matching of events of selected types perform better than semantic matching of events of all types. They show that the effectiveness of overall <span>Test Reuse</span> approach depends on the characteristics of the test suites and apps. The replication package that we make publicly available online (https://star.inf.usi.ch/#/software-data/11) allows researchers and practitioners to refine the results with additional experiments and evaluate other choices for test reuse components.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"66 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Just-in-Time crash prediction for mobile apps 移动应用程序的即时崩溃预测
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-08 DOI: 10.1007/s10664-024-10455-7
Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell

Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings and Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.

即时缺陷预测(JIT)的目的是在提交时尽早发现缺陷。因此,开发人员可以在对代码更改记忆犹新时采取预防措施,避免出现缺陷。然而,JIT 缺陷预测在移动应用程序崩溃方面的实用性尚未得到研究。因此,我们采用定量和定性分析方法进行了一项多案例研究。在定量分析中,我们使用了机器学习技术进行预测。我们从 14 个 Android 应用程序的约 30,000 次提交中收集了 113 个可靠性相关指标,并选择了 14 个重要指标进行预测。我们发现,标准 JIT 指标和静态分析警告对于 JIT 预测移动应用程序崩溃都很重要。我们进一步优化了预测性能,通过超参数优化比较了七种最先进的缺陷预测技术。结果表明,随机森林是性能最好的模型,AUC-ROC 为 0.83。在定性分析中,我们手动分析了 642 个提交样本,并确定了导致崩溃的提交中常见的不同变更类型。我们探讨了是否可以将不同方面的变更作为 JIT 模型的衡量指标,以提高预测性能。我们发现这些指标能显著提高预测性能。因此,我们建议考虑静态分析警告和特定于 Android 的指标,以调整标准 JIT 缺陷预测模型,使其适用于移动环境,从而预测崩溃。最后,我们提出了弥合研究与实践之间差距的建议,并指出了未来研究的机遇。
{"title":"Just-in-Time crash prediction for mobile apps","authors":"Chathrie Wimalasooriya, Sherlock A. Licorish, Daniel Alencar da Costa, Stephen G. MacDonell","doi":"10.1007/s10664-024-10455-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10455-7","url":null,"abstract":"<p>Just-In-Time (JIT) defect prediction aims to identify defects early, at commit time. Hence, developers can take precautions to avoid defects when the code changes are still fresh in their minds. However, the utility of JIT defect prediction has not been investigated in relation to crashes of mobile apps. We therefore conducted a multi-case study employing both quantitative and qualitative analysis. In the quantitative analysis, we used machine learning techniques for prediction. We collected 113 reliability-related metrics for about 30,000 commits from 14 Android apps and selected 14 important metrics for prediction. We found that both standard JIT metrics and static analysis warnings are important for JIT prediction of mobile app crashes. We further optimized prediction performance, comparing seven state-of-the-art defect prediction techniques with hyperparameter optimization. Our results showed that Random Forest is the best performing model with an AUC-ROC of 0.83. In our qualitative analysis, we manually analysed a sample of 642 commits and identified different types of changes that are common in crash-inducing commits. We explored whether different aspects of changes can be used as metrics in JIT models to improve prediction performance. We found these metrics improve the prediction performance significantly. Hence, we suggest considering static analysis warnings <i>and</i> Android-specific metrics to adapt standard JIT defect prediction models for a mobile context to predict crashes. Finally, we provide recommendations to bridge the gap between research and practice and point to opportunities for future research.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"205 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing and revivifying function signature inference using deep learning 利用深度学习分析和活化函数特征推理
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-08 DOI: 10.1007/s10664-024-10453-9
Yan Lin, Trisha Singhal, Debin Gao, David Lo

Function signature plays an important role in binary analysis and security enhancement, with typical examples in bug finding and control-flow integrity enforcement. However, recovery of function signatures by static binary analysis is challenging since crucial information vital for such recovery is stripped off during compilation. Although function signature recovery using deep learning (DL) is proposed in an effort to handle such challenges, the reported accuracy is low for binaries compiled with optimizations. In this paper, we first perform a systematic study to quantify the extent to which compiler optimizations (negatively) impact the accuracy of existing DL techniques based on Recurrent Neural Network (RNN) for function signature recovery. Our experiments show that the state-of-the-art DL technique has its accuracy dropped from 98.7% to 87.7% when training and testing optimized binaries. We further investigate the type of instructions that existing RNN model deems most important in inferring function signatures with the help of saliency map. The results show that existing RNN model mistakenly considers non-argument-accessing instructions to infer the number of arguments, especially when dealing with optimized binaries. Finally, we identify specific weaknesses in such existing approaches and propose an enhanced DL approach named ReSIL to incorporate compiler-optimization-specific domain knowledge into the learning process. Our experimental results show that ReSIL significantly improves the accuracy and F1 score in inferring function signatures, e.g., with accuracy in inferring the number of arguments for callees compiled with optimization flag O1 from 84.83% to 92.68%. Meanwhile, ReSIL correctly considers the argument-accessing instructions as the most important ones to perform the inferencing. We also demonstrate security implications of ReSIL in Control-Flow Integrity enforcement in stopping potential Counterfeit Object-Oriented Programming (COOP) attacks.

函数签名在二进制分析和安全增强中发挥着重要作用,典型的例子有错误查找和控制流完整性执行。然而,通过静态二进制分析恢复函数签名具有挑战性,因为在编译过程中,恢复函数签名所需的关键信息会被剥离。虽然有人提出使用深度学习(DL)来恢复函数签名,以应对这种挑战,但对于经过优化编译的二进制文件来说,报告的准确率很低。在本文中,我们首先进行了一项系统研究,以量化编译器优化在多大程度上(负面地)影响了基于递归神经网络(RNN)的现有 DL 技术在函数签名恢复方面的准确性。我们的实验表明,在训练和测试优化二进制文件时,最先进的 DL 技术的准确率从 98.7% 降至 87.7%。我们进一步研究了现有 RNN 模型在借助显著性图推断功能特征时认为最重要的指令类型。结果表明,现有的 RNN 模型在推断参数数时错误地考虑了非参数访问指令,尤其是在处理优化二进制文件时。最后,我们指出了这些现有方法的具体弱点,并提出了一种名为 ReSIL 的增强型 DL 方法,将编译器优化特定领域的知识纳入学习过程。实验结果表明,ReSIL 显著提高了推断函数签名的准确率和 F1 分数,例如,推断使用优化标志 O1 编译的 callees 的参数数的准确率从 84.83% 提高到 92.68%。同时,ReSIL 正确地将参数访问指令视为执行推断的最重要指令。我们还展示了 ReSIL 在控制流完整性执行方面的安全意义,以阻止潜在的假冒面向对象编程(COOP)攻击。
{"title":"Analyzing and revivifying function signature inference using deep learning","authors":"Yan Lin, Trisha Singhal, Debin Gao, David Lo","doi":"10.1007/s10664-024-10453-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10453-9","url":null,"abstract":"<p>Function signature plays an important role in binary analysis and security enhancement, with typical examples in bug finding and control-flow integrity enforcement. However, recovery of function signatures by static binary analysis is challenging since crucial information vital for such recovery is stripped off during compilation. Although function signature recovery using deep learning (DL) is proposed in an effort to handle such challenges, the reported accuracy is low for binaries compiled with optimizations. In this paper, we first perform a systematic study to quantify the extent to which compiler optimizations (negatively) impact the accuracy of existing DL techniques based on Recurrent Neural Network (RNN) for function signature recovery. Our experiments show that the state-of-the-art DL technique has its accuracy dropped from 98.7% to 87.7% when training and testing optimized binaries. We further investigate the type of instructions that existing RNN model deems most important in inferring function signatures with the help of saliency map. The results show that existing RNN model mistakenly considers non-argument-accessing instructions to infer the number of arguments, especially when dealing with optimized binaries. Finally, we identify specific weaknesses in such existing approaches and propose an enhanced DL approach named ReSIL to incorporate compiler-optimization-specific domain knowledge into the learning process. Our experimental results show that ReSIL significantly improves the accuracy and F1 score in inferring function signatures, e.g., with accuracy in inferring the number of arguments for callees compiled with optimization flag O1 from 84.83% to 92.68%. Meanwhile, ReSIL correctly considers the argument-accessing instructions as the most important ones to perform the inferencing. We also demonstrate security implications of ReSIL in Control-Flow Integrity enforcement in stopping potential Counterfeit Object-Oriented Programming (COOP) attacks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2021 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ethics in AI through the practitioner’s view: a grounded theory literature review 从实践者的角度看人工智能伦理:基础理论文献综述
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-05-06 DOI: 10.1007/s10664-024-10465-5
Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, Burak Turhan

The term ethics is widely used, explored, and debated in the context of developing Artificial Intelligence (AI) based software systems. In recent years, numerous incidents have raised the profile of ethical issues in AI development and led to public concerns about the proliferation of AI technology in our everyday lives. But what do we know about the views and experiences of those who develop these systems – the AI practitioners? We conducted a grounded theory literature review (GTLR) of 38 primary empirical studies that included AI practitioners’ views on ethics in AI and analysed them to derive five categories: practitioner awareness, perception, need, challenge, and approach. These are underpinned by multiple codes and concepts that we explain with evidence from the included studies. We present a taxonomy of ethics in AI from practitioners’ viewpoints to assist AI practitioners in identifying and understanding the different aspects of AI ethics. The taxonomy provides a landscape view of the key aspects that concern AI practitioners when it comes to ethics in AI. We also share an agenda for future research studies and recommendations for practitioners, managers, and organisations to help in their efforts to better consider and implement ethics in AI.

在开发基于人工智能(AI)的软件系统时,伦理一词被广泛使用、探讨和辩论。近年来,众多事件凸显了人工智能开发中的伦理问题,并引发了公众对人工智能技术在日常生活中扩散的担忧。但是,我们对这些系统的开发者--人工智能从业者--的观点和经验了解多少呢?我们对包含人工智能从业人员对人工智能伦理看法的 38 项主要实证研究进行了基础理论文献综述(GTLR),并通过分析得出了五个类别:从业人员意识、认知、需求、挑战和方法。这些类别由多个代码和概念支撑,我们通过所纳入研究的证据对这些代码和概念进行了解释。我们从从业人员的角度提出了人工智能伦理分类法,以帮助人工智能从业人员识别和理解人工智能伦理的不同方面。该分类法提供了人工智能从业人员在人工智能伦理方面所关注的关键方面的全景视图。我们还为从业人员、管理人员和组织分享了未来研究的议程和建议,以帮助他们更好地考虑和实施人工智能伦理。
{"title":"Ethics in AI through the practitioner’s view: a grounded theory literature review","authors":"Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, Burak Turhan","doi":"10.1007/s10664-024-10465-5","DOIUrl":"https://doi.org/10.1007/s10664-024-10465-5","url":null,"abstract":"<p>The term ethics is widely used, explored, and debated in the context of developing Artificial Intelligence (AI) based software systems. In recent years, numerous incidents have raised the profile of ethical issues in AI development and led to public concerns about the proliferation of AI technology in our everyday lives. But what do we know about the views and experiences of those who develop these systems – the AI practitioners? We conducted a grounded theory literature review (GTLR) of 38 primary empirical studies that included AI practitioners’ views on ethics in AI and analysed them to derive five categories: practitioner <i>awareness</i>, <i>perception</i>, <i>need</i>, <i>challenge</i>, and <i>approach</i>. These are underpinned by multiple codes and concepts that we explain with evidence from the included studies. We present a <i>taxonomy of ethics in AI from practitioners’ viewpoints</i> to assist AI practitioners in identifying and understanding the different aspects of AI ethics. The taxonomy provides a landscape view of the key aspects that concern AI practitioners when it comes to ethics in AI. We also share an agenda for future research studies and recommendations for practitioners, managers, and organisations to help in their efforts to better consider and implement ethics in AI.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"161 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140888028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Empirical Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1