Pub Date : 2025-01-01Epub Date: 2025-04-02DOI: 10.1007/s10664-025-10645-x
Francesco Sovrano, Emmie Hine, Stefano Anzolut, Alberto Bacchelli
The European AI Act has introduced specific technical documentation requirements for AI systems. Compliance with them is challenging due to the need for advanced knowledge of both legal and technical aspects, which is rare among software developers and legal professionals. Consequently, small and medium-sized enterprises may face high costs in meeting these requirements. In this study, we explore how contemporary AI technologies, including ChatGPT and an existing compliance tool (DoXpert), can aid software developers in creating technical documentation that complies with the AI Act. We specifically demonstrate how these AI tools can identify gaps in existing documentation according to the provisions of the AI Act. Using open-source high-risk AI systems as case studies, we collaborated with legal experts to evaluate how closely tool-generated assessments align with expert opinions. Findings show partial alignment, important issues with ChatGPT (3.5 and 4), and a moderate (and statistically significant) correlation between DoXpert and expert judgments, according to the Rank Biserial Correlation analysis. Nonetheless, these findings underscore the potential of AI to combine with human analysis and alleviate the compliance burden, supporting the broader goal of fostering responsible and transparent AI development under emerging regulatory frameworks.
{"title":"Simplifying software compliance: AI technologies in drafting technical documentation for the AI Act.","authors":"Francesco Sovrano, Emmie Hine, Stefano Anzolut, Alberto Bacchelli","doi":"10.1007/s10664-025-10645-x","DOIUrl":"10.1007/s10664-025-10645-x","url":null,"abstract":"<p><p>The European AI Act has introduced specific technical documentation requirements for AI systems. Compliance with them is challenging due to the need for advanced knowledge of both legal and technical aspects, which is rare among software developers and legal professionals. Consequently, small and medium-sized enterprises may face high costs in meeting these requirements. In this study, we explore how contemporary AI technologies, including ChatGPT and an existing compliance tool (DoXpert), can aid software developers in creating technical documentation that complies with the AI Act. We specifically demonstrate how these AI tools can identify gaps in existing documentation according to the provisions of the AI Act. Using open-source high-risk AI systems as case studies, we collaborated with legal experts to evaluate how closely tool-generated assessments align with expert opinions. Findings show partial alignment, important issues with ChatGPT (3.5 and 4), and a moderate (and statistically significant) correlation between DoXpert and expert judgments, according to the Rank Biserial Correlation analysis. Nonetheless, these findings underscore the potential of AI to combine with human analysis and alleviate the compliance burden, supporting the broader goal of fostering responsible and transparent AI development under emerging regulatory frameworks.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 3","pages":"91"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11965209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143794942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-04-05DOI: 10.1007/s10664-025-10636-y
Aurora Papotti, Katja Tuma, Fabio Massacci
Slicing is a fault localization technique that has been proposed to support debugging and program comprehension. Yet, its empirical effectiveness during code inspection by humans has received limited attention. The goal of our study is two-fold. First, we aim to define what it means for a code reviewer to identify the vulnerable lines correctly. Second, we investigate whether reducing the number of to-be-inspected lines by method-level slicing supports code reviewers in detecting security vulnerabilities. We propose a novel approach based on the notion of a -neighborhood (intuitively based on the idea of the context size of the command git diff) to define correctly identified lines. Then, we conducted a multi-year controlled experiment (2017-2023) in which MSc students attending security courses ( ) were tasked with identifying vulnerable lines in original or sliced Java files from Apache Tomcat. We provide perfect seed lines for a slicing algorithm to control for confounding factors. Each treatment differs in the pair (Vulnerability, Original/Sliced) with a balanced design with vulnerabilities from the OWASP Top 10 2017: A1 (Injection), A5 (Broken Access Control), A6 (Security Misconfiguration), and A7 (Cross-Site Scripting). To generate smaller slices for human consumption, we used a variant of intra-procedural thin slicing. We report the results for which corresponds to exactly matching the vulnerable ground truth lines, and which represents the scenario of identifying the vulnerable area. For both cases, we found that slicing helps in 'finding something' (the participant has found at least some vulnerable lines) as opposed to 'finding nothing'. For the case of analyzing a slice and analyzing the original file are statistically equivalent from the perspective of lines found by those who found something. With slicing helps to find more vulnerabilities compared to analyzing an original file, as we would normally expect. Given the type of population, additional experiments are necessary to be generalized to experienced developers.
{"title":"On the effects of program slicing for vulnerability detection during code inspection.","authors":"Aurora Papotti, Katja Tuma, Fabio Massacci","doi":"10.1007/s10664-025-10636-y","DOIUrl":"10.1007/s10664-025-10636-y","url":null,"abstract":"<p><p>Slicing is a fault localization technique that has been proposed to support debugging and program comprehension. Yet, its empirical effectiveness during code inspection by humans has received limited attention. The goal of our study is two-fold. First, we aim to define what it means for a code reviewer to identify the vulnerable lines correctly. Second, we investigate whether reducing the number of to-be-inspected lines by method-level slicing supports code reviewers in detecting security vulnerabilities. We propose a novel approach based on the notion of a <math><mi>δ</mi></math> -neighborhood (intuitively based on the idea of the context size of the command git diff) to define correctly identified lines. Then, we conducted a multi-year controlled experiment (2017-2023) in which MSc students attending security courses ( <math><mrow><mi>n</mi> <mo>=</mo> <mn>236</mn></mrow> </math> ) were tasked with identifying vulnerable lines in original or sliced Java files from Apache Tomcat. We provide perfect seed lines for a slicing algorithm to control for confounding factors. Each treatment differs in the pair (Vulnerability, Original/Sliced) with a balanced design with vulnerabilities from the OWASP Top 10 2017: A1 (Injection), A5 (Broken Access Control), A6 (Security Misconfiguration), and A7 (Cross-Site Scripting). To generate smaller slices for human consumption, we used a variant of intra-procedural thin slicing. We report the results for <math><mrow><mi>δ</mi> <mo>=</mo> <mn>0</mn></mrow> </math> which corresponds to exactly matching the vulnerable ground truth lines, and <math><mrow><mi>δ</mi> <mo>=</mo> <mn>3</mn></mrow> </math> which represents the scenario of identifying the vulnerable area. For both cases, we found that slicing helps in 'finding something' (the participant has found at least some vulnerable lines) as opposed to 'finding nothing'. For the case of <math><mrow><mi>δ</mi> <mo>=</mo> <mn>0</mn></mrow> </math> analyzing a slice and analyzing the original file are statistically equivalent from the perspective of lines found by those who found something. With <math><mrow><mi>δ</mi> <mo>=</mo> <mn>3</mn></mrow> </math> slicing helps to find more vulnerabilities compared to analyzing an original file, as we would normally expect. Given the type of population, additional experiments are necessary to be generalized to experienced developers.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 3","pages":"93"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972194/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143802692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-11-05DOI: 10.1007/s10664-024-10562-5
Luca Giamattei, Matteo Biagiola, Roberto Pietrantuono, Stefano Russo, Paolo Tonella
In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random search. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
{"title":"Reinforcement learning for online testing of autonomous driving systems: a replication and extension study.","authors":"Luca Giamattei, Matteo Biagiola, Roberto Pietrantuono, Stefano Russo, Paolo Tonella","doi":"10.1007/s10664-024-10562-5","DOIUrl":"10.1007/s10664-024-10562-5","url":null,"abstract":"<p><p>In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random search. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 1","pages":"19"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11538197/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142602130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-02-14DOI: 10.1007/s10664-024-10601-1
Patric Genfer, Souhaila Serbout, Georg Simhandl, Uwe Zdun, Cesare Pautasso
While microservice architectures have become a widespread option for designing distributed applications, designing secure microservice systems remains challenging. Although various security-related guidelines and practices exist, these systems' sheer size, complex communication structures, and polyglot tech stacks make it difficult to manually validate whether adequate security tactics are applied throughout their architecture. To address these challenges, we have devised a novel solution that involves the automatic generation of security-annotated software decomposition models and the utilization of security-based metrics to guide software architectures through the assessment of security tactics employed within microservice systems. To evaluate the effectiveness of our artifacts, we conducted a controlled experiment where we asked 60 students from two universities and ten experts from the industry to identify and assess the security features of two microservice reference systems. During the experiment, we tracked the correctness of their answers and the time they needed to solve the given tasks to measure how well they could understand the security tactics applied in the reference systems. Our results indicate that the supplemental material significantly improved the correctness of the participants' answers without requiring them to consult the documentation more. Most participants also stated in a self-assessment that their understanding of the security tactics used in the systems improved significantly because of the provided material, with the additional diagrams considered very helpful. In contrast, the perception of architectural metrics varied widely. We could also show that novice developers benefited most from the supplementary diagrams. In contrast, senior developers could rely on their experience to compensate for the lack of additional help. Contrary to our expectations, we found no significant correlation between the time spent solving the tasks and the overall correctness score achieved, meaning that participants who took more time to read the documentation did not automatically achieve better results. As far as we know, this empirical study is the first analysis that explores the influence of security annotations in component diagrams to guide software developers when assessing microservice system security.
{"title":"Understanding security tactics in microservice APIs using annotated software architecture decomposition models - a controlled experiment.","authors":"Patric Genfer, Souhaila Serbout, Georg Simhandl, Uwe Zdun, Cesare Pautasso","doi":"10.1007/s10664-024-10601-1","DOIUrl":"10.1007/s10664-024-10601-1","url":null,"abstract":"<p><p>While microservice architectures have become a widespread option for designing distributed applications, designing secure microservice systems remains challenging. Although various security-related guidelines and practices exist, these systems' sheer size, complex communication structures, and polyglot tech stacks make it difficult to manually validate whether adequate security tactics are applied throughout their architecture. To address these challenges, we have devised a novel solution that involves the automatic generation of security-annotated software decomposition models and the utilization of security-based metrics to guide software architectures through the assessment of security tactics employed within microservice systems. To evaluate the effectiveness of our artifacts, we conducted a controlled experiment where we asked 60 students from two universities and ten experts from the industry to identify and assess the security features of two microservice reference systems. During the experiment, we tracked the correctness of their answers and the time they needed to solve the given tasks to measure how well they could understand the security tactics applied in the reference systems. Our results indicate that the supplemental material significantly improved the correctness of the participants' answers without requiring them to consult the documentation more. Most participants also stated in a self-assessment that their understanding of the security tactics used in the systems improved significantly because of the provided material, with the additional diagrams considered very helpful. In contrast, the perception of architectural metrics varied widely. We could also show that novice developers benefited most from the supplementary diagrams. In contrast, senior developers could rely on their experience to compensate for the lack of additional help. Contrary to our expectations, we found no significant correlation between the time spent solving the tasks and the overall correctness score achieved, meaning that participants who took more time to read the documentation did not automatically achieve better results. As far as we know, this empirical study is the first analysis that explores the influence of security annotations in component diagrams to guide software developers when assessing microservice system security.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 3","pages":"66"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11828814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143432508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-10-31DOI: 10.1007/s10664-024-10554-5
Jonas Eberlein, Daniel Rodriguez, Rachel Harrison
The research area of Software Defect Prediction (SDP) is both extensive and popular, and is often treated as a classification problem. Improvements in classification, pre-processing and tuning techniques, (together with many factors which can influence model performance) have encouraged this trend. However, no matter the effort in these areas, it seems that there is a ceiling in the performance of the classification models used in SDP. In this paper, the issue of classifier performance is analysed from the perspective of data complexity. Specifically, data complexity metrics are calculated using the Unified Bug Dataset, a collection of well-known SDP datasets, and then checked for correlation with the defect prediction performance of machine learning classifiers (in particular, the classifiers C5.0, Naive Bayes, Artificial Neural Networks, Random Forests, and Support Vector Machines). In this work, different domains of competence and incompetence are identified for the classifiers. Similarities and differences between the classifiers and the performance metrics are found and the Unified Bug Dataset is analysed from the perspective of data complexity. We found that certain classifiers work best in certain situations and that all data complexity metrics can be problematic, although certain classifiers did excel in some situations.
{"title":"The effect of data complexity on classifier performance.","authors":"Jonas Eberlein, Daniel Rodriguez, Rachel Harrison","doi":"10.1007/s10664-024-10554-5","DOIUrl":"10.1007/s10664-024-10554-5","url":null,"abstract":"<p><p>The research area of Software Defect Prediction (SDP) is both extensive and popular, and is often treated as a classification problem. Improvements in classification, pre-processing and tuning techniques, (together with many factors which can influence model performance) have encouraged this trend. However, no matter the effort in these areas, it seems that there is a ceiling in the performance of the classification models used in SDP. In this paper, the issue of classifier performance is analysed from the perspective of data complexity. Specifically, data complexity metrics are calculated using the Unified Bug Dataset, a collection of well-known SDP datasets, and then checked for correlation with the defect prediction performance of machine learning classifiers (in particular, the classifiers C5.0, Naive Bayes, Artificial Neural Networks, Random Forests, and Support Vector Machines). In this work, different domains of competence and incompetence are identified for the classifiers. Similarities and differences between the classifiers and the performance metrics are found and the Unified Bug Dataset is analysed from the perspective of data complexity. We found that certain classifiers work best in certain situations and that all data complexity metrics can be problematic, although certain classifiers did excel in some situations.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 1","pages":"16"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11527945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-02-27DOI: 10.1007/s10664-024-10610-0
Alexandre Verdet, Mohammad Hamdaqa, Leuson Da Silva, Foutse Khomh
Cloud computing has become popular thanks to the widespread use of Infrastructure as Code (IaC) tools, allowing the community to manage and configure cloud infrastructure using scripts. However, the scripting process does not automatically prevent practitioners from introducing misconfigurations, vulnerabilities, or privacy risks. As a result, ensuring security relies on practitioners' understanding and the adoption of explicit policies. To understand how practitioners deal with this problem, we perform an empirical study analyzing the adoption of scripted security best practices present in Terraform files, applied on AWS, Azure, and Google Cloud. We assess the adoption of these practices by analyzing a sample of 812 open-source GitHub projects. We scan each project's configuration files, looking for policy implementation through static analysis (Checkov and Tfsec). The category Access policy emerges as the most widely adopted in all providers, while Encryption at rest presents the most neglected policies. Regarding the cloud providers, we observe that AWS and Azure present similar behavior regarding attended and neglected policies. Finally, we provide guidelines for cloud practitioners to limit infrastructure vulnerability and discuss further aspects associated with policies that have yet to be extensively embraced within the industry.
{"title":"Assessing the adoption of security policies by developers in terraform across different cloud providers.","authors":"Alexandre Verdet, Mohammad Hamdaqa, Leuson Da Silva, Foutse Khomh","doi":"10.1007/s10664-024-10610-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10610-0","url":null,"abstract":"<p><p>Cloud computing has become popular thanks to the widespread use of Infrastructure as Code (IaC) tools, allowing the community to manage and configure cloud infrastructure using scripts. However, the scripting process does not automatically prevent practitioners from introducing misconfigurations, vulnerabilities, or privacy risks. As a result, ensuring security relies on practitioners' understanding and the adoption of explicit policies. To understand how practitioners deal with this problem, we perform an empirical study analyzing the adoption of scripted security best practices present in Terraform files, applied on AWS, Azure, and Google Cloud. We assess the adoption of these practices by analyzing a sample of 812 open-source GitHub projects. We scan each project's configuration files, looking for policy implementation through static analysis (Checkov and Tfsec). The category <i>Access policy</i> emerges as the most widely adopted in all providers, while <i>Encryption at rest</i> presents the most neglected policies. Regarding the cloud providers, we observe that AWS and Azure present similar behavior regarding attended and neglected policies. Finally, we provide guidelines for cloud practitioners to limit infrastructure vulnerability and discuss further aspects associated with policies that have yet to be extensively embraced within the industry.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"30 3","pages":"74"},"PeriodicalIF":3.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11868142/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143540588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s10664-024-10540-x
Huizi Hao, Kazi Amit Hasan, Hong Qin, Marcos Macedo, Yuan Tian, Steven H. H. Ding, Ahmed E. Hassan
ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in various tasks, including coding, testing, and debugging. Despite its widespread adoption, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of 210 and 370 developers’ shared conversations with ChatGPT in GitHub pull requests (PRs) and issues. We manually examined the content of the conversations and characterized the dynamics of the sharing behavior, i.e., understanding the rationale behind the sharing, identifying the locations where the conversations were shared, and determining the roles of the developers who shared them. Our main observations are: (1) Developers seek ChatGPT’s assistance across 16 types of software engineering inquiries. In both conversations shared in PRs and issues, the most frequently encountered inquiry categories include code generation, conceptual questions, how-to guides, issue resolution, and code review. (2) Developers frequently engage with ChatGPT via multi-turn conversations where each prompt can fulfill various roles, such as unveiling initial or new tasks, iterative follow-up, and prompt refinement. Multi-turn conversations account for 33.2% of the conversations shared in PRs and 36.9% in issues. (3) In collaborative coding, developers leverage shared conversations with ChatGPT to facilitate their role-specific contributions, whether as authors of PRs or issues, code reviewers, or collaborators on issues. Our work serves as the first step towards understanding the dynamics between developers and ChatGPT in collaborative software development and opens up new directions for future research on the topic.
{"title":"An empirical study on developers’ shared conversations with ChatGPT in GitHub pull requests and issues","authors":"Huizi Hao, Kazi Amit Hasan, Hong Qin, Marcos Macedo, Yuan Tian, Steven H. H. Ding, Ahmed E. Hassan","doi":"10.1007/s10664-024-10540-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10540-x","url":null,"abstract":"<p>ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in various tasks, including coding, testing, and debugging. Despite its widespread adoption, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of 210 and 370 developers’ shared conversations with ChatGPT in GitHub pull requests (PRs) and issues. We manually examined the content of the conversations and characterized the dynamics of the sharing behavior, i.e., understanding the rationale behind the sharing, identifying the locations where the conversations were shared, and determining the roles of the developers who shared them. Our main observations are: (1) Developers seek ChatGPT’s assistance across 16 types of software engineering inquiries. In both conversations shared in PRs and issues, the most frequently encountered inquiry categories include code generation, conceptual questions, how-to guides, issue resolution, and code review. (2) Developers frequently engage with ChatGPT via multi-turn conversations where each prompt can fulfill various roles, such as unveiling initial or new tasks, iterative follow-up, and prompt refinement. Multi-turn conversations account for 33.2% of the conversations shared in PRs and 36.9% in issues. (3) In collaborative coding, developers leverage shared conversations with ChatGPT to facilitate their role-specific contributions, whether as authors of PRs or issues, code reviewers, or collaborators on issues. Our work serves as the first step towards understanding the dynamics between developers and ChatGPT in collaborative software development and opens up new directions for future research on the topic.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs).
Problem
There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic.
Objective
This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs.
Method
We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners.
Results
Based on the content of 37 interviews, we identified 18 recurring quality issues and 24 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners’ experience.
Conclusion
We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository.
{"title":"Quality issues in machine learning software systems","authors":"Pierre-Olivier Côté, Amin Nikanjam, Rached Bouchoucha, Ilan Basta, Mouna Abidi, Foutse Khomh","doi":"10.1007/s10664-024-10536-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10536-7","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs).</p><h3 data-test=\"abstract-sub-heading\">Problem</h3><p>There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Based on the content of 37 interviews, we identified 18 recurring quality issues and 24 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners’ experience.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"4 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1007/s10664-024-10527-8
Masanari Kondo, Daniel M. German, Yasutaka Kamei, Naoyasu Ubayashi, Osamu Mizuno
In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define micro commits, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness.
{"title":"An empirical study of token-based micro commits","authors":"Masanari Kondo, Daniel M. German, Yasutaka Kamei, Naoyasu Ubayashi, Osamu Mizuno","doi":"10.1007/s10664-024-10527-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10527-8","url":null,"abstract":"<p>In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define <i>micro commits</i>, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"26 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1007/s10664-024-10516-x
Halimeh Agh, Aidin Azamnouri, Stefan Wagner
A Software Product Line (SPL) is a software development paradigm in which a family of software products shares a set of core assets. Testing has a vital role in both single-system development and SPL development in identifying potential faults by examining the behavior of a product or products, but it is especially challenging in SPL. There have been many research contributions in the SPL testing field; therefore, assessing the current state of research and practice is necessary to understand the progress in testing practices and to identify the gap between required techniques and existing approaches. This paper aims to survey existing research on SPL testing to provide researchers and practitioners with up-to-date evidence and issues that enable further development of the field. To this end, we conducted a Systematic Literature Review (SLR) with seven research questions in which we identified and analyzed 118 studies dating from 2003 to 2022. The results indicate that the literature proposes many techniques for specific aspects (e.g., controlling cost/effort in SPL testing); however, other elements (e.g., regression testing and non-functional testing) still need to be covered by existing research. Furthermore, most approaches are evaluated by only one empirical method, most of which are academic evaluations. This may jeopardize the adoption of approaches in industry. The results of this study can help identify gaps in SPL testing since specific points of SPL Engineering still need to be addressed entirely.
{"title":"Software product line testing: a systematic literature review","authors":"Halimeh Agh, Aidin Azamnouri, Stefan Wagner","doi":"10.1007/s10664-024-10516-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10516-x","url":null,"abstract":"<p>A Software Product Line (SPL) is a software development paradigm in which a family of software products shares a set of core assets. Testing has a vital role in both single-system development and SPL development in identifying potential faults by examining the behavior of a product or products, but it is especially challenging in SPL. There have been many research contributions in the SPL testing field; therefore, assessing the current state of research and practice is necessary to understand the progress in testing practices and to identify the gap between required techniques and existing approaches. This paper aims to survey existing research on SPL testing to provide researchers and practitioners with up-to-date evidence and issues that enable further development of the field. To this end, we conducted a Systematic Literature Review (SLR) with seven research questions in which we identified and analyzed 118 studies dating from 2003 to 2022. The results indicate that the literature proposes many techniques for specific aspects (e.g., controlling cost/effort in SPL testing); however, other elements (e.g., regression testing and non-functional testing) still need to be covered by existing research. Furthermore, most approaches are evaluated by only one empirical method, most of which are academic evaluations. This may jeopardize the adoption of approaches in industry. The results of this study can help identify gaps in SPL testing since specific points of SPL Engineering still need to be addressed entirely.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"9 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}