Pub Date : 2024-07-25DOI: 10.1007/s10664-024-10512-1
Adam Alami, Mansooreh Zahedi, Oliver Krancher
Psychological safety continues to pique the interest of scholars in a variety of disciplines of study. Recent research indicates that psychological safety fosters knowledge sharing and norm clarity and complements agile values. Although software quality remains a concern in the software industry, academics have yet to investigate whether and how psychologically safe teams provide superior results. In this study, we explore how psychological safety influences agile teams’ quality-related behaviors aimed at enhancing software quality. To widen the empirical coverage and evaluate the results, we chose a two-phase mixed-methods research design with an exploratory qualitative phase (20 interviews) followed by a quantitative phase (survey study, N = 423). Our findings show that, when psychological safety is established in agile software teams, it induces enablers of a social nature that advance the teams’ ability to pursue software quality. For example, admitting mistakes and taking initiatives equally help teams learn and invest their learning in their future decisions related to software quality. Past mistakes become points of reference for avoiding them in the future. Individuals become more willing to take initiatives aimed at enhancing quality practices and mitigating software quality issues. We contribute to our endeavor to understand the circumstances that promote software quality. Psychological safety requires organizations, their management, agile teams, and individuals to maintain and propagate safety principles. Our results also suggest that technological tools and procedures can be utilized alongside social strategies to promote software quality.
{"title":"The role of psychological safety in promoting software quality in agile teams","authors":"Adam Alami, Mansooreh Zahedi, Oliver Krancher","doi":"10.1007/s10664-024-10512-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10512-1","url":null,"abstract":"<p>Psychological safety continues to pique the interest of scholars in a variety of disciplines of study. Recent research indicates that psychological safety fosters knowledge sharing and norm clarity and complements agile values. Although software quality remains a concern in the software industry, academics have yet to investigate whether and how psychologically safe teams provide superior results. In this study, we explore how psychological safety influences agile teams’ quality-related behaviors aimed at enhancing software quality. To widen the empirical coverage and evaluate the results, we chose a two-phase mixed-methods research design with an exploratory qualitative phase (20 interviews) followed by a quantitative phase (survey study, N = 423). Our findings show that, when psychological safety is established in agile software teams, it induces enablers of a social nature that advance the teams’ ability to pursue software quality. For example, admitting mistakes and taking initiatives equally help teams learn and invest their learning in their future decisions related to software quality. Past mistakes become points of reference for avoiding them in the future. Individuals become more willing to take initiatives aimed at enhancing quality practices and mitigating software quality issues. We contribute to our endeavor to understand the circumstances that promote software quality. Psychological safety requires organizations, their management, agile teams, and individuals to maintain and propagate safety principles. Our results also suggest that technological tools and procedures can be utilized alongside social strategies to promote software quality.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s10664-024-10518-9
Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan
Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.
{"title":"The impact of concept drift and data leakage on log level prediction models","authors":"Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10518-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10518-9","url":null,"abstract":"<p>Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"16 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s10664-024-10509-w
Andrei Arusoaie, Ștefan-Claudiu Susan
The term smart contract was originally used to describe automated legal contracts. Nowadays, it refers to special programs that run on blockchain platforms and are popular in decentralized applications. In recent years, vulnerabilities in smart contracts caused significant financial losses. Researchers have proposed methods and tools for detecting them and have demonstrated their effectiveness using various test suites. In this paper, we aim to improve the current approach to measuring the effectiveness of vulnerability detectors in smart contracts. First, we identify several traits of existing test suites used to assess tool effectiveness. We explain how these traits limit the evaluation and comparison of vulnerability detection tools. Next, we propose a new test suite that prioritizes diversity over quantity, utilizing a comprehensive taxonomy to achieve this. Our organized test suite enables insightful evaluations and more precise comparisons among vulnerability detection tools. We demonstrate the benefits of our test suite by comparing several vulnerability detection tools using two sets of metrics. Results show that the tools we included in our comparison cover less than half of the vulnerabilities in the new test suite. Finally, based on our results, we answer several questions that we pose in the introduction of the paper about the effectiveness of the compared tools.
{"title":"Towards Trusted Smart Contracts: A Comprehensive Test Suite For Vulnerability Detection","authors":"Andrei Arusoaie, Ștefan-Claudiu Susan","doi":"10.1007/s10664-024-10509-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10509-w","url":null,"abstract":"<p>The term <i>smart contract</i> was originally used to describe automated legal contracts. Nowadays, it refers to special programs that run on blockchain platforms and are popular in decentralized applications. In recent years, vulnerabilities in smart contracts caused significant financial losses. Researchers have proposed methods and tools for detecting them and have demonstrated their effectiveness using various test suites. In this paper, we aim to improve the current approach to measuring the effectiveness of vulnerability detectors in smart contracts. First, we identify several traits of existing test suites used to assess tool effectiveness. We explain how these traits limit the evaluation and comparison of vulnerability detection tools. Next, we propose a new test suite that prioritizes diversity over quantity, utilizing a comprehensive taxonomy to achieve this. Our organized test suite enables insightful evaluations and more precise comparisons among vulnerability detection tools. We demonstrate the benefits of our test suite by comparing several vulnerability detection tools using two sets of metrics. Results show that the tools we included in our comparison cover less than half of the vulnerabilities in the new test suite. Finally, based on our results, we answer several questions that we pose in the introduction of the paper about the effectiveness of the compared tools.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"17 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Title quality is important for different software engineering communities. For example, in Stack Overflow, posts with low-quality question titles often discourage potential answerers. In GitHub, issues with low-quality titles can make it difficult for developers to grasp the core idea of the problem. In previous studies, researchers mainly focused on generating titles from scratch by analyzing the body contents, such as the post body for Stack Overflow question title generation (SOTG) and the issue body for issue title generation (ISTG). However, the quality of the generated titles is still limited by the information available in the body contents. A more effective way is to provide accurate completion suggestions when developers compose titles. Inspired by this idea, we are the first to study the problem of automatic title completion for software engineering title generation tasks and propose the approach TC4SETG. Specifically, we first preprocess the gathered titles to form incomplete titles (i.e., tip information provided by developers) for simulating the title completion scene. Then we construct the input by concatenating the incomplete title with the body’s content. Finally, we fine-tune the pre-trained model CodeT5 to learn the title completion patterns effectively. To evaluate the effectiveness of TC4SETG, we selected 189,655 high-quality posts from Stack Overflow by covering eight popular programming languages for the SOTG task and 333,563 issues in the top-200 starred repositories on GitHub for the ISTG task. Our empirical results show that compared with the approaches of generating question titles from scratch, our proposed approach TC4SETG is more practical in automatic and human evaluation. Our experimental results demonstrate that TC4SETG outperforms corresponding state-of-the-art baselines in the SOTG task by a minimum of 25.82% and in the ISTG task by at least 45.48% in terms of ROUGE-L. Therefore, our study provides a new direction for studying automatic software engineering title generation and calls for more researchers to investigate this direction in the future.
{"title":"Automatic title completion for Stack Overflow posts and GitHub issues","authors":"Xiang Chen, Wenlong Pei, Shaoyu Yang, Yanlin Zhou, Zichen Zhang, Jiahua Pei","doi":"10.1007/s10664-024-10513-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10513-0","url":null,"abstract":"<p>Title quality is important for different software engineering communities. For example, in Stack Overflow, posts with low-quality question titles often discourage potential answerers. In GitHub, issues with low-quality titles can make it difficult for developers to grasp the core idea of the problem. In previous studies, researchers mainly focused on generating titles from scratch by analyzing the body contents, such as the post body for Stack Overflow question title generation (SOTG) and the issue body for issue title generation (ISTG). However, the quality of the generated titles is still limited by the information available in the body contents. A more effective way is to provide accurate completion suggestions when developers compose titles. Inspired by this idea, we are the first to study the problem of automatic title completion for software engineering title generation tasks and propose the approach <span>TC4SETG</span>. Specifically, we first preprocess the gathered titles to form incomplete titles (i.e., tip information provided by developers) for simulating the title completion scene. Then we construct the input by concatenating the incomplete title with the body’s content. Finally, we fine-tune the pre-trained model CodeT5 to learn the title completion patterns effectively. To evaluate the effectiveness of <span>TC4SETG</span>, we selected 189,655 high-quality posts from Stack Overflow by covering eight popular programming languages for the SOTG task and 333,563 issues in the top-200 starred repositories on GitHub for the ISTG task. Our empirical results show that compared with the approaches of generating question titles from scratch, our proposed approach <span>TC4SETG</span> is more practical in automatic and human evaluation. Our experimental results demonstrate that <span>TC4SETG</span> outperforms corresponding state-of-the-art baselines in the SOTG task by a minimum of 25.82% and in the ISTG task by at least 45.48% in terms of ROUGE-L. Therefore, our study provides a new direction for studying automatic software engineering title generation and calls for more researchers to investigate this direction in the future.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"23 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}