Bots are software systems designed to support users by automating a specific process, task, or activity. When such systems implement a conversational component to interact with the users, they are also known as conversational agents. Bots, particularly in their conversation-oriented version and AI-powered, have seen their adoption increase over time for software development and engineering purposes. Despite their exciting potential, ulteriorly enhanced by the advent of Generative AI and Large Language Models, bots still need to be improved to develop and integrate into the development cycle since practitioners report that bots add additional challenges that may worsen rather than improve. In this work, we aim to provide a taxonomy for characterizing bots, as well as a series of challenges for their adoption for Software Engineering associated with potential mitigation strategies. To reach our objectives, we conducted a multivocal literature review, reviewing both research and practitioner's literature. Through such an approach, we hope to contribute to both researchers and practitioners by providing first, a series of future research routes to follow, second, a list of strategies to adopt for improving the use of bots for software engineering purposes, and third, enforce a technology and knowledge transfer from the research field to the practitioners one, that is one of the primary goal of multivocal literature reviews.
{"title":"Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review","authors":"Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci","doi":"arxiv-2409.11864","DOIUrl":"https://doi.org/arxiv-2409.11864","url":null,"abstract":"Bots are software systems designed to support users by automating a specific\u0000process, task, or activity. When such systems implement a conversational\u0000component to interact with the users, they are also known as conversational\u0000agents. Bots, particularly in their conversation-oriented version and\u0000AI-powered, have seen their adoption increase over time for software\u0000development and engineering purposes. Despite their exciting potential,\u0000ulteriorly enhanced by the advent of Generative AI and Large Language Models,\u0000bots still need to be improved to develop and integrate into the development\u0000cycle since practitioners report that bots add additional challenges that may\u0000worsen rather than improve. In this work, we aim to provide a taxonomy for\u0000characterizing bots, as well as a series of challenges for their adoption for\u0000Software Engineering associated with potential mitigation strategies. To reach\u0000our objectives, we conducted a multivocal literature review, reviewing both\u0000research and practitioner's literature. Through such an approach, we hope to\u0000contribute to both researchers and practitioners by providing first, a series\u0000of future research routes to follow, second, a list of strategies to adopt for\u0000improving the use of bots for software engineering purposes, and third, enforce\u0000a technology and knowledge transfer from the research field to the\u0000practitioners one, that is one of the primary goal of multivocal literature\u0000reviews.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.
{"title":"Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization","authors":"Zhi Chen, Lingxiao Jiang","doi":"arxiv-2409.12020","DOIUrl":"https://doi.org/arxiv-2409.12020","url":null,"abstract":"In the rapidly evolving field of machine learning, training models with\u0000datasets from various locations and organizations presents significant\u0000challenges due to privacy and legal concerns. The exploration of effective\u0000collaborative training settings capable of leveraging valuable knowledge from\u0000distributed and isolated datasets is increasingly crucial. This study\u0000investigates key factors that impact the effectiveness of collaborative\u0000training methods in code next-token prediction, as well as the correctness and\u0000utility of the generated code, demonstrating the promise of such methods.\u0000Additionally, we evaluate the memorization of different participant training\u0000data across various collaborative training settings, including centralized,\u0000federated, and incremental training, highlighting their potential risks in\u0000leaking data. Our findings indicate that the size and diversity of code\u0000datasets are pivotal factors influencing the success of collaboratively trained\u0000code models. We show that federated learning achieves competitive performance\u0000compared to centralized training while offering better data protection, as\u0000evidenced by lower memorization ratios in the generated code. However,\u0000federated learning can still produce verbatim code snippets from hidden\u0000training data, potentially violating privacy or copyright. Our study further\u0000explores effectiveness and memorization patterns in incremental learning,\u0000emphasizing the sequence in which individual participant datasets are\u0000introduced. We also identify cross-organizational clones as a prevalent\u0000challenge in both centralized and federated learning scenarios. Our findings\u0000highlight the persistent risk of data leakage during inference, even when\u0000training data remains unseen. We conclude with recommendations for\u0000practitioners and researchers to optimize multisource datasets, propelling\u0000cross-organizational collaboration forward.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leadership in agile teams is a collective responsibility where team members share leadership work based on expertise and skills. However, the understanding of leadership in this context is limited. This study explores the under-researched area of prototypical leadership, aiming to understand if and how leaders who are perceived as more representative of the team are more effective leaders. Qualitative interviews were conducted with eleven members of six agile software teams in five Swedish companies from various industries and sizes. In this study, the effectiveness of leadership was perceived as higher when it emerged from within the team or when leaders aligned with the group. In addition, leaders in managerial roles that align with the team's shared values and traits were perceived as more effective, contributing to overall team success.
{"title":"Prototypical Leadership in Agile Software Development","authors":"Jina Dawood, Lucas Gren","doi":"arxiv-2409.11685","DOIUrl":"https://doi.org/arxiv-2409.11685","url":null,"abstract":"Leadership in agile teams is a collective responsibility where team members\u0000share leadership work based on expertise and skills. However, the understanding\u0000of leadership in this context is limited. This study explores the\u0000under-researched area of prototypical leadership, aiming to understand if and\u0000how leaders who are perceived as more representative of the team are more\u0000effective leaders. Qualitative interviews were conducted with eleven members of\u0000six agile software teams in five Swedish companies from various industries and\u0000sizes. In this study, the effectiveness of leadership was perceived as higher\u0000when it emerged from within the team or when leaders aligned with the group. In\u0000addition, leaders in managerial roles that align with the team's shared values\u0000and traits were perceived as more effective, contributing to overall team\u0000success.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
App reviews in mobile app stores contain useful information which is used to improve applications and promote software evolution. This information is processed by automatic tools which prioritize reviews. In order to carry out this prioritization, reviews are decomposed into features like category and sentiment. Then, a weighted function assigns a weight to each feature and a review ranking is calculated. Unfortunately, in order to extract category and sentiment from reviews, its is required at least a classifier trained in an annotated corpus. Therefore this task is computational demanding. Thus, in this work, we propose Shannon Entropy as a simple feature which can replace standard features. Our results show that a Shannon Entropy based ranking is better than a standard ranking according to the NDCG metric. This result is promising even if we require fairness by means of algorithmic bias. Finally, we highlight a computational limit which appears in the search of the best ranking.
{"title":"Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing","authors":"Andres Rojas Paredes, Brenda Mareco","doi":"arxiv-2409.12012","DOIUrl":"https://doi.org/arxiv-2409.12012","url":null,"abstract":"App reviews in mobile app stores contain useful information which is used to\u0000improve applications and promote software evolution. This information is\u0000processed by automatic tools which prioritize reviews. In order to carry out\u0000this prioritization, reviews are decomposed into features like category and\u0000sentiment. Then, a weighted function assigns a weight to each feature and a\u0000review ranking is calculated. Unfortunately, in order to extract category and\u0000sentiment from reviews, its is required at least a classifier trained in an\u0000annotated corpus. Therefore this task is computational demanding. Thus, in this\u0000work, we propose Shannon Entropy as a simple feature which can replace standard\u0000features. Our results show that a Shannon Entropy based ranking is better than\u0000a standard ranking according to the NDCG metric. This result is promising even\u0000if we require fairness by means of algorithmic bias. Finally, we highlight a\u0000computational limit which appears in the search of the best ranking.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
From 2019 to 2022, Volvo Cars successfully translated our research discoveries regarding group dynamics within agile teams into widespread industrial practice. We wish to illuminate the insights gained through the process of garnering support, providing training, executing implementation, and sustaining a tool embraced by approximately 700 teams and 9,000 employees. This tool was designed to empower agile teams and propel their internal development. Our experiences underscore the necessity of comprehensive team training, the cultivation of a cadre of trainers across the organization, and the creation of a novel software solution. In essence, we deduce that an automated concise survey tool, coupled with a repository of actionable strategies, holds remarkable potential in fostering the maturation of agile teams, but we also share many of the challenges we encountered during the implementation.
{"title":"From Group Psychology to Software Engineering Research to Automotive R&D: Measuring Team Development at Volvo Cars","authors":"Lucas Gren, Christian Jacobsson","doi":"arxiv-2409.11778","DOIUrl":"https://doi.org/arxiv-2409.11778","url":null,"abstract":"From 2019 to 2022, Volvo Cars successfully translated our research\u0000discoveries regarding group dynamics within agile teams into widespread\u0000industrial practice. We wish to illuminate the insights gained through the\u0000process of garnering support, providing training, executing implementation, and\u0000sustaining a tool embraced by approximately 700 teams and 9,000 employees. This\u0000tool was designed to empower agile teams and propel their internal development.\u0000Our experiences underscore the necessity of comprehensive team training, the\u0000cultivation of a cadre of trainers across the organization, and the creation of\u0000a novel software solution. In essence, we deduce that an automated concise\u0000survey tool, coupled with a repository of actionable strategies, holds\u0000remarkable potential in fostering the maturation of agile teams, but we also\u0000share many of the challenges we encountered during the implementation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Privacy policies define the terms under which personal data may be collected and processed by data controllers. The General Data Protection Regulation (GDPR) imposes requirements on these policies that are often difficult to implement. Difficulties arise in particular due to the heterogeneity of existing systems (e.g., the Internet of Things (IoT), web technology, etc.). In this paper, we propose a method to refine high level GDPR privacy requirements for informed consent into low-level computational models. The method is aimed at software developers implementing systems that require consent management. We mechanize our models in TLA+ and use model-checking to prove that the low-level computational models implement the high-level privacy requirements; TLA+ has been used by software engineers in companies such as Microsoft or Amazon. We demonstrate our method in two real world scenarios: an implementation of cookie banners and a IoT system communicating via Bluetooth low energy.
{"title":"Model-Checking the Implementation of Consent","authors":"Raúl Pardo, Daniel Le Métayer","doi":"arxiv-2409.11803","DOIUrl":"https://doi.org/arxiv-2409.11803","url":null,"abstract":"Privacy policies define the terms under which personal data may be collected\u0000and processed by data controllers. The General Data Protection Regulation\u0000(GDPR) imposes requirements on these policies that are often difficult to\u0000implement. Difficulties arise in particular due to the heterogeneity of\u0000existing systems (e.g., the Internet of Things (IoT), web technology, etc.). In\u0000this paper, we propose a method to refine high level GDPR privacy requirements\u0000for informed consent into low-level computational models. The method is aimed\u0000at software developers implementing systems that require consent management. We\u0000mechanize our models in TLA+ and use model-checking to prove that the low-level\u0000computational models implement the high-level privacy requirements; TLA+ has\u0000been used by software engineers in companies such as Microsoft or Amazon. We\u0000demonstrate our method in two real world scenarios: an implementation of cookie\u0000banners and a IoT system communicating via Bluetooth low energy.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federica Pepe, Fiorella Zampetti, Antonio Mastropaolo, Gabriele Bavota, Massimiliano Di Penta
The development of Machine Learning (ML)- and, more recently, of Deep Learning (DL)-intensive systems requires suitable choices, e.g., in terms of technology, algorithms, and hyper-parameters. Such choices depend on developers' experience, as well as on proper experimentation. Due to limited time availability, developers may adopt suboptimal, sometimes temporary choices, leading to a technical debt (TD) specifically related to the ML code. This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in DL systems. After selecting 100 open-source Python projects using popular DL frameworks, we identified SATD from their source comments and created a stratified sample of 443 SATD to analyze manually. We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves. The identified SATD categories pertain to different aspects of DL models, some of which are technological (e.g., due to hardware or libraries) and some related to suboptimal choices in the DL process, model usage, or configuration. Our findings indicate that DL-specific SATD differs from DL bugs found in previous studies, as it typically pertains to suboptimal solutions rather than functional (eg blocking) problems. Last but not least, we found that state-of-the-art static analysis tools do not help developers avoid such problems, and therefore, specific support is needed to cope with DL-specific SATD.
{"title":"A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems","authors":"Federica Pepe, Fiorella Zampetti, Antonio Mastropaolo, Gabriele Bavota, Massimiliano Di Penta","doi":"arxiv-2409.11826","DOIUrl":"https://doi.org/arxiv-2409.11826","url":null,"abstract":"The development of Machine Learning (ML)- and, more recently, of Deep\u0000Learning (DL)-intensive systems requires suitable choices, e.g., in terms of\u0000technology, algorithms, and hyper-parameters. Such choices depend on\u0000developers' experience, as well as on proper experimentation. Due to limited\u0000time availability, developers may adopt suboptimal, sometimes temporary\u0000choices, leading to a technical debt (TD) specifically related to the ML code.\u0000This paper empirically analyzes the presence of Self-Admitted Technical Debt\u0000(SATD) in DL systems. After selecting 100 open-source Python projects using\u0000popular DL frameworks, we identified SATD from their source comments and\u0000created a stratified sample of 443 SATD to analyze manually. We derived a\u0000taxonomy of DL-specific SATD through open coding, featuring seven categories\u0000and 41 leaves. The identified SATD categories pertain to different aspects of\u0000DL models, some of which are technological (e.g., due to hardware or libraries)\u0000and some related to suboptimal choices in the DL process, model usage, or\u0000configuration. Our findings indicate that DL-specific SATD differs from DL bugs\u0000found in previous studies, as it typically pertains to suboptimal solutions\u0000rather than functional (eg blocking) problems. Last but not least, we found\u0000that state-of-the-art static analysis tools do not help developers avoid such\u0000problems, and therefore, specific support is needed to cope with DL-specific\u0000SATD.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}