Pub Date : 2025-10-18DOI: 10.1007/s10515-025-00564-y
J. Andres Diaz-Pace, Daniele Di Pompeo, Michele Tucci
Software model optimization is a process that automatically generates design alternatives aimed at improving quantifiable non-functional properties of software systems, such as performance and reliability. Multi-objective evolutionary algorithms effectively help designers identify trade-offs among the desired non-functional properties. To reduce the use of computational resources, this work examines the impact of implementing a search budget to limit the search for design alternatives. In particular, we analyze how time budgets affect the quality of Pareto fronts by utilizing quality indicators and exploring the structural features of the generated design alternatives. This study identifies distinct behavioral differences among evolutionary algorithms when a search budget is implemented. It further reveals that design alternatives generated under a budget are structurally different from those produced without one. Additionally, we offer recommendations for designers on selecting algorithms in relation to time constraints, thereby facilitating the effective application of automated refactoring to improve non-functional properties.
{"title":"On the role of search budgets in model-based software refactoring optimization","authors":"J. Andres Diaz-Pace, Daniele Di Pompeo, Michele Tucci","doi":"10.1007/s10515-025-00564-y","DOIUrl":"10.1007/s10515-025-00564-y","url":null,"abstract":"<div><p>Software model optimization is a process that automatically generates design alternatives aimed at improving quantifiable non-functional properties of software systems, such as performance and reliability. Multi-objective evolutionary algorithms effectively help designers identify trade-offs among the desired non-functional properties. To reduce the use of computational resources, this work examines the impact of implementing a search budget to limit the search for design alternatives. In particular, we analyze how time budgets affect the quality of Pareto fronts by utilizing quality indicators and exploring the structural features of the generated design alternatives. This study identifies distinct behavioral differences among evolutionary algorithms when a search budget is implemented. It further reveals that design alternatives generated under a budget are structurally different from those produced without one. Additionally, we offer recommendations for designers on selecting algorithms in relation to time constraints, thereby facilitating the effective application of automated refactoring to improve non-functional properties.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00564-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145316541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-18DOI: 10.1007/s10515-025-00556-y
Leila Yousofvand, Seyfollah Soleimani, Vahid Rafe, Amin Nikanjam
Bug localization (BL) is known as one of the major steps in the program repair process, which generally seeks to find a set of commands causing a program to crash or fail. At the present time, locating bugs and their sources quickly seems to be impossible as the complexity of modern software development and scaling is soaring. Accordingly, there is a huge demand for BL techniques with minimal human intervention. A graph representing source code typically encodes valuable information about both the syntactic and semantic structures of programs. Many software bugs are associated with these structures, making graphs particularly suitable for bug localization (BL). Therefore, the key contributions of this work involve labeling graph nodes, classifying these nodes, and addressing imbalanced classifications within the graph data structure to effectively locate bugs in code. A graph-based bug classifier is initially introduced in the method proposed in this paper. For this purpose, the program source codes are mapped to a graph representation. Since the graph nodes do not have labels, the Gumtree algorithm is then exploited to label them by comparing the buggy graphs and the corresponding bug-free ones. Afterward, a trained, supervised node classifier, developed based on a graph neural network (GNN), is applied to classify the nodes into buggy or bug-free ones. Given the imbalance in the data, accuracy, precision, recall, and F1-score metrics are used for evaluation. Experimental results on identical datasets show that the proposed method outperforms other related approaches. The proposed approach effectively localizes a broader spectrum of bug types, such as undefined properties, functional bugs, variable naming errors, and variable misuse issues.
{"title":"Graph neural networks for precise bug localization through structural program analysis","authors":"Leila Yousofvand, Seyfollah Soleimani, Vahid Rafe, Amin Nikanjam","doi":"10.1007/s10515-025-00556-y","DOIUrl":"10.1007/s10515-025-00556-y","url":null,"abstract":"<div><p>Bug localization (BL) is known as one of the major steps in the program repair process, which generally seeks to find a set of commands causing a program to crash or fail. At the present time, locating bugs and their sources quickly seems to be impossible as the complexity of modern software development and scaling is soaring. Accordingly, there is a huge demand for BL techniques with minimal human intervention. A graph representing source code typically encodes valuable information about both the syntactic and semantic structures of programs. Many software bugs are associated with these structures, making graphs particularly suitable for bug localization (BL). Therefore, the key contributions of this work involve labeling graph nodes, classifying these nodes, and addressing imbalanced classifications within the graph data structure to effectively locate bugs in code. A graph-based bug classifier is initially introduced in the method proposed in this paper. For this purpose, the program source codes are mapped to a graph representation. Since the graph nodes do not have labels, the Gumtree algorithm is then exploited to label them by comparing the buggy graphs and the corresponding bug-free ones. Afterward, a trained, supervised node classifier, developed based on a graph neural network (GNN), is applied to classify the nodes into buggy or bug-free ones. Given the imbalance in the data, accuracy, precision, recall, and F1-score metrics are used for evaluation. Experimental results on identical datasets show that the proposed method outperforms other related approaches. The proposed approach effectively localizes a broader spectrum of bug types, such as undefined properties, functional bugs, variable naming errors, and variable misuse issues<b>.</b></p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00556-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145316542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1007/s10515-025-00561-1
Haibo Lin, Zhong Li, Ruihua Ji, Minxue Pan, Tian Zhang, Nan Wu, Xuandong Li
Code watermarking has gained increasing attention for tracing the provenance of code with the rapid growth of the open-source community. Existing work on code watermarking has shown promising results yet still falls short, especially when a multi-bit watermark for encoding diverse information is required. In this paper, we propose DWC, a novel code watermarking method with highly watermark capacity. The key idea of DWC is to first decompose the code into natural and formal channels, then embed the watermark separately into each channel based solely on its respective information. As such, DWC reduces the mutual interference between these two channels and the impacts of irrelevant information within the code, thus enabling more effective transformations for embedding watermarks with higher capacity and robustness. Our extensive experiments on source code snippets in four programming languages (C, C++, Java, and Python) demonstrate the effectiveness, efficiency, and capability of DWC in embedding multi-bit watermarks, as well as the utility and robustness of the watermarked code it generates.
{"title":"Decomposition then watermarking: Enhancing code traceability with dual-channel code watermarking","authors":"Haibo Lin, Zhong Li, Ruihua Ji, Minxue Pan, Tian Zhang, Nan Wu, Xuandong Li","doi":"10.1007/s10515-025-00561-1","DOIUrl":"10.1007/s10515-025-00561-1","url":null,"abstract":"<div><p>Code watermarking has gained increasing attention for tracing the provenance of code with the rapid growth of the open-source community. Existing work on code watermarking has shown promising results yet still falls short, especially when a multi-bit watermark for encoding diverse information is required. In this paper, we propose <span>DWC</span>, a novel code watermarking method with highly watermark capacity. The key idea of <span>DWC</span> is to first decompose the code into natural and formal channels, then embed the watermark separately into each channel based solely on its respective information. As such, <span>DWC</span> reduces the mutual interference between these two channels and the impacts of irrelevant information within the code, thus enabling more effective transformations for embedding watermarks with higher capacity and robustness. Our extensive experiments on source code snippets in four programming languages (C, C++, Java, and Python) demonstrate the effectiveness, efficiency, and capability of <span>DWC</span> in embedding multi-bit watermarks, as well as the utility and robustness of the watermarked code it generates.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145256631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-07DOI: 10.1007/s10515-025-00558-w
Guocang Yang, Dawei Yuan, Tao Zhang, Zhenghan Chen
Structured Query Language (SQL) is a standard language for interacting with relational databases and is widely used across various information systems, either through direct query execution or via object-relational mapping (ORM) frameworks. Recent approaches have focused on converting natural language into SQL to simplify database development for users without programming expertise. However, these methods overlook direct translation from sign language—an essential modality for users such as the deaf community who may lack experience with SQL syntax. In this paper, we present SIGN2SQL, an innovative end-to-end framework that generates SQL queries from signed input. The system first employs a dedicated gesture recognition module to interpret the visual signals, followed by a convolutional neural network (CNN)-based model that produces the corresponding SQL statements. Trained on a well-annotated dataset, SIGN2SQL is evaluated against multiple pipeline-based baselines. Experimental results demonstrate that SIGN2SQL outperforms existing methods in both effectiveness and efficiency, particularly for SELECT statements with WHERE clauses. It achieves an execution accuracy of 89.8%, highlighting its potential as an accessible and inclusive database interaction interface.
{"title":"A sign language to SQL query translation system for enhancing database accessibility","authors":"Guocang Yang, Dawei Yuan, Tao Zhang, Zhenghan Chen","doi":"10.1007/s10515-025-00558-w","DOIUrl":"10.1007/s10515-025-00558-w","url":null,"abstract":"<div><p>Structured Query Language (SQL) is a standard language for interacting with relational databases and is widely used across various information systems, either through direct query execution or via object-relational mapping (ORM) frameworks. Recent approaches have focused on converting natural language into SQL to simplify database development for users without programming expertise. However, these methods overlook direct translation from sign language—an essential modality for users such as the deaf community who may lack experience with SQL syntax. In this paper, we present <i>SIGN2SQL</i>, an innovative end-to-end framework that generates SQL queries from signed input. The system first employs a dedicated gesture recognition module to interpret the visual signals, followed by a convolutional neural network (CNN)-based model that produces the corresponding SQL statements. Trained on a well-annotated dataset, SIGN2SQL is evaluated against multiple pipeline-based baselines. Experimental results demonstrate that SIGN2SQL outperforms existing methods in both effectiveness and efficiency, particularly for SELECT statements with WHERE clauses. It achieves an execution accuracy of 89.8%, highlighting its potential as an accessible and inclusive database interaction interface.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00558-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145256302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-07DOI: 10.1007/s10515-025-00554-0
Lichen Yang, Qiang Wang, Zhonghao Yang, Daojing He, Yu Li
Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in handling graph-structured data; however, they exhibit failures after deployment, which can cause severe consequences. Hence, conducting thorough testing before deployment becomes imperative to ensure the reliability of GNNs. However, thorough testing requires numerous manually annotated test data. To mitigate the annotation cost, strategically prioritizing and labeling high-quality unlabeled inputs for testing becomes crucial, which facilitates uncovering more model failures with a limited labeling budget. Unfortunately, existing test input prioritization techniques either overlook the valuable information contained in graph structures or are overly reliant on attributes extracted from the target model, i.e., model-aware attributes, whose quality can vary significantly. To address these issues, we propose a novel test input prioritization framework, named GraphRank, for GNNs. GraphRank introduces model-agnostic attributes to compensate for the limitations of the model-aware ones. It also leverages the graph structure information to aggregate attributes from neighboring nodes, thereby enhancing the model-aware and model-agnostic attributes. Furthermore, GraphRank combines the above attributes with a binary classifier, using it as a ranking model to prioritize inputs. This classifier undergoes iterative training, which enables it to learn from each round’s feedback and improve its performance accordingly. Extensive experiments demonstrate GraphRank’s superiority over existing techniques.
{"title":"Toward efficient testing of graph neural networks via test input prioritization","authors":"Lichen Yang, Qiang Wang, Zhonghao Yang, Daojing He, Yu Li","doi":"10.1007/s10515-025-00554-0","DOIUrl":"10.1007/s10515-025-00554-0","url":null,"abstract":"<div><p>Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in handling graph-structured data; however, they exhibit failures after deployment, which can cause severe consequences. Hence, conducting thorough testing before deployment becomes imperative to ensure the reliability of GNNs. However, thorough testing requires numerous manually annotated test data. To mitigate the annotation cost, strategically prioritizing and labeling high-quality unlabeled inputs for testing becomes crucial, which facilitates uncovering more model failures with a limited labeling budget. Unfortunately, existing test input prioritization techniques either overlook the valuable information contained in graph structures or are overly reliant on attributes extracted from the target model, <i>i.e., model-aware attributes</i>, whose quality can vary significantly. To address these issues, we propose a novel test input prioritization framework, named <i>GraphRank</i>, for GNNs. GraphRank introduces model-agnostic attributes to compensate for the limitations of the model-aware ones. It also leverages the graph structure information to aggregate attributes from neighboring nodes, thereby enhancing the model-aware and model-agnostic attributes. Furthermore, GraphRank combines the above attributes with a binary classifier, using it as a ranking model to prioritize inputs. This classifier undergoes iterative training, which enables it to learn from each round’s feedback and improve its performance accordingly. Extensive experiments demonstrate GraphRank’s superiority over existing techniques.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145256303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Function Point Analysis (FPA) is a method in software engineering that focuses on identifying the functions provided by a software system to users, such as data input, processing, output, and database management. These functions are classified according to complexity to quantify the system’s size in functional point units. In this paper, we propose two graph neural networks: a Graph-based Similarity Detection Neural Network (GSDNN) and a Prior-Structural Information Graph Neural Network (PSI-GNN) with a pre-trained layer using transfer learning, to define the best model for functional size prediction and uncover patterns and trends in data. Additionally, the NESMA (Netherlands Software Metrics Users Association) method, from the functional families approach, will be in focus, where the ISBSG (International Software Benchmarking Standards Group) dataset, which provides standardized and relevant data for comparing software performance, was used to analyze 1704 industrial software projects. The goal was to identify the graph architecture with the smallest number of experiments to be performed and the lowest Mean Magnitude Relative Error (MMRE) using orthogonal-array tuning optimization via Latin Square extraction. In the proposed approach, the number of experiments is fewer than 8 for each dataset, and a minimum MMRE value of 0.97% was obtained using PSI-GNN. Additionally, the impact of five input features on the change in MMRE value was analyzed with the top-performing model, employing the SHAP (SHapley Additive exPlanations) feature importance method, visualized through GraphExplainer. The frequency of user-initiated transactions, quantified technically, emerged as the most significant determinant within the NESMA framework.
{"title":"Graph based transfer learning with orthogonal tunning for functionality size insights","authors":"Nevena Ranković, Dragica Ranković, Gonzalo Nápoles, Federico Zamberlan","doi":"10.1007/s10515-025-00562-0","DOIUrl":"10.1007/s10515-025-00562-0","url":null,"abstract":"<div><p>Function Point Analysis (FPA) is a method in software engineering that focuses on identifying the functions provided by a software system to users, such as data input, processing, output, and database management. These functions are classified according to complexity to quantify the system’s size in functional point units. In this paper, we propose two graph neural networks: a Graph-based Similarity Detection Neural Network (GSDNN) and a Prior-Structural Information Graph Neural Network (PSI-GNN) with a pre-trained layer using transfer learning, to define the best model for functional size prediction and uncover patterns and trends in data. Additionally, the NESMA (Netherlands Software Metrics Users Association) method, from the functional families approach, will be in focus, where the ISBSG (International Software Benchmarking Standards Group) dataset, which provides standardized and relevant data for comparing software performance, was used to analyze 1704 industrial software projects. The goal was to identify the graph architecture with the smallest number of experiments to be performed and the lowest Mean Magnitude Relative Error (MMRE) using orthogonal-array tuning optimization <i>via Latin Square</i> extraction. In the proposed approach, the number of experiments is fewer than 8 for each dataset, and a minimum MMRE value of 0.97% was obtained using PSI-GNN. Additionally, the impact of five input features on the change in MMRE value was analyzed with the top-performing model, employing the SHAP (SHapley Additive exPlanations) feature importance method, visualized through GraphExplainer. The frequency of user-initiated transactions, quantified technically, emerged as the most significant determinant within the NESMA framework.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00562-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145256587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anomaly detection in software logs is crucial for development and maintenance, allowing timely identification of system failures and ensuring normal operations. Although recent deep learning advancements in log anomaly detection have shown exceptional performance, the reliance on time-consuming log parsers raises concerns about their necessity for quickly identifying anomalies. Standardized preprocessing methods can mishandle or lose important information. Additionally, the significant imbalance between normal and anomalous log data, along with the scarcity of labeled data, presents a persistent challenge in anomaly detection. We first evaluated the impact of omitting a log parser on anomaly detection models. Subsequently, we propose LogRoBERTa, an innovative anomaly detection model that eliminates the need for a parser. LogRoBERTa creates a stable and diverse labeled training set using the Determinantal Point Process (DPP) method, needing only a small amount of labeled data. The hybrid language model is based on RoBERTa’s architecture, combined with an attention-based BiLSTM. This setup leverages RoBERTa’s strong contextual understanding and BiLSTM’s capability to capture sequential dependencies, enhancing performance in complex log sequences. Experiments on four widely used datasets demonstrate that LogRoBERTa outperforms state-of-the-art benchmark models—including three fully supervised approaches—without relying on a dedicated log parser. Furthermore, its consistently strong performance on low-resource datasets highlights its robustness and generalizability across varying data conditions. These results validate the overall effectiveness of LogRoBERTa’s design and offer a thorough evaluation of the implications of bypassing a log parser. Additionally, our ablation studies and training set construction experiments further confirm the contributions of each individual component to the model’s performance. The study empirically validated that a RoBERTa-based approach effectively handles software log anomaly detection in long and complex log sequences, providing a more efficient and robust solution for omitting a parser compared to existing models.
软件日志异常检测对于开发和维护至关重要,可以及时发现系统故障,保证系统正常运行。尽管最近深度学习在日志异常检测方面的进展显示出了卓越的性能,但对耗时的日志解析器的依赖引起了人们对其快速识别异常的必要性的担忧。标准化的预处理方法可能会处理不当或丢失重要信息。此外,正常和异常日志数据之间的显著不平衡以及标记数据的稀缺性给异常检测带来了持续的挑战。我们首先评估了忽略日志解析器对异常检测模型的影响。随后,我们提出了LogRoBERTa,这是一种创新的异常检测模型,它消除了对解析器的需求。LogRoBERTa使用确定性点过程(determinal Point Process, DPP)方法创建一个稳定且多样化的标记训练集,只需要少量的标记数据。混合语言模型基于RoBERTa的体系结构,结合了基于注意力的BiLSTM。这种设置利用了RoBERTa强大的上下文理解能力和BiLSTM捕获顺序依赖关系的能力,增强了复杂日志序列中的性能。在四个广泛使用的数据集上进行的实验表明,LogRoBERTa优于最先进的基准模型(包括三种完全监督的方法),而不依赖于专门的日志解析器。此外,它在低资源数据集上始终如一的强大性能突出了它在不同数据条件下的鲁棒性和泛化性。这些结果验证了LogRoBERTa设计的总体有效性,并对绕过日志解析器的影响进行了全面评估。此外,我们的消融研究和训练集构建实验进一步证实了每个单独组件对模型性能的贡献。该研究经验验证了基于roberta的方法有效地处理长而复杂的日志序列中的软件日志异常检测,与现有模型相比,提供了一个更有效和健壮的解决方案,可以省去解析器。
{"title":"Improving anomaly detection in software logs through hybrid language modeling and reduced reliance on parser","authors":"Yicheng Sun, Jacky Keung, Zhen Yang, Shuo Liu, Hi Kuen Yu","doi":"10.1007/s10515-025-00548-y","DOIUrl":"10.1007/s10515-025-00548-y","url":null,"abstract":"<div><p>Anomaly detection in software logs is crucial for development and maintenance, allowing timely identification of system failures and ensuring normal operations. Although recent deep learning advancements in log anomaly detection have shown exceptional performance, the reliance on time-consuming log parsers raises concerns about their necessity for quickly identifying anomalies. Standardized preprocessing methods can mishandle or lose important information. Additionally, the significant imbalance between normal and anomalous log data, along with the scarcity of labeled data, presents a persistent challenge in anomaly detection. We first evaluated the impact of omitting a log parser on anomaly detection models. Subsequently, we propose LogRoBERTa, an innovative anomaly detection model that eliminates the need for a parser. LogRoBERTa creates a stable and diverse labeled training set using the Determinantal Point Process (DPP) method, needing only a small amount of labeled data. The hybrid language model is based on RoBERTa’s architecture, combined with an attention-based BiLSTM. This setup leverages RoBERTa’s strong contextual understanding and BiLSTM’s capability to capture sequential dependencies, enhancing performance in complex log sequences. Experiments on four widely used datasets demonstrate that LogRoBERTa outperforms state-of-the-art benchmark models—including three fully supervised approaches—without relying on a dedicated log parser. Furthermore, its consistently strong performance on low-resource datasets highlights its robustness and generalizability across varying data conditions. These results validate the overall effectiveness of LogRoBERTa’s design and offer a thorough evaluation of the implications of bypassing a log parser. Additionally, our ablation studies and training set construction experiments further confirm the contributions of each individual component to the model’s performance. The study empirically validated that a RoBERTa-based approach effectively handles software log anomaly detection in long and complex log sequences, providing a more efficient and robust solution for omitting a parser compared to existing models.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23DOI: 10.1007/s10515-025-00553-1
Yayun Zhang, Yuying Li, Minying Fang, Xing Yuan, Junwei Du
Bug report summarization aims to generate concise and accurate descriptions to help developers understand and maintain. The existing methodologies prioritize simplifying reporting content but fail to provide a structured and well-rounded description of bugs, limiting developers’ understanding efficiency. In this paper, we leverage large language models (LLMs) to generate detailed, multi-dimensional summaries. Our intuition is based on the following facts: (1) LLMs establish robust semantic connections through extensive pre-training on paired data; (2) Real-world bug reports contain multi-dimensional information. We propose the Bug Report Multi-Dimensional Summary (BRMDS) approach, defining five dimensions: environment, actual behavior, expected behavior, bug category, and solution suggestions, and use specific instructions for each dimension to guide LLM in Parameter Efficient Fine-Tuning (PEFT). We construct a dataset in multi-dimensional information for PEFT and experimental evaluation, thereby addressing the gaps in existing datasets within this domain. The experimental results show that multi-dimensional summaries enhance developers’ understanding of bug reports. BRMDS approach outperforms baseline approaches in both automatic and human evaluations. Our datasets are publicly available at https://github.com/yunjua/bug-reports-multi-dimensional.
{"title":"BRMDS: an LLM-based multi-dimensional summary generation approach for bug reports","authors":"Yayun Zhang, Yuying Li, Minying Fang, Xing Yuan, Junwei Du","doi":"10.1007/s10515-025-00553-1","DOIUrl":"10.1007/s10515-025-00553-1","url":null,"abstract":"<div><p>Bug report summarization aims to generate concise and accurate descriptions to help developers understand and maintain. The existing methodologies prioritize simplifying reporting content but fail to provide a structured and well-rounded description of bugs, limiting developers’ understanding efficiency. In this paper, we leverage large language models (LLMs) to generate detailed, multi-dimensional summaries. Our intuition is based on the following facts: (1) LLMs establish robust semantic connections through extensive pre-training on paired data; (2) Real-world bug reports contain multi-dimensional information. We propose the Bug Report Multi-Dimensional Summary (BRMDS) approach, defining five dimensions: environment, actual behavior, expected behavior, bug category, and solution suggestions, and use specific instructions for each dimension to guide LLM in Parameter Efficient Fine-Tuning (PEFT). We construct a dataset in multi-dimensional information for PEFT and experimental evaluation, thereby addressing the gaps in existing datasets within this domain. The experimental results show that multi-dimensional summaries enhance developers’ understanding of bug reports. BRMDS approach outperforms baseline approaches in both automatic and human evaluations. Our datasets are publicly available at https://github.com/yunjua/bug-reports-multi-dimensional.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pre-trained models of code have shown significant effectiveness in a variety of software engineering tasks, but they are difficult for local deployment due to their large size. Existing works mainly focus on compressing these large models into small models to achieve similar performance and efficient inference. However, it is ignored that the small models should be robust enough to deal with adversarial examples that make incorrect predictions to users. Knowledge distillation techniques typically transform the model compression problem into a combinatorial optimization problem of the student architecture space to achieve the best student model performance. But they can only improve the robustness of the student model to a limited extent through traditional adversarial training. This paper proposes PIONEER (ImProvIng the RObustness of StudeNt ModEls WhEn CompRessing Code Models), a novel knowledge distillation technique that enhances the robustness of the student model without requiring adversarial training. PIONEER incorporates robustness evaluation during distillation to guide the optimization of the student model architecture. By using the probability distributions of original examples and adversarial examples as soft labels, the student model learns the features of both the original samples and adversarial examples during training. We conduct experimental evaluations on two downstream tasks (vulnerability prediction and clone detection) for the three models (CodeBERT, GraphCodeBERT, and CodeT5). We utilize PIONEER to compress six downstream task models to small (3 MB) models that are 206(times) smaller than the original size. The results show that compressed models reduce the inference latency (76(times)) and improve the robustness of the model (87.54%) with negligible loss of effectiveness (1.67%).
预训练的代码模型在各种软件工程任务中显示出显著的有效性,但是由于它们的规模太大,很难在本地部署。现有的工作主要集中在将这些大模型压缩成小模型,以达到相似的性能和高效的推理。然而,它忽略了小模型应该足够健壮,以处理对用户做出错误预测的对抗性示例。知识蒸馏技术通常将模型压缩问题转化为学生体系结构空间的组合优化问题,以获得最佳的学生模型性能。但通过传统的对抗性训练,只能在有限程度上提高学生模型的鲁棒性。本文提出了一种新的知识蒸馏技术PIONEER (improved the鲁棒性of StudeNt ModEls WhEn compressed Code ModEls),它可以在不需要对抗性训练的情况下增强学生模型的鲁棒性。先锋在蒸馏过程中纳入鲁棒性评估,以指导学生模型架构的优化。通过使用原始样本和对抗样本的概率分布作为软标签,学生模型在训练过程中学习原始样本和对抗样本的特征。我们对三个模型(CodeBERT、GraphCodeBERT和CodeT5)的两个下游任务(漏洞预测和克隆检测)进行了实验评估。我们利用PIONEER将6个下游任务模型压缩为比原始大小小206 (times)的小(3 MB)模型。结果表明,压缩模型减少了推理延迟(76 (times)),提高了模型的鲁棒性(87.54)%) with negligible loss of effectiveness (1.67%).
{"title":"PIONEER: improving the robustness of student models when compressing pre-trained models of code","authors":"Xiangyue Liu, Xinwei Liu, Lili Bo, Xiaoxue Wu, Yun Yang, Xiaobing Sun, Feng Zhou","doi":"10.1007/s10515-025-00560-2","DOIUrl":"10.1007/s10515-025-00560-2","url":null,"abstract":"<div><p>Pre-trained models of code have shown significant effectiveness in a variety of software engineering tasks, but they are difficult for local deployment due to their large size. Existing works mainly focus on compressing these large models into small models to achieve similar performance and efficient inference. However, it is ignored that the small models should be robust enough to deal with adversarial examples that make incorrect predictions to users. Knowledge distillation techniques typically transform the model compression problem into a combinatorial optimization problem of the student architecture space to achieve the best student model performance. But they can only improve the robustness of the student model to a limited extent through traditional adversarial training. This paper proposes PIONEER (Im<b>P</b>rov<b>I</b>ng the R<b>O</b>bustness of Stude<b>N</b>t Mod<b>E</b>ls Wh<b>E</b>n Comp<b>R</b>essing Code Models), a novel knowledge distillation technique that enhances the robustness of the student model without requiring adversarial training. PIONEER incorporates robustness evaluation during distillation to guide the optimization of the student model architecture. By using the probability distributions of original examples and adversarial examples as soft labels, the student model learns the features of both the original samples and adversarial examples during training. We conduct experimental evaluations on two downstream tasks (vulnerability prediction and clone detection) for the three models (CodeBERT, GraphCodeBERT, and CodeT5). We utilize PIONEER to compress six downstream task models to small (3 MB) models that are 206<span>(times)</span> smaller than the original size. The results show that compressed models reduce the inference latency (76<span>(times)</span>) and improve the robustness of the model (87.54%) with negligible loss of effectiveness (1.67%).</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reinforcement learning (RL) is increasingly applied in areas such as gaming, robotic control, and autonomous driving. Like to deep learning, RL systems also encounter failures during operation. However, RL differs from deep learning in terms of its error causes and symptom manifestations. What are the differences in error causes and symptoms between RL and deep learning? How are RL errors and their symptoms related? Understanding the symptoms and causes of RL failures can advance research on RL failure detection and repair. In this paper, we conducted a comprehensive empirical study by collecting 1,155 error reports from the popular Q&A forum Stack Overflow and four GitHub repositories: baselines, stable-baselines3, tianshou and keras-rl. We analyzed the root causes and symptoms of these failures and examined the differences in resolution times across various root causes. Additionally, we analyzed the correlations between causes and symptoms. Our study yielded 14 key findings, and six implications for developing RL detection and failure repair tools. Our work is the first to integrate LLM-based analysis with manual validation for RL bug studies, providing actionable insights for tool development and testing strategies.
{"title":"Investigating the bugs in reinforcement learning programs: Insights from Stack Overflow and GitHub","authors":"Jiayin Song, Yike Li, Yunzhe Tian, Haoxuan Ma, Honglei Li, Jie Zuo, Jiqiang Liu, Wenjia Niu","doi":"10.1007/s10515-025-00555-z","DOIUrl":"10.1007/s10515-025-00555-z","url":null,"abstract":"<div><p>Reinforcement learning (RL) is increasingly applied in areas such as gaming, robotic control, and autonomous driving. Like to deep learning, RL systems also encounter failures during operation. However, RL differs from deep learning in terms of its error causes and symptom manifestations. What are the differences in error causes and symptoms between RL and deep learning? How are RL errors and their symptoms related? Understanding the symptoms and causes of RL failures can advance research on RL failure detection and repair. In this paper, we conducted a comprehensive empirical study by collecting 1,155 error reports from the popular Q&A forum <i>Stack Overflow</i> and four <i>GitHub</i> repositories: baselines, stable-baselines3, tianshou and keras-rl. We analyzed the root causes and symptoms of these failures and examined the differences in resolution times across various root causes. Additionally, we analyzed the correlations between causes and symptoms. Our study yielded 14 key findings, and six implications for developing RL detection and failure repair tools. Our work is the first to integrate LLM-based analysis with manual validation for RL bug studies, providing actionable insights for tool development and testing strategies.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}