Software code smells, subtle indicators of potential design flaws, play a critical role in maintaining software quality and preventing future maintenance issues. Numerous researchers have proposed various tools and employed different machine learning and deep learning techniques to detect software code smells. This survey systematically reviews the work conducted on detecting software code smells through tool-based, ML-based, and DL-based approaches published from 2014 to 2024. The imbalanced nature of datasets is another vital issue in this domain, where instances of software code smells are often significantly underrepresented, which poses a substantial challenge for traditional detection techniques. Therefore, this review also includes efforts to detect software code smells using different imbalance learning techniques. After initial scrutiny and selection, a total of 86 studies are analyzed and reported in this review work, providing a comprehensive overview of the field. This work comprehensively analyzes the intersection between software code smell detection and imbalance learning techniques, highlighting challenges posed by imbalanced datasets. Furthermore, we identify the best-performing ML techniques (e.g., Random Forest, SVM), the most commonly detected code smells (e.g., God Class, Data Class, Long Method, and feature Envy), and popular experimental setup techniques (e.g., K-fold cross-validation) used in prior studies. Based on the analysis, several key challenges and research gaps are identified, offering directions for future research.
{"title":"Systematic literature review on software code smell detection approaches","authors":"Praveen Singh Thakur , Satyendra Singh Chouhan , Santosh Singh Rathore , Jitendra Parmar","doi":"10.1016/j.jss.2026.112784","DOIUrl":"10.1016/j.jss.2026.112784","url":null,"abstract":"<div><div>Software code smells, subtle indicators of potential design flaws, play a critical role in maintaining software quality and preventing future maintenance issues. Numerous researchers have proposed various tools and employed different machine learning and deep learning techniques to detect software code smells. This survey systematically reviews the work conducted on detecting software code smells through tool-based, ML-based, and DL-based approaches published from 2014 to 2024. The imbalanced nature of datasets is another vital issue in this domain, where instances of software code smells are often significantly underrepresented, which poses a substantial challenge for traditional detection techniques. Therefore, this review also includes efforts to detect software code smells using different imbalance learning techniques. After initial scrutiny and selection, a total of 86 studies are analyzed and reported in this review work, providing a comprehensive overview of the field. This work comprehensively analyzes the intersection between software code smell detection and imbalance learning techniques, highlighting challenges posed by imbalanced datasets. Furthermore, we identify the best-performing ML techniques (e.g., Random Forest, SVM), the most commonly detected code smells (e.g., God Class, Data Class, Long Method, and feature Envy), and popular experimental setup techniques (e.g., K-fold cross-validation) used in prior studies. Based on the analysis, several key challenges and research gaps are identified, offering directions for future research.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112784"},"PeriodicalIF":4.1,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-10DOI: 10.1016/j.jss.2026.112783
Marco Scapin, Fabio Pinelli, Letterio Galletta
Parsing and validation functions are crucial because they process untrusted data, e.g., user inputs. Due to their complexity, these functions are highly susceptible to bugs, making them a primary target for security audits. However, identifying such functions within a binary is time-intensive and challenging, given the numerous functions typically present and the lack of source code or supporting documentation. This paper presents an AI-based methodology for identifying functions with parser-like behavior and complex processing logic within a binary. Our methodology analyzes each binary by identifying its functions, extracting their Control Flow Graphs (CFGs), and enriching them with features derived from an embedding model that captures both structural and semantic aspects of their behavior. These annotated CFGs are the input to a Graph Neural Network trained to identify parsing functions. We implement this methodology in the tool ParserHunter, which allows users to train the model on labeled data, query the model with unseen binaries, and accommodate a symbolic execution phase on the processed binary through a user interface. Our experiments on ten real-world projects from GitHub show that our tool effectively identifies parsers in binaries.
{"title":"ParserHunter: Identify parsing functions in binary code","authors":"Marco Scapin, Fabio Pinelli, Letterio Galletta","doi":"10.1016/j.jss.2026.112783","DOIUrl":"10.1016/j.jss.2026.112783","url":null,"abstract":"<div><div>Parsing and validation functions are crucial because they process untrusted data, e.g., user inputs. Due to their complexity, these functions are highly susceptible to bugs, making them a primary target for security audits. However, identifying such functions within a binary is time-intensive and challenging, given the numerous functions typically present and the lack of source code or supporting documentation. This paper presents an AI-based methodology for identifying functions with parser-like behavior and complex processing logic within a binary. Our methodology analyzes each binary by identifying its functions, extracting their Control Flow Graphs (CFGs), and enriching them with features derived from an embedding model that captures both structural and semantic aspects of their behavior. These annotated CFGs are the input to a Graph Neural Network trained to identify parsing functions. We implement this methodology in the tool ParserHunter, which allows users to train the model on labeled data, query the model with unseen binaries, and accommodate a symbolic execution phase on the processed binary through a user interface. Our experiments on ten real-world projects from GitHub show that our tool effectively identifies parsers in binaries.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112783"},"PeriodicalIF":4.1,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-10DOI: 10.1016/j.jss.2026.112782
Tongcheng Geng , Yubin Qu , W. Eric Wong
With the widespread deployment of embodied AI agents in safety-critical scenarios, LLM-based decision-making systems face unprecedented risks. Existing prompt injection attacks, designed for general conversational systems, lack semantic contextual adaptability for embodied agents and fail to address scenario-specific semantics and safety constraints. This paper proposes SAPIA (Scenario-Adaptive white-box Prompt Injection Attack), integrating an adaptive context prompt generation module with an enhanced GCG algorithm to dynamically produce scenario-targeted adversarial suffixes. We build a multi-scenario dataset of 40 dangerous instructions across four application domains–autonomous driving, robotic manipulation, drone control, and industrial control–establishing a standardized benchmark for embodied AI safety. Large-scale white-box experiments on three mainstream open-source LLMs show SAPIA substantially outperforms traditional GCG and improved I-GCG, with notably high effectiveness on extremely high-risk instructions. Transferability analysis reveals distinctive properties in embodied settings: cross-architecture transfer is extremely limited, while high cross-version transferability exists within model series, contrasting with cross-model transfer observed in conventional adversarial research. Ablation studies confirm both the adaptive context module and enhanced GCG are critical and synergistic for optimal attack performance. Robustness analyses indicate SAPIA strongly resists mainstream defenses, effectively evading input perturbation, structured self-examination, and safety prefix prompting. This work exposes serious security vulnerabilities in current embodied AI agents and underscores the urgency of scenario-based protection mechanisms for safety-critical deployments.
{"title":"A white-box prompt injection attack on embodied AI agents driven by large language models the","authors":"Tongcheng Geng , Yubin Qu , W. Eric Wong","doi":"10.1016/j.jss.2026.112782","DOIUrl":"10.1016/j.jss.2026.112782","url":null,"abstract":"<div><div>With the widespread deployment of embodied AI agents in safety-critical scenarios, LLM-based decision-making systems face unprecedented risks. Existing prompt injection attacks, designed for general conversational systems, lack semantic contextual adaptability for embodied agents and fail to address scenario-specific semantics and safety constraints. This paper proposes <strong>SAPIA</strong> (<strong>S</strong>cenario-<strong>A</strong>daptive white-box <strong>P</strong>rompt <strong>I</strong>njection <strong>A</strong>ttack), integrating an adaptive context prompt generation module with an enhanced GCG algorithm to dynamically produce scenario-targeted adversarial suffixes. We build a multi-scenario dataset of 40 dangerous instructions across four application domains–autonomous driving, robotic manipulation, drone control, and industrial control–establishing a standardized benchmark for embodied AI safety. Large-scale white-box experiments on three mainstream open-source LLMs show SAPIA substantially outperforms traditional GCG and improved I-GCG, with notably high effectiveness on extremely high-risk instructions. Transferability analysis reveals distinctive properties in embodied settings: cross-architecture transfer is extremely limited, while high cross-version transferability exists within model series, contrasting with cross-model transfer observed in conventional adversarial research. Ablation studies confirm both the adaptive context module and enhanced GCG are critical and synergistic for optimal attack performance. Robustness analyses indicate SAPIA strongly resists mainstream defenses, effectively evading input perturbation, structured self-examination, and safety prefix prompting. This work exposes serious security vulnerabilities in current embodied AI agents and underscores the urgency of scenario-based protection mechanisms for safety-critical deployments.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112782"},"PeriodicalIF":4.1,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1016/j.jss.2026.112781
Yicheng Sun, Jacky Wai Keung, Hi Kuen Yu, Wenqiang Luo
Context: Log anomaly detection is critical for maintaining the security, stability, and operational efficiency of modern software systems, especially as they generate vast and diverse log data. However, existing deep learning models struggle with the challenges of heterogeneous log formats across systems and the scarcity of labeled anomaly logs, limiting their real-world deployment and generalization capabilities.
Objective: To address these challenges, we propose LogMeta, a novel semi-supervised framework designed for adaptive and efficient log anomaly detection in diverse and low-resource environments.
Method: LogMeta integrates Model-Agnostic Meta-Learning (MAML) with a hybrid language model to address key challenges. MAML enables LogMeta to rapidly adapt to unseen log systems using few-shot samples, while the hybrid model combines RoBERTa for extracting semantic representations with Bi-LSTM and attention mechanisms to capture sequential dependencies and critical features within log sequences. This design reduces reliance on large-scale labeled datasets and enhances adaptability in heterogeneous environments.
Results: Experimental evaluations on multiple benchmark datasets demonstrate that LogMeta consistently outperforms state-of-the-art supervised and unsupervised methods, achieving up to a 28.3% improvement in F1-scores under low-resource scenarios compared to other models. Furthermore, LogMeta exhibits exceptional domain transfer capabilities, maintaining robust performance across diverse log datasets with minimal fine-tuning. In terms of efficiency, LogMeta achieves competitive training and inference times, making it suitable for real-time anomaly detection in large-scale systems.
Conclusion: LogMeta provides a scalable and practical solution for real-world log anomaly detection, overcoming challenges related to data heterogeneity and label scarcity. Its strong generalization capabilities, minimal supervision requirements, and adaptability to new log systems make it a promising tool for enhancing software system reliability and security.
{"title":"LogMeta: A few-shot model-agnostic meta-learning framework for robust and adaptive log anomaly detection","authors":"Yicheng Sun, Jacky Wai Keung, Hi Kuen Yu, Wenqiang Luo","doi":"10.1016/j.jss.2026.112781","DOIUrl":"10.1016/j.jss.2026.112781","url":null,"abstract":"<div><div><strong>Context:</strong> Log anomaly detection is critical for maintaining the security, stability, and operational efficiency of modern software systems, especially as they generate vast and diverse log data. However, existing deep learning models struggle with the challenges of heterogeneous log formats across systems and the scarcity of labeled anomaly logs, limiting their real-world deployment and generalization capabilities.</div><div><strong>Objective:</strong> To address these challenges, we propose LogMeta, a novel semi-supervised framework designed for adaptive and efficient log anomaly detection in diverse and low-resource environments.</div><div><strong>Method:</strong> LogMeta integrates Model-Agnostic Meta-Learning (MAML) with a hybrid language model to address key challenges. MAML enables LogMeta to rapidly adapt to unseen log systems using few-shot samples, while the hybrid model combines RoBERTa for extracting semantic representations with Bi-LSTM and attention mechanisms to capture sequential dependencies and critical features within log sequences. This design reduces reliance on large-scale labeled datasets and enhances adaptability in heterogeneous environments.</div><div><strong>Results:</strong> Experimental evaluations on multiple benchmark datasets demonstrate that LogMeta consistently outperforms state-of-the-art supervised and unsupervised methods, achieving up to a 28.3% improvement in F1-scores under low-resource scenarios compared to other models. Furthermore, LogMeta exhibits exceptional domain transfer capabilities, maintaining robust performance across diverse log datasets with minimal fine-tuning. In terms of efficiency, LogMeta achieves competitive training and inference times, making it suitable for real-time anomaly detection in large-scale systems.</div><div><strong>Conclusion:</strong> LogMeta provides a scalable and practical solution for real-world log anomaly detection, overcoming challenges related to data heterogeneity and label scarcity. Its strong generalization capabilities, minimal supervision requirements, and adaptability to new log systems make it a promising tool for enhancing software system reliability and security.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112781"},"PeriodicalIF":4.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.jss.2026.112780
Seungmo Jung, Woojin Lee
In software development, software requirements and class diagrams are core components that are closely related to each other. Software requirements specify the system's functionality in natural language, while class diagrams are created using CASE tools to visually represent the system's structure and behavior based on these requirements. Although software requirements and class diagrams are complementary, ensuring consistency between them is challenging due to the ambiguity and vagueness inherent in natural language. To address this issue, research on automatically transforming natural language into class diagrams is actively being conducted; however, most of these studies focus on requirements written in English. In addition, existing research primarily emphasizes the grammatical structure of natural language requirements, which limits their ability to reflect the conceptual structures of specific domains. To overcome these limitations, this paper proposes a method for developing an automatic class diagram generator that utilizes AI-based GRU classification model and 5W1H-based heuristic rules. The proposed class diagram generator extracts element and class model information from software requirements written in Korean and visualizes class diagrams based on a model interface language. For elements that can be directly extracted from natural language requirements, 5W1H-based heuristic rules considering linguistic characteristics are applied, while domain-specific elements requiring domain knowledge are extracted using an AI-based GRU classification model. Furthermore, when comparing the class diagrams generated by the proposed tool with those manually created by developers, the tool demonstrated high performance in terms of precision, recall, and F1-score.
{"title":"Development of an automatic class diagram generator using an AI-based GRU classification model and 5W1H heuristic rules","authors":"Seungmo Jung, Woojin Lee","doi":"10.1016/j.jss.2026.112780","DOIUrl":"10.1016/j.jss.2026.112780","url":null,"abstract":"<div><div>In software development, software requirements and class diagrams are core components that are closely related to each other. Software requirements specify the system's functionality in natural language, while class diagrams are created using CASE tools to visually represent the system's structure and behavior based on these requirements. Although software requirements and class diagrams are complementary, ensuring consistency between them is challenging due to the ambiguity and vagueness inherent in natural language. To address this issue, research on automatically transforming natural language into class diagrams is actively being conducted; however, most of these studies focus on requirements written in English. In addition, existing research primarily emphasizes the grammatical structure of natural language requirements, which limits their ability to reflect the conceptual structures of specific domains. To overcome these limitations, this paper proposes a method for developing an automatic class diagram generator that utilizes AI-based GRU classification model and 5W1H-based heuristic rules. The proposed class diagram generator extracts element and class model information from software requirements written in Korean and visualizes class diagrams based on a model interface language. For elements that can be directly extracted from natural language requirements, 5W1H-based heuristic rules considering linguistic characteristics are applied, while domain-specific elements requiring domain knowledge are extracted using an AI-based GRU classification model. Furthermore, when comparing the class diagrams generated by the proposed tool with those manually created by developers, the tool demonstrated high performance in terms of precision, recall, and F1-score.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112780"},"PeriodicalIF":4.1,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.jss.2025.112766
Yu Tang , Ye Du , Jian-Bo Gao , Ang Li , Ming-Song Yang
Software defect prediction is a crucial technique for ensuring software reliability. However, software defect datasets often exhibit complex feature dependencies and traditional feature engineering methods have limitations in capturing non-linear relationships between these features.As deep learning can effectively capture these complex relationships, they have the potential to overcome the shortcomings of traditional feature engineering techniques. In this paper, we propose the concept of instance image and transform the software defect prediction problem into an image classification task based on instance images, thus fully leveraging the feature extraction capabilities of deep learning. Additionally, to address the limitations of existing binary cross-entropy loss functions in classification models that they cannot account for instance importance differences, we also design an instance similarity reverse loss function. We first design a method to measure instance similarity and dynamically adjust the instance weights during loss calculation based on this similarity. Next, we use normalized instance similarity loss as the active loss in the active-passive loss framework. Finally, we construct a software defect prediction method based on the Instance Similarity Reverse Loss (ISRL). The experimental results show that the proposed method improves performance by 5% to 8% compared to existing works.
{"title":"ISRLNN: A software defect prediction method based on instance similarity reverse loss","authors":"Yu Tang , Ye Du , Jian-Bo Gao , Ang Li , Ming-Song Yang","doi":"10.1016/j.jss.2025.112766","DOIUrl":"10.1016/j.jss.2025.112766","url":null,"abstract":"<div><div>Software defect prediction is a crucial technique for ensuring software reliability. However, software defect datasets often exhibit complex feature dependencies and traditional feature engineering methods have limitations in capturing non-linear relationships between these features.As deep learning can effectively capture these complex relationships, they have the potential to overcome the shortcomings of traditional feature engineering techniques. In this paper, we propose the concept of instance image and transform the software defect prediction problem into an image classification task based on instance images, thus fully leveraging the feature extraction capabilities of deep learning. Additionally, to address the limitations of existing binary cross-entropy loss functions in classification models that they cannot account for instance importance differences, we also design an instance similarity reverse loss function. We first design a method to measure instance similarity and dynamically adjust the instance weights during loss calculation based on this similarity. Next, we use normalized instance similarity loss as the active loss in the active-passive loss framework. Finally, we construct a software defect prediction method based on the <u>I</u>nstance <u>S</u>imilarity <u>R</u>everse <u>L</u>oss (ISRL). The experimental results show that the proposed method improves performance by 5% to 8% compared to existing works.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112766"},"PeriodicalIF":4.1,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.jss.2025.112764
Denesa Zyberaj , Pascal Hirmer , Marco Aiello , Stefan Wagner
The automotive domain is shifting to software-centric development to meet regulation, market pressure, and feature velocity. This shift increases embedded systems’ complexity and strains testing capacity. Despite relevant standards, a coherent system-testing methodology that spans heterogeneous, legacy-constrained toolchains remains elusive, and practice often depends on individual expertise rather than a systematic strategy. We derive challenges and requirements from a systematic literature review (SLR), complemented by industry experience and practice. We map them to test case specification techniques and testing tools, evaluating their suitability for automotive testing using PRISMA. Our contribution is a curated catalog that supports technique/tool selection and can inform future testing frameworks and improvements. We synthesize nine recurring challenge areas across the life cycle, such as requirements quality and traceability, variability management, and toolchain fragmentation. We then provide a prioritized criteria catalog that recommends model-based planning, interoperable and traceable toolchains, requirements uplift, pragmatic automation and virtualization, targeted AI and formal methods, actionable metrics, and lightweight organizational practices.
{"title":"Test case specification techniques and system testing tools in the automotive industry: A review","authors":"Denesa Zyberaj , Pascal Hirmer , Marco Aiello , Stefan Wagner","doi":"10.1016/j.jss.2025.112764","DOIUrl":"10.1016/j.jss.2025.112764","url":null,"abstract":"<div><div>The automotive domain is shifting to software-centric development to meet regulation, market pressure, and feature velocity. This shift increases embedded systems’ complexity and strains testing capacity. Despite relevant standards, a coherent system-testing methodology that spans heterogeneous, legacy-constrained toolchains remains elusive, and practice often depends on individual expertise rather than a systematic strategy. We derive challenges and requirements from a systematic literature review (SLR), complemented by industry experience and practice. We map them to test case specification techniques and testing tools, evaluating their suitability for automotive testing using PRISMA. Our contribution is a curated catalog that supports technique/tool selection and can inform future testing frameworks and improvements. We synthesize nine recurring challenge areas across the life cycle, such as requirements quality and traceability, variability management, and toolchain fragmentation. We then provide a prioritized criteria catalog that recommends model-based planning, interoperable and traceable toolchains, requirements uplift, pragmatic automation and virtualization, targeted AI and formal methods, actionable metrics, and lightweight organizational practices.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112764"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.jss.2025.112756
Fangcheng Qiu , Zhongxin Liu , Bingde Hu , Zhengong Cai , Lingfeng Bao , Xinyu Wang
Vulnerability detection plays a critical role in ensuring software quality during the processes of software development and maintenance. Automated vulnerability detection methods have been proposed to reduce the consumption of human and material resources. From traditional machine learning-based approaches to deep learning-based approaches, vulnerability detection techniques have continuously evolved and improved. Recently, Large Language Models (LLMs) have been increasingly applied to vulnerability detection. However, deep learning-based approaches and LLM-based approaches suffer from two main problems: (1) They suffer from poor generalization capabilities, which limit their performance in real-world scenarios. (2) They lack accurate contextual information of the target function, which hinders their ability to correctly understand the target function. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named RLV (Retrieving&Refining Contextual Information for LLM-based Vulnerability Detection), an LLM-based approach that enhances vulnerability detection by integrating project-level contextual information into the analysis process. RLV emulates how programmers reason about code, enabling the LLM to retrieve and refine relevant semantic context from the project repository to better understand the target function. Besides, RLV guides the LLM via effective prompts, avoiding task-specific training and enhancing its practicality in real-world scenarios. We conduct experiments on two vulnerability datasets with a total of 30,436 vulnerable functions and 306,269 non-vulnerable functions. The experimental results demonstrate that our approach achieves state-of-the-art performance. Moreover, our approach achieves a 26.83% improvement in terms of F1-score over state-of-the-art baselines when tested on unseen projects.
在软件开发和维护过程中,漏洞检测对保证软件质量起着至关重要的作用。为了减少人力和物力的消耗,提出了自动化漏洞检测方法。从传统的基于机器学习的方法到基于深度学习的方法,漏洞检测技术不断发展和改进。近年来,大型语言模型(Large Language Models, llm)在漏洞检测中的应用越来越广泛。然而,基于深度学习的方法和基于llm的方法存在两个主要问题:(1)泛化能力差,这限制了它们在现实场景中的性能。(2)缺乏准确的目标函数语境信息,阻碍了他们正确理解目标函数的能力。为了解决这些问题,本文提出了一种新的漏洞检测方法,命名为RLV (Retrieving&Refining Contextual Information for llm based vulnerability detection),这种基于llm的方法通过将项目级上下文信息集成到分析过程中来增强漏洞检测。RLV模拟程序员如何推理代码,使LLM能够从项目存储库中检索和细化相关的语义上下文,以更好地理解目标函数。此外,RLV通过有效的提示来指导LLM,避免了特定任务的培训,增强了其在现实场景中的实用性。我们在两个漏洞数据集上进行实验,总共有30436个漏洞函数和306269个非漏洞函数。实验结果表明,我们的方法达到了最先进的性能。此外,当在未见过的项目上进行测试时,我们的方法在f1得分方面比最先进的基线提高了26.83%。
{"title":"RLV: LLM-based vulnerability detection by retrieving and refining contextual information","authors":"Fangcheng Qiu , Zhongxin Liu , Bingde Hu , Zhengong Cai , Lingfeng Bao , Xinyu Wang","doi":"10.1016/j.jss.2025.112756","DOIUrl":"10.1016/j.jss.2025.112756","url":null,"abstract":"<div><div>Vulnerability detection plays a critical role in ensuring software quality during the processes of software development and maintenance. Automated vulnerability detection methods have been proposed to reduce the consumption of human and material resources. From traditional machine learning-based approaches to deep learning-based approaches, vulnerability detection techniques have continuously evolved and improved. Recently, Large Language Models (LLMs) have been increasingly applied to vulnerability detection. However, deep learning-based approaches and LLM-based approaches suffer from two main problems: (1) They suffer from poor generalization capabilities, which limit their performance in real-world scenarios. (2) They lack accurate contextual information of the target function, which hinders their ability to correctly understand the target function. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named <span>RLV</span> (<u><strong>R</strong></u>etrieving&Refining Contextual Information for <u><strong>L</strong></u>LM-based <u><strong>V</strong></u>ulnerability Detection), an LLM-based approach that enhances vulnerability detection by integrating project-level contextual information into the analysis process. RLV emulates how programmers reason about code, enabling the LLM to retrieve and refine relevant semantic context from the project repository to better understand the target function. Besides, RLV guides the LLM via effective prompts, avoiding task-specific training and enhancing its practicality in real-world scenarios. We conduct experiments on two vulnerability datasets with a total of 30,436 vulnerable functions and 306,269 non-vulnerable functions. The experimental results demonstrate that our approach achieves state-of-the-art performance. Moreover, our approach achieves a 26.83% improvement in terms of <em>F</em><sub>1</sub>-score over state-of-the-art baselines when tested on unseen projects.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112756"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jss.2025.112760
Stephen John Warnett , Evangelos Ntentos , Uwe Zdun
MLOps (Machine Learning Operations) and its application to Reinforcement Learning (RL) involve various challenges when integrating Machine Learning and RL models into production systems, entailing considerable expertise and manual effort, which can be error-prone and obstruct scalability and rapid deployment. We propose a new approach to address these challenges in generating MLOps pipelines. We present a low-code, template-based approach leveraging Large Language Models (LLMs) to automate RL pipeline generation, validation and deployment. In our approach, the Pipes and Filters pattern allows for the fine-grained generation of MLOps pipeline configuration files. Built-in error detection and correction help maintain high-quality output standards.
To empirically evaluate our solution, we assess the correctness of pipelines generated with seven LLMs for three open-source RL projects. Our initial approach achieved an average error rate of 0.187 across all seven LLMs. OpenAI GPT-4o performed the best with an error rate of just 0.09, followed by Qwen2.5 Coder with an error rate of 0.15. We implemented a single round of improvements to our implementation and low-code template. We reevaluated our solution on the best-performing LLM from the initial evaluation, achieving perfect results with an overall error rate of zero for OpenAI GPT-4o. Our findings indicate that pipelines generated by our approach have low error rates, potentially enabling rapid scaling and deployment of reliable MLOps for RL pipelines, particularly for practitioners lacking advanced software engineering or DevOps skills. Our approach contributes towards demonstrating increased reliability and trustworthiness in LLM-based solutions, despite the uncertainty hitherto associated with LLMs.
{"title":"MLOps pipeline generation for reinforcement learning: A low-code approach using large language models","authors":"Stephen John Warnett , Evangelos Ntentos , Uwe Zdun","doi":"10.1016/j.jss.2025.112760","DOIUrl":"10.1016/j.jss.2025.112760","url":null,"abstract":"<div><div>MLOps (Machine Learning Operations) and its application to Reinforcement Learning (RL) involve various challenges when integrating Machine Learning and RL models into production systems, entailing considerable expertise and manual effort, which can be error-prone and obstruct scalability and rapid deployment. We propose a new approach to address these challenges in generating MLOps pipelines. We present a low-code, template-based approach leveraging Large Language Models (LLMs) to automate RL pipeline generation, validation and deployment. In our approach, the Pipes and Filters pattern allows for the fine-grained generation of MLOps pipeline configuration files. Built-in error detection and correction help maintain high-quality output standards.</div><div>To empirically evaluate our solution, we assess the correctness of pipelines generated with seven LLMs for three open-source RL projects. Our initial approach achieved an average error rate of 0.187 across all seven LLMs. OpenAI GPT-4o performed the best with an error rate of just 0.09, followed by Qwen2.5 Coder with an error rate of 0.15. We implemented a single round of improvements to our implementation and low-code template. We reevaluated our solution on the best-performing LLM from the initial evaluation, achieving perfect results with an overall error rate of zero for OpenAI GPT-4o. Our findings indicate that pipelines generated by our approach have low error rates, potentially enabling rapid scaling and deployment of reliable MLOps for RL pipelines, particularly for practitioners lacking advanced software engineering or DevOps skills. Our approach contributes towards demonstrating increased reliability and trustworthiness in LLM-based solutions, despite the uncertainty hitherto associated with LLMs.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112760"},"PeriodicalIF":4.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.jss.2025.112748
Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury
Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software (OSS) projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this study, we have analyzed 25,302 questions related to Mocking on StackOverflow to identify the challenges faced by developers. We have used Latent Dirichlet Allocation (LDA) for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that categories such as Mocking Techniques and External Services have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. While the questions on Theoretical category declined after 2010, posts regarding Error Handling grew notably from 2009.
Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like Framework Selection tend to have lower difficulty and faster resolution times, while complex topics like HTTP Requests and Responses are more likely to remain unanswered and take longer to resolve. Additionally, we evaluated questions based on the answer status- successful, ordinary, or unsuccessful, and found that topics such as Framework Selection have higher success rates, whereas tool setup and Android-related issues are more often unresolved. A classification of questions into How, Why, What, and Other revealed that over 64 % are How questions, particularly in practical domains like file access, APIs, and databases, indicating a strong need for implementation guidance. Why questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while What questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.
{"title":"Exploring challenges in test mocking: Developer questions and insights from StackOverflow","authors":"Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury","doi":"10.1016/j.jss.2025.112748","DOIUrl":"10.1016/j.jss.2025.112748","url":null,"abstract":"<div><div>Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software (OSS) projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this study, we have analyzed 25,302 questions related to <em>Mocking</em> on StackOverflow to identify the challenges faced by developers. We have used Latent Dirichlet Allocation (LDA) for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that categories such as <em>Mocking Techniques</em> and <em>External Services</em> have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. While the questions on <em>Theoretical</em> category declined after 2010, posts regarding <em>Error Handling</em> grew notably from 2009.</div><div>Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like <em>Framework Selection</em> tend to have lower difficulty and faster resolution times, while complex topics like <em>HTTP Requests and Responses</em> are more likely to remain unanswered and take longer to resolve. Additionally, we evaluated questions based on the answer status- successful, ordinary, or unsuccessful, and found that topics such as <em>Framework Selection</em> have higher success rates, whereas tool setup and Android-related issues are more often unresolved. A classification of questions into <em>How, Why, What</em>, and <em>Other</em> revealed that over 64 % are <em>How</em> questions, particularly in practical domains like file access, APIs, and databases, indicating a strong need for implementation guidance. <em>Why</em> questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while <em>What</em> questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112748"},"PeriodicalIF":4.1,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}