首页 > 最新文献

Automated Software Engineering最新文献

英文 中文
Context-aware prompting for LLM-based program repair 基于llm的程序修复的上下文感知提示
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-18 DOI: 10.1007/s10515-025-00512-w
Yingling Li, Muxin Cai, Junjie Chen, Yang Xu, Lei Huang, Jianping Li

Automated program repair (APR) plays a crucial role in ensuring the quality of software code, as manual bug-fixing is extremely time-consuming and labor-intensive. Traditional APR tools (e.g., template-based approaches) face the challenge of generalizing to different bug patterns, while deep learning (DL)-based methods heavily rely on training datasets and struggle to fix unseen bugs. Recently, large language models (LLMs) have shown great potential in APR due to their ability to generate patches, having achieved promising results. However, their effectiveness is still constrained by the casually-determined context (e.g., being unable to adaptively select the specific context according to the situation of each defect). Therefore, a more effective APR approach is highly needed, which provides more precise and comprehensive context for the given defect to enhance the robustness of LLM-based APRs. In this paper, we propose a context-aware APR approach named CodeCorrector, which designs a Chain-of-Thought (CoT) approach to follow developers’ program repair behaviors. Given a failing test and its buggy file, CodeCorrector first analyzes why the test fails based on the failure message to infer repair direction; then selects the relevant context information to this repair direction; finally builds the context-aware repair prompt to guide LLMs for patch generation. Our motivation is to offer a novel perspective for enhancing LLM-based program repair through context-aware prompting, which adaptively selects specific context for a given defect. The evaluation on the widely-used Defects4J (i.e., v1.2 and v2.0) benchmark shows that overall, by executing a small number of repairs (i.e., as few as ten rounds), CodeCorrector outperforms all the state-of-the-art baselines on the more complex defects in Defects4J v2.0 and the defects without fine-grained defect localization information in Defects4J v1.2. Specifically, a total of 38 defects are fixed by only CodeCorrector. We further analyze the contributions of two core components (i.e., repair directions, global context selection) to the performance of CodeCorrector, especially repair directions, which improve CodeCorrector by 112% in correct patches and 78% in plausible patches on Defects4J v1.2. Moreover, CodeCorrector generates more valid and correct patches, achieving a 377% improvement over the base LLM GPT-3.5 and a 268% improvement over GPT-4.

自动程序修复(APR)在确保软件代码质量方面起着至关重要的作用,因为手动修复错误非常耗时和费力。传统的APR工具(例如,基于模板的方法)面临着泛化到不同错误模式的挑战,而基于深度学习(DL)的方法严重依赖于训练数据集,难以修复看不见的错误。近年来,大型语言模型(large language models, llm)由于具有生成补丁的能力,在APR中显示出了巨大的潜力,并取得了可喜的成果。然而,它们的有效性仍然受到随意确定的上下文的约束(例如,不能根据每个缺陷的情况自适应地选择特定的上下文)。因此,迫切需要一种更有效的APR方法,它为给定缺陷提供更精确和全面的上下文,以增强基于llm的APR的鲁棒性。在本文中,我们提出了一种名为codecortor的上下文感知APR方法,它设计了一种思想链(CoT)方法来跟踪开发人员的程序修复行为。给定一个失败的测试及其错误文件,codecortor首先根据失败消息分析测试失败的原因,以推断修复方向;然后选择与该修复方向相关的上下文信息;最后构建上下文感知的修复提示,以指导llm生成补丁。我们的动机是提供一种新的视角,通过上下文感知提示来增强基于llm的程序修复,它自适应地为给定的缺陷选择特定的上下文。对广泛使用的缺陷4j(即,v1.2和v2.0)基准的评估表明,总体而言,通过执行少量的修复(即,少至10轮),codecortor在缺陷4j v2.0中更复杂的缺陷和缺陷v1.2中没有细粒度缺陷定位信息的缺陷上的性能优于所有最先进的基线。具体来说,总共有38个缺陷被codecortor修复了。我们进一步分析了两个核心组件(即修复方向,全局上下文选择)对codecortor性能的贡献,特别是修复方向,在缺陷4j v1.2上,修复方向在正确补丁中提高了112%,在合理补丁中提高了78%。此外,codecortor生成更有效和正确的补丁,比基本的LLM GPT-3.5提高了377%,比GPT-4提高了268%。
{"title":"Context-aware prompting for LLM-based program repair","authors":"Yingling Li,&nbsp;Muxin Cai,&nbsp;Junjie Chen,&nbsp;Yang Xu,&nbsp;Lei Huang,&nbsp;Jianping Li","doi":"10.1007/s10515-025-00512-w","DOIUrl":"10.1007/s10515-025-00512-w","url":null,"abstract":"<div><p>Automated program repair (APR) plays a crucial role in ensuring the quality of software code, as manual bug-fixing is extremely time-consuming and labor-intensive. Traditional APR tools (e.g., template-based approaches) face the challenge of generalizing to different bug patterns, while deep learning (DL)-based methods heavily rely on training datasets and struggle to fix unseen bugs. Recently, large language models (LLMs) have shown great potential in APR due to their ability to generate patches, having achieved promising results. However, their effectiveness is still constrained by the casually-determined context (e.g., being unable to adaptively select the specific context according to the situation of each defect). Therefore, a more effective APR approach is highly needed, which provides more precise and comprehensive context for the given defect to enhance the robustness of LLM-based APRs. In this paper, we propose a context-aware APR approach named <b>CodeCorrector</b>, which designs a Chain-of-Thought (CoT) approach to follow developers’ program repair behaviors. Given a failing test and its buggy file, CodeCorrector first analyzes why the test fails based on the failure message to infer repair direction; then selects the relevant context information to this repair direction; finally builds the context-aware repair prompt to guide LLMs for patch generation. Our motivation is to offer a novel perspective for enhancing LLM-based program repair through context-aware prompting, which adaptively selects specific context for a given defect. The evaluation on the widely-used Defects4J (i.e., v1.2 and v2.0) benchmark shows that overall, by executing a small number of repairs (i.e., as few as ten rounds), CodeCorrector outperforms all the state-of-the-art baselines on the more complex defects in Defects4J v2.0 and the defects without fine-grained defect localization information in Defects4J v1.2. Specifically, a total of 38 defects are fixed by only CodeCorrector. We further analyze the contributions of two core components (i.e., repair directions, global context selection) to the performance of CodeCorrector, especially repair directions, which improve CodeCorrector by 112% in correct patches and 78% in plausible patches on Defects4J v1.2. Moreover, CodeCorrector generates more valid and correct patches, achieving a 377% improvement over the base LLM GPT-3.5 and a 268% improvement over GPT-4.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UI2HTML: utilizing LLM agents with chain of thought to convert UI into HTML code UI2HTML:利用具有思维链的LLM代理将UI转换为HTML代码
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-17 DOI: 10.1007/s10515-025-00509-5
Dawei Yuan, Guocang Yang, Tao Zhang

The exponential growth of the internet has led to the creation of over 1.11 billion active websites, with approximately 252,000 new sites emerging daily. This burgeoning landscape underscores a pressing need for rapid and diverse website development, particularly to support advanced functionalities like Web3 interfaces and AI-generated content platforms. Traditional methods that manually convert visual designs into functional code are not only time-consuming but also error-prone, especially challenging for non-experts. In this paper, we introduce “UI2HTML” an innovative system that harnesses the capabilities of Web Real-Time Communication and Large Language Models (LLMs) to convert website layout designs into functional user interface (UI) code. The UI2HTML system employs a sophisticated divide-and-conquer approach, augmented by Chain of Thought reasoning, to enhance the processing and accurate analysis of UI designs. It efficiently captures real-time video and audio inputs from product managers via mobile devices, utilizing advanced image processing algorithms like OpenCV to extract and categorize UI elements. This rich data, complemented by audio descriptions of UI components, is processed by backend cloud services employing Multimodal Large Language Models (MLLMs). These AI agents interpret the multimodal data to generate requirement documents and initial software architecture drafts, effectively automating the translation of webpage designs into executable code. Our comprehensive evaluation demonstrates that UI2HTML significantly outperforms existing methods in terms of visual similarity and functional accuracy through extensive testing across real-world datasets and various MLLM configurations. By offering a robust solution for the automated generation of UI code from screenshots, UI2HTML sets a new benchmark in the field, particularly beneficial in today’s fast-evolving digital environment.

互联网的指数级增长导致了超过11.1亿个活跃网站的创建,每天大约有25.2万个新网站出现。这一蓬勃发展的领域强调了对快速和多样化的网站开发的迫切需求,特别是支持Web3接口和人工智能生成内容平台等高级功能。手工将视觉设计转换为功能代码的传统方法不仅耗时而且容易出错,对非专业人士来说尤其具有挑战性。在本文中,我们介绍了“UI2HTML”这一创新系统,它利用网络实时通信和大型语言模型(llm)的能力,将网站布局设计转换为功能性用户界面(UI)代码。UI2HTML系统采用了一种复杂的分而治之的方法,辅以思维链推理,以增强对UI设计的处理和准确分析。它通过移动设备有效地捕获产品经理的实时视频和音频输入,利用先进的图像处理算法(如OpenCV)提取和分类UI元素。这些丰富的数据,辅以UI组件的音频描述,由使用多模态大型语言模型(Multimodal Large Language Models, mllm)的后端云服务处理。这些人工智能代理解释多模态数据以生成需求文档和初始软件架构草案,有效地自动将网页设计转换为可执行代码。通过对真实世界数据集和各种MLLM配置的广泛测试,我们的综合评估表明,UI2HTML在视觉相似性和功能准确性方面明显优于现有方法。通过为从屏幕截图自动生成UI代码提供一个强大的解决方案,UI2HTML在该领域树立了一个新的基准,在当今快速发展的数字环境中尤其有益。
{"title":"UI2HTML: utilizing LLM agents with chain of thought to convert UI into HTML code","authors":"Dawei Yuan,&nbsp;Guocang Yang,&nbsp;Tao Zhang","doi":"10.1007/s10515-025-00509-5","DOIUrl":"10.1007/s10515-025-00509-5","url":null,"abstract":"<div><p>The exponential growth of the internet has led to the creation of over 1.11 billion active websites, with approximately 252,000 new sites emerging daily. This burgeoning landscape underscores a pressing need for rapid and diverse website development, particularly to support advanced functionalities like Web3 interfaces and AI-generated content platforms. Traditional methods that manually convert visual designs into functional code are not only time-consuming but also error-prone, especially challenging for non-experts. In this paper, we introduce “UI2HTML” an innovative system that harnesses the capabilities of Web Real-Time Communication and Large Language Models (LLMs) to convert website layout designs into functional user interface (UI) code. The UI2HTML system employs a sophisticated divide-and-conquer approach, augmented by Chain of Thought reasoning, to enhance the processing and accurate analysis of UI designs. It efficiently captures real-time video and audio inputs from product managers via mobile devices, utilizing advanced image processing algorithms like OpenCV to extract and categorize UI elements. This rich data, complemented by audio descriptions of UI components, is processed by backend cloud services employing Multimodal Large Language Models (MLLMs). These AI agents interpret the multimodal data to generate requirement documents and initial software architecture drafts, effectively automating the translation of webpage designs into executable code. Our comprehensive evaluation demonstrates that UI2HTML significantly outperforms existing methods in terms of visual similarity and functional accuracy through extensive testing across real-world datasets and various MLLM configurations. By offering a robust solution for the automated generation of UI code from screenshots, UI2HTML sets a new benchmark in the field, particularly beneficial in today’s fast-evolving digital environment.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Navigating bug cold start with contextual multi-armed bandits: an enhanced approach to developer assignment in software bug repositories 利用上下文多臂匪帮浏览错误冷启动:软件错误库中开发人员分配的增强方法
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-16 DOI: 10.1007/s10515-025-00508-6
Neetu Singh, Sandeep Kumar Singh

Recommending the most suitable developer for new bugs poses a challenge to triagers in software bug repositories. Bugs vary in components, severity, priority, and other significant attributes, making it difficult to address them promptly. This difficulty is further compounded by the lack of background knowledge on new bugs, which impedes traditional recommender systems. In the absence of adequate information about either a developer or a bug, building, training, and testing a conventional machine-learning model becomes arduous. In such scenarios, one potential solution is employing a reinforcement-learning model. Often, triagers resort to simplistic approaches like selecting a random developer (explore strategy) or one who has been assigned frequently (exploit strategy). However, the research presented here demonstrates that these approaches based on multi-armed bandits (MAB) perform inadequately. To address this, we propose a novel improved bandit approach that utilizes contextual or side information to automatically recommend suitable developers for new or cold bugs. Experiments conducted on five publicly available open-source datasets have revealed that contextual MAB approaches outperformed simple MAB approaches. We have additionally evaluated the efficacy of two algorithms from Multi-Armed Bandit (MAB), as well as four algorithms from the Contextual-MAB algorithm. These algorithms were assessed based on four performance metrics, namely rewards, average rewards, regret, and average regret. The experimental results present a thorough framework for developer recommendation. The results indicate that all contextual-MAB approaches consistently outperform MAB approaches.

为新出现的bug推荐最合适的开发人员对软件bug存储库中的鉴别者来说是一个挑战。错误在组件、严重程度、优先级和其他重要属性上各不相同,因此很难及时解决它们。由于缺乏关于新漏洞的背景知识(这阻碍了传统的推荐系统),这一困难进一步加剧。在缺乏关于开发人员或bug的足够信息的情况下,构建、培训和测试传统的机器学习模型变得非常困难。在这种情况下,一个潜在的解决方案是采用强化学习模型。通常情况下,tritriers会采取一些简单的方法,如随机选择一个开发人员(探索策略)或选择一个经常被分配的开发人员(利用策略)。然而,本文提出的研究表明,这些基于多武装强盗(MAB)的方法表现不佳。为了解决这个问题,我们提出了一种新的改进的强盗方法,它利用上下文或附带信息来自动推荐合适的开发人员来处理新的或冷的bug。在五个公开可用的开源数据集上进行的实验表明,上下文MAB方法优于简单的MAB方法。我们还评估了来自Multi-Armed Bandit (MAB)的两种算法以及来自context -MAB算法的四种算法的有效性。这些算法基于四个绩效指标进行评估,即奖励、平均奖励、后悔和平均后悔。实验结果为开发者推荐提供了一个完整的框架。结果表明,所有情境MAB方法始终优于MAB方法。
{"title":"Navigating bug cold start with contextual multi-armed bandits: an enhanced approach to developer assignment in software bug repositories","authors":"Neetu Singh,&nbsp;Sandeep Kumar Singh","doi":"10.1007/s10515-025-00508-6","DOIUrl":"10.1007/s10515-025-00508-6","url":null,"abstract":"<div><p>Recommending the most suitable developer for new bugs poses a challenge to triagers in software bug repositories. Bugs vary in components, severity, priority, and other significant attributes, making it difficult to address them promptly. This difficulty is further compounded by the lack of background knowledge on new bugs, which impedes traditional recommender systems. In the absence of adequate information about either a developer or a bug, building, training, and testing a conventional machine-learning model becomes arduous. In such scenarios, one potential solution is employing a reinforcement-learning model. Often, triagers resort to simplistic approaches like selecting a random developer (explore strategy) or one who has been assigned frequently (exploit strategy). However, the research presented here demonstrates that these approaches based on multi-armed bandits (MAB) perform inadequately. To address this, we propose a novel improved bandit approach that utilizes contextual or side information to automatically recommend suitable developers for new or cold bugs. Experiments conducted on five publicly available open-source datasets have revealed that contextual MAB approaches outperformed simple MAB approaches. We have additionally evaluated the efficacy of two algorithms from Multi-Armed Bandit (MAB), as well as four algorithms from the Contextual-MAB algorithm. These algorithms were assessed based on four performance metrics, namely rewards, average rewards, regret, and average regret. The experimental results present a thorough framework for developer recommendation. The results indicate that all contextual-MAB approaches consistently outperform MAB approaches.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143835761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models 无监督特征选择技术对缺陷预测模型的性能和解释的影响
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-16 DOI: 10.1007/s10515-025-00510-y
Zhiqiang Li, Wenzhi Zhu, Hongyu Zhang, Yuantian Miao, Jie Ren

The performance and interpretation of a defect prediction model depend on the software metrics utilized in its construction. Feature selection techniques can enhance model performance and interpretation by effectively removing redundant, correlated, and irrelevant metrics from defect datasets. Previous empirical studies have scrutinized the impact of feature selection techniques on the performance and interpretation of defect prediction models. However, most feature selection techniques examined in these studies are primarily supervised. In particular, the impact of unsupervised feature selection (UFS) techniques on defect prediction remains unknown and needs to be explored extensively. To address this gap, we systematically apply 21 UFS techniques to evaluate their impact on the performance and interpretation of unsupervised defect prediction models in binary classification and effort-aware ranking scenarios. Extensive experiments are conducted on the 28 versions from 8 projects using 4 unsupervised models. We observe that: (1) 10–100% of the selected metrics are inconsistent between each pair of UFS techniques. (2) 29–100% of the selected metrics are inconsistent among different software modules. (3) For unsupervised defect prediction models, some UFS techniques (e.g., AutoSpearman, LS, and FMIUFS) exhibit the ability to effectively reduce the number of metrics while maintaining or even improving model performance. (4) UFS techniques alter the ranking of the top 3 groups of metrics in defect models, affecting the interpretation of these models. Based on these findings, we recommend that software practitioners utilize UFS techniques for unsupervised defect prediction. However, caution should be exercised when deriving insights and interpretations from defect prediction models.

缺陷预测模型的性能和解释依赖于在其构建中使用的软件度量。特征选择技术可以通过有效地从缺陷数据集中去除冗余、相关和不相关的度量来增强模型的性能和解释。以前的实证研究已经仔细研究了特征选择技术对缺陷预测模型的性能和解释的影响。然而,在这些研究中检验的大多数特征选择技术主要是监督的。特别是,无监督特征选择(UFS)技术对缺陷预测的影响仍然是未知的,需要广泛的探索。为了解决这一差距,我们系统地应用了21种UFS技术来评估它们在二元分类和努力感知排序场景下对无监督缺陷预测模型的性能和解释的影响。使用4个无监督模型对来自8个项目的28个版本进行了广泛的实验。我们观察到:(1)每对UFS技术之间10-100%的选择指标不一致。(2) 29-100%的选择指标在不同的软件模块之间不一致。(3)对于无监督缺陷预测模型,一些UFS技术(例如AutoSpearman、LS和FMIUFS)显示出在保持甚至改进模型性能的同时有效减少度量的数量的能力。(4) UFS技术改变了缺陷模型中前3组度量的排名,影响了这些模型的解释。基于这些发现,我们建议软件从业者利用UFS技术进行无监督缺陷预测。然而,当从缺陷预测模型中获得见解和解释时,应该谨慎行事。
{"title":"The impact of unsupervised feature selection techniques on the performance and interpretation of defect prediction models","authors":"Zhiqiang Li,&nbsp;Wenzhi Zhu,&nbsp;Hongyu Zhang,&nbsp;Yuantian Miao,&nbsp;Jie Ren","doi":"10.1007/s10515-025-00510-y","DOIUrl":"10.1007/s10515-025-00510-y","url":null,"abstract":"<div><p>The performance and interpretation of a defect prediction model depend on the software metrics utilized in its construction. Feature selection techniques can enhance model performance and interpretation by effectively removing redundant, correlated, and irrelevant metrics from defect datasets. Previous empirical studies have scrutinized the impact of feature selection techniques on the performance and interpretation of defect prediction models. However, most feature selection techniques examined in these studies are primarily supervised. In particular, the impact of unsupervised feature selection (UFS) techniques on defect prediction remains unknown and needs to be explored extensively. To address this gap, we systematically apply 21 UFS techniques to evaluate their impact on the performance and interpretation of unsupervised defect prediction models in binary classification and effort-aware ranking scenarios. Extensive experiments are conducted on the 28 versions from 8 projects using 4 unsupervised models. We observe that: (1) 10–100% of the selected metrics are inconsistent between each pair of UFS techniques. (2) 29–100% of the selected metrics are inconsistent among different software modules. (3) For unsupervised defect prediction models, some UFS techniques (e.g., AutoSpearman, LS, and FMIUFS) exhibit the ability to effectively reduce the number of metrics while maintaining or even improving model performance. (4) UFS techniques alter the ranking of the top 3 groups of metrics in defect models, affecting the interpretation of these models. Based on these findings, we recommend that software practitioners utilize UFS techniques for unsupervised defect prediction. However, caution should be exercised when deriving insights and interpretations from defect prediction models.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143840329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SIFT: enhance the performance of vulnerability detection by incorporating structural knowledge and multi-task learning SIFT:结合结构知识和多任务学习,提高漏洞检测性能
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-11 DOI: 10.1007/s10515-025-00507-7
Liping Wang, Guilong Lu, Xiang Chen, Xiaofeng Dai, Jianlin Qiu

Software vulnerabilities pose significant risks to software systems, leading to security breaches, data loss, operational disruptions, and substantial financial damage. Therefore, accurately detecting these vulnerabilities is of paramount importance. In recent years, pre-trained language models (PLMs) have demonstrated powerful capabilities in code representation and understanding, emerging as a promising method for vulnerability detection. However, integrating code structure knowledge while fine-tuning PLMs remains a significant challenge. To alleviate this limitation, we propose a novel vulnerability detection approach called SIFT. SIFT extracts the code property graph (CPG) to serve as the source of graph structural information. It constructs a code structure matrix from this information and measures the difference between the code structure matrix and the attention matrix using Sinkhorn Divergence to obtain the structural knowledge loss. This structural knowledge loss is then used alongside the cross-entropy loss for vulnerability detection in a multi-task learning framework to enhance overall detection performance. To evaluate the effectiveness of SIFT, we conducted experiments on three vulnerability detection datasets: FFmpeg+Qemu, Chrome+Debian, and Big-Vul. The results demonstrate that SIFT outperforms nine state-of-the-art vulnerability detection baselines, achieving performance improvements of 1.74%, 10.19%, and 2.87% in terms of F1 score, respectively. Our study shows the effectiveness of incorporating structural knowledge and multi-task learning in enhancing the performance of PLMs for vulnerability detection.

软件漏洞会给软件系统带来重大风险,导致安全漏洞、数据丢失、操作中断和重大的财务损失。因此,准确地检测这些漏洞至关重要。近年来,预训练语言模型(PLMs)在代码表示和理解方面表现出强大的能力,成为一种很有前途的漏洞检测方法。然而,在微调plm的同时集成代码结构知识仍然是一个重大挑战。为了减轻这种限制,我们提出了一种新的漏洞检测方法,称为SIFT。SIFT提取代码属性图(CPG)作为图结构信息的来源。利用这些信息构造代码结构矩阵,利用Sinkhorn散度度量代码结构矩阵与注意矩阵的差值,得到结构知识损失。然后将这种结构知识损失与交叉熵损失一起用于多任务学习框架中的漏洞检测,以提高整体检测性能。为了评估SIFT的有效性,我们在FFmpeg+Qemu、Chrome+Debian和Big-Vul三个漏洞检测数据集上进行了实验。结果表明,SIFT优于9条最先进的漏洞检测基线,F1得分的性能提升分别为1.74%、10.19%和2.87%。我们的研究显示了结合结构知识和多任务学习在提高PLMs漏洞检测性能方面的有效性。
{"title":"SIFT: enhance the performance of vulnerability detection by incorporating structural knowledge and multi-task learning","authors":"Liping Wang,&nbsp;Guilong Lu,&nbsp;Xiang Chen,&nbsp;Xiaofeng Dai,&nbsp;Jianlin Qiu","doi":"10.1007/s10515-025-00507-7","DOIUrl":"10.1007/s10515-025-00507-7","url":null,"abstract":"<div><p>Software vulnerabilities pose significant risks to software systems, leading to security breaches, data loss, operational disruptions, and substantial financial damage. Therefore, accurately detecting these vulnerabilities is of paramount importance. In recent years, pre-trained language models (PLMs) have demonstrated powerful capabilities in code representation and understanding, emerging as a promising method for vulnerability detection. However, integrating code structure knowledge while fine-tuning PLMs remains a significant challenge. To alleviate this limitation, we propose a novel vulnerability detection approach called SIFT. SIFT extracts the code property graph (CPG) to serve as the source of graph structural information. It constructs a code structure matrix from this information and measures the difference between the code structure matrix and the attention matrix using Sinkhorn Divergence to obtain the structural knowledge loss. This structural knowledge loss is then used alongside the cross-entropy loss for vulnerability detection in a multi-task learning framework to enhance overall detection performance. To evaluate the effectiveness of SIFT, we conducted experiments on three vulnerability detection datasets: FFmpeg+Qemu, Chrome+Debian, and Big-Vul. The results demonstrate that SIFT outperforms nine state-of-the-art vulnerability detection baselines, achieving performance improvements of 1.74%, 10.19%, and 2.87% in terms of F1 score, respectively. Our study shows the effectiveness of incorporating structural knowledge and multi-task learning in enhancing the performance of PLMs for vulnerability detection.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143818150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthetic versus real: an analysis of critical scenarios for autonomous vehicle testing 合成与真实:自动驾驶汽车测试关键场景分析
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-09 DOI: 10.1007/s10515-025-00499-4
Qunying Song, Avner Bensoussan, Mohammad Reza Mousavi

With the emergence of autonomous vehicles comes the requirement of adequate and rigorous testing, particularly in critical scenarios that  are both challenging and potentially hazardous. Generating synthetic simulation-based critical scenarios for testing autonomous vehicles has therefore received considerable interest, yet it is unclear how such scenarios relate to the actual crash or near-crash scenarios  in the real world. Consequently, their realism is unknown. In this paper, we define realism as the degree of similarity of synthetic critical scenarios to real-world critical scenarios. We propose a methodology to measure realism using two metrics, namely attribute distribution and Euclidean distance. The methodology extracts various attributes from synthetic and realistic critical scenario datasets and performs a set of statistical tests to compare their distributions and distances. As a proof of concept for our methodology, we compare synthetic collision scenarios from DeepScenario against realistic autonomous vehicle collisions collected by the Department of Motor Vehicles in California, to analyse how well DeepScenario synthetic collision scenarios are aligned with real autonomous vehicle collisions recorded in California. We focus on five key attributes that are extractable from both datasets, and analyse the attribution distribution and distance between scenarios in the two datasets. Further, we derive recommendations to improve the realism of synthetic scenarios based on our analysis. Our study of realism provides a framework that can be replicated and extended for other dataset both concerning real-world and synthetically-generated scenarios.

随着自动驾驶汽车的出现,需要进行充分和严格的测试,特别是在具有挑战性和潜在危险的关键场景中。因此,为自动驾驶汽车测试生成基于综合模拟的关键场景已经引起了相当大的兴趣,但目前尚不清楚这些场景与现实世界中的实际碰撞或接近碰撞场景之间的关系。因此,他们的现实主义是未知的。在本文中,我们将真实感定义为合成关键场景与现实世界关键场景的相似程度。我们提出了一种利用属性分布和欧几里得距离两个度量度量真实感的方法。该方法从合成的和现实的关键场景数据集中提取各种属性,并执行一组统计测试来比较它们的分布和距离。为了验证我们方法的概念,我们将DeepScenario的合成碰撞场景与加州机动车辆管理局收集的真实自动驾驶汽车碰撞进行了比较,以分析DeepScenario合成碰撞场景与加州记录的真实自动驾驶汽车碰撞的一致性。我们专注于从两个数据集中可提取的五个关键属性,并分析两个数据集中场景之间的属性分布和距离。此外,根据我们的分析,我们得出了提高合成场景真实感的建议。我们对现实主义的研究提供了一个框架,可以复制和扩展到其他关于现实世界和合成生成场景的数据集。
{"title":"Synthetic versus real: an analysis of critical scenarios for autonomous vehicle testing","authors":"Qunying Song,&nbsp;Avner Bensoussan,&nbsp;Mohammad Reza Mousavi","doi":"10.1007/s10515-025-00499-4","DOIUrl":"10.1007/s10515-025-00499-4","url":null,"abstract":"<div><p>With the emergence of autonomous vehicles comes the requirement of adequate and rigorous testing, particularly in critical scenarios that  are both challenging and potentially hazardous. Generating synthetic simulation-based critical scenarios for testing autonomous vehicles has therefore received considerable interest, yet it is unclear how such scenarios relate to the actual crash or near-crash scenarios  in the real world. Consequently, their realism is unknown. In this paper, we define realism as the degree of similarity of synthetic critical scenarios to real-world critical scenarios. We propose a methodology to measure realism using two metrics, namely attribute distribution and Euclidean distance. The methodology extracts various attributes from synthetic and realistic critical scenario datasets and performs a set of statistical tests to compare their distributions and distances. As a proof of concept for our methodology, we compare synthetic collision scenarios from DeepScenario against realistic autonomous vehicle collisions collected by the Department of Motor Vehicles in California, to analyse how well DeepScenario synthetic collision scenarios are aligned with real autonomous vehicle collisions recorded in California. We focus on five key attributes that are extractable from both datasets, and analyse the attribution distribution and distance between scenarios in the two datasets. Further, we derive recommendations to improve the realism of synthetic scenarios based on our analysis. Our study of realism provides a framework that can be replicated and extended for other dataset both concerning real-world and synthetically-generated scenarios.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00499-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pipe-DBT: enhancing dynamic binary translation simulators to support pipeline-level simulation 管道- dbt:增强动态二进制转换模拟器以支持管道级仿真
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-05 DOI: 10.1007/s10515-025-00506-8
Tiancheng Tang, Yi Man, Xinbing Zhou, Duqing Wang

In response to the lack of pipeline behavior modeling in Instruction-Set Simulators (ISS) and the performance limitations of Cycle-Accurate Simulators (CAS), this paper proposes Pipe-DBT, a pipeline simulation framework based on Dynamic Binary Translation (DBT). This method achieves a balance between accuracy and efficiency through two key techniques: (1) the design of a pipeline state descriptor called Pipsdep, which abstracts data hazards and resource contentions in the form of formal rules about resource occupancy and read/write behaviors, thereby avoiding low-level hardware details; (2) the introduction of a coroutine-based instruction execution flow partitioning mechanism that employs dynamic suspension/resumption to realize cycle-accurate scheduling in multi-stage pipelines. Implemented on QEMU, Pipe-DBT supports variable-length pipelines, a Very Long Instruction Word (VLIW) architecture with four-issue capability, and pipeline forwarding. Under typical DSP workloads, it achieves a simulation speed of 400–1100 KIPS, representing a 2.3(times) improvement over Gem5 in cycle-accurate mode. Experimental results show that only modular extensions to the host DBT framework are required to accommodate heterogeneous pipeline microarchitectures, thereby providing a high-throughput simulation infrastructure for processor design. To the best of our knowledge, this is the first pipeline-level simulation model implemented on a DBT simulator.

针对指令集模拟器(ISS)缺乏管道行为建模和循环精确模拟器(CAS)的性能限制,提出了基于动态二进制转换(DBT)的管道仿真框架Pipe-DBT。该方法通过两个关键技术实现了准确性和效率之间的平衡:(1)设计了管道状态描述符pipsdeep,将数据危害和资源争用抽象为关于资源占用和读写行为的形式化规则,从而避免了底层硬件细节;(2)引入基于协程的指令执行流分区机制,采用动态挂起/恢复,实现多级流水线的周期精确调度。Pipe-DBT在QEMU上实现,支持变长管道、具有四问题能力的超长指令字(VLIW)架构和管道转发。在典型的DSP工作负载下,它可以实现400-1100 KIPS的仿真速度,在周期精确模式下比Gem5提高2.3 (times)。实验结果表明,只需要对主机DBT框架进行模块化扩展就可以适应异构管道微架构,从而为处理器设计提供高吞吐量的仿真基础设施。据我们所知,这是在DBT模拟器上实现的第一个流水线级仿真模型。
{"title":"Pipe-DBT: enhancing dynamic binary translation simulators to support pipeline-level simulation","authors":"Tiancheng Tang,&nbsp;Yi Man,&nbsp;Xinbing Zhou,&nbsp;Duqing Wang","doi":"10.1007/s10515-025-00506-8","DOIUrl":"10.1007/s10515-025-00506-8","url":null,"abstract":"<div><p>In response to the lack of pipeline behavior modeling in Instruction-Set Simulators (ISS) and the performance limitations of Cycle-Accurate Simulators (CAS), this paper proposes Pipe-DBT, a pipeline simulation framework based on Dynamic Binary Translation (DBT). This method achieves a balance between accuracy and efficiency through two key techniques: (1) the design of a pipeline state descriptor called Pipsdep, which abstracts data hazards and resource contentions in the form of formal rules about resource occupancy and read/write behaviors, thereby avoiding low-level hardware details; (2) the introduction of a coroutine-based instruction execution flow partitioning mechanism that employs dynamic suspension/resumption to realize cycle-accurate scheduling in multi-stage pipelines. Implemented on QEMU, Pipe-DBT supports variable-length pipelines, a Very Long Instruction Word (VLIW) architecture with four-issue capability, and pipeline forwarding. Under typical DSP workloads, it achieves a simulation speed of 400–1100 KIPS, representing a 2.3<span>(times)</span> improvement over Gem5 in cycle-accurate mode. Experimental results show that only modular extensions to the host DBT framework are required to accommodate heterogeneous pipeline microarchitectures, thereby providing a high-throughput simulation infrastructure for processor design. To the best of our knowledge, this is the first pipeline-level simulation model implemented on a DBT simulator.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00506-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An empirical study on the code naturalness modeling capability for LLMs in automated patch correctness assessment 自动补丁正确性评估中llm代码自然度建模能力的实证研究
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-04-02 DOI: 10.1007/s10515-025-00502-y
Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo

Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.

就像自然语言一样,代码也可以表现出自然性。此属性在特定上下文中表现为高度重复的模式。代码的自然性可以通过语言模型捕获,然后应用于各种软件工程任务(例如错误定位和程序修复)。近年来,基于transformer的大型语言模型(Large Language Models, llm)已成为对代码自然性进行建模的有利工具。然而,现有工作缺乏对llm代码自然度建模能力的系统研究。为了弥补这一差距,本文从自动补丁正确性评估任务开始,探索了llm的代码自然性建模能力。具体而言,我们研究了在不同上下文窗口大小下,具有不同架构和规模的llm是否能够(1)基于自然性从普通代码中识别有bug的代码,并认为固定代码比有bug的代码更自然,以及(2)能够从自动化工具中区分不同程度的修复(即完全修复和不完全修复)。然后,我们提出度量来评估模型的上述两个功能。实验结果表明,具有不同体系结构和规模的模型具有代码自然度建模能力,即使模型没有经过专门的代码预训练。此外,较小的模型并不一定比较大的模型表现出较弱的建模能力。我们还发现,更多的上下文信息只能提供有限的好处。基于实验结果,我们通过计算代码自然度,选择具有220 M个参数的最佳模型来开发基于熵的自动补丁正确性评估(E-APCA)方法。在大规模数据集PraPatch上,E-APCA在各种评价指标上都比传统方法高出20%以上。与基于6.7B LLM的最新APCA方法相比,E-APCA方法的正确补丁召回率提高了17.32%,F1得分提高了6.83%,而推理时间不到熵-delta方法的7%。
{"title":"An empirical study on the code naturalness modeling capability for LLMs in automated patch correctness assessment","authors":"Yuning Li,&nbsp;Wenkang Zhong,&nbsp;Zongwen Shen,&nbsp;Chuanyi Li,&nbsp;Xiang Chen,&nbsp;Jidong Ge,&nbsp;Bin Luo","doi":"10.1007/s10515-025-00502-y","DOIUrl":"10.1007/s10515-025-00502-y","url":null,"abstract":"<div><p>Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ladle: a method for unsupervised anomaly detection across log types Ladle:一种跨日志类型的无监督异常检测方法
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-03-24 DOI: 10.1007/s10515-025-00504-w
Juha Mylläri, Tatu Aalto, Jukka K. Nurminen

Log files can help detect and diagnose erroneous software behaviour, but their utility is limited by the ability of users and developers to sift through large amounts of text. Unsupervised machine learning tools have been developed to automatically find anomalies in logs, but they are usually not designed for situations where a large number of log streams or log files, each with its own characteristics, need to be analyzed and their anomaly scores compared. We propose Ladle, an accurate unsupervised anomaly detection and localization method that can simultaneously learn the characteristics of hundreds of log types and determine which log entries are the most anomalous across these log types. Ladle uses a sentence transformer (a large language model) to embed short overlapping segments of log files and compares new, potentially anomalous, log segments against a collection of reference data. The result of the comparison is re-centered by subtracting a baseline score indicating how much variation tends to occur in each log type, making anomaly scores comparable across log types. Ladle is designed to adapt to data drift and is updated by adding new reference data without the need to retrain the sentence transformer. We demonstrate the accuracy of Ladle on a real-world dataset consisting of logs produced by an endpoint protection platform test suite. We also compare Ladle’s performance on the dataset to that of a state-of-the-art method for single-log anomaly detection, showing that the latter is inadequate for the multi-log task.

日志文件可以帮助检测和诊断错误的软件行为,但是它们的效用受到用户和开发人员筛选大量文本的能力的限制。无监督机器学习工具已经被开发出来自动发现日志中的异常,但它们通常不是为需要分析大量日志流或日志文件的情况而设计的,每个日志流或日志文件都有自己的特征,需要分析并比较它们的异常分数。我们提出了一种精确的无监督异常检测和定位方法Ladle,它可以同时学习数百种日志类型的特征,并确定哪些日志条目在这些日志类型中最异常。Ladle使用句子转换器(大型语言模型)嵌入日志文件的短重叠段,并将新的、可能异常的日志段与一组参考数据进行比较。通过减去表明每种日志类型中倾向发生多少变化的基线分数,可以重新集中比较的结果,从而使不同日志类型的异常分数具有可比性。Ladle旨在适应数据漂移,并通过添加新的参考数据进行更新,而无需重新训练句子转换器。我们在一个由端点保护平台测试套件生成的日志组成的真实数据集上演示了Ladle的准确性。我们还比较了Ladle在数据集上的性能与最先进的单日志异常检测方法的性能,表明后者对于多日志任务是不够的。
{"title":"Ladle: a method for unsupervised anomaly detection across log types","authors":"Juha Mylläri,&nbsp;Tatu Aalto,&nbsp;Jukka K. Nurminen","doi":"10.1007/s10515-025-00504-w","DOIUrl":"10.1007/s10515-025-00504-w","url":null,"abstract":"<div><p>Log files can help detect and diagnose erroneous software behaviour, but their utility is limited by the ability of users and developers to sift through large amounts of text. Unsupervised machine learning tools have been developed to automatically find anomalies in logs, but they are usually not designed for situations where a large number of log streams or log files, each with its own characteristics, need to be analyzed and their anomaly scores compared. We propose Ladle, an accurate unsupervised anomaly detection and localization method that can simultaneously learn the characteristics of hundreds of log types and determine which log entries are the most anomalous across these log types. Ladle uses a sentence transformer (a large language model) to embed short overlapping segments of log files and compares new, potentially anomalous, log segments against a collection of reference data. The result of the comparison is re-centered by subtracting a baseline score indicating how much variation tends to occur in each log type, making anomaly scores comparable across log types. Ladle is designed to adapt to data drift and is updated by adding new reference data without the need to retrain the sentence transformer. We demonstrate the accuracy of Ladle on a real-world dataset consisting of logs produced by an endpoint protection platform test suite. We also compare Ladle’s performance on the dataset to that of a state-of-the-art method for single-log anomaly detection, showing that the latter is inadequate for the multi-log task.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00504-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Requirement falsification for cyber-physical systems using generative models 使用生成模型对网络物理系统进行需求伪造
IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2025-03-23 DOI: 10.1007/s10515-025-00503-x
Jarkko Peltomäki, Ivan Porres

We present the OGAN algorithm for automatic requirement falsification of cyber-physical systems. System inputs and outputs are represented as piecewise constant signals over time while requirements are expressed in signal temporal logic. OGAN can find inputs that are counterexamples for the correctness of a system revealing design, software, or hardware defects before the system is taken into operation. The OGAN algorithm works by training a generative machine learning model to produce such counterexamples. It executes tests offline and does not require any previous model of the system under test. We evaluate OGAN using the ARCH-COMP benchmark problems, and the experimental results show that generative models are a viable method for requirement falsification. OGAN can be applied to new systems with little effort, has few requirements for the system under test, and exhibits state-of-the-art CPS falsification efficiency and effectiveness.

我们提出了用于网络物理系统自动需求伪造的 OGAN 算法。系统的输入和输出用随时间变化的片断恒定信号表示,而需求则用信号时间逻辑表示。OGAN 可以在系统投入运行之前,找到作为系统正确性反例的输入,揭示设计、软件或硬件缺陷。OGAN 算法通过训练生成式机器学习模型来生成反例。该算法可离线执行测试,不需要被测系统的任何先前模型。我们使用 ARCH-COMP 基准问题对 OGAN 进行了评估,实验结果表明生成模型是一种可行的需求证伪方法。OGAN 不费吹灰之力就能应用于新系统,对被测系统的要求也不高,而且具有最先进的 CPS 验证效率和效果。
{"title":"Requirement falsification for cyber-physical systems using generative models","authors":"Jarkko Peltomäki,&nbsp;Ivan Porres","doi":"10.1007/s10515-025-00503-x","DOIUrl":"10.1007/s10515-025-00503-x","url":null,"abstract":"<div><p>We present the OGAN algorithm for automatic requirement falsification of cyber-physical systems. System inputs and outputs are represented as piecewise constant signals over time while requirements are expressed in signal temporal logic. OGAN can find inputs that are counterexamples for the correctness of a system revealing design, software, or hardware defects before the system is taken into operation. The OGAN algorithm works by training a generative machine learning model to produce such counterexamples. It executes tests offline and does not require any previous model of the system under test. We evaluate OGAN using the ARCH-COMP benchmark problems, and the experimental results show that generative models are a viable method for requirement falsification. OGAN can be applied to new systems with little effort, has few requirements for the system under test, and exhibits state-of-the-art CPS falsification efficiency and effectiveness.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00503-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Automated Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1