Pub Date : 2025-10-10DOI: 10.1109/TSE.2025.3619966
Shiwen Ou;Yuwei Li;Lu Yu;Chengkun Wei;Tingke Wen;Qiangpu Chen;Yu Chen;Haizhi Tang;Zulie Pan
Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92% and 98.20% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.
{"title":"MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing","authors":"Shiwen Ou;Yuwei Li;Lu Yu;Chengkun Wei;Tingke Wen;Qiangpu Chen;Yu Chen;Haizhi Tang;Zulie Pan","doi":"10.1109/TSE.2025.3619966","DOIUrl":"10.1109/TSE.2025.3619966","url":null,"abstract":"Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92% and 98.20% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"360-375"},"PeriodicalIF":5.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11201027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last five years have seen a rise of model checking guided testing (MCGT) approaches for systematically testing distributed systems. MCGT approaches generate test cases for distributed systems by traversing their verified abstract state spaces, simultaneously solving the three key problems faced in testing distributed systems, i.e., test input generation, test oracle construction and execution space enumeration. However, existing MCGT approaches struggle with traversing the huge state space of distributed systems, which can contain billions of system states. This makes the process of finding bugs time-consuming and expensive, often taking several weeks. In this paper, we propose Mosso to speed up model checking guided testing for distributed systems. We observe that there exist lots of redundant test scenarios in the abstract state space of distributed systems. Considering the characteristics of these redundant test scenarios, we propose three strategies: action independence, node symmetry and scenario equivalence, to identify and prioritize unique test scenarios when traversing the state space. We have applied Mosso on three real-world distributed systems. By employing the three strategies, our approach has achieved an average speedup of 56X (up to 208X) compared to the state-of-art MCGT approach. Additionally, our approach has successfully uncovered 2 previously-unknown bugs.
{"title":"Efficiently Testing Distributed Systems via Abstract State Space Prioritization","authors":"Yu Gao;Dong Wang;Wensheng Dou;Wenhan Feng;Yu Liang;Jun Wei","doi":"10.1109/TSE.2025.3618976","DOIUrl":"10.1109/TSE.2025.3618976","url":null,"abstract":"The last five years have seen a rise of model checking guided testing (MCGT) approaches for systematically testing distributed systems. MCGT approaches generate test cases for distributed systems by traversing their verified abstract state spaces, simultaneously solving the three key problems faced in testing distributed systems, i.e., test input generation, test oracle construction and execution space enumeration. However, existing MCGT approaches struggle with traversing the huge state space of distributed systems, which can contain billions of system states. This makes the process of finding bugs time-consuming and expensive, often taking several weeks. In this paper, we propose <monospace>Mosso</monospace> to speed up model checking guided testing for distributed systems. We observe that there exist lots of redundant test scenarios in the abstract state space of distributed systems. Considering the characteristics of these redundant test scenarios, we propose three strategies: action independence, node symmetry and scenario equivalence, to identify and prioritize unique test scenarios when traversing the state space. We have applied <monospace>Mosso</monospace> on three real-world distributed systems. By employing the three strategies, our approach has achieved an average speedup of 56X (up to 208X) compared to the state-of-art MCGT approach. Additionally, our approach has successfully uncovered 2 previously-unknown bugs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"395-410"},"PeriodicalIF":5.6,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1109/TSE.2025.3619112
Xiaoxia Liu;Peng Di;Cong Li;Jun Sun;Jingyi Wang
Function calling is a fundamental capability of today’s large language models, but sequential function calling posed efficiency problems. Recent studies have proposed to request function calls with parallelism support in order to alleviate this issue. However, they either delegate the concurrent function calls to users for execution which are conversely executed sequentially, or overlook the relations among various function calls, rending limited efficiency. This paper introduces LLMOrch, an advanced framework for automated, parallel function calling in large language models. The key principle behind LLMOrch is to identify an available processor to execute a function call while preventing any single processor from becoming overburdened. To this end, LLMOrch models the data relations (i.e., definition-use (def-use) dependencies among different function calls and coordinates their executions by their contro l relations (i.e., mutual-exclusion) as well as the working status of the underlying processors. When comparing with state-of-the-art techniques, LLMOrch demonstrated comparable efficiency improvements in orchestrating I/O-intensive functions, while significantly outperforming (2$times$) them with compute-intensive functions. LLMOrch’s performance even showed a linear correlation to the number of allocated processors. We believe that these results highlight the potential of LLMOrch as an efficient solution for parallel function orchestration in the context of large language models.
{"title":"Efficient Function Orchestration for Large Language Models","authors":"Xiaoxia Liu;Peng Di;Cong Li;Jun Sun;Jingyi Wang","doi":"10.1109/TSE.2025.3619112","DOIUrl":"10.1109/TSE.2025.3619112","url":null,"abstract":"Function calling is a fundamental capability of today’s large language models, but sequential function calling posed efficiency problems. Recent studies have proposed to request function calls with parallelism support in order to alleviate this issue. However, they either delegate the concurrent function calls to users for execution which are conversely executed sequentially, or overlook the relations among various function calls, rending limited efficiency. This paper introduces <monospace>LLMOrch</monospace>, an advanced framework for automated, parallel function calling in large language models. The key principle behind <monospace>LLMOrch</monospace> is to identify an available processor to execute a function call while preventing any single processor from becoming overburdened. To this end, <monospace>LLMOrch</monospace> models the data relations (i.e., definition-use (def-use) dependencies among different function calls and coordinates their executions by their contro l relations (i.e., mutual-exclusion) as well as the working status of the underlying processors. When comparing with state-of-the-art techniques, <monospace>LLMOrch</monospace> demonstrated comparable efficiency improvements in orchestrating I/O-intensive functions, while significantly outperforming (2<inline-formula><tex-math>$times$</tex-math></inline-formula>) them with compute-intensive functions. <monospace>LLMOrch</monospace>’s performance even showed a linear correlation to the number of allocated processors. We believe that these results highlight the potential of <monospace>LLMOrch</monospace> as an efficient solution for parallel function orchestration in the context of large language models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"411-427"},"PeriodicalIF":5.6,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1109/tse.2025.3619281
Jianguo Zhao, Yuqiang Sun, Cheng Huang, Chengwei Liu, YaoHui Guan, Yutong Zeng, Yang Liu
{"title":"Towards Secure Code Generation with LLMs: A Study on Common Weakness Enumeration","authors":"Jianguo Zhao, Yuqiang Sun, Cheng Huang, Chengwei Liu, YaoHui Guan, Yutong Zeng, Yang Liu","doi":"10.1109/tse.2025.3619281","DOIUrl":"https://doi.org/10.1109/tse.2025.3619281","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1109/tse.2025.3612888
Hashini Gunatilake, John Grundy, Rashina Hoda, Ingo Mueller
{"title":"Manifestations of Empathy in Software Engineering: How, Why, and When It Matters","authors":"Hashini Gunatilake, John Grundy, Rashina Hoda, Ingo Mueller","doi":"10.1109/tse.2025.3612888","DOIUrl":"https://doi.org/10.1109/tse.2025.3612888","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software aging, characterized by an increasing failure rate or performance degradation in long-running software systems, poses significant risks, including substantial financial losses and potential threats to human lives. This phenomenon is primarily driven by the accumulation of runtime errors, commonly referred to as aging-related bugs (ARBs). Aging-related bug prediction (ARBP) has been proposed to facilitate the detection and remediation of ARBs prior to software release. However, ARBP’s effectiveness heavily depends on the quality of dataset features used. Previous research has largely relied on a standard set of manually designed metrics, often overlooking that these metrics may fail to distinguish between code segments with different semantics, even when they exhibit identical metric values. While some studies have attempted to develop models that learn semantic features from source code, they typically focus on token-level or graph-level features, neglecting a comprehensive exploration of ARB characteristics within the source code. Specifically, there is insufficient discussion on whether deep semantic features can adequately capture the essential traits that trigger aging phenomena. In this paper, we propose a novel multi-view graph feature learning framework based on Graph-Transformer, which integrates newly proposed ARB features extracted from Abstract Syntax Trees with Code Property Graphs for feature learning. Our approach effectively captures hierarchical structures and variable dependencies, facilitating the identification of complex interactions that contribute to ARBs. Additionally, we implement sub-graph sampling and class imbalance strategies to enhance model performance. Experimental results across three datasets demonstrate that our method surpasses state-of-the-art approaches, a code property graph-based feature extraction method (specifically SGT), achieving precision improvements of 8.2 percentage points on Linux, 15.4 percentage points on MySQL, and 2.5 percentage points on NetBSD, thereby establishing a new benchmark for ARB prediction.
{"title":"Aging-Related Bug Prediction Based on Multi-View Graph Feature Learning and Graph-Transformer","authors":"Chen Zhang;Jianwen Xiang;Rui Hao;Kai Jia;Jing Tian;Roberto Natella;Roberto Pietrantuono;Domenico Cotroneo","doi":"10.1109/TSE.2025.3618113","DOIUrl":"10.1109/TSE.2025.3618113","url":null,"abstract":"Software aging, characterized by an increasing failure rate or performance degradation in long-running software systems, poses significant risks, including substantial financial losses and potential threats to human lives. This phenomenon is primarily driven by the accumulation of runtime errors, commonly referred to as aging-related bugs (ARBs). Aging-related bug prediction (ARBP) has been proposed to facilitate the detection and remediation of ARBs prior to software release. However, ARBP’s effectiveness heavily depends on the quality of dataset features used. Previous research has largely relied on a standard set of manually designed metrics, often overlooking that these metrics may fail to distinguish between code segments with different semantics, even when they exhibit identical metric values. While some studies have attempted to develop models that learn semantic features from source code, they typically focus on token-level or graph-level features, neglecting a comprehensive exploration of ARB characteristics within the source code. Specifically, there is insufficient discussion on whether deep semantic features can adequately capture the essential traits that trigger aging phenomena. In this paper, we propose a novel multi-view graph feature learning framework based on Graph-Transformer, which integrates newly proposed ARB features extracted from Abstract Syntax Trees with Code Property Graphs for feature learning. Our approach effectively captures hierarchical structures and variable dependencies, facilitating the identification of complex interactions that contribute to ARBs. Additionally, we implement sub-graph sampling and class imbalance strategies to enhance model performance. Experimental results across three datasets demonstrate that our method surpasses state-of-the-art approaches, a code property graph-based feature extraction method (specifically SGT), achieving precision improvements of 8.2 percentage points on Linux, 15.4 percentage points on MySQL, and 2.5 percentage points on NetBSD, thereby establishing a new benchmark for ARB prediction.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"221-245"},"PeriodicalIF":5.6,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145247017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1109/tse.2025.3618552
Yang Liu, Jingjing Gu, Jingxuan Zhang, Bao Wen, Yi Zhuang
{"title":"CEDAR: Silent Control Flow Error Detection via Heterogeneous Relation Learning","authors":"Yang Liu, Jingjing Gu, Jingxuan Zhang, Bao Wen, Yi Zhuang","doi":"10.1109/tse.2025.3618552","DOIUrl":"https://doi.org/10.1109/tse.2025.3618552","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"11 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145247014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1109/TSE.2025.3618123
Sofia Bobadilla;Monica Jin;Martin Monperrus
Automated Program Repair (APR) for smart contract security promises to automatically mitigate smart contract vulnerabilities responsible for billions in financial losses. However, the true effectiveness of this research in addressing smart contract exploits remains uncharted territory. This paper bridges this critical gap by introducing a novel and systematic experimental framework for evaluating exploit mitigation of program repair tools for smart contracts. We qualitatively and quantitatively analyze 20 state-of-the-art APR tools using a dataset of 143 vulnerable smart contracts, for which we manually craft 91 executable exploits. We are the very first to define and measure the essential “exploit mitigation rate”, giving researchers and practitioners a real sense of effectiveness. Our findings reveal substantial disparities in the state of the art, with an exploit mitigation rate ranging from a low of 29% to a high of 74%. Our study identifies systemic limitations, such as inconsistent functionality preservation, that must be addressed in future research on program repair for smart contracts.
{"title":"Do Automated Fixes Truly Mitigate Smart Contract Exploits?","authors":"Sofia Bobadilla;Monica Jin;Martin Monperrus","doi":"10.1109/TSE.2025.3618123","DOIUrl":"10.1109/TSE.2025.3618123","url":null,"abstract":"Automated Program Repair (APR) for smart contract security promises to automatically mitigate smart contract vulnerabilities responsible for billions in financial losses. However, the true effectiveness of this research in addressing smart contract exploits remains uncharted territory. This paper bridges this critical gap by introducing a novel and systematic experimental framework for evaluating exploit mitigation of program repair tools for smart contracts. We qualitatively and quantitatively analyze 20 state-of-the-art APR tools using a dataset of 143 vulnerable smart contracts, for which we manually craft 91 executable exploits. We are the very first to define and measure the essential “exploit mitigation rate”, giving researchers and practitioners a real sense of effectiveness. Our findings reveal substantial disparities in the state of the art, with an exploit mitigation rate ranging from a low of 29% to a high of 74%. Our study identifies systemic limitations, such as inconsistent functionality preservation, that must be addressed in future research on program repair for smart contracts.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"100-115"},"PeriodicalIF":5.6,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11197044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145247015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}