Pub Date : 2025-10-20DOI: 10.1109/tse.2025.3623625
Yanjie Jiang, Chenxu Li, Zixiao Zhao, Fu Fan, Lu Zhang, Hui Liu
{"title":"Evaluating and Improving GPT-Based Expansion of Abbreviations","authors":"Yanjie Jiang, Chenxu Li, Zixiao Zhao, Fu Fan, Lu Zhang, Hui Liu","doi":"10.1109/tse.2025.3623625","DOIUrl":"https://doi.org/10.1109/tse.2025.3623625","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"6 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1109/TSE.2025.3607625
Simone Corbo;Luca Bancale;Valeria De Gennaro;Livia Lestingi;Vincenzo Scotti;Matteo Camilli
Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models, now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM, which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs’ inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7–671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average). This work includes examples of toxic degeneration by LLMs, which may be considered profane or offensive to some readers. Reader discretion is advised.
{"title":"How Toxic Can You Get? Search-Based Toxicity Testing for Large Language Models","authors":"Simone Corbo;Luca Bancale;Valeria De Gennaro;Livia Lestingi;Vincenzo Scotti;Matteo Camilli","doi":"10.1109/TSE.2025.3607625","DOIUrl":"10.1109/TSE.2025.3607625","url":null,"abstract":"Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models, now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM, which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs’ inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7–671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average). This work includes examples of toxic degeneration by LLMs, which may be considered profane or offensive to some readers. Reader discretion is advised.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 11","pages":"3056-3071"},"PeriodicalIF":5.6,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the increasing complexity of smart contracts and their interactions has led to more sophisticated strategies for executing attacks. Hackers often need to deploy attacker contracts as delegators to automate these attacks on their behalf. Existing identification methods for attacker contracts either rely on simple patterns (e.g., recursive callback control flow) that suffer from high false-positive rates and limited extraction of interaction and call information, or lack fully automated detection capabilities. Consequently, these limitations reduce the effectiveness of current solutions in identifying modern, intricate attacks. To overcome these challenges, we introduce the concept of state manipulation attacks, which abstracts the exploitation of problematic state dependencies arising from contract interactions. During these attacks, hackers first alter the storage state of one contract (the manipulated contract), which determines the profit they can gain. They then call another contract (the victim contract) to exploit its dependency on the altered state and maximize their profits. We present SMAsher, a tool designed to automatically identify state manipulation attacker contracts. SMAsher leverages fine-grained state-aware dataflow analysis to detect exploitation traces and exploited state dependencies among contracts, focusing on recovering the call path and interaction semantics. Our extensive experiments on 1.38 million real-world contracts demonstrate that SMAsher successfully identifies 311 state manipulation attacker contracts with 100% precision, resulting in $ 6.95 million in losses. Our findings also reveal some notable malicious characteristics of hackers’ accounts through their deployed attacker contracts. Additionally, we have provided 10 PoCs (Proof-of-Concepts) for previously unidentified attacks, all of which have been confirmed and released to the community.
{"title":"Who Is Pulling the Strings: Unveiling Smart Contract State Manipulation Attacks Through State-Aware Dataflow Analysis","authors":"Shuo Yang;Jiachi Chen;Lei Xiao;Jinyuan Hu;Dan Lin;Jiajing Wu;Tao Zhang;Zibin Zheng","doi":"10.1109/TSE.2025.3605145","DOIUrl":"10.1109/TSE.2025.3605145","url":null,"abstract":"Recently, the increasing complexity of smart contracts and their interactions has led to more sophisticated strategies for executing attacks. Hackers often need to deploy attacker contracts as delegators to automate these attacks on their behalf. Existing identification methods for attacker contracts either rely on simple patterns (e.g., recursive callback control flow) that suffer from high false-positive rates and limited extraction of interaction and call information, or lack fully automated detection capabilities. Consequently, these limitations reduce the effectiveness of current solutions in identifying modern, intricate attacks. To overcome these challenges, we introduce the concept of <italic>state manipulation attacks</i>, which abstracts the exploitation of problematic state dependencies arising from contract interactions. During these attacks, hackers first alter the storage state of one contract (the manipulated contract), which determines the profit they can gain. They then call another contract (the victim contract) to exploit its dependency on the altered state and maximize their profits. We present SMAsher, a tool designed to automatically identify state manipulation attacker contracts. SMAsher leverages fine-grained state-aware dataflow analysis to detect exploitation traces and exploited state dependencies among contracts, focusing on recovering the call path and interaction semantics. Our extensive experiments on 1.38 million real-world contracts demonstrate that SMAsher successfully identifies 311 state manipulation attacker contracts with 100% precision, resulting in $ 6.95 million in losses. Our findings also reveal some notable malicious characteristics of hackers’ accounts through their deployed attacker contracts. Additionally, we have provided 10 PoCs (Proof-of-Concepts) for previously unidentified attacks, all of which have been confirmed and released to the community.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 10","pages":"2942-2956"},"PeriodicalIF":5.6,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/tse.2025.3622251
Xinyue Zuo, Yan Xiao, Xiaochun Cao, Wenya Wang, Jin Song Dong
{"title":"DT4LM: Differential Testing for Reliable Language Model Updates in Classification Tasks","authors":"Xinyue Zuo, Yan Xiao, Xiaochun Cao, Wenya Wang, Jin Song Dong","doi":"10.1109/tse.2025.3622251","DOIUrl":"https://doi.org/10.1109/tse.2025.3622251","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"91 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/tse.2025.3621462
Xiaoxue Ren, Chaoqun Dai, Qiao Huang, Ye Wang, Chao Liu, Bo Jiang
{"title":"Hydra-Reviewer: A holistic multi-agent system for automatic code review comment generation","authors":"Xiaoxue Ren, Chaoqun Dai, Qiao Huang, Ye Wang, Chao Liu, Bo Jiang","doi":"10.1109/tse.2025.3621462","DOIUrl":"https://doi.org/10.1109/tse.2025.3621462","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"102 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TSE.2025.3620670
Wenting Zhao;Wuxia Jin;Yiran Zhang;Ming Fan;Haijun Wang;Li Li;Yang Liu;Ting Liu
The architecture of software systems evolves along with their upgrades and maintenance, inevitably creating a gap between the defact architecture and the designed one. To perceive and fix the discrepancy, clustering-based architecture recovery methods have been developed to re-engineer the real-time system architecture from the code implementation. However, existing solutions still face several limitations. They underutilize both code-level and architecture-level semantics underlying the source code. Moreover, they overlook implicit structural dependencies that complement explicit ones to reflect code interactions. To address these challenges, we propose SemArc, an architecture recovery method that utilizes large language models to comprehend both implementation-level and architecture-level semantics, supported by well-established canonical architectural patterns as a knowledge base. SemArc also incorporates both implicit and explicit dependencies to complete the system behavior representations. Additionally, SemArc introduces a component-as-anchor guided clustering algorithm to improve the clustering process. We evaluated SemArc on 15 software systems written in C/C++, Java, and Python, using five different metrics. The results demonstrate that SemArc outperforms seven baseline methods by an average of 32 percentage points. We also examined how three factors—code semantics, architectural semantics, and implicit dependencies—as well as different levels of architectural semantic descriptions, influence recovery accuracy. A case study on the Bash project indicates that SemArc has the potential to yield even more precise recovery results than those labeled by humans.
{"title":"Software Architecture Recovery Augmented With Semantics","authors":"Wenting Zhao;Wuxia Jin;Yiran Zhang;Ming Fan;Haijun Wang;Li Li;Yang Liu;Ting Liu","doi":"10.1109/TSE.2025.3620670","DOIUrl":"10.1109/TSE.2025.3620670","url":null,"abstract":"The architecture of software systems evolves along with their upgrades and maintenance, inevitably creating a gap between the defact architecture and the designed one. To perceive and fix the discrepancy, clustering-based architecture recovery methods have been developed to re-engineer the real-time system architecture from the code implementation. However, existing solutions still face several limitations. They underutilize both code-level and architecture-level semantics underlying the source code. Moreover, they overlook implicit structural dependencies that complement explicit ones to reflect code interactions. To address these challenges, we propose SemArc, an architecture recovery method that utilizes large language models to comprehend both implementation-level and architecture-level semantics, supported by well-established canonical architectural patterns as a knowledge base. SemArc also incorporates both implicit and explicit dependencies to complete the system behavior representations. Additionally, SemArc introduces a component-as-anchor guided clustering algorithm to improve the clustering process. We evaluated SemArc on 15 software systems written in C/C++, Java, and Python, using five different metrics. The results demonstrate that SemArc outperforms seven baseline methods by an average of 32 percentage points. We also examined how three factors—code semantics, architectural semantics, and implicit dependencies—as well as different levels of architectural semantic descriptions, influence recovery accuracy. A case study on the Bash project indicates that SemArc has the potential to yield even more precise recovery results than those labeled by humans.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"338-359"},"PeriodicalIF":5.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1109/TSE.2025.3619966
Shiwen Ou;Yuwei Li;Lu Yu;Chengkun Wei;Tingke Wen;Qiangpu Chen;Yu Chen;Haizhi Tang;Zulie Pan
Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92% and 98.20% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.
{"title":"MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing","authors":"Shiwen Ou;Yuwei Li;Lu Yu;Chengkun Wei;Tingke Wen;Qiangpu Chen;Yu Chen;Haizhi Tang;Zulie Pan","doi":"10.1109/TSE.2025.3619966","DOIUrl":"10.1109/TSE.2025.3619966","url":null,"abstract":"Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92% and 98.20% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"360-375"},"PeriodicalIF":5.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11201027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last five years have seen a rise of model checking guided testing (MCGT) approaches for systematically testing distributed systems. MCGT approaches generate test cases for distributed systems by traversing their verified abstract state spaces, simultaneously solving the three key problems faced in testing distributed systems, i.e., test input generation, test oracle construction and execution space enumeration. However, existing MCGT approaches struggle with traversing the huge state space of distributed systems, which can contain billions of system states. This makes the process of finding bugs time-consuming and expensive, often taking several weeks. In this paper, we propose Mosso to speed up model checking guided testing for distributed systems. We observe that there exist lots of redundant test scenarios in the abstract state space of distributed systems. Considering the characteristics of these redundant test scenarios, we propose three strategies: action independence, node symmetry and scenario equivalence, to identify and prioritize unique test scenarios when traversing the state space. We have applied Mosso on three real-world distributed systems. By employing the three strategies, our approach has achieved an average speedup of 56X (up to 208X) compared to the state-of-art MCGT approach. Additionally, our approach has successfully uncovered 2 previously-unknown bugs.
{"title":"Efficiently Testing Distributed Systems via Abstract State Space Prioritization","authors":"Yu Gao;Dong Wang;Wensheng Dou;Wenhan Feng;Yu Liang;Jun Wei","doi":"10.1109/TSE.2025.3618976","DOIUrl":"10.1109/TSE.2025.3618976","url":null,"abstract":"The last five years have seen a rise of model checking guided testing (MCGT) approaches for systematically testing distributed systems. MCGT approaches generate test cases for distributed systems by traversing their verified abstract state spaces, simultaneously solving the three key problems faced in testing distributed systems, i.e., test input generation, test oracle construction and execution space enumeration. However, existing MCGT approaches struggle with traversing the huge state space of distributed systems, which can contain billions of system states. This makes the process of finding bugs time-consuming and expensive, often taking several weeks. In this paper, we propose <monospace>Mosso</monospace> to speed up model checking guided testing for distributed systems. We observe that there exist lots of redundant test scenarios in the abstract state space of distributed systems. Considering the characteristics of these redundant test scenarios, we propose three strategies: action independence, node symmetry and scenario equivalence, to identify and prioritize unique test scenarios when traversing the state space. We have applied <monospace>Mosso</monospace> on three real-world distributed systems. By employing the three strategies, our approach has achieved an average speedup of 56X (up to 208X) compared to the state-of-art MCGT approach. Additionally, our approach has successfully uncovered 2 previously-unknown bugs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"395-410"},"PeriodicalIF":5.6,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}