Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world multilingual repository-level code translation benchmark featuring 1,897 real-world repository samples across 13 language pairs with automatically executable test suites. Besides, we introduce RepoTransAgent, a general agent framework to perform repository-level code translation. We evaluate both our benchmark’s challenges and agent’s effectiveness using several methods and backbone LLMs, revealing that repository-level translation remains challenging, where the best-performing method achieves only a 32.8% success rate. Furthermore, our analysis reveals that translation difficulty varies significantly by language pair direction, with dynamic-to-static language translation being much more challenging than the reverse direction (achieving below 10% vs. static-to-dynamic at 45-63%). Finally, we conduct a detailed error analysis and highlight current LLMs’ deficiencies in repository-level code translation, which could provide a reference for further improvements. We provide the code and data at https://github.com/DeepSoftwareAnalytics/RepoTransBench.
{"title":"RepoTransBench: A Real-World Multilingual Benchmark for Repository-Level Code Translation","authors":"Yanli Wang;Yanlin Wang;Suiquan Wang;Daya Guo;Jiachi Chen;John Grundy;Xilin Liu;Yuchi Ma;Mingzhi Mao;Hongyu Zhang;Zibin Zheng","doi":"10.1109/TSE.2025.3645056","DOIUrl":"10.1109/TSE.2025.3645056","url":null,"abstract":"Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named <bold>RepoTransBench</b>, which is a real-world multilingual repository-level code translation benchmark featuring 1,897 real-world repository samples across 13 language pairs with automatically executable test suites. Besides, we introduce <bold>RepoTransAgent</b>, a general agent framework to perform repository-level code translation. We evaluate both our benchmark’s challenges and agent’s effectiveness using several methods and backbone LLMs, revealing that repository-level translation remains challenging, where the best-performing method achieves only a 32.8% success rate. Furthermore, our analysis reveals that translation difficulty varies significantly by language pair direction, with dynamic-to-static language translation being much more challenging than the reverse direction (achieving below 10% vs. static-to-dynamic at 45-63%). Finally, we conduct a detailed error analysis and highlight current LLMs’ deficiencies in repository-level code translation, which could provide a reference for further improvements. We provide the code and data at <uri>https://github.com/DeepSoftwareAnalytics/RepoTransBench</uri>.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"675-690"},"PeriodicalIF":5.6,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1109/TSE.2025.3644183
Roham Koohestani;Philippe de Bekker;Begüm Koç;Maliheh Izadi
Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified approach to improve benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, which features corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame’s scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. Lastly, we publicly release the material of our review, user study, and the enhanced benchmark.1
{"title":"Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality","authors":"Roham Koohestani;Philippe de Bekker;Begüm Koç;Maliheh Izadi","doi":"10.1109/TSE.2025.3644183","DOIUrl":"10.1109/TSE.2025.3644183","url":null,"abstract":"Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified approach to improve benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, which features corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame’s scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. Lastly, we publicly release the material of our review, user study, and the enhanced benchmark.<xref><sup>1</sup></xref><fn><label><sup>1</sup></label><p><uri>https://github.com/AISE-TUDelft/AI4SE-benchmarks</uri></p></fn>","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"651-674"},"PeriodicalIF":5.6,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, researchers have proposed many multi-agent frameworks for function-level code generation, which aim to improve software development productivity by automatically generating function-level source code based on task descriptions. A typical multi-agent framework consists of Large Language Model (LLM)-based agents that are responsible for task planning, code generation, testing, debugging, etc. Studies have shown that existing multi-agent code generation frameworks perform well on ChatGPT. However, their generalizability across other foundation LLMs remains unexplored systematically. In this paper, we report an empirical study on the generalizability of four state-of-the-art multi-agent code generation frameworks across 12 open-source LLMs with varying code generation and instruction-following capabilities. Our study reveals the unstable generalizability of existing frameworks on diverse foundation LLMs. Based on the findings obtained from the empirical study, we propose AdaCoder, a novel adaptive planning, multi-agent framework for function-level code generation. AdaCoder has two phases. Phase-1 is an initial code generation step without planning, which uses an LLM-based coding agent and a script-based testing agent to unleash LLM’s native power, identify cases beyond LLM’s power, and determine the errors hindering execution. Phase-2 adds a rule-based debugging agent and an LLM-based planning agent for iterative code generation with planning. Our evaluation shows that AdaCoder achieves higher generalizability on diverse LLMs. Compared to the best baseline MapCoder, AdaCoder is on average 27.69% higher in Pass@1, 16 times faster in inference, and 12 times lower in token consumption.
{"title":"AdaCoder: An Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation","authors":"Yueheng Zhu;Chao Liu;Xuan He;Xiaoxue Ren;Zhongxin Liu;Ruwei Pan;Hongyu Zhang","doi":"10.1109/TSE.2025.3642621","DOIUrl":"10.1109/TSE.2025.3642621","url":null,"abstract":"Recently, researchers have proposed many multi-agent frameworks for function-level code generation, which aim to improve software development productivity by automatically generating function-level source code based on task descriptions. A typical multi-agent framework consists of Large Language Model (LLM)-based agents that are responsible for task planning, code generation, testing, debugging, etc. Studies have shown that existing multi-agent code generation frameworks perform well on ChatGPT. However, their generalizability across other foundation LLMs remains unexplored systematically. In this paper, we report an empirical study on the generalizability of four state-of-the-art multi-agent code generation frameworks across 12 open-source LLMs with varying code generation and instruction-following capabilities. Our study reveals the unstable generalizability of existing frameworks on diverse foundation LLMs. Based on the findings obtained from the empirical study, we propose AdaCoder, a novel adaptive planning, multi-agent framework for function-level code generation. AdaCoder has two phases. Phase-1 is an initial code generation step without planning, which uses an LLM-based coding agent and a script-based testing agent to unleash LLM’s native power, identify cases beyond LLM’s power, and determine the errors hindering execution. Phase-2 adds a rule-based debugging agent and an LLM-based planning agent for iterative code generation with planning. Our evaluation shows that AdaCoder achieves higher generalizability on diverse LLMs. Compared to the best baseline MapCoder, AdaCoder is on average 27.69% higher in Pass@1, 16 times faster in inference, and 12 times lower in token consumption.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"631-650"},"PeriodicalIF":5.6,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145731225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1109/tse.2025.3639694
Sebastian Uchitel
{"title":"State of the Journal","authors":"Sebastian Uchitel","doi":"10.1109/tse.2025.3639694","DOIUrl":"https://doi.org/10.1109/tse.2025.3639694","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"13 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145728829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1109/TSE.2025.3641486
Vikram Nitin;Rahul Krishna;Luiz Lemos do Valle;Baishakhi Ray
In recent years, there has been a lot of interest in converting C code to Rust, to benefit from the memory and thread safety guarantees of Rust. C2Rust is a rule-based system that can automatically convert C code to functionally identical Rust, but the Rust code that it produces is non-idiomatic, i.e., makes extensive use of unsafe Rust, a subset of the language that doesn’t have memory or thread safety guarantees. At the other end of the spectrum are LLMs, which produce idiomatic Rust code, but these have the potential to make mistakes and are constrained in the length of code they can process. In this paper, we present C2SaferRust, a novel approach to translate C to Rust that combines the strengths of C2Rust and LLMs. We first use C2Rust to convert C code to non-idiomatic, unsafe Rust. We then decompose the unsafe Rust code into slices that can be individually translated to safer Rust by an LLM. After processing each slice, we run end-to-end test cases to verify that the code still functions as expected. We also contribute a benchmark of 7 real-world programs, translated from C to unsafe Rust using C2Rust. Each of these programs also comes with end-to-end test cases. On this benchmark, we are able to reduce the number of raw pointers by up to 38%, and reduce the amount of unsafe code by up to 28%, indicating an increase in safety. The resulting programs still pass all test cases. C2SaferRust also shows convincing gains in performance against two previous techniques for making Rust code safer.
{"title":"C2SaferRust: Transforming C Projects Into Safer Rust With NeuroSymbolic Techniques","authors":"Vikram Nitin;Rahul Krishna;Luiz Lemos do Valle;Baishakhi Ray","doi":"10.1109/TSE.2025.3641486","DOIUrl":"10.1109/TSE.2025.3641486","url":null,"abstract":"In recent years, there has been a lot of interest in converting C code to Rust, to benefit from the memory and thread safety guarantees of Rust. C2Rust is a rule-based system that can automatically convert C code to functionally identical Rust, but the Rust code that it produces is non-idiomatic, i.e., makes extensive use of unsafe Rust, a subset of the language that <italic>doesn’t</i> have memory or thread safety guarantees. At the other end of the spectrum are LLMs, which produce idiomatic Rust code, but these have the potential to make mistakes and are constrained in the length of code they can process. In this paper, we present <sc>C2SaferRust</small>, a novel approach to translate C to Rust that combines the strengths of C2Rust and LLMs. We first use C2Rust to convert C code to non-idiomatic, unsafe Rust. We then decompose the unsafe Rust code into slices that can be individually translated to safer Rust by an LLM. After processing each slice, we run end-to-end test cases to verify that the code still functions as expected. We also contribute a benchmark of 7 real-world programs, translated from C to unsafe Rust using C2Rust. Each of these programs also comes with end-to-end test cases. On this benchmark, we are able to reduce the number of raw pointers by up to 38%, and reduce the amount of unsafe code by up to 28%, indicating an increase in safety. The resulting programs still pass all test cases. <sc>C2SaferRust</small> also shows convincing gains in performance against two previous techniques for making Rust code safer.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"618-630"},"PeriodicalIF":5.6,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145717968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1109/TSE.2025.3641225
Xiaoyan Zhu;Tianxiang Xu;Xin Lai;Xin Lian;Hangyu Cheng;Jiayin Wang
With the rapid advancements in medicine, biology, and information technology, their deep integration has given rise to the emerging field of bioinformatics. In this process, high-throughput technologies such as genomics, transcriptomics, and proteomics have generated massive volumes of biological data. The biological significance of these data heavily relies on bioinformatics software for analysis and processing. Therefore, it is crucial for both scientific research and clinical applications to ensure the quality of bioinformatics software and avoiding errors or hidden defects. However, to date, no dedicated study has systematically analyzed the quality of bioinformatics software. We conduct a comprehensive empirical study that aggregates, synthesizes, and analyzes findings from 167 bioinformatics software projects. Following the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) protocol, we extract and evaluate quality-related data to answer our research questions (RQs). Our analysis reveals several key findings. The quality of bioinformatics software requires significant improvement, with an average defect density approximately 11.8× higher than that of general-purpose software. Additionally, unlike traditional software domains, a considerable proportion of defects in bioinformatics software are related to annotations. These issues can lead developers to overlook potential security vulnerabilities or make incorrect fixes, thereby increasing the cost and complexity of subsequent code maintenance. Based on these findings, we further discuss the challenges faced by bioinformatics software and propose potential solutions. This paper lays a foundation for further research on software quality in the bioinformatics domain and offers actionable insights for researchers and practitioners alike.
{"title":"Reaching Software Quality for Bioinformatics Applications: How Far Are We?","authors":"Xiaoyan Zhu;Tianxiang Xu;Xin Lai;Xin Lian;Hangyu Cheng;Jiayin Wang","doi":"10.1109/TSE.2025.3641225","DOIUrl":"10.1109/TSE.2025.3641225","url":null,"abstract":"With the rapid advancements in medicine, biology, and information technology, their deep integration has given rise to the emerging field of bioinformatics. In this process, high-throughput technologies such as genomics, transcriptomics, and proteomics have generated massive volumes of biological data. The biological significance of these data heavily relies on bioinformatics software for analysis and processing. Therefore, it is crucial for both scientific research and clinical applications to ensure the quality of bioinformatics software and avoiding errors or hidden defects. However, to date, no dedicated study has systematically analyzed the quality of bioinformatics software. We conduct a comprehensive empirical study that aggregates, synthesizes, and analyzes findings from 167 bioinformatics software projects. Following the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) protocol, we extract and evaluate quality-related data to answer our research questions (RQs). Our analysis reveals several key findings. The quality of bioinformatics software requires significant improvement, with an average defect density approximately 11.8× higher than that of general-purpose software. Additionally, unlike traditional software domains, a considerable proportion of defects in bioinformatics software are related to annotations. These issues can lead developers to overlook potential security vulnerabilities or make incorrect fixes, thereby increasing the cost and complexity of subsequent code maintenance. Based on these findings, we further discuss the challenges faced by bioinformatics software and propose potential solutions. This paper lays a foundation for further research on software quality in the bioinformatics domain and offers actionable insights for researchers and practitioners alike.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 2","pages":"595-617"},"PeriodicalIF":5.6,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145704004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}