Pub Date : 2024-12-24DOI: 10.1109/tse.2024.3522038
Leslie Lamport
{"title":"A Retrospective of Proving the Correctness of Multiprocess Programs","authors":"Leslie Lamport","doi":"10.1109/tse.2024.3522038","DOIUrl":"https://doi.org/10.1109/tse.2024.3522038","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142884233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/tse.2024.3521306
Jeff Kramer
{"title":"Reflections of a Former Editor-in-Chief of TSE","authors":"Jeff Kramer","doi":"10.1109/tse.2024.3521306","DOIUrl":"https://doi.org/10.1109/tse.2024.3521306","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"60 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-16DOI: 10.1109/TSE.2024.3519159
Ishrak Hayet;Adam Scott;Marcelo d'Amorim
Test oracle generation is an important and challenging problem. Neural-based solutions have been recently proposed for oracle generation but they are still inaccurate. For example, the accuracy of the state-of-the-art technique teco is only 27.5% on its dataset including 3,540 test cases. We propose ChatAssert, a prompt engineering framework designed for oracle generation that uses dynamic and static information to iteratively refine prompts for querying large language models (LLMs). ChatAssert uses code summaries and examples to assist an LLM in generating candidate test oracles, uses a lightweight static analysis to assist the LLM in repairing generated oracles that fail to compile, and uses dynamic information obtained from test runs to help the LLM in repairing oracles that compile but do not pass. Experimental results using an independent publicly-available dataset show that ChatAssert improves the state-of-the-art technique, teco, on key evaluation metrics. For example, it improves Acc@1 by 15%. Overall, results provide initial yet strong evidence that using external tools in the formulation of prompts is an important aid in LLM-based oracle generation.
{"title":"ChatAssert: LLM-Based Test Oracle Generation With External Tools Assistance","authors":"Ishrak Hayet;Adam Scott;Marcelo d'Amorim","doi":"10.1109/TSE.2024.3519159","DOIUrl":"10.1109/TSE.2024.3519159","url":null,"abstract":"Test oracle generation is an important and challenging problem. Neural-based solutions have been recently proposed for oracle generation but they are still inaccurate. For example, the accuracy of the state-of-the-art technique <sc>teco</small> is only 27.5% on its dataset including 3,540 test cases. We propose <sc>ChatAssert</small>, a prompt engineering framework designed for oracle generation that uses dynamic and static information to iteratively refine prompts for querying large language models (LLMs). <sc>ChatAssert</small> uses code summaries and examples to assist an LLM in generating candidate test oracles, uses a lightweight static analysis to assist the LLM in repairing generated oracles that fail to compile, and uses dynamic information obtained from test runs to help the LLM in repairing oracles that compile but do not pass. Experimental results using an independent publicly-available dataset show that <sc>ChatAssert</small> improves the state-of-the-art technique, <sc>teco</small>, on key evaluation metrics. For example, it improves <italic>Acc@1</i> by 15%. Overall, results provide initial yet strong evidence that using external tools in the formulation of prompts is an important aid in LLM-based oracle generation.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"305-319"},"PeriodicalIF":6.5,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Crowdsourced testing has gained prominence in the field of software testing due to its ability to effectively address the challenges posed by the fragmentation problem in mobile app testing. The inherent openness of crowdsourced testing brings diversity to the testing outcome. However, it also presents challenges for app developers in inspecting a substantial quantity of test reports. To help app developers inspect the bugs in crowdsourced test reports as early as possible, crowdsourced test report prioritization has emerged as an effective technology by establishing a systematic optimal report inspecting sequence. Nevertheless, crowdsourced test reports consist of app screenshots and textual descriptions, but current prioritization approaches mostly rely on textual descriptions, and some may add vectorized image features at the image-as-a-whole level or widget level. They still lack precision in accurately characterizing the distinctive features of crowdsourced test reports. In terms of prioritization strategy, prevailing approaches adopt simple prioritization based on features combined merely using weighted coefficients, without adequately considering the semantics, which may result in biased and ineffective outcomes. In this paper, we propose EncrePrior