Pub Date : 2024-10-07DOI: 10.1109/TSE.2024.3470333
Xin Yin;Chao Ni;Shaohua Wang
This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection. However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated superior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.
{"title":"Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability","authors":"Xin Yin;Chao Ni;Shaohua Wang","doi":"10.1109/TSE.2024.3470333","DOIUrl":"10.1109/TSE.2024.3470333","url":null,"abstract":"This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection. However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated superior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"3071-3087"},"PeriodicalIF":6.5,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-04DOI: 10.1109/TSE.2024.3474173
Jorge Melegati;Kieran Conboy;Daniel Graziotin
Qualitative surveys are emerging as a popular research method in software engineering (SE), particularly as many aspects of the field are increasingly socio-technical and thus concerned with the subtle, social, and often ambiguous issues that are not amenable to a simple quantitative survey. While many argue that qualitative surveys play a vital role amongst the diverse range of methods employed in SE there are a number of shortcomings that inhibits its use and value. First there is a lack of clarity as to what defines a qualitative survey and what features differentiate it from other methods. There is an absence of a clear set of principles and guidelines for its execution, and what does exist is very inconsistent and sometimes contradictory. These issues undermine the perceived reliability and rigour of this method. Researchers are unsure about how to ensure reliability and rigour when designing qualitative surveys and reviewers are unsure how these should be evaluated. In this paper, we present a systematic mapping study to identify how qualitative surveys have been employed in SE research to date. This paper proposes a set of principles, based on a multidisciplinary review of qualitative surveys and capturing some of the commonalities of the diffuse approaches found. These principles can be used by researchers when choosing whether to do a qualitative survey or not. They can then be used to design their study. The principles can also be used by editors and reviewers to judge the quality and rigour of qualitative surveys. It is hoped that this will result in more widespread use of the method and also more effective and evidence-based reviews of studies that use these methods in the future.
{"title":"Qualitative Surveys in Software Engineering Research: Definition, Critical Review, and Guidelines","authors":"Jorge Melegati;Kieran Conboy;Daniel Graziotin","doi":"10.1109/TSE.2024.3474173","DOIUrl":"10.1109/TSE.2024.3474173","url":null,"abstract":"Qualitative surveys are emerging as a popular research method in software engineering (SE), particularly as many aspects of the field are increasingly socio-technical and thus concerned with the subtle, social, and often ambiguous issues that are not amenable to a simple quantitative survey. While many argue that qualitative surveys play a vital role amongst the diverse range of methods employed in SE there are a number of shortcomings that inhibits its use and value. First there is a lack of clarity as to what defines a qualitative survey and what features differentiate it from other methods. There is an absence of a clear set of principles and guidelines for its execution, and what does exist is very inconsistent and sometimes contradictory. These issues undermine the perceived reliability and rigour of this method. Researchers are unsure about how to ensure reliability and rigour when designing qualitative surveys and reviewers are unsure how these should be evaluated. In this paper, we present a systematic mapping study to identify how qualitative surveys have been employed in SE research to date. This paper proposes a set of principles, based on a multidisciplinary review of qualitative surveys and capturing some of the commonalities of the diffuse approaches found. These principles can be used by researchers when choosing whether to do a qualitative survey or not. They can then be used to design their study. The principles can also be used by editors and reviewers to judge the quality and rigour of qualitative surveys. It is hoped that this will result in more widespread use of the method and also more effective and evidence-based reviews of studies that use these methods in the future.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3172-3187"},"PeriodicalIF":6.5,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705351","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-02DOI: 10.1109/TSE.2024.3472476
Sakina Fatima;Hadi Hemmati;Lionel C. Briand
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT 3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.
{"title":"FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair","authors":"Sakina Fatima;Hadi Hemmati;Lionel C. Briand","doi":"10.1109/TSE.2024.3472476","DOIUrl":"10.1109/TSE.2024.3472476","url":null,"abstract":"Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT 3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3146-3171"},"PeriodicalIF":6.5,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704582","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30DOI: 10.1109/TSE.2024.3469582
Rongqi Pan;Taher A. Ghaleb;Lionel C. Briand
Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. The former yields higher fault detection rates ( FDR