{"title":"Do Chase Your Tail! Missing Key Aspects Augmentation in Textual Vulnerability Descriptions of Long-Tail Software Through Feature Inference","authors":"Linyi Han;Shidong Pan;Zhenchang Xing;Jiamou Sun;Sofonias Yitagesu;Xiaowang Zhang;Zhiyong Feng","doi":"10.1109/TSE.2024.3523284","DOIUrl":null,"url":null,"abstract":"Augmenting missing key aspects in Textual Vulnerability Descriptions (TVDs) is crucial for effective vulnerability analysis. For instance, in TVDs, key aspects include <italic>Attack Vector</i>, <italic>Vulnerability Type</i>, among others. These key aspects help security engineers understand and address the vulnerability in a timely manner. For software with a large user base (non-long-tail software), augmenting these missing key aspects has significantly advanced vulnerability analysis and software security research. However, software instances with a limited user base (long-tail software) often get overlooked due to inconsistency software names, TVD limited avaliability, and domain-specific jargon, which complicates vulnerability analysis and software repairs. In this paper, we introduce a novel software feature inference framework designed to augment the missing key aspects of TVDs for long-tail software. Firstly, we tackle the issue of non-standard software names found in community-maintained vulnerability databases by cross-referencing government databases with Common Vulnerabilities and Exposures (CVEs). Next, we employ Large Language Models (LLMs) to generate the missing key aspects. However, the limited availability of historical TVDs restricts the variety of examples. To overcome this limitation, we utilize the Common Weakness Enumeration (CWE) to classify all TVDs and select cluster centers as representative examples. To ensure accuracy, we present Natural Language Inference (NLI) models specifically designed for long-tail software. These models identify and eliminate incorrect responses. Additionally, we use a wiki repository to provide explanations for proprietary terms. Our evaluations demonstrate that our approach significantly improves the accuracy of augmenting missing key aspects of TVDs for log-tail software from 0.27 to 0.56 (+107%). Interestingly, the accuracy of non-long-tail software also increases from 64% to 71%. As a result, our approach can be useful in various downstream tasks that require complete TVD information.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"466-483"},"PeriodicalIF":5.6000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10817126/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Augmenting missing key aspects in Textual Vulnerability Descriptions (TVDs) is crucial for effective vulnerability analysis. For instance, in TVDs, key aspects include Attack Vector, Vulnerability Type, among others. These key aspects help security engineers understand and address the vulnerability in a timely manner. For software with a large user base (non-long-tail software), augmenting these missing key aspects has significantly advanced vulnerability analysis and software security research. However, software instances with a limited user base (long-tail software) often get overlooked due to inconsistency software names, TVD limited avaliability, and domain-specific jargon, which complicates vulnerability analysis and software repairs. In this paper, we introduce a novel software feature inference framework designed to augment the missing key aspects of TVDs for long-tail software. Firstly, we tackle the issue of non-standard software names found in community-maintained vulnerability databases by cross-referencing government databases with Common Vulnerabilities and Exposures (CVEs). Next, we employ Large Language Models (LLMs) to generate the missing key aspects. However, the limited availability of historical TVDs restricts the variety of examples. To overcome this limitation, we utilize the Common Weakness Enumeration (CWE) to classify all TVDs and select cluster centers as representative examples. To ensure accuracy, we present Natural Language Inference (NLI) models specifically designed for long-tail software. These models identify and eliminate incorrect responses. Additionally, we use a wiki repository to provide explanations for proprietary terms. Our evaluations demonstrate that our approach significantly improves the accuracy of augmenting missing key aspects of TVDs for log-tail software from 0.27 to 0.56 (+107%). Interestingly, the accuracy of non-long-tail software also increases from 64% to 71%. As a result, our approach can be useful in various downstream tasks that require complete TVD information.
在文本漏洞描述(tvd)中增加缺失的关键方面对于有效的漏洞分析至关重要。例如,在tvd中,关键方面包括攻击向量、漏洞类型等。这些关键方面可以帮助安全工程师及时了解和解决漏洞。对于具有大量用户基础的软件(非长尾软件),增加这些缺失的关键方面可以显著推进漏洞分析和软件安全性研究。然而,由于不一致的软件名称、TVD有限的可用性和特定于领域的术语,具有有限用户基础的软件实例(长尾软件)经常被忽略,这使得漏洞分析和软件修复变得复杂。在本文中,我们引入了一种新的软件特征推理框架,旨在补充长尾软件的tvd缺失的关键方面。首先,我们通过交叉参考政府数据库中的常见漏洞和暴露(Common Vulnerabilities and Exposures, cve),解决了在社区维护的漏洞数据库中发现的非标准软件名称的问题。接下来,我们使用大型语言模型(llm)来生成缺失的关键方面。然而,历史tvd的有限可用性限制了示例的多样性。为了克服这一限制,我们利用共同弱点枚举(CWE)对所有tvd进行分类,并选择集群中心作为代表性示例。为了确保准确性,我们提出了专门为长尾软件设计的自然语言推理(NLI)模型。这些模型识别并消除不正确的响应。此外,我们使用wiki存储库来提供专有术语的解释。我们的评估表明,我们的方法显着提高了对数尾软件的tvd缺失关键方面的准确性,从0.27提高到0.56(+107%)。有趣的是,非长尾软件的准确率也从64%提高到71%。因此,我们的方法可以用于需要完整TVD信息的各种下游任务。
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.