从漏洞数据自动识别库:我们能做得更好吗?

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) Pub Date : 2022-05-01 DOI:10.1145/3524610.3527893

S. A. Haryono, Hong Jin Kang, Abhishek Sharma, Asankhaya Sharma, A. Santosa, Angela Yi, D. Lo

{"title":"从漏洞数据自动识别库:我们能做得更好吗?","authors":"S. A. Haryono, Hong Jin Kang, Abhishek Sharma, Asankhaya Sharma, A. Santosa, Angela Yi, D. Lo","doi":"10.1145/3524610.3527893","DOIUrl":null,"url":null,"abstract":"Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not ex-plicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While pre-vious work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Automated Identification of Libraries from Vulnerability Data: Can We Do Better?\",\"authors\":\"S. A. Haryono, Hong Jin Kang, Abhishek Sharma, Asankhaya Sharma, A. Santosa, Angela Yi, D. Lo\",\"doi\":\"10.1145/3524610.3527893\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not ex-plicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While pre-vious work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.\",\"PeriodicalId\":426634,\"journal\":{\"name\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3524610.3527893\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524610.3527893","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

软件工程师在很大程度上依赖于软件库，一旦发现漏洞，就必须更新它们的依赖项。软件组合分析(SCA)帮助开发人员识别应用程序使用的易受攻击的库。一个关键的挑战是识别与国家漏洞数据库(NVD)中给定报告的漏洞相关的库，这可能不会明确指出受影响的库。最近，研究人员试图通过将其视为极端多标签学习(XML)问题来解决从NVD报告中识别库的问题，该问题的特点是可能存在大量标签和严重的数据稀疏性。作为输入，提供NVD报告，作为输出，返回一组相关库。在这项工作中，我们评估了多种XML技术。虽然之前的工作只评估了传统的XML技术FastXML，但我们训练了其他四个传统的XML模型(DiSMEC, Parabel, Bonsai, ExtremeText)以及两个基于深度学习的模型(XML- cnn和LightXML)。我们比较了它们的有效性和训练和使用模型进行预测的时间成本。我们发现，除了DiSMEC和XML- cnn，最近的XML模型在Top-k (k=1,2,3)预测的f1得分方面比FastXML模型高出3%-10%。此外，我们观察到这些XML模型在训练和预测时间上都有显著的改进，与FastXML基线相比，Bonsai和Parabel模型的训练时间分别快了627倍和589倍，预测时间快了12倍。我们讨论了实验结果的含义，并强调了未来工作需要解决的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automated Identification of Libraries from Vulnerability Data: Can We Do Better?

Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not ex-plicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While pre-vious work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助