基于模糊的基于最小种子输入集的语法学习

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Computer Languages Pub Date : 2024-03-01 Epub Date: 2023-11-19 DOI:10.1016/j.cola.2023.101252

Hannes Sochor , Flavio Ferrarotti , Daniela Kaufmann

{"title":"基于模糊的基于最小种子输入集的语法学习","authors":"Hannes Sochor , Flavio Ferrarotti , Daniela Kaufmann","doi":"10.1016/j.cola.2023.101252","DOIUrl":null,"url":null,"abstract":"<div>To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as Java Script Object Notation (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.</div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"78 ","pages":"Article 101252"},"PeriodicalIF":1.8000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fuzzing-based grammar learning from a minimal set of seed inputs\",\"authors\":\"Hannes Sochor , Flavio Ferrarotti , Daniela Kaufmann\",\"doi\":\"10.1016/j.cola.2023.101252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as Java Script Object Notation (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.</div>\",\"PeriodicalId\":48552,\"journal\":{\"name\":\"Journal of Computer Languages\",\"volume\":\"78 \",\"pages\":\"Article 101252\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Languages\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S259011842300062X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/11/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259011842300062X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/19 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

为了有效，fuzzer需要生成格式良好的输入，这样它们就不会被被测试软件(SUT)直接拒绝，从而可以检测到有意义的错误。基于语法的fuzzers解决了这个问题，但它们显然需要SUT接受的输入语言的语法。很多时候这样的语法是未知的。因此，人们提出了不同的黑盒和白盒算法来从sut中学习它们。黑盒算法仅依赖于成员查询，但需要访问精心制作的格式良好的输入，以获得良好的结果。白盒算法需要访问源代码，并且通常产生具有更高精度和召回率的语法，但代价是只适用于特定的编程语言和库。我们提出了一种新的算法，并通过广泛的实验表明，它可以从递归的后代解析器中学习语法，并且具有一致的高水平，召回率和精度。值得注意的是，这个结果是从几个任意的种子输入开始获得的，并且包括使用Java Script Object Notation (JSON)等复杂语言的计算。与其他先进的白盒方法不同，我们的方法不需要复杂的程序分析技术，如动态污染或符号执行。事实上，实验证实，我们的方法仅使用解析程序的(标准的)通用抽象语法树(AST)作为输入就可以非常好地执行。我们的方法的核心是使用模糊技术结合语法学习的基本理论结果。与其他白盒方法相比，我们的方法不依赖于特定的编程语言和工具，因此可以很容易地移植。关于性能，我们已经证明我们的算法在实践中工作得很好，并且在合理的假设下，其最坏情况复杂度是多项式(低指数)，而不是时间和空间要求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fuzzing-based grammar learning from a minimal set of seed inputs

To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as Java Script Object Notation (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Languages Computer Science-Computer Networks and Communications

CiteScore

5.00

自引率

13.60%

发文量