基于模糊的基于最小种子输入集的语法学习

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Computer Languages Pub Date : 2023-11-19 DOI:10.1016/j.cola.2023.101252
Hannes Sochor , Flavio Ferrarotti , Daniela Kaufmann
{"title":"基于模糊的基于最小种子输入集的语法学习","authors":"Hannes Sochor ,&nbsp;Flavio Ferrarotti ,&nbsp;Daniela Kaufmann","doi":"10.1016/j.cola.2023.101252","DOIUrl":null,"url":null,"abstract":"<div><p><span>To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers<span><span> solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as </span>Java Script Object Notation<span> (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the </span></span></span>parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"78 ","pages":"Article 101252"},"PeriodicalIF":1.7000,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fuzzing-based grammar learning from a minimal set of seed inputs\",\"authors\":\"Hannes Sochor ,&nbsp;Flavio Ferrarotti ,&nbsp;Daniela Kaufmann\",\"doi\":\"10.1016/j.cola.2023.101252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><span>To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers<span><span> solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as </span>Java Script Object Notation<span> (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the </span></span></span>parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.</p></div>\",\"PeriodicalId\":48552,\"journal\":{\"name\":\"Journal of Computer Languages\",\"volume\":\"78 \",\"pages\":\"Article 101252\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Languages\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S259011842300062X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259011842300062X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

为了有效,fuzzer需要生成格式良好的输入,这样它们就不会被被测试软件(SUT)直接拒绝,从而可以检测到有意义的错误。基于语法的fuzzers解决了这个问题,但它们显然需要SUT接受的输入语言的语法。很多时候这样的语法是未知的。因此,人们提出了不同的黑盒和白盒算法来从sut中学习它们。黑盒算法仅依赖于成员查询,但需要访问精心制作的格式良好的输入,以获得良好的结果。白盒算法需要访问源代码,并且通常产生具有更高精度和召回率的语法,但代价是只适用于特定的编程语言和库。我们提出了一种新的算法,并通过广泛的实验表明,它可以从递归的后代解析器中学习语法,并且具有一致的高水平,召回率和精度。值得注意的是,这个结果是从几个任意的种子输入开始获得的,并且包括使用Java Script Object Notation (JSON)等复杂语言的计算。与其他先进的白盒方法不同,我们的方法不需要复杂的程序分析技术,如动态污染或符号执行。事实上,实验证实,我们的方法仅使用解析程序的(标准的)通用抽象语法树(AST)作为输入就可以非常好地执行。我们的方法的核心是使用模糊技术结合语法学习的基本理论结果。与其他白盒方法相比,我们的方法不依赖于特定的编程语言和工具,因此可以很容易地移植。关于性能,我们已经证明我们的算法在实践中工作得很好,并且在合理的假设下,其最坏情况复杂度是多项式(低指数),而不是时间和空间要求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Fuzzing-based grammar learning from a minimal set of seed inputs

To be effective, a fuzzer needs to generate inputs that are well formed, so that they are not outright rejected by the Software Under Test (SUT) and can thus detect meaningful bugs. Grammar based fuzzers solve this problem, but they obviously require a grammar of the input language accepted by the SUT. Many times such grammar is unknown. Therefore, different black- and white-box algorithms have been proposed for learning them from SUTs. Black-box algorithms rely only on membership queries, but need access to carefully crafted well formed inputs in order to obtain good results. White-box algorithms require access to the source code and generally produce grammars with higher precision and recall, but at the expense of working only for specific programming languages and libraries. We propose a new algorithm and show through extensive experimentation that it can learn grammars from recursive descendent parsers with consistently high levels of both, recall and precision. Notably, this result was obtained starting with a couple of arbitrary seed inputs and includes evaluations with sophisticated languages such as Java Script Object Notation (JSON). Different to other state of the art white-box approaches, our method does not require sophisticated program analysis techniques such as dynamic tainting or symbolic execution. In fact, the experiments confirm that our method performs extremely well with just a (standard) generic Abstract Syntax Tree (AST) of the parsing program as input. The core of our method uses fuzzing techniques combined with fundamental theoretical results on grammar learning. Compared to other white-box approaches, ours is not tied to specific programming languages and tools, and thus can be easily ported. Regarding performance, we have shown that our algorithm works well in practice and that, under reasonable assumptions, its worst-case complexity is polynomial (with low exponents) w.r.t. time and space requirements.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Computer Languages
Journal of Computer Languages Computer Science-Computer Networks and Communications
CiteScore
5.00
自引率
13.60%
发文量
36
期刊最新文献
Debugging in the Domain-Specific Modeling Languages for multi-agent systems GPotion: Embedding GPU programming in Elixir Near-Pruned single assignment transformation of programs MLAPW: A framework to assess the impact of feature selection and sampling techniques on anti-pattern prediction using WSDL metrics Editorial Board
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1