Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec

2016 IEEE Security and Privacy Workshops (SPW) Pub Date : 2016-05-22 DOI:10.1109/SPW.2016.26

Sheridan S. Curley, Richard E. Harang

{"title":"Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec","authors":"Sheridan S. Curley, Richard E. Harang","doi":"10.1109/SPW.2016.26","DOIUrl":null,"url":null,"abstract":"Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks, a type of recurrent neural network (RNN) architecture, to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.","PeriodicalId":341207,"journal":{"name":"2016 IEEE Security and Privacy Workshops (SPW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW.2016.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Formal Language Theory for Security (LangSec) applies the tools of theoretical computer science to the problem of protocol design and analysis. In practice, most results have focused on protocol design, showing that by restricting the complexity of protocols it is possible to design parsers with desirable and formally verifiable properties, such as correctness and equivalence. When we consider existing protocols, however, many of these were not subjected to formal analysis during their design, and many are not implemented in a manner consistent with their formal documentation. Determining a grammar for such protocols is the first step in analyzing them, which places this problem in the domain of grammatical inference, for which a deep theoretical literature exists. In particular, although it has been shown that the higher level categories of the Chomsky hierarchy cannot be generically learned, it is also known that certain subcategories of that hierarchy can be effectively learned. In this paper, we summarize some theoretical results for inferring well-known Chomsky grammars, with special attention to context-free grammars (CFGs) and their generated languages (CFLs). We then demonstrate that, despite negative learnability results in the theoretical regime, we can use long short-term memory (LSTM) networks, a type of recurrent neural network (RNN) architecture, to learn a grammar for URIs that appear in Apache HTTP access logs for a particular server with high accuracy. We discuss these results in the context of grammatical inference, and suggest avenues for further research into learnability of a subgroup of the context-free grammars.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Post-Hoc LangSec的语法推理和机器学习方法

形式语言安全理论(LangSec)将理论计算机科学的工具应用于协议设计和分析问题。在实践中，大多数结果都集中在协议设计上，表明通过限制协议的复杂性，可以设计出具有理想和正式可验证属性的解析器，例如正确性和等价性。然而，当我们考虑现有的协议时，其中许多协议在设计期间没有经过正式分析，并且许多协议没有以与其正式文档一致的方式实现。确定这些协议的语法是分析它们的第一步，这将这个问题置于语法推理领域，对此存在着深厚的理论文献。特别是，虽然已经证明乔姆斯基层次结构的更高层次的类别不能被一般地学习，但也知道该层次结构的某些子类别可以被有效地学习。在本文中，我们总结了一些理论结果来推断著名的乔姆斯基语法，特别关注上下文无关语法(CFGs)和它们的生成语言(cfl)。然后，我们证明，尽管在理论体系中有负的可学习性结果，但我们可以使用长短期记忆(LSTM)网络，一种循环神经网络(RNN)架构，以高精度的方式学习出现在Apache HTTP访问日志中的特定服务器的uri语法。我们在语法推理的背景下讨论了这些结果，并提出了进一步研究上下文无关语法子集的可学习性的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE Security and Privacy Workshops (SPW)

自引率

0.00%

发文量