General parsing with regular expression matching

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Computer Languages Pub Date : 2023-01-01 DOI:10.1016/j.cola.2022.101176

Angelo Borsotti , Luca Breveglieri , Stefano Crespi Reghizzi , Angelo Morzenti

{"title":"General parsing with regular expression matching","authors":"Angelo Borsotti , Luca Breveglieri , Stefano Crespi Reghizzi , Angelo Morzenti","doi":"10.1016/j.cola.2022.101176","DOIUrl":null,"url":null,"abstract":"<div><p>The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a <span><math><mrow><mtext>GLR</mtext><mspace></mspace><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></mrow></math></span> parser. To construct such a parser, we follow the positional method of the Berry–Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"74 ","pages":"Article 101176"},"PeriodicalIF":1.8000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118422000739","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 1

Abstract

The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a $GLR (1)$ parser. To construct such a parser, we follow the positional method of the Berry–Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用正则表达式匹配的一般解析

用正则表达式（RE）扩展的上下文无关语法，即ECF或EBNF语法，通常被使用，因为它们通常允许简洁的语言定义。然而，对于这样的语法来说，语法树的概念，以及由此产生的歧义，缺乏一个一致的定义。现有解析算法返回的简化树结构并不能忠实地表示通过语法规则中存在的RE导出句子的所有方式。我们贡献了EBNF语法树结构中规则部分的精确定义，该定义与最近的RE匹配算法所采用的树表示一致。对于EBNF规则，最精细的表示将所有RE运算符显示为子树中的节点，而平面表示只是将RE生成的字符串附加到非终端节点。随之而来的是表征相关歧义的概念。上述表示被合并到富田风格的解析算法中，即GLR（1）解析器。为了构建这样一个解析器，我们遵循Berry–Sethi语法分析器的定位方法。给定EBNF语法，我们的解析器生成器生成一个解析器，该解析器符合用户对语法树表示的选择。我们还报告了与EBNF语法的现有解析器相比，在用于编程和数据表示的几个具有代表性的语言集（基准）上的解析性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computer Languages Computer Science-Computer Networks and Communications

CiteScore

5.00

自引率

13.60%

发文量