General parsing with regular expression matching

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Computer Languages Pub Date : 2023-01-01 DOI:10.1016/j.cola.2022.101176
Angelo Borsotti , Luca Breveglieri , Stefano Crespi Reghizzi , Angelo Morzenti
{"title":"General parsing with regular expression matching","authors":"Angelo Borsotti ,&nbsp;Luca Breveglieri ,&nbsp;Stefano Crespi Reghizzi ,&nbsp;Angelo Morzenti","doi":"10.1016/j.cola.2022.101176","DOIUrl":null,"url":null,"abstract":"<div><p>The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a <span><math><mrow><mtext>GLR</mtext><mspace></mspace><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></mrow></math></span> parser. To construct such a parser, we follow the positional method of the Berry–Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"74 ","pages":"Article 101176"},"PeriodicalIF":1.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118422000739","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 1

Abstract

The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a GLR(1) parser. To construct such a parser, we follow the positional method of the Berry–Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用正则表达式匹配的一般解析
用正则表达式(RE)扩展的上下文无关语法,即ECF或EBNF语法,通常被使用,因为它们通常允许简洁的语言定义。然而,对于这样的语法来说,语法树的概念,以及由此产生的歧义,缺乏一个一致的定义。现有解析算法返回的简化树结构并不能忠实地表示通过语法规则中存在的RE导出句子的所有方式。我们贡献了EBNF语法树结构中规则部分的精确定义,该定义与最近的RE匹配算法所采用的树表示一致。对于EBNF规则,最精细的表示将所有RE运算符显示为子树中的节点,而平面表示只是将RE生成的字符串附加到非终端节点。随之而来的是表征相关歧义的概念。上述表示被合并到富田风格的解析算法中,即GLR(1)解析器。为了构建这样一个解析器,我们遵循Berry–Sethi语法分析器的定位方法。给定EBNF语法,我们的解析器生成器生成一个解析器,该解析器符合用户对语法树表示的选择。我们还报告了与EBNF语法的现有解析器相比,在用于编程和数据表示的几个具有代表性的语言集(基准)上的解析性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computer Languages
Journal of Computer Languages Computer Science-Computer Networks and Communications
CiteScore
5.00
自引率
13.60%
发文量
36
期刊最新文献
Near-Pruned single assignment transformation of programs MLAPW: A framework to assess the impact of feature selection and sampling techniques on anti-pattern prediction using WSDL metrics Editorial Board Code histories: Documenting development by recording code influences and changes in code A comprehensive meta-analysis of efficiency and effectiveness in the detection community
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1