Automata-based constraints for language model decoding

Terry Koo, Frederick Liu, Luheng He
{"title":"Automata-based constraints for language model decoding","authors":"Terry Koo, Frederick Liu, Luheng He","doi":"arxiv-2407.08103","DOIUrl":null,"url":null,"abstract":"LMs are often expected to generate strings in some formal language; for\nexample, structured data, API calls, or code snippets. Although LMs can be\ntuned to improve their adherence to formal syntax, this does not guarantee\nconformance, especially with smaller LMs suitable for large-scale deployment.\nIn addition, tuning requires significant resources, making it impractical for\nuncommon or task-specific formats. To prevent downstream parsing errors we\nwould ideally constrain the LM to only produce valid output, but this is\nseverely complicated by tokenization, which is typically both ambiguous and\nmisaligned with the formal grammar. We solve these issues through the\napplication of automata theory, deriving an efficient closed-form solution for\nthe regular languages, a broad class of formal languages with many practical\napplications, including API calls or schema-guided JSON and YAML. We also\ndiscuss pragmatic extensions for coping with the issue of high branching\nfactor. Finally, we extend our techniques to deterministic context-free\nlanguages, which similarly admit an efficient closed-form solution. In spite of\nits flexibility and representative power, our approach only requires access to\nper-token decoding logits and lowers into simple calculations that are\nindependent of LM size, making it both efficient and easy to apply to almost\nany LM architecture.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor. Finally, we extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. In spite of its flexibility and representative power, our approach only requires access to per-token decoding logits and lowers into simple calculations that are independent of LM size, making it both efficient and easy to apply to almost any LM architecture.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于自动机的语言模型解码约束
人们通常希望 LM 以某种正式语言生成字符串;例如,结构化数据、API 调用或代码片段。虽然可以对 LM 进行调整,使其更符合正式语法,但这并不能保证其性能,尤其是适合大规模部署的小型 LM。为了防止下游解析错误,我们最好限制 LM 只产生有效输出,但标记化会使这一问题变得非常复杂,因为标记化通常既模棱两可,又与正式语法不一致。我们通过应用自动机理论解决了这些问题,为正则表达式语言推导出了一个高效的闭式解决方案,正则表达式语言是一类广泛的形式语言,有很多实际应用,包括 API 调用或模式引导的 JSON 和 YAML。我们还讨论了应对高分支因子问题的实用扩展。最后,我们将我们的技术扩展到确定性上下文无关语言,这些语言同样允许高效的闭式解决方案。尽管我们的方法具有灵活性和代表性,但它只需要访问标记解码对数,并简化为与 LM 大小无关的简单计算,因此它既高效又易于应用于几乎所有 LM 架构。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Query Learning of Advice and Nominal Automata Well-Behaved (Co)algebraic Semantics of Regular Expressions in Dafny Run supports and initial algebra supports of weighted automata Alternating hierarchy of sushifts defined by nondeterministic plane-walking automata $\mathbb{N}$-polyregular functions arise from well-quasi-orderings
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1