Automata-based constraints for language model decoding

arXiv - CS - Formal Languages and Automata Theory Pub Date : 2024-07-11 DOI:arxiv-2407.08103

Terry Koo, Frederick Liu, Luheng He

{"title":"Automata-based constraints for language model decoding","authors":"Terry Koo, Frederick Liu, Luheng He","doi":"arxiv-2407.08103","DOIUrl":null,"url":null,"abstract":"LMs are often expected to generate strings in some formal language; for\nexample, structured data, API calls, or code snippets. Although LMs can be\ntuned to improve their adherence to formal syntax, this does not guarantee\nconformance, especially with smaller LMs suitable for large-scale deployment.\nIn addition, tuning requires significant resources, making it impractical for\nuncommon or task-specific formats. To prevent downstream parsing errors we\nwould ideally constrain the LM to only produce valid output, but this is\nseverely complicated by tokenization, which is typically both ambiguous and\nmisaligned with the formal grammar. We solve these issues through the\napplication of automata theory, deriving an efficient closed-form solution for\nthe regular languages, a broad class of formal languages with many practical\napplications, including API calls or schema-guided JSON and YAML. We also\ndiscuss pragmatic extensions for coping with the issue of high branching\nfactor. Finally, we extend our techniques to deterministic context-free\nlanguages, which similarly admit an efficient closed-form solution. In spite of\nits flexibility and representative power, our approach only requires access to\nper-token decoding logits and lowers into simple calculations that are\nindependent of LM size, making it both efficient and easy to apply to almost\nany LM architecture.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor. Finally, we extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. In spite of its flexibility and representative power, our approach only requires access to per-token decoding logits and lowers into simple calculations that are independent of LM size, making it both efficient and easy to apply to almost any LM architecture.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于自动机的语言模型解码约束

人们通常希望 LM 以某种正式语言生成字符串；例如，结构化数据、API 调用或代码片段。虽然可以对 LM 进行调整，使其更符合正式语法，但这并不能保证其性能，尤其是适合大规模部署的小型 LM。为了防止下游解析错误，我们最好限制 LM 只产生有效输出，但标记化会使这一问题变得非常复杂，因为标记化通常既模棱两可，又与正式语法不一致。我们通过应用自动机理论解决了这些问题，为正则表达式语言推导出了一个高效的闭式解决方案，正则表达式语言是一类广泛的形式语言，有很多实际应用，包括 API 调用或模式引导的 JSON 和 YAML。我们还讨论了应对高分支因子问题的实用扩展。最后，我们将我们的技术扩展到确定性上下文无关语言，这些语言同样允许高效的闭式解决方案。尽管我们的方法具有灵活性和代表性，但它只需要访问标记解码对数，并简化为与 LM 大小无关的简单计算，因此它既高效又易于应用于几乎所有 LM 架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Formal Languages and Automata Theory

自引率

0.00%

发文量

期刊最新文献

Query Learning of Advice and Nominal Automata Well-Behaved (Co)algebraic Semantics of Regular Expressions in Dafny Run supports and initial algebra supports of weighted automata Alternating hierarchy of sushifts defined by nondeterministic plane-walking automata $\mathbb{N}$-polyregular functions arise from well-quasi-orderings