{"title":"Automata-based constraints for language model decoding","authors":"Terry Koo, Frederick Liu, Luheng He","doi":"arxiv-2407.08103","DOIUrl":null,"url":null,"abstract":"LMs are often expected to generate strings in some formal language; for\nexample, structured data, API calls, or code snippets. Although LMs can be\ntuned to improve their adherence to formal syntax, this does not guarantee\nconformance, especially with smaller LMs suitable for large-scale deployment.\nIn addition, tuning requires significant resources, making it impractical for\nuncommon or task-specific formats. To prevent downstream parsing errors we\nwould ideally constrain the LM to only produce valid output, but this is\nseverely complicated by tokenization, which is typically both ambiguous and\nmisaligned with the formal grammar. We solve these issues through the\napplication of automata theory, deriving an efficient closed-form solution for\nthe regular languages, a broad class of formal languages with many practical\napplications, including API calls or schema-guided JSON and YAML. We also\ndiscuss pragmatic extensions for coping with the issue of high branching\nfactor. Finally, we extend our techniques to deterministic context-free\nlanguages, which similarly admit an efficient closed-form solution. In spite of\nits flexibility and representative power, our approach only requires access to\nper-token decoding logits and lowers into simple calculations that are\nindependent of LM size, making it both efficient and easy to apply to almost\nany LM architecture.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Formal Languages and Automata Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
LMs are often expected to generate strings in some formal language; for
example, structured data, API calls, or code snippets. Although LMs can be
tuned to improve their adherence to formal syntax, this does not guarantee
conformance, especially with smaller LMs suitable for large-scale deployment.
In addition, tuning requires significant resources, making it impractical for
uncommon or task-specific formats. To prevent downstream parsing errors we
would ideally constrain the LM to only produce valid output, but this is
severely complicated by tokenization, which is typically both ambiguous and
misaligned with the formal grammar. We solve these issues through the
application of automata theory, deriving an efficient closed-form solution for
the regular languages, a broad class of formal languages with many practical
applications, including API calls or schema-guided JSON and YAML. We also
discuss pragmatic extensions for coping with the issue of high branching
factor. Finally, we extend our techniques to deterministic context-free
languages, which similarly admit an efficient closed-form solution. In spite of
its flexibility and representative power, our approach only requires access to
per-token decoding logits and lowers into simple calculations that are
independent of LM size, making it both efficient and easy to apply to almost
any LM architecture.