{"title":"MDL框架中的上下文模型","authors":"E. Ristad, Robert G. Thomas","doi":"10.1109/DCC.1995.515496","DOIUrl":null,"url":null,"abstract":"Current approaches to speech and handwriting recognition demand a strong language model with a small number of states and an even smaller number of parameters. We introduce four new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, implicit context growth, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 2.16 bits/char on the Brown corpus using only 19374 contexts and 54621 parameters. Multicontextual modeling and nonmonotonic contexts, are generalizations of the traditional context model. Implicit context growth ensures that the state transition probabilities of a variable-length Markov process are estimated accurately. This technique is generally applicable to any variable-length Markov process whose state transition probabilities are estimated from string frequencies. In our case, each state in the Markov process represents a context, and implicit context growth conditions the shorter contexts on the fact that the longer contexts did not occur. In a traditional unicontext model, this technique reduces the message entropy of typical English text by 0.1 bits/char. The divergence heuristic, is a heuristic estimation algorithm based on Rissanen's (1978, 1983) minimum description length (MDL) principle and universal data compression algorithm.","PeriodicalId":107017,"journal":{"name":"Proceedings DCC '95 Data Compression Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Context models in the MDL framework\",\"authors\":\"E. Ristad, Robert G. Thomas\",\"doi\":\"10.1109/DCC.1995.515496\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current approaches to speech and handwriting recognition demand a strong language model with a small number of states and an even smaller number of parameters. We introduce four new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, implicit context growth, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 2.16 bits/char on the Brown corpus using only 19374 contexts and 54621 parameters. Multicontextual modeling and nonmonotonic contexts, are generalizations of the traditional context model. Implicit context growth ensures that the state transition probabilities of a variable-length Markov process are estimated accurately. This technique is generally applicable to any variable-length Markov process whose state transition probabilities are estimated from string frequencies. In our case, each state in the Markov process represents a context, and implicit context growth conditions the shorter contexts on the fact that the longer contexts did not occur. In a traditional unicontext model, this technique reduces the message entropy of typical English text by 0.1 bits/char. The divergence heuristic, is a heuristic estimation algorithm based on Rissanen's (1978, 1983) minimum description length (MDL) principle and universal data compression algorithm.\",\"PeriodicalId\":107017,\"journal\":{\"name\":\"Proceedings DCC '95 Data Compression Conference\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '95 Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1995.515496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '95 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1995.515496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Current approaches to speech and handwriting recognition demand a strong language model with a small number of states and an even smaller number of parameters. We introduce four new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, implicit context growth, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 2.16 bits/char on the Brown corpus using only 19374 contexts and 54621 parameters. Multicontextual modeling and nonmonotonic contexts, are generalizations of the traditional context model. Implicit context growth ensures that the state transition probabilities of a variable-length Markov process are estimated accurately. This technique is generally applicable to any variable-length Markov process whose state transition probabilities are estimated from string frequencies. In our case, each state in the Markov process represents a context, and implicit context growth conditions the shorter contexts on the fact that the longer contexts did not occur. In a traditional unicontext model, this technique reduces the message entropy of typical English text by 0.1 bits/char. The divergence heuristic, is a heuristic estimation algorithm based on Rissanen's (1978, 1983) minimum description length (MDL) principle and universal data compression algorithm.