{"title":"String matching with stopper compression","authors":"J. Rautio, Jani Tanninen, J. Tarhio","doi":"10.1109/DCC.2002.1000012","DOIUrl":null,"url":null,"abstract":"Summary form only given. We consider string searching in compressed texts. We utilize a compression method related to static Huffman compression. Characters are encoded as variable length sequences of base symbols, which consist of a fixed number of bits. Because the length of a code as base symbols varies, we divide base symbols into stoppers and continuers in order to be able to recognize where a new code starts. Stoppers can only be used as the last base symbol of a code. All other base symbols are continuers which can be used anywhere but as the last base symbol of a code. Our searching algorithm is a variation of the Boyer-Moore-Horspool algorithm. The shift function is based on several base symbols in order to achieve longer jumps than the ordinary occurrence heuristic. If four bits are used for base symbols, we apply bytes of eight bits for shift calculation.","PeriodicalId":420897,"journal":{"name":"Proceedings DCC 2002. Data Compression Conference","volume":"172 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC 2002. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2002.1000012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Summary form only given. We consider string searching in compressed texts. We utilize a compression method related to static Huffman compression. Characters are encoded as variable length sequences of base symbols, which consist of a fixed number of bits. Because the length of a code as base symbols varies, we divide base symbols into stoppers and continuers in order to be able to recognize where a new code starts. Stoppers can only be used as the last base symbol of a code. All other base symbols are continuers which can be used anywhere but as the last base symbol of a code. Our searching algorithm is a variation of the Boyer-Moore-Horspool algorithm. The shift function is based on several base symbols in order to achieve longer jumps than the ordinary occurrence heuristic. If four bits are used for base symbols, we apply bytes of eight bits for shift calculation.