{"title":"AXECHOP: a grammar-based compressor for XML","authors":"G. Leighton, Jim Diamond, T. Müldner","doi":"10.1109/DCC.2005.20","DOIUrl":null,"url":null,"abstract":"Summary form only given. XML is gaining widespread acceptance as a standard for storing and transmitting structured data. One of the drawbacks of XML is that it is quite verbose: an XML representation of a set of data can easily be ten times as large as a more economical representation of the data. To overcome this limitation, we present a compression scheme tailored specifically to XML named AXECHOP. The compression strategy used in AXECHOP begins by dividing the source XML document into structural and data segments. The former is represented using a byte tokenization scheme that preserves the original structure of the document (i.e. it maintains the proper nesting and ordering of elements, attributes, and data values). The MPM compression algorithm is used to generate a context-free grammar capable of deriving this original structure, and the grammar is passed through an adaptive arithmetic coder before being written to the compressed file. The document's data is organized into a series of containers (where container membership is determined by the identity of the XML element or attribute that encloses the data) and then the Burrows-Wheeler transform (BWT) is applied to the contents of each dictionary, with the results being appended to the compressed file.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"44 1","pages":"467-"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2005.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
Summary form only given. XML is gaining widespread acceptance as a standard for storing and transmitting structured data. One of the drawbacks of XML is that it is quite verbose: an XML representation of a set of data can easily be ten times as large as a more economical representation of the data. To overcome this limitation, we present a compression scheme tailored specifically to XML named AXECHOP. The compression strategy used in AXECHOP begins by dividing the source XML document into structural and data segments. The former is represented using a byte tokenization scheme that preserves the original structure of the document (i.e. it maintains the proper nesting and ordering of elements, attributes, and data values). The MPM compression algorithm is used to generate a context-free grammar capable of deriving this original structure, and the grammar is passed through an adaptive arithmetic coder before being written to the compressed file. The document's data is organized into a series of containers (where container membership is determined by the identity of the XML element or attribute that encloses the data) and then the Burrows-Wheeler transform (BWT) is applied to the contents of each dictionary, with the results being appended to the compressed file.