{"title":"Efficient Compression Scheme for Large Natural Text Using Zipf Distribution","authors":"Md. Ashiq Mahmood, K. Hasan","doi":"10.1109/ICASERT.2019.8934651","DOIUrl":null,"url":null,"abstract":"Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.","PeriodicalId":6613,"journal":{"name":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","volume":"70 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASERT.2019.8934651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.