Efficient Compression Scheme for Large Natural Text Using Zipf Distribution

Md. Ashiq Mahmood, K. Hasan
{"title":"Efficient Compression Scheme for Large Natural Text Using Zipf Distribution","authors":"Md. Ashiq Mahmood, K. Hasan","doi":"10.1109/ICASERT.2019.8934651","DOIUrl":null,"url":null,"abstract":"Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.","PeriodicalId":6613,"journal":{"name":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","volume":"70 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASERT.2019.8934651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用Zipf分布的大型自然文本的有效压缩方案
数据压缩是对数据的位结构进行修改、编码或改变的一种方式,这种方式消耗的空间更少。字符编码在某种程度上与数据压缩有关,它通过某种编码框架表示字符。编码是将一串字符按特定的顺序排列以达到有效传输或容量的一种方法。数据压缩涉及的领域非常广泛,包括信息通信、信息存储和数据库开发。在本文中,我们提出了一种高效的新的压缩算法,用于大型自然数据集,其中任何字符都由5位编码,称为5位压缩(5BC)。该算法通过表查找对英语和孟加拉语中任意字符的编码过程进行5位的管理。查找表是使用Zipf分布构造的。Zipf分布是不同语言中常用字符的离散分布。通过将字符分成7组并在单独的表中使用,将8位字符转换为5位字符。字符的位置被唯一地编码为5位。5BC所能压缩的文本是实际文本的60%以上。文中还描述了恢复原始数据的解压缩算法。在产生5BC的输出字符串后,LZW和Huffman技术进一步压缩输出字符串。我们的实验结果证明了乐观的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Thickness Dependency of Zinc Selenide (ZnSe) Thin Film Deposited By Vacuum Evaporation Method Comparative Study of Enhancing Stability of Wind Farm attached to the Grid by PID Controller based STATCOM and Capacitor Bank Performance Analysis of a High Power Quality Single Phase AC-DC Buck Boost Converter RoboFI: Autonomous Path Follower Robot for Human Body Detection and Geolocalization for Search and Rescue Missions using Computer Vision and IoT Electrical Properties of CSS Deposited CdTe Thin Films for Solar Cell Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1