Natural Language Compression per Blocks

P. Procházka, J. Holub
{"title":"Natural Language Compression per Blocks","authors":"P. Procházka, J. Holub","doi":"10.1109/CCP.2011.25","DOIUrl":null,"url":null,"abstract":"We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 First International Conference on Data Compression, Communications and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCP.2011.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
每个块的自然语言压缩
提出了一种新的自然语言压缩方法:半自适应双字节密集码(STBDC)。STBDC对每个块执行压缩。这意味着输入被分成几个块,每个块根据自己的统计模型被单独压缩。为了避免冗余,最终词汇表文件由两个连续块的模型变化序列组成。STBDC属于密集代码家族,并保持了所有吸引人的特性,包括非常高的压缩和解压缩速度以及在自然语言文本上可接受的约32%的压缩比。此外,STBDC还提供了适用于数字图书馆和其他文本数据库的其他属性。压缩方法允许对压缩文本进行直接搜索,而词汇表可以用作块索引。在客户端/服务器架构中,STBDC在有限的带宽下非常容易实现。它可以发送单个压缩块,只包含词汇表的相应部分。此外,STBDC支持更新和扩展压缩文本的各种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CoTracks: A New Lossy Compression Schema for Tracking Logs Data Based on Multiparametric Segmentation Fast Implementation of Block Motion Estimation Algorithms in Video Encoders Electrophysiological Data Processing Using a Dynamic Range Compressor Coupled to a Ten Bits A/D Convertion Port A Generic Intrusion Detection and Diagnoser System Based on Complex Event Processing QoS Performance Testing of Multimedia Delivery over WiMAX Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1