Shirou Maruyama, M. Takeda, Masaya Nakahara, H. Sakamoto
Grammar-based compression is a well-studied technique for constructing a small context-free grammar (CFG) uniquely deriving a given text. In this paper, we present an online algorithm for lightweight grammar-based compression. Our algorithm is based on the LCA algorithm [Sakamoto et al. 2004]which guarantees nearly optimum compression ratio and space. LCA, however, is an offline algorithm and requires external space to save space consumption. Therefore, we present its online version which inherits most characteristics of the original LCA. Our algorithm guarantees $O(log^2 n)$-approximation ratio for an optimum grammar size, and all work is carried out on a main memory space which is bounded by the output size. In addition, we propose more practical encoding based on parentheses representation of a binary tree. Experimental results for repetitive texts demonstrate that our algorithm achieves effective compression compared to other practical compressors and the space consumption of our algorithm is smaller than the input text size.
基于语法的压缩是一种经过充分研究的技术,用于构造一个小型的上下文无关语法(CFG),惟一地派生给定的文本。本文提出了一种基于语法的在线轻量级压缩算法。我们的算法基于LCA算法[Sakamoto et al. 2004],它保证了近乎最佳的压缩比和空间。而LCA算法是一种离线算法,需要外部空间来节省空间消耗。因此,我们提出了它的在线版本,它继承了原始LCA的大部分特征。我们的算法保证$O(log^2 n)$-近似比率为最佳语法大小,并且所有工作都在由输出大小限定的主内存空间上进行。此外,我们提出了基于二叉树的括号表示的更实用的编码方法。实验结果表明,与其他实用压缩器相比,我们的算法在重复文本的压缩中取得了有效的压缩效果,并且我们的算法占用的空间小于输入文本的大小。
{"title":"An Online Algorithm for Lightweight Grammar-Based Compression","authors":"Shirou Maruyama, M. Takeda, Masaya Nakahara, H. Sakamoto","doi":"10.1109/CCP.2011.40","DOIUrl":"https://doi.org/10.1109/CCP.2011.40","url":null,"abstract":"Grammar-based compression is a well-studied technique for constructing a small context-free grammar (CFG) uniquely deriving a given text. In this paper, we present an online algorithm for lightweight grammar-based compression. Our algorithm is based on the LCA algorithm [Sakamoto et al. 2004]which guarantees nearly optimum compression ratio and space. LCA, however, is an offline algorithm and requires external space to save space consumption. Therefore, we present its online version which inherits most characteristics of the original LCA. Our algorithm guarantees $O(log^2 n)$-approximation ratio for an optimum grammar size, and all work is carried out on a main memory space which is bounded by the output size. In addition, we propose more practical encoding based on parentheses representation of a binary tree. Experimental results for repetitive texts demonstrate that our algorithm achieves effective compression compared to other practical compressors and the space consumption of our algorithm is smaller than the input text size.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121376914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work presents a generic Intrusion Detection and Diagnosis System, which implements a comprehensive alert correlation workflow for detection and diagnosis of complex intrusion scenarios in Large scale Complex Critical Infrastructures. The on-line detection and diagnosis process is based on an hybrid and hierarchical approach, which allows to detect intrusion scenarios by collecting diverse information at several architectural levels, using distributed security probes, as well as perform complex event correlation based on a Complex Event Processing Engine. The escalation process from intrusion symptoms to the identified target and cause of the intrusion is driven by a knowledge-base represented by an ontology. A prototype implementation of the proposed Intrusion Detection and Diagnosis framework is also presented.
{"title":"A Generic Intrusion Detection and Diagnoser System Based on Complex Event Processing","authors":"M. Ficco, L. Romano","doi":"10.1109/CCP.2011.43","DOIUrl":"https://doi.org/10.1109/CCP.2011.43","url":null,"abstract":"This work presents a generic Intrusion Detection and Diagnosis System, which implements a comprehensive alert correlation workflow for detection and diagnosis of complex intrusion scenarios in Large scale Complex Critical Infrastructures. The on-line detection and diagnosis process is based on an hybrid and hierarchical approach, which allows to detect intrusion scenarios by collecting diverse information at several architectural levels, using distributed security probes, as well as perform complex event correlation based on a Complex Event Processing Engine. The escalation process from intrusion symptoms to the identified target and cause of the intrusion is driven by a knowledge-base represented by an ontology. A prototype implementation of the proposed Intrusion Detection and Diagnosis framework is also presented.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115533272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intrusion Detection Systems are the major technology used for protecting information systems. However, they do not directly detect intrusion, but they only monitor the attack symptoms. Therefore, no assumption can be made on the outcome of the attack, no assurance can be assumed once the system is compromised. The intrusion tolerance techniques focus on providing minimal level of services, even when the system has been partially compromised. This paper presents an intrusion tolerant approach for Denial of Service attacks to Web Services. It focuses on the detection of attack symptoms as well as the diagnosis of intrusion effects in order to perform a proper reaction only if the attack succeeds. In particular, this work focuses on a specific Denial of Service attack, called Deeply-Nested XML. Preliminary experimental results show that the proposed approach results in a better performance of the Intrusion Detection Systems, in terms of increasing diagnosis capacity as well as reducing the service unavailability during an intrusion.
{"title":"Intrusion Tolerant Approach for Denial of Service Attacks to Web Services","authors":"M. Ficco, M. Rak","doi":"10.1109/CCP.2011.44","DOIUrl":"https://doi.org/10.1109/CCP.2011.44","url":null,"abstract":"Intrusion Detection Systems are the major technology used for protecting information systems. However, they do not directly detect intrusion, but they only monitor the attack symptoms. Therefore, no assumption can be made on the outcome of the attack, no assurance can be assumed once the system is compromised. The intrusion tolerance techniques focus on providing minimal level of services, even when the system has been partially compromised. This paper presents an intrusion tolerant approach for Denial of Service attacks to Web Services. It focuses on the detection of attack symptoms as well as the diagnosis of intrusion effects in order to perform a proper reaction only if the attack succeeds. In particular, this work focuses on a specific Denial of Service attack, called Deeply-Nested XML. Preliminary experimental results show that the proposed approach results in a better performance of the Intrusion Detection Systems, in terms of increasing diagnosis capacity as well as reducing the service unavailability during an intrusion.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
the fast development of Graphics Processing Unit (GPU) leads to the popularity of General-purpose usage of GPU (GPGPU). So far, most modern computers are CPU-GPGPU heterogeneous architecture and CPU is used as host processor. In this work, we promote a multithread file chunking prototype system, which is able to exploit the hardware organization of the CPU-GPGPU heterogeneous computer and determine which device should be used to chunk the file to accelerate the content based file chunking operation of deduplication. We built rules for the system to choose which device should be used to chunk file and also found the optimal choice of other related parameters of both CPU and GPGPU subsystem like segment size and block dimension. This prototype was implemented and tested. The result of using GTX460(336 cores) and Intel i5 (four cores) shows that this system can increase the chunking speed 63% compared to using GPGPU alone and 80% compared to using CPU alone.
图形处理器(Graphics Processing Unit, GPU)的快速发展使得通用图形处理器(GPGPU)的普及。到目前为止,大多数现代计算机都是CPU- gpgpu异构架构,使用CPU作为主处理器。在本工作中,我们提出了一个多线程文件分块原型系统,该系统能够利用CPU-GPGPU异构计算机的硬件组织,确定应该使用哪个设备对文件进行分块,以加速重复数据删除中基于内容的文件分块操作。我们建立了系统选择应该使用哪个设备来块文件的规则,并找到了CPU和GPGPU子系统的其他相关参数(如段大小和块尺寸)的最佳选择。该原型已实现并进行了测试。使用GTX460(336核)和Intel i5(4核)的结果表明,与单独使用GPGPU相比,该系统的分块速度提高了63%,与单独使用CPU相比提高了80%。
{"title":"Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture","authors":"Zhi Tang, Y. Won","doi":"10.1109/CCP.2011.20","DOIUrl":"https://doi.org/10.1109/CCP.2011.20","url":null,"abstract":"the fast development of Graphics Processing Unit (GPU) leads to the popularity of General-purpose usage of GPU (GPGPU). So far, most modern computers are CPU-GPGPU heterogeneous architecture and CPU is used as host processor. In this work, we promote a multithread file chunking prototype system, which is able to exploit the hardware organization of the CPU-GPGPU heterogeneous computer and determine which device should be used to chunk the file to accelerate the content based file chunking operation of deduplication. We built rules for the system to choose which device should be used to chunk file and also found the optimal choice of other related parameters of both CPU and GPGPU subsystem like segment size and block dimension. This prototype was implemented and tested. The result of using GTX460(336 cores) and Intel i5 (four cores) shows that this system can increase the chunking speed 63% compared to using GPGPU alone and 80% compared to using CPU alone.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122043683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.
{"title":"Natural Language Compression per Blocks","authors":"P. Procházka, J. Holub","doi":"10.1109/CCP.2011.25","DOIUrl":"https://doi.org/10.1109/CCP.2011.25","url":null,"abstract":"We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127488581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Hernández-Cabronero, Ian Blanes, J. Serra-Sagristà, M. Marcellin
We review the state of the art in DNA micro array image compression. First, we describe the most relevant approaches published in the literature and classify them according to the stage of the typical image compression process where each approach makes its contribution. We then summarize the compression results reported for these specific-specific image compression schemes. In a set of experiments conducted for this paper, we obtain results for several popular image coding techniques, including the most recent coding standards. Prediction-based schemes CALIC and JPEG-LS, and JPEG2000 using zero wavelet decomposition levels are the best performing standard compressors, but are all outperformed by the best micro array-specific technique, Battiato's CNN-based scheme.
{"title":"A Review of DNA Microarray Image Compression","authors":"Miguel Hernández-Cabronero, Ian Blanes, J. Serra-Sagristà, M. Marcellin","doi":"10.1109/CCP.2011.21","DOIUrl":"https://doi.org/10.1109/CCP.2011.21","url":null,"abstract":"We review the state of the art in DNA micro array image compression. First, we describe the most relevant approaches published in the literature and classify them according to the stage of the typical image compression process where each approach makes its contribution. We then summarize the compression results reported for these specific-specific image compression schemes. In a set of experiments conducted for this paper, we obtain results for several popular image coding techniques, including the most recent coding standards. Prediction-based schemes CALIC and JPEG-LS, and JPEG2000 using zero wavelet decomposition levels are the best performing standard compressors, but are all outperformed by the best micro array-specific technique, Battiato's CNN-based scheme.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128618246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For each $j geq 1$, if $T_j$ is the finite rooted binary tree with $2^j$ leaves, the hierarchical type class of binary string $x$ of length $2^j$ is obtained by placing the entries of $x$ as label son the leaves of $T_j$ and then forming all permutations of $x$according to the permutations of the leaf labels under all isomorphisms of tree $T_j$ into itself. The set of binary strings of length $2^j$ is partitioned into hierarchical type classes, and in each such class, all of the strings have the same type $(n_0^j, n_1^j)$, where $n_0^j, n_1^j$ are respectively the numbers of zeroes and ones in the strings. Let $p(n_0^j, n_1^j)$ be the probability vector $(n_0^j/2^j, n_1^j/2^j)$belonging to the set ${cal P}_2$ of all two-dimensional probability vectors. For each $j geq 1$, and each of the $2^j+1$ possible types $(n_0^j, n_1^j)$, a hierarchical type class ${cal S}(n_0^j, n_1^j)$is specified. Conditions are investigated under which there will exist a function $h:{cal P}_2to [0, infty)$ such that for each $pin {cal P}_2$, if ${(n_0^j, n_1^j):jgeq 1}$ is any sequence of types for which $p(n_0^j, n_1^j) to p$, then the sequence ${2^{-j}log_2({rm card}({cal S}(n_0^j, n_1^j))):j geq 1}$converges to $h(p)$. Such functions $h$, called hierarchical entropy functions, play the same role in hierarchical type class coding theory that the Shannon entropy function on ${cal P}_2$ does in traditional type class coding theory, except that there are infinitely many hierarchical entropy functions but only one Shannon entropy function. One of the hierarchical entropy functions $h$ that is studied is a self-affine function for which a closed-form expression is obtained making use of an iterated function system whose attractor is the graph of $h$.
{"title":"Hierarchical Type Classes and Their Entropy Functions","authors":"J. Kieffer","doi":"10.1109/CCP.2011.36","DOIUrl":"https://doi.org/10.1109/CCP.2011.36","url":null,"abstract":"For each $j geq 1$, if $T_j$ is the finite rooted binary tree with $2^j$ leaves, the hierarchical type class of binary string $x$ of length $2^j$ is obtained by placing the entries of $x$ as label son the leaves of $T_j$ and then forming all permutations of $x$according to the permutations of the leaf labels under all isomorphisms of tree $T_j$ into itself. The set of binary strings of length $2^j$ is partitioned into hierarchical type classes, and in each such class, all of the strings have the same type $(n_0^j, n_1^j)$, where $n_0^j, n_1^j$ are respectively the numbers of zeroes and ones in the strings. Let $p(n_0^j, n_1^j)$ be the probability vector $(n_0^j/2^j, n_1^j/2^j)$belonging to the set ${cal P}_2$ of all two-dimensional probability vectors. For each $j geq 1$, and each of the $2^j+1$ possible types $(n_0^j, n_1^j)$, a hierarchical type class ${cal S}(n_0^j, n_1^j)$is specified. Conditions are investigated under which there will exist a function $h:{cal P}_2to [0, infty)$ such that for each $pin {cal P}_2$, if ${(n_0^j, n_1^j):jgeq 1}$ is any sequence of types for which $p(n_0^j, n_1^j) to p$, then the sequence ${2^{-j}log_2({rm card}({cal S}(n_0^j, n_1^j))):j geq 1}$converges to $h(p)$. Such functions $h$, called hierarchical entropy functions, play the same role in hierarchical type class coding theory that the Shannon entropy function on ${cal P}_2$ does in traditional type class coding theory, except that there are infinitely many hierarchical entropy functions but only one Shannon entropy function. One of the hierarchical entropy functions $h$ that is studied is a self-affine function for which a closed-form expression is obtained making use of an iterated function system whose attractor is the graph of $h$.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130835446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents a hardware solution of the in vivo electrophysiological signals processing, using a continuous data acquisition on PC. The originality of the paper comes from some blocks proposal, which selective amplify the bio signals. One of the major problems in the electrophysiological signals monitoring is the impossibility to record the weak signals from deep organs that are covered by noise or by strong cardiac or muscular signals. An automatic gain control block is used, so that the high power skin signals are less amplified than the low components. The analog processing block is based on a dynamic range compressor, containing the automatic gain control block. The following block is a clipper since to capture all the transitions that escape from the dynamic range compressor. At clipper output a low-pass filter is connected since to abruptly cut the high frequencies, like 50Hz, ECG. The data vector recording is performing by strong internal resources micro controller including ten bits A/D conversion port.
{"title":"Electrophysiological Data Processing Using a Dynamic Range Compressor Coupled to a Ten Bits A/D Convertion Port","authors":"F. Babarada, C. Ravariu, A. Janel","doi":"10.1109/CCP.2011.24","DOIUrl":"https://doi.org/10.1109/CCP.2011.24","url":null,"abstract":"The paper presents a hardware solution of the in vivo electrophysiological signals processing, using a continuous data acquisition on PC. The originality of the paper comes from some blocks proposal, which selective amplify the bio signals. One of the major problems in the electrophysiological signals monitoring is the impossibility to record the weak signals from deep organs that are covered by noise or by strong cardiac or muscular signals. An automatic gain control block is used, so that the high power skin signals are less amplified than the low components. The analog processing block is based on a dynamic range compressor, containing the automatic gain control block. The following block is a clipper since to capture all the transitions that escape from the dynamic range compressor. At clipper output a low-pass filter is connected since to abruptly cut the high frequencies, like 50Hz, ECG. The data vector recording is performing by strong internal resources micro controller including ten bits A/D conversion port.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114969853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider a compact text index based on evenly spaced sparse suffix trees of a text [9]. Such a tree is defined by partitioning the text into blocks of equal size and constructing the suffix tree only for those suffixes that start at block boundaries. We propose a new pattern matching algorithm on this structure. The algorithm is based on a notion of suffix links different from that of [9] and on the packing of several letters into one computer word.
{"title":"Pattern Matching on Sparse Suffix Trees","authors":"R. Kolpakov, G. Kucherov, Tatiana Starikovskaya","doi":"10.1109/CCP.2011.45","DOIUrl":"https://doi.org/10.1109/CCP.2011.45","url":null,"abstract":"We consider a compact text index based on evenly spaced sparse suffix trees of a text [9]. Such a tree is defined by partitioning the text into blocks of equal size and constructing the suffix tree only for those suffixes that start at block boundaries. We propose a new pattern matching algorithm on this structure. The algorithm is based on a notion of suffix links different from that of [9] and on the packing of several letters into one computer word.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"408 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116035739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An axiomatic approach to the notion of similarity of sequences, that seems to be natural in many cases (e.g. Phylogenetic analysis), is proposed. Despite of the fact that it is not assume that the sequences are a realization of a probabilistic process (e.g. a variable-order Markov process), it is demonstrated that any classifier that fully complies with the proposed similarity axioms must be based on modeling of the training data that is contained in a (long) individual training sequence via a suffix tree with no more than O(N) leaves (or, alternatively, a table with O(N) entries) where N is the length of the test sequence. Some common classification algorithms may be slightly modified to comply with the proposed axiomatic conditions and the resulting organization of the training data, thus yielding a formal justification for their good empirical performance without relying on any a-priori (sometimes unjustified)probabilistic assumption. One such case is discussed in details.
{"title":"An Axiomatic Approach to the Notion of Similarity of Individual Sequences and Their Classification","authors":"J. Ziv","doi":"10.1109/CCP.2011.29","DOIUrl":"https://doi.org/10.1109/CCP.2011.29","url":null,"abstract":"An axiomatic approach to the notion of similarity of sequences, that seems to be natural in many cases (e.g. Phylogenetic analysis), is proposed. Despite of the fact that it is not assume that the sequences are a realization of a probabilistic process (e.g. a variable-order Markov process), it is demonstrated that any classifier that fully complies with the proposed similarity axioms must be based on modeling of the training data that is contained in a (long) individual training sequence via a suffix tree with no more than O(N) leaves (or, alternatively, a table with O(N) entries) where N is the length of the test sequence. Some common classification algorithms may be slightly modified to comply with the proposed axiomatic conditions and the resulting organization of the training data, thus yielding a formal justification for their good empirical performance without relying on any a-priori (sometimes unjustified)probabilistic assumption. One such case is discussed in details.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123550957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}