{"title":"压缩在分类学鉴定中的价值","authors":"Jorge Miguel Silva, João Rafael Almeida","doi":"10.1109/CBMS55023.2022.00055","DOIUrl":null,"url":null,"abstract":"Advances in DNA sequencing technologies have led to an unprecedented growth of sequenced data. However, when sequencing de-novo genomes, one of the biggest challenges is the classification of DNA sequences that do not match with any biological sequence from the literature. The use of reference-free methods to identify these organisms supported by compressors is one strategy for taxonomic identification. However, with the high number of compressors available, and the computational resources required to operate them, there is a problem in selecting the best compressors for classification with limited computational resources. In this paper, we present a two-step pipeline to analyze nine compressors, to understand which ones could be the best candidates for taxonomic identification. We use 500 randomly selected sequences from five taxonomic groups to conduct this analysis. The results show that besides being an excellent repre-sentative feature, depending on the compressor, the Normalized Compression (NC) reflects different aspects concerning the nature of a given sequence and its complexity. Furthermore, we show that neither the compression capability of a compressor nor the compressibility of the file correlates with classification accuracy. The code used in this work is publicly available at https://github.com/bioinformatics-ua/COMPACT.","PeriodicalId":218475,"journal":{"name":"2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"The value of compression for taxonomic identification\",\"authors\":\"Jorge Miguel Silva, João Rafael Almeida\",\"doi\":\"10.1109/CBMS55023.2022.00055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advances in DNA sequencing technologies have led to an unprecedented growth of sequenced data. However, when sequencing de-novo genomes, one of the biggest challenges is the classification of DNA sequences that do not match with any biological sequence from the literature. The use of reference-free methods to identify these organisms supported by compressors is one strategy for taxonomic identification. However, with the high number of compressors available, and the computational resources required to operate them, there is a problem in selecting the best compressors for classification with limited computational resources. In this paper, we present a two-step pipeline to analyze nine compressors, to understand which ones could be the best candidates for taxonomic identification. We use 500 randomly selected sequences from five taxonomic groups to conduct this analysis. The results show that besides being an excellent repre-sentative feature, depending on the compressor, the Normalized Compression (NC) reflects different aspects concerning the nature of a given sequence and its complexity. Furthermore, we show that neither the compression capability of a compressor nor the compressibility of the file correlates with classification accuracy. The code used in this work is publicly available at https://github.com/bioinformatics-ua/COMPACT.\",\"PeriodicalId\":218475,\"journal\":{\"name\":\"2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CBMS55023.2022.00055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS55023.2022.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The value of compression for taxonomic identification
Advances in DNA sequencing technologies have led to an unprecedented growth of sequenced data. However, when sequencing de-novo genomes, one of the biggest challenges is the classification of DNA sequences that do not match with any biological sequence from the literature. The use of reference-free methods to identify these organisms supported by compressors is one strategy for taxonomic identification. However, with the high number of compressors available, and the computational resources required to operate them, there is a problem in selecting the best compressors for classification with limited computational resources. In this paper, we present a two-step pipeline to analyze nine compressors, to understand which ones could be the best candidates for taxonomic identification. We use 500 randomly selected sequences from five taxonomic groups to conduct this analysis. The results show that besides being an excellent repre-sentative feature, depending on the compressor, the Normalized Compression (NC) reflects different aspects concerning the nature of a given sequence and its complexity. Furthermore, we show that neither the compression capability of a compressor nor the compressibility of the file correlates with classification accuracy. The code used in this work is publicly available at https://github.com/bioinformatics-ua/COMPACT.