{"title":"clusterdb:一个高性能的大规模序列匹配工具","authors":"J. Kleffe, Friedrich Möller, B. Wittig","doi":"10.1109/DEXA.2006.40","DOIUrl":null,"url":null,"abstract":"High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use","PeriodicalId":282986,"journal":{"name":"17th International Workshop on Database and Expert Systems Applications (DEXA'06)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"ClustDB: A High-Performance Tool for Large Scale Sequence Matching\",\"authors\":\"J. Kleffe, Friedrich Möller, B. Wittig\",\"doi\":\"10.1109/DEXA.2006.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use\",\"PeriodicalId\":282986,\"journal\":{\"name\":\"17th International Workshop on Database and Expert Systems Applications (DEXA'06)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"17th International Workshop on Database and Expert Systems Applications (DEXA'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.2006.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"17th International Workshop on Database and Expert Systems Applications (DEXA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2006.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ClustDB: A High-Performance Tool for Large Scale Sequence Matching
High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use