{"title":"USING HORIZONTALLY SCALE INFRASTRUCTURE IN SEARCHING FOR SIMILARITY IN GENOME DATA OF ECOSYSTEMS","authors":"A. Tskhai, S. Murzintsev","doi":"10.1109/CSGB.2018.8544736","DOIUrl":null,"url":null,"abstract":"For the processing of reference information (for example, from ENSEMBL, GenBank, KEGG), namely: rapid comparison of genomes of organisms in order to discover recurring sets of nucleotides, a special-purpose computer system has been developed. Due to the large amount of data that appears during the processing of the source information, a transition to non-relational databases has been made, as more flexible and scalable. The distributed non-relational DB MongoDB and the algorithm of data processing Winnowing were used as the basis of the approach. When using a non-relational database to identify genetic similarity, was proposed the option of submitting the prints of structural genomic variations in the form of \"key-value\". The software implementation of the developed model was implemented. Computing experiments were performed: (1) loading data into a database using one and three shards (servers where the data is stored and the information is searched and processed); (2) search for coincidences of genomes with DB of genomes using one and three shards; (3) calculation of the speed of searching for genomes in the database; (4) calculation of the rate of loading of genomes in the database. The result of the experiments was confirmation of the possibility of using the proposed method of searching for genetic similarity, for example, for using in analysis of deviations at the gene level. The continuation of the work can be carried out in the following directions: (1) solving the problem of determining the moment when it is necessary to add a node to the cluster with increasing the number of deviations considered and increasing the number of genomes in the DB of organisms; (2) study of genomic disorders to assess the probability of genetic abnormalities at the at the recognition stage of the potentially possible unfavorable development of the situation.","PeriodicalId":230439,"journal":{"name":"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\Systems Biology (BGRS\\SB)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\Systems Biology (BGRS\\SB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSGB.2018.8544736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
For the processing of reference information (for example, from ENSEMBL, GenBank, KEGG), namely: rapid comparison of genomes of organisms in order to discover recurring sets of nucleotides, a special-purpose computer system has been developed. Due to the large amount of data that appears during the processing of the source information, a transition to non-relational databases has been made, as more flexible and scalable. The distributed non-relational DB MongoDB and the algorithm of data processing Winnowing were used as the basis of the approach. When using a non-relational database to identify genetic similarity, was proposed the option of submitting the prints of structural genomic variations in the form of "key-value". The software implementation of the developed model was implemented. Computing experiments were performed: (1) loading data into a database using one and three shards (servers where the data is stored and the information is searched and processed); (2) search for coincidences of genomes with DB of genomes using one and three shards; (3) calculation of the speed of searching for genomes in the database; (4) calculation of the rate of loading of genomes in the database. The result of the experiments was confirmation of the possibility of using the proposed method of searching for genetic similarity, for example, for using in analysis of deviations at the gene level. The continuation of the work can be carried out in the following directions: (1) solving the problem of determining the moment when it is necessary to add a node to the cluster with increasing the number of deviations considered and increasing the number of genomes in the DB of organisms; (2) study of genomic disorders to assess the probability of genetic abnormalities at the at the recognition stage of the potentially possible unfavorable development of the situation.