With an ever increasing amount of research data available, it becomes constantly more important to possess data literacy skills to benefit from this valuable resource. An integrative course was developed to teach students the fundamentals of data literacy through an engaging genome sequencing project. Each cohort of students performed planning of the experiment, DNA extraction, nanopore sequencing, genome sequence assembly, prediction of genes in the assembled sequence, and assignment of functional annotation terms to predicted genes. Students learned how to communicate science through writing a protocol in the form of a scientific paper, providing comments during a peer-review process, and presenting their findings as part of an international symposium. Many students enjoyed the opportunity to own a project and to work towards a meaningful objective.
{"title":"Data literacy in genome research.","authors":"Katharina Wolff, Ronja Friedhoff, Friderieke Schwarzer, Boas Pucker","doi":"10.1515/jib-2023-0033","DOIUrl":"10.1515/jib-2023-0033","url":null,"abstract":"<p><p>With an ever increasing amount of research data available, it becomes constantly more important to possess data literacy skills to benefit from this valuable resource. An integrative course was developed to teach students the fundamentals of data literacy through an engaging genome sequencing project. Each cohort of students performed planning of the experiment, DNA extraction, nanopore sequencing, genome sequence assembly, prediction of genes in the assembled sequence, and assignment of functional annotation terms to predicted genes. Students learned how to communicate science through writing a protocol in the form of a scientific paper, providing comments during a peer-review process, and presenting their findings as part of an international symposium. Many students enjoyed the opportunity to own a project and to work towards a meaningful objective.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777367/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138479289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-20eCollection Date: 2023-09-01DOI: 10.1515/jib-2023-0017
Yulia E Uvarova, Pavel S Demenkov, Irina N Kuzmicheva, Artur S Venzel, Elena L Mischenko, Timofey V Ivanisenko, Vadim M Efimov, Svetlana V Bannikova, Asya R Vasilieva, Vladimir A Ivanisenko, Sergey E Peltek
Bacillus strains are ubiquitous in the environment and are widely used in the microbiological industry as valuable enzyme sources, as well as in agriculture to stimulate plant growth. The Bacillus genus comprises several closely related groups of species. The rapid classification of these remains challenging using existing methods. Techniques based on MALDI-TOF MS data analysis hold significant promise for fast and precise microbial strains classification at both the genus and species levels. In previous work, we proposed a geometric approach to Bacillus strain classification based on mass spectra analysis via the centroid method (CM). One limitation of such methods is the noise in MS spectra. In this study, we used a denoising autoencoder (DAE) to improve bacteria classification accuracy under noisy MS spectra conditions. We employed a denoising autoencoder approach to convert noisy MS spectra into latent variables representing molecular patterns in the original MS data, and the Random Forest method to classify bacterial strains by latent variables. Comparison of the DAE-RF with the CM method using the artificially noisy test samples showed that DAE-RF offers higher noise robustness. Hence, the DAE-RF method could be utilized for noise-robust, fast, and neat classification of Bacillus species according to MALDI-TOF MS data.
{"title":"Accurate noise-robust classification of Bacillus species from MALDI-TOF MS spectra using a denoising autoencoder.","authors":"Yulia E Uvarova, Pavel S Demenkov, Irina N Kuzmicheva, Artur S Venzel, Elena L Mischenko, Timofey V Ivanisenko, Vadim M Efimov, Svetlana V Bannikova, Asya R Vasilieva, Vladimir A Ivanisenko, Sergey E Peltek","doi":"10.1515/jib-2023-0017","DOIUrl":"10.1515/jib-2023-0017","url":null,"abstract":"<p><p>Bacillus strains are ubiquitous in the environment and are widely used in the microbiological industry as valuable enzyme sources, as well as in agriculture to stimulate plant growth. The Bacillus genus comprises several closely related groups of species. The rapid classification of these remains challenging using existing methods. Techniques based on MALDI-TOF MS data analysis hold significant promise for fast and precise microbial strains classification at both the genus and species levels. In previous work, we proposed a geometric approach to Bacillus strain classification based on mass spectra analysis via the centroid method (CM). One limitation of such methods is the noise in MS spectra. In this study, we used a denoising autoencoder (DAE) to improve bacteria classification accuracy under noisy MS spectra conditions. We employed a denoising autoencoder approach to convert noisy MS spectra into latent variables representing molecular patterns in the original MS data, and the Random Forest method to classify bacterial strains by latent variables. Comparison of the DAE-RF with the CM method using the artificially noisy test samples showed that DAE-RF offers higher noise robustness. Hence, the DAE-RF method could be utilized for noise-robust, fast, and neat classification of Bacillus species according to MALDI-TOF MS data.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-20eCollection Date: 2023-09-01DOI: 10.1515/jib-2023-0013
Evgeniya A Antropova, Tamara M Khlebodarova, Pavel S Demenkov, Anastasiia R Volianskaia, Artur S Venzel, Nikita V Ivanisenko, Alexandr D Gavrilenko, Timofey V Ivanisenko, Anna V Adamovskaya, Polina M Revva, Nikolay A Kolchanov, Inna N Lavrik, Vladimir A Ivanisenko
Hepatocellular carcinoma (HCC) has been associated with hepatitis C viral (HCV) infection as a potential risk factor. Nonetheless, the precise genetic regulatory mechanisms triggered by the virus, leading to virus-induced hepatocarcinogenesis, remain unclear. We hypothesized that HCV proteins might modulate the activity of aberrantly methylated HCC genes through regulatory pathways. Virus-host regulatory pathways, interactions between proteins, gene expression, transport, and stability regulation, were reconstructed using the ANDSystem. Gene expression regulation was statistically significant. Gene network analysis identified four out of 70 HCC marker genes whose expression regulation by viral proteins may be associated with HCC: DNA-binding protein inhibitor ID - 1 (ID1), flap endonuclease 1 (FEN1), cyclin-dependent kinase inhibitor 2A (CDKN2A), and telomerase reverse transcriptase (TERT). It suggested the following viral protein effects in HCV/human protein heterocomplexes: HCV NS3(p70) protein activates human STAT3 and NOTC1; NS2-3(p23), NS5B(p68), NS1(E2), and core(p21) activate SETD2; NS5A inhibits SMYD3; and NS3 inhibits CCN2. Interestingly, NS3 and E1(gp32) activate c-Jun when it positively regulates CDKN2A and inhibit it when it represses TERT. The discovered regulatory mechanisms might be key areas of focus for creating medications and preventative therapies to decrease the likelihood of HCC development during HCV infection.
{"title":"Reconstruction of the regulatory hypermethylation network controlling hepatocellular carcinoma development during hepatitis C viral infection.","authors":"Evgeniya A Antropova, Tamara M Khlebodarova, Pavel S Demenkov, Anastasiia R Volianskaia, Artur S Venzel, Nikita V Ivanisenko, Alexandr D Gavrilenko, Timofey V Ivanisenko, Anna V Adamovskaya, Polina M Revva, Nikolay A Kolchanov, Inna N Lavrik, Vladimir A Ivanisenko","doi":"10.1515/jib-2023-0013","DOIUrl":"10.1515/jib-2023-0013","url":null,"abstract":"<p><p>Hepatocellular carcinoma (HCC) has been associated with hepatitis C viral (HCV) infection as a potential risk factor. Nonetheless, the precise genetic regulatory mechanisms triggered by the virus, leading to virus-induced hepatocarcinogenesis, remain unclear. We hypothesized that HCV proteins might modulate the activity of aberrantly methylated HCC genes through regulatory pathways. Virus-host regulatory pathways, interactions between proteins, gene expression, transport, and stability regulation, were reconstructed using the ANDSystem. Gene expression regulation was statistically significant. Gene network analysis identified four out of 70 HCC marker genes whose expression regulation by viral proteins may be associated with HCC: <i>DNA-binding protein inhibitor ID - 1 (ID1)</i>, <i>flap endonuclease 1 (FEN1)</i>, <i>cyclin-dependent kinase inhibitor 2A (CDKN2A)</i>, and <i>telomerase reverse transcriptase (TERT)</i>. It suggested the following viral protein effects in HCV/human protein heterocomplexes: HCV NS3(p70) protein activates human STAT3 and NOTC1; NS2-3(p23), NS5B(p68), NS1(E2), and core(p21) activate SETD2; NS5A inhibits SMYD3; and NS3 inhibits CCN2. Interestingly, NS3 and E1(gp32) activate c-Jun when it positively regulates <i>CDKN2A</i> and inhibit it when it represses <i>TERT</i>. The discovered regulatory mechanisms might be key areas of focus for creating medications and preventative therapies to decrease the likelihood of HCC development during HCV infection.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757076/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-16eCollection Date: 2023-09-01DOI: 10.1515/jib-2023-0032
Yuriy L Orlov, Ming Chen, Nikolay A Kolchanov, Ralf Hofestädt
{"title":"BGRS: bioinformatics of genome regulation and data integration.","authors":"Yuriy L Orlov, Ming Chen, Nikolay A Kolchanov, Ralf Hofestädt","doi":"10.1515/jib-2023-0032","DOIUrl":"10.1515/jib-2023-0032","url":null,"abstract":"","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757072/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-21eCollection Date: 2023-09-01DOI: 10.1515/jib-2022-0029
Andrea Marino, Blerina Sinaimeri, Enrico Tronci, Tiziana Calamoneri
Many important aspects of biological knowledge at the molecular level can be represented by pathways. Through their analysis, we gain mechanistic insights and interpret lists of interesting genes from experiments (usually omics and functional genomic experiments). As a result, pathways play a central role in the development of bioinformatics methods and tools for computing predictions from known molecular-level mechanisms. Qualitative as well as quantitative knowledge about pathways can be effectively represented through biochemical networks linking the biochemical reactions and the compounds (e.g., proteins) occurring in the considered pathways. So, repositories providing biochemical networks for known pathways play a central role in bioinformatics and in systems biology. Here we focus on Reactome, a free, comprehensive, and widely used repository for biochemical networks and pathways. In this paper, we: (1) introduce a tool StARGate-X (STatistical Analysis of the Reactome multi-GrAph Through nEtworkX) to carry out an automated analysis of the connectivity properties of Reactome biochemical reaction network and of its biological hierarchy (i.e., cell compartments, namely, the closed parts within the cytosol, usually surrounded by a membrane); the code is freely available at https://github.com/marinoandrea/stargate-x; (2) show the effectiveness of our tool by providing an analysis of the Reactome network, in terms of centrality measures, with respect to in- and out-degree. As an example of usage of StARGate-X, we provide a detailed automated analysis of the Reactome network, in terms of centrality measures. We focus both on the subgraphs induced by single compartments and on the graph whose nodes are the strongly connected components. To the best of our knowledge, this is the first freely available tool that enables automatic analysis of the large biochemical network within Reactome through easy-to-use APIs (Application Programming Interfaces).
{"title":"STARGATE-X: a Python package for statistical analysis on the REACTOME network.","authors":"Andrea Marino, Blerina Sinaimeri, Enrico Tronci, Tiziana Calamoneri","doi":"10.1515/jib-2022-0029","DOIUrl":"10.1515/jib-2022-0029","url":null,"abstract":"<p><p>Many important aspects of biological knowledge at the molecular level can be represented by <i>pathways</i>. Through their analysis, we gain mechanistic insights and interpret lists of interesting genes from experiments (usually omics and functional genomic experiments). As a result, pathways play a central role in the development of bioinformatics methods and tools for computing predictions from known molecular-level mechanisms. Qualitative as well as quantitative knowledge about pathways can be effectively represented through <i>biochemical networks</i> linking the <i>biochemical reactions</i> and the compounds (<i>e.g.</i>, proteins) occurring in the considered pathways. So, repositories providing biochemical networks for known pathways play a central role in bioinformatics and in <i>systems biology</i>. Here we focus on Reactome, a free, comprehensive, and widely used repository for biochemical networks and pathways. In this paper, we: (1) introduce a tool StARGate-X (<i>STatistical Analysis of the</i> Reactome <i>multi-GrAph Through</i> nEtworkX) to carry out an automated analysis of the connectivity properties of Reactome biochemical reaction network and of its biological hierarchy (<i>i.e.</i>, cell compartments, namely, the closed parts within the cytosol, usually surrounded by a membrane); the code is freely available at https://github.com/marinoandrea/stargate-x; (2) show the effectiveness of our tool by providing an analysis of the Reactome network, in terms of centrality measures, with respect to in- and out-degree. As an example of usage of StARGate-X, we provide a detailed automated analysis of the Reactome network, in terms of centrality measures. We focus both on the subgraphs induced by single compartments and on the graph whose nodes are the strongly connected components. To the best of our knowledge, this is the first freely available tool that enables automatic analysis of the large biochemical network within Reactome through easy-to-use APIs (<i>Application Programming Interfaces</i>).</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757075/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41168952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-25eCollection Date: 2023-09-01DOI: 10.1515/jib-2022-0046
John Anders, Peter F Stadler
The differentiation of regions with coding potential from non-coding regions remains a key task in computational biology. Methods such as RNAcode that exploit patterns of sequence conservation for this task have a substantial advantage in classification accuracy in particular for short coding sequences, compared to methods that rely on a single input sequence. However, they require sequence alignments as input. Frequently, suitable multiple sequence alignments are not readily available and are tedious, and sometimes difficult to construct. We therefore introduce here a new web service that provides access to the well-known coding sequence detector RNAcode with minimal user overhead. It requires as input only a single target nucleotide sequence. The service automates the collection, selection, and preparation of homologous sequences from the NCBI database, as well as the construction of the multiple sequence alignment that are needed as input for RNAcode. The service automatizes the entire pre- and postprocessing and thus makes the investigation of specific genomic regions for previously unannotated coding regions, such as small peptides or additional introns, a simple task that is easily accessible to non-expert users. RNAcode_Web is accessible online at rnacode.bioinf.uni-leipzig.de.
{"title":"RNAcode_Web - Convenient identification of evolutionary conserved protein coding regions.","authors":"John Anders, Peter F Stadler","doi":"10.1515/jib-2022-0046","DOIUrl":"10.1515/jib-2022-0046","url":null,"abstract":"<p><p>The differentiation of regions with coding potential from non-coding regions remains a key task in computational biology. Methods such as RNAcode that exploit patterns of sequence conservation for this task have a substantial advantage in classification accuracy in particular for short coding sequences, compared to methods that rely on a single input sequence. However, they require sequence alignments as input. Frequently, suitable multiple sequence alignments are not readily available and are tedious, and sometimes difficult to construct. We therefore introduce here a new web service that provides access to the well-known coding sequence detector RNAcode with minimal user overhead. It requires as input only a single target nucleotide sequence. The service automates the collection, selection, and preparation of homologous sequences from the NCBI database, as well as the construction of the multiple sequence alignment that are needed as input for RNAcode. The service automatizes the entire pre- and postprocessing and thus makes the investigation of specific genomic regions for previously unannotated coding regions, such as small peptides or additional introns, a simple task that is easily accessible to non-expert users. RNAcode_Web is accessible online at rnacode.bioinf.uni-leipzig.de.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10057634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-21eCollection Date: 2023-09-01DOI: 10.1515/jib-2022-0059
Jaroslav Budiš, Werner Krampl, Marcel Kucharík, Rastislav Hekel, Adrián Goga, Jozef Sitarčík, Michal Lichvár, Dávid Smol'ak, Miroslav Böhmer, Andrej Baláž, František Ďuriš, Juraj Gazdarica, Katarína Šoltys, Ján Turňa, Ján Radvánszky, Tomáš Szemes
With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.
随着大规模并行测序技术的快速发展,越来越多的实验室开始利用测序 DNA 片段进行基因组分析。然而,测序数据的解读在很大程度上依赖于生物信息学处理,这对于没有计算背景的临床医生和研究人员来说往往要求过高。另一个问题是,不同的计算中心安装的文库和生物信息学工具版本不一致,计算分析的可重复性也存在问题。我们提出了一套易于扩展的计算管道,称为 "SnakeLines",用于处理测序读数,包括映射、组装、变异调用、病毒识别、转录组学和元基因组学分析。分析的各个步骤、方法及其参数可在单个配置文件中轻松修改。所提供的流水线被嵌入虚拟环境中,确保所需资源与主机操作系统隔离、快速部署以及在不同的 Unix 平台上进行分析的可重复性。SnakeLines 是一个功能强大的生物信息学自动化分析框架,强调简单的设置、修改、可扩展性和可重复性。该框架已在多个研究项目及其应用中得到常规使用,特别是在斯洛伐克的 SARS-CoV-2 国家监测中。
{"title":"SnakeLines: integrated set of computational pipelines for sequencing reads.","authors":"Jaroslav Budiš, Werner Krampl, Marcel Kucharík, Rastislav Hekel, Adrián Goga, Jozef Sitarčík, Michal Lichvár, Dávid Smol'ak, Miroslav Böhmer, Andrej Baláž, František Ďuriš, Juraj Gazdarica, Katarína Šoltys, Ján Turňa, Ján Radvánszky, Tomáš Szemes","doi":"10.1515/jib-2022-0059","DOIUrl":"10.1515/jib-2022-0059","url":null,"abstract":"<p><p>With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757078/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10089530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-25eCollection Date: 2023-06-01DOI: 10.1515/jib-2022-0052
Carlos A C Bastos, Vera Afreixo, João M O S Rodrigues, Armando J Pinho
This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.
{"title":"Concentration of inverted repeats along human DNA.","authors":"Carlos A C Bastos, Vera Afreixo, João M O S Rodrigues, Armando J Pinho","doi":"10.1515/jib-2022-0052","DOIUrl":"10.1515/jib-2022-0052","url":null,"abstract":"<p><p>This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (<i>z</i>-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10561070/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9895627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-25eCollection Date: 2023-12-01DOI: 10.1515/jib-2023-0012
Haoyu Chao, Shilong Zhang, Yueming Hu, Qingyang Ni, Saige Xin, Liang Zhao, Vladimir A Ivanisenko, Yuriy L Orlov, Ming Chen
Crop plant breeding involves selecting and developing new plant varieties with desirable traits such as increased yield, improved disease resistance, and enhanced nutritional value. With the development of high-throughput technologies, such as genomics, transcriptomics, and metabolomics, crop breeding has entered a new era. However, to effectively use these technologies, integration of multi-omics data from different databases is required. Integration of omics data provides a comprehensive understanding of the biological processes underlying plant traits and their interactions. This review highlights the importance of integrating omics databases in crop plant breeding, discusses available omics data and databases, describes integration challenges, and highlights recent developments and potential benefits. Taken together, the integration of omics databases is a critical step towards enhancing crop plant breeding and improving global food security.
{"title":"Integrating omics databases for enhanced crop breeding.","authors":"Haoyu Chao, Shilong Zhang, Yueming Hu, Qingyang Ni, Saige Xin, Liang Zhao, Vladimir A Ivanisenko, Yuriy L Orlov, Ming Chen","doi":"10.1515/jib-2023-0012","DOIUrl":"10.1515/jib-2023-0012","url":null,"abstract":"<p><p>Crop plant breeding involves selecting and developing new plant varieties with desirable traits such as increased yield, improved disease resistance, and enhanced nutritional value. With the development of high-throughput technologies, such as genomics, transcriptomics, and metabolomics, crop breeding has entered a new era. However, to effectively use these technologies, integration of multi-omics data from different databases is required. Integration of omics data provides a comprehensive understanding of the biological processes underlying plant traits and their interactions. This review highlights the importance of integrating omics databases in crop plant breeding, discusses available omics data and databases, describes integration challenges, and highlights recent developments and potential benefits. Taken together, the integration of omics databases is a critical step towards enhancing crop plant breeding and improving global food security.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777369/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9912715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transmembrane transport proteins (transporters) play a crucial role in the fundamental cellular processes of all organisms by facilitating the transport of hydrophilic substrates across hydrophobic membranes. Despite the availability of numerous membrane protein sequences, their structures and functions remain largely elusive. Recently, natural language processing (NLP) techniques have shown promise in the analysis of protein sequences. Bidirectional Encoder Representations from Transformers (BERT) is an NLP technique adapted for proteins to learn contextual embeddings of individual amino acids within a protein sequence. Our previous strategy, TooT-BERT-T, differentiated transporters from non-transporters by employing a logistic regression classifier with fine-tuned representations from ProtBERT-BFD. In this study, we expand upon this approach by utilizing representations from ProtBERT, ProtBERT-BFD, and MembraneBERT in combination with classical classifiers. Additionally, we introduce TooT-BERT-CNN-T, a novel method that fine-tunes ProtBERT-BFD and discriminates transporters using a Convolutional Neural Network (CNN). Our experimental results reveal that CNN surpasses traditional classifiers in discriminating transporters from non-transporters, achieving an MCC of 0.89 and an accuracy of 95.1 % on the independent test set. This represents an improvement of 0.03 and 1.11 percentage points compared to TooT-BERT-T, respectively.
跨膜转运蛋白(transporter)通过促进亲水性底物在疏水膜上的转运,在所有生物体的基本细胞过程中起着至关重要的作用。尽管有许多膜蛋白序列,但它们的结构和功能在很大程度上仍然难以捉摸。近年来,自然语言处理(NLP)技术在蛋白质序列分析中显示出了良好的前景。BERT (Bidirectional Encoder Representations from Transformers)是一种用于蛋白质学习蛋白质序列中单个氨基酸的上下文嵌入的NLP技术。我们之前的策略TooT-BERT-T通过使用逻辑回归分类器和ProtBERT-BFD的微调表示来区分转运蛋白和非转运蛋白。在本研究中,我们通过利用ProtBERT、ProtBERT- bfd和膜伯特的表示与经典分类器相结合,扩展了这种方法。此外,我们介绍了TooT-BERT-CNN-T,这是一种使用卷积神经网络(CNN)微调ProtBERT-BFD并区分转运体的新方法。我们的实验结果表明,CNN在区分转运蛋白和非转运蛋白方面优于传统分类器,在独立测试集上实现了0.89的MCC和95.1% %的准确率。与TooT-BERT-T相比,这分别提高了0.03和1.11个百分点。
{"title":"Enhanced identification of membrane transport proteins: a hybrid approach combining ProtBERT-BFD and convolutional neural networks.","authors":"Hamed Ghazikhani, Gregory Butler","doi":"10.1515/jib-2022-0055","DOIUrl":"https://doi.org/10.1515/jib-2022-0055","url":null,"abstract":"<p><p>Transmembrane transport proteins (transporters) play a crucial role in the fundamental cellular processes of all organisms by facilitating the transport of hydrophilic substrates across hydrophobic membranes. Despite the availability of numerous membrane protein sequences, their structures and functions remain largely elusive. Recently, natural language processing (NLP) techniques have shown promise in the analysis of protein sequences. Bidirectional Encoder Representations from Transformers (BERT) is an NLP technique adapted for proteins to learn contextual embeddings of individual amino acids within a protein sequence. Our previous strategy, TooT-BERT-T, differentiated transporters from non-transporters by employing a logistic regression classifier with fine-tuned representations from ProtBERT-BFD. In this study, we expand upon this approach by utilizing representations from ProtBERT, ProtBERT-BFD, and MembraneBERT in combination with classical classifiers. Additionally, we introduce TooT-BERT-CNN-T, a novel method that fine-tunes ProtBERT-BFD and discriminates transporters using a Convolutional Neural Network (CNN). Our experimental results reveal that CNN surpasses traditional classifiers in discriminating transporters from non-transporters, achieving an MCC of 0.89 and an accuracy of 95.1 % on the independent test set. This represents an improvement of 0.03 and 1.11 percentage points compared to TooT-BERT-T, respectively.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.9,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10389051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9925128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}