Pub Date : 2024-03-27eCollection Date: 2024-06-01DOI: 10.1515/jib-2023-0046
Hugo López-Fernández, Miguel Pinto, Cristina P Vieira, Pedro Duque, Miguel Reboiro-Jato, Jorge Vieira
The vast amount of genome sequence data that is available, and that is predicted to drastically increase in the near future, can only be efficiently dealt with by building automated pipelines. Indeed, the Earth Biogenome Project will produce high-quality reference genome sequences for all 1.8 million named living eukaryote species, providing unprecedented insight into the evolution of genes and gene families, and thus on biological issues. Here, new modules for gene annotation, further BLAST search algorithms, further multiple sequence alignment methods, the adding of reference sequences, further tree rooting methods, the estimation of rates of synonymous and nonsynonymous substitutions, and the identification of positively selected amino acid sites, have been added to auto-phylo (version 2), a recently developed software to address biological problems using phylogenetic inferences. Additionally, we present auto-phylo-pipeliner, a graphical user interface application that further facilitates the creation and running of auto-phylo pipelines. Inferences on S-RNase specificity, are critical for both cross-based breeding and for the establishment of pollination requirements. Therefore, as a test case, we develop an auto-phylo pipeline to identify amino acid sites under positive selection, that are, in principle, those determining S-RNase specificity, starting from both non-annotated Prunus genomes and sequences available in public databases.
{"title":"Auto-phylo v2 and auto-phylo-pipeliner: building advanced, flexible, and reusable pipelines for phylogenetic inferences, estimation of variability levels and identification of positively selected amino acid sites.","authors":"Hugo López-Fernández, Miguel Pinto, Cristina P Vieira, Pedro Duque, Miguel Reboiro-Jato, Jorge Vieira","doi":"10.1515/jib-2023-0046","DOIUrl":"10.1515/jib-2023-0046","url":null,"abstract":"<p><p>The vast amount of genome sequence data that is available, and that is predicted to drastically increase in the near future, can only be efficiently dealt with by building automated pipelines. Indeed, the Earth Biogenome Project will produce high-quality reference genome sequences for all 1.8 million named living eukaryote species, providing unprecedented insight into the evolution of genes and gene families, and thus on biological issues. Here, new modules for gene annotation, further BLAST search algorithms, further multiple sequence alignment methods, the adding of reference sequences, further tree rooting methods, the estimation of rates of synonymous and nonsynonymous substitutions, and the identification of positively selected amino acid sites, have been added to auto-phylo (version 2), a recently developed software to address biological problems using phylogenetic inferences. Additionally, we present auto-phylo-pipeliner, a graphical user interface application that further facilitates the creation and running of auto-phylo pipelines. Inferences on <i>S-RNase</i> specificity, are critical for both cross-based breeding and for the establishment of pollination requirements. Therefore, as a test case, we develop an auto-phylo pipeline to identify amino acid sites under positive selection, that are, in principle, those determining <i>S-RNase</i> specificity, starting from both non-annotated <i>Prunus</i> genomes and sequences available in public databases.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378518/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140289644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-06eCollection Date: 2024-03-01DOI: 10.1515/jib-2023-0048
Sahar Aghakhani, Anna Niarakis, Sylvain Soliman
Molecular interaction maps (MIMs) are static graphical representations depicting complex biochemical networks that can be formalized using one of the Systems Biology Graphical Notation languages. Regardless of their extensive coverage of various biological processes, they are limited in terms of dynamic insights. However, MIMs can serve as templates for developing dynamic computational models. We present MetaLo, an open-source Python package that enables the coupling of Boolean models inferred from process description MIMs with generic core metabolic networks. MetaLo provides a framework to study the impact of signaling cascades, gene regulation processes, and metabolic flux distribution of central energy production pathways. MetaLo computes the Boolean model's asynchronous asymptotic behavior, through the identification of trap-spaces, and extracts metabolic constraints to contextualize the generic metabolic network. MetaLo is able to handle large-scale Boolean models and genome-scale metabolic models without requiring kinetic information or manual tuning. The framework behind MetaLo enables in depth analysis of the regulatory model, and may allow tackling a lack of omics data in poorly addressed biological fields to contextualize generic metabolic networks along with improper automatic reconstructions of cell- and/or disease-specific metabolic networks. MetaLo is available at https://pypi.org/project/metalo/ under the terms of the GNU General Public License v3.
分子相互作用图(MIM)是描述复杂生化网络的静态图形表示法,可使用系统生物学图形符号语言之一进行形式化。尽管它们广泛覆盖了各种生物过程,但在动态洞察方面却很有限。然而,MIM 可以作为开发动态计算模型的模板。我们介绍的 MetaLo 是一个开源 Python 软件包,它能将从过程描述 MIMs 中推断出的布尔模型与通用核心代谢网络相耦合。MetaLo 提供了一个框架,用于研究信号级联、基因调控过程和中心能量生产途径的代谢通量分布的影响。MetaLo 通过识别陷阱空间来计算布尔模型的异步渐进行为,并提取代谢约束条件,从而将通用代谢网络背景化。MetaLo 能够处理大规模布尔模型和基因组规模的代谢模型,而无需动力学信息或人工调整。MetaLo 背后的框架可对调控模型进行深入分析,并可解决生物领域中缺乏 omics 数据的问题,从而将通用代谢网络与细胞和/或疾病特定代谢网络的不当自动重建结合起来。MetaLo 根据 GNU 通用公共许可证 v3 条款发布于 https://pypi.org/project/metalo/。
{"title":"MetaLo: metabolic analysis of Logical models extracted from molecular interaction maps.","authors":"Sahar Aghakhani, Anna Niarakis, Sylvain Soliman","doi":"10.1515/jib-2023-0048","DOIUrl":"10.1515/jib-2023-0048","url":null,"abstract":"<p><p>Molecular interaction maps (MIMs) are static graphical representations depicting complex biochemical networks that can be formalized using one of the Systems Biology Graphical Notation languages. Regardless of their extensive coverage of various biological processes, they are limited in terms of dynamic insights. However, MIMs can serve as templates for developing dynamic computational models. We present MetaLo, an open-source Python package that enables the coupling of Boolean models inferred from process description MIMs with generic core metabolic networks. MetaLo provides a framework to study the impact of signaling cascades, gene regulation processes, and metabolic flux distribution of central energy production pathways. MetaLo computes the Boolean model's asynchronous asymptotic behavior, through the identification of trap-spaces, and extracts metabolic constraints to contextualize the generic metabolic network. MetaLo is able to handle large-scale Boolean models and genome-scale metabolic models without requiring kinetic information or manual tuning. The framework behind MetaLo enables in depth analysis of the regulatory model, and may allow tackling a lack of omics data in poorly addressed biological fields to contextualize generic metabolic networks along with improper automatic reconstructions of cell- and/or disease-specific metabolic networks. MetaLo is available at https://pypi.org/project/metalo/ under the terms of the GNU General Public License v3.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11293895/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139693479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-28eCollection Date: 2023-12-01DOI: 10.1515/jib-2023-0026
Lena Raupach, Cassandra Königs
The first approaches in recent years for the integration of pharmacogenomic plausibility checks into clinical practice show both a promising improvement in the drug therapy safety, but also difficulties in application. One of the difficulties is the meaningful interpretation of the text-based results by the medical practitioner. We propose here as an appropriate and sensible solution to avoid misunderstandings and to include evidence-based, pharmacogenomic recommendations in prescriptions, which should be the graph-based visualization of the reports. This allows for a plausible interpretation and relate complex, even contradictory guidelines. The improved overview over the pharmacogenomics (PGx) guidelines using the graphical visualization makes the medical practitioner's choice of dose and medication more patient-specific, improves the treatment outcome and thus, increases the drug therapy safety.
{"title":"PharmoCo: a graph-based visualization of pharmacogenomic plausibility check reports for clinical decision support systems.","authors":"Lena Raupach, Cassandra Königs","doi":"10.1515/jib-2023-0026","DOIUrl":"10.1515/jib-2023-0026","url":null,"abstract":"<p><p>The first approaches in recent years for the integration of pharmacogenomic plausibility checks into clinical practice show both a promising improvement in the drug therapy safety, but also difficulties in application. One of the difficulties is the meaningful interpretation of the text-based results by the medical practitioner. We propose here as an appropriate and sensible solution to avoid misunderstandings and to include evidence-based, pharmacogenomic recommendations in prescriptions, which should be the graph-based visualization of the reports. This allows for a plausible interpretation and relate complex, even contradictory guidelines. The improved overview over the pharmacogenomics (PGx) guidelines using the graphical visualization makes the medical practitioner's choice of dose and medication more patient-specific, improves the treatment outcome and thus, increases the drug therapy safety.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139049773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-22eCollection Date: 2023-12-01DOI: 10.1515/jib-2023-0011
Andreas Ian Lackner, Jürgen Pollheimer, Paulina Latos, Martin Knöfler, Sandra Haider
During early pregnancy, extravillous trophoblasts (EVTs) play a crucial role in modifying the maternal uterine environment. Failures in EVT lineage formation and differentiation can lead to pregnancy complications such as preeclampsia, fetal growth restriction, and pregnancy loss. Despite recent advances, our knowledge on molecular and external factors that control and affect EVT development remains incomplete. Using trophoblast organoid in vitro models, we recently discovered that coordinated manipulation of the transforming growth factor beta (TGFβ) signaling is essential for EVT development. To further investigate gene networks involved in EVT function and development, we performed weighted gene co-expression network analysis (WGCNA) on our RNA-Seq data. We identified 10 modules with a median module membership of over 0.8 and sizes ranging from 1005 (M1) to 72 (M27) network genes associated with TGFβ activation status or in vitro culturing, the latter being indicative for yet undiscovered factors that shape the EVT phenotypes. Lastly, we hypothesized that certain therapeutic drugs might unintentionally interfere with placentation by affecting EVT-specific gene expression. We used the STRING database to map correlations and the Drug-Gene Interaction database to identify drug targets. Our comprehensive dataset of drug-gene interactions provides insights into potential risks associated with certain drugs in early gestation.
{"title":"Gene-network based analysis of human placental trophoblast subtypes identifies critical genes as potential targets of therapeutic drugs.","authors":"Andreas Ian Lackner, Jürgen Pollheimer, Paulina Latos, Martin Knöfler, Sandra Haider","doi":"10.1515/jib-2023-0011","DOIUrl":"10.1515/jib-2023-0011","url":null,"abstract":"<p><p>During early pregnancy, extravillous trophoblasts (EVTs) play a crucial role in modifying the maternal uterine environment. Failures in EVT lineage formation and differentiation can lead to pregnancy complications such as preeclampsia, fetal growth restriction, and pregnancy loss. Despite recent advances, our knowledge on molecular and external factors that control and affect EVT development remains incomplete. Using trophoblast organoid <i>in vitro</i> models, we recently discovered that coordinated manipulation of the transforming growth factor beta (TGFβ) signaling is essential for EVT development. To further investigate gene networks involved in EVT function and development, we performed weighted gene co-expression network analysis (WGCNA) on our RNA-Seq data. We identified 10 modules with a median module membership of over 0.8 and sizes ranging from 1005 (M1) to 72 (M27) network genes associated with TGFβ activation status or <i>in vitro</i> culturing, the latter being indicative for yet undiscovered factors that shape the EVT phenotypes. Lastly, we hypothesized that certain therapeutic drugs might unintentionally interfere with placentation by affecting EVT-specific gene expression. We used the STRING database to map correlations and the Drug-Gene Interaction database to identify drug targets. Our comprehensive dataset of drug-gene interactions provides insights into potential risks associated with certain drugs in early gestation.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777358/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138832989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-15eCollection Date: 2023-12-01DOI: 10.1515/jib-2023-0002
Marco Zurdo-Tabernero, Ángel Canal-Alonso, Fernando de la Prieta, Sara Rodríguez, Javier Prieto, Juan Manuel Corchado
Epilepsy is a neurological disorder (the third most common, following stroke and migraines). A key aspect of its diagnosis is the presence of seizures that occur without a known cause and the potential for new seizures to occur. Machine learning has shown potential as a cost-effective alternative for rapid diagnosis. In this study, we review the current state of machine learning in the detection and prediction of epileptic seizures. The objective of this study is to portray the existing machine learning methods for seizure prediction. Internet bibliographical searches were conducted to identify relevant literature on the topic. Through cross-referencing from key articles, additional references were obtained to provide a comprehensive overview of the techniques. As the aim of this paper aims is not a pure bibliographical review of the subject, the publications here cited have been selected among many others based on their number of citations. To implement accurate diagnostic and treatment tools, it is necessary to achieve a balance between prediction time, sensitivity, and specificity. This balance can be achieved using deep learning algorithms. The best performance and results are often achieved by combining multiple techniques and features, but this approach can also increase computational requirements.
{"title":"An overview of machine learning and deep learning techniques for predicting epileptic seizures.","authors":"Marco Zurdo-Tabernero, Ángel Canal-Alonso, Fernando de la Prieta, Sara Rodríguez, Javier Prieto, Juan Manuel Corchado","doi":"10.1515/jib-2023-0002","DOIUrl":"10.1515/jib-2023-0002","url":null,"abstract":"<p><p>Epilepsy is a neurological disorder (the third most common, following stroke and migraines). A key aspect of its diagnosis is the presence of seizures that occur without a known cause and the potential for new seizures to occur. Machine learning has shown potential as a cost-effective alternative for rapid diagnosis. In this study, we review the current state of machine learning in the detection and prediction of epileptic seizures. The objective of this study is to portray the existing machine learning methods for seizure prediction. Internet bibliographical searches were conducted to identify relevant literature on the topic. Through cross-referencing from key articles, additional references were obtained to provide a comprehensive overview of the techniques. As the aim of this paper aims is not a pure bibliographical review of the subject, the publications here cited have been selected among many others based on their number of citations. To implement accurate diagnostic and treatment tools, it is necessary to achieve a balance between prediction time, sensitivity, and specificity. This balance can be achieved using deep learning algorithms. The best performance and results are often achieved by combining multiple techniques and features, but this approach can also increase computational requirements.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777364/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138805520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-14eCollection Date: 2023-12-01DOI: 10.1515/jib-2023-0006
Jing Chen, Zixiang Wang, Jia Huang
Proteins are important parts of the biological structures and encode a lot of biological information. Protein-protein interaction network alignment is a model for analyzing proteins that helps discover conserved functions between organisms and predict unknown functions. In particular, multi-network alignment aims at finding the mapping relationship among multiple network nodes, so as to transfer the knowledge across species. However, with the increasing complexity of PPI networks, how to perform network alignment more accurately and efficiently is a new challenge. This paper proposes a new global network alignment algorithm called Simulated Annealing Multiple Network Alignment (SAMNA), using both network topology and sequence homology information. To generate the alignment, SAMNA first generates cross-network candidate clusters by a clustering algorithm on a k-partite similarity graph constructed with sequence similarity information, and then selects candidate cluster nodes as alignment results and optimizes them using an improved simulated annealing algorithm. Finally, the SAMNA algorithm was experimented on synthetic and real-world network datasets, and the results showed that SAMNA outperformed the state-of-the-art algorithm in biological performance.
蛋白质是生物结构的重要组成部分,并编码大量生物信息。蛋白质-蛋白质相互作用网络配准是一种分析蛋白质的模型,有助于发现生物体之间的保守功能和预测未知功能。其中,多网络配准旨在找到多个网络节点之间的映射关系,从而实现跨物种知识传递。然而,随着 PPI 网络的日益复杂,如何更准确、更高效地进行网络配准是一个新的挑战。本文提出了一种新的全局网络配准算法--模拟退火多重网络配准(SAMNA),同时使用网络拓扑和序列同源性信息。为了生成对齐结果,SAMNA 首先在利用序列相似性信息构建的 k-partite 相似性图上通过聚类算法生成跨网络候选簇,然后选择候选簇节点作为对齐结果,并利用改进的模拟退火算法对其进行优化。最后,SAMNA 算法在合成和实际网络数据集上进行了实验,结果表明 SAMNA 在生物学性能上优于最先进的算法。
{"title":"SAMNA: accurate alignment of multiple biological networks based on simulated annealing.","authors":"Jing Chen, Zixiang Wang, Jia Huang","doi":"10.1515/jib-2023-0006","DOIUrl":"10.1515/jib-2023-0006","url":null,"abstract":"<p><p>Proteins are important parts of the biological structures and encode a lot of biological information. Protein-protein interaction network alignment is a model for analyzing proteins that helps discover conserved functions between organisms and predict unknown functions. In particular, multi-network alignment aims at finding the mapping relationship among multiple network nodes, so as to transfer the knowledge across species. However, with the increasing complexity of PPI networks, how to perform network alignment more accurately and efficiently is a new challenge. This paper proposes a new global network alignment algorithm called Simulated Annealing Multiple Network Alignment (SAMNA), using both network topology and sequence homology information. To generate the alignment, SAMNA first generates cross-network candidate clusters by a clustering algorithm on a <i>k</i>-partite similarity graph constructed with sequence similarity information, and then selects candidate cluster nodes as alignment results and optimizes them using an improved simulated annealing algorithm. Finally, the SAMNA algorithm was experimented on synthetic and real-world network datasets, and the results showed that SAMNA outperformed the state-of-the-art algorithm in biological performance.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777366/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138805553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-12eCollection Date: 2023-09-01DOI: 10.1515/jib-2023-0014
Grigory A Oborotov, Konstantin A Koshechkin, Yuriy L Orlov
Applications of Artificial Intelligence in medical informatics solutions risk sharing have social value. At a time of ever-increasing cost for the provision of medicines to citizens, there is a need to restrain the growth of health care costs. The search for computer technologies to stop or slow down the growth of costs acquires a new very important and significant meaning. We discussed the two information technologies in pharmacotherapy and the possibility of combining and sharing them, namely the combination of risk-sharing agreements and Machine Learning, which was made possible by the development of Artificial Intelligence (AI). Neural networks could be used to predict the outcome to reduce the risk factors for treatment. AI-based data processing automation technologies could be also used for risk-sharing agreements automation.
{"title":"Application of Artificial Intelligence or machine learning in risk sharing agreements for pharmacotherapy risk management.","authors":"Grigory A Oborotov, Konstantin A Koshechkin, Yuriy L Orlov","doi":"10.1515/jib-2023-0014","DOIUrl":"10.1515/jib-2023-0014","url":null,"abstract":"<p><p>Applications of Artificial Intelligence in medical informatics solutions risk sharing have social value. At a time of ever-increasing cost for the provision of medicines to citizens, there is a need to restrain the growth of health care costs. The search for computer technologies to stop or slow down the growth of costs acquires a new very important and significant meaning. We discussed the two information technologies in pharmacotherapy and the possibility of combining and sharing them, namely the combination of risk-sharing agreements and Machine Learning, which was made possible by the development of Artificial Intelligence (AI). Neural networks could be used to predict the outcome to reduce the risk factors for treatment. AI-based data processing automation technologies could be also used for risk-sharing agreements automation.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138805521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-05eCollection Date: 2024-03-01DOI: 10.1515/jib-2023-0021
Avery Mecham, Ashlie Stephenson, Badi I Quinteros, Grace S Brown, Stephen R Piccolo
TidyGEO is a Web-based tool for downloading, tidying, and reformatting data series from Gene Expression Omnibus (GEO). As a freely accessible repository with data from over 6 million biological samples across more than 4000 organisms, GEO provides diverse opportunities for secondary research. Although scientists may find assay data relevant to a given research question, most analyses require sample-level annotations. In GEO, such annotations are stored alongside assay data in delimited, text-based files. However, the structure and semantics of the annotations vary widely from one series to another, and many annotations are not useful for analysis purposes. Thus, every GEO series must be tidied before it is analyzed. Manual approaches may be used, but these are error prone and take time away from other research tasks. Custom computer scripts can be written, but many scientists lack the computational expertise to create such scripts. To address these challenges, we created TidyGEO, which supports essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps.
{"title":"TidyGEO: preparing analysis-ready datasets from Gene Expression Omnibus.","authors":"Avery Mecham, Ashlie Stephenson, Badi I Quinteros, Grace S Brown, Stephen R Piccolo","doi":"10.1515/jib-2023-0021","DOIUrl":"10.1515/jib-2023-0021","url":null,"abstract":"<p><p>TidyGEO is a Web-based tool for downloading, tidying, and reformatting data series from Gene Expression Omnibus (GEO). As a freely accessible repository with data from over 6 million biological samples across more than 4000 organisms, GEO provides diverse opportunities for secondary research. Although scientists may find assay data relevant to a given research question, most analyses require sample-level annotations. In GEO, such annotations are stored alongside assay data in delimited, text-based files. However, the structure and semantics of the annotations vary widely from one series to another, and many annotations are not useful for analysis purposes. Thus, every GEO series must be tidied before it is analyzed. Manual approaches may be used, but these are error prone and take time away from other research tasks. Custom computer scripts can be written, but many scientists lack the computational expertise to create such scripts. To address these challenges, we created TidyGEO, which supports essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11294518/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138479290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With an ever increasing amount of research data available, it becomes constantly more important to possess data literacy skills to benefit from this valuable resource. An integrative course was developed to teach students the fundamentals of data literacy through an engaging genome sequencing project. Each cohort of students performed planning of the experiment, DNA extraction, nanopore sequencing, genome sequence assembly, prediction of genes in the assembled sequence, and assignment of functional annotation terms to predicted genes. Students learned how to communicate science through writing a protocol in the form of a scientific paper, providing comments during a peer-review process, and presenting their findings as part of an international symposium. Many students enjoyed the opportunity to own a project and to work towards a meaningful objective.
{"title":"Data literacy in genome research.","authors":"Katharina Wolff, Ronja Friedhoff, Friderieke Schwarzer, Boas Pucker","doi":"10.1515/jib-2023-0033","DOIUrl":"10.1515/jib-2023-0033","url":null,"abstract":"<p><p>With an ever increasing amount of research data available, it becomes constantly more important to possess data literacy skills to benefit from this valuable resource. An integrative course was developed to teach students the fundamentals of data literacy through an engaging genome sequencing project. Each cohort of students performed planning of the experiment, DNA extraction, nanopore sequencing, genome sequence assembly, prediction of genes in the assembled sequence, and assignment of functional annotation terms to predicted genes. Students learned how to communicate science through writing a protocol in the form of a scientific paper, providing comments during a peer-review process, and presenting their findings as part of an international symposium. Many students enjoyed the opportunity to own a project and to work towards a meaningful objective.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777367/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138479289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-20eCollection Date: 2023-09-01DOI: 10.1515/jib-2023-0017
Yulia E Uvarova, Pavel S Demenkov, Irina N Kuzmicheva, Artur S Venzel, Elena L Mischenko, Timofey V Ivanisenko, Vadim M Efimov, Svetlana V Bannikova, Asya R Vasilieva, Vladimir A Ivanisenko, Sergey E Peltek
Bacillus strains are ubiquitous in the environment and are widely used in the microbiological industry as valuable enzyme sources, as well as in agriculture to stimulate plant growth. The Bacillus genus comprises several closely related groups of species. The rapid classification of these remains challenging using existing methods. Techniques based on MALDI-TOF MS data analysis hold significant promise for fast and precise microbial strains classification at both the genus and species levels. In previous work, we proposed a geometric approach to Bacillus strain classification based on mass spectra analysis via the centroid method (CM). One limitation of such methods is the noise in MS spectra. In this study, we used a denoising autoencoder (DAE) to improve bacteria classification accuracy under noisy MS spectra conditions. We employed a denoising autoencoder approach to convert noisy MS spectra into latent variables representing molecular patterns in the original MS data, and the Random Forest method to classify bacterial strains by latent variables. Comparison of the DAE-RF with the CM method using the artificially noisy test samples showed that DAE-RF offers higher noise robustness. Hence, the DAE-RF method could be utilized for noise-robust, fast, and neat classification of Bacillus species according to MALDI-TOF MS data.
{"title":"Accurate noise-robust classification of Bacillus species from MALDI-TOF MS spectra using a denoising autoencoder.","authors":"Yulia E Uvarova, Pavel S Demenkov, Irina N Kuzmicheva, Artur S Venzel, Elena L Mischenko, Timofey V Ivanisenko, Vadim M Efimov, Svetlana V Bannikova, Asya R Vasilieva, Vladimir A Ivanisenko, Sergey E Peltek","doi":"10.1515/jib-2023-0017","DOIUrl":"10.1515/jib-2023-0017","url":null,"abstract":"<p><p>Bacillus strains are ubiquitous in the environment and are widely used in the microbiological industry as valuable enzyme sources, as well as in agriculture to stimulate plant growth. The Bacillus genus comprises several closely related groups of species. The rapid classification of these remains challenging using existing methods. Techniques based on MALDI-TOF MS data analysis hold significant promise for fast and precise microbial strains classification at both the genus and species levels. In previous work, we proposed a geometric approach to Bacillus strain classification based on mass spectra analysis via the centroid method (CM). One limitation of such methods is the noise in MS spectra. In this study, we used a denoising autoencoder (DAE) to improve bacteria classification accuracy under noisy MS spectra conditions. We employed a denoising autoencoder approach to convert noisy MS spectra into latent variables representing molecular patterns in the original MS data, and the Random Forest method to classify bacterial strains by latent variables. Comparison of the DAE-RF with the CM method using the artificially noisy test samples showed that DAE-RF offers higher noise robustness. Hence, the DAE-RF method could be utilized for noise-robust, fast, and neat classification of Bacillus species according to MALDI-TOF MS data.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10757077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}