Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad436
Tong Xin, Yanan Lv, Haoran Chen, Linlin Li, Lijun Shen, Guangcun Shan, Xi Chen, Hua Han
Motivation: The registration of serial section electron microscope images is a critical step in reconstructing biological tissue volumes, and it aims to eliminate complex nonlinear deformations from sectioning and replicate the correct neurite structure. However, due to the inherent properties of biological structures and the challenges posed by section preparation of biological tissues, achieving an accurate registration of serial sections remains a significant challenge. Conventional nonlinear registration techniques, which are effective in eliminating nonlinear deformation, can also eliminate the natural morphological variation of neurites across sections. Additionally, accumulation of registration errors alters the neurite structure.
Results: This article proposes a novel method for serial section registration that utilizes an unsupervised optical flow network to measure feature similarity rather than pixel similarity to eliminate nonlinear deformation and achieve pairwise registration between sections. The optical flow network is then employed to estimate and compensate for cumulative registration error, thereby allowing for the reconstruction of the structure of biological tissues. Based on the novel serial section registration method, a serial split technique is proposed for long-serial sections. Experimental results demonstrate that the state-of-the-art method proposed here effectively improves the spatial continuity of serial sections, leading to more accurate registration and improved reconstruction of the structure of biological tissues.
Availability and implementation: The source code and data are available at https://github.com/TongXin-CASIA/EFSR.
{"title":"A novel registration method for long-serial section images of EM with a serial split technique based on unsupervised optical flow network.","authors":"Tong Xin, Yanan Lv, Haoran Chen, Linlin Li, Lijun Shen, Guangcun Shan, Xi Chen, Hua Han","doi":"10.1093/bioinformatics/btad436","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad436","url":null,"abstract":"<p><strong>Motivation: </strong>The registration of serial section electron microscope images is a critical step in reconstructing biological tissue volumes, and it aims to eliminate complex nonlinear deformations from sectioning and replicate the correct neurite structure. However, due to the inherent properties of biological structures and the challenges posed by section preparation of biological tissues, achieving an accurate registration of serial sections remains a significant challenge. Conventional nonlinear registration techniques, which are effective in eliminating nonlinear deformation, can also eliminate the natural morphological variation of neurites across sections. Additionally, accumulation of registration errors alters the neurite structure.</p><p><strong>Results: </strong>This article proposes a novel method for serial section registration that utilizes an unsupervised optical flow network to measure feature similarity rather than pixel similarity to eliminate nonlinear deformation and achieve pairwise registration between sections. The optical flow network is then employed to estimate and compensate for cumulative registration error, thereby allowing for the reconstruction of the structure of biological tissues. Based on the novel serial section registration method, a serial split technique is proposed for long-serial sections. Experimental results demonstrate that the state-of-the-art method proposed here effectively improves the spatial continuity of serial sections, leading to more accurate registration and improved reconstruction of the structure of biological tissues.</p><p><strong>Availability and implementation: </strong>The source code and data are available at https://github.com/TongXin-CASIA/EFSR.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9961566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad441
Amir Feizi, Kamalika Ray
Motivation: Open Target Genetics is a comprehensive resource portal that offers variant-centric statistical evidence, enabling the prioritization of causal variants and the identification of potential drug targets. The portal uses GraphQL technology for efficient data query and provides endpoints for programmatic access for R and Python users. However, leveraging GraphQL for data retrieval can be challenging, time-consuming, and repetitive, requiring familiarity with the GraphQL query language and processing outputs in nested JSON (JavaScript Object Notation) format into tidy data tables. Therefore, developing open-source tools are required to simplify data retrieval processes to integrate valuable genetic information into data-driven target discovery pipelines seamlessly.
Results: otargen is an open-source R package designed to make data retrieval and analysis from the Open Target Genetics portal as simple as possible for R users. The package offers a suite of functions covering all query types, allowing streamlined data access in a tidy table format. By executing only a single line of code, the otargen users avoid the repetitive scripting of complex GraphQL queries, including the post-processing steps. In addition, otargen contains convenient plotting functions to visualize and gain insights from complex data tables returned by several key functions.
Availability and implementation: otargen is available at https://amirfeizi.github.io/otargen/.
{"title":"otargen: GraphQL-based R package for tidy data accessing and processing from Open Targets Genetics.","authors":"Amir Feizi, Kamalika Ray","doi":"10.1093/bioinformatics/btad441","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad441","url":null,"abstract":"<p><strong>Motivation: </strong>Open Target Genetics is a comprehensive resource portal that offers variant-centric statistical evidence, enabling the prioritization of causal variants and the identification of potential drug targets. The portal uses GraphQL technology for efficient data query and provides endpoints for programmatic access for R and Python users. However, leveraging GraphQL for data retrieval can be challenging, time-consuming, and repetitive, requiring familiarity with the GraphQL query language and processing outputs in nested JSON (JavaScript Object Notation) format into tidy data tables. Therefore, developing open-source tools are required to simplify data retrieval processes to integrate valuable genetic information into data-driven target discovery pipelines seamlessly.</p><p><strong>Results: </strong>otargen is an open-source R package designed to make data retrieval and analysis from the Open Target Genetics portal as simple as possible for R users. The package offers a suite of functions covering all query types, allowing streamlined data access in a tidy table format. By executing only a single line of code, the otargen users avoid the repetitive scripting of complex GraphQL queries, including the post-processing steps. In addition, otargen contains convenient plotting functions to visualize and gain insights from complex data tables returned by several key functions.</p><p><strong>Availability and implementation: </strong>otargen is available at https://amirfeizi.github.io/otargen/.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10394122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10017873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad443
Ondrej Vavra, Jakub Beranek, Jan Stourac, Martin Surkovsky, Jiri Filipovic, Jiri Damborsky, Jan Martinovic, David Bednar
Summary: Access pathways in enzymes are crucial for the passage of substrates and products of catalysed reactions. The process can be studied by computational means with variable degrees of precision. Our in-house approximative method CaverDock provides a fast and easy way to set up and run ligand binding and unbinding calculations through protein tunnels and channels. Here we introduce pyCaverDock, a Python3 API designed to improve user experience with the tool and further facilitate the ligand transport analyses. The API enables users to simplify the steps needed to use CaverDock, from automatizing setup processes to designing screening pipelines.
Availability and implementation: pyCaverDock API is implemented in Python 3 and is freely available with detailed documentation and practical examples at https://loschmidt.chemi.muni.cz/caverdock/.
{"title":"pyCaverDock: Python implementation of the popular tool for analysis of ligand transport with advanced caching and batch calculation support.","authors":"Ondrej Vavra, Jakub Beranek, Jan Stourac, Martin Surkovsky, Jiri Filipovic, Jiri Damborsky, Jan Martinovic, David Bednar","doi":"10.1093/bioinformatics/btad443","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad443","url":null,"abstract":"<p><strong>Summary: </strong>Access pathways in enzymes are crucial for the passage of substrates and products of catalysed reactions. The process can be studied by computational means with variable degrees of precision. Our in-house approximative method CaverDock provides a fast and easy way to set up and run ligand binding and unbinding calculations through protein tunnels and channels. Here we introduce pyCaverDock, a Python3 API designed to improve user experience with the tool and further facilitate the ligand transport analyses. The API enables users to simplify the steps needed to use CaverDock, from automatizing setup processes to designing screening pipelines.</p><p><strong>Availability and implementation: </strong>pyCaverDock API is implemented in Python 3 and is freely available with detailed documentation and practical examples at https://loschmidt.chemi.muni.cz/caverdock/.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10397418/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10017874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad459
Gabriel Cretin, Charlotte Périn, Nicolas Zimmermann, Tatiana Galochkina, Jean-Christophe Gelly
Motivation: Alignment of protein structures is a major problem in structural biology. The first approach commonly used is to consider proteins as rigid bodies. However, alignment of protein structures can be very complex due to conformational variability, or complex evolutionary relationships between proteins such as insertions, circular permutations or repetitions. In such cases, introducing flexibility becomes useful for two reasons: (i) it can help compare two protein chains which adopted two different conformational states, such as due to proteins/ligands interaction or post-translational modifications, and (ii) it aids in the identification of conserved regions in proteins that may have distant evolutionary relationships.
Results: We propose ICARUS, a new approach for flexible structural alignment based on identification of Protein Units, evolutionarily preserved structural descriptors of intermediate size, between secondary structures and domains. ICARUS significantly outperforms reference methods on a dataset of very difficult structural alignments.
Availability and implementation: Code is freely available online at https://github.com/DSIMB/ICARUS.
{"title":"ICARUS: flexible protein structural alignment based on Protein Units.","authors":"Gabriel Cretin, Charlotte Périn, Nicolas Zimmermann, Tatiana Galochkina, Jean-Christophe Gelly","doi":"10.1093/bioinformatics/btad459","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad459","url":null,"abstract":"<p><strong>Motivation: </strong>Alignment of protein structures is a major problem in structural biology. The first approach commonly used is to consider proteins as rigid bodies. However, alignment of protein structures can be very complex due to conformational variability, or complex evolutionary relationships between proteins such as insertions, circular permutations or repetitions. In such cases, introducing flexibility becomes useful for two reasons: (i) it can help compare two protein chains which adopted two different conformational states, such as due to proteins/ligands interaction or post-translational modifications, and (ii) it aids in the identification of conserved regions in proteins that may have distant evolutionary relationships.</p><p><strong>Results: </strong>We propose ICARUS, a new approach for flexible structural alignment based on identification of Protein Units, evolutionarily preserved structural descriptors of intermediate size, between secondary structures and domains. ICARUS significantly outperforms reference methods on a dataset of very difficult structural alignments.</p><p><strong>Availability and implementation: </strong>Code is freely available online at https://github.com/DSIMB/ICARUS.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10400377/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10018371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad466
Emanuel Cunha, Davide Lagoa, José P Faria, Filipe Liu, Christopher S Henry, Oscar Dias
Motivation: The importance and rate of development of genome-scale metabolic models have been growing for the last few years, increasing the demand for software solutions that automate several steps of this process. However, since TRIAGE's release, software development for the automatic integration of transport reactions into models has stalled.
Results: Here, we present the Transport Systems Tracker (TranSyT). Unlike other transport systems annotation software, TranSyT does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources. TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes. All transport reactions have GPR associations and can be exported with identifiers from four different metabolite databases. TranSyT is currently available as a plugin for merlin v4.0 and an app for KBase.
Availability and implementation: TranSyT web service: https://transyt.bio.di.uminho.pt/; GitHub for the tool: https://github.com/BioSystemsUM/transyt; GitHub with examples and instructions to run TranSyT: https://github.com/ecunha1996/transyt_paper.
{"title":"TranSyT, an innovative framework for identifying transport systems.","authors":"Emanuel Cunha, Davide Lagoa, José P Faria, Filipe Liu, Christopher S Henry, Oscar Dias","doi":"10.1093/bioinformatics/btad466","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad466","url":null,"abstract":"<p><strong>Motivation: </strong>The importance and rate of development of genome-scale metabolic models have been growing for the last few years, increasing the demand for software solutions that automate several steps of this process. However, since TRIAGE's release, software development for the automatic integration of transport reactions into models has stalled.</p><p><strong>Results: </strong>Here, we present the Transport Systems Tracker (TranSyT). Unlike other transport systems annotation software, TranSyT does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources. TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes. All transport reactions have GPR associations and can be exported with identifiers from four different metabolite databases. TranSyT is currently available as a plugin for merlin v4.0 and an app for KBase.</p><p><strong>Availability and implementation: </strong>TranSyT web service: https://transyt.bio.di.uminho.pt/; GitHub for the tool: https://github.com/BioSystemsUM/transyt; GitHub with examples and instructions to run TranSyT: https://github.com/ecunha1996/transyt_paper.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10444967/pdf/btad466.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10420696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad509
Thaidy Moreno, Joaquin Magana, David A Quigley
Summary: Resistance to two classes of FDA-approved therapies that target DNA repair-deficient tumors is caused by mutations that restore the tumor cell's DNA repair function. Identifying these "reversion" mutations currently requires manual annotation of patient tumor sequence data. Here we present AARDVARK, an R package that automatically identifies reversion mutations from DNA sequence data.
Availability and implementation: AARDVARK is implemented in R (≥3.5). It is available on GitHub at https://github.com/davidquigley/aardvark. It is licensed under the MIT license.
摘要美国 FDA 批准的针对 DNA 修复缺陷肿瘤的两类疗法的抗药性是由恢复肿瘤细胞 DNA 修复功能的突变引起的。目前,识别这些 "逆转 "突变需要对患者肿瘤序列数据进行人工标注。在此,我们介绍一款能从 DNA 序列数据中自动识别逆转突变的 R 软件包 AARDVARK:AARDVARK 由 R (≥3.5) 实现。它可在 GitHub 上获取:https://github.com/davidquigley/aardvark。它采用 MIT 许可。
{"title":"AARDVARK: an automated reversion detector for variants affecting resistance kinetics.","authors":"Thaidy Moreno, Joaquin Magana, David A Quigley","doi":"10.1093/bioinformatics/btad509","DOIUrl":"10.1093/bioinformatics/btad509","url":null,"abstract":"<p><strong>Summary: </strong>Resistance to two classes of FDA-approved therapies that target DNA repair-deficient tumors is caused by mutations that restore the tumor cell's DNA repair function. Identifying these \"reversion\" mutations currently requires manual annotation of patient tumor sequence data. Here we present AARDVARK, an R package that automatically identifies reversion mutations from DNA sequence data.</p><p><strong>Availability and implementation: </strong>AARDVARK is implemented in R (≥3.5). It is available on GitHub at https://github.com/davidquigley/aardvark. It is licensed under the MIT license.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457659/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10476475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad477
Serene W H Wong, Chiara Pastrello, Max Kotlyar, Christos Faloutsos, Igor Jurisica
Motivation: Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this article, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains.
Results: There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer gene expression datasets. Stages in non-small cell lung cancer were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m+mc log nc+nc log nc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph.
Availability and implementation: The code of USNAP is available at https://www.cs.utoronto.ca/~juris/data/USNAP22.
{"title":"USNAP: fast unique dense region detection and its application to lung cancer.","authors":"Serene W H Wong, Chiara Pastrello, Max Kotlyar, Christos Faloutsos, Igor Jurisica","doi":"10.1093/bioinformatics/btad477","DOIUrl":"10.1093/bioinformatics/btad477","url":null,"abstract":"<p><strong>Motivation: </strong>Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this article, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains.</p><p><strong>Results: </strong>There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer gene expression datasets. Stages in non-small cell lung cancer were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m+mc log nc+nc log nc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph.</p><p><strong>Availability and implementation: </strong>The code of USNAP is available at https://www.cs.utoronto.ca/~juris/data/USNAP22.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10425186/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10067954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad479
Md Ashiqur Rahman, Abdullah Aman Tutul, Mahfuza Sharmin, Md Shamsuzzoha Bayzid
Motivation: Analyzing large-scale single-cell transcriptomic datasets generated using different technologies is challenging due to the presence of batch-specific systematic variations known as batch effects. Since biological and technological differences are often interspersed, detecting and accounting for batch effects in RNA-seq datasets are critical for effective data integration and interpretation. Low-dimensional embeddings, such as principal component analysis (PCA) are widely used in visual inspection and estimation of batch effects. Linear dimensionality reduction methods like PCA are effective in assessing the presence of batch effects, especially when batch effects exhibit linear patterns. However, batch effects are inherently complex and existing linear dimensionality reduction methods could be inadequate and imprecise in the presence of sophisticated nonlinear batch effects.
Results: We present Batch Effect Estimation using Nonlinear Embedding (BEENE), a deep nonlinear auto-encoder network which is specially tailored to generate an alternative lower dimensional embedding suitable for both linear and nonlinear batch effects. BEENE simultaneously learns the batch and biological variables from RNA-seq data, resulting in an embedding that is more robust and sensitive than PCA embedding in terms of detecting and quantifying batch effects. BEENE was assessed on a collection of carefully controlled simulated datasets as well as biological datasets, including two technical replicates of mouse embryogenesis cells, peripheral blood mononuclear cells from three largely different experiments and five studies of pancreatic islet cells.
Availability and implementation: BEENE is freely available as an open source project at https://github.com/ashiq24/BEENE.
{"title":"BEENE: deep learning-based nonlinear embedding improves batch effect estimation.","authors":"Md Ashiqur Rahman, Abdullah Aman Tutul, Mahfuza Sharmin, Md Shamsuzzoha Bayzid","doi":"10.1093/bioinformatics/btad479","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad479","url":null,"abstract":"<p><strong>Motivation: </strong>Analyzing large-scale single-cell transcriptomic datasets generated using different technologies is challenging due to the presence of batch-specific systematic variations known as batch effects. Since biological and technological differences are often interspersed, detecting and accounting for batch effects in RNA-seq datasets are critical for effective data integration and interpretation. Low-dimensional embeddings, such as principal component analysis (PCA) are widely used in visual inspection and estimation of batch effects. Linear dimensionality reduction methods like PCA are effective in assessing the presence of batch effects, especially when batch effects exhibit linear patterns. However, batch effects are inherently complex and existing linear dimensionality reduction methods could be inadequate and imprecise in the presence of sophisticated nonlinear batch effects.</p><p><strong>Results: </strong>We present Batch Effect Estimation using Nonlinear Embedding (BEENE), a deep nonlinear auto-encoder network which is specially tailored to generate an alternative lower dimensional embedding suitable for both linear and nonlinear batch effects. BEENE simultaneously learns the batch and biological variables from RNA-seq data, resulting in an embedding that is more robust and sensitive than PCA embedding in terms of detecting and quantifying batch effects. BEENE was assessed on a collection of carefully controlled simulated datasets as well as biological datasets, including two technical replicates of mouse embryogenesis cells, peripheral blood mononuclear cells from three largely different experiments and five studies of pancreatic islet cells.</p><p><strong>Availability and implementation: </strong>BEENE is freely available as an open source project at https://github.com/ashiq24/BEENE.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10448987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10102931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad484
A Zadorozhny, A Smirnov, D Filimonov, A Lagunin
Motivation: Next Generation Sequencing technologies make it possible to detect rare genetic variants in individual patients. Currently, more than a dozen software and web services have been created to predict the pathogenicity of variants related with changing of amino acid residues. Despite considerable efforts in this area, at the moment there is no ideal method to classify pathogenic and harmless variants, and the assessment of the pathogenicity is often contradictory. In this article, we propose to use peptides structural formulas of proteins as an amino acid residues substitutions description, rather than a single-letter code. This allowed us to investigate the effectiveness of chemoinformatics approach to assess the pathogenicity of variants associated with amino acid substitutions.
Results: The structure-activity relationships analysis relying on protein-specific data and atom centric substructural multilevel neighborhoods of atoms (MNA) descriptors of molecular fragments appeared to be suitable for predicting the pathogenic effect of single amino acid variants. MNA-based Naïve Bayes classifier algorithm, ClinVar and humsavar data were used for the creation of structure-activity relationships models for 10 proteins. The performance of the models was compared with 11 different predicting tools: 8 individual (SIFT 4G, Polyphen2 HDIV, MutationAssessor, PROVEAN, FATHMM, MVP, LIST-S2, MutPred) and 3 consensus (M-CAP, MetaSVM, MetaLR). The accuracy of MNA-based method varies for the proteins (AUC: 0.631-0.993; MCC: 0.191-0.891). It was similar for both the results of comparisons with the other individual predictors and third-party protein-specific predictors. For several proteins (BRCA1, BRCA2, COL1A2, and RYR1), the performance of the MNA-based method was outstanding, capable of capturing the pathogenic effect of structural changes in amino acid substitutions.
Availability and implementation: The datasets are available as supplemental data at Bioinformatics online. A python script to convert amino acid and nucleotide sequences from single-letter codes to SD files is available at https://github.com/SmirnygaTotoshka/SequenceToSDF. The authors provide trial licenses for MultiPASS software to interested readers upon request.
{"title":"Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors.","authors":"A Zadorozhny, A Smirnov, D Filimonov, A Lagunin","doi":"10.1093/bioinformatics/btad484","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad484","url":null,"abstract":"<p><strong>Motivation: </strong>Next Generation Sequencing technologies make it possible to detect rare genetic variants in individual patients. Currently, more than a dozen software and web services have been created to predict the pathogenicity of variants related with changing of amino acid residues. Despite considerable efforts in this area, at the moment there is no ideal method to classify pathogenic and harmless variants, and the assessment of the pathogenicity is often contradictory. In this article, we propose to use peptides structural formulas of proteins as an amino acid residues substitutions description, rather than a single-letter code. This allowed us to investigate the effectiveness of chemoinformatics approach to assess the pathogenicity of variants associated with amino acid substitutions.</p><p><strong>Results: </strong>The structure-activity relationships analysis relying on protein-specific data and atom centric substructural multilevel neighborhoods of atoms (MNA) descriptors of molecular fragments appeared to be suitable for predicting the pathogenic effect of single amino acid variants. MNA-based Naïve Bayes classifier algorithm, ClinVar and humsavar data were used for the creation of structure-activity relationships models for 10 proteins. The performance of the models was compared with 11 different predicting tools: 8 individual (SIFT 4G, Polyphen2 HDIV, MutationAssessor, PROVEAN, FATHMM, MVP, LIST-S2, MutPred) and 3 consensus (M-CAP, MetaSVM, MetaLR). The accuracy of MNA-based method varies for the proteins (AUC: 0.631-0.993; MCC: 0.191-0.891). It was similar for both the results of comparisons with the other individual predictors and third-party protein-specific predictors. For several proteins (BRCA1, BRCA2, COL1A2, and RYR1), the performance of the MNA-based method was outstanding, capable of capturing the pathogenic effect of structural changes in amino acid substitutions.</p><p><strong>Availability and implementation: </strong>The datasets are available as supplemental data at Bioinformatics online. A python script to convert amino acid and nucleotide sequences from single-letter codes to SD files is available at https://github.com/SmirnygaTotoshka/SequenceToSDF. The authors provide trial licenses for MultiPASS software to interested readers upon request.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10435372/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10121115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1093/bioinformatics/btad513
Kynon J M Benjamin, Tarun Katipalli, Apuã C M Paquola
Motivation: Advances in technology have generated larger omics datasets with potential applications for machine learning. In many datasets, however, cost and limited sample availability result in an excessively higher number of features as compared to observations. Moreover, biological processes are associated with networks of core and peripheral genes, while traditional feature selection approaches capture only core genes.
Results: To overcome these limitations, we present dRFEtools that implements dynamic recursive feature elimination (RFE), reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral features. dRFEtools integrates with scikit-learn (the popular Python machine learning platform) and thus provides new opportunities for dynamic RFE in large-scale omics data while enhancing its interpretability.
Availability and implementation: dRFEtools is freely available on PyPI at https://pypi.org/project/drfetools/ or on GitHub https://github.com/LieberInstitute/dRFEtools, implemented in Python 3, and supported on Linux, Windows, and Mac OS.
{"title":"dRFEtools: dynamic recursive feature elimination for omics.","authors":"Kynon J M Benjamin, Tarun Katipalli, Apuã C M Paquola","doi":"10.1093/bioinformatics/btad513","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad513","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in technology have generated larger omics datasets with potential applications for machine learning. In many datasets, however, cost and limited sample availability result in an excessively higher number of features as compared to observations. Moreover, biological processes are associated with networks of core and peripheral genes, while traditional feature selection approaches capture only core genes.</p><p><strong>Results: </strong>To overcome these limitations, we present dRFEtools that implements dynamic recursive feature elimination (RFE), reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral features. dRFEtools integrates with scikit-learn (the popular Python machine learning platform) and thus provides new opportunities for dynamic RFE in large-scale omics data while enhancing its interpretability.</p><p><strong>Availability and implementation: </strong>dRFEtools is freely available on PyPI at https://pypi.org/project/drfetools/ or on GitHub https://github.com/LieberInstitute/dRFEtools, implemented in Python 3, and supported on Linux, Windows, and Mac OS.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 8","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10471895/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10285113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}