Pub Date : 2024-11-05DOI: 10.1093/bioinformatics/btae654
Yingyao Zhou, Jiayi Cox, Bin Zhou, Steven Zhu, Yang Zhong, Glen Spraggon
Motivation: The advent of AlphaFold and other protein Artificial Intelligence (AI) models has transformed protein design, necessitating efficient handling of large-scale data and complex workflows. Using existing programming packages that predate recent AI advancements often leads to inefficiencies in human coding and slow code execution. To address this gap, we developed the Afpdb package.
Results: Afpdb, built on AlphaFold's NumPy architecture, offers a high-performance core. It uses RFDiffusion's contig syntax to streamline residue and atom selection, making coding simpler and more readable. Integrating PyMOL's visualization capabilities, Afpdb allows automatic visual quality control. With over 180 methods commonly used in protein AI design, which are otherwise hard to find, Afpdb enhances productivity in structural biology by supporting the development of concise, high-performance code.
Availability: Code and documentation are available on GitHub (https://github.com/data2code/afpdb) and PyPI (https://pypi.org/project/afpdb). An interactive tutorial is accessible through Google Colab.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Afpdb - an efficient structure manipulation package for AI protein design.","authors":"Yingyao Zhou, Jiayi Cox, Bin Zhou, Steven Zhu, Yang Zhong, Glen Spraggon","doi":"10.1093/bioinformatics/btae654","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae654","url":null,"abstract":"<p><strong>Motivation: </strong>The advent of AlphaFold and other protein Artificial Intelligence (AI) models has transformed protein design, necessitating efficient handling of large-scale data and complex workflows. Using existing programming packages that predate recent AI advancements often leads to inefficiencies in human coding and slow code execution. To address this gap, we developed the Afpdb package.</p><p><strong>Results: </strong>Afpdb, built on AlphaFold's NumPy architecture, offers a high-performance core. It uses RFDiffusion's contig syntax to streamline residue and atom selection, making coding simpler and more readable. Integrating PyMOL's visualization capabilities, Afpdb allows automatic visual quality control. With over 180 methods commonly used in protein AI design, which are otherwise hard to find, Afpdb enhances productivity in structural biology by supporting the development of concise, high-performance code.</p><p><strong>Availability: </strong>Code and documentation are available on GitHub (https://github.com/data2code/afpdb) and PyPI (https://pypi.org/project/afpdb). An interactive tutorial is accessible through Google Colab.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1093/bioinformatics/btae652
Mattia G Gollub, Thierry Backes, Hans-Michael Kaltenbach, Jörg Stelling
Motivation: Relating metabolite and enzyme abundances to metabolic fluxes requires reaction kinetics, core elements of dynamic and enzyme cost models. However, kinetic parameters have been measured only for a fraction of all known enzymes, and the reliability of the available values is unknown.
Results: The ENzyme KInetics Estimator (ENKIE) uses Bayesian Multilevel Models to predict value and uncertainty of KM and kcat parameters. Our models use five categorical predictors and achieve prediction performances comparable to deep learning approaches that use sequence and structure information. They provide calibrated uncertainty predictions and interpretable insights into the main sources of uncertainty. We expect our tool to simplify the construction of priors for Bayesian kinetic models of metabolism.
Availability: Code and Python package are available at https://gitlab.com/csb.ethz/enkie and https://pypi.org/project/enkie/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"ENKIE: A package for predicting enzyme kinetic parameter values and their uncertainties.","authors":"Mattia G Gollub, Thierry Backes, Hans-Michael Kaltenbach, Jörg Stelling","doi":"10.1093/bioinformatics/btae652","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae652","url":null,"abstract":"<p><strong>Motivation: </strong>Relating metabolite and enzyme abundances to metabolic fluxes requires reaction kinetics, core elements of dynamic and enzyme cost models. However, kinetic parameters have been measured only for a fraction of all known enzymes, and the reliability of the available values is unknown.</p><p><strong>Results: </strong>The ENzyme KInetics Estimator (ENKIE) uses Bayesian Multilevel Models to predict value and uncertainty of KM and kcat parameters. Our models use five categorical predictors and achieve prediction performances comparable to deep learning approaches that use sequence and structure information. They provide calibrated uncertainty predictions and interpretable insights into the main sources of uncertainty. We expect our tool to simplify the construction of priors for Bayesian kinetic models of metabolism.</p><p><strong>Availability: </strong>Code and Python package are available at https://gitlab.com/csb.ethz/enkie and https://pypi.org/project/enkie/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae647
Laiyi Fu, Yanxin Xie, Shunkang Ling, Ying Wang, Binzhong Wang, Hejun Du, Qinke Peng, Hequan Sun
Summary: Estimating genome size using k-mer frequencies, which plays a fundamental role in designing genome sequencing and analysis projects, has remained challenging for polyploid species, i.e., ploidy p > 2. To address this, we introduce "findGSEP," which is designed based on iterative curve fitting of k-mer frequencies. Precisely, it first disentangles up to p normal distributions by analyzing k-mer frequencies in whole genome sequencing of the focal species. Second, it computes the sizes of genomic regions related to 1∼p (homologous) chromosome(s) using each respective curve fitting, from which it infers the full polyploid and average haploid genome size. "findGSEP" can handle any level of ploidy p, and infer more accurate genome size than other well-known tools, as shown by tests using simulated and real genomic sequencing data of various species including octoploids.
Availability and implementation: "findGSEP" was implemented as a web server, which is freely available at http://146.56.237.198:3838/findGSEP/. Also, "findGSEP" was implemented as an R package for parallel processing of multiple samples. Source code and tutorial on its installation and usage is available at https://github.com/sperfu/findGSEP.
摘要:利用 k-mer 频率估算基因组大小在设计基因组测序和分析项目中起着基础性作用,但对于多倍体物种(即倍性 p > 2)来说仍具有挑战性。为此,我们引入了基于 k-mer 频率迭代曲线拟合设计的 findGSEP。确切地说,它首先通过分析目标物种全基因组测序中的 k-mer 频率,对多达 p 个正态分布进行分解。其次,它利用各自的曲线拟合计算出与 1∼p 条(同源)染色体相关的基因组区域的大小,并由此推断出全多倍体和平均单倍体基因组的大小。findGSEP可以处理任何水平的倍性p,并能比其他知名工具推断出更准确的基因组大小,这一点已通过使用包括八倍体在内的各种物种的模拟和真实基因组测序数据进行的测试得到证明。可用性和实现:findGSEP以网络服务器的形式实现,可在http://146.56.237.198:3838/findGSEP/ 免费获取。此外,findGSEP 还是一个 R 软件包,用于并行处理多个样本。源代码及其安装和使用教程可从 https://github.com/sperfu/findGSEP.Supplementary 信息中获取:补充数据可在 Bioinformatics online 上获取。
{"title":"findGSEP: estimating genome size of polyploid species using k-mer frequencies.","authors":"Laiyi Fu, Yanxin Xie, Shunkang Ling, Ying Wang, Binzhong Wang, Hejun Du, Qinke Peng, Hequan Sun","doi":"10.1093/bioinformatics/btae647","DOIUrl":"10.1093/bioinformatics/btae647","url":null,"abstract":"<p><strong>Summary: </strong>Estimating genome size using k-mer frequencies, which plays a fundamental role in designing genome sequencing and analysis projects, has remained challenging for polyploid species, i.e., ploidy p > 2. To address this, we introduce \"findGSEP,\" which is designed based on iterative curve fitting of k-mer frequencies. Precisely, it first disentangles up to p normal distributions by analyzing k-mer frequencies in whole genome sequencing of the focal species. Second, it computes the sizes of genomic regions related to 1∼p (homologous) chromosome(s) using each respective curve fitting, from which it infers the full polyploid and average haploid genome size. \"findGSEP\" can handle any level of ploidy p, and infer more accurate genome size than other well-known tools, as shown by tests using simulated and real genomic sequencing data of various species including octoploids.</p><p><strong>Availability and implementation: </strong>\"findGSEP\" was implemented as a web server, which is freely available at http://146.56.237.198:3838/findGSEP/. Also, \"findGSEP\" was implemented as an R package for parallel processing of multiple samples. Source code and tutorial on its installation and usage is available at https://github.com/sperfu/findGSEP.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552620/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae661
Mukai Wang, Simon Fontaine, Hui Jiang, Gen Li
Motivation: Microbiome differential abundance analysis (DAA) remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for DAA.
Results: We propose a novel method called "Analysis of Microbiome Differential Abundance by Pooling Tobit Models" (ADAPT) to overcome these two challenges. ADAPT interprets zero counts as left-censored observations to avoid unfounded assumptions and complex models. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference to reveal differentially abundant taxa while avoiding false discoveries. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries.
Availability and implementation: The R package ADAPT can be installed from Bioconductor at https://bioconductor.org/packages/release/bioc/html/ADAPT.html or from Github at https://github.com/mkbwang/ADAPT. The source codes for simulation studies and real data analysis are available at https://github.com/mkbwang/ADAPT_example.
{"title":"ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models.","authors":"Mukai Wang, Simon Fontaine, Hui Jiang, Gen Li","doi":"10.1093/bioinformatics/btae661","DOIUrl":"10.1093/bioinformatics/btae661","url":null,"abstract":"<p><strong>Motivation: </strong>Microbiome differential abundance analysis (DAA) remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for DAA.</p><p><strong>Results: </strong>We propose a novel method called \"Analysis of Microbiome Differential Abundance by Pooling Tobit Models\" (ADAPT) to overcome these two challenges. ADAPT interprets zero counts as left-censored observations to avoid unfounded assumptions and complex models. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference to reveal differentially abundant taxa while avoiding false discoveries. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries.</p><p><strong>Availability and implementation: </strong>The R package ADAPT can be installed from Bioconductor at https://bioconductor.org/packages/release/bioc/html/ADAPT.html or from Github at https://github.com/mkbwang/ADAPT. The source codes for simulation studies and real data analysis are available at https://github.com/mkbwang/ADAPT_example.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Heart failure (HF), a major cause of morbidity and mortality, necessitates precise diagnostic and prognostic methods.
Results: This study presents a novel deep learning approach, Transformer-based Analysis of Images of Tissue for Effective Remedy (TRAITER), for HF diagnosis and prognosis. Using image segmentation techniques and a Vision Transformer, TRAITER predicts HF likelihood from cardiac tissue cell nuclear morphology images and the potential for left ventricular reverse remodeling (LVRR) from dual-stained images with cell nuclei and DNA damage markers. In HF prediction using 31 158 images from 9 patients, TRAITER achieved 83.1% accuracy. For LVRR prediction with 231 840 images from 46 patients, TRAITER attained 84.2% accuracy for individual images and 92.9% for individual patients. TRAITER outperformed other neural network models in terms of receiver operating characteristics, and precision-recall curves. Our method promises to advance personalized HF medicine decision-making.
Availability and implementation: The source code and data are available at the following link: https://github.com/HamanoLaboratory/predict-of-HF-and-LVRR.
{"title":"TRAITER: transformer-guided diagnosis and prognosis of heart failure using cell nuclear morphology and DNA damage marker.","authors":"Hiromu Hayashi, Toshiyuki Ko, Zhehao Dai, Kanna Fujita, Seitaro Nomura, Hiroki Kiyoshima, Shinya Ishihara, Momoko Hamano, Issei Komuro, Yoshihiro Yamanishi","doi":"10.1093/bioinformatics/btae610","DOIUrl":"10.1093/bioinformatics/btae610","url":null,"abstract":"<p><strong>Motivation: </strong>Heart failure (HF), a major cause of morbidity and mortality, necessitates precise diagnostic and prognostic methods.</p><p><strong>Results: </strong>This study presents a novel deep learning approach, Transformer-based Analysis of Images of Tissue for Effective Remedy (TRAITER), for HF diagnosis and prognosis. Using image segmentation techniques and a Vision Transformer, TRAITER predicts HF likelihood from cardiac tissue cell nuclear morphology images and the potential for left ventricular reverse remodeling (LVRR) from dual-stained images with cell nuclei and DNA damage markers. In HF prediction using 31 158 images from 9 patients, TRAITER achieved 83.1% accuracy. For LVRR prediction with 231 840 images from 46 patients, TRAITER attained 84.2% accuracy for individual images and 92.9% for individual patients. TRAITER outperformed other neural network models in terms of receiver operating characteristics, and precision-recall curves. Our method promises to advance personalized HF medicine decision-making.</p><p><strong>Availability and implementation: </strong>The source code and data are available at the following link: https://github.com/HamanoLaboratory/predict-of-HF-and-LVRR.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae641
Haoyu Chao, Zhuojin Li, Dijun Chen, Ming Chen
Motivation: High-throughput sequencing technologies [next-generation sequencing (NGS)] are increasingly used to address diverse biological questions. Despite the rich information in NGS data, particularly with the growing datasets from repositories like the Genome Sequence Archive (GSA) at NGDC, programmatic access to public sequencing data and metadata remains limited.
Results: We developed iSeq to enable quick and straightforward retrieval of metadata and NGS data from multiple databases via the command-line interface. iSeq supports simultaneous retrieval from GSA, SRA, ENA, and DDBJ databases. It handles over 25 different accession formats, supports Aspera downloads, parallel downloads, multi-threaded processes, FASTQ file merging, and integrity verification, simplifying data acquisition and enhancing the capacity for reanalyzing NGS data.
Availability and implementation: iSeq is freely available on Bioconda (https://anaconda.org/bioconda/iseq) and GitHub (https://github.com/BioOmics/iSeq).
{"title":"iSeq: an integrated tool to fetch public sequencing data.","authors":"Haoyu Chao, Zhuojin Li, Dijun Chen, Ming Chen","doi":"10.1093/bioinformatics/btae641","DOIUrl":"10.1093/bioinformatics/btae641","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput sequencing technologies [next-generation sequencing (NGS)] are increasingly used to address diverse biological questions. Despite the rich information in NGS data, particularly with the growing datasets from repositories like the Genome Sequence Archive (GSA) at NGDC, programmatic access to public sequencing data and metadata remains limited.</p><p><strong>Results: </strong>We developed iSeq to enable quick and straightforward retrieval of metadata and NGS data from multiple databases via the command-line interface. iSeq supports simultaneous retrieval from GSA, SRA, ENA, and DDBJ databases. It handles over 25 different accession formats, supports Aspera downloads, parallel downloads, multi-threaded processes, FASTQ file merging, and integrity verification, simplifying data acquisition and enhancing the capacity for reanalyzing NGS data.</p><p><strong>Availability and implementation: </strong>iSeq is freely available on Bioconda (https://anaconda.org/bioconda/iseq) and GitHub (https://github.com/BioOmics/iSeq).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11561040/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae565
Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best
Motivation: Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.
Results: We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.
Availability and implementation: Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.
{"title":"Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.","authors":"Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best","doi":"10.1093/bioinformatics/btae565","DOIUrl":"10.1093/bioinformatics/btae565","url":null,"abstract":"<p><strong>Motivation: </strong>Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.</p><p><strong>Results: </strong>We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.</p><p><strong>Availability and implementation: </strong>Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549014/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142303224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER.
Results: We present a novel lifestyle factor ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature.
Availability and implementation: LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: https://bioportal.bioontology.org/ontologies/LSFO.
动机:尽管生活方式因素(LSFs)在塑造个人健康轨迹,尤其是慢性疾病方面的作用日益得到认可,但生物医学文献中仍未对其进行系统描述。部分原因是目前还没有命名实体识别(NER)系统能够全面检测文本中所有类型的生活方式因素。由于LSF固有的多样性、基于词典的NER缺乏全面的LSF分类以及基于深度学习的NER缺乏语料库,这项任务具有挑战性:我们提出了一个新颖的生活方式因素本体(LSFO),并利用它开发了一个基于词典的系统,用于识别和规范 LSF。此外,我们还引入了一个人工标注的 LSFs 语料库(LSF200),该语料库适用于 NER 系统的训练和评估,并用于训练一个基于转换器的系统。在该语料库上对两种 NER 系统的性能进行评估后发现,基于词典的系统的 F 分数为 64%,基于转换器的系统为 76%。这些系统在PubMed摘要和PMC开放存取文章中的大规模应用确定了生物医学文献中超过3亿次的LSF提及:LSFO、注释的 LSF200 语料库以及使用这两种 NER 系统在 PubMed 和 PMC-OA 文章中检测到的 LSF,均可通过以下 GitHub 存储库以开放许可的方式获取:Https://github.com/EsmaeilNourani/LSFO-expansion。该资源库包含两个相关 GitHub 资源库和一个与该研究有关的 Zenodo 项目的链接。LSFO 还可在 BioPortal 上查阅:Https://bioportal.bioontology.org/ontologies/LSFO.Supplementary information:补充数据可在 Bioinformatics online 上获取。
{"title":"Lifestyle factors in the biomedical literature: an ontology and comprehensive resources for named entity recognition.","authors":"Esmaeil Nourani, Mikaela Koutrouli, Yijia Xie, Danai Vagiaki, Sampo Pyysalo, Katerina Nastou, Søren Brunak, Lars Juhl Jensen","doi":"10.1093/bioinformatics/btae613","DOIUrl":"10.1093/bioinformatics/btae613","url":null,"abstract":"<p><strong>Motivation: </strong>Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER.</p><p><strong>Results: </strong>We present a novel lifestyle factor ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature.</p><p><strong>Availability and implementation: </strong>LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: https://bioportal.bioontology.org/ontologies/LSFO.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11543612/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Molecular docking is an invaluable computational tool with broad applications in computer-aided drug design and enzyme engineering. However, current molecular docking tools are typically implemented in languages such as C++ for calculation speed, which lack flexibility and user-friendliness for further development. Moreover, validating the effectiveness of external scoring functions for molecular docking and screening within these frameworks is challenging, and implementing more efficient sampling strategies is not straightforward.
Results: To address these limitations, we have developed an open-source molecular docking framework, OpenDock, based on Python and PyTorch. This framework supports the integration of multiple scoring functions; some can be utilized during molecular docking and pose optimization, while others can be used for post-processing scoring. In terms of sampling, the current version of this framework supports simulated annealing and Monte Carlo optimization. Additionally, it can be extended to include methods such as genetic algorithms and particle swarm optimization for sampling docking poses and protein side chain orientations. Distance constraints are also implemented to enable covalent docking, restricted docking or distance map constraints guided pose sampling. Overall, this framework serves as a valuable tool in drug design and enzyme engineering, offering significant flexibility for most protein-ligand modelling tasks.
Availability and implementation: OpenDock is publicly available at: https://github.com/guyuehuo/opendock.
动机:分子对接是一种宝贵的计算工具,在计算机辅助药物设计和酶工程中有着广泛的应用。然而,目前的分子对接工具通常使用 C ++ 等语言实现,计算速度较慢,缺乏灵活性和用户友好性,难以进一步发展。此外,在这些框架内验证用于分子对接和筛选的外部评分函数的有效性具有挑战性,而实施更高效的采样策略也并非易事:为了解决这些局限性,我们开发了基于 Python 和 PyTorch 的开源分子对接框架 OpenDock。该框架支持多种评分函数的集成;其中一些可在分子对接和姿势优化过程中使用,另一些则可用于后处理评分。在采样方面,该框架的当前版本支持模拟退火和蒙特卡罗优化。此外,它还可以扩展到包括遗传算法和粒子群优化等方法,用于对接姿势和蛋白质侧链方向的取样。此外,还可以通过距离约束实现共价对接、受限对接或距离图约束引导的姿势采样。总之,该框架是药物设计和酶工程的重要工具,为大多数蛋白质配体建模任务提供了极大的灵活性:OpenDock 可在以下网址公开获取Https://github.com/guyuehuo/opendock.
{"title":"OpenDock: a pytorch-based open-source framework for protein-ligand docking and modelling.","authors":"Qiuyue Hu, Zechen Wang, Jintao Meng, Weifeng Li, Jingjing Guo, Yuguang Mu, Sheng Wang, Liangzhen Zheng, Yanjie Wei","doi":"10.1093/bioinformatics/btae628","DOIUrl":"10.1093/bioinformatics/btae628","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular docking is an invaluable computational tool with broad applications in computer-aided drug design and enzyme engineering. However, current molecular docking tools are typically implemented in languages such as C++ for calculation speed, which lack flexibility and user-friendliness for further development. Moreover, validating the effectiveness of external scoring functions for molecular docking and screening within these frameworks is challenging, and implementing more efficient sampling strategies is not straightforward.</p><p><strong>Results: </strong>To address these limitations, we have developed an open-source molecular docking framework, OpenDock, based on Python and PyTorch. This framework supports the integration of multiple scoring functions; some can be utilized during molecular docking and pose optimization, while others can be used for post-processing scoring. In terms of sampling, the current version of this framework supports simulated annealing and Monte Carlo optimization. Additionally, it can be extended to include methods such as genetic algorithms and particle swarm optimization for sampling docking poses and protein side chain orientations. Distance constraints are also implemented to enable covalent docking, restricted docking or distance map constraints guided pose sampling. Overall, this framework serves as a valuable tool in drug design and enzyme engineering, offering significant flexibility for most protein-ligand modelling tasks.</p><p><strong>Availability and implementation: </strong>OpenDock is publicly available at: https://github.com/guyuehuo/opendock.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae630
Seyedeh Fatemeh Khezri, Ali Ebrahimi, Changiz Eslahchi
Motivation: The concept of controllability within complex networks is pivotal in determining the minimal set of driver vertices required for the exertion of external signals, thereby enabling control over the entire network's vertices. Target controllability further refines this concept by focusing on a subset of vertices within the network as the specific targets for control, both of which are known to be NP-hard problems. Crucially, the effectiveness of the driver set in achieving control of the network is contingent upon satisfying a specific rank condition, as introduced by Kalman. On the other hand, structural controllability provides a complementary approach to understanding network control, emphasizing the identification of driver vertices based on the network's structural properties. However, in structural controllability approaches, the Kalman condition may not always be satisfied.
Results: In this study, we address the challenge of target controllability by proposing a feed-forward greedy algorithm designed to efficiently handle large networks while meeting the Kalman controllability rank condition. We further enhance our method's efficacy by integrating it with Barabasi et al.'s structural controllability approach. This integration allows for a more comprehensive control strategy, leveraging both the dynamical requirements specified by Kalman's rank condition and the structural properties of the network. Empirical evaluation across various network topologies demonstrates the superior performance of our algorithms compared to existing methods, consistently requiring fewer driver vertices for effective control. Additionally, our method's application to protein-protein interaction networks associated with breast cancer reveals potential drug repurposing candidates, underscoring its biomedical relevance. This study highlights the importance of addressing both structural and dynamical aspects of network controllability for advancing control strategies in complex systems.
Availability and implementation: The source code is available for free at:Https://github.com/fatemeKhezry/targetControllability.
{"title":"Target controllability: a feed-forward greedy algorithm in complex networks, meeting Kalman's rank condition.","authors":"Seyedeh Fatemeh Khezri, Ali Ebrahimi, Changiz Eslahchi","doi":"10.1093/bioinformatics/btae630","DOIUrl":"10.1093/bioinformatics/btae630","url":null,"abstract":"<p><strong>Motivation: </strong>The concept of controllability within complex networks is pivotal in determining the minimal set of driver vertices required for the exertion of external signals, thereby enabling control over the entire network's vertices. Target controllability further refines this concept by focusing on a subset of vertices within the network as the specific targets for control, both of which are known to be NP-hard problems. Crucially, the effectiveness of the driver set in achieving control of the network is contingent upon satisfying a specific rank condition, as introduced by Kalman. On the other hand, structural controllability provides a complementary approach to understanding network control, emphasizing the identification of driver vertices based on the network's structural properties. However, in structural controllability approaches, the Kalman condition may not always be satisfied.</p><p><strong>Results: </strong>In this study, we address the challenge of target controllability by proposing a feed-forward greedy algorithm designed to efficiently handle large networks while meeting the Kalman controllability rank condition. We further enhance our method's efficacy by integrating it with Barabasi et al.'s structural controllability approach. This integration allows for a more comprehensive control strategy, leveraging both the dynamical requirements specified by Kalman's rank condition and the structural properties of the network. Empirical evaluation across various network topologies demonstrates the superior performance of our algorithms compared to existing methods, consistently requiring fewer driver vertices for effective control. Additionally, our method's application to protein-protein interaction networks associated with breast cancer reveals potential drug repurposing candidates, underscoring its biomedical relevance. This study highlights the importance of addressing both structural and dynamical aspects of network controllability for advancing control strategies in complex systems.</p><p><strong>Availability and implementation: </strong>The source code is available for free at:Https://github.com/fatemeKhezry/targetControllability.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}