Bioinformatics advances最新文献

Transfer learning improves performance in volumetric electron microscopy organelle segmentation across tissues.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-04-02 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf021

Ronald Xie, Ben Mulcahy, Ali Darbandi, Sagar Marwah, Fez Ali, Yuna Lee, Gunes Parlakgul, Gokhan S Hotamisligil, Bo Wang, Sonya MacParland, Mei Zhen, Gary D Bader

Motivation: Volumetric electron microscopy (VEM) enables nanoscale resolution three-dimensional imaging of biological samples. Identification and labeling of organelles, cells, and other structures in the image volume is required for image interpretation, but manual labeling is extremely time-consuming. This can be automated using deep learning segmentation algorithms, but these traditionally require substantial manual annotation for training and typically these labeled datasets are unavailable for new samples.

Results: We show that transfer learning can help address this challenge. By pretraining on VEM data from multiple mammalian tissues and organelle types and then fine-tuning on a target dataset, we segment multiple organelles at high performance, yet require a relatively small amount of new training data. We benchmark our method on three published VEM datasets and a new rat liver dataset we imaged over a 56×56×11 $μ$ m volume measuring 7000×7000×219 px using serial block face scanning electron microscopy with corresponding manually labeled mitochondria and endoplasmic reticulum structures. We further benchmark our approach against the Segment Anything Model 2 and MitoNet in zero-shot, prompted, and fine-tuned settings.

Availability and implementation: Our rat liver dataset's raw image volume, manual ground truth annotation, and model predictions are freely shared at github.com/Xrioen/cross-tissue-transfer-learning-in-VEM.

动机体积电子显微镜（VEM）可对生物样本进行纳米级分辨率的三维成像。图像解读需要对图像体积中的细胞器、细胞和其他结构进行识别和标记，但人工标记非常耗时。使用深度学习分割算法可以自动完成这项工作，但这些算法传统上需要大量的人工标注来进行训练，而且这些标注数据集通常无法用于新样本：我们的研究表明，迁移学习有助于解决这一难题。通过在多个哺乳动物组织和细胞器类型的 VEM 数据上进行预训练，然后在目标数据集上进行微调，我们可以高性能地分割多个细胞器，而且只需要相对较少的新训练数据。我们在三个已发表的 VEM 数据集和一个新的大鼠肝脏数据集上对我们的方法进行了基准测试，我们使用序列块面扫描电子显微镜对一个 56×56×11 μ m 的体积进行了成像，该体积的尺寸为 7000×7000×219 px，并带有相应的手动标记的线粒体和内质网结构。我们还进一步将我们的方法与 Segment Anything Model 2 和 MitoNet 在零拍摄、提示和微调设置中进行了比较：我们的大鼠肝脏数据集的原始图像卷、人工地面实况标注和模型预测可在 github.com/Xrioen/cross-tissue-transfer-learning-in-VEM 免费共享。

{"title":"Transfer learning improves performance in volumetric electron microscopy organelle segmentation across tissues.","authors":"Ronald Xie, Ben Mulcahy, Ali Darbandi, Sagar Marwah, Fez Ali, Yuna Lee, Gunes Parlakgul, Gokhan S Hotamisligil, Bo Wang, Sonya MacParland, Mei Zhen, Gary D Bader","doi":"10.1093/bioadv/vbaf021","DOIUrl":"10.1093/bioadv/vbaf021","url":null,"abstract":"Motivation: Volumetric electron microscopy (VEM) enables nanoscale resolution three-dimensional imaging of biological samples. Identification and labeling of organelles, cells, and other structures in the image volume is required for image interpretation, but manual labeling is extremely time-consuming. This can be automated using deep learning segmentation algorithms, but these traditionally require substantial manual annotation for training and typically these labeled datasets are unavailable for new samples.Results: We show that transfer learning can help address this challenge. By pretraining on VEM data from multiple mammalian tissues and organelle types and then fine-tuning on a target dataset, we segment multiple organelles at high performance, yet require a relatively small amount of new training data. We benchmark our method on three published VEM datasets and a new rat liver dataset we imaged over a 56×56×11 <math><mi>μ</mi></math> m volume measuring 7000×7000×219 px using serial block face scanning electron microscopy with corresponding manually labeled mitochondria and endoplasmic reticulum structures. We further benchmark our approach against the Segment Anything Model 2 and MitoNet in zero-shot, prompted, and fine-tuned settings.Availability and implementation: Our rat liver dataset's raw image volume, manual ground truth annotation, and model predictions are freely shared at github.com/Xrioen/cross-tissue-transfer-learning-in-VEM.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf021"},"PeriodicalIF":2.4,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11974384/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143804970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive adjustment of profile HMM significance thresholds improves functional and metabolic insights into microbial genomes.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf039

Kathryn Kananen, Iva Veseli, Christian J Quiles Pérez, Samuel E Miller, A Murat Eren, Patrick H Bradley

Motivation: Gene function annotation in microbial genomes and metagenomes is a fundamental in silico first step toward understanding metabolic potential and determinants of fitness. The Kyoto Encyclopedia of Genes and Genomes publishes a curated list of profile hidden Markov models to identify orthologous gene families (KOfams) with roles in metabolism. However, the computational tools that rely upon KOfams yield different annotations for the same set of genomes, leading to different downstream biological inferences.

Results: Here, we apply three open-source software tools that can annotate KOfams to genomes of phylogenetically diverse bacterial families from host-associated and free-living biomes. We use multiple computational approaches to benchmark these methods and investigate individual case studies where they differ. Our results show that despite their fundamental similarities, these methods have different annotation rates and quality. In particular, a method that adaptively tunes sequence similarity thresholds substantially improves sensitivity while maintaining high accuracy. We observe particularly large improvements for protein families with few reference sequences, or when annotating genomes from nonmodel organisms (such as gut-dwelling Lachnospiraceae). Our findings show that small improvements in annotation workflows can maximize the utility of existing databases and meaningfully improve in silico characterizations of microbial metabolism.

Availability and implementation: Anvi'o is available at https://anvio.org under the GNU GPL license. Scripts and workflow are available at https://github.com/pbradleylab/2023-anvio-comparison under the MIT license.

{"title":"Adaptive adjustment of profile HMM significance thresholds improves functional and metabolic insights into microbial genomes.","authors":"Kathryn Kananen, Iva Veseli, Christian J Quiles Pérez, Samuel E Miller, A Murat Eren, Patrick H Bradley","doi":"10.1093/bioadv/vbaf039","DOIUrl":"10.1093/bioadv/vbaf039","url":null,"abstract":"Motivation: Gene function annotation in microbial genomes and metagenomes is a fundamental in silico first step toward understanding metabolic potential and determinants of fitness. The Kyoto Encyclopedia of Genes and Genomes publishes a curated list of profile hidden Markov models to identify orthologous gene families (KOfams) with roles in metabolism. However, the computational tools that rely upon KOfams yield different annotations for the same set of genomes, leading to different downstream biological inferences.Results: Here, we apply three open-source software tools that can annotate KOfams to genomes of phylogenetically diverse bacterial families from host-associated and free-living biomes. We use multiple computational approaches to benchmark these methods and investigate individual case studies where they differ. Our results show that despite their fundamental similarities, these methods have different annotation rates and quality. In particular, a method that adaptively tunes sequence similarity thresholds substantially improves sensitivity while maintaining high accuracy. We observe particularly large improvements for protein families with few reference sequences, or when annotating genomes from nonmodel organisms (such as gut-dwelling Lachnospiraceae). Our findings show that small improvements in annotation workflows can maximize the utility of existing databases and meaningfully improve in silico characterizations of microbial metabolism.Availability and implementation: Anvi'o is available at https://anvio.org under the GNU GPL license. Scripts and workflow are available at https://github.com/pbradleylab/2023-anvio-comparison under the MIT license.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf039"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

peptidy: a light-weight Python library for peptide representation in machine learning.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf058

Rıza Özçelik, Laura van Weesep, Sarah de Ruiter, Francesca Grisoni

Motivation: Peptides are widely used in applications ranging from drug discovery to food technologies. Machine learning has become increasingly prominent in accelerating the search for new peptides, and user-friendly computational tools can further enhance these efforts.

Results: In this work, we introduce peptidy-a lightweight Python library that facilitates converting peptides (expressed as amino acid sequences) to numerical representations suited to machine learning. peptidy is free from external dependencies, integrates seamlessly into modern Python environments, and supports a range of encoding strategies suitable for both predictive and generative machine learning approaches. Additionally, peptidy supports peptides with post-translational modifications, such as phosphorylation, acetylation, and methylation, thereby extending the functionality of existing Python packages for peptides and proteins.

Availability and implementation: peptidy is freely available with a permissive license on GitHub at the following URL: https://github.com/molML/peptidy.

动机：肽被广泛应用于从药物发现到食品技术的各个领域。机器学习在加速寻找新肽方面的作用日益突出，而用户友好型计算工具可以进一步加强这些工作：在这项工作中，我们介绍了 peptidy--一个轻量级 Python 库，它有助于将多肽（以氨基酸序列表示）转换为适合机器学习的数字表示。peptidy 不依赖于外部环境，可无缝集成到现代 Python 环境中，并支持一系列适合预测式和生成式机器学习方法的编码策略。此外，peptidy 还支持磷酸化、乙酰化和甲基化等翻译后修饰的多肽，从而扩展了现有 Python 多肽和蛋白质软件包的功能。可用性和实现：peptidy 在 GitHub 上以许可的方式免费提供，网址如下：https://github.com/molML/peptidy。

{"title":"peptidy: a light-weight Python library for peptide representation in machine learning.","authors":"Rıza Özçelik, Laura van Weesep, Sarah de Ruiter, Francesca Grisoni","doi":"10.1093/bioadv/vbaf058","DOIUrl":"10.1093/bioadv/vbaf058","url":null,"abstract":"Motivation: Peptides are widely used in applications ranging from drug discovery to food technologies. Machine learning has become increasingly prominent in accelerating the search for new peptides, and user-friendly computational tools can further enhance these efforts.Results: In this work, we introduce peptidy-a lightweight Python library that facilitates converting peptides (expressed as amino acid sequences) to numerical representations suited to machine learning. peptidy is free from external dependencies, integrates seamlessly into modern Python environments, and supports a range of encoding strategies suitable for both predictive and generative machine learning approaches. Additionally, peptidy supports peptides with post-translational modifications, such as phosphorylation, acetylation, and methylation, thereby extending the functionality of existing Python packages for peptides and proteins.Availability and implementation: peptidy is freely available with a permissive license on GitHub at the following URL: https://github.com/molML/peptidy.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf058"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AntiFold: improved structure-based antibody design using inverse folding.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae202

Magnus Haraldson Høie, Alissa M Hummer, Tobias H Olsen, Broncio Aguilar-Sanjuan, Morten Nielsen, Charlotte M Deane

Summary: The design and optimization of antibodies requires an intricate balance across multiple properties. Protein inverse folding models, capable of generating diverse sequences folding into the same structure, are promising tools for maintaining structural integrity during antibody design. Here, we present AntiFold, an antibody-specific inverse folding model, fine-tuned from ESM-IF1 on solved and predicted antibody structures. AntiFold outperforms existing inverse folding tools on sequence recovery across complementarity-determining regions, with designed sequences showing high structural similarity to their solved counterpart. It additionally achieves stronger correlations when predicting antibody-antigen binding affinity in a zero-shot manner. AntiFold assigns low probabilities to mutations that disrupt antigen binding, synergizing with protein language model residue probabilities, and demonstrates promise for guiding antibody optimization while retaining structure-related properties.

Availability and implementation: AntiFold is freely available under the BSD 3-Clause as a web server (https://opig.stats.ox.ac.uk/webapps/antifold/) and pip-installable package (https://github.com/oxpig/AntiFold).

摘要：抗体的设计和优化需要在多种特性之间取得复杂的平衡。蛋白质反折叠模型能够生成折叠成相同结构的不同序列，是在抗体设计过程中保持结构完整性的有效工具。在此，我们介绍一种抗体特异性反折叠模型 AntiFold，它是根据已解决和预测的抗体结构对 ESM-IF1 进行微调后得出的。AntiFold 在互补性决定区域的序列恢复方面优于现有的反折叠工具，其设计的序列与已解决的对应序列具有很高的结构相似性。此外，它还能在预测抗体与抗原结合亲和力时实现更强的相关性。AntiFold 对破坏抗原结合的突变赋予较低的概率，与蛋白质语言模型残基概率协同作用，在保留结构相关特性的同时有望指导抗体优化：AntiFold 在 BSD 3 条款下以网络服务器 (https://opig.stats.ox.ac.uk/webapps/antifold/) 和 pip-installable 软件包 (https://github.com/oxpig/AntiFold) 的形式免费提供。

{"title":"AntiFold: improved structure-based antibody design using inverse folding.","authors":"Magnus Haraldson Høie, Alissa M Hummer, Tobias H Olsen, Broncio Aguilar-Sanjuan, Morten Nielsen, Charlotte M Deane","doi":"10.1093/bioadv/vbae202","DOIUrl":"10.1093/bioadv/vbae202","url":null,"abstract":"Summary: The design and optimization of antibodies requires an intricate balance across multiple properties. Protein inverse folding models, capable of generating diverse sequences folding into the same structure, are promising tools for maintaining structural integrity during antibody design. Here, we present AntiFold, an antibody-specific inverse folding model, fine-tuned from ESM-IF1 on solved and predicted antibody structures. AntiFold outperforms existing inverse folding tools on sequence recovery across complementarity-determining regions, with designed sequences showing high structural similarity to their solved counterpart. It additionally achieves stronger correlations when predicting antibody-antigen binding affinity in a zero-shot manner. AntiFold assigns low probabilities to mutations that disrupt antigen binding, synergizing with protein language model residue probabilities, and demonstrates promise for guiding antibody optimization while retaining structure-related properties.Availability and implementation: AntiFold is freely available under the BSD 3-Clause as a web server (https://opig.stats.ox.ac.uk/webapps/antifold/) and pip-installable package (https://github.com/oxpig/AntiFold).","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae202"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zepyros: a webserver to evaluate the shape complementarity of protein-protein interfaces.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf051

Mattia Miotto, Lorenzo Di Rienzo, Leonardo Bo', Giancarlo Ruocco, Edoardo Milanetti

Motivation: Shape complementarity of molecular surfaces at the interfaces is a well-known characteristic of protein-protein binding regions, and it is critical in influencing the stability of the complex. Measuring such complementarity is of great importance for a number of theoretical and practical implications; however, only a limited number of tools are currently available to efficiently and rapidly assess it.

Results: Here, we introduce Zepyros (ZErnike Polynomials analYsis of pROtein Shapes), a webserver for fast measurement of the shape complementarity between two molecular interfaces of a given protein-protein complex using structural information. Zepyros is implemented as a publicly available tool with a user-friendly interface.

Availability and implementation: Our server can be found at the following link (all major browser supported): https://zepyros.bio-groups.com.

引用次数: 0

Aggregating residue-level protein language model embeddings with optimal transport.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf060

Navid NaderiAlizadeh, Rohit Singh

Motivation: Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations.

Results: We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling.

Availability and implementation: Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.

{"title":"Aggregating residue-level protein language model embeddings with optimal transport.","authors":"Navid NaderiAlizadeh, Rohit Singh","doi":"10.1093/bioadv/vbaf060","DOIUrl":"10.1093/bioadv/vbaf060","url":null,"abstract":"Motivation: Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations.Results: We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling.Availability and implementation: Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf060"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Biological databases in the age of generative artificial intelligence.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf044

Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow

Summary: Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.

Availability and implementation: Not applicable.

{"title":"Biological databases in the age of generative artificial intelligence.","authors":"Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow","doi":"10.1093/bioadv/vbaf044","DOIUrl":"10.1093/bioadv/vbaf044","url":null,"abstract":"Summary: Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.Availability and implementation: Not applicable.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf044"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

S2Map: a novel computational platform for identifying secretio-types through cell secretion-signal map.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf059

Zongliang Yue, Lang Zhou, Peizhen Sun, Xuejia Kang, Fengyuan Huang, Pengyu Chen

Motivation: Cell communication is predominantly governed by secreted proteins, whose diverse secretion patterns often signify underlying physiological irregularities. Understanding these secreted signals at an individual cell level is crucial for gaining insights into regulatory mechanisms involving various molecular agents. To elucidate the array of cell secretion signals, which encompass different types of biomolecular secretion cues from individual immune cells, we introduce the secretion-signal map (S2Map).

Results: S2Map is an online interactive analytical platform designed to explore and interpret distinct cell secretion-signal patterns visually. It incorporates two innovative qualitative metrics, the signal inequality index and the signal coverage index, which are exquisitely sensitive in measuring dissymmetry and diffusion of signals in temporal data. S2Map's innovation lies in its depiction of signals through time-series analysis with multi-layer visualization. We tested the SII and SCI performance in distinguishing the simulated signal diffusion models. S2Map hosts a repository for the single-cell's secretion-signal data for exploring cell secretio-types, a new cell phenotyping based on the cell secretion signal pattern. We anticipate that S2Map will be a powerful tool to delve into the complexities of physiological systems, providing insights into the regulation of protein production, such as cytokines at the remarkable resolution of single cells.

Availability and implementation: The S2Map server is publicly accessible via https://au-s2map.streamlit.app/.

{"title":"S2Map: a novel computational platform for identifying secretio-types through cell secretion-signal map.","authors":"Zongliang Yue, Lang Zhou, Peizhen Sun, Xuejia Kang, Fengyuan Huang, Pengyu Chen","doi":"10.1093/bioadv/vbaf059","DOIUrl":"10.1093/bioadv/vbaf059","url":null,"abstract":"Motivation: Cell communication is predominantly governed by secreted proteins, whose diverse secretion patterns often signify underlying physiological irregularities. Understanding these secreted signals at an individual cell level is crucial for gaining insights into regulatory mechanisms involving various molecular agents. To elucidate the array of cell secretion signals, which encompass different types of biomolecular secretion cues from individual immune cells, we introduce the secretion-signal map (S2Map).Results: S2Map is an online interactive analytical platform designed to explore and interpret distinct cell secretion-signal patterns visually. It incorporates two innovative qualitative metrics, the signal inequality index and the signal coverage index, which are exquisitely sensitive in measuring dissymmetry and diffusion of signals in temporal data. S2Map's innovation lies in its depiction of signals through time-series analysis with multi-layer visualization. We tested the SII and SCI performance in distinguishing the simulated signal diffusion models. S2Map hosts a repository for the single-cell's secretion-signal data for exploring cell secretio-types, a new cell phenotyping based on the cell secretion signal pattern. We anticipate that S2Map will be a powerful tool to delve into the complexities of physiological systems, providing insights into the regulation of protein production, such as cytokines at the remarkable resolution of single cells.Availability and implementation: The S2Map server is publicly accessible via https://au-s2map.streamlit.app/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf059"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RNApysoforms: fast rendering interactive visualization of RNA isoform structure and expression in Python.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-14 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf057

Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert

Summary: Alternative splicing generates multiple RNA isoforms from a single gene, enriching genetic diversity and impacting gene function. Effective visualization of these isoforms and their expression patterns is crucial but challenging due to limitations in existing tools. Traditional genome browsers lack programmability, while other tools offer limited customization, produce static plots, or cannot simultaneously display structures and expression levels. RNApysoforms was developed to address these gaps by providing a Python-based package that enables concurrent visualization of RNA isoform structures and expression data. Leveraging plotly and polars libraries, it offers an interactive, customizable, and faster-rendering framework suitable for web applications, enhancing the analysis and dissemination of RNA isoform research.

Availability and implementation: RNApysoforms is a Python package available at (https://github.com/UK-SBCoA-EbbertLab/RNApysoforms) and (https://zenodo.org/records/14941190) via an open-source MIT license. It can be easily installed using the pip package installer for Python. Thorough documentation and usage vignettes are available at: https://rna-pysoforms.readthedocs.io/en/latest/.

{"title":"RNApysoforms: fast rendering interactive visualization of RNA isoform structure and expression in Python.","authors":"Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert","doi":"10.1093/bioadv/vbaf057","DOIUrl":"10.1093/bioadv/vbaf057","url":null,"abstract":"Summary: Alternative splicing generates multiple RNA isoforms from a single gene, enriching genetic diversity and impacting gene function. Effective visualization of these isoforms and their expression patterns is crucial but challenging due to limitations in existing tools. Traditional genome browsers lack programmability, while other tools offer limited customization, produce static plots, or cannot simultaneously display structures and expression levels. RNApysoforms was developed to address these gaps by providing a Python-based package that enables concurrent visualization of RNA isoform structures and expression data. Leveraging plotly and polars libraries, it offers an interactive, customizable, and faster-rendering framework suitable for web applications, enhancing the analysis and dissemination of RNA isoform research.Availability and implementation: RNApysoforms is a Python package available at (https://github.com/UK-SBCoA-EbbertLab/RNApysoforms) and (https://zenodo.org/records/14941190) via an open-source MIT license. It can be easily installed using the pip package installer for Python. Thorough documentation and usage vignettes are available at: https://rna-pysoforms.readthedocs.io/en/latest/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf057"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964586/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Challenges in predicting PROTAC-mediated protein-protein interfaces with AlphaFold reveal a general limitation on small interfaces.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2025-03-14 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf056

Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin

Motivation: Proteolysis Targeting Chimeras (PROTACs) are heterobifunctional molecules composed by ligands binding to a target protein and a E3-ligase complex, connected by a linker, that induce proximity-based target protein degradation. PROTACs are promising alternatives to conventional drugs against cancer. Predicting PROTAC-mediated complexes is often the first step for in silico PROTAC design pipelines. We previously noted that AlphaFold2 (AF2) fails to predict PROTAC-mediated complexes.

Results: Here, we investigate the potential causes of this limitation. We consider a set of 326 protein heterodimers orthogonal to the AF2 training set, and evaluate AF2 models focusing on the interface size and presence of interface ligand. Our results show that AF2-multimer predictions are sensitive to the size of the interface to predict even in the absence of ligands, with the majority of models being incorrect for the smallest interfaces. We also benchmark both AF2 and AF3 on a set of 28 PROTAC-mediated dimers and show that AF3 does not significantly improve upon the accuracy of AF2. The low accuracy of AF2 on complexes with small interfaces has strong implications for computational pipelines for PROTAC design, as these stabilize typically small interfaces, and more generally on any prediction task that involves small interfaces.

Availability and implementation: All the models analyzed in this article are available in the Zenodo archive https://zenodo.org/records/14810843.

{"title":"Challenges in predicting PROTAC-mediated protein-protein interfaces with AlphaFold reveal a general limitation on small interfaces.","authors":"Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin","doi":"10.1093/bioadv/vbaf056","DOIUrl":"10.1093/bioadv/vbaf056","url":null,"abstract":"Motivation: Proteolysis Targeting Chimeras (PROTACs) are heterobifunctional molecules composed by ligands binding to a target protein and a E3-ligase complex, connected by a linker, that induce proximity-based target protein degradation. PROTACs are promising alternatives to conventional drugs against cancer. Predicting PROTAC-mediated complexes is often the first step for in silico PROTAC design pipelines. We previously noted that AlphaFold2 (AF2) fails to predict PROTAC-mediated complexes.Results: Here, we investigate the potential causes of this limitation. We consider a set of 326 protein heterodimers orthogonal to the AF2 training set, and evaluate AF2 models focusing on the interface size and presence of interface ligand. Our results show that AF2-multimer predictions are sensitive to the size of the interface to predict even in the absence of ligands, with the majority of models being incorrect for the smallest interfaces. We also benchmark both AF2 and AF3 on a set of 28 PROTAC-mediated dimers and show that AF3 does not significantly improve upon the accuracy of AF2. The low accuracy of AF2 on complexes with small interfaces has strong implications for computational pipelines for PROTAC design, as these stabilize typically small interfaces, and more generally on any prediction task that involves small interfaces.Availability and implementation: All the models analyzed in this article are available in the Zenodo archive https://zenodo.org/records/14810843.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf056"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143722845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0