Bioinformatics最新文献_第7页

Position-Specific Enrichment Ratio Matrix scores predict antibody variant properties from deep sequencing data. 位置特异性富集比矩阵得分可从深度测序数据中预测抗体变异特性。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad446

Matthew D Smith, Marshall A Case, Emily K Makowski, Peter M Tessier

Motivation: Deep sequencing of antibody and related protein libraries after phage or yeast-surface display sorting is widely used to identify variants with increased affinity, specificity, and/or improvements in key biophysical properties. Conventional approaches for identifying optimal variants typically use the frequencies of observation in enriched libraries or the corresponding enrichment ratios. However, these approaches disregard the vast majority of deep sequencing data and often fail to identify the best variants in the libraries.

Results: Here, we present a method, Position-Specific Enrichment Ratio Matrix (PSERM) scoring, that uses entire deep sequencing datasets from pre- and post-selections to score each observed protein variant. The PSERM scores are the sum of the site-specific enrichment ratios observed at each mutated position. We find that PSERM scores are much more reproducible and correlate more strongly with experimentally measured properties than frequencies or enrichment ratios, including for multiple antibody properties (affinity and non-specific binding) for a clinical-stage antibody (emibetuzumab). We expect that this method will be broadly applicable to diverse protein engineering campaigns.

Availability and implementation: All deep sequencing datasets and code to perform the analyses presented within are available via https://github.com/Tessier-Lab-UMich/PSERM_paper.

动机噬菌体或酵母表面展示分选后的抗体和相关蛋白文库的深度测序被广泛用于鉴定亲和性、特异性和/或关键生物物理特性改进的变体。识别最佳变体的传统方法通常使用富集文库中的观察频率或相应的富集比。然而，这些方法忽略了绝大多数深度测序数据，往往无法识别文库中的最佳变体：在这里，我们提出了一种位置特异性富集比矩阵（PSERM）评分法，它使用选择前和选择后的整个深度测序数据集对每个观察到的蛋白质变体进行评分。PSERM 分数是在每个变异位置观察到的特定位点富集比的总和。我们发现，与频率或富集比相比，PSERM 评分的可重复性要高得多，而且与实验测量的特性相关性更强，包括临床阶段抗体（埃贝珠单抗）的多种抗体特性（亲和力和非特异性结合）。我们希望这种方法能广泛适用于各种蛋白质工程活动：所有深度测序数据集和执行分析的代码均可通过 https://github.com/Tessier-Lab-UMich/PSERM_paper 获取。

{"title":"Position-Specific Enrichment Ratio Matrix scores predict antibody variant properties from deep sequencing data.","authors":"Matthew D Smith, Marshall A Case, Emily K Makowski, Peter M Tessier","doi":"10.1093/bioinformatics/btad446","DOIUrl":"10.1093/bioinformatics/btad446","url":null,"abstract":"Motivation: Deep sequencing of antibody and related protein libraries after phage or yeast-surface display sorting is widely used to identify variants with increased affinity, specificity, and/or improvements in key biophysical properties. Conventional approaches for identifying optimal variants typically use the frequencies of observation in enriched libraries or the corresponding enrichment ratios. However, these approaches disregard the vast majority of deep sequencing data and often fail to identify the best variants in the libraries.Results: Here, we present a method, Position-Specific Enrichment Ratio Matrix (PSERM) scoring, that uses entire deep sequencing datasets from pre- and post-selections to score each observed protein variant. The PSERM scores are the sum of the site-specific enrichment ratios observed at each mutated position. We find that PSERM scores are much more reproducible and correlate more strongly with experimentally measured properties than frequencies or enrichment ratios, including for multiple antibody properties (affinity and non-specific binding) for a clinical-stage antibody (emibetuzumab). We expect that this method will be broadly applicable to diverse protein engineering campaigns.Availability and implementation: All deep sequencing datasets and code to perform the analyses presented within are available via https://github.com/Tessier-Lab-UMich/PSERM_paper.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10628969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An extensive benchmark study on biomedical text generation and mining with ChatGPT. 利用ChatGPT对生物医学文本生成和挖掘进行了广泛的基准研究。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad557

Qijie Chen, Haotong Sun, Haoyang Liu, Yinghui Jiang, Ting Ran, Xurui Jin, Xianglu Xiao, Zhimin Lin, Hongming Chen, Zhangmin Niu

Motivation: In recent years, the development of natural language process (NLP) technologies and deep learning hardware has led to significant improvement in large language models (LLMs). The ChatGPT, the state-of-the-art LLM built on GPT-3.5 and GPT-4, shows excellent capabilities in general language understanding and reasoning. Researchers also tested the GPTs on a variety of NLP-related tasks and benchmarks and got excellent results. With exciting performance on daily chat, researchers began to explore the capacity of ChatGPT on expertise that requires professional education for human and we are interested in the biomedical domain.

Results: To evaluate the performance of ChatGPT on biomedical-related tasks, this article presents a comprehensive benchmark study on the use of ChatGPT for biomedical corpus, including article abstracts, clinical trials description, biomedical questions, and so on. Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30. Through a series of experiments, we demonstrated the effectiveness and versatility of ChatGPT in biomedical text understanding, reasoning and generation, and the limitation of ChatGPT build on GPT-3.5.

Availability and implementation: All the datasets are available from BLURB benchmark https://microsoft.github.io/BLURB/index.html. The prompts are described in the article.

动机：近年来，自然语言处理（NLP）技术和深度学习硬件的发展导致了大型语言模型（LLM）的显著改进。ChatGPT是建立在GPT-3.5和GPT-4基础上的最先进的LLM，在一般语言理解和推理方面表现出出色的能力。研究人员还在各种与NLP相关的任务和基准测试中测试了GPT，并获得了优异的结果。随着在日常聊天中令人兴奋的表现，研究人员开始探索ChatGPT在需要对人类进行专业教育的专业知识方面的能力，我们对生物医学领域感兴趣。结果：为了评估ChatGPT在生物医学相关任务中的性能，本文对ChatGPT用于生物医学语料库进行了全面的基准研究，包括文章摘要、临床试验描述、生物医学问题等。典型的NLP任务包括命名实体识别、关系提取、句子相似性、问答，以及文档分类。总体而言，ChatGPT的BLURB得分为58.50，而最先进的模型得分为84.30。通过一系列实验，我们证明了ChatGPT在生物医学文本理解、推理和生成方面的有效性和通用性，以及基于GPT-3.5的ChatGPT的局限性。可用性和实现：所有数据集都可以从BLURB基准中获得https://microsoft.github.io/BLURB/index.html.文章中介绍了提示。

{"title":"An extensive benchmark study on biomedical text generation and mining with ChatGPT.","authors":"Qijie Chen, Haotong Sun, Haoyang Liu, Yinghui Jiang, Ting Ran, Xurui Jin, Xianglu Xiao, Zhimin Lin, Hongming Chen, Zhangmin Niu","doi":"10.1093/bioinformatics/btad557","DOIUrl":"10.1093/bioinformatics/btad557","url":null,"abstract":"Motivation: In recent years, the development of natural language process (NLP) technologies and deep learning hardware has led to significant improvement in large language models (LLMs). The ChatGPT, the state-of-the-art LLM built on GPT-3.5 and GPT-4, shows excellent capabilities in general language understanding and reasoning. Researchers also tested the GPTs on a variety of NLP-related tasks and benchmarks and got excellent results. With exciting performance on daily chat, researchers began to explore the capacity of ChatGPT on expertise that requires professional education for human and we are interested in the biomedical domain.Results: To evaluate the performance of ChatGPT on biomedical-related tasks, this article presents a comprehensive benchmark study on the use of ChatGPT for biomedical corpus, including article abstracts, clinical trials description, biomedical questions, and so on. Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30. Through a series of experiments, we demonstrated the effectiveness and versatility of ChatGPT in biomedical text understanding, reasoning and generation, and the limitation of ChatGPT build on GPT-3.5.Availability and implementation: All the datasets are available from BLURB benchmark https://microsoft.github.io/BLURB/index.html. The prompts are described in the article.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

DeepMHCI: an anchor position-aware deep interaction model for accurate MHC-I peptide binding affinity prediction. DeepMHCI：一个锚定位置感知的深度相互作用模型，用于准确预测MHC-I肽结合亲和力。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad551

Wei Qu, Ronghui You, Hiroshi Mamitsuka, Shanfeng Zhu

Motivation: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels.

Results: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction.

Availability and implementation: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.

动机：计算预测主要组织相容性复合物I类（MHC-I）肽结合亲和力是免疫学生物信息学中的一个重要问题，这对于鉴定用于个性化治疗性癌症疫苗的新抗原也至关重要。针对这一问题，最近基于深度学习的前沿方法无法获得令人满意的性能，尤其是对于非9-聚体肽。这是因为这种方法通过简单地连接两个给定的序列来产生输入：肽和MHC I类分子的（伪序列），这不能精确地捕捉可变长度肽的MHC结合基序的锚定位置。因此，我们开发了一个锚位置感知和高性能的深度模型DeepMHCI，该模型具有位置门控层和残余结合相互作用卷积层。这允许该模型控制肽中的信息流以了解锚定位置，并直接用多个卷积核对肽和MHC伪（结合）序列之间的相互作用进行建模。结果：DeepMHCI的性能已通过在四个基准数据集上进行的广泛实验在各种设置下得到了彻底验证，如5倍交叉验证、独立测试集验证、外部HPV疫苗鉴定和外部CD8+表位鉴定。结合基序可视化的实验结果表明，DeepMHCI优于所有竞争方法，尤其是在非9-聚体肽结合预测方面。可用性和实施：DeepMHCI可在https://github.com/ZhuLab-Fudan/DeepMHCI.

{"title":"DeepMHCI: an anchor position-aware deep interaction model for accurate MHC-I peptide binding affinity prediction.","authors":"Wei Qu, Ronghui You, Hiroshi Mamitsuka, Shanfeng Zhu","doi":"10.1093/bioinformatics/btad551","DOIUrl":"10.1093/bioinformatics/btad551","url":null,"abstract":"Motivation: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels.Results: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction.Availability and implementation: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516514/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10217795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs. BioThings Explorer：用于生物医学API的联合知识图的查询引擎。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad570

Jackson Callaghan, Colleen H Xu, Jiwen Xin, Marco Alvarado Cano, Anders Riutta, Eric Zhou, Rohan Juneja, Yao Yao, Madhumita Narayan, Kristina Hanspers, Ayushi Agrawal, Alexander R Pico, Chunlei Wu, Andrew I Su

Summary: Knowledge graphs are an increasingly common data structure for representing biomedical information. These knowledge graphs can easily represent heterogeneous types of information, and many algorithms and tools exist for querying and analyzing graphs. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, knowledge graphs are constructed by centralization and integration of data from multiple disparate sources. Here, we describe BioThings Explorer, an application that can query a virtual, federated knowledge graph derived from the aggregated information in a network of biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThings Explorer is distributed as a lightweight application that dynamically retrieves information at query time.

Availability and implementation: More information can be found at https://explorer.biothings.io and code is available at https://github.com/biothings/biothings_explorer.

摘要：知识图是一种越来越常见的用于表示生物医学信息的数据结构。这些知识图可以很容易地表示异构类型的信息，并且存在许多用于查询和分析图的算法和工具。生物医学知识图谱已被用于多种应用，包括药物再利用、药物靶点识别、药物副作用预测和临床决策支持。通常，知识图是通过集中和集成来自多个不同来源的数据来构建的。在这里，我们描述了BioThings Explorer，它是一个应用程序，可以查询从生物医学web服务网络中的聚合信息派生的虚拟联合知识图。BioThings Explorer利用每个资源的输入和输出的语义精确注释，并自动链接web服务调用以执行多步骤图查询。因为没有大型的、集中的知识图需要维护，所以BioThings Explorer是作为一个轻量级应用程序分发的，它在查询时动态检索信息。可用性和实施：更多信息可在https://explorer.biothings.io代码可在https://github.com/biothings/biothings_explorer.

{"title":"BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs.","authors":"Jackson Callaghan, Colleen H Xu, Jiwen Xin, Marco Alvarado Cano, Anders Riutta, Eric Zhou, Rohan Juneja, Yao Yao, Madhumita Narayan, Kristina Hanspers, Ayushi Agrawal, Alexander R Pico, Chunlei Wu, Andrew I Su","doi":"10.1093/bioinformatics/btad570","DOIUrl":"10.1093/bioinformatics/btad570","url":null,"abstract":"Summary: Knowledge graphs are an increasingly common data structure for representing biomedical information. These knowledge graphs can easily represent heterogeneous types of information, and many algorithms and tools exist for querying and analyzing graphs. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, knowledge graphs are constructed by centralization and integration of data from multiple disparate sources. Here, we describe BioThings Explorer, an application that can query a virtual, federated knowledge graph derived from the aggregated information in a network of biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThings Explorer is distributed as a lightweight application that dynamically retrieves information at query time.Availability and implementation: More information can be found at https://explorer.biothings.io and code is available at https://github.com/biothings/biothings_explorer.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11015316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads. 序列分析辅助变长读软信息解码降低基于dna的数据存储成本。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad548

Seong-Joon Park, Sunghwan Kim, Jaeho Jeong, Albert No, Jong-Seon No, Hosung Park

Motivation: DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed.

Results: We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works.

Availability and implementation: Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.

动机:基于dna的数据存储是未来档案存储最具吸引力的研究领域之一。然而，它在实际应用中面临着高读写成本的问题。为了解决这个问题已经有很多努力，但是现有的方案并不完全适合基于dna的数据存储，需要进一步降低成本。结果:我们提出了完整的DNA存储编码和解码程序。编码过程由精心设计的单个低密度奇偶校验码作为寡码，有效地纠正错误和遗漏。我们采用新的聚类和对齐方法来操作可变长度读取，以提高解码性能。在序列分析辅助解码过程中，我们使用编辑距离和质量分数，可以丢弃异常读取并利用高质量的软信息。我们将548.83 KB的图像文件存储在DNA oligos中，与之前的两项工作相比，写入成本降低了7.46%，读取成本显著降低了26.57%和19.41%。可用性和实现:本研究中提出的所有算法的数据和代码可在https://github.com/sjpark0905/DNA-LDPC-codes上获得。

{"title":"Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads.","authors":"Seong-Joon Park, Sunghwan Kim, Jaeho Jeong, Albert No, Jong-Seon No, Hosung Park","doi":"10.1093/bioinformatics/btad548","DOIUrl":"10.1093/bioinformatics/btad548","url":null,"abstract":"Motivation: DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed.Results: We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works.Availability and implementation: Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500082/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10631513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FunTaxIS-lite: a simple and light solution to investigate protein functions in all living organisms. FunTaxIS-lite:一个简单而轻巧的解决方案，用于研究所有生物体中的蛋白质功能。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad549

Federico Bianca, Emilio Ispano, Ermanno Gazzola, Enrico Lavezzo, Paolo Fontana, Stefano Toppo

Motivation: Defining the full domain of protein functions belonging to an organism is a complex challenge that is due to the huge heterogeneity of the taxonomy, where single or small groups of species can bear unique functional characteristics. FunTaxIS-lite provides a solution to this challenge by determining taxon-based constraints on Gene Ontology (GO) terms, which specify the functions that an organism can or cannot perform. The tool employs a set of rules to generate and spread the constraints across both the taxon hierarchy and the GO graph.

Results: The taxon-based constraints produced by FunTaxIS-lite extend those provided by the Gene Ontology Consortium by an average of 300%. The implementation of these rules significantly reduces errors in function predictions made by automatic algorithms and can assist in correcting inconsistent protein annotations in databases.

Availability and implementation: FunTaxIS-lite is available on https://www.medcomp.medicina.unipd.it/funtaxis-lite and from https://github.com/MedCompUnipd/FunTaxIS-lite.

动机:由于分类学的巨大异质性，定义属于生物体的蛋白质功能的完整域是一项复杂的挑战，其中单个或小群体的物种可以具有独特的功能特征。FunTaxIS-lite通过确定基因本体(GO)术语的基于分类的约束来解决这一挑战，这些术语指定了生物体能执行或不能执行的功能。该工具使用一组规则在分类单元层次结构和GO图中生成和传播约束。结果:FunTaxIS-lite提供的基于分类的约束平均比Gene Ontology Consortium提供的约束扩展了300%。这些规则的实现大大减少了自动算法在功能预测中的错误，并有助于纠正数据库中不一致的蛋白质注释。可用性和实现:FunTaxIS-lite可从https://www.medcomp.medicina.unipd.it/funtaxis-lite和https://github.com/MedCompUnipd/FunTaxIS-lite获得。

{"title":"FunTaxIS-lite: a simple and light solution to investigate protein functions in all living organisms.","authors":"Federico Bianca, Emilio Ispano, Ermanno Gazzola, Enrico Lavezzo, Paolo Fontana, Stefano Toppo","doi":"10.1093/bioinformatics/btad549","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad549","url":null,"abstract":"Motivation: Defining the full domain of protein functions belonging to an organism is a complex challenge that is due to the huge heterogeneity of the taxonomy, where single or small groups of species can bear unique functional characteristics. FunTaxIS-lite provides a solution to this challenge by determining taxon-based constraints on Gene Ontology (GO) terms, which specify the functions that an organism can or cannot perform. The tool employs a set of rules to generate and spread the constraints across both the taxon hierarchy and the GO graph.Results: The taxon-based constraints produced by FunTaxIS-lite extend those provided by the Gene Ontology Consortium by an average of 300%. The implementation of these rules significantly reduces errors in function predictions made by automatic algorithms and can assist in correcting inconsistent protein annotations in databases.Availability and implementation: FunTaxIS-lite is available on https://www.medcomp.medicina.unipd.it/funtaxis-lite and from https://github.com/MedCompUnipd/FunTaxIS-lite.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10631519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MuTATE-an R package for comprehensive multi-objective molecular modeling. mutate -一个用于综合多目标分子建模的R包。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad507

Sarah G Ayton, Víctor Treviño

Motivation: Comprehensive multi-omics studies have driven advances in disease modeling for effective precision medicine but pose a challenge for existing machine-learning approaches, which have limited interpretability across clinical endpoints. Automated, comprehensive disease modeling requires a machine-learning approach that can simultaneously identify disease subgroups and their defining molecular biomarkers by explaining multiple clinical endpoints. Current tools are restricted to individual endpoints or limited variable types, necessitate advanced computation skills, and require resource-intensive manual expert interpretation.

Results: We developed Multi-Target Automated Tree Engine (MuTATE) for automated and comprehensive molecular modeling, which enables user-friendly multi-objective decision tree construction and visualization of relationships between molecular biomarkers and patient subgroups characterized by multiple clinical endpoints. MuTATE incorporates multiple targets throughout model construction and allows for target weights, enabling construction of interpretable decision trees that provide insights into disease heterogeneity and molecular signatures. MuTATE eliminates the need for manual synthesis of multiple non-explainable models, making it highly efficient and accessible for bioinformaticians and clinicians. The flexibility and versatility of MuTATE make it applicable to a wide range of complex diseases, including cancer, where it can improve therapeutic decisions by providing comprehensive molecular insights for precision medicine. MuTATE has the potential to transform biomarker discovery and subtype identification, leading to more effective and personalized treatment strategies in precision medicine, and advancing our understanding of disease mechanisms at the molecular level.

Availability and implementation: MuTATE is freely available at GitHub (https://github.com/SarahAyton/MuTATE) under the GPLv3 license.

动机:全面的多组学研究推动了有效精准医学疾病建模的进步，但对现有的机器学习方法提出了挑战，这些方法在临床终点的可解释性有限。自动化、全面的疾病建模需要一种机器学习方法，该方法可以通过解释多个临床终点同时识别疾病亚组及其定义的分子生物标志物。当前的工具仅限于单个端点或有限的变量类型，需要高级计算技能，并且需要资源密集型的人工专家解释。结果:我们开发了多目标自动化树引擎(MuTATE)，用于自动化和全面的分子建模，支持用户友好的多目标决策树构建和可视化分子生物标志物与具有多个临床终点特征的患者亚组之间的关系。MuTATE在整个模型构建过程中包含多个目标，并允许目标权重，从而能够构建可解释的决策树，从而深入了解疾病异质性和分子特征。MuTATE消除了人工合成多个不可解释模型的需要，使生物信息学家和临床医生能够高效地使用它。MuTATE的灵活性和多功能性使其适用于广泛的复杂疾病，包括癌症，它可以通过为精准医学提供全面的分子见解来改善治疗决策。MuTATE有可能改变生物标志物的发现和亚型鉴定，在精准医学中导致更有效和个性化的治疗策略，并在分子水平上推进我们对疾病机制的理解。可用性和实现:MuTATE在GPLv3许可下可在GitHub (https://github.com/SarahAyton/MuTATE)免费获得。

{"title":"MuTATE-an R package for comprehensive multi-objective molecular modeling.","authors":"Sarah G Ayton, Víctor Treviño","doi":"10.1093/bioinformatics/btad507","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad507","url":null,"abstract":"Motivation: Comprehensive multi-omics studies have driven advances in disease modeling for effective precision medicine but pose a challenge for existing machine-learning approaches, which have limited interpretability across clinical endpoints. Automated, comprehensive disease modeling requires a machine-learning approach that can simultaneously identify disease subgroups and their defining molecular biomarkers by explaining multiple clinical endpoints. Current tools are restricted to individual endpoints or limited variable types, necessitate advanced computation skills, and require resource-intensive manual expert interpretation.Results: We developed Multi-Target Automated Tree Engine (MuTATE) for automated and comprehensive molecular modeling, which enables user-friendly multi-objective decision tree construction and visualization of relationships between molecular biomarkers and patient subgroups characterized by multiple clinical endpoints. MuTATE incorporates multiple targets throughout model construction and allows for target weights, enabling construction of interpretable decision trees that provide insights into disease heterogeneity and molecular signatures. MuTATE eliminates the need for manual synthesis of multiple non-explainable models, making it highly efficient and accessible for bioinformaticians and clinicians. The flexibility and versatility of MuTATE make it applicable to a wide range of complex diseases, including cancer, where it can improve therapeutic decisions by providing comprehensive molecular insights for precision medicine. MuTATE has the potential to transform biomarker discovery and subtype identification, leading to more effective and personalized treatment strategies in precision medicine, and advancing our understanding of disease mechanisms at the molecular level.Availability and implementation: MuTATE is freely available at GitHub (https://github.com/SarahAyton/MuTATE) under the GPLv3 license.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500092/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

chem16S: community-level chemical metrics for exploring genomic adaptation to environments. chem16S：用于探索基因组对环境适应的社区级化学指标。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad564

Jeffrey M Dick, Xun Kang

Summary: The chem16S package combines taxonomic classifications of 16S rRNA gene sequences with amino acid compositions of prokaryotic reference proteomes to generate community reference proteomes. Taxonomic classifications from the RDP Classifier or data objects created by the phyloseq R package are supported. Users can calculate and visualize a variety of chemical metrics in order to explore the effects of redox, salinity, and other physicochemical variables on the genomic adaptation of protein sequences at the community level.

Availability and implementation: Development of chem16S is hosted at https://github.com/jedick/chem16S. Version 1.0.0 is freely available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/package=chem16S.

摘要：chem16S包将16S rRNA基因序列的分类分类与原核参考蛋白质组的氨基酸组成相结合，生成群落参考蛋白质组。支持RDP分类器中的分类或phyloseq R包创建的数据对象。用户可以计算和可视化各种化学指标，以便在群落水平上探索氧化还原、盐度和其他物理化学变量对蛋白质序列基因组适应的影响。可用性和实施：chem16S的开发位于https://github.com/jedick/chem16S.1.0.0版本可从综合R档案网络（CRAN）免费获得，网址为https://cran.r-project.org/package=chem16S.

引用次数: 0

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. HAPNEST:高效、大规模地生成和评估基因型和表型的合成数据集。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad535

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna

Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

动机:现有的模拟合成基因型和表型数据集的方法具有有限的可扩展性，限制了它们用于大规模分析的可用性。此外，还缺乏评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。结果:我们提出了happnest，一种有效生成不同个体水平基因型和表型数据的新方法。与其他方法相比，HAPNEST的计算速度更快，与参考面板的相关度更低，同时生成的数据集保留了真实数据的关键统计属性。这些理想的合成数据特性使我们能够在100万个个体中产生680万个常见变异和9种具有不同程度遗传性和多基因性的表型。我们展示了HAPNEST如何通过比较七种方法来促进生物库规模的分析，从而在多个祖先群体和不同的遗传结构中生成多基因风险评分。可用性和实现:在https://www.ebi.ac.uk/biostudies/studies/S-BSST936上可以获得一个包含1008,000个个体和9个特征的680万个常见变异的合成数据集。用于生成合成数据集的happnest软件可以在https://github.com/intervene-EU-H2020/synthetic_data上以Docker/Singularity容器和开源Julia和C代码的形式获得。

{"title":"HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.","authors":"Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna","doi":"10.1093/bioinformatics/btad535","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad535","url":null,"abstract":"Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10335851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MSDRP: a deep learning model based on multisource data for predicting drug response. MSDRP:基于多源数据的深度学习模型，用于预测药物反应。

IF 5.8 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Bioinformatics

Pub Date : 2023-09-02 DOI: 10.1093/bioinformatics/btad514

Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang

Motivation: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.

Results: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.

Availability and implementation: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.

动机:癌症异质性极大地影响癌症治疗结果。体外预测药物反应有望帮助制定个性化的治疗方案。近年来，人们提出了几种基于机器学习和深度学习的计算模型来预测体外药物反应。然而，这些方法中的大多数基于单一药物描述(例如药物结构)捕获药物特征，而没有考虑药物与生物实体之间的关系(例如靶点、疾病和副作用)。此外，这些方法大多分别收集药物和细胞系的特征，而没有考虑药物和细胞系之间的成对相互作用。结果:在本文中，我们提出了一个深度学习框架MSDRP用于药物反应预测。MSDRP使用交互模块捕获药物与细胞系之间的相互作用，并通过相似网络融合算法整合药物与生物实体之间的多种关联/相互作用，在所有实验的所有性能指标中都优于一些最先进的模型。从头测试和独立测试的实验结果证明了该模型对新药的优良性能。此外，几个案例研究说明了使用来自多源数据的药物相似矩阵的特征向量来表示药物的合理性和我们的模型的可解释性。可用性和实施:MSDRP的代码可在https://github.com/xyzhang-10/MSDRP上获得。

{"title":"MSDRP: a deep learning model based on multisource data for predicting drug response.","authors":"Haochen Zhao, Xiaoyu Zhang, Qichang Zhao, Yaohang Li, Jianxin Wang","doi":"10.1093/bioinformatics/btad514","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad514","url":null,"abstract":"Motivation: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.Results: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model.Availability and implementation: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10647978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0