Frontiers in bioinformatics最新文献_第6页

Posterior inference of Hi-C contact frequency through sampling. 通过抽样对 Hi-C 接触频率进行后验推断。

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-02-22 eCollection Date: 2023-01-01 DOI: 10.3389/fbinf.2023.1285828

Yanlin Zhang, Christopher J F Cameron, Mathieu Blanchette

Hi-C is one of the most widely used approaches to study three-dimensional genome conformations. Contacts captured by a Hi-C experiment are represented in a contact frequency matrix. Due to the limited sequencing depth and other factors, Hi-C contact frequency matrices are only approximations of the true interaction frequencies and are further reported without any quantification of uncertainty. Hence, downstream analyses based on Hi-C contact maps (e.g., TAD and loop annotation) are themselves point estimations. Here, we present the Hi-C interaction frequency sampler (HiCSampler) that reliably infers the posterior distribution of the interaction frequency for a given Hi-C contact map by exploiting dependencies between neighboring loci. Posterior predictive checks demonstrate that HiCSampler can infer highly predictive chromosomal interaction frequency. Summary statistics calculated by HiCSampler provide a measurement of the uncertainty for Hi-C experiments, and samples inferred by HiCSampler are ready for use by most downstream analysis tools off the shelf and permit uncertainty measurements in these analyses without modifications.

Hi-C 是研究三维基因组构象最广泛使用的方法之一。Hi-C 实验捕获的接触用接触频率矩阵表示。由于测序深度和其他因素的限制，Hi-C 接触频率矩阵只是真实相互作用频率的近似值，在进一步报告时没有对不确定性进行量化。因此，基于 Hi-C 接触图的下游分析（如 TAD 和环注释）本身就是点估计。在这里，我们提出了 Hi-C 相互作用频率采样器（HiCSampler），它通过利用相邻基因座之间的依赖关系，可靠地推断出给定 Hi-C 接触图的相互作用频率的后验分布。后验预测检查表明，HiCSampler 可以推断出具有高度预测性的染色体相互作用频率。由 HiCSampler 计算出的汇总统计量可测量 Hi-C 实验的不确定性，而且由 HiCSampler 推断出的样本可供大多数现成的下游分析工具使用，无需修改即可在这些分析中进行不确定性测量。

{"title":"Posterior inference of Hi-C contact frequency through sampling.","authors":"Yanlin Zhang, Christopher J F Cameron, Mathieu Blanchette","doi":"10.3389/fbinf.2023.1285828","DOIUrl":"10.3389/fbinf.2023.1285828","url":null,"abstract":"Hi-C is one of the most widely used approaches to study three-dimensional genome conformations. Contacts captured by a Hi-C experiment are represented in a contact frequency matrix. Due to the limited sequencing depth and other factors, Hi-C contact frequency matrices are only approximations of the true interaction frequencies and are further reported without any quantification of uncertainty. Hence, downstream analyses based on Hi-C contact maps (e.g., TAD and loop annotation) are themselves point estimations. Here, we present the Hi-C interaction frequency sampler (HiCSampler) that reliably infers the posterior distribution of the interaction frequency for a given Hi-C contact map by exploiting dependencies between neighboring loci. Posterior predictive checks demonstrate that HiCSampler can infer highly predictive chromosomal interaction frequency. Summary statistics calculated by HiCSampler provide a measurement of the uncertainty for Hi-C experiments, and samples inferred by HiCSampler are ready for use by most downstream analysis tools off the shelf and permit uncertainty measurements in these analyses without modifications.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1285828"},"PeriodicalIF":0.0,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10919286/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140061442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Making bioinformatics training FAIR: the EMBL-EBI training portal. 使生物信息学培训 FAIR：EMBL-EBI 培训门户网站。

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-31 eCollection Date: 2024-01-01 DOI: 10.3389/fbinf.2024.1347168

A L Swan, A Broadbent, P Singh Gaur, A Mishra, K Gurwitz, A Mithani, S L Morgan, G Malhotra, C Brooksbank

EMBL-EBI provides a broad range of training in data-driven life sciences. To improve awareness and access to training course listings and to make digital learning materials findable and simple to use, the EMBL-EBI Training website, www.ebi.ac.uk/training, was redesigned and restructured. To provide a framework for the redesign of the website, the FAIR (findable, accessible, interoperable, reusable) principles were applied to both the listings of live training courses and the presentation of on-demand training content. Each of the FAIR principles guided decisions on the choice of technology used to develop the website, including the details provided about training and the way in which training was presented. Since its release the openly accessible website has been accessed by an average of 58,492 users a month. There have also been over 12,000 unique users creating accounts since the functionality was added in March 2022, allowing these users to track their learning and record completion of training. Development of the website was completed using the Agile Scrum project management methodology and a focus on user experience. This framework continues to be used now that the website is live for the maintenance and improvement of the website, as feedback continues to be collected and further ways to make training FAIR are identified. Here, we describe the process of making EMBL-EBI's training FAIR through the development of a new website and our experience of implementing Agile Scrum.

EMBL-EBI 在数据驱动的生命科学领域提供广泛的培训。为了提高对培训课程列表的认识和访问，并使数字学习材料易于查找和使用，EMBL-EBI 培训网站 www.ebi.ac.uk/training 进行了重新设计和结构调整。为了给网站的重新设计提供一个框架，FAIR（可查找、可访问、可互操作、可重用）原则被应用于实时培训课程列表和点播培训内容的展示。FAIR 原则中的每一项原则都指导着网站开发技术选择的决策，包括提供培训的详细信息和展示培训的方式。自公开访问网站发布以来，平均每月有 58 492 名用户访问该网站。自 2022 年 3 月添加该功能以来，已有 12,000 多名独特用户创建了账户，使这些用户能够跟踪自己的学习情况并记录培训完成情况。网站的开发采用了 Agile Scrum 项目管理方法，注重用户体验。网站上线后，我们将继续使用这一框架对网站进行维护和改进，同时继续收集反馈意见，进一步确定使培训 FAIR 化的方法。在此，我们将介绍通过开发新网站使 EMBL-EBI 的培训 FAIR 化的过程，以及我们实施敏捷 Scrum 的经验。

{"title":"Making bioinformatics training FAIR: the EMBL-EBI training portal.","authors":"A L Swan, A Broadbent, P Singh Gaur, A Mishra, K Gurwitz, A Mithani, S L Morgan, G Malhotra, C Brooksbank","doi":"10.3389/fbinf.2024.1347168","DOIUrl":"10.3389/fbinf.2024.1347168","url":null,"abstract":"EMBL-EBI provides a broad range of training in data-driven life sciences. To improve awareness and access to training course listings and to make digital learning materials findable and simple to use, the EMBL-EBI Training website, www.ebi.ac.uk/training, was redesigned and restructured. To provide a framework for the redesign of the website, the FAIR (findable, accessible, interoperable, reusable) principles were applied to both the listings of live training courses and the presentation of on-demand training content. Each of the FAIR principles guided decisions on the choice of technology used to develop the website, including the details provided about training and the way in which training was presented. Since its release the openly accessible website has been accessed by an average of 58,492 users a month. There have also been over 12,000 unique users creating accounts since the functionality was added in March 2022, allowing these users to track their learning and record completion of training. Development of the website was completed using the Agile Scrum project management methodology and a focus on user experience. This framework continues to be used now that the website is live for the maintenance and improvement of the website, as feedback continues to be collected and further ways to make training FAIR are identified. Here, we describe the process of making EMBL-EBI's training FAIR through the development of a new website and our experience of implementing Agile Scrum.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1347168"},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10866141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139736872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Systematic computational hunting for small RNAs derived from ncRNAs during dengue virus infection in endothelial HMEC-1 cells. 在内皮 HMEC-1 细胞感染登革热病毒过程中，通过系统计算寻找 ncRNAs 衍生的小 RNAs。

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-31 eCollection Date: 2024-01-01 DOI: 10.3389/fbinf.2024.1293412

Aimer Gutierrez-Diaz, Steve Hoffmann, Juan Carlos Gallego-Gómez, Clara Isabel Bermudez-Santana

In recent years, a population of small RNA fragments derived from non-coding RNAs (sfd-RNAs) has gained significant interest due to its functional and structural resemblance to miRNAs, adding another level of complexity to our comprehension of small-RNA-mediated gene regulation. Despite this, scientists need more tools to test the differential expression of sfd-RNAs since the current methods to detect miRNAs may not be directly applied to them. The primary reasons are the lack of accurate small RNA and ncRNA annotation, the multi-mapping read (MMR) placement, and the multicopy nature of ncRNAs in the human genome. To solve these issues, a methodology that allows the detection of differentially expressed sfd-RNAs, including canonical miRNAs, by using an integrated copy-number-corrected ncRNA annotation was implemented. This approach was coupled with sixteen different computational strategies composed of combinations of four aligners and four normalization methods to provide a rank-order of prediction for each differentially expressed sfd-RNA. By systematically addressing the three main problems, we could detect differentially expressed miRNAs and sfd-RNAs in dengue virus-infected human dermal microvascular endothelial cells. Although more biological evaluations are required, two molecular targets of the hsa-mir-103a and hsa-mir-494 (CDK5 and PI3/AKT) appear relevant for dengue virus (DENV) infections. Here, we performed a comprehensive annotation and differential expression analysis, which can be applied in other studies addressing the role of small fragment RNA populations derived from ncRNAs in virus infection.

近年来，源自非编码 RNA 的小 RNA 片段（sfd-RNAs）因其在功能和结构上与 miRNAs 相似而备受关注，这为我们理解小 RNA 介导的基因调控增加了另一层复杂性。尽管如此，科学家们仍需要更多的工具来检测 sfd-RNAs 的差异表达，因为目前检测 miRNAs 的方法可能无法直接应用于它们。主要原因是缺乏准确的小 RNA 和 ncRNA 注释、多映射读数（MMR）位置以及人类基因组中 ncRNA 的多拷贝特性。为了解决这些问题，我们采用了一种方法，通过使用综合拷贝数校正 ncRNA 注释，检测差异表达的 sfd-RNA，包括典型 miRNA。这种方法与 16 种不同的计算策略相结合，由四种排列器和四种归一化方法组合而成，为每种差异表达的 sfd-RNA 提供了一个预测等级顺序。通过系统地解决这三个主要问题，我们可以检测出受登革热病毒感染的人真皮微血管内皮细胞中差异表达的 miRNA 和 sfd-RNA。尽管还需要更多的生物学评估，但 hsa-mir-103a 和 hsa-mir-494 的两个分子靶标（CDK5 和 PI3/AKT）似乎与登革热病毒（DENV）感染有关。在这里，我们进行了全面的注释和差异表达分析，这些分析可用于其他研究，探讨从 ncRNAs 派生的小片段 RNA 群体在病毒感染中的作用。

{"title":"Systematic computational hunting for small RNAs derived from ncRNAs during dengue virus infection in endothelial HMEC-1 cells.","authors":"Aimer Gutierrez-Diaz, Steve Hoffmann, Juan Carlos Gallego-Gómez, Clara Isabel Bermudez-Santana","doi":"10.3389/fbinf.2024.1293412","DOIUrl":"10.3389/fbinf.2024.1293412","url":null,"abstract":"In recent years, a population of small RNA fragments derived from non-coding RNAs (sfd-RNAs) has gained significant interest due to its functional and structural resemblance to miRNAs, adding another level of complexity to our comprehension of small-RNA-mediated gene regulation. Despite this, scientists need more tools to test the differential expression of sfd-RNAs since the current methods to detect miRNAs may not be directly applied to them. The primary reasons are the lack of accurate small RNA and ncRNA annotation, the multi-mapping read (MMR) placement, and the multicopy nature of ncRNAs in the human genome. To solve these issues, a methodology that allows the detection of differentially expressed sfd-RNAs, including canonical miRNAs, by using an integrated copy-number-corrected ncRNA annotation was implemented. This approach was coupled with sixteen different computational strategies composed of combinations of four aligners and four normalization methods to provide a rank-order of prediction for each differentially expressed sfd-RNA. By systematically addressing the three main problems, we could detect differentially expressed miRNAs and sfd-RNAs in dengue virus-infected human dermal microvascular endothelial cells. Although more biological evaluations are required, two molecular targets of the hsa-mir-103a and hsa-mir-494 (CDK5 and PI3/AKT) appear relevant for dengue virus (DENV) infections. Here, we performed a comprehensive annotation and differential expression analysis, which can be applied in other studies addressing the role of small fragment RNA populations derived from ncRNAs in virus infection.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1293412"},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10864640/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139736873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A breast cancer-specific combinational QSAR model development using machine learning and deep learning approaches. 利用机器学习和深度学习方法开发乳腺癌特异性组合 QSAR 模型。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-15 eCollection Date: 2023-01-01 DOI: 10.3389/fbinf.2023.1328262

Anush Karampuri, Shyam Perugu

Breast cancer is the most prevalent and heterogeneous form of cancer affecting women worldwide. Various therapeutic strategies are in practice based on the extent of disease spread, such as surgery, chemotherapy, radiotherapy, and immunotherapy. Combinational therapy is another strategy that has proven to be effective in controlling cancer progression. Administration of Anchor drug, a well-established primary therapeutic agent with known efficacy for specific targets, with Library drug, a supplementary drug to enhance the efficacy of anchor drugs and broaden the therapeutic approach. Our work focused on harnessing regression-based Machine learning (ML) and deep learning (DL) algorithms to develop a structure-activity relationship between the molecular descriptors of drug pairs and their combined biological activity through a QSAR (Quantitative structure-activity relationship) model. 11 popularly known machine learning and deep learning algorithms were used to develop QSAR models. A total of 52 breast cancer cell lines, 25 anchor drugs, and 51 library drugs were considered in developing the QSAR model. It was observed that Deep Neural Networks (DNNs) achieved an impressive R² (Coefficient of Determination) of 0.94, with an RMSE (Root Mean Square Error) value of 0.255, making it the most effective algorithm for developing a structure-activity relationship with strong generalization capabilities. In conclusion, applying combinational therapy alongside ML and DL techniques represents a promising approach to combating breast cancer.

乳腺癌是影响全球妇女的最常见的异质性癌症。根据疾病的扩散程度，目前有多种治疗策略，如手术、化疗、放疗和免疫疗法。综合疗法是另一种被证明能有效控制癌症进展的策略。锚药是一种成熟的主要治疗药物，对特定靶点具有已知的疗效，而库药是一种辅助药物，可增强锚药的疗效并拓宽治疗途径。我们的工作重点是利用基于回归的机器学习（ML）和深度学习（DL）算法，通过 QSAR（定量结构-活性关系）模型，建立药物配对的分子描述符与其综合生物活性之间的结构-活性关系。11 种广为人知的机器学习和深度学习算法被用于开发 QSAR 模型。在开发 QSAR 模型时，共考虑了 52 个乳腺癌细胞系、25 种锚药物和 51 种库药物。结果表明，深度神经网络（DNN）的R2（决定系数）达到了令人印象深刻的0.94，RMSE（均方根误差）值为0.255，是建立结构-活性关系最有效的算法，具有很强的泛化能力。总之，在应用 ML 和 DL 技术的同时应用组合疗法是一种很有前景的抗击乳腺癌的方法。

{"title":"A breast cancer-specific combinational QSAR model development using machine learning and deep learning approaches.","authors":"Anush Karampuri, Shyam Perugu","doi":"10.3389/fbinf.2023.1328262","DOIUrl":"10.3389/fbinf.2023.1328262","url":null,"abstract":"Breast cancer is the most prevalent and heterogeneous form of cancer affecting women worldwide. Various therapeutic strategies are in practice based on the extent of disease spread, such as surgery, chemotherapy, radiotherapy, and immunotherapy. Combinational therapy is another strategy that has proven to be effective in controlling cancer progression. Administration of Anchor drug, a well-established primary therapeutic agent with known efficacy for specific targets, with Library drug, a supplementary drug to enhance the efficacy of anchor drugs and broaden the therapeutic approach. Our work focused on harnessing regression-based Machine learning (ML) and deep learning (DL) algorithms to develop a structure-activity relationship between the molecular descriptors of drug pairs and their combined biological activity through a QSAR (Quantitative structure-activity relationship) model. 11 popularly known machine learning and deep learning algorithms were used to develop QSAR models. A total of 52 breast cancer cell lines, 25 anchor drugs, and 51 library drugs were considered in developing the QSAR model. It was observed that Deep Neural Networks (DNNs) achieved an impressive R2 (Coefficient of Determination) of 0.94, with an RMSE (Root Mean Square Error) value of 0.255, making it the most effective algorithm for developing a structure-activity relationship with strong generalization capabilities. In conclusion, applying combinational therapy alongside ML and DL techniques represents a promising approach to combating breast cancer.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1328262"},"PeriodicalIF":2.8,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10822965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139577087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BPAGS: a web application for bacteriocin prediction via feature evaluation using alternating decision tree, genetic algorithm, and linear support vector classifier BPAGS：利用交替决策树、遗传算法和线性支持向量分类器，通过特征评估进行细菌素预测的网络应用程序

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-10 DOI: 10.3389/fbinf.2023.1284705

Suraiya Akhter, John H. Miller

The use of bacteriocins has emerged as a propitious strategy in the development of new drugs to combat antibiotic resistance, given their ability to kill bacteria with both broad and narrow natural spectra. Hence, a compelling requirement arises for a precise and efficient computational model that can accurately predict novel bacteriocins. Machine learning’s ability to learn patterns and features from bacteriocin sequences that are difficult to capture using sequence matching-based methods makes it a potentially superior choice for accurate prediction. A web application for predicting bacteriocin was created in this study, utilizing a machine learning approach. The feature sets employed in the application were chosen using alternating decision tree (ADTree), genetic algorithm (GA), and linear support vector classifier (linear SVC)-based feature evaluation methods. Initially, potential features were extracted from the physicochemical, structural, and sequence-profile attributes of both bacteriocin and non-bacteriocin protein sequences. We assessed the candidate features first using the Pearson correlation coefficient, followed by separate evaluations with ADTree, GA, and linear SVC to eliminate unnecessary features. Finally, we constructed random forest (RF), support vector machine (SVM), decision tree (DT), logistic regression (LR), k-nearest neighbors (KNN), and Gaussian naïve Bayes (GNB) models using reduced feature sets. We obtained the overall top performing model using SVM with ADTree-reduced features, achieving an accuracy of 99.11% and an AUC value of 0.9984 on the testing dataset. We also assessed the predictive capabilities of our best-performing models for each reduced feature set relative to our previously developed software solution, a sequence alignment-based tool, and a deep-learning approach. A web application, titled BPAGS (Bacteriocin Prediction based on ADTree, GA, and linear SVC), was developed to incorporate the predictive models built using ADTree, GA, and linear SVC-based feature sets. Currently, the web-based tool provides classification results with associated probability values and has options to add new samples in the training data to improve the predictive efficacy. BPAGS is freely accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/.

细菌素具有宽窄两种自然光谱，能够杀死细菌，因此在开发对抗抗生素耐药性的新药时，细菌素的使用已成为一种有利的策略。因此，人们迫切要求建立一个精确、高效的计算模型，以准确预测新型细菌素。机器学习能够从细菌素序列中学习到序列匹配方法难以捕捉到的模式和特征，因此有可能成为准确预测的上佳选择。本研究利用机器学习方法创建了一个预测细菌素的网络应用程序。应用中使用的特征集是通过交替决策树（ADTree）、遗传算法（GA）和基于特征评估方法的线性支持向量分类器（linear SVC）选择的。最初，我们从细菌素和非细菌素蛋白质序列的理化、结构和序列剖面属性中提取潜在特征。我们首先使用皮尔逊相关系数对候选特征进行评估，然后使用 ADTree、GA 和线性 SVC 分别进行评估，以剔除不必要的特征。最后，我们利用减少的特征集构建了随机森林（RF）、支持向量机（SVM）、决策树（DT）、逻辑回归（LR）、k-近邻（KNN）和高斯天真贝叶斯（GNB）模型。我们使用带有 ADTree 缩减特征的 SVM 获得了整体表现最佳的模型，在测试数据集上达到了 99.11% 的准确率和 0.9984 的 AUC 值。我们还评估了相对于我们之前开发的软件解决方案、基于序列比对的工具和深度学习方法，我们针对每个特征集缩减的最佳表现模型的预测能力。我们开发了一个名为 BPAGS（基于 ADTree、GA 和线性 SVC 的细菌素预测）的网络应用程序，以整合使用基于 ADTree、GA 和线性 SVC 特征集建立的预测模型。目前，该基于网络的工具可提供带有相关概率值的分类结果，并有在训练数据中添加新样本以提高预测效果的选项。BPAGS 可在 https://shiny.tricities.wsu.edu/bacteriocin-prediction/ 免费访问。

{"title":"BPAGS: a web application for bacteriocin prediction via feature evaluation using alternating decision tree, genetic algorithm, and linear support vector classifier","authors":"Suraiya Akhter, John H. Miller","doi":"10.3389/fbinf.2023.1284705","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1284705","url":null,"abstract":"The use of bacteriocins has emerged as a propitious strategy in the development of new drugs to combat antibiotic resistance, given their ability to kill bacteria with both broad and narrow natural spectra. Hence, a compelling requirement arises for a precise and efficient computational model that can accurately predict novel bacteriocins. Machine learning’s ability to learn patterns and features from bacteriocin sequences that are difficult to capture using sequence matching-based methods makes it a potentially superior choice for accurate prediction. A web application for predicting bacteriocin was created in this study, utilizing a machine learning approach. The feature sets employed in the application were chosen using alternating decision tree (ADTree), genetic algorithm (GA), and linear support vector classifier (linear SVC)-based feature evaluation methods. Initially, potential features were extracted from the physicochemical, structural, and sequence-profile attributes of both bacteriocin and non-bacteriocin protein sequences. We assessed the candidate features first using the Pearson correlation coefficient, followed by separate evaluations with ADTree, GA, and linear SVC to eliminate unnecessary features. Finally, we constructed random forest (RF), support vector machine (SVM), decision tree (DT), logistic regression (LR), k-nearest neighbors (KNN), and Gaussian naïve Bayes (GNB) models using reduced feature sets. We obtained the overall top performing model using SVM with ADTree-reduced features, achieving an accuracy of 99.11% and an AUC value of 0.9984 on the testing dataset. We also assessed the predictive capabilities of our best-performing models for each reduced feature set relative to our previously developed software solution, a sequence alignment-based tool, and a deep-learning approach. A web application, titled BPAGS (Bacteriocin Prediction based on ADTree, GA, and linear SVC), was developed to incorporate the predictive models built using ADTree, GA, and linear SVC-based feature sets. Currently, the web-based tool provides classification results with associated probability values and has options to add new samples in the training data to improve the predictive efficacy. BPAGS is freely accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"8 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139439460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

No-boundary thinking for artificial intelligence in bioinformatics and education 生物信息学和教育领域人工智能的无边界思维

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-08 DOI: 10.3389/fbinf.2023.1332902

Prajay Patel, Nisha Pillai, Inimary T. Toby

No-boundary thinking enables the scientific community to reflect in a thoughtful manner and discover new opportunities, create innovative solutions, and break through barriers that might have otherwise constrained their progress. This concept encourages thinking without being confined by traditional rules, limitations, or established norms, and a mindset that is not limited by previous work, leading to fresh perspectives and innovative outcomes. So, where do we see the field of artificial intelligence (AI) in bioinformatics going in the next 30 years? That was the theme of a “No-Boundary Thinking” Session as part of the Mid-South Computational Bioinformatics Society’s (MCBIOS) 19th annual meeting in Irving, Texas. This session addressed various areas of AI in an open discussion and raised some perspectives on how popular tools like ChatGPT can be integrated into bioinformatics, communicating with scientists in different fields to properly utilize the potential of these algorithms, and how to continue educational outreach to further interest of data science and informatics to the next-generation of scientists.

无边界思维使科学界能够以深思熟虑的方式进行反思，发现新的机遇，创造创新的解决方案，突破可能制约其进步的障碍。这一概念鼓励不受传统规则、限制或既定规范束缚的思考，以及不受以往工作限制的思维方式，从而带来全新的视角和创新成果。那么，我们认为生物信息学领域的人工智能（AI）在未来 30 年将走向何方？这就是在德克萨斯州欧文市举行的中南计算生物信息学学会（MCBIOS）第 19 届年会的 "无边界思维 "会议的主题。本次会议以开放式讨论的形式探讨了人工智能的各个领域，并就如何将 ChatGPT 等流行工具整合到生物信息学中、与不同领域的科学家沟通以正确利用这些算法的潜力，以及如何继续开展教育推广活动以提高下一代科学家对数据科学和信息学的兴趣等问题提出了一些看法。

引用次数: 0

Protein-lipid interactions and protein anchoring modulate the modes of association of the globular domain of the Prion protein and Doppel protein to model membrane patches 蛋白-脂质相互作用和蛋白锚定调节朊病毒蛋白和多肽蛋白的球状结构域与模型膜片的结合模式

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-05 DOI: 10.3389/fbinf.2023.1321287

Patricia Soto, Davis T. Thalhuber, Frank Luceri, Jamie Janos, Mason R. Borgman, Noah M. Greenwood, Sofia Acosta, Hunter Stoffel

The Prion protein is the molecular hallmark of the incurable prion diseases affecting mammals, including humans. The protein-only hypothesis states that the misfolding, accumulation, and deposition of the Prion protein play a critical role in toxicity. The cellular Prion protein (PrPC) anchors to the extracellular leaflet of the plasma membrane and prefers cholesterol- and sphingomyelin-rich membrane domains. Conformational Prion protein conversion into the pathological isoform happens on the cell surface. In vitro and in vivo experiments indicate that Prion protein misfolding, aggregation, and toxicity are sensitive to the lipid composition of plasma membranes and vesicles. A picture of the underlying biophysical driving forces that explain the effect of Prion protein - lipid interactions in physiological conditions is needed to develop a structural model of Prion protein conformational conversion. To this end, we use molecular dynamics simulations that mimic the interactions between the globular domain of PrPC anchored to model membrane patches. In addition, we also simulate the Doppel protein anchored to such membrane patches. The Doppel protein is the closest in the phylogenetic tree to PrPC, localizes in an extracellular milieu similar to that of PrPC, and exhibits a similar topology to PrPC even if the amino acid sequence is only 25% identical. Our simulations show that specific protein-lipid interactions and conformational constraints imposed by GPI anchoring together favor specific binding sites in globular PrPC but not in Doppel. Interestingly, the binding sites we found in PrPC correspond to prion protein loops, which are critical in aggregation and prion disease transmission barrier (β2-α2 loop) and in initial spontaneous misfolding (α2-α3 loop). We also found that the membrane re-arranges locally to accommodate protein residues inserted in the membrane surface as a response to protein binding.

朊病毒蛋白是影响哺乳动物（包括人类）的无法治愈的朊病毒疾病的分子标志。唯蛋白假说认为，朊病毒蛋白的错误折叠、积累和沉积在毒性中起着关键作用。细胞朊病毒蛋白（PrPC）锚定在质膜的细胞外小叶上，喜欢富含胆固醇和鞘磷脂的膜域。朊病毒蛋白在细胞表面转化为病理异构体。体外和体内实验表明，朊病毒蛋白的错误折叠、聚集和毒性对质膜和囊泡的脂质成分很敏感。为了建立朊病毒蛋白构象转换的结构模型，我们需要了解在生理条件下解释朊病毒蛋白-脂质相互作用效应的潜在生物物理驱动力。为此，我们使用分子动力学模拟来模拟锚定在模型膜片上的 PrPC 球状结构域之间的相互作用。此外，我们还模拟了锚定在此类膜斑块上的 Doppel 蛋白。Doppel 蛋白在系统发育树中与 PrPC 最为接近，定位于与 PrPC 相似的细胞外环境中，即使氨基酸序列只有 25% 相同，也表现出与 PrPC 相似的拓扑结构。我们的模拟结果表明，特定的蛋白质-脂质相互作用和 GPI 锚定所施加的构象限制共同作用于球状 PrPC 的特定结合位点，而不是 Doppel 的结合位点。有趣的是，我们在 PrPC 中发现的结合位点与朊病毒蛋白环路相对应，而朊病毒蛋白环路在聚集和朊病毒疾病传播屏障（β2-α2 环路）以及初始自发错误折叠（α2-α3 环路）中至关重要。我们还发现，作为对蛋白质结合的反应，膜会局部重新排列，以容纳插入膜表面的蛋白质残基。

{"title":"Protein-lipid interactions and protein anchoring modulate the modes of association of the globular domain of the Prion protein and Doppel protein to model membrane patches","authors":"Patricia Soto, Davis T. Thalhuber, Frank Luceri, Jamie Janos, Mason R. Borgman, Noah M. Greenwood, Sofia Acosta, Hunter Stoffel","doi":"10.3389/fbinf.2023.1321287","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1321287","url":null,"abstract":"The Prion protein is the molecular hallmark of the incurable prion diseases affecting mammals, including humans. The protein-only hypothesis states that the misfolding, accumulation, and deposition of the Prion protein play a critical role in toxicity. The cellular Prion protein (PrPC) anchors to the extracellular leaflet of the plasma membrane and prefers cholesterol- and sphingomyelin-rich membrane domains. Conformational Prion protein conversion into the pathological isoform happens on the cell surface. In vitro and in vivo experiments indicate that Prion protein misfolding, aggregation, and toxicity are sensitive to the lipid composition of plasma membranes and vesicles. A picture of the underlying biophysical driving forces that explain the effect of Prion protein - lipid interactions in physiological conditions is needed to develop a structural model of Prion protein conformational conversion. To this end, we use molecular dynamics simulations that mimic the interactions between the globular domain of PrPC anchored to model membrane patches. In addition, we also simulate the Doppel protein anchored to such membrane patches. The Doppel protein is the closest in the phylogenetic tree to PrPC, localizes in an extracellular milieu similar to that of PrPC, and exhibits a similar topology to PrPC even if the amino acid sequence is only 25% identical. Our simulations show that specific protein-lipid interactions and conformational constraints imposed by GPI anchoring together favor specific binding sites in globular PrPC but not in Doppel. Interestingly, the binding sites we found in PrPC correspond to prion protein loops, which are critical in aggregation and prion disease transmission barrier (β2-α2 loop) and in initial spontaneous misfolding (α2-α3 loop). We also found that the membrane re-arranges locally to accommodate protein residues inserted in the membrane surface as a response to protein binding.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"39 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139381635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RxNorm for drug name normalization: a case study of prescription opioids in the FDA adverse events reporting system 用于药品名称规范化的 RxNorm：FDA 不良事件报告系统中处方类阿片的案例研究

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-05 DOI: 10.3389/fbinf.2023.1328613

Huyen Le, Ru Chen, Stephen Harris, Hong Fang, Beverly Lyn-Cook, H. Hong, W. Ge, Paul Rogers, Weida Tong, Wen Zou

Numerous studies have been conducted on the US Food and Drug Administration (FDA) Adverse Events Reporting System (FAERS) database to assess post-marketing reporting rates for drug safety review and risk assessment. However, the drug names in the adverse event (AE) reports from FAERS were heterogeneous due to a lack of uniformity of information submitted mandatorily by pharmaceutical companies and voluntarily by patients, healthcare professionals, and the public. Studies using FAERS and other spontaneous reporting AEs database without drug name normalization may encounter incomplete collection of AE reports from non-standard drug names and the accuracies of the results might be impacted. In this study, we demonstrated applicability of RxNorm, developed by the National Library of Medicine, for drug name normalization in FAERS. Using prescription opioids as a case study, we used RxNorm application program interface (API) to map all FDA-approved prescription opioids described in FAERS AE reports to their equivalent RxNorm Concept Unique Identifiers (RxCUIs) and RxNorm names. The different names of the opioids were then extracted, and their usage frequencies were calculated in collection of more than 14.9 million AE reports for 13 FDA-approved prescription opioid classes, reported over 17 years. The results showed that a significant number of different names were consistently used for opioids in FAERS reports, with 2,086 different names (out of 7,892) used at least three times and 842 different names used at least ten times for each of the 92 RxNorm names of FDA-approved opioids. Our method of using RxNorm API mapping was confirmed to be efficient and accurate and capable of reducing the heterogeneity of prescription opioid names significantly in the AE reports in FAERS; meanwhile, it is expected to have a broad application to different sets of drug names from any database where drug names are diverse and unnormalized. It is expected to be able to automatically standardize and link different representations of the same drugs to build an intact and high-quality database for diverse research, particularly postmarketing data analysis in pharmacovigilance initiatives.

美国食品和药物管理局（FDA）的不良事件报告系统（FAERS）数据库已进行了大量研究，以评估上市后的报告率，用于药物安全性审查和风险评估。然而，由于制药公司强制提交的信息与患者、医疗保健专业人员和公众自愿提交的信息不统一，FAERS 中不良事件（AE）报告中的药物名称也不尽相同。使用 FAERS 和其他自发报告的 AEs 数据库进行研究时，如果没有对药物名称进行规范化处理，可能会遇到从非标准药物名称中收集到的 AE 报告不完整的问题，结果的准确性可能会受到影响。在本研究中，我们展示了美国国家医学图书馆开发的 RxNorm 在 FAERS 中进行药名规范化的适用性。以处方阿片类药物为例，我们使用 RxNorm 应用程序接口（API）将 FAERS AE 报告中描述的所有经 FDA 批准的处方阿片类药物映射为等效的 RxNorm 概念唯一标识符（RxCUI）和 RxNorm 名称。然后提取了阿片类药物的不同名称，并在收集的超过 1490 万份 AE 报告中计算了它们的使用频率，这些报告涉及 13 个 FDA 批准的处方阿片类药物类别，报告时间长达 17 年。结果显示，在 FAERS 报告中，阿片类药物持续使用了大量不同的名称，在 92 个 FDA 批准的阿片类药物 RxNorm 名称中，有 2,086 个不同名称（共 7,892 个）至少使用了三次，842 个不同名称至少使用了十次。我们使用 RxNorm API 映射的方法被证实是高效、准确的，能够显著减少 FAERS AE 报告中处方阿片类药物名称的异质性；同时，它有望广泛应用于任何药物名称多样且未规范化的数据库中的不同药物名称集。它有望能够自动标准化和连接相同药物的不同表述，从而建立一个完整和高质量的数据库，用于各种研究，特别是药物警戒计划中的上市后数据分析。

{"title":"RxNorm for drug name normalization: a case study of prescription opioids in the FDA adverse events reporting system","authors":"Huyen Le, Ru Chen, Stephen Harris, Hong Fang, Beverly Lyn-Cook, H. Hong, W. Ge, Paul Rogers, Weida Tong, Wen Zou","doi":"10.3389/fbinf.2023.1328613","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1328613","url":null,"abstract":"Numerous studies have been conducted on the US Food and Drug Administration (FDA) Adverse Events Reporting System (FAERS) database to assess post-marketing reporting rates for drug safety review and risk assessment. However, the drug names in the adverse event (AE) reports from FAERS were heterogeneous due to a lack of uniformity of information submitted mandatorily by pharmaceutical companies and voluntarily by patients, healthcare professionals, and the public. Studies using FAERS and other spontaneous reporting AEs database without drug name normalization may encounter incomplete collection of AE reports from non-standard drug names and the accuracies of the results might be impacted. In this study, we demonstrated applicability of RxNorm, developed by the National Library of Medicine, for drug name normalization in FAERS. Using prescription opioids as a case study, we used RxNorm application program interface (API) to map all FDA-approved prescription opioids described in FAERS AE reports to their equivalent RxNorm Concept Unique Identifiers (RxCUIs) and RxNorm names. The different names of the opioids were then extracted, and their usage frequencies were calculated in collection of more than 14.9 million AE reports for 13 FDA-approved prescription opioid classes, reported over 17 years. The results showed that a significant number of different names were consistently used for opioids in FAERS reports, with 2,086 different names (out of 7,892) used at least three times and 842 different names used at least ten times for each of the 92 RxNorm names of FDA-approved opioids. Our method of using RxNorm API mapping was confirmed to be efficient and accurate and capable of reducing the heterogeneity of prescription opioid names significantly in the AE reports in FAERS; meanwhile, it is expected to have a broad application to different sets of drug names from any database where drug names are diverse and unnormalized. It is expected to be able to automatically standardize and link different representations of the same drugs to build an intact and high-quality database for diverse research, particularly postmarketing data analysis in pharmacovigilance initiatives.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"48 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139383606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial: Expert opinions in protein bioinformatics: 2022 社论：蛋白质生物信息学专家意见：2022 年

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-05 DOI: 10.3389/fbinf.2023.1338560

Daisuke Kihara

引用次数: 0

Genomic risk prediction of cardiovascular diseases among type 2 diabetes patients in the UK Biobank 英国生物库中 2 型糖尿病患者心血管疾病的基因组风险预测

Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2024-01-04 DOI: 10.3389/fbinf.2023.1320748

Yixuan Ye, Jiaqi Hu, Fuyuan Pang, Can Cui, Hongyu Zhao

Background: Polygenic risk score (PRS) has proved useful in predicting the risk of cardiovascular diseases (CVD) based on the genotypes of an individual, but most analyses have focused on disease onset in the general population. The usefulness of PRS to predict CVD risk among type 2 diabetes (T2D) patients remains unclear.Methods: We built a meta-PRSCVD upon the candidate PRSs developed from state-of-the-art PRS methods for three CVD subtypes of significant importance: coronary artery disease (CAD), ischemic stroke (IS), and heart failure (HF). To evaluate the prediction performance of the meta-PRSCVD, we restricted our analysis to 21,092 white British T2D patients in the UK Biobank, among which 4,015 had CVD events.Results: Results showed that the meta-PRSCVD was significantly associated with CVD risk with a hazard ratio per standard deviation increase of 1.28 (95% CI: 1.23–1.33). The meta-PRSCVD alone predicted the CVD incidence with an area under the receiver operating characteristic curve (AUC) of 0.57 (95% CI: 0.54–0.59). When restricted to the early-onset patients (onset age ≤ 55), the AUC was further increased to 0.61 (95% CI 0.56–0.67).Conclusion: Our results highlight the potential role of genomic screening for secondary preventions of CVD among T2D patients, especially among early-onset patients.

背景：多基因风险评分（PRS）已被证明有助于根据个体的基因型预测心血管疾病（CVD）的风险，但大多数分析都侧重于普通人群的发病情况。PRS对预测2型糖尿病（T2D）患者心血管疾病风险的有用性仍不清楚：我们根据最先进的 PRS 方法为三种重要的心血管疾病亚型（冠状动脉疾病 (CAD)、缺血性中风 (IS) 和心力衰竭 (HF)）开发的候选 PRS 建立了元 PRSCVD。为了评估元 PRSCVD 的预测性能，我们将分析对象限定为英国生物库中的 21,092 名英国白人 T2D 患者，其中 4,015 人发生了心血管疾病事件：结果显示，meta-PRSCVD 与心血管疾病风险显著相关，每标准差增加的危险比为 1.28（95% CI：1.23-1.33）。元-PRSCVD单独预测心血管疾病发病率的接收者操作特征曲线下面积（AUC）为0.57（95% CI：0.54-0.59）。如果仅限于早发患者（发病年龄≤55岁），则AUC进一步增加到0.61（95% CI 0.56-0.67）：我们的研究结果凸显了基因组筛查在T2D患者，尤其是早发症患者心血管疾病二级预防中的潜在作用。

{"title":"Genomic risk prediction of cardiovascular diseases among type 2 diabetes patients in the UK Biobank","authors":"Yixuan Ye, Jiaqi Hu, Fuyuan Pang, Can Cui, Hongyu Zhao","doi":"10.3389/fbinf.2023.1320748","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1320748","url":null,"abstract":"Background: Polygenic risk score (PRS) has proved useful in predicting the risk of cardiovascular diseases (CVD) based on the genotypes of an individual, but most analyses have focused on disease onset in the general population. The usefulness of PRS to predict CVD risk among type 2 diabetes (T2D) patients remains unclear.Methods: We built a meta-PRSCVD upon the candidate PRSs developed from state-of-the-art PRS methods for three CVD subtypes of significant importance: coronary artery disease (CAD), ischemic stroke (IS), and heart failure (HF). To evaluate the prediction performance of the meta-PRSCVD, we restricted our analysis to 21,092 white British T2D patients in the UK Biobank, among which 4,015 had CVD events.Results: Results showed that the meta-PRSCVD was significantly associated with CVD risk with a hazard ratio per standard deviation increase of 1.28 (95% CI: 1.23–1.33). The meta-PRSCVD alone predicted the CVD incidence with an area under the receiver operating characteristic curve (AUC) of 0.57 (95% CI: 0.54–0.59). When restricted to the early-onset patients (onset age ≤ 55), the AUC was further increased to 0.61 (95% CI 0.56–0.67).Conclusion: Our results highlight the potential role of genomic screening for secondary preventions of CVD among T2D patients, especially among early-onset patients.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"59 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139384606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0