BMC Bioinformatics最新文献_第10页

mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate. mulea：使用多本体和经验错误发现率进行富集分析的 R 软件包。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-18 DOI: 10.1186/s12859-024-05948-7

Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari

Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.

传统的基因组富集分析通常仅限于少数几个本体，而且不考虑基因组或术语之间的相互依赖关系，从而导致过校正 p 值。mulea 采用一种渐进式经验错误发现率 (eFDR) 方法，专为相互关联的生物数据而设计，可准确识别不同本体中的重要术语。mulea 的功能超越了传统工具，纳入了广泛的本体，包括基因本体、通路、调控元件、基因组位置和蛋白质域。这种灵活性使研究人员能够针对具体问题进行富集分析，例如在基因表达数据中识别富集的转录调控因子，或在蛋白质组中识别代表性过高的蛋白质域。为便于进行无缝分析，mulea 提供了 27 种模式生物的基因集（标准化 GMT 格式），涵盖来自 16 个数据库的 22 种本体类型和各种标识符，形成近 900 个文件。此外，muleaData ExperimentData Bioconductor 软件包简化了对这些预定义本体的访问。最后，mulea 的架构允许轻松集成用户定义的本体或来自外部资源（如 MSigDB 或 Enrichr）的 GMT 文件，从而扩大了其在不同研究领域的适用性。mulea 以 CRAN R 软件包的形式发布，可从 https://cran.r-project.org/web/packages/mulea/ 和 https://github.com/ELTEbioinformatics/mulea 下载。它为研究人员提供了一个强大而灵活的功能富集分析工具包，通过渐进式 eFDR 和支持各种本体解决了传统工具的局限性。总之，mulea 有助于探索各种模式生物的各种生物学问题。

{"title":"mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate.","authors":"Cezary Turek, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, Eszter Ari","doi":"10.1186/s12859-024-05948-7","DOIUrl":"https://doi.org/10.1186/s12859-024-05948-7","url":null,"abstract":"Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. mulea is distributed as a CRAN R package downloadable from https://cran.r-project.org/web/packages/mulea/ and https://github.com/ELTEbioinformatics/mulea . It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"334"},"PeriodicalIF":2.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11490090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method. repDilPCR：采用稀释-复制法自动分析 qPCR 检测的工具。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05954-9

Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens

Background: The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.

Results: Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.

Conclusions: repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.

背景：用于 qPCR 检测的稀释-重复实验设计特别有效。它基于从实验样品本身得出的多条 3 点标准曲线的多重线性回归，因此无需通过连续稀释标准品来生成单独的标准曲线。该方法最大限度地减少了反应总数，并确保 Cq 值在稀释-重复标准曲线的线性动态范围内。然而，由于缺乏专门的软件，稀释-复制法至今仍未得到广泛应用：我们在此介绍 repDilPCR，它是第一款利用稀释-复制方法的工具，并通过增加使用多个参考基因的可能性对其进行了扩展。 repDilPCR 提供了广泛的统计和图形功能，也可用于通过常规检测设计和评估方法获得的预处理数据（相对表达值）。repDilPCR 的设计理念是自动加快数据分析速度（从 Cq 值到可用于发表论文的图表通常不超过一分钟），并具有自动选择和执行适当统计检验的功能，至少在单因素实验设计的情况下是如此。结论：repDilPCR 是一款用户友好型工具，有助于更高效地规划 qPCR 实验并对其进行稳健分析。公共网络服务器可在 https://repdilpcr.eu 免费访问，无需注册。该程序也可作为 R 脚本或本地安装的 Shiny 应用程序使用，可从 https://github.com/deyanyosifov/repDilPCR 下载，源代码也可从该网站获取。

{"title":"repDilPCR: a tool for automated analysis of qPCR assays by the dilution-replicate method.","authors":"Deyan Yordanov Yosifov, Michaela Reichenzeller, Stephan Stilgenbauer, Daniel Mertens","doi":"10.1186/s12859-024-05954-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05954-9","url":null,"abstract":"Background: The dilution-replicate experimental design for qPCR assays is especially efficient. It is based on multiple linear regression of multiple 3-point standard curves that are derived from the experimental samples themselves and thus obviates the need for a separate standard curve produced by serial dilution of a standard. The method minimizes the total number of reactions and guarantees that Cq values are within the linear dynamic range of the dilution-replicate standard curves. However, the lack of specialized software has so far precluded the widespread use of the dilution-replicate approach.Results: Here we present repDilPCR, the first tool that utilizes the dilution-replicate method and extends it by adding the possibility to use multiple reference genes. repDilPCR offers extensive statistical and graphical functions that can also be used with preprocessed data (relative expression values) obtained by usual assay designs and evaluation methods. repDilPCR has been designed with the philosophy to automate and speed up data analysis (typically less than a minute from Cq values to publication-ready plots), and features automatic selection and performance of appropriate statistical tests, at least in the case of one-factor experimental designs. Nevertheless, the program also allows users to export intermediate data and perform more sophisticated analyses with external statistical software, e.g. if two-way ANOVA is necessary.Conclusions: repDilPCR is a user-friendly tool that can contribute to more efficient planning of qPCR experiments and their robust analysis. A public web server is freely accessible at https://repdilpcr.eu without registration. The program can also be used as an R script or as a locally installed Shiny app, which can be downloaded from https://github.com/deyanyosifov/repDilPCR where also the source code is available.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"331"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142485691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing. 预测中风发生率：一种具有特征选择和数据预处理功能的叠加式机器学习方法。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05866-8

Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene

Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.

脑卒中预测仍是医疗保健领域的一个重要研究领域，旨在加强早期干预和患者护理策略。本研究探讨了机器学习技术，尤其是主成分分析（PCA）和堆叠集合方法，在基于人口、临床和生活方式因素预测脑卒中发生率方面的功效。我们系统地改变了 PCA 分量，并实施了一个由随机森林、决策树和 K-nearest neighbors (KNN) 组成的堆叠模型。我们的研究结果表明，将 PCA 分量设置为 16 最能提高预测准确性，中风预测准确率高达 98.6%。评估指标强调了我们的方法在处理类不平衡和提高模型性能方面的稳健性，与 SVM、逻辑回归和 Naive Bayes 等传统机器学习算法的比较分析也凸显了我们提出的方法的优越性。

{"title":"Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing.","authors":"Pritam Chakraborty, Anjan Bandyopadhyay, Preeti Padma Sahu, Aniket Burman, Saurav Mallik, Najah Alsubaie, Mohamed Abbas, Mohammed S Alqahtani, Ben Othman Soufiene","doi":"10.1186/s12859-024-05866-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05866-8","url":null,"abstract":"Stroke prediction remains a critical area of research in healthcare, aiming to enhance early intervention and patient care strategies. This study investigates the efficacy of machine learning techniques, particularly principal component analysis (PCA) and a stacking ensemble method, for predicting stroke occurrences based on demographic, clinical, and lifestyle factors. We systematically varied PCA components and implemented a stacking model comprising random forest, decision tree, and K-nearest neighbors (KNN).Our findings demonstrate that setting PCA components to 16 optimally enhanced predictive accuracy, achieving a remarkable 98.6% accuracy in stroke prediction. Evaluation metrics underscored the robustness of our approach in handling class imbalance and improving model performance, also comparative analyses against traditional machine learning algorithms such as SVM, logistic regression, and Naive Bayes highlighted the superiority of our proposed method.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"329"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Be-dataHIVE: a base editing database. Be-dataHIVE：基础编辑数据库。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05898-0

Lucas Schneider, Peter Minary

Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.

碱基编辑是一种增强型基因编辑方法，可实现单个核苷酸的精确转化，具有治疗罕见疾病的潜力。碱基编辑器的设计过程是劳动密集型的，结果也不容易预测。要用于临床，碱基编辑必须准确、高效。因此，必须尽量减少旁观者突变。近年来，预测碱基编辑结果的计算模型已经开发出来。然而，这些模型的整体稳健性和性能有限。提高性能的方法之一是在多样化、特征丰富的大型数据集上训练模型，而碱基编辑领域并不存在这样的数据集。因此，我们开发了一个 MySQL 数据库 BE-dataHIVE，它涵盖了超过 46 万个 gRNA 目标组合。当前版本的 BE-dataHIVE 包含来自五项研究的数据，并丰富了熔化温度和能量项。此外，还为机器学习计算了多种不同的数据结构，并可直接使用。该数据库可通过我们的网站 https://be-datahive.com/ 或 API 访问，因此适合从业人员和机器学习研究人员使用。

{"title":"Be-dataHIVE: a base editing database.","authors":"Lucas Schneider, Peter Minary","doi":"10.1186/s12859-024-05898-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05898-0","url":null,"abstract":"Base editing is an enhanced gene editing approach that enables the precise transformation of single nucleotides and has the potential to cure rare diseases. The design process of base editors is labour-intensive and outcomes are not easily predictable. For any clinical use, base editing has to be accurate and efficient. Thus, any bystander mutations have to be minimized. In recent years, computational models to predict base editing outcomes have been developed. However, the overall robustness and performance of those models is limited. One way to improve the performance is to train models on a diverse, feature-rich, and large dataset, which does not exist for the base editing field. Hence, we develop BE-dataHIVE, a mySQL database that covers over 460,000 gRNA target combinations. The current version of BE-dataHIVE consists of data from five studies and is enriched with melting temperatures and energy terms. Furthermore, multiple different data structures for machine learning were computed and are directly available. The database can be accessed via our website https://be-datahive.com/ or API and is therefore suitable for practitioners and machine learning researchers.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"330"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks. LDAGM：基于多视角异构网络的图卷积自动编码器和多层感知器预测 lncRNA 与疾病的关联。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-15 DOI: 10.1186/s12859-024-05950-z

Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu

Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.

Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.

Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.

背景：长非编码RNA（long non-coding RNAs，lncRNAs）可以预防、诊断和治疗多种复杂的人类疾病，建立一种有效预测lncRNA-疾病关联的方法至关重要：本文提出了一种基于图卷积自动编码器和多层感知器模型的 lncRNA 与疾病关联关系预测方法，命名为 LDAGM。该方法首先提取了 lncRNA 和 miRNA 的功能相似性和高斯交互图谱核相似性，以及疾病的语义相似性和高斯交互图谱核相似性。然后，它构建了六个同质网络，并使用深度拓扑特征提取方法将它们深度融合。融合后的网络有助于对原始关联关系进行特征补充和深度挖掘，捕捉节点之间的深层联系。接下来，通过将获得的深度拓扑特征与 lncRNA、疾病和 miRNA 相互作用的相似性网络相结合，我们构建了一个多视角异构网络模型。图卷积自动编码器用于非线性特征提取。最后，将提取的非线性特征与多视角异构网络的深度拓扑特征相结合，得到 lncRNA-疾病配对的最终特征表示。使用多层感知器模型对 lncRNA 与疾病的关联关系进行预测。为了提高多层感知器模型的性能和稳定性，我们在多层感知器模型中引入了一个名为聚合层的隐藏层。通过门控机制，它可以控制多层感知器模型中各隐藏层之间的信息流，从而实现各隐藏层的最佳特征提取：参数分析、消融研究和对比实验验证了该方法的有效性，案例研究验证了该方法在预测 lncRNA 与疾病关联关系方面的准确性。

{"title":"LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks.","authors":"Bing Zhang, Haoyu Wang, Chao Ma, Hai Huang, Zhou Fang, Jiaxing Qu","doi":"10.1186/s12859-024-05950-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05950-z","url":null,"abstract":"Background: Long non-coding RNAs (lncRNAs) can prevent, diagnose, and treat a variety of complex human diseases, and it is crucial to establish a method to efficiently predict lncRNA-disease associations.Results: In this paper, we propose a prediction method for the lncRNA-disease association relationship, named LDAGM, which is based on the Graph Convolutional Autoencoder and Multilayer Perceptron model. The method first extracts the functional similarity and Gaussian interaction profile kernel similarity of lncRNAs and miRNAs, as well as the semantic similarity and Gaussian interaction profile kernel similarity of diseases. It then constructs six homogeneous networks and deeply fuses them using a deep topology feature extraction method. The fused networks facilitate feature complementation and deep mining of the original association relationships, capturing the deep connections between nodes. Next, by combining the obtained deep topological features with the similarity network of lncRNA, disease, and miRNA interactions, we construct a multi-view heterogeneous network model. The Graph Convolutional Autoencoder is employed for nonlinear feature extraction. Finally, the extracted nonlinear features are combined with the deep topological features of the multi-view heterogeneous network to obtain the final feature representation of the lncRNA-disease pair. Prediction of the lncRNA-disease association relationship is performed using the Multilayer Perceptron model. To enhance the performance and stability of the Multilayer Perceptron model, we introduce a hidden layer called the aggregation layer in the Multilayer Perceptron model. Through a gate mechanism, it controls the flow of information between each hidden layer in the Multilayer Perceptron model, aiming to achieve optimal feature extraction from each hidden layer.Conclusions: Parameter analysis, ablation studies, and comparison experiments verified the effectiveness of this method, and case studies verified the accuracy of this method in predicting lncRNA-disease association relationships.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"332"},"PeriodicalIF":2.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11481433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification. DNASimCLR：基于对比学习的基因序列数据分类深度学习方法。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-14 DOI: 10.1186/s12859-024-05955-8

Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin

Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.

Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.

Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.

背景：深度神经网络模型的快速发展大大提高了从微生物序列数据中提取特征的能力，这对于解决生物学难题至关重要。然而，标注微生物数据的稀缺性和复杂性给监督学习方法带来了巨大困难。为了解决这些问题，我们提出了 DNASimCLR，这是一种无监督框架，旨在高效提取基因序列数据特征：DNASimCLR 利用卷积神经网络和基于对比学习的 SimCLR 框架，从不同的微生物基因序列中提取复杂的特征。预训练在两个经典的大规模无标签数据集上进行，包括元基因组和病毒基因序列。随后的分类任务是利用之前获得的模型对预训练模型进行微调。我们的实验证明，DNASimCLR 在基因序列分类方面至少可以与最先进的技术相媲美。对于基于卷积神经网络的方法，DNASimCLR 超越了现有的最新方法，明确确立了其优于最先进的基于 CNN 的特征提取技术的地位。此外，该模型在分析生物序列数据的各种任务中表现出卓越的性能，展示了其强大的适应性：DNASimCLR 是一种用于基因序列分类的稳健且与数据库无关的解决方案。它的多功能性使其在涉及新基因序列或以前未见过的基因序列的情况下表现出色，成为基因组学中各种应用的重要工具。

{"title":"DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.","authors":"Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, Changlong Jin","doi":"10.1186/s12859-024-05955-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05955-8","url":null,"abstract":"Background: The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction.Results: DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability.Conclusions: DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"328"},"PeriodicalIF":2.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11476100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores. 多任务图深度学习模型，用于预测协同作用和敏感性得分的药物组合。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-10 DOI: 10.1186/s12859-024-05925-0

Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid

Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.

Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( $MSE$ ), mean absolute error ( $MAE$ ), coefficient of determination ( $R^{2}$ ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.

Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.

背景：事实证明，联合用药是治疗癌症等具有挑战性疾病的现实技术，既能提高疗效，又能减轻副作用。为了实现这些联合疗法的治疗目标，必须采用多靶点药物组合，以最大限度地提高疗效和协同效应：本文提出的 "MultiComb "是一种多任务深度学习（MTDL）模型，旨在同时预测药物组合的协同作用和敏感性。该模型利用图卷积网络来表示两种药物的简化分子输入线段（SMILES），生成它们各自的特征。此外，三个完全连接的子网络还能提取癌细胞系的特征。然后，这些药物和细胞系特征被连接起来，并通过注意力机制进行处理，从而为目标任务输出两个优化的特征表示。交叉缝合模型学习这些任务之间的关系。最后，将每个学习到的任务特征输入全连接子网络，以预测协同性和敏感性得分。我们使用 O'Neil 基准数据集对所提出的模型进行了验证，该数据集包含 38 种独特的药物，组合成 17,901 对药物组合，并在 37 种独特的癌细胞中进行了测试。该模型的性能测试采用了一些指标，如均方误差（MSE）、平均绝对误差（MAE）、决定系数（R 2）、斯皮尔曼和皮尔逊评分。在上述指标中，拟议模型的平均协同得分分别为 232.37、9.59、0.57、0.76 和 0.73。此外，平均灵敏度得分分别为 15.59、2.74、0.90、0.95 和 0.95：本文提出了一种 MTDL 模型，用于预测针对特定癌细胞系的药物组合的协同作用和敏感性得分。与现有方法相比，MTDL 模型表现出更优越的性能，提供了更好的结果。

{"title":"A multi-task graph deep learning model to predict drugs combination of synergy and sensitivity scores.","authors":"Samar Monem, Aboul Ella Hassanien, Alaa H Abdel-Hamid","doi":"10.1186/s12859-024-05925-0","DOIUrl":"10.1186/s12859-024-05925-0","url":null,"abstract":"Background: Drug combination treatments have proven to be a realistic technique for treating challenging diseases such as cancer by enhancing efficacy and mitigating side effects. To achieve the therapeutic goals of these combinations, it is essential to employ multi-targeted drug combinations, which maximize effectiveness and synergistic effects.Results: This paper proposes 'MultiComb', a multi-task deep learning (MTDL) model designed to simultaneously predict the synergy and sensitivity of drug combinations. The model utilizes a graph convolution network to represent the Simplified Molecular-Input Line-Entry (SMILES) of two drugs, generating their respective features. Also, three fully connected subnetworks extract features of the cancer cell line. These drug and cell line features are then concatenated and processed through an attention mechanism, which outputs two optimized feature representations for the target tasks. The cross-stitch model learns the relationship between these tasks. At last, each learned task feature is fed into fully connected subnetworks to predict the synergy and sensitivity scores. The proposed model is validated using the O'Neil benchmark dataset, which includes 38 unique drugs combined to form 17,901 drug combination pairs and tested across 37 unique cancer cells. The model's performance is tested using some metrics like mean square error ( <math><mrow><mi>MSE</mi></mrow> </math> ), mean absolute error ( <math><mrow><mi>MAE</mi></mrow> </math> ), coefficient of determination ( <math> <msup><mrow><mi>R</mi></mrow> <mn>2</mn></msup> </math> ), Spearman, and Pearson scores. The mean synergy scores of the proposed model are 232.37, 9.59, 0.57, 0.76, and 0.73 for the previous metrics, respectively. Also, the values for mean sensitivity scores are 15.59, 2.74, 0.90, 0.95, and 0.95, respectively.Conclusion: This paper proposes an MTDL model to predict synergy and sensitivity scores for drug combinations targeting specific cancer cell lines. The MTDL model demonstrates superior performance compared to existing approaches, providing better results.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"327"},"PeriodicalIF":2.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468365/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MethylSeqLogo: DNA methylation smart sequence logos. MethylSeqLogo：DNA 甲基化智能序列标识。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-09 DOI: 10.1186/s12859-024-05896-2

Fei-Man Hsu, Paul Horton

Background: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.

Method: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.

Conclusions: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.

背景：一些转录因子（例如 MYC）与可能甲基化的 DNA 位点结合。这可能会增加结合的特异性，因为这些位点（1）在基因组中的代表性极低，（2）以低甲基化或高甲基化的形式提供额外的组织特异性信息。幸运的是，亚硫酸氢盐测序数据可用于研究这一现象：我们开发了 MethylSeqLogo，它是序列标识的一种扩展，其中包含了一些新元素，用于显示 DNA 甲基化和一组结合位点中每个位置上代表性不足的二聚体。我们的方法显示 DNA 双链的信息，并考虑到适当的序列上下文（CpG 或其他）和基因组区域（启动子或全基因组），以正确评估预期的背景二聚体频率和甲基化水平。MethylSeqLogo 保留了序列徽标的语义--一列中核苷酸的相对高度代表它们在结合位点中的比例，而每列的绝对高度代表信息（相对熵），所有列加起来的高度代表总信息结果：我们展示的图表说明了使用 MethylSeqLogo 总结几个 CpG 结合转录因子数据的实用性。图标显示，未甲基化的 CpG 结合位点是 MYC 和 ZBTB33 等转录因子的特征，而其他一些 CpG 结合转录因子（如 CEBPB）则呈现甲基化中性：结论：我们的软件使用户能够探索亚硫酸氢盐和 ChIP 测序数据集，并在此过程中获得具有发表质量的数据。

{"title":"MethylSeqLogo: DNA methylation smart sequence logos.","authors":"Fei-Man Hsu, Paul Horton","doi":"10.1186/s12859-024-05896-2","DOIUrl":"10.1186/s12859-024-05896-2","url":null,"abstract":"Background: Some transcription factors, MYC for example, bind sites of potentially methylated DNA. This may increase binding specificity as such sites are (1) highly under-represented in the genome, and (2) offer additional, tissue specific information in the form of hypo- or hyper-methylation. Fortunately, bisulfite sequencing data can be used to investigate this phenomenon.Method: We developed MethylSeqLogo, an extension of sequence logos which includes new elements to indicate DNA methylation and under-represented dimers in each position of a set binding sites. Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation. MethylSeqLogo preserves sequence logo semantics-the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information RESULTS: We present figures illustrating the utility of using MethylSeqLogo to summarize data from several CpG binding transcription factors. The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.Conclusions: Our software enables users to explore bisulfite and ChIP sequencing data sets-and in the process obtain publication quality figures.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 Suppl 2","pages":"326"},"PeriodicalIF":2.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11462690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression. NeuroimaGene：用于评估基因调控基因表达的神经相关性的 R 软件包。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-08 DOI: 10.1186/s12859-024-05936-x

Xavier Bledsoe, Eric R Gamazon

Background: We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.

Results: We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.

Conclusions: Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.

背景：我们介绍的 NeuroimaGene 资源是一个 R 软件包，旨在帮助研究人员识别与精神和神经健康相关的基因和神经特征。虽然最近的研究已经确定了数百个基因是神经和精神疾病病理生理学的潜在组成部分，但解释这种变异的生理后果仍具有挑战性。将神经影像数据与分子研究结果相结合是应对这一挑战的一个步骤。除了与分子变异和临床表型有关联外，神经影像学特征还能为认知过程提供内在信息。NeuroimaGene 为了解疾病相关基因与大脑中间结构的关系提供了一种工具：我们创建了 NeuroimaGene，它是一个用户友好、开放存取的 R 软件包，现在可供公众使用。它的主要功能是识别受用户提供的基因或基因组的基因调控表达影响的神经影像衍生大脑特征。该资源可用于：(1) 鉴定与大脑结构和功能相关的单个基因或基因组；(2) 识别目标基因的表达与神经相关的大脑或身体区域；(3) 估算受用户定义的基因组（如队列水平基因关联研究产生的基因组）影响最大的大脑特征；(4) 生成发表水平、可修改的重要发现可视化图谱。我们从已有的分析中确定了中风相关基因的神经相关性，从而证明了该资源的实用性：结论：在从基因到基于大脑的诊断表型的过程中，将神经学数据作为中间表型进行整合，可提高分子研究的可解释性，并丰富我们对疾病病理生理学的理解。NeuroimaGene R 软件包旨在协助这一过程，并可公开使用。

{"title":"NeuroimaGene: an R package for assessing the neurological correlates of genetically regulated gene expression.","authors":"Xavier Bledsoe, Eric R Gamazon","doi":"10.1186/s12859-024-05936-x","DOIUrl":"10.1186/s12859-024-05936-x","url":null,"abstract":"Background: We present the NeuroimaGene resource as an R package designed to assist researchers in identifying genes and neurologic features relevant to psychiatric and neurological health. While recent studies have identified hundreds of genes as potential components of pathophysiology in neurologic and psychiatric disease, interpreting the physiological consequences of this variation is challenging. The integration of neuroimaging data with molecular findings is a step toward addressing this challenge. In addition to sharing associations with both molecular variation and clinical phenotypes, neuroimaging features are intrinsically informative of cognitive processes. NeuroimaGene provides a tool to understand how disease-associated genes relate to the intermediate structure of the brain.Results: We created NeuroimaGene, a user-friendly, open access R package now available for public use. Its primary function is to identify neuroimaging derived brain features that are impacted by genetically regulated expression of user-provided genes or gene sets. This resource can be used to (1) characterize individual genes or gene sets as relevant to the structure and function of the brain, (2) identify the region(s) of the brain or body in which expression of target gene(s) is neurologically relevant, (3) impute the brain features most impacted by user-defined gene sets such as those produced by cohort level gene association studies, and (4) generate publication level, modifiable visual plots of significant findings. We demonstrate the utility of the resource by identifying neurologic correlates of stroke-associated genes derived from pre-existing analyses.Conclusions: Integrating neurologic data as an intermediate phenotype in the pathway from genes to brain-based diagnostic phenotypes increases the interpretability of molecular studies and enriches our understanding of disease pathophysiology. The NeuroimaGene R package is designed to assist in this process and is publicly available for use.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"325"},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency. Crossfeat：基于变换器的交叉特征学习模型，用于预测药物副作用频率。

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics

Pub Date : 2024-10-08 DOI: 10.1186/s12859-024-05915-2

Bin Baek, Hyunju Lee

Background: Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.

Results: This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.

Conclusions: Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.

背景：安全的药物治疗需要了解潜在的副作用。识别药物副作用的频率可以降低用药风险。然而，现有的预测药物副作用频率的计算方法严重依赖于已知的药物副作用频率信息。因此，这些方法在预测新药副作用频率时面临挑战。虽然有一些方法可以预测新药的副作用频率，但由于排除了药物与副作用的关系，这些方法的性能并不可靠：本研究提出的 CrossFeat 是一种基于卷积神经网络-变换器架构的交叉特征学习模型，即使在缺乏药物副作用关系信息的情况下，也能预测新药的副作用发生率和频率。CrossFeat 在其转换器架构中促进了药物和副作用信息的同步学习。这种同时进行的信息交换使药物能够了解其相关的副作用，而副作用也能同时获得相应药物的信息。这种双向学习可以全面整合药物和副作用知识。我们的五倍交叉验证实验表明，CrossFeat 在预测新药副作用频率方面优于现有的研究，而无需先验知识：结论：我们的模型为预测药物副作用频率提供了一种很有前景的方法，特别是对于先验信息有限的新药。CrossFeat 在交叉验证实验中的优异表现，以及案例研究和消融实验的证据，凸显了它的有效性。

{"title":"Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency.","authors":"Bin Baek, Hyunju Lee","doi":"10.1186/s12859-024-05915-2","DOIUrl":"10.1186/s12859-024-05915-2","url":null,"abstract":"Background: Safe drug treatment requires an understanding of the potential side effects. Identifying the frequency of drug side effects can reduce the risks associated with drug use. However, existing computational methods for predicting drug side effect frequencies heavily depend on known drug side effect frequency information. Consequently, these methods face challenges when predicting the side effect frequencies of new drugs. Although a few methods can predict the side effect frequencies of new drugs, they exhibit unreliable performance owing to the exclusion of drug-side effect relationships.Results: This study proposed CrossFeat, a model based on convolutional neural network-transformer architecture with cross-feature learning that can predict the occurrence and frequency of drug side effects for new drugs, even in the absence of information regarding drug-side effect relationships. CrossFeat facilitates the concurrent learning of drugs and side effect information within its transformer architecture. This simultaneous exchange of information enables drugs to learn about their associated side effects, while side effects concurrently acquire information about the respective drugs. Such bidirectional learning allows for the comprehensive integration of drug and side effect knowledge. Our five-fold cross-validation experiments demonstrated that CrossFeat outperforms existing studies in predicting side effect frequencies for new drugs without prior knowledge.Conclusions: Our model offers a promising approach for predicting the drug side effect frequencies, particularly for new drugs where prior information is limited. CrossFeat's superior performance in cross-validation experiments, along with evidence from case studies and ablation experiments, highlights its effectiveness.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"324"},"PeriodicalIF":2.9,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0