Biosynthetic gene clusters (BGCs) encode enzymatic pathways that produce diverse natural products with applications in pharmaceuticals, agriculture, and biotechnology. Broad-spectrum tools such as antiSMASH and DeepBGC cover many BGC classes but may face challenges in detecting atypical or hybrid architectures. Here we present RFBGCPred, an open-source machine learning classifier focused on five clinically and agriculturally important classes—PKS, NRPS, RiPPs, terpenes, and PKS–NRPS hybrids. Rather than replacing existing pipelines, RFBGCPred complements them by improving class-level discrimination within this subset.
Using curated data from the MiBIG database, we applied Word2Vec for feature extraction, supervised UMAP for dimensionality reduction, and SMOTE to address class imbalance. Multiple classifiers were benchmarked, with Random Forest identified as the top performer using the TOPSIS decision-making criterion. The final model achieved an accuracy of 98.0 % (MCC: 0.9752, AUC: 0.9928) on a balanced test set, and maintained strong generalization on an unbalanced validation set (accuracy: 94.8 %, MCC: 0.89, AUC: 0.96). Compared with antiSMASH and DeepBGC, RFBGCPred showed improved recall for hybrid PKS–NRPS clusters while sustaining competitive precision, thereby reducing misclassification of atypical arrangements.
RFBGCPred supports FASTA, GenBank, and CSV inputs, with full source code, curated datasets, and documentation available at: https://github.com/SHARANBASAPPA/RFBGCPred.git.
扫码关注我们
求助内容:
应助结果提醒方式:
