使用随机森林分类器预测肺癌的新生物标志物。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2023-01-01 DOI:10.1177/11769351231167992

Lavanya C, Pooja S, Abhay H Kashyap, Abdur Rahaman, Swarna Niranjan, Vidya Niranjan

{"title":"使用随机森林分类器预测肺癌的新生物标志物。","authors":"Lavanya C, Pooja S, Abhay H Kashyap, Abdur Rahaman, Swarna Niranjan, Vidya Niranjan","doi":"10.1177/11769351231167992","DOIUrl":null,"url":null,"abstract":"Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"22 ","pages":"11769351231167992"},"PeriodicalIF":2.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c4/97/10.1177_11769351231167992.PMC10126698.pdf","citationCount":"3","resultStr":"{\"title\":\"Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.\",\"authors\":\"Lavanya C, Pooja S, Abhay H Kashyap, Abdur Rahaman, Swarna Niranjan, Vidya Niranjan\",\"doi\":\"10.1177/11769351231167992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.\",\"PeriodicalId\":35418,\"journal\":{\"name\":\"Cancer Informatics\",\"volume\":\"22 \",\"pages\":\"11769351231167992\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c4/97/10.1177_11769351231167992.PMC10126698.pdf\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/11769351231167992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351231167992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 3

摘要

肺癌被认为是最常见和最致命的癌症类型。肺癌主要有两种类型:小细胞肺癌和非小细胞肺癌。非小细胞肺癌约占85%，而小细胞肺癌仅占14%左右。在过去的十年中，功能基因组学已经成为研究遗传学和揭示基因表达变化的革命性工具。RNA-Seq已被应用于研究罕见和新颖的转录本，这些转录本有助于发现由不同肺癌引起的肿瘤中发生的遗传变化。尽管RNA-Seq有助于理解和表征肺癌诊断中涉及的基因表达，但发现生物标志物仍然是一个挑战。分类模型的使用有助于发现和分类基于不同肺癌基因表达水平的生物标志物。目前的研究主要集中在通过基因的归一化折叠变化计算基因转录文件的转录统计，并确定参考基因组和肺癌样本之间基因表达水平的可量化差异。对收集到的数据进行分析，并开发机器学习模型，将基因分类为导致NSCLC、导致SCLC、两者都导致或两者都不导致。进行探索性数据分析，以确定概率分布和主要特征。由于可用的特征数量有限，所有特征都用于预测类别。为了解决数据集中的不平衡问题，对数据集进行了欠采样算法Near Miss。在分类方面，主要研究了4种监督式机器学习算法:Logistic回归、KNN分类器、SVM分类器和Random Forest分类器，并考虑了2种集成算法:XGboost和AdaBoost。其中，基于所考虑的加权指标，随机森林分类器的准确率为87%，被认为是表现最好的算法，因此被用于预测导致NSCLC和SCLC的生物标志物。数据集中的不平衡和有限的特征限制了模型的准确性或精度的进一步提高。本研究采用随机森林分类器BRAF、KRAS、NRAS中的基因表达值(LogFC、P Value)作为特征集，预测EGFR是可能导致NSCLC的生物标志物，通过转录组分析预测ATF6、ATF3、PGDFA、PGDFD、PGDFC和PIP5K1C是可能导致SCLC的生物标志物。经过微调后，准确率为91.3%，召回率为91%。预测NSCLC和SCLC的一些常见生物标志物是CDK4、CDK6、BAK1、CDKN1A、DDB2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.

Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cancer Informatics Medicine-Oncology

CiteScore

3.00

自引率

5.00%

发文量

审稿时长

8 weeks

期刊介绍： The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.