使用随机森林分类器预测肺癌的新生物标志物。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2023-01-01 DOI:10.1177/11769351231167992
Lavanya C, Pooja S, Abhay H Kashyap, Abdur Rahaman, Swarna Niranjan, Vidya Niranjan
{"title":"使用随机森林分类器预测肺癌的新生物标志物。","authors":"Lavanya C,&nbsp;Pooja S,&nbsp;Abhay H Kashyap,&nbsp;Abdur Rahaman,&nbsp;Swarna Niranjan,&nbsp;Vidya Niranjan","doi":"10.1177/11769351231167992","DOIUrl":null,"url":null,"abstract":"<p><p>Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c4/97/10.1177_11769351231167992.PMC10126698.pdf","citationCount":"3","resultStr":"{\"title\":\"Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.\",\"authors\":\"Lavanya C,&nbsp;Pooja S,&nbsp;Abhay H Kashyap,&nbsp;Abdur Rahaman,&nbsp;Swarna Niranjan,&nbsp;Vidya Niranjan\",\"doi\":\"10.1177/11769351231167992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.</p>\",\"PeriodicalId\":35418,\"journal\":{\"name\":\"Cancer Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c4/97/10.1177_11769351231167992.PMC10126698.pdf\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/11769351231167992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351231167992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 3

摘要

肺癌被认为是最常见和最致命的癌症类型。肺癌主要有两种类型:小细胞肺癌和非小细胞肺癌。非小细胞肺癌约占85%,而小细胞肺癌仅占14%左右。在过去的十年中,功能基因组学已经成为研究遗传学和揭示基因表达变化的革命性工具。RNA-Seq已被应用于研究罕见和新颖的转录本,这些转录本有助于发现由不同肺癌引起的肿瘤中发生的遗传变化。尽管RNA-Seq有助于理解和表征肺癌诊断中涉及的基因表达,但发现生物标志物仍然是一个挑战。分类模型的使用有助于发现和分类基于不同肺癌基因表达水平的生物标志物。目前的研究主要集中在通过基因的归一化折叠变化计算基因转录文件的转录统计,并确定参考基因组和肺癌样本之间基因表达水平的可量化差异。对收集到的数据进行分析,并开发机器学习模型,将基因分类为导致NSCLC、导致SCLC、两者都导致或两者都不导致。进行探索性数据分析,以确定概率分布和主要特征。由于可用的特征数量有限,所有特征都用于预测类别。为了解决数据集中的不平衡问题,对数据集进行了欠采样算法Near Miss。在分类方面,主要研究了4种监督式机器学习算法:Logistic回归、KNN分类器、SVM分类器和Random Forest分类器,并考虑了2种集成算法:XGboost和AdaBoost。其中,基于所考虑的加权指标,随机森林分类器的准确率为87%,被认为是表现最好的算法,因此被用于预测导致NSCLC和SCLC的生物标志物。数据集中的不平衡和有限的特征限制了模型的准确性或精度的进一步提高。本研究采用随机森林分类器BRAF、KRAS、NRAS中的基因表达值(LogFC、P Value)作为特征集,预测EGFR是可能导致NSCLC的生物标志物,通过转录组分析预测ATF6、ATF3、PGDFA、PGDFD、PGDFC和PIP5K1C是可能导致SCLC的生物标志物。经过微调后,准确率为91.3%,召回率为91%。预测NSCLC和SCLC的一些常见生物标志物是CDK4、CDK6、BAK1、CDKN1A、DDB2。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.

Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Cancer Informatics
Cancer Informatics Medicine-Oncology
CiteScore
3.00
自引率
5.00%
发文量
30
审稿时长
8 weeks
期刊介绍: The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.
期刊最新文献
Understanding the Biological Basis of Polygenic Risk Scores and Disparities in Prostate Cancer: A Comprehensive Genomic Analysis. Machine Learning for Dynamic Prognostication of Patients With Hepatocellular Carcinoma Using Time-Series Data: Survival Path Versus Dynamic-DeepHit HCC Model. Advancements and Challenges in the Image-Based Diagnosis of Lung and Colon Cancer: A Comprehensive Review. Prediction of Treatment Recommendations Via Ensemble Machine Learning Algorithms for Non-Small Cell Lung Cancer Patients in Personalized Medicine. Multicategory Survival Outcomes Classification via Overlapping Group Screening Process Based on Multinomial Logistic Regression Model With Application to TCGA Transcriptomic Data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1