TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2022-01-01 DOI:10.1177/11769351221139491
Sara Jones, Matthew Beyers, Maulik Shukla, Fangfang Xia, Thomas Brettin, Rick Stevens, M Ryan Weil, Satishkumar Ranganathan Ganakammal
{"title":"TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.","authors":"Sara Jones,&nbsp;Matthew Beyers,&nbsp;Maulik Shukla,&nbsp;Fangfang Xia,&nbsp;Thomas Brettin,&nbsp;Rick Stevens,&nbsp;M Ryan Weil,&nbsp;Satishkumar Ranganathan Ganakammal","doi":"10.1177/11769351221139491","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.</p><p><strong>Methods: </strong>In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models.</p><p><strong>Results: </strong>All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types.</p><p><strong>Conclusions: </strong>We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.</p>","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"21 ","pages":"11769351221139491"},"PeriodicalIF":2.4000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/b1/0b/10.1177_11769351221139491.PMC9729992.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351221139491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 2

Abstract

Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.

Methods: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models.

Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types.

Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TULIP:使用卷积神经网络的基于rna序列的原发性肿瘤类型预测工具。
背景:癌症是世界范围内导致死亡的主要原因之一,准确的原发肿瘤类型预测对于确定能够抑制或减缓肿瘤进展的遗传因素至关重要。在过去的几年里,人们一直在努力利用机器学习和深度学习的基因表达数据对原发性肿瘤类型进行分类。方法:在本文中,我们开发了四个一维卷积神经网络(CNN)模型,将RNA-seq计数数据分类为17种高度代表性的原发肿瘤类型之一或32种原发肿瘤类型,而不考虑不平衡的代表性。此外,我们调整了模型,将所有的Ensembl基因(60,483)或蛋白质编码基因(19,758)作为输入。与之前的工作不同,我们没有根据表达值过滤基因,从而避免了选择偏差。从基因组数据共享(GDC)下载来自癌症基因组图谱(TCGA)的9,025和10,940个样本的RNA-seq计数数据,分别对应17和32种原发肿瘤类型,以FPKM-UQ表示,用于训练和验证模型。结果:4个1D-CNN模型在测试数据集上的总体准确率为94.7% ~ 97.6%。进一步的评估表明,在17种和32种原发肿瘤类型中,仅以蛋白质编码基因为特征的模型比包含所有Ensembl基因的模型具有更好的准确性。对于所有模型,大多数原发肿瘤类型的准确率都在80%以上。结论:我们将所有4个模型打包成一个基于python的深度学习分类工具TULIP (TUmor classification Predictor),用于对原发肿瘤样本进行质量控制,并对未知肿瘤类型的癌症样本进行表征。需要进一步优化模型以提高某些原发肿瘤类型的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Cancer Informatics
Cancer Informatics Medicine-Oncology
CiteScore
3.00
自引率
5.00%
发文量
30
审稿时长
8 weeks
期刊介绍: The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.
期刊最新文献
Detecting the Tumor Prognostic Factors From the YTH Domain Family Through Integrative Pan-Cancer Analysis. Unveiling Recurrence Patterns: Analyzing Predictive Risk Factors for Breast Cancer Recurrence after Surgery. Understanding the Biological Basis of Polygenic Risk Scores and Disparities in Prostate Cancer: A Comprehensive Genomic Analysis. Machine Learning for Dynamic Prognostication of Patients With Hepatocellular Carcinoma Using Time-Series Data: Survival Path Versus Dynamic-DeepHit HCC Model. Advancements and Challenges in the Image-Based Diagnosis of Lung and Colon Cancer: A Comprehensive Review.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1