TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.

IF 2.5 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Cancer Informatics Pub Date : 2022-01-01 DOI:10.1177/11769351221139491

Sara Jones, Matthew Beyers, Maulik Shukla, Fangfang Xia, Thomas Brettin, Rick Stevens, M Ryan Weil, Satishkumar Ranganathan Ganakammal

{"title":"TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.","authors":"Sara Jones, Matthew Beyers, Maulik Shukla, Fangfang Xia, Thomas Brettin, Rick Stevens, M Ryan Weil, Satishkumar Ranganathan Ganakammal","doi":"10.1177/11769351221139491","DOIUrl":null,"url":null,"abstract":"Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.Methods: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models.Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types.Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"21 ","pages":"11769351221139491"},"PeriodicalIF":2.5000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/b1/0b/10.1177_11769351221139491.PMC9729992.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/11769351221139491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 2

Abstract

Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.

Methods: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models.

Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types.

Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TULIP:使用卷积神经网络的基于rna序列的原发性肿瘤类型预测工具。

背景:癌症是世界范围内导致死亡的主要原因之一，准确的原发肿瘤类型预测对于确定能够抑制或减缓肿瘤进展的遗传因素至关重要。在过去的几年里，人们一直在努力利用机器学习和深度学习的基因表达数据对原发性肿瘤类型进行分类。方法:在本文中，我们开发了四个一维卷积神经网络(CNN)模型，将RNA-seq计数数据分类为17种高度代表性的原发肿瘤类型之一或32种原发肿瘤类型，而不考虑不平衡的代表性。此外，我们调整了模型，将所有的Ensembl基因(60,483)或蛋白质编码基因(19,758)作为输入。与之前的工作不同，我们没有根据表达值过滤基因，从而避免了选择偏差。从基因组数据共享(GDC)下载来自癌症基因组图谱(TCGA)的9,025和10,940个样本的RNA-seq计数数据，分别对应17和32种原发肿瘤类型，以FPKM-UQ表示，用于训练和验证模型。结果:4个1D-CNN模型在测试数据集上的总体准确率为94.7% ~ 97.6%。进一步的评估表明，在17种和32种原发肿瘤类型中，仅以蛋白质编码基因为特征的模型比包含所有Ensembl基因的模型具有更好的准确性。对于所有模型，大多数原发肿瘤类型的准确率都在80%以上。结论:我们将所有4个模型打包成一个基于python的深度学习分类工具TULIP (TUmor classification Predictor)，用于对原发肿瘤样本进行质量控制，并对未知肿瘤类型的癌症样本进行表征。需要进一步优化模型以提高某些原发肿瘤类型的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Cancer Informatics Medicine-Oncology

CiteScore

3.00

自引率

5.00%

发文量

审稿时长

8 weeks

期刊介绍： The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.