Tissue Classification Using RNA-Seq Transcriptomics with Distribution Analysis and SVM Models*

2023 Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2023-04-27 DOI:10.1109/SIEDS58326.2023.10137900

Dominick DeCanio, Minah Kim, Samuel Haddox, G. Guadagni

{"title":"Tissue Classification Using RNA-Seq Transcriptomics with Distribution Analysis and SVM Models*","authors":"Dominick DeCanio, Minah Kim, Samuel Haddox, G. Guadagni","doi":"10.1109/SIEDS58326.2023.10137900","DOIUrl":null,"url":null,"abstract":"The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. Our SVM model was run using both the quantity of splice junctions observed and their presence, and had a high prediction accuracy for both data sets. The performance between the two SVM model outcomes were not significantly different. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.","PeriodicalId":267464,"journal":{"name":"2023 Systems and Information Engineering Design Symposium (SIEDS)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS58326.2023.10137900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. Our SVM model was run using both the quantity of splice junctions observed and their presence, and had a high prediction accuracy for both data sets. The performance between the two SVM model outcomes were not significantly different. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于分布分析和SVM模型的RNA-Seq转录组学组织分类*

人体产生的蛋白质比编码蛋白质的基因要多。蛋白质的多样性源于RNA可以被剪接和重组的不同方式。RNA的每一种变体都会产生一种不同的蛋白质，为我们的身体提供了一种用单一基因产生多种蛋白质的方法。然而，一些替代的RNA转录物有剪接错误，产生与遗传疾病有关的有缺陷的蛋白质。了解剪接模式和概况对我们理解健康和患病组织状态具有广泛的意义。目前，人们对健康组织的剪接谱知之甚少，这种剪接谱在个体之间和个体内部因组织类型而异。因此，本项目探索利用来自第一条染色体的RNA剪接数据，利用分布分析和监督学习方法来预测非癌样本的组织类型。采用基于经验累积分布函数的Kolmogorov-Smirnov检验对样本进行分类，不能可靠地区分组织类型。我们的支持向量机模型使用观察到的剪接的数量和它们的存在来运行，并且对两个数据集都有很高的预测精度。两种SVM模型结果的性能无显著差异。总的来说，这些发现表明了在生物学分类中使用剪接连接数据的效用，并为未来的剪接模式与表型的映射工作奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量