多组学数据分析的综合充分降维方法

Yashita Jain, Shanshan Ding
{"title":"多组学数据分析的综合充分降维方法","authors":"Yashita Jain, Shanshan Ding","doi":"10.1145/3107411.3108225","DOIUrl":null,"url":null,"abstract":"With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| Pδ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we proposed a new integrative SIR method, a supervised dimension reduction technique for integrative analysis of multi-omics data types. Unlike conventional SDR methods, the new approach can reduce the dimensions of multiple omics data simultaneously while sharing common latent structures across data types without losing any information in prediction. By efficiently capturing the common information, our numerical study shows that integrative SIR classifies tumor types more accurately as compared to conventional SDR methods.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Integrative Sufficient Dimension Reduction Methods for Multi-Omics Data Analysis\",\"authors\":\"Yashita Jain, Shanshan Ding\",\"doi\":\"10.1145/3107411.3108225\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| Pδ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we proposed a new integrative SIR method, a supervised dimension reduction technique for integrative analysis of multi-omics data types. Unlike conventional SDR methods, the new approach can reduce the dimensions of multiple omics data simultaneously while sharing common latent structures across data types without losing any information in prediction. By efficiently capturing the common information, our numerical study shows that integrative SIR classifies tumor types more accurately as compared to conventional SDR methods.\",\"PeriodicalId\":246388,\"journal\":{\"name\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3107411.3108225\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着高通量全基因组测定的出现,同时测量多种类型的基因组数据已经成为可能。TCGA、ICGC、NCI-60等项目已经从癌症样本中生成了MiRNA、MRNA、蛋白质组学等关键基因组变化的全面、多维图谱[2,4]。这些基因组数据可用于肿瘤类型的分类。对来自多个来源的这些数据进行综合分析可能会提供额外的生物学见解,但缺乏进行此类分析的方法。处理高维数据的一种广泛使用的解决方案是去除集成样本中的冗余信息。大多数表达的基因是重叠的,可以投射到较低的维度上,然后用于分类不同的肿瘤类型,而不会丢失任何/很多信息。充分降维(SDR)[1]是实现这一目标的理想方法,它是一种监督降维方法。在本文中,我们提出了一种新的集成SDR方法,该方法可以同时降低多种数据类型的维数,同时共享共同的潜在结构,以提高预测和解释。特别地,我们扩展了切片逆回归(SIR)技术,一种主要的SDR方法,以整合多个遗漏数据同时降维。SIR是一种监督降维方法,它假设结果变量Y通过预测器[3]的d个未知线性组合依赖于预测变量X。预测变量被其投影到预测空间的低维子空间中而不丢失信息。目的是找到所有的子空间的交集δ被称为中央susbspace (CS)的预测空间满足属性Y╨X | PδX集成多种类型的数据,我们提出和实施一个新的综合足够的降维方法扩展先生[3],称为综合先生。主要的思想是,我们同时考虑所有multi-omics数据信息时发现每个数据类型的基础矩阵与一些共享潜在结构。最后得到比原始数据维数小得多的d维数据。通过交叉验证实现降维d。为了展示多组学数据的综合分析,我们应用并比较了传统SIR和综合SIR来分析NCI-60面板中一部分细胞系的MRNA、MiRNA和蛋白质组学表达谱。所使用的数据取自[6]。我们必须将结果分类为中枢神经系统,白血病和黑色素瘤肿瘤类型。我们以高方差标准从每种数据类型中预先筛选了400个变量。为了找出分类误差,我们对每种方法进行留一交叉验证后进行随机森林分类。结果表明,与传统的SIR方法相比,集成SIR方法的分类误差更小。综上所述,我们提出了一种新的集成SIR方法,一种用于多组学数据类型集成分析的监督降维技术。与传统的SDR方法不同,新方法可以同时降低多个组学数据的维数,同时在数据类型之间共享共同的潜在结构,而不会丢失任何预测信息。通过有效地捕获共同信息,我们的数值研究表明,与传统的SDR方法相比,集成SIR对肿瘤类型的分类更准确。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Integrative Sufficient Dimension Reduction Methods for Multi-Omics Data Analysis
With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| Pδ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we proposed a new integrative SIR method, a supervised dimension reduction technique for integrative analysis of multi-omics data types. Unlike conventional SDR methods, the new approach can reduce the dimensions of multiple omics data simultaneously while sharing common latent structures across data types without losing any information in prediction. By efficiently capturing the common information, our numerical study shows that integrative SIR classifies tumor types more accurately as compared to conventional SDR methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Mapping Free Text into MedDRA by Natural Language Processing: A Modular Approach in Designing and Evaluating Software Extensions Evolving Conformation Paths to Model Protein Structural Transitions Supervised Machine Learning Approaches Predict and Characterize Nanomaterial Exposures: MWCNT Markers in Lung Lavage Fluid. Geometry Analysis for Protein Secondary Structures Matching Problem Geometric Sampling Framework for Exploring Molecular Walker Energetics and Dynamics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1