Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

IF 1.8 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Journal of Classification Pub Date : 2024-02-28 DOI:10.1007/s00357-024-09463-5
Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar
{"title":"Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data","authors":"Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar","doi":"10.1007/s00357-024-09463-5","DOIUrl":null,"url":null,"abstract":"<p>This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"6 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-024-09463-5","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高维相关数据的监督分类:基因组数据的应用
这项研究利用相关块和监督降维解决了高维和高度相关数据的监督分类问题。我们提出了一种方法,该方法结合了基于区间图建模的块划分和主成分分析 (PCA) 的扩展,在低维投影中纳入了条件类矩估计。块划分法允许我们通过将数据分组为块来处理数据的高相关性,其中同一块内的相关性最大化,而不同块内变量间的相关性最小化。扩展 PCA 允许我们执行低维投影和聚类监督。将该方法应用于 445 个个体的基因表达数据,这些数据分为两组(患病和未患病),包含 719,656 个单核苷酸多态性(SNPs),结果显示该方法具有良好的聚类和预测性能。单核苷酸多态性(SNPs)是一种遗传变异,它代表了单个脱氧核糖核酸(DNA)结构单元(即核苷酸)的差异。以往的研究表明,SNPs 可用于识别个体的正确种群来源,并可单独或同时对表型产生影响。因此,研究遗传学在传染病表型中的作用至关重要。目前在全基因组关联研究(GWAS)领域使用的经典统计模型在检测哮喘或疟疾等复杂疾病研究中的相关基因方面已显示出其局限性。在本研究中,我们首先研究了一种基于区间图建模的连锁不平衡(LD)区块划分方法,以处理 SNP 之间的高度相关性。然后,我们使用监督方法,特别是通过在低维投影中纳入条件类矩估计来扩展 PCA 的方法,来识别疟疾发病中的决定性 SNPs。在 Dielmo-Ndiop 项目数据集上获得的实验结果表明,线性判别分析(LDA)方法在预测疟疾发作方面具有很高的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Classification
Journal of Classification 数学-数学跨学科应用
CiteScore
3.60
自引率
5.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.
期刊最新文献
How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry Multi-task Support Vector Machine Classifier with Generalized Huber Loss Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning Combining Semi-supervised Clustering and Classification Under a Generalized Framework Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1