On potential limitations of differential expression analysis with non-linear machine learning models

EMBnet.journal Pub Date : 2023-03-08 DOI:10.14806/ej.28.0.1035

G. Sabbatini, L. Manganaro

{"title":"On potential limitations of differential expression analysis with non-linear machine learning models","authors":"G. Sabbatini, L. Manganaro","doi":"10.14806/ej.28.0.1035","DOIUrl":null,"url":null,"abstract":"Recently, there has been a growing interest in bioinformatics toward the adoption of increasingly complex machine learning models for the analysis of next-generation sequencing data with the goal of disease subtyping (i.e., patient stratification based on molecular features) or risk-based classification for specific endpoints, such as survival. With gene-expression data, a common approach consists in characterising the emerging groups by exploiting a differential expression analysis, which selects relevant gene sets coupled with pathway enrichment analysis, providing an insight into the underlying biological processes. However, when non-linear machine learning models are involved, differential expression analysis could be limiting since patient groupings identified by the model could be based on a set of genes that are hidden to differential expression due to its linear nature, affecting subsequent biological characterisation and validation. The aim of this study is to provide a proof-of-concept example demonstrating such a limitation. Moreover, we suggest that this issue could be overcome by the adoption of the innovative paradigm of eXplainable Artificial Intelligence, which consists in building an additional explainer to get a trustworthy interpretation of the model outputs and building a reliable set of genes characterising each group, preserving also non-linear relations, to be used for downstream analysis and validation.","PeriodicalId":72893,"journal":{"name":"EMBnet.journal","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EMBnet.journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14806/ej.28.0.1035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Recently, there has been a growing interest in bioinformatics toward the adoption of increasingly complex machine learning models for the analysis of next-generation sequencing data with the goal of disease subtyping (i.e., patient stratification based on molecular features) or risk-based classification for specific endpoints, such as survival. With gene-expression data, a common approach consists in characterising the emerging groups by exploiting a differential expression analysis, which selects relevant gene sets coupled with pathway enrichment analysis, providing an insight into the underlying biological processes. However, when non-linear machine learning models are involved, differential expression analysis could be limiting since patient groupings identified by the model could be based on a set of genes that are hidden to differential expression due to its linear nature, affecting subsequent biological characterisation and validation. The aim of this study is to provide a proof-of-concept example demonstrating such a limitation. Moreover, we suggest that this issue could be overcome by the adoption of the innovative paradigm of eXplainable Artificial Intelligence, which consists in building an additional explainer to get a trustworthy interpretation of the model outputs and building a reliable set of genes characterising each group, preserving also non-linear relations, to be used for downstream analysis and validation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非线性机器学习模型差分表达式分析的潜在局限性

最近，人们对生物信息学越来越感兴趣，采用越来越复杂的机器学习模型来分析下一代测序数据，目标是疾病亚型(即，基于分子特征的患者分层)或基于风险的特定终点分类，如生存。对于基因表达数据，一种常见的方法是通过利用差异表达分析来表征新兴群体，该分析选择相关基因集，结合途径富集分析，提供对潜在生物学过程的洞察。然而，当涉及非线性机器学习模型时，差异表达分析可能会受到限制，因为模型识别的患者分组可能基于一组由于其线性性质而隐藏于差异表达的基因，从而影响随后的生物学表征和验证。本研究的目的是提供一个概念验证的例子来证明这种限制。此外，我们建议可以通过采用可解释人工智能的创新范式来克服这个问题，该范式包括建立一个额外的解释器，以获得对模型输出的可信解释，并建立一组可靠的基因来表征每个组，同时保留非线性关系，用于下游分析和验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

EMBnet.journal

自引率

0.00%

发文量

期刊最新文献

Milk exosomes and a new way of communication between mother and child Exosomal Epigenetics Fingerprinting Breast Milk; insights into Milk Exosomics Ds-Seq: An Integrated Pipeline for In Silico Small RNA Se-quence Analysis for Host-pathogen Interaction Studies The Intersection of Artificial Intelligence and Precision Endocrinology.