{"title":"Semantically Rich Local Dataset Generation for Explainable AI in Genomics","authors":"Pedro Barbosa, Rosina Savisaar, Alcides Fonseca","doi":"arxiv-2407.02984","DOIUrl":null,"url":null,"abstract":"Black box deep learning models trained on genomic sequences excel at\npredicting the outcomes of different gene regulatory mechanisms. Therefore,\ninterpreting these models may provide novel insights into the underlying\nbiology, supporting downstream biomedical applications. Due to their\ncomplexity, interpretable surrogate models can only be built for local\nexplanations (e.g., a single instance). However, accomplishing this requires\ngenerating a dataset in the neighborhood of the input, which must maintain\nsyntactic similarity to the original data while introducing semantic\nvariability in the model's predictions. This task is challenging due to the\ncomplex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving\nperturbations in sequences that contribute to their semantic diversity. Our\ncustom, domain-guided individual representation effectively constrains\nsyntactic similarity, and we provide two alternative fitness functions that\npromote diversity with no computational effort. Applied to the RNA splicing\ndomain, our approach quickly achieves good diversity and significantly\noutperforms a random baseline in exploring the search space, as shown by our\nproof-of-concept, short RNA sequence. Furthermore, we assess its\ngeneralizability and demonstrate scalability to larger sequences, resulting in\na $\\approx$30\\% improvement over the baseline.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.02984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Black box deep learning models trained on genomic sequences excel at
predicting the outcomes of different gene regulatory mechanisms. Therefore,
interpreting these models may provide novel insights into the underlying
biology, supporting downstream biomedical applications. Due to their
complexity, interpretable surrogate models can only be built for local
explanations (e.g., a single instance). However, accomplishing this requires
generating a dataset in the neighborhood of the input, which must maintain
syntactic similarity to the original data while introducing semantic
variability in the model's predictions. This task is challenging due to the
complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving
perturbations in sequences that contribute to their semantic diversity. Our
custom, domain-guided individual representation effectively constrains
syntactic similarity, and we provide two alternative fitness functions that
promote diversity with no computational effort. Applied to the RNA splicing
domain, our approach quickly achieves good diversity and significantly
outperforms a random baseline in exploring the search space, as shown by our
proof-of-concept, short RNA sequence. Furthermore, we assess its
generalizability and demonstrate scalability to larger sequences, resulting in
a $\approx$30\% improvement over the baseline.