Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng, Tarak Nath Nandi, Ravi Madduri, Jeanne Shen, Olivier Gevaert
{"title":"Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models.","authors":"Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng, Tarak Nath Nandi, Ravi Madduri, Jeanne Shen, Olivier Gevaert","doi":"10.1038/s41551-024-01193-8","DOIUrl":null,"url":null,"abstract":"<p><p>Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.</p>","PeriodicalId":19063,"journal":{"name":"Nature Biomedical Engineering","volume":null,"pages":null},"PeriodicalIF":26.8000,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41551-024-01193-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.
期刊介绍:
Nature Biomedical Engineering is an online-only monthly journal that was launched in January 2017. It aims to publish original research, reviews, and commentary focusing on applied biomedicine and health technology. The journal targets a diverse audience, including life scientists who are involved in developing experimental or computational systems and methods to enhance our understanding of human physiology. It also covers biomedical researchers and engineers who are engaged in designing or optimizing therapies, assays, devices, or procedures for diagnosing or treating diseases. Additionally, clinicians, who make use of research outputs to evaluate patient health or administer therapy in various clinical settings and healthcare contexts, are also part of the target audience.