{"title":"STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics","authors":"Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li","doi":"arxiv-2406.06393","DOIUrl":null,"url":null,"abstract":"Recent advances in multi-modal algorithms have driven and been driven by the\nincreasing availability of large image-text datasets, leading to significant\nstrides in various fields, including computational pathology. However, in most\nexisting medical image-text datasets, the text typically provides high-level\nsummaries that may not sufficiently describe sub-tile regions within a large\npathology image. For example, an image might cover an extensive tissue area\ncontaining cancerous and healthy regions, but the accompanying text might only\nspecify that this image is a cancer slide, lacking the nuanced details needed\nfor in-depth analysis. In this study, we introduce STimage-1K4M, a novel\ndataset designed to bridge this gap by providing genomic features for sub-tile\nimages. STimage-1K4M contains 1,149 images derived from spatial transcriptomics\ndata, which captures gene expression information at the level of individual\nspatial spots within a pathology image. Specifically, each image in the dataset\nis broken down into smaller sub-image tiles, with each tile paired with\n15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile\nimages and gene expressions, STimage-1K4M offers unprecedented granularity,\npaving the way for a wide range of advanced research in multi-modal data\nanalysis an innovative applications in computational pathology, and beyond.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.06393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in multi-modal algorithms have driven and been driven by the
increasing availability of large image-text datasets, leading to significant
strides in various fields, including computational pathology. However, in most
existing medical image-text datasets, the text typically provides high-level
summaries that may not sufficiently describe sub-tile regions within a large
pathology image. For example, an image might cover an extensive tissue area
containing cancerous and healthy regions, but the accompanying text might only
specify that this image is a cancer slide, lacking the nuanced details needed
for in-depth analysis. In this study, we introduce STimage-1K4M, a novel
dataset designed to bridge this gap by providing genomic features for sub-tile
images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics
data, which captures gene expression information at the level of individual
spatial spots within a pathology image. Specifically, each image in the dataset
is broken down into smaller sub-image tiles, with each tile paired with
15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile
images and gene expressions, STimage-1K4M offers unprecedented granularity,
paving the way for a wide range of advanced research in multi-modal data
analysis an innovative applications in computational pathology, and beyond.