MoViT:为医学图像分析记忆视觉变换器。

Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang, Mathias Unberath
{"title":"MoViT:为医学图像分析记忆视觉变换器。","authors":"Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang, Mathias Unberath","doi":"10.1007/978-3-031-45676-3_21","DOIUrl":null,"url":null,"abstract":"<p><p>The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new \"evidence\" with previously memorized \"experience\", we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.</p>","PeriodicalId":74092,"journal":{"name":"Machine learning in medical imaging. MLMI (Workshop)","volume":"14349 ","pages":"205-213"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11008051/pdf/","citationCount":"0","resultStr":"{\"title\":\"MoViT: Memorizing Vision Transformers for Medical Image Analysis.\",\"authors\":\"Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang, Mathias Unberath\",\"doi\":\"10.1007/978-3-031-45676-3_21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new \\\"evidence\\\" with previously memorized \\\"experience\\\", we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.</p>\",\"PeriodicalId\":74092,\"journal\":{\"name\":\"Machine learning in medical imaging. MLMI (Workshop)\",\"volume\":\"14349 \",\"pages\":\"205-213\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11008051/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning in medical imaging. MLMI (Workshop)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/978-3-031-45676-3_21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/10/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning in medical imaging. MLMI (Workshop)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-031-45676-3_21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/10/15 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

变压器的长程依赖性和卷积神经网络(CNN)对图像内容的局部表征的协同作用,为各种医学图像分析任务提供了先进的架构和更高的性能,因为它们具有互补优势。然而,与卷积神经网络相比,变换器需要更多的训练数据,这是因为变换器需要更多的参数,而且不存在归纳偏差。对越来越大的数据集的需求仍然是个问题,特别是在医学成像领域,注释工作和数据保护都导致数据可用性有限。在这项工作中,受将新 "证据 "与先前记忆的 "经验 "关联起来的人类决策过程的启发,我们提出了记忆视觉转换器(MoViT),以缓解对大规模数据集的需求,从而成功地训练和部署基于转换器的架构。MoViT 利用外部内存结构,在训练阶段缓存历史注意力快照。为防止过度拟合,我们采用了一种创新的内存更新方案--注意力时空移动平均法,用历史移动平均值更新存储的外部内存。为了加快推理速度,我们设计了一种原型注意力学习方法,将外部记忆提炼为更小的代表性子集。我们在一个公共组织学图像数据集和一个内部核磁共振成像数据集上对我们的方法进行了评估,结果表明,将 MoViT 应用于各种医学图像分析任务时,它在各种数据环境下的表现都优于香草变换器模型,尤其是在只有少量注释数据的情况下。更重要的是,MoViT 只需 3.0% 的训练数据就能达到 ViT 的竞争性能。总之,MoViT 为变换器架构提供了一个简单的插件,它可以帮助减少训练数据,从而为广泛的医学图像分析任务建立可接受的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MoViT: Memorizing Vision Transformers for Medical Image Analysis.

The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new "evidence" with previously memorized "experience", we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Probabilistic 3D Correspondence Prediction from Sparse Unsegmented Images. Class-Balanced Deep Learning with Adaptive Vector Scaling Loss for Dementia Stage Detection. MoViT: Memorizing Vision Transformers for Medical Image Analysis. Robust Unsupervised Super-Resolution of Infant MRI via Dual-Modal Deep Image Prior. IA-GCN: Interpretable Attention based Graph Convolutional Network for Disease Prediction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1