Non-local Attention Improves Description Generation for Retinal Images

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2022-01-01 DOI:10.1109/WACV51458.2022.00331

Jia-Hong Huang, Ting-Wei Wu, C. Yang, Zenglin Shi, I-Hung Lin, J. Tegnér, M. Worring

{"title":"Non-local Attention Improves Description Generation for Retinal Images","authors":"Jia-Hong Huang, Ting-Wei Wu, C. Yang, Zenglin Shi, I-Hung Lin, J. Tegnér, M. Worring","doi":"10.1109/WACV51458.2022.00331","DOIUrl":null,"url":null,"abstract":"Automatically generating medical reports from retinal images is a difficult task in which an algorithm must generate semantically coherent descriptions for a given retinal image. Existing methods mainly rely on the input image to generate descriptions. However, many abstract medical concepts or descriptions cannot be generated based on image information only. In this work, we integrate additional information to help solve this task; we observe that early in the diagnosis process, ophthalmologists have usually written down a small set of keywords denoting important information. These keywords are then subsequently used to aid the later creation of medical reports for a patient. Since these keywords commonly exist and are useful for generating medical reports, we incorporate them into automatic report generation. Since we have two types of inputs expert-defined unordered keywords and images - effectively fusing features from these different modalities is challenging. To that end, we propose a new keyword-driven medical report generation method based on a non-local attention-based multi-modal feature fusion approach, TransFuser, which is capable of fusing features from different types of inputs based on such attention. Our experiments show the proposed method successfully captures the mutual information of keywords and image content. We further show our proposed keyword-driven generation model reinforced by the TransFuser is superior to baselines under the popular text evaluation metrics BLEU, CIDEr, and ROUGE. Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV51458.2022.00331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Automatically generating medical reports from retinal images is a difficult task in which an algorithm must generate semantically coherent descriptions for a given retinal image. Existing methods mainly rely on the input image to generate descriptions. However, many abstract medical concepts or descriptions cannot be generated based on image information only. In this work, we integrate additional information to help solve this task; we observe that early in the diagnosis process, ophthalmologists have usually written down a small set of keywords denoting important information. These keywords are then subsequently used to aid the later creation of medical reports for a patient. Since these keywords commonly exist and are useful for generating medical reports, we incorporate them into automatic report generation. Since we have two types of inputs expert-defined unordered keywords and images - effectively fusing features from these different modalities is challenging. To that end, we propose a new keyword-driven medical report generation method based on a non-local attention-based multi-modal feature fusion approach, TransFuser, which is capable of fusing features from different types of inputs based on such attention. Our experiments show the proposed method successfully captures the mutual information of keywords and image content. We further show our proposed keyword-driven generation model reinforced by the TransFuser is superior to baselines under the popular text evaluation metrics BLEU, CIDEr, and ROUGE. Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非局部注意改进视网膜图像描述生成

从视网膜图像中自动生成医学报告是一项困难的任务，其中算法必须为给定的视网膜图像生成语义一致的描述。现有的方法主要依靠输入图像来生成描述。然而，许多抽象的医学概念或描述不能仅仅基于图像信息来生成。在这项工作中，我们集成了额外的信息来帮助解决这个任务;我们观察到，在诊断过程的早期，眼科医生通常会写下一小组表示重要信息的关键词。然后使用这些关键字来帮助稍后为患者创建医疗报告。由于这些关键字通常存在并且对生成医疗报告很有用，因此我们将它们合并到自动报告生成中。由于我们有两种类型的输入——专家定义的无序关键词和图像——有效地融合这些不同模式的特征是具有挑战性的。为此，我们提出了一种新的基于非局部关注的多模态特征融合方法的关键字驱动医疗报告生成方法——输血，该方法能够基于这种关注融合来自不同类型输入的特征。实验表明，该方法成功地捕获了关键词和图像内容之间的互信息。我们进一步表明，在流行的文本评估指标BLEU、CIDEr和ROUGE下，我们提出的关键词驱动生成模型在输血用户的强化下优于基线。Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量