{"title":"医学图像字幕的可解释性","authors":"D. Beddiar, Mourad Oussalah, T. Seppänen","doi":"10.1109/IPTA54936.2022.9784146","DOIUrl":null,"url":null,"abstract":"Medical image captioning is the process of generating clinically significant descriptions to medical images, which has many applications among which medical report generation is the most frequent one. In general, automatic captioning of medical images is of great interest for medical experts since it offers assistance in diagnosis, disease treatment and automating the workflow of the health practitioners. Recently, many efforts have been put forward to obtain accurate descriptions but medical image captioning still provides weak and incorrect descriptions. To alleviate this issue, it is important to explain why the model produced a particular caption based on some specific features. This is performed through Artificial Intelligence Explainability (XAI), which aims to unfold the ‘black-box’ feature of deep-learning based models. We present in this paper an explainable module for medical image captioning that provides a sound interpretation of our attention-based encoder-decoder model by explaining the correspondence between visual features and semantic features. We exploit for that, self-attention to compute word importance of semantic features and visual attention to compute relevant regions of the image that correspond to each generated word of the caption in addition to visualization of visual features extracted at each layer of the Convolutional Neural Network (CNN) encoder. We finally evaluate our model using the ImageCLEF medical captioning dataset.","PeriodicalId":381729,"journal":{"name":"2022 Eleventh International Conference on Image Processing Theory, Tools and Applications (IPTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Explainability for Medical Image Captioning\",\"authors\":\"D. Beddiar, Mourad Oussalah, T. Seppänen\",\"doi\":\"10.1109/IPTA54936.2022.9784146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Medical image captioning is the process of generating clinically significant descriptions to medical images, which has many applications among which medical report generation is the most frequent one. In general, automatic captioning of medical images is of great interest for medical experts since it offers assistance in diagnosis, disease treatment and automating the workflow of the health practitioners. Recently, many efforts have been put forward to obtain accurate descriptions but medical image captioning still provides weak and incorrect descriptions. To alleviate this issue, it is important to explain why the model produced a particular caption based on some specific features. This is performed through Artificial Intelligence Explainability (XAI), which aims to unfold the ‘black-box’ feature of deep-learning based models. We present in this paper an explainable module for medical image captioning that provides a sound interpretation of our attention-based encoder-decoder model by explaining the correspondence between visual features and semantic features. We exploit for that, self-attention to compute word importance of semantic features and visual attention to compute relevant regions of the image that correspond to each generated word of the caption in addition to visualization of visual features extracted at each layer of the Convolutional Neural Network (CNN) encoder. We finally evaluate our model using the ImageCLEF medical captioning dataset.\",\"PeriodicalId\":381729,\"journal\":{\"name\":\"2022 Eleventh International Conference on Image Processing Theory, Tools and Applications (IPTA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Eleventh International Conference on Image Processing Theory, Tools and Applications (IPTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPTA54936.2022.9784146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Eleventh International Conference on Image Processing Theory, Tools and Applications (IPTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPTA54936.2022.9784146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Medical image captioning is the process of generating clinically significant descriptions to medical images, which has many applications among which medical report generation is the most frequent one. In general, automatic captioning of medical images is of great interest for medical experts since it offers assistance in diagnosis, disease treatment and automating the workflow of the health practitioners. Recently, many efforts have been put forward to obtain accurate descriptions but medical image captioning still provides weak and incorrect descriptions. To alleviate this issue, it is important to explain why the model produced a particular caption based on some specific features. This is performed through Artificial Intelligence Explainability (XAI), which aims to unfold the ‘black-box’ feature of deep-learning based models. We present in this paper an explainable module for medical image captioning that provides a sound interpretation of our attention-based encoder-decoder model by explaining the correspondence between visual features and semantic features. We exploit for that, self-attention to compute word importance of semantic features and visual attention to compute relevant regions of the image that correspond to each generated word of the caption in addition to visualization of visual features extracted at each layer of the Convolutional Neural Network (CNN) encoder. We finally evaluate our model using the ImageCLEF medical captioning dataset.