通过视觉时间感知嵌入和视觉-文本非对称协同工作实现对 EGD 流的细粒度时态站点监控。

IF 6.8 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-10-30 DOI:10.1109/JBHI.2024.3488514

Fang Peng;Hongkuan Shi;Shiquan He;Qiang Hu;Ting Li;Fan Huang;Xinxia Feng;Mei Liu;Jiazhi Liao;Qiang Li;Zhiwei Wang

{"title":"通过视觉时间感知嵌入和视觉-文本非对称协同工作实现对 EGD 流的细粒度时态站点监控。","authors":"Fang Peng;Hongkuan Shi;Shiquan He;Qiang Hu;Ting Li;Fan Huang;Xinxia Feng;Mei Liu;Jiazhi Liao;Qiang Li;Zhiwei Wang","doi":"10.1109/JBHI.2024.3488514","DOIUrl":null,"url":null,"abstract":"Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two-fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame-wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"29 2","pages":"1308-1319"},"PeriodicalIF":6.8000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-Grained Temporal Site Monitoring in EGD Streams via Visual Time-Aware Embedding and Vision-Text Asymmetric Coworking\",\"authors\":\"Fang Peng;Hongkuan Shi;Shiquan He;Qiang Hu;Ting Li;Fan Huang;Xinxia Feng;Mei Liu;Jiazhi Liao;Qiang Li;Zhiwei Wang\",\"doi\":\"10.1109/JBHI.2024.3488514\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two-fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame-wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"29 2\",\"pages\":\"1308-1319\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2024-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10738274/\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10738274/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

食管胃十二指肠镜检查（EGD）需要全面检查大量上消化道（UGI）部位，以进行精确的癌症筛查。因此，对用于 EGD 辅助的自动时间性部位监测的需求很高，但如果直接应用现有的在线动作检测方法，往往会失败。主要挑战有两个方面：1）全局摄像机运动占主导地位，使从物体光流中得出的时间模式失效；2）胃肠道部位细粒度，产生高度同质化的外观。在本文中，我们提出了一种 EGD 定制模型，该模型由两种新颖的设计（即视觉时间感知嵌入加视觉文本非对称协同工作（VTE+VAC））驱动，用于实时精确的细粒度 UGI 站点监控。具体来说，VTE 通过分类损失区分帧来学习视觉嵌入，同时通过排序损失对采样的时间无关帧重新排序，使其在时间上保持一致。这种联合目标促使 VTE 在不借助不适用的对象光流的情况下捕捉顺序关系，从而提供时间感知的帧嵌入。在随后的分析中，VAC 使用时间滑动窗口，通过学习的 VTE 和冻结的 BERT，从每帧及其相应的文本化预测中提取视觉-文本多模态知识。文本嵌入有助于提供更具代表性的线索，但也可能因预测错误而导致误导。因此，VAC 会随机丢弃或替换历史预测，以提高容错率，避免坍塌到最后几个预测上。定性和定量实验证明，与其他最先进的方法相比，所提出的方法性能更优越，平均 F1 分数至少提高了 7.66%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fine-Grained Temporal Site Monitoring in EGD Streams via Visual Time-Aware Embedding and Vision-Text Asymmetric Coworking

Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two-fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame-wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.