Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization

IF 6.6 1区计算机科学 Q1 Multidisciplinary Tsinghua Science and Technology Pub Date : 2024-02-09 DOI:10.26599/TST.2023.9010070

Tianyu Liu;Fuchun Sun

{"title":"Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization","authors":"Tianyu Liu;Fuchun Sun","doi":"10.26599/TST.2023.9010070","DOIUrl":null,"url":null,"abstract":"The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"29 4","pages":"1082-1091"},"PeriodicalIF":6.6000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10431728","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10431728/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

Abstract

The oropharyngeal swabbing is a pre-diagnostic procedure used to test various respiratory diseases, including COVID and Influenza A (H1N1). To improve the testing efficiency of testing, a real-time, accurate, and robust sampling point localization algorithm is needed for robots. However, current solutions rely heavily on visual input, which is not reliable enough for large-scale deployment. The transformer has significantly improved the performance of image-related tasks and challenged the dominance of traditional convolutional neural networks (CNNs) in the image field. Inspired by its success, we propose a novel self-aligning multi-modal transformer (SAMMT) to dynamically attend to different parts of unaligned feature maps, preventing information loss caused by perspective disparity and simplifying overall implementation. Unlike preexisting multi-modal transformers, our attention mechanism works in image space instead of embedding space, rendering the need for the sensor registration process obsolete. To facilitate the multi-modal task, we collected and annotate an oropharynx localization/segmentation dataset by trained medical personnel. This dataset is open-sourced and can be used for future multi-modal research. Our experiments show that our model improves the performance of the localization task by 4.2% compared to the pure visual model, and reduces the pixel-wise error rate of the segmentation task by 16.7% compared to the CNN baseline.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于口咽拭子点定位的自对准多模式变压器

口咽拭子是用于检测各种呼吸道疾病（包括 COVID 和甲型 H1N1 流感）的诊断前程序。为了提高检测效率，机器人需要一种实时、准确、稳健的采样点定位算法。然而，目前的解决方案严重依赖视觉输入，这对于大规模部署来说不够可靠。变压器大大提高了图像相关任务的性能，并对传统卷积神经网络（CNN）在图像领域的主导地位提出了挑战。受其成功经验的启发，我们提出了一种新颖的自对齐多模态变换器（SAMMT），可动态关注未对齐特征图的不同部分，防止因视角差异造成的信息丢失，并简化整体实现过程。与现有的多模态变换器不同，我们的关注机制在图像空间而非嵌入空间工作，因此无需传感器注册过程。为了促进多模态任务的完成，我们收集并注释了由训练有素的医务人员制作的口咽定位/分割数据集。该数据集已开源，可用于未来的多模态研究。实验表明，与纯视觉模型相比，我们的模型将定位任务的性能提高了 4.2%；与 CNN 基线相比，我们的模型将分割任务的像素误差率降低了 16.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

10.20

自引率

10.60%

发文量

2340

期刊介绍： Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.

期刊最新文献

Contents Front Cover LP-Rounding Based Algorithm for Capacitated Uniform Facility Location Problem with Soft Penalties A P4-Based Approach to Traffic Isolation and Bandwidth Management for 5G Network Slicing Quantum-Inspired Sensitive Data Measurement and Secure Transmission in 5G-Enabled Healthcare Systems