S2Former-OR：用于在 OR 中生成场景图的单级双模变换器。

IEEE transactions on medical imaging Pub Date : 2024-08-15 DOI:10.1109/TMI.2024.3444279

Jialun Pei, Diandian Guo, Jingyang Zhang, Manxi Lin, Yueming Jin, Pheng-Ann Heng

{"title":"S2Former-OR：用于在 OR 中生成场景图的单级双模变换器。","authors":"Jialun Pei, Diandian Guo, Jingyang Zhang, Manxi Lin, Yueming Jin, Pheng-Ann Heng","doi":"10.1109/TMI.2024.3444279","DOIUrl":null,"url":null,"abstract":"Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR.\",\"authors\":\"Jialun Pei, Diandian Guo, Jingyang Zhang, Manxi Lin, Yueming Jin, Pheng-Ann Heng\",\"doi\":\"10.1109/TMI.2024.3444279\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.\",\"PeriodicalId\":94033,\"journal\":{\"name\":\"IEEE transactions on medical imaging\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on medical imaging\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TMI.2024.3444279\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TMI.2024.3444279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

手术过程的场景图生成（SGG）对于提高手术室（OR）的整体认知智能至关重要。然而，以前的工作主要依赖于多阶段学习，其中生成的语义场景图依赖于姿势估计和物体检测的中间过程。这种流水线可能会影响多模态表征学习的灵活性，从而制约整体效果。在本研究中，我们引入了一种新颖的单级双模态转换器框架，用于在OR中进行SGG，称为S2Former-OR，旨在以端到端的方式利用多视角二维场景和三维点云对SGG进行互补。具体来说，我们的模型采用视图同步转换方案，鼓励多视图视觉信息交互。同时，我们还设计了几何-视觉内聚操作，将协同的二维语义特征整合到三维点云特征中。此外，在增强特征的基础上，我们提出了一种新颖的关系敏感变换解码器，该解码器嵌入了动态实体对查询和关系特质先验，可直接预测实体对关系以生成图，而无需中间步骤。广泛的实验验证了 S2Former-OR 在 4D-OR 基准上比当前的 OR-SGG 方法具有更优越的 SGG 性能和更低的计算成本，例如精度提高了 3 个百分点，模型参数减少了 2420 万个。我们还进一步将我们的方法与通用的单级 SGG 方法进行了比较，并采用了更广泛的指标进行综合评估，结果显示我们的方法始终具有更好的性能。我们的源代码可在以下网址获取：https://github.com/PJLallen/S2Former-OR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

S²Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR.

Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S²Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S²Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量