Panoptic segmentation-based semantic embedding matching model for scene graph generation

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Letters Pub Date : 2025-07-01 Epub Date: 2025-04-19 DOI:10.1016/j.patrec.2025.04.005

Ming Zhao, Jing Zhang

{"title":"Panoptic segmentation-based semantic embedding matching model for scene graph generation","authors":"Ming Zhao, Jing Zhang","doi":"10.1016/j.patrec.2025.04.005","DOIUrl":null,"url":null,"abstract":"<div><div>Scene Graph Generation aims to construct a structured representation of entities and their relationships in an image. Traditional methods use object detection for entity localization but struggle with relationship modeling in complex scenes. Most approaches also face challenges in predicate classification due to inter-class similarity and intra-class variability. Additionally, when multiple entities are present in an image, the contextual information between them are crucial. To address these challenges, this paper proposes a Panoptic Segmentation-based Semantic Embedding Matching Network, which optimizes the entire process from entity localization to entity-pair and predicate prediction. Specifically, we use a panoptic segmentation module to locate all entities (including the foreground and background), providing comprehensive support for predicate prediction in complex scenes. Simultaneously, a semantic embedding module is introduced to fuse the visual and semantic features of entities and predicates respectively, constructing a similarity-based matching mechanism. Furthermore, we incorporate a graph attention network before the semantic embedding of entities, effectively capturing contextual information among multiple entities and dynamically adjusting the semantic embedding module. Experiments on the PSG dataset validate the proposed method’s effectiveness. The results show that our model outperforms existing methods in relationship detection and generation in complex scenes.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 56-63"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525001382","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Scene Graph Generation aims to construct a structured representation of entities and their relationships in an image. Traditional methods use object detection for entity localization but struggle with relationship modeling in complex scenes. Most approaches also face challenges in predicate classification due to inter-class similarity and intra-class variability. Additionally, when multiple entities are present in an image, the contextual information between them are crucial. To address these challenges, this paper proposes a Panoptic Segmentation-based Semantic Embedding Matching Network, which optimizes the entire process from entity localization to entity-pair and predicate prediction. Specifically, we use a panoptic segmentation module to locate all entities (including the foreground and background), providing comprehensive support for predicate prediction in complex scenes. Simultaneously, a semantic embedding module is introduced to fuse the visual and semantic features of entities and predicates respectively, constructing a similarity-based matching mechanism. Furthermore, we incorporate a graph attention network before the semantic embedding of entities, effectively capturing contextual information among multiple entities and dynamically adjusting the semantic embedding module. Experiments on the PSG dataset validate the proposed method’s effectiveness. The results show that our model outperforms existing methods in relationship detection and generation in complex scenes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于全视分割的场景图生成语义嵌入匹配模型

场景图生成旨在构建图像中实体及其关系的结构化表示。传统的方法使用对象检测进行实体定位，但在复杂场景中难以建立关系模型。由于类间相似性和类内可变性，大多数方法在谓词分类方面也面临挑战。此外，当图像中存在多个实体时，它们之间的上下文信息至关重要。为了解决这些问题，本文提出了一种基于泛视分割的语义嵌入匹配网络，该网络优化了从实体定位到实体对和谓词预测的整个过程。具体来说，我们使用全景分割模块来定位所有实体（包括前景和背景），为复杂场景下的谓词预测提供全面的支持。同时引入语义嵌入模块，分别融合实体和谓词的视觉特征和语义特征，构建基于相似度的匹配机制。此外，我们在实体语义嵌入之前引入了一个图关注网络，有效地捕获多个实体之间的上下文信息，并动态调整语义嵌入模块。在PSG数据集上的实验验证了该方法的有效性。结果表明，该模型在复杂场景下的关系检测和生成方面优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.