首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation 基于时空显著性的无偏对比学习视频场景图生成
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557688
Weijun Zhuang;Bowen Dong;Zhilin Zhu;Zhijun Li;Jie Liu;Yaowei Wang;Xiaopeng Hong;Xin Li;Wangmeng Zuo
Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.
在视频场景图生成(VidSGG)中,准确检测目标及其相互关系面临两个主要挑战。第一个挑战涉及从众多背景对象中识别与人类交互的活动对象,而第二个挑战是谓词类之间的长尾分布。为了应对这些挑战,我们提出了STABILE,这是一个具有时空显著性引导的对比学习方案的新框架。对于第一个挑战,STABILE具有一个活动对象检索器,其中包括一个对象显著性融合块,用于增强带有运动线索的对象嵌入,以及一个对象时间编码器,以捕获时间依赖性。对于第二个挑战,STABILE引入了一个带有无偏多标签(unbiased Multi-Label, UML)对比损失的无偏关系表示学习模块,以减轻长尾分布的影响。通过这两方面的增强,STABILE大大提高了场景图生成的准确性。大量的实验证明了STABILE的优越性,通过提供更高的精度和无偏的场景图生成,在该领域树立了新的基准。
{"title":"Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation","authors":"Weijun Zhuang;Bowen Dong;Zhilin Zhu;Zhijun Li;Jie Liu;Yaowei Wang;Xiaopeng Hong;Xin Li;Wangmeng Zuo","doi":"10.1109/TMM.2025.3557688","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557688","url":null,"abstract":"Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3092-3104"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition 利用简洁的概念和概率模型进行可解释的视觉识别
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557677
Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang
Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.
可解释的视觉识别对于高风险情况下的决策至关重要。最近的进展通过利用带有概念瓶颈模型(cbm)的可视化语言模型(vlm)和大型语言模型(llm)实现了可解释模型的自动化构建,后者处理与人类可理解概念相关的瓶颈层。然而,现有方法存在两个主要问题:a)从llm中收集的概念可能与任务无关的描述冗余,从而导致潜在不匹配的劣质概念空间。b) vlm直接用细粒度概念映射全局确定性图像嵌入,导致映射结果不精确的模糊过程。为了解决上述两个问题,我们提出了一种基于简明概念和概率建模(CCPM)的CBMs新解决方案,该方案可以通过高质量的概念和精确的映射策略实现卓越的分类性能。首先,我们利用上下文示例作为与类别相关的线索来指导法学硕士概念生成过程。为了减少概念空间中的冗余,我们提出了一个关系感知选择(RAS)模块,以获得基于图像-概念和概念间关系的判别性和相关性的简洁概念集。其次,对于精确映射,我们使用概率分布适配器(PDA)来估计预训练vlm图像嵌入的固有模糊性,以捕获与概念的复杂关系。大量实验表明,我们的模型达到了最先进的结果,在8个主流识别基准上的分类准确率提高了6.18%,并且通过可解释性分析获得了可靠的可解释性。
{"title":"Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition","authors":"Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang","doi":"10.1109/TMM.2025.3557677","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557677","url":null,"abstract":"Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3117-3131"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Geometric Attacks for 3D Point Cloud Object Tracking 三维点云目标跟踪的对抗性几何攻击
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557613
Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik
3D point cloud object tracking (3D PCOT) plays a vital role in applications such as autonomous driving and robotics. Adversarial attacks offer a promising approach to enhance the robustness and security of tracking models. However, existing adversarial attack methods for 3D PCOT seldom leverage the geometric structure of point clouds and often overlook the transferability of attack strategies. To address these limitations, this paper proposes an adversarial geometric attack method tailored for 3D PCOT, which includes a point perturbation attack module (non-isometric transformation) and a rotation attack module (isometric transformation). First, we introduce a curvature-aware point perturbation attack module that enhances local transformations by applying normal perturbations to critical points identified through geometric features such as curvature and entropy. Second, we design a Thompson sampling-based rotation attack module that applies subtle global rotations to the point cloud, introducing tracking errors while maintaining imperceptibility. Additionally, we design a fused loss function to iteratively optimize the point cloud within the search region, generating adversarially perturbed samples. The proposed method is evaluated on multiple 3D PCOT models and validated through black-box tracking experiments on benchmarks. For P2B, white-box attacks on KITTI reduce the success rate from 53.3% to 29.6% and precision from 68.4% to 37.1%. On NuScenes, the success rate drops from 39.0% to 27.6%, and precision from 39.9 to 26.8%. Black-box attacks show a transferability, with BAT showing a maximum 47.0% drop in success rate and 47.2% in precision on KITTI, and a maximum 22.5% and 27.0% on NuScenes.
三维点云目标跟踪(3D PCOT)在自动驾驶和机器人等应用中发挥着至关重要的作用。对抗性攻击为增强跟踪模型的鲁棒性和安全性提供了一种很有前途的方法。然而,现有的针对三维PCOT的对抗性攻击方法很少利用点云的几何结构,而且往往忽略了攻击策略的可转移性。针对这些局限性,本文提出了一种针对三维PCOT的对抗性几何攻击方法,该方法包括点摄动攻击模块(非等距变换)和旋转攻击模块(等距变换)。首先,我们引入了一个曲率感知点摄动攻击模块,该模块通过对曲率和熵等几何特征识别的临界点施加正常摄动来增强局部变换。其次,我们设计了一个基于汤普森采样的旋转攻击模块,该模块对点云应用微妙的全局旋转,在保持不可感知性的同时引入跟踪误差。此外,我们设计了一个融合损失函数来迭代优化搜索区域内的点云,生成对抗性扰动样本。在多个三维PCOT模型上对该方法进行了评估,并通过基准的黑盒跟踪实验进行了验证。对于P2B,针对KITTI的白盒攻击将成功率从53.3%降低到29.6%,准确率从68.4%降低到37.1%。在NuScenes上,成功率从39.0%下降到27.6%,准确率从39.9%下降到26.8%。黑盒攻击表现出可转移性,BAT在KITTI上的成功率和精度分别下降47.0%和47.2%,在NuScenes上的成功率和精度分别下降22.5%和27.0%。
{"title":"Adversarial Geometric Attacks for 3D Point Cloud Object Tracking","authors":"Rui Yao;Anqi Zhang;Yong Zhou;Jiaqi Zhao;Bing Liu;Abdulmotaleb El Saddik","doi":"10.1109/TMM.2025.3557613","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557613","url":null,"abstract":"3D point cloud object tracking (3D PCOT) plays a vital role in applications such as autonomous driving and robotics. Adversarial attacks offer a promising approach to enhance the robustness and security of tracking models. However, existing adversarial attack methods for 3D PCOT seldom leverage the geometric structure of point clouds and often overlook the transferability of attack strategies. To address these limitations, this paper proposes an adversarial geometric attack method tailored for 3D PCOT, which includes a point perturbation attack module (non-isometric transformation) and a rotation attack module (isometric transformation). First, we introduce a curvature-aware point perturbation attack module that enhances local transformations by applying normal perturbations to critical points identified through geometric features such as curvature and entropy. Second, we design a Thompson sampling-based rotation attack module that applies subtle global rotations to the point cloud, introducing tracking errors while maintaining imperceptibility. Additionally, we design a fused loss function to iteratively optimize the point cloud within the search region, generating adversarially perturbed samples. The proposed method is evaluated on multiple 3D PCOT models and validated through black-box tracking experiments on benchmarks. For P2B, white-box attacks on KITTI reduce the success rate from 53.3% to 29.6% and precision from 68.4% to 37.1%. On NuScenes, the success rate drops from 39.0% to 27.6%, and precision from 39.9 to 26.8%. Black-box attacks show a transferability, with BAT showing a maximum 47.0% drop in success rate and 47.2% in precision on KITTI, and a maximum 22.5% and 27.0% on NuScenes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3144-3157"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ICE: Interactive 3D Game Character Facial Editing via Dialogue ICE:通过对话进行交互式3D游戏角色面部编辑
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557611
Haoqian Wu;Minda Zhao;Zhipeng Hu;Changjie Fan;Lincheng Li;Weijie Chen;Rui Zhao;Xin Yu
Most recent popular Role-Playing Games (RPGs) allow players to create in-game characters with hundreds of adjustable parameters, including bone positions and various makeup options. Although text-driven auto-customization systems have been developed to simplify the complex process of adjusting these intricate character parameters, they are limited by their single-round generation and lack the capability for further editing and fine-tuning. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE.
最近最流行的角色扮演游戏(rpg)允许玩家在游戏中创造具有数百个可调整参数的角色,包括骨骼位置和各种化妆选项。虽然文本驱动的自动定制系统已经开发出来,以简化调整这些复杂的字符参数的复杂过程,但它们受到单轮生成的限制,缺乏进一步编辑和微调的能力。在本文中,我们提出了一个交互式字符编辑框架(ICE)来实现基于多轮对话的细化过程。简而言之,我们的ICE提供了一种更友好的方式,让玩家能够迭代地传达创造性的想法,同时确保所创造的角色符合玩家的期望。具体来说,我们提出了一个指令解析模块(IPM),该模块利用大型语言模型(llm)将多轮对话解析为每轮清晰的编辑指令提示。为了在细粒度水平上可靠、快速地修改字符控制参数,我们提出了一种语义引导的低维参数求解器(SLPS),它可以根据提示以零射击的方式编辑字符控制参数。我们的SLPS首先定位与细粒度修改相关的字符控制参数,然后在低维空间中优化相应的参数,以避免不切实际的结果。大量的实验结果证明了我们提出的ICE在游戏角色创建方面的有效性以及ICE的卓越编辑性能。
{"title":"ICE: Interactive 3D Game Character Facial Editing via Dialogue","authors":"Haoqian Wu;Minda Zhao;Zhipeng Hu;Changjie Fan;Lincheng Li;Weijie Chen;Rui Zhao;Xin Yu","doi":"10.1109/TMM.2025.3557611","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557611","url":null,"abstract":"Most recent popular Role-Playing Games (RPGs) allow players to create in-game characters with hundreds of adjustable parameters, including bone positions and various makeup options. Although text-driven auto-customization systems have been developed to simplify the complex process of adjusting these intricate character parameters, they are limited by their single-round generation and lack the capability for further editing and fine-tuning. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3210-3223"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training 跨模态提示:单标签训练的少射多标签识别
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557700
Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han
Few-shot multi-label recognition (FS-MLR) presents a significant challenge due to the need to assign multiple labels to images with limited examples. Existing methods often struggle to balance the learning of novel classes and the retention of knowledge from base classes. To address this issue, we propose a novel Cross-Modality Prompts (CMP) approach. Unlike conventional methods that rely on additional semantic information to mitigate the impact of limited samples, our approach leverages multimodal prompts to adaptively tune the feature extraction network. A new FS-MLR benchmark is also proposed, which includes single-label training and multi-label testing, accompanied by benchmark datasets constructed from MS-COCO and NUS-WIDE. Extensive experiments on these datasets demonstrate the superior performance of our CMP approach, highlighting its effectiveness and adaptability. Our results show that CMP outperforms CoOp on the MS-COCO dataset with a maximal improvement of 19.47% and 23.94% in mAPharmonic for 5-way 1-shot and 5-way 5-shot settings, respectively.
少镜头多标签识别(FS-MLR)由于需要为有限示例的图像分配多个标签而面临重大挑战。现有的方法往往难以平衡学习新类和保留基类的知识。为了解决这个问题,我们提出了一种新的跨模态提示(CMP)方法。与依赖额外语义信息来减轻有限样本影响的传统方法不同,我们的方法利用多模态提示自适应调整特征提取网络。提出了一种新的FS-MLR基准,包括单标签训练和多标签测试,并结合MS-COCO和NUS-WIDE构建的基准数据集。在这些数据集上的大量实验证明了我们的CMP方法的优越性能,突出了它的有效性和适应性。结果表明,CMP在MS-COCO数据集上的性能优于CoOp,在mAPharmonic的5-way 1-shot和5-way 5-shot设置下,CMP分别提高了19.47%和23.94%。
{"title":"Cross-Modality Prompts: Few-Shot Multi-Label Recognition With Single-Label Training","authors":"Zixuan Ding;Zihan Zhou;Hui Chen;Tianxiang Hao;Yizhe Xiong;Sicheng Zhao;Qiang Zhang;Jungong Han","doi":"10.1109/TMM.2025.3557700","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557700","url":null,"abstract":"Few-shot multi-label recognition (FS-MLR) presents a significant challenge due to the need to assign multiple labels to images with limited examples. Existing methods often struggle to balance the learning of novel classes and the retention of knowledge from base classes. To address this issue, we propose a novel Cross-Modality Prompts (CMP) approach. Unlike conventional methods that rely on additional semantic information to mitigate the impact of limited samples, our approach leverages multimodal prompts to adaptively tune the feature extraction network. A new FS-MLR benchmark is also proposed, which includes single-label training and multi-label testing, accompanied by benchmark datasets constructed from MS-COCO and NUS-WIDE. Extensive experiments on these datasets demonstrate the superior performance of our CMP approach, highlighting its effectiveness and adaptability. Our results show that CMP outperforms CoOp on the MS-COCO dataset with a maximal improvement of 19.47% and 23.94% in mAP<sub>harmonic</sub> for 5-way 1-shot and 5-way 5-shot settings, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3023-3033"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segmenting Anything in the Dark via Depth Perception 通过深度感知在黑暗中分割任何东西
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557612
Peng Liu;Jinhong Deng;Lixin Duan;Wen Li;Fengmao Lv
Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.
低光条件下的图像分割在自动驾驶和视频监控系统等实际应用中至关重要。最近提出的任意分割模型(SAM)在各种视觉应用中显示出强大的分割能力。然而,在弱光条件下,它的性能可能会严重下降。另一方面,通过提供补充信息(例如深度),多模态信息已被利用来帮助模型构建对低光照条件下场景更全面的理解。因此,在这项工作中,我们提出了一种开创性的尝试,通过在低光条件下有效地整合额外的深度信息,将单模视觉基础模型(例如SAM)提升到多模态模型。为了实现这一目标,我们提出了一种基于深度感知SAM (DPSAM)框架的新方法。具体来说,我们设计了一个模态编码器来提取深度信息和深度感知层(deep Perception Layers, dpl),用于RGB和深度特征之间的相互特征细化。DPLs采用跨模态注意机制,从RGB和深度两个方面相互查询有效信息,以进行后续的特征细化。因此,dpl可以有效地利用深度的互补信息来丰富RGB表示,并获得全面的多模态视觉表示,用于分割黑暗中的任何物体。为此,我们的DPSAM最大限度地保留了SAM对RGB图像分割的本能专长,并进一步利用深度的强度来增强分割任何功能,特别是对于仅使用RGB可能失败的情况(例如,低光或复杂纹理)。在四个RGBD基准数据集上的大量实验表明,DPSAM明显提高了在黑暗中分割任何东西的性能,例如在LLRGBD和DeLiVER上分别提高了+12.90% mIoU和+16.23% mIoU。
{"title":"Segmenting Anything in the Dark via Depth Perception","authors":"Peng Liu;Jinhong Deng;Lixin Duan;Wen Li;Fengmao Lv","doi":"10.1109/TMM.2025.3557612","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557612","url":null,"abstract":"Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2975-2986"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data 有限数据下三维兴趣区域标注的多模态自我感知增强大语言模型
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557703
Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen
3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.
3D感兴趣区域(RoI)字幕包括将模型对复杂3D场景中特定对象的理解转换为描述性字幕。大型语言模型(llm)的最新进展在这一领域显示了巨大的潜力。现有的方法从roi中捕获可视化信息,作为llm的输入令牌。然而,这种方法可能无法为法学硕士提供足够详细的信息,以生成准确的区域特定标题。本文介绍了一种具有多模态自我感知能力的大型语言模型Self-RoI,用于三维RoI字幕。为了确保llm获得更准确和充分的信息,Self-RoI结合了隐含文本信息。感知构建多模态视觉语言信息。该模块利用简单的映射网络,从llm的视觉跟随响应中生成RoI基本属性的文本信息。然后将该文本信息与RoI的可视化表示相集成,形成llm的综合多模态指令。鉴于三维roi字幕数据的可用性有限,我们提出了一种两阶段的训练策略来有效地优化Self-RoI。在第一阶段,我们对齐3D RoI视觉和标题表示。在第二阶段,我们将重点放在3D RoI视觉-标题交互上,使用完全不同的对比嵌入模块来提高隐式文本信息的可靠性,并使用语言建模损失来确保准确生成标题。我们的实验表明,Self-RoI显著优于以前的3D RoI字幕模型。此外,隐含文本信息。感知可以集成到其他多模态llm中以增强性能。我们将为进一步的研究提供我们的代码。
{"title":"Multi-Modal Self-Perception Enhanced Large Language Model for 3D Region-of-Interest Captioning With Limited Data","authors":"Lu Shi;Shichao Kan;Yi Jin;Linna Zhang;Yigang Cen","doi":"10.1109/TMM.2025.3557703","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557703","url":null,"abstract":"3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2935-2948"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking 基于语义和运动引导的多目标跟踪视觉语言特征对齐
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557710
Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang
Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.
参考多目标跟踪(RMOT)的目的是根据语言表达动态跟踪视频序列中任意数量的参考目标。以往的方法主要集中在特征层与设计结构的跨模态融合。然而,视觉语言对齐不足容易造成视觉语言不匹配,导致某些目标被跟踪但没有被正确引用,特别是面对具有复杂语义或动作描述的语言表达时。为此,我们建议通过语义引导和运动引导进行视觉语言对齐,使视觉特征与更多样化的语言表达有效对齐。在本文中,我们提出了一个新的端到端RMOT框架SKTrack,它遵循基于变压器的架构,具有语言引导解码器(LGD)和运动感知聚合器(MAA)。其中,LGD在单帧内逐层进行深度语义交互,增强了模型的对齐能力;MAA跨多帧进行时间特征融合对齐,实现了视觉目标与具有运动描述的语言表达的对齐。在reference - kitti和reference - kitti -v2上进行的大量实验表明,SKTrack达到了最先进的性能,并验证了我们的框架及其组件的有效性。
{"title":"Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking","authors":"Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang","doi":"10.1109/TMM.2025.3557710","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557710","url":null,"abstract":"Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3034-3044"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Open-Vocabulary Video Semantic Segmentation 面向开放词汇的视频语义分割
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557719
Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.
视频中的语义分割一直是近年来研究的热点。然而,当面对不熟悉的类别时,现有的模型会遇到挑战。为了解决这个问题,我们引入了开放词汇视频语义分割(OV-VSS)任务,旨在准确地分割各种开放词汇类别中的每个像素,包括那些新颖或以前未探索过的类别。为了提高OV-VSS的性能,我们提出了一个鲁棒基线OV2VSS,它集成了一个时空融合模块,使模型能够利用连续帧之间的时间关系。此外,我们结合了一个随机帧增强模块,扩大了模型对整个视频序列的语义上下文的理解。我们的方法还包括视频文本编码,这加强了模型在视频上下文中解释文本信息的能力。对基准数据集(如VSPW和cityscape)的综合评估突出了OV-VSS的零射击泛化能力,特别是在处理新类别方面。结果验证了OV2VSS的有效性,展示了跨不同视频数据集的语义分割任务的改进性能。
{"title":"Towards Open-Vocabulary Video Semantic Segmentation","authors":"Xinhao Li;Yun Liu;Guolei Sun;Min Wu;Le Zhang;Ce Zhu","doi":"10.1109/TMM.2025.3557719","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557719","url":null,"abstract":"Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2924-2934"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Adaptive Framework Embedded With LLM for Knowledge Graph Construction 一种嵌入LLM的自适应知识图谱构建框架
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557717
Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen
Knowledge graph construction is aimed at storing and representing the knowledge of the objective world in a structured form. Existing methods for automatic construction of knowledge graphs have problems such as difficulty in understanding potential semantics and low precision. The emergence of Large Language Models (LLMs) provides an effective way for automatic knowledge graph construction. However, using LLMs as automatic knowledge graph construction engines relies on the embedding of schema layers, which brings challenges to the input length of LLMs. In this paper, we present a framework for Adaptive Construction of Knowledge Graph by leveraging the exceptional generation capabilities of LLMs and the latent relational semantic information of triples, named ACKG-LLM. Our proposed framework divides the knowledge graph construction task into three subtasks within a unified pipeline: triple extraction of open information, additional relational semantic information embedding and knowledge graph normalization based on schema-level embedding. The framework can construct knowledge graphs in different domains, making up for the defects of existing frameworks that need to retrain and fine-tune the internal model. Extensive experiments demonstrate that our proposed ACKG-LLM performs favorably against representative methods on the REBEL and WiKi-NRE datasets.
知识图谱的构建旨在以结构化的形式存储和表示客观世界的知识。现有的知识图自动构建方法存在潜在语义难以理解、精度低等问题。大型语言模型的出现为知识图谱的自动构建提供了有效的途径。然而,使用llm作为知识图谱自动构建引擎依赖于模式层的嵌入,这给llm的输入长度带来了挑战。在本文中,我们利用llm的特殊生成能力和三元组的潜在关系语义信息,提出了一个自适应构建知识图的框架,命名为ACKG-LLM。我们提出的框架将知识图谱构建任务划分为统一流水线中的三个子任务:开放信息的三重提取、附加关系语义信息的嵌入和基于模式级嵌入的知识图谱规范化。该框架可以构建不同领域的知识图谱,弥补了现有框架需要重新训练和微调内部模型的缺陷。大量的实验表明,我们提出的ACKG-LLM在REBEL和WiKi-NRE数据集上优于代表性方法。
{"title":"An Adaptive Framework Embedded With LLM for Knowledge Graph Construction","authors":"Qingwang Wang;Chaohui Li;Yi Liu;Qiubai Zhu;Jian Song;Tao Shen","doi":"10.1109/TMM.2025.3557717","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557717","url":null,"abstract":"Knowledge graph construction is aimed at storing and representing the knowledge of the objective world in a structured form. Existing methods for automatic construction of knowledge graphs have problems such as difficulty in understanding potential semantics and low precision. The emergence of Large Language Models (LLMs) provides an effective way for automatic knowledge graph construction. However, using LLMs as automatic knowledge graph construction engines relies on the embedding of schema layers, which brings challenges to the input length of LLMs. In this paper, we present a framework for Adaptive Construction of Knowledge Graph by leveraging the exceptional generation capabilities of LLMs and the latent relational semantic information of triples, named ACKG-LLM. Our proposed framework divides the knowledge graph construction task into three subtasks within a unified pipeline: triple extraction of open information, additional relational semantic information embedding and knowledge graph normalization based on schema-level embedding. The framework can construct knowledge graphs in different domains, making up for the defects of existing frameworks that need to retrain and fine-tune the internal model. Extensive experiments demonstrate that our proposed ACKG-LLM performs favorably against representative methods on the REBEL and WiKi-NRE datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2912-2923"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1