首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Imp: Highly Capable Large Multimodal Models for Mobile Devices Imp:移动设备的高性能大型多模态模型
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-11 DOI: 10.1109/TMM.2025.3557680
Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding
By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B$sim$4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.
通过利用大型语言模型(llm)的能力,最近的大型多模态模型(lmm)在开放世界的多模态理解中显示出显著的多功能性。然而,它们通常是大量参数和计算密集型的,因此阻碍了它们在资源受限场景中的适用性。为此,相继提出了几种轻量级lmm,以最大限度地发挥受限制规模下的能力(例如3B)。尽管这些方法取得了令人鼓舞的结果,但它们大多只关注设计空间的一个或两个方面,并且尚未对影响模型能力的关键设计选择进行深入研究。本文从模型架构、训练策略、训练数据等方面对轻量级lmm进行了系统的研究。基于我们的研究结果,我们获得了一个在2B$sim$4B尺度上的高性能lmm家族。值得注意的是,我们的Imp-3B模型稳定地优于所有现有的类似尺寸的轻型lmm,甚至超过了最先进的13B级lmm。通过低比特量化和分辨率降低技术,我们的Imp模型可以部署在高通骁龙8Gen3移动芯片上,具有大约13个令牌/秒的高推断速度。
{"title":"Imp: Highly Capable Large Multimodal Models for Mobile Devices","authors":"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding","doi":"10.1109/TMM.2025.3557680","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557680","url":null,"abstract":"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2961-2974"},"PeriodicalIF":8.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding 自动几何图像数据集创建增强几何理解
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-10 DOI: 10.1109/TMM.2025.3557720
Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu
With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.
随着大型语言模型的快速发展,人们对它们在数学推理中的能力越来越感兴趣。然而,由于缺乏高质量的几何数据集,现有的研究主要集中在基于文本的代数问题上,而忽视了几何问题的研究。为了解决这一问题,本文介绍了自动生成数学几何图像的新方法autogo,以满足对大规模和多样化几何数据集的需求。augeto促进了augeto - 100k的创建,这是一个包含100k高质量几何图像-文本对的广泛存储库。通过利用精确定义的几何条款,augego - 100k包含各种几何形状,包括线,多边形,圆和复杂的空间关系等。此外,本文还证明了autogo - 100k通过微调来提高多模态大型语言模型的性能。实验结果表明,该模型在处理几何图像方面的能力有了显著提高,在几何标题和数学推理等任务上的准确性得到了提高。这项研究不仅填补了几何数据集可用性方面的关键空白,而且为先进的人工智能驱动工具在教育和研究中的发展铺平了道路。
{"title":"AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding","authors":"Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu","doi":"10.1109/TMM.2025.3557720","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557720","url":null,"abstract":"With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3105-3116"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation 利用有效的sam和时间相干性进行视听分割
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-10 DOI: 10.1109/TMM.2025.3557637
Yue Zhu;Kun Li;Zongxin Yang
Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.
视听分割(AVS)旨在在像素级上准确识别和分割视频内容中的声源,需要对视觉和音频线索进行细粒度的语义理解。虽然分割任意模型(SAM)在各种分割任务中表现出出色的效果,但其设计主要集中在带有点、框和掩码提示的单图像分割上。因此,当SAM直接应用于AVS时,它很难有效地利用音频数据中的上下文信息,并捕获视频帧之间的时间相关性。此外,它的高计算要求对其在AVS应用中的实际适用性提出了挑战。本文介绍了基于EfficientSAM的ESAM-AVS框架,旨在将SAM的先验知识转移到AVS领域。具体来说,我们利用EfficientSAM作为主干来维护模型适应性,同时显著降低计算和处理成本。为了解决时间和视听相关性带来的挑战,我们设计了帧间相干模块,该模块独立集成了来自视觉和音频模式的时间信息。此外,我们结合了一个音频引导提示编码器,该编码器生成音频提示以提供指导,有效地将音频提示集成到分割过程中。通过结合这些组件,我们的模型最大限度地发挥了SAM先验知识的潜力,并使其适应更复杂的AVS任务。在AVSBench数据集上进行的大量实验表明,ESAM-AVS优于现有的最先进的方法。
{"title":"Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation","authors":"Yue Zhu;Kun Li;Zongxin Yang","doi":"10.1109/TMM.2025.3557637","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557637","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2999-3008"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VideoDreamer: Customized Multi-Subject Text-to-Video Generation With Disen-Mix Finetuning on Language-Video Foundation Models 自定义多主题文本到视频生成与Disen-Mix微调语言视频基础模型
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-08 DOI: 10.1109/TMM.2025.3557634
Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu
Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.
自定义文本到视频生成旨在生成用户给定主题的文本引导视频,这一技术越来越受到关注。然而,现有的工作主要局限于面向单一主题的文本到视频生成,而更具有挑战性的定制多主题生成问题尚未得到探索。在本文中,我们填补了这一空白,并提出了一个新的videodream框架,该框架可以生成时间一致的文本引导视频,忠实地保留给定多个主题的视觉特征。具体来说,video做梦者采用带有时间模块的预训练的Stable Diffusion作为其基础视频生成器,利用文本到图像模型的力量生成多样化的内容。进一步针对多主题定制视频生成器,利用提出的Disen-Mix微调和Human-in-the-Loop再微调策略,解决多主题生成的属性绑定问题。此外,我们提出了一个解纠缠的动作定制策略来微调时间模块,以便我们可以生成具有自定义主题和动作的视频。为了评估自定义多主题文本到视频生成的性能,我们引入了multistudibench基准测试。大量的实验证明了VideoDreamer生成新内容(如新事件和背景)视频的卓越能力,这些内容是针对定制的多个主题量身定制的。
{"title":"VideoDreamer: Customized Multi-Subject Text-to-Video Generation With Disen-Mix Finetuning on Language-Video Foundation Models","authors":"Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu","doi":"10.1109/TMM.2025.3557634","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557634","url":null,"abstract":"Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2875-2885"},"PeriodicalIF":8.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter Cps-STS:弥合内容和位置之间的差距,为粗点监督场景文本观测者
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-04 DOI: 10.1109/TMM.2024.3521756
Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao
Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.
近年来,场景文本识别的弱监督方法因其显著减少数据集标注的潜力而越来越受到研究人员的欢迎。该领域的最新进展是基于单点或多点注释的文本定位器。然而,该方法存在文本识别对精确标注位置的敏感性问题,无法捕捉到字符的相对位置和形状,导致对大量旋转和翻转文本的识别受损。为了解决这些问题,本文开发了一种新的方法,称为粗点监督场景文本识别(Cps-STS)。Cps-STS首先利用几个近似点作为文本位置标签,并引入可学习的位置调制机制,降低了标注的精度要求,增强了模型的鲁棒性。此外,我们还集成了空间兼容性注意(SCA)模块用于文本解码,以有效地利用位置和形状等空间数据。该模块融合了复合查询和全局特征映射,在SCA模块中用作表示文本空间形态的偏向。为了准确定位和解码文本内容,我们将包含空间形态信息和文本内容的特征引入到文本解码器的输入特征中。通过在文本解码器中引入带有空间形态信息的特征作为偏置项,烧烧实验表明,该操作使模型能够有效地识别和利用文本内容与位置之间的关系,从而提高模型的识别性能。Cps-STS的一个显著优势是它能够以低成本仅用几个不精确的粗点实现完全的监督级性能。大量的实验验证了Cps-STS相对于现有方法的有效性和优越性。
{"title":"Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter","authors":"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao","doi":"10.1109/TMM.2024.3521756","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521756","url":null,"abstract":"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1652-1664"},"PeriodicalIF":8.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Large Models are Effective Action Anticipators 多模态大模型是有效的行动预测器
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557615
Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.
长期行动预期的任务要求解决方案能够有效地模拟长时间的时间动态,同时深刻理解行动的固有语义。传统的方法主要依赖于循环单元或Transformer层来获取长期依赖关系,因此在处理这些挑战时往往不足。大型语言模型(llm)具有强大的顺序建模能力和广泛的常识知识,为长期行动预测提供了新的机会。在这项工作中,我们引入了ActionLLM框架,这是一种将视频序列视为连续令牌的新方法,利用llm来预测未来的动作。我们的基线模型通过设置未来令牌、合并动作调优模块和将文本解码器层减少到线性层来简化LLM体系结构,从而实现直接的动作预测,而不需要复杂的指令或冗余的描述。为了进一步利用llm的常识推理,我们预测观察到的框架的动作类别,并使用顺序的文本线索来指导语义理解。此外,我们引入了一个跨模态交互块,旨在探索每种模态的特异性,并捕获视觉和文本模态之间的交互,从而增强多模态调谐。在基准数据集上进行的大量实验证明了所提出的ActionLLM框架的优越性,为在动作预测的背景下探索llm提供了一个有希望的方向。
{"title":"Multimodal Large Models are Effective Action Anticipators","authors":"Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang","doi":"10.1109/TMM.2025.3557615","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557615","url":null,"abstract":"The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2949-2960"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation 基于语义交替增强和双向聚合的参考视频对象分割
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557689
Jiaxing Yang;Lihe Zhang;Huchuan Lu
Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.
参考视频对象分割(RVOS)的目的是根据给定的表达式将视频片段中描述的对象分割出来。该任务需要有效地融合跨模态特征、交流时间信息和描述参考外观的方法。然而,现有的解决方案将重点放在主要挖掘一两个线索上,导致其性能较差。在本文中,我们提出了语义交替增强(SAE),以一种替代的方式实现跨模态融合和时空语义挖掘,从而使综合利用三个线索成为可能。在每次更新过程中,SAE将生成一个跨模态和时间感知向量,引导视觉特征放大其引用语义,同时过滤掉无关内容。作为回报,纯化的特征将提供上下文土壤,以产生更精细的向导。总的来说,跨模态交互和时间通信一起交织在轴向语义增强步骤中。此外,我们设计了一个简化的SAE,通过减少空间语义增强步骤,并在视觉编码器的早期阶段使用该变体,以进一步提高可用性。为了整合不同尺度的特征,我们提出了双向语义聚合解码器(BSA)来获得参考掩码。BSA将综合增强的特征分成两组,然后采用差分感知步骤双向实现组内特征聚合,采用一致性约束步骤实现语义密集和外观丰富的特征的组间集成。在具有挑战性的基准测试上的广泛结果表明,我们的方法比最先进的竞争对手表现得更好。
{"title":"Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation","authors":"Jiaxing Yang;Lihe Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557689","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557689","url":null,"abstract":"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2987-2998"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection 开放世界弱监督视频异常检测的多模态证据学习
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557682
Chao Huang;Weiliang Huang;Qiuping Jiang;Wei Wang;Jie Wen;Bob Zhang
Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.
弱监督视频异常检测的研究重点是通过粗粒度标签检测视频中的异常事件,并已成功应用于许多实际应用中。然而,大多数现有方法的一个显着局限性是它们仅对特定场景下的特定对象有效,这使得它们在面对以前未见过的异常时容易出现错误分类或遗漏。与传统的异常检测任务相比,开放世界弱监督视频异常检测(OWVAD)由于缺乏对未知异常的标记和细粒度注释而面临更大的挑战。为了解决上述问题,我们提出了一种多尺度证据视觉语言模型来实现开放世界视频异常检测。具体来说,我们利用来自CLIP的广义视觉语言关联来利用大型预训练模型在解决OWVAD任务中的全部潜力。随后,我们将多尺度时间建模模块与多模态证据收集器集成在一起,以实现对可见和未见异常的精确帧级检测。在两个广泛使用的基准上进行的大量实验最终验证了我们方法的有效性。该准则将向公众开放。
{"title":"Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection","authors":"Chao Huang;Weiliang Huang;Qiuping Jiang;Wei Wang;Jie Wen;Bob Zhang","doi":"10.1109/TMM.2025.3557682","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557682","url":null,"abstract":"Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3132-3143"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Few-Shot 3D Point Cloud Segmentation via Relation Consistency-Guided Heterogeneous Prototypes 基于关系一致性引导的异构原型的少镜头三维点云分割
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557699
Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu
Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel Relation Consistency-guided Heterogeneous Prototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (e.g. CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (i.e., inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.
由于缺乏标记的点云(支持集),少镜头三维点云语义分割是一项具有挑战性的任务。为了分割未标记的查询点云,现有的基于原型的方法是从支持集的点特征中学习3D原型,然后测量它们到查询点的距离。然而,这种同构的3D原型往往质量较低,因为它们忽略了隐藏在支持集中的有价值的异构信息,如语义标签和投影2D深度图。为了解决这一问题,本文提出了一种新的关系一致性引导的异构原型学习框架(RCHP),该框架通过使用大型多模态模型(例如CLIP)集成异构信息来提高原型质量。RCHP通过两个核心组件来实现这一点:异构原型生成模块与3D网络和CLIP协同生成异构原型,异构原型融合模块有效融合异构原型,获得高质量原型。此外,为了弥合异构原型之间的差距,我们引入了异构关系一致性损失,它将更可靠的类间关系(即原型间关系)从精炼原型转移到异构原型。在五个点云分割数据集上进行了广泛的实验,包括四个室内数据集(S3DIS, ScanNet, SceneNN, NYU Depth V2)和一个室外数据集(Semantic3D),证明了我们的方法的优越性和泛化能力,在所有数据集上都优于最先进的方法。
{"title":"Few-Shot 3D Point Cloud Segmentation via Relation Consistency-Guided Heterogeneous Prototypes","authors":"Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu","doi":"10.1109/TMM.2025.3557699","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557699","url":null,"abstract":"Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel <bold>R</b>elation <bold>C</b>onsistency-guided <bold>H</b>eterogeneous <bold>P</b>rototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (<italic>e.g.</i> CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (<italic>i.e.</i>, inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3158-3170"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection 一个实例级机器人抓取检测的语言驱动框架
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557685
Yunpeng Mei;Jian Sun;Zhihong Peng;Fang Deng;Gang Wang;Jie Chen
Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.
机器人抓取是机器人技术和计算机视觉领域的一个重要课题,在工业生产和智能制造中有着广泛的应用。尽管一些方法已经开始解决实例级抓取问题,但大多数方法仍然局限于预定义的实例和类别,缺乏基于用户指定指令的开放词汇抓取预测的灵活性。为了解决这个问题,我们提出了一个基于分段任意模型(SAM)的语言驱动的实例级抓取检测框架RoG-SAM。RoG-SAM利用开放词汇提示进行对象定位和抓取姿态预测,通过与编码器适配器和多头解码器的迁移学习来适应SAM,以扩展其分割能力以抓取姿态估计。实验结果表明,RoG-SAM在单对象数据集(Cornell和Jacquard)和杂乱数据集(graspnet - 10亿和ocd)上取得了具有竞争力的性能,实例级准确率分别为91.2%和90.1%,而SAM的可训练参数仅占28.3%。RoG-SAM的有效性也在现实环境中得到了验证。
{"title":"RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection","authors":"Yunpeng Mei;Jian Sun;Zhihong Peng;Fang Deng;Gang Wang;Jie Chen","doi":"10.1109/TMM.2025.3557685","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557685","url":null,"abstract":"Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3057-3068"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1