By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B$sim$4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.
{"title":"Imp: Highly Capable Large Multimodal Models for Mobile Devices","authors":"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding","doi":"10.1109/TMM.2025.3557680","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557680","url":null,"abstract":"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2961-2974"},"PeriodicalIF":8.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.
{"title":"AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding","authors":"Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu","doi":"10.1109/TMM.2025.3557720","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557720","url":null,"abstract":"With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3105-3116"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-10DOI: 10.1109/TMM.2025.3557637
Yue Zhu;Kun Li;Zongxin Yang
Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.
{"title":"Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation","authors":"Yue Zhu;Kun Li;Zongxin Yang","doi":"10.1109/TMM.2025.3557637","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557637","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2999-3008"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-08DOI: 10.1109/TMM.2025.3557634
Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu
Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.
{"title":"VideoDreamer: Customized Multi-Subject Text-to-Video Generation With Disen-Mix Finetuning on Language-Video Foundation Models","authors":"Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu","doi":"10.1109/TMM.2025.3557634","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557634","url":null,"abstract":"Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2875-2885"},"PeriodicalIF":8.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-04DOI: 10.1109/TMM.2024.3521756
Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao
Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.
{"title":"Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter","authors":"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao","doi":"10.1109/TMM.2024.3521756","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521756","url":null,"abstract":"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1652-1664"},"PeriodicalIF":8.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-03DOI: 10.1109/TMM.2025.3557615
Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.
{"title":"Multimodal Large Models are Effective Action Anticipators","authors":"Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang","doi":"10.1109/TMM.2025.3557615","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557615","url":null,"abstract":"The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2949-2960"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-03DOI: 10.1109/TMM.2025.3557689
Jiaxing Yang;Lihe Zhang;Huchuan Lu
Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.
{"title":"Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation","authors":"Jiaxing Yang;Lihe Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557689","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557689","url":null,"abstract":"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2987-2998"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.
{"title":"Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection","authors":"Chao Huang;Weiliang Huang;Qiuping Jiang;Wei Wang;Jie Wen;Bob Zhang","doi":"10.1109/TMM.2025.3557682","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557682","url":null,"abstract":"Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3132-3143"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-03DOI: 10.1109/TMM.2025.3557699
Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu
Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel Relation Consistency-guided Heterogeneous Prototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (e.g. CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (i.e., inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.
由于缺乏标记的点云(支持集),少镜头三维点云语义分割是一项具有挑战性的任务。为了分割未标记的查询点云,现有的基于原型的方法是从支持集的点特征中学习3D原型,然后测量它们到查询点的距离。然而,这种同构的3D原型往往质量较低,因为它们忽略了隐藏在支持集中的有价值的异构信息,如语义标签和投影2D深度图。为了解决这一问题,本文提出了一种新的关系一致性引导的异构原型学习框架(RCHP),该框架通过使用大型多模态模型(例如CLIP)集成异构信息来提高原型质量。RCHP通过两个核心组件来实现这一点:异构原型生成模块与3D网络和CLIP协同生成异构原型,异构原型融合模块有效融合异构原型,获得高质量原型。此外,为了弥合异构原型之间的差距,我们引入了异构关系一致性损失,它将更可靠的类间关系(即原型间关系)从精炼原型转移到异构原型。在五个点云分割数据集上进行了广泛的实验,包括四个室内数据集(S3DIS, ScanNet, SceneNN, NYU Depth V2)和一个室外数据集(Semantic3D),证明了我们的方法的优越性和泛化能力,在所有数据集上都优于最先进的方法。
{"title":"Few-Shot 3D Point Cloud Segmentation via Relation Consistency-Guided Heterogeneous Prototypes","authors":"Lili Wei;Congyan Lang;Zheming Xu;Liqian Liang;Jun Liu","doi":"10.1109/TMM.2025.3557699","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557699","url":null,"abstract":"Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel <bold>R</b>elation <bold>C</b>onsistency-guided <bold>H</b>eterogeneous <bold>P</b>rototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (<italic>e.g.</i> CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (<italic>i.e.</i>, inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3158-3170"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.
{"title":"RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection","authors":"Yunpeng Mei;Jian Sun;Zhihong Peng;Fang Deng;Gang Wang;Jie Chen","doi":"10.1109/TMM.2025.3557685","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557685","url":null,"abstract":"Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3057-3068"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}