Pub Date : 2025-07-04DOI: 10.1109/TMM.2025.3586103
Ke Wang;Qi Ma;Xingcan Li;Chongqiang Shen;Rui Leng;Jianbo Lu
The traditional object detection algorithm in the intelligent vehicle perception system cannot maintain stable recognition performance in the unknown and changing road environment. We find that uncertainty quantification is of great significance in detecting unknown complex environments and helps to improve the robustness and safety of autonomous driving systems. Therefore, this paper proposes an Uncertainty-based Transformer (UBT) object detection algorithm. Firstly, the double Gaussian feature map network (DGF) is designed to quantify and utilize the uncertainty of the features derived from the backbone network. Secondly, we propose a RBF-based query filtering model(RBQF), which takes uncertainty sum as the index of query vector screening. At the same time, this paper proposes an uncertainty detection head (UDH); the final model output results are quantitative uncertainty, improved detection performance and enhanced algorithm reliability. To further prove the detection performance of the proposed method in real driving scenes, we use COCO, Cityscapes, FoggyCityscapes, RainCityscapes and self-made traffic scene datasets for verification, which shows that our algorithm is well applicable to large datasets and complex road scenes.
{"title":"UBTransformer: Uncertainty-Based Transformer Model for Complex Scenarios Detection in Autonomous Driving","authors":"Ke Wang;Qi Ma;Xingcan Li;Chongqiang Shen;Rui Leng;Jianbo Lu","doi":"10.1109/TMM.2025.3586103","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586103","url":null,"abstract":"The traditional object detection algorithm in the intelligent vehicle perception system cannot maintain stable recognition performance in the unknown and changing road environment. We find that uncertainty quantification is of great significance in detecting unknown complex environments and helps to improve the robustness and safety of autonomous driving systems. Therefore, this paper proposes an Uncertainty-based Transformer (UBT) object detection algorithm. Firstly, the double Gaussian feature map network (DGF) is designed to quantify and utilize the uncertainty of the features derived from the backbone network. Secondly, we propose a RBF-based query filtering model(RBQF), which takes uncertainty sum as the index of query vector screening. At the same time, this paper proposes an uncertainty detection head (UDH); the final model output results are quantitative uncertainty, improved detection performance and enhanced algorithm reliability. To further prove the detection performance of the proposed method in real driving scenes, we use COCO, Cityscapes, FoggyCityscapes, RainCityscapes and self-made traffic scene datasets for verification, which shows that our algorithm is well applicable to large datasets and complex road scenes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6581-6592"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional cross-domain tasks, including unsupervised domain adaptation (UDA), domain generalization (DG) and test-time adaptation (TTA), rely heavily on the training model by source domain data whether for specific or arbitrary target domains. With the recent advance of vision-language models (VLMs), recognized as natural source models that can be transferred to various downstream tasks without any parameter training, we propose a novel cross-domain task directly combining the strengths of both UDA and DG, named Training-Free Adaptive Domain Generalization (TF-ADG). However, current cross-domain datasets have many limitations, such as unrealistic domains, unclear domain definitions, and the inability to fine-grained domain decomposition, which hinder the real-world application of current cross-domain models due to the lack of accurate and fair evaluation of fine-grained realistic domains. These insights motivate us to establish a novel realistic benchmark for TF-ADG. Benefiting from the introduced hierarchical definition of domain shifts, our proposed dataset DomainVerse addresses these issues by providing about 0.5 million images from 390 realistic, hierarchical, and balanced domains, allowing for decomposition across multiple domains within each image. With the help of the constructed DomainVerse and VLMs, we further propose two algorithms called Domain CLIP and Domain++ CLIP for training-free adaptive domain generalization. Extensive and comprehensive experiments demonstrate the significance of the dataset and the effectiveness of the proposed methods.
{"title":"DomainVerse: A Benchmark Towards Real-World Distribution Shifts for Training-Free Adaptive Domain Generalization","authors":"Feng Hou;Jin Yuan;Ying Yang;Yao Zhang;Yang Liu;Yang Zhang;Cheng Zhong;Zhongchao Shi;Jianping Fan;Zhiqiang He;Yong Rui","doi":"10.1109/TMM.2025.3586108","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586108","url":null,"abstract":"Traditional cross-domain tasks, including unsupervised domain adaptation (UDA), domain generalization (DG) and test-time adaptation (TTA), rely heavily on the training model by source domain data whether for specific or arbitrary target domains. With the recent advance of vision-language models (VLMs), recognized as natural source models that can be transferred to various downstream tasks without any parameter training, we propose a novel cross-domain task directly combining the strengths of both UDA and DG, named Training-Free Adaptive Domain Generalization (TF-ADG). However, current cross-domain datasets have many limitations, such as unrealistic domains, unclear domain definitions, and the inability to fine-grained domain decomposition, which hinder the real-world application of current cross-domain models due to the lack of accurate and fair evaluation of fine-grained realistic domains. These insights motivate us to establish a novel realistic benchmark for TF-ADG. Benefiting from the introduced hierarchical definition of domain shifts, our proposed dataset DomainVerse addresses these issues by providing about 0.5 million images from 390 realistic, hierarchical, and balanced domains, allowing for decomposition across multiple domains within each image. With the help of the constructed DomainVerse and VLMs, we further propose two algorithms called Domain CLIP and Domain++ CLIP for training-free adaptive domain generalization. Extensive and comprehensive experiments demonstrate the significance of the dataset and the effectiveness of the proposed methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6648-6660"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-03DOI: 10.1109/TMM.2025.3586119
Mulin Chen;Yajie Wang;Xuelong Li
The goal of Image-to-Music Generation is to create pure music according to the given image. Unlike existing tasks such as text-to-image generation, there is no explicit connection between image content and musical melody. Some existing studies attempt to generate music by directly mapping image features (such as color, edges, etc.) into musical notes, which may result in the melodic incoherence. Inspired by neuroscience, it is desirable to employ emotion to bridge these two modalities. However, the continuity and complexity of emotions make it difficult to capture the cross-modal correlation. Drawing from human perception mechanisms of emotions, a Progressive Image-to-Music Generation (PIMG) framework is proposed. The framework designs a mean-teacher based association network to guide the music generation process progressively, starting from highly correlated image-music pairs. The generation network receives more challenging sample pairs gradually, eventually capturing complex cross-modal emotional correspondences. Additionally, a contrastive learning strategy is introduced into the diffusion models to better capture the consistency between pieces of music with the similar emotions. Extensive experimental results demonstrate that the proposed framework is able to generate high-quality and emotionally consistent music from images.
{"title":"PIMG: Progressive Image-to-Music Generation With Contrastive Diffusion Models","authors":"Mulin Chen;Yajie Wang;Xuelong Li","doi":"10.1109/TMM.2025.3586119","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586119","url":null,"abstract":"The goal of Image-to-Music Generation is to create pure music according to the given image. Unlike existing tasks such as text-to-image generation, there is no explicit connection between image content and musical melody. Some existing studies attempt to generate music by directly mapping image features (such as color, edges, etc.) into musical notes, which may result in the melodic incoherence. Inspired by neuroscience, it is desirable to employ emotion to bridge these two modalities. However, the continuity and complexity of emotions make it difficult to capture the cross-modal correlation. Drawing from human perception mechanisms of emotions, a Progressive Image-to-Music Generation (PIMG) framework is proposed. The framework designs a mean-teacher based association network to guide the music generation process progressively, starting from highly correlated image-music pairs. The generation network receives more challenging sample pairs gradually, eventually capturing complex cross-modal emotional correspondences. Additionally, a contrastive learning strategy is introduced into the diffusion models to better capture the consistency between pieces of music with the similar emotions. Extensive experimental results demonstrate that the proposed framework is able to generate high-quality and emotionally consistent music from images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6732-6739"},"PeriodicalIF":9.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-30DOI: 10.1109/TMM.2025.3565984
Lei Lei;Xianxian Li
Vision-language tracking is a crucial branch of multi-modal object tracking, aiming to jointly locate an object by utilizing visual information and language descriptions. Typically, existing vision-language trackers employ language and visual encoders to extract features from language descriptions and visual information, respectively. Based on these extracted visual and language features, a cross-modal interaction module is used to extract multi-modal features to locate the targets. However, they ignore the differences between visual and language modalities. Due to the lack of pixel-level position information in language descriptions, the positional information of the multi-modal features is greatly weakened by the cross-modal interaction modules. As a result, the vision-language trackers cannot effectively capture subtle changes in the target's positions. To address this problem, we propose a multi-modal hybrid interaction vision-language tracking method (named MHITrack), in which a multi-modal hybrid interaction decoder is designed to enhance the positional information of multi-modal features. The proposed multi-modal hybrid interaction decoder consists of a visual-language interaction module, a multi-level position interaction module, and a hybrid interaction module. Firstly, the multi-level position interaction module is utilized to capture fine-grained position information of the target from multi-level features. Meanwhile, the visual-language interaction module performs cross-modal interaction between visual and language features to obtain multi-modal features. Furthermore, the hybrid interaction module is employed to integrate the multi-modal features with target position information, enhancing the positional information of the multi-modal features. Finally, the proposed tracker can effectively capture subtle changes in the target's positions. Through extensive experiments on four benchmark datasets, namely TNL2k, LaSOT, OTB-Lang, and LaSOText, we demonstrate that the proposed vision-language tracker achieves promising performance compared to existing state-of-the-art vision-language trackers.
{"title":"Multi-Modal Hybrid Interaction Vision-Language Tracking","authors":"Lei Lei;Xianxian Li","doi":"10.1109/TMM.2025.3565984","DOIUrl":"https://doi.org/10.1109/TMM.2025.3565984","url":null,"abstract":"Vision-language tracking is a crucial branch of multi-modal object tracking, aiming to jointly locate an object by utilizing visual information and language descriptions. Typically, existing vision-language trackers employ language and visual encoders to extract features from language descriptions and visual information, respectively. Based on these extracted visual and language features, a cross-modal interaction module is used to extract multi-modal features to locate the targets. However, they ignore the differences between visual and language modalities. Due to the lack of pixel-level position information in language descriptions, the positional information of the multi-modal features is greatly weakened by the cross-modal interaction modules. As a result, the vision-language trackers cannot effectively capture subtle changes in the target's positions. To address this problem, we propose a multi-modal hybrid interaction vision-language tracking method (named MHITrack), in which a multi-modal hybrid interaction decoder is designed to enhance the positional information of multi-modal features. The proposed multi-modal hybrid interaction decoder consists of a visual-language interaction module, a multi-level position interaction module, and a hybrid interaction module. Firstly, the multi-level position interaction module is utilized to capture fine-grained position information of the target from multi-level features. Meanwhile, the visual-language interaction module performs cross-modal interaction between visual and language features to obtain multi-modal features. Furthermore, the hybrid interaction module is employed to integrate the multi-modal features with target position information, enhancing the positional information of the multi-modal features. Finally, the proposed tracker can effectively capture subtle changes in the target's positions. Through extensive experiments on four benchmark datasets, namely TNL2k, LaSOT, OTB-Lang, and LaSOText, we demonstrate that the proposed vision-language tracker achieves promising performance compared to existing state-of-the-art vision-language trackers.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5857-5865"},"PeriodicalIF":9.7,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal hashing achieves low storage costs and high retrieval speeds by using compact hash codes to represent complex and heterogeneous multi-modal data, effectively addressing the inefficiency and resource intensiveness challenges faced by the traditional multi-modal retrieval methods. However, balancing intraclass compactness and interclass separability remains a struggle in existing works due to coarse-grained feature limitations, simplified fusion strategies that overlook semantic complementarity, and neglect of the structural information within the multi-modal data. To address these limitations comprehensively, we propose a Proto-centric Multi-modal Hashing with Pronounced Category Differences (PMH-PCD) model. Specifically, PMH-PCD first learns modality-specific prototypes by deeply exploring within-modality class information, ensuring effective fusion of each modality's unique characteristics. Furthermore, it learns multi-modal integrated class prototypes that seamlessly incorporate semantic information across modalities to effectively capture and represent the intricate relationships and complementary semantic content embedded within the multi-modal data. Additionally, to generate more discriminative and representative binary hash codes, PMH-PCD integrates multifaceted semantic information, encompassing both low-level pairwise relations and high-level structural patterns, holistically capturing intricate data details and leveraging underlying structures. The experimental results demonstrate that, compared with existing advanced methods, PMH-PCD achieves superior and consistent performances in multi-modal retrieval tasks.
{"title":"Compact-Yet-Separate: Proto-Centric Multi-Modal Hashing With Pronounced Category Differences for Multi-Modal Retrieval","authors":"Ruifan Zuo;Chaoqun Zheng;Lei Zhu;Wenpeng Lu;Jiasheng Si;Weiyu Zhang","doi":"10.1109/TMM.2025.3565973","DOIUrl":"https://doi.org/10.1109/TMM.2025.3565973","url":null,"abstract":"Multi-modal hashing achieves low storage costs and high retrieval speeds by using compact hash codes to represent complex and heterogeneous multi-modal data, effectively addressing the inefficiency and resource intensiveness challenges faced by the traditional multi-modal retrieval methods. However, balancing intraclass compactness and interclass separability remains a struggle in existing works due to coarse-grained feature limitations, simplified fusion strategies that overlook semantic complementarity, and neglect of the structural information within the multi-modal data. To address these limitations comprehensively, we propose a <italic>Proto-centric Multi-modal Hashing with Pronounced Category Differences</i> (PMH-PCD) model. Specifically, PMH-PCD first learns modality-specific prototypes by deeply exploring within-modality class information, ensuring effective fusion of each modality's unique characteristics. Furthermore, it learns multi-modal integrated class prototypes that seamlessly incorporate semantic information across modalities to effectively capture and represent the intricate relationships and complementary semantic content embedded within the multi-modal data. Additionally, to generate more discriminative and representative binary hash codes, PMH-PCD integrates multifaceted semantic information, encompassing both low-level pairwise relations and high-level structural patterns, holistically capturing intricate data details and leveraging underlying structures. The experimental results demonstrate that, compared with existing advanced methods, PMH-PCD achieves superior and consistent performances in multi-modal retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5843-5856"},"PeriodicalIF":9.7,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-30DOI: 10.1109/TMM.2025.3565933
Bo Han;Lihuo He;Junjie Ke;Jinjian Wu;Xinbo Gao
Inconsistent accuracy between classification and localization tasks is a common challenge in modern object detection. Task decoupling, which employs distinct features or labeling strategies for each task, is a widely used approach to address this issue. Although it has led to noteworthy advancements, this approach is insufficient as it neglects task interdependence and lacks an explicit consistency constraint. To bridge this gap, this paper proposes the Progressive Semi-Decoupled Detector (ProSDD) to enhance both classification and localization accuracy. Specifically, a new detection head is designed that incorporates feature suppression and enhancement mechanism (FSEM) and bidirectional interaction module (BIM). Compared with the decoupled head, it not only filters out task-irrelevant information and enhances task-related information, but also avoids excessive decoupling at the feature level. Moreover, both FSEM and BIM are used multiple times, thus forming a progressive semi-decoupled head. Then, a novel consistency loss is proposed and integrated into the loss function of object detection, ensuring harmonic performance in classification and localization. Experimental results demonstrate that the proposed ProSDD effectively alleviates inconsistent accuracy and achieves high-quality object detection. Taking the pretrained ResNet-50 as the backbone, ProSDD achieves a remarkable 43.3 AP on the MS COCO dataset, surpassing contemporary state-of-the-art detectors by a substantial margin under the equivalent configurations.
{"title":"Progressive Semi-Decoupled Detector for Accurate Object Detection","authors":"Bo Han;Lihuo He;Junjie Ke;Jinjian Wu;Xinbo Gao","doi":"10.1109/TMM.2025.3565933","DOIUrl":"https://doi.org/10.1109/TMM.2025.3565933","url":null,"abstract":"Inconsistent accuracy between classification and localization tasks is a common challenge in modern object detection. Task decoupling, which employs distinct features or labeling strategies for each task, is a widely used approach to address this issue. Although it has led to noteworthy advancements, this approach is insufficient as it neglects task interdependence and lacks an explicit consistency constraint. To bridge this gap, this paper proposes the <bold>Pro</b>gressive <bold>S</b>emi-<bold>D</b>ecoupled <bold>D</b>etector (ProSDD) to enhance both classification and localization accuracy. Specifically, a new detection head is designed that incorporates feature suppression and enhancement mechanism (FSEM) and bidirectional interaction module (BIM). Compared with the decoupled head, it not only filters out task-irrelevant information and enhances task-related information, but also avoids excessive decoupling at the feature level. Moreover, both FSEM and BIM are used multiple times, thus forming a progressive semi-decoupled head. Then, a novel consistency loss is proposed and integrated into the loss function of object detection, ensuring harmonic performance in classification and localization. Experimental results demonstrate that the proposed ProSDD effectively alleviates inconsistent accuracy and achieves high-quality object detection. Taking the pretrained ResNet-50 as the backbone, ProSDD achieves a remarkable 43.3 AP on the MS COCO dataset, surpassing contemporary state-of-the-art detectors by a substantial margin under the equivalent configurations.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5866-5878"},"PeriodicalIF":9.7,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-16DOI: 10.1109/TMM.2025.3561661
Samuel Ortega;Tatiana N. Ageeva;Silje Kristoffersen;Karsten Heia;Heidi A. Nilsen
Fish quality and shelf life can be evaluated using various assessment methods, such as sensory analysis, biochemical tests, microbiological evaluations, and physicochemical analyses. However, these methods are invasive and time-consuming, driving interest in technologies capable of estimating shelf life through non-invasive procedures. This study investigates the potential of hyperspectral imaging as a non-invasive technology for predicting the shelf life of Atlantic cod. A storage experiment was conducted that included both gutted fish with heads (GFWH) and fillets, with sensory evaluation and biochemical measurements employed to determine shelf life. Subsequently, hyperspectral images of the fish samples were captured under industrial production conditions, and the spectral data were analyzed using different regression algorithms. The majority of the regression techniques utilized in this research successfully predicted shelf life for both fillets and GFWH, achieving a root mean square error (RMSE) lower than one day. While most regression models exhibited comparable performance in predicting the shelf life of fillets, deep learning-based models demonstrated superior performance for GFWH. These results suggest that hyperspectral imaging technology has significant potential as a non-invasive tool for estimating the shelf life of Atlantic cod, thereby enabling effective quality-based sorting, reducing food waste, and enhancing sustainability in the seafood supply chain.
{"title":"High Throughput Shelf Life Determination of Atlantic Cod (Gadus morhua L.) by Use of Hyperspectral Imaging","authors":"Samuel Ortega;Tatiana N. Ageeva;Silje Kristoffersen;Karsten Heia;Heidi A. Nilsen","doi":"10.1109/TMM.2025.3561661","DOIUrl":"https://doi.org/10.1109/TMM.2025.3561661","url":null,"abstract":"Fish quality and shelf life can be evaluated using various assessment methods, such as sensory analysis, biochemical tests, microbiological evaluations, and physicochemical analyses. However, these methods are invasive and time-consuming, driving interest in technologies capable of estimating shelf life through non-invasive procedures. This study investigates the potential of hyperspectral imaging as a non-invasive technology for predicting the shelf life of Atlantic cod. A storage experiment was conducted that included both gutted fish with heads (GFWH) and fillets, with sensory evaluation and biochemical measurements employed to determine shelf life. Subsequently, hyperspectral images of the fish samples were captured under industrial production conditions, and the spectral data were analyzed using different regression algorithms. The majority of the regression techniques utilized in this research successfully predicted shelf life for both fillets and GFWH, achieving a root mean square error (RMSE) lower than one day. While most regression models exhibited comparable performance in predicting the shelf life of fillets, deep learning-based models demonstrated superior performance for GFWH. These results suggest that hyperspectral imaging technology has significant potential as a non-invasive tool for estimating the shelf life of Atlantic cod, thereby enabling effective quality-based sorting, reducing food waste, and enhancing sustainability in the seafood supply chain.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2809-2824"},"PeriodicalIF":8.4,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10966199","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-14DOI: 10.1109/TMM.2025.3557618
Chaofan Luo;Donglin Di;Xun Yang;Yongjia Ma;Zhou Xue;Wei Chen;Xiaofei Gou;Yebin Liu
Despite significant strides in the field of 3D scene editing, current methods encounter substantial challenge, particularly in preserving 3D consistency during the multi-view editing process. To tackle this challenge, we propose a progressive 3D editing strategy that ensures multi-view consistency via a Trajectory-Anchored Scheme (TAS) with a dual-branch editing mechanism. Specifically, TAS facilitates a tightly coupled iterative process between 2D view editing and 3D updating, preventing error accumulation yielded from the text-to-image process. Additionally, we explore the connection between optimization-based methods and reconstruction-based methods, offering a unified perspective for selecting superior design choices, supporting the rationale behind the designed TAS. We further present a tuning-free View-Consistent Attention Control (VCAC) module that leverages cross-view semantic and geometric reference from the source branch to yield aligned views from the target branch during the editing of 2D views. To validate the effectiveness of our method, we analyze 2D examples to demonstrate the improved consistency with the VCAC module. Extensive quantitative and qualitative results in text-guided 3D scene editing clearly indicate that our method can achieve superior editing quality compared with state-of-the-art 3D scene editing methods.
{"title":"TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Manipulation","authors":"Chaofan Luo;Donglin Di;Xun Yang;Yongjia Ma;Zhou Xue;Wei Chen;Xiaofei Gou;Yebin Liu","doi":"10.1109/TMM.2025.3557618","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557618","url":null,"abstract":"Despite significant strides in the field of 3D scene editing, current methods encounter substantial challenge, particularly in preserving 3D consistency during the multi-view editing process. To tackle this challenge, we propose a progressive 3D editing strategy that ensures multi-view consistency via a Trajectory-Anchored Scheme (TAS) with a dual-branch editing mechanism. Specifically, TAS facilitates a tightly coupled iterative process between 2D view editing and 3D updating, preventing error accumulation yielded from the text-to-image process. Additionally, we explore the connection between optimization-based methods and reconstruction-based methods, offering a unified perspective for selecting superior design choices, supporting the rationale behind the designed TAS. We further present a tuning-free View-Consistent Attention Control (VCAC) module that leverages cross-view semantic and geometric reference from the source branch to yield aligned views from the target branch during the editing of 2D views. To validate the effectiveness of our method, we analyze 2D examples to demonstrate the improved consistency with the VCAC module. Extensive quantitative and qualitative results in text-guided 3D scene editing clearly indicate that our method can achieve superior editing quality compared with state-of-the-art 3D scene editing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2886-2898"},"PeriodicalIF":8.4,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B$sim$4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.
{"title":"Imp: Highly Capable Large Multimodal Models for Mobile Devices","authors":"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding","doi":"10.1109/TMM.2025.3557680","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557680","url":null,"abstract":"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2961-2974"},"PeriodicalIF":8.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.
{"title":"AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding","authors":"Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu","doi":"10.1109/TMM.2025.3557720","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557720","url":null,"abstract":"With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3105-3116"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}