Pub Date : 2026-03-07DOI: 10.1007/s11263-026-02761-y
Xiaofei Hui, Qian Wu, Haoxuan Qu, Majid Mirmehdi, Hossein Rahmani, Jun Liu
The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a “black box”, i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM’s performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.
多模态大语言模型(Multimodal Large Language Models, MLLM)的出现以及GPT-4V等MLLM云服务的广泛使用,引起了人们对视觉数据隐私泄露的极大关注。由于这些模型通常部署在云服务中,用户需要提交他们的图像和视频,这带来了严重的隐私风险。然而,如何解决这些隐私问题是一个未被充分探讨的问题。因此,在本文中,我们旨在对在享受mlm服务带来的便利的同时,如何保护视觉隐私进行新的研究。我们解决了MLLM是一个“黑盒”的实际情况,也就是说,我们只能访问它的输入和输出,而不知道它的内部模型信息。为了解决这一具有挑战性且要求较高的问题,我们提出了一种新的框架,其中我们使用Pareto最优来精心设计学习目标,以寻求视觉隐私和MLLM性能之间更好的权衡,并提出了关键历史增强优化,以有效地优化框架与黑盒MLLM。我们的实验表明,我们的方法在不同的基准测试中是有效的。
{"title":"When Visual Privacy Protection Meets Multimodal Large Language Models","authors":"Xiaofei Hui, Qian Wu, Haoxuan Qu, Majid Mirmehdi, Hossein Rahmani, Jun Liu","doi":"10.1007/s11263-026-02761-y","DOIUrl":"https://doi.org/10.1007/s11263-026-02761-y","url":null,"abstract":"The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a “black box”, i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM’s performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"46 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-06DOI: 10.1007/s11263-025-02719-6
Yuli Sun, Junzheng Wu, Han Zhang, Zhang Li, Lin Lei, Gangyao Kuang
{"title":"Iterative Global Mapping-Local Searching for Heterogeneous Change Detection with Unregistered Images","authors":"Yuli Sun, Junzheng Wu, Han Zhang, Zhang Li, Lin Lei, Gangyao Kuang","doi":"10.1007/s11263-025-02719-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02719-6","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"68 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-06DOI: 10.1007/s11263-026-02755-w
Zhaonian Kuang, Rui Ding, Meng Yang, Xinhu Zheng, Gang Hua
{"title":"Object-Scene-Camera Decomposition and Recomposition for Data Efficient Monocular 3D Object Detection","authors":"Zhaonian Kuang, Rui Ding, Meng Yang, Xinhu Zheng, Gang Hua","doi":"10.1007/s11263-026-02755-w","DOIUrl":"https://doi.org/10.1007/s11263-026-02755-w","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era","authors":"Xiaowei Hu, Zhenghao Xing, Tianyu Wang, Chi-Wing Fu, Pheng-Ann Heng","doi":"10.1007/s11263-026-02744-z","DOIUrl":"https://doi.org/10.1007/s11263-026-02744-z","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"46 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-06DOI: 10.1007/s11263-025-02713-y
Pia Bideau, Duc Pham, Félicie Dhellemmes, Matthew Hansen, Jens Krause
Easily accessible technologies, such as drones equipped with diverse onboard sensors, have greatly expanded opportunities to study animal behavior in natural environments. However, analyzing large volumes of unlabeled video data, often spanning hours, remains a significant challenge for machine learning, particularly in computer vision. Existing approaches typically process only a small number of frames, and accurate georeferencing of tracked positions is still largely unresolved, particularly in dynamic environments where static landmarks cannot be established. In this work, we focus on long-term tracking of animal behavior in real-world geographic coordinates. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. Particle filters offer a useful algorithmic structure for recursively adding new incoming information and thus ensuring time consistency. By incorporating recent developments in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in image coordinates, instead it additionally tracks the position and spatial expansion of the fish school in geographic coordinates by fusing video data and the drone’s on board sensor information (GPS and IMU). No landmarks with known geographic coordinates are required, making the proposed method adaptable to unstructured, dynamic environments like the open ocean, where static landmarks are unavailable. With this, the presented framework enables researchers to study the collective behavior of fish schools within their social and environmental context.
{"title":"Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos","authors":"Pia Bideau, Duc Pham, Félicie Dhellemmes, Matthew Hansen, Jens Krause","doi":"10.1007/s11263-025-02713-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02713-y","url":null,"abstract":"Easily accessible technologies, such as drones equipped with diverse onboard sensors, have greatly expanded opportunities to study animal behavior in natural environments. However, analyzing large volumes of unlabeled video data, often spanning hours, remains a significant challenge for machine learning, particularly in computer vision. Existing approaches typically process only a small number of frames, and accurate georeferencing of tracked positions is still largely unresolved, particularly in dynamic environments where static landmarks cannot be established. In this work, we focus on long-term tracking of animal behavior in real-world geographic coordinates. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. Particle filters offer a useful algorithmic structure for recursively adding new incoming information and thus ensuring time consistency. By incorporating recent developments in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in image coordinates, instead it additionally tracks the position and spatial expansion of the fish school in geographic coordinates by fusing video data and the drone’s on board sensor information (GPS and IMU). No landmarks with known geographic coordinates are required, making the proposed method adaptable to unstructured, dynamic environments like the open ocean, where static landmarks are unavailable. With this, the presented framework enables researchers to study the collective behavior of fish schools within their social and environmental context.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The potential for higher-resolution image generation using pretrained diffusion models is immense. However, these models often struggle with object repetition and structural artifacts especially when scaling to 4K resolution and beyond. Our analysis reveals that causes the problem, a single prompt for the generation of multiple scales provides insufficient efficacy. To address this, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts provide both global and local semantic guidance. Specifically, the global prompt captures overall scene semantics from user input, while local guidance comes from patch-wise descriptions generated by MLLMs to refine regional structures and textures. Furthermore, during inverse denoising, noise is decomposed into low- and high-frequency components, each conditioned on different prompt levels, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality. The demo and code can be found on the project website: https://liuxinyv.github.io/HiPrompt/ .
{"title":"HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts","authors":"Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Yan Li, Chi-Min Chan, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo","doi":"10.1007/s11263-026-02736-z","DOIUrl":"https://doi.org/10.1007/s11263-026-02736-z","url":null,"abstract":"The potential for higher-resolution image generation using pretrained diffusion models is immense. However, these models often struggle with object repetition and structural artifacts especially when scaling to 4K resolution and beyond. Our analysis reveals that causes the problem, a single prompt for the generation of multiple scales provides insufficient efficacy. To address this, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts provide both global and local semantic guidance. Specifically, the global prompt captures overall scene semantics from user input, while local guidance comes from patch-wise descriptions generated by MLLMs to refine regional structures and textures. Furthermore, during inverse denoising, noise is decomposed into low- and high-frequency components, each conditioned on different prompt levels, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality. The demo and code can be found on the project website: <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://liuxinyv.github.io/HiPrompt/\" ext-link-type=\"uri\">https://liuxinyv.github.io/HiPrompt/</jats:ext-link> .","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"53 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diffusion models have achieved state-of-the-art image synthesis, yet unlike GANs, they lack a well-structured latent space for intuitive image editing. Existing diffusion-based editing methods often rely on supervised fine-tuning or text-based guidance, while recent unsupervised techniques leveraging the model’s bottleneck layer suffer from one or more key limitations: (i) they focus only on global attributes, (ii) fail to disentangle local and global semantics, or (iii) require extensive human intervention. To fill this gap, we first propose an unsupervised method for localized image editing in pre-trained unconditional diffusion models that disentangles local and global semantics in the model’s latent space. Given an input image and a user-specified region of interest, our approach uses the denoising network’s Jacobian to map that region to a corresponding latent subspace. We then separate this subspace into shared (global) and region-specific components to uncover latent directions that control local attributes. These directions generalize across images, enabling semantically consistent edits without retraining. We go one step further by extending our method to minimize manual supervision by automatically inferring edit directions from a single reference image and generating region masks without human input. Experiments on multiple datasets show that our method yields more localized, high-fidelity edits than state-of-the-art approaches.
{"title":"Disentangling Local and Global Semantics in Diffusion Models for Image Editing","authors":"Manos Plitsis, Theodoros Kouzelis, Panagiotis Koromilas, Vassilis Katsouros, Mihalis A. Nicolaou, Yannis Panagakis","doi":"10.1007/s11263-025-02694-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02694-y","url":null,"abstract":"Diffusion models have achieved state-of-the-art image synthesis, yet unlike GANs, they lack a well-structured latent space for intuitive image editing. Existing diffusion-based editing methods often rely on supervised fine-tuning or text-based guidance, while recent unsupervised techniques leveraging the model’s bottleneck layer suffer from one or more key limitations: (i) they focus only on global attributes, (ii) fail to disentangle local and global semantics, or (iii) require extensive human intervention. To fill this gap, we first propose an unsupervised method for localized image editing in pre-trained unconditional diffusion models that disentangles local and global semantics in the model’s latent space. Given an input image and a user-specified region of interest, our approach uses the denoising network’s Jacobian to map that region to a corresponding latent subspace. We then separate this subspace into shared (global) and region-specific components to uncover latent directions that control local attributes. These directions generalize across images, enabling semantically consistent edits without retraining. We go one step further by extending our method to minimize manual supervision by automatically inferring edit directions from a single reference image and generating region masks without human input. Experiments on multiple datasets show that our method yields more localized, high-fidelity edits than state-of-the-art approaches.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"30 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-06DOI: 10.1007/s11263-025-02697-9
Erol Ozgur, Mohammad Alkhatib, Youcef Mezouar, Adrien Bartoli
{"title":"Reconstructing a Sphere and the Camera Focal Length from a Single View by Fitting Planes","authors":"Erol Ozgur, Mohammad Alkhatib, Youcef Mezouar, Adrien Bartoli","doi":"10.1007/s11263-025-02697-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02697-9","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147368102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}