Pub Date : 2026-01-01DOI: 10.1007/s11263-025-02639-5
Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
We present Liquid, a versatile and native auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using any existing large language models (LLMs), eliminating the need for external pretrained visual modules such as CLIP and diffusion models. For the first time, Liquid reveals that the power-law scaling laws of unified multimodal models align with those observed in language models, and it discovers that the trade-offs between visual and language tasks diminish as model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We demonstrate that existing LLMs can serve as strong foundations for Liquid, saving training costs by 100times while surpassing Chameleon in multimodal capabilities. Compared to previous unified multimodal models, Liquid maintains on-par language performance comparable to mainstream LLMs like Llama2, preserving its potential as a foundational model. Building on this foundation, Liquid outperforms visual generation models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. The code and models are available at https://github.com/FoundationVision/Liquid.
{"title":"Liquid: Language Models are Scalable and Unified Multi-Modal Generators","authors":"Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai","doi":"10.1007/s11263-025-02639-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02639-5","url":null,"abstract":"We present Liquid, a versatile and native auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using any existing large language models (LLMs), eliminating the need for external pretrained visual modules such as CLIP and diffusion models. For the first time, Liquid reveals that the power-law scaling laws of unified multimodal models align with those observed in language models, and it discovers that the trade-offs between visual and language tasks diminish as model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We demonstrate that existing LLMs can serve as strong foundations for Liquid, saving training costs by 100<italic>times</italic> while surpassing Chameleon in multimodal capabilities. Compared to previous unified multimodal models, Liquid maintains on-par language performance comparable to mainstream LLMs like Llama2, preserving its potential as a foundational model. Building on this foundation, Liquid outperforms visual generation models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. The code and models are available at <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/FoundationVision/Liquid\">https://github.com/FoundationVision/Liquid</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1007/s11263-025-02647-5
Razan Alharith, Jiashu Zhang, Ashraf Osman Ibrahim, Zhenyu Wu
Concept-based explanation represents an important yet rapidly evolving method aimed at enhancing the interpretability and transparency of deep learning models by clarifying their behaviors and predictions using understandable concepts. However, the current literature lacks a comprehensive survey and classification of the various strategies and methodologies employed to analyze these models. This paper aims to fill this gap by introducing a new taxonomy of concept-based explanation strategies. Following a thorough review of 101 relevant studies, a preliminary taxonomy was developed that groups strategies based on criteria such as data modality, level of supervision, model complexity, explanation scope, and model interpretability. Furthermore, we present a comprehensive evaluation of the advantages and limitations of various methodologies, as well as the datasets commonly used in this field. We also identify promising avenues for further exploration. Our study aims to serve as a useful tool for researchers and professionals interested in advancing concept-based explanation. Furthermore, we have built a GitHub project page that gathers key materials for concept-based explanations, which may be accessible through : https://github.com/razanalharith/Concept-Based-Explanation.
{"title":"Concept-Based Explanation for Deep Vision Models: A Comprehensive Survey on Techniques, Taxonomy, Applications, and Recent Advances","authors":"Razan Alharith, Jiashu Zhang, Ashraf Osman Ibrahim, Zhenyu Wu","doi":"10.1007/s11263-025-02647-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02647-5","url":null,"abstract":"Concept-based explanation represents an important yet rapidly evolving method aimed at enhancing the interpretability and transparency of deep learning models by clarifying their behaviors and predictions using understandable concepts. However, the current literature lacks a comprehensive survey and classification of the various strategies and methodologies employed to analyze these models. This paper aims to fill this gap by introducing a new taxonomy of concept-based explanation strategies. Following a thorough review of 101 relevant studies, a preliminary taxonomy was developed that groups strategies based on criteria such as data modality, level of supervision, model complexity, explanation scope, and model interpretability. Furthermore, we present a comprehensive evaluation of the advantages and limitations of various methodologies, as well as the datasets commonly used in this field. We also identify promising avenues for further exploration. Our study aims to serve as a useful tool for researchers and professionals interested in advancing concept-based explanation. Furthermore, we have built a GitHub project page that gathers key materials for concept-based explanations, which may be accessible through : <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/razanalharith/Concept-Based-Explanation\">https://github.com/razanalharith/Concept-Based-Explanation</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"30 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-28DOI: 10.1007/s11263-025-02679-x
Yi Zheng, Harsh Sharma, Margrit Betke, Jonathan D. Cherry, Jesse B. Mez, Jennifer E. Beane, Vijaya B. Kolachalama
{"title":"FourierMIL: Fourier Filtering-based Multiple Instance Learning for Whole Slide Image Analysis","authors":"Yi Zheng, Harsh Sharma, Margrit Betke, Jonathan D. Cherry, Jesse B. Mez, Jennifer E. Beane, Vijaya B. Kolachalama","doi":"10.1007/s11263-025-02679-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02679-x","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-28DOI: 10.1007/s11263-025-02638-6
Xu Dong, Xinran Liu, Wanqing Li, Anthony Adeyemi-Ejeye, Andrew Gilbert
This work proposes UIL-AQA for long-term Action Quality Assessment AQA designed to be clip-level interpretable and uncertainty-aware. AQA evaluates the execution quality of actions in videos. However, the complexity and diversity of actions, especially in long videos, increase the difficulty of AQA. Existing AQA methods solve this by limiting themselves generally to short-term videos. These approaches lack detailed semantic interpretation for individual clips and fail to account for the impact of human biases and subjectivity in the data during model training. Moreover, although query-based Transformer networks demonstrate strong capabilities in long-term modelling, their interpretability in AQA remains insufficient. This is primarily due to a phenomenon we identified, termed Temporal Skipping , where the model skips self-attention layers to prevent output degradation. We introduce an Attention Loss function and a Query Initialization Module to enhance the modelling capability of query-based Transformer networks. Additionally, we incorporate a Gaussian Noise Injection Module to simulate biases in human scoring, mitigating the influence of uncertainty and improving model reliability. Furthermore, we propose a Difficulty-Quality Regression Module, which decomposes each clip’s action score into independent difficulty and quality components, enabling a more fine-grained and interpretable evaluation. Our extensive quantitative and qualitative analysis demonstrates that our proposed method achieves state-of-the-art performance on three long-term real-world AQA datasets. Our code is available at: https://github.com/dx199771/Interpretability-AQA
{"title":"UIL-AQA: Uncertainty-Aware Clip-Level Interpretable Action Quality Assessment","authors":"Xu Dong, Xinran Liu, Wanqing Li, Anthony Adeyemi-Ejeye, Andrew Gilbert","doi":"10.1007/s11263-025-02638-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02638-6","url":null,"abstract":"This work proposes UIL-AQA for long-term Action Quality Assessment AQA designed to be clip-level interpretable and uncertainty-aware. AQA evaluates the execution quality of actions in videos. However, the complexity and diversity of actions, especially in long videos, increase the difficulty of AQA. Existing AQA methods solve this by limiting themselves generally to short-term videos. These approaches lack detailed semantic interpretation for individual clips and fail to account for the impact of human biases and subjectivity in the data during model training. Moreover, although query-based Transformer networks demonstrate strong capabilities in long-term modelling, their interpretability in AQA remains insufficient. This is primarily due to a phenomenon we identified, termed <jats:italic>Temporal Skipping</jats:italic> , where the model skips self-attention layers to prevent output degradation. We introduce an Attention Loss function and a Query Initialization Module to enhance the modelling capability of query-based Transformer networks. Additionally, we incorporate a Gaussian Noise Injection Module to simulate biases in human scoring, mitigating the influence of uncertainty and improving model reliability. Furthermore, we propose a Difficulty-Quality Regression Module, which decomposes each clip’s action score into independent difficulty and quality components, enabling a more fine-grained and interpretable evaluation. Our extensive quantitative and qualitative analysis demonstrates that our proposed method achieves state-of-the-art performance on three long-term real-world AQA datasets. Our code is available at: <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/dx199771/Interpretability-AQA\" ext-link-type=\"uri\">https://github.com/dx199771/Interpretability-AQA</jats:ext-link>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding","authors":"Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Jiahao Wang, Zhe Chen, Zhiqi Li, Tong Lu, Limin Wang","doi":"10.1007/s11263-025-02597-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02597-y","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145836243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-26DOI: 10.1007/s11263-025-02607-z
Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli Ouyang
{"title":"Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs","authors":"Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli Ouyang","doi":"10.1007/s11263-025-02607-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02607-z","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"30 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145836245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}