Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, Jia Li
Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.
{"title":"Efficient Low-Resolution Face Recognition via Bridge Distillation","authors":"Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, Jia Li","doi":"arxiv-2409.11786","DOIUrl":"https://doi.org/arxiv-2409.11786","url":null,"abstract":"Face recognition in the wild is now advancing towards light-weight models,\u0000fast inference speed and resolution-adapted capability. In this paper, we\u0000propose a bridge distillation approach to turn a complex face model pretrained\u0000on private high-resolution faces into a light-weight one for low-resolution\u0000face recognition. In our approach, such a cross-dataset resolution-adapted\u0000knowledge transfer problem is solved via two-step distillation. In the first\u0000step, we conduct cross-dataset distillation to transfer the prior knowledge\u0000from private high-resolution faces to public high-resolution faces and generate\u0000compact and discriminative features. In the second step, the resolution-adapted\u0000distillation is conducted to further transfer the prior knowledge to synthetic\u0000low-resolution faces via multi-task learning. By learning low-resolution face\u0000representations and mimicking the adapted high-resolution knowledge, a\u0000light-weight student model can be constructed with high efficiency and\u0000promising accuracy in recognizing low-resolution faces. Experimental results\u0000show that the student model performs impressively in recognizing low-resolution\u0000faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed\u0000reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile\u0000phone, respectively.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kalakonda Sai Shashank, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla
We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: https://motion-rag.github.io/
{"title":"MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion","authors":"Kalakonda Sai Shashank, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla","doi":"arxiv-2409.12140","DOIUrl":"https://doi.org/arxiv-2409.12140","url":null,"abstract":"We introduce MoRAG, a novel multi-part fusion based retrieval-augmented\u0000generation strategy for text-based human motion generation. The method enhances\u0000motion diffusion models by leveraging additional knowledge obtained through an\u0000improved motion retrieval process. By effectively prompting large language\u0000models (LLMs), we address spelling errors and rephrasing issues in motion\u0000retrieval. Our approach utilizes a multi-part retrieval strategy to improve the\u0000generalizability of motion retrieval across the language space. We create\u0000diverse samples through the spatial composition of the retrieved motions.\u0000Furthermore, by utilizing low-level, part-specific motion information, we can\u0000construct motion samples for unseen text descriptions. Our experiments\u0000demonstrate that our framework can serve as a plug-and-play module, improving\u0000the performance of motion diffusion models. Code, pretrained models and sample\u0000videos will be made available at: https://motion-rag.github.io/","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiuhong Shen, Xingyi Yang, Michael Bi Mi, Xinchao Wang
We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at https://github.com/florinshen/Vista3D.
{"title":"Vista3D: Unravel the 3D Darkside of a Single Image","authors":"Qiuhong Shen, Xingyi Yang, Michael Bi Mi, Xinchao Wang","doi":"arxiv-2409.12193","DOIUrl":"https://doi.org/arxiv-2409.12193","url":null,"abstract":"We embark on the age-old quest: unveiling the hidden dimensions of objects\u0000from mere glimpses of their visible parts. To address this, we present Vista3D,\u0000a framework that realizes swift and consistent 3D generation within a mere 5\u0000minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase\u0000and the fine phase. In the coarse phase, we rapidly generate initial geometry\u0000with Gaussian Splatting from a single image. In the fine phase, we extract a\u0000Signed Distance Function (SDF) directly from learned Gaussian Splatting,\u0000optimizing it with a differentiable isosurface representation. Furthermore, it\u0000elevates the quality of generation by using a disentangled representation with\u0000two independent implicit functions to capture both visible and obscured aspects\u0000of objects. Additionally, it harmonizes gradients from 2D diffusion prior with\u00003D-aware diffusion priors by angular diffusion prior composition. Through\u0000extensive evaluation, we demonstrate that Vista3D effectively sustains a\u0000balance between the consistency and diversity of the generated 3D objects.\u0000Demos and code will be available at https://github.com/florinshen/Vista3D.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.
{"title":"NVLM: Open Frontier-Class Multimodal LLMs","authors":"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping","doi":"arxiv-2409.11402","DOIUrl":"https://doi.org/arxiv-2409.11402","url":null,"abstract":"We introduce NVLM 1.0, a family of frontier-class multimodal large language\u0000models (LLMs) that achieve state-of-the-art results on vision-language tasks,\u0000rivaling the leading proprietary models (e.g., GPT-4o) and open-access models\u0000(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\u0000text-only performance over its LLM backbone after multimodal training. In terms\u0000of model design, we perform a comprehensive comparison between decoder-only\u0000multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\u0000Flamingo). Based on the strengths and weaknesses of both approaches, we propose\u0000a novel architecture that enhances both training efficiency and multimodal\u0000reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\u0000tile-based dynamic high-resolution images, which significantly boosts\u0000performance on multimodal reasoning and OCR-related tasks. Regarding training\u0000data, we meticulously curate and provide detailed information on our multimodal\u0000pretraining and supervised fine-tuning datasets. Our findings indicate that\u0000dataset quality and task diversity are more important than scale, even during\u0000the pretraining phase, across all architectures. Notably, we develop\u0000production-grade multimodality for the NVLM-1.0 models, enabling them to excel\u0000in vision-language tasks while maintaining and even improving text-only\u0000performance compared to their LLM backbones. To achieve this, we craft and\u0000integrate a high-quality text-only dataset into multimodal training, alongside\u0000a substantial amount of multimodal math and reasoning data, leading to enhanced\u0000math and coding capabilities across modalities. To advance research in the\u0000field, we are releasing the model weights and will open-source the code for the\u0000community: https://nvlm-project.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most recent few-shot learning approaches are based on meta-learning with episodic training. However, prior studies encounter two crucial problems: (1) textit{the presence of inductive bias}, and (2) textit{the occurrence of catastrophic forgetting}. In this paper, we propose a novel Multi-Level Contrastive Constraints (MLCC) framework, that jointly integrates within-episode learning and across-episode learning into a unified interactive learning paradigm to solve these issues. Specifically, we employ a space-aware interaction modeling scheme to explore the correct inductive paradigms for each class between within-episode similarity/dis-similarity distributions. Additionally, with the aim of better utilizing former prior knowledge, a cross-stage distribution adaption strategy is designed to align the across-episode distributions from different time stages, thus reducing the semantic gap between existing and past prediction distribution. Extensive experiments on multiple few-shot datasets demonstrate the consistent superiority of MLCC approach over the existing state-of-the-art baselines.
{"title":"Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints","authors":"Bingzhi Chen, Haoming Zhou, Yishu Liu, Biqing Zeng, Jiahui Pan, Guangming Lu","doi":"arxiv-2409.11286","DOIUrl":"https://doi.org/arxiv-2409.11286","url":null,"abstract":"Most recent few-shot learning approaches are based on meta-learning with\u0000episodic training. However, prior studies encounter two crucial problems: (1)\u0000textit{the presence of inductive bias}, and (2) textit{the occurrence of\u0000catastrophic forgetting}. In this paper, we propose a novel Multi-Level\u0000Contrastive Constraints (MLCC) framework, that jointly integrates\u0000within-episode learning and across-episode learning into a unified interactive\u0000learning paradigm to solve these issues. Specifically, we employ a space-aware\u0000interaction modeling scheme to explore the correct inductive paradigms for each\u0000class between within-episode similarity/dis-similarity distributions.\u0000Additionally, with the aim of better utilizing former prior knowledge, a\u0000cross-stage distribution adaption strategy is designed to align the\u0000across-episode distributions from different time stages, thus reducing the\u0000semantic gap between existing and past prediction distribution. Extensive\u0000experiments on multiple few-shot datasets demonstrate the consistent\u0000superiority of MLCC approach over the existing state-of-the-art baselines.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.
{"title":"Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs","authors":"Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang","doi":"arxiv-2409.10994","DOIUrl":"https://doi.org/arxiv-2409.10994","url":null,"abstract":"The rapid advancement of Multimodal Large Language Models (MLLMs) has led to\u0000remarkable performances across various domains. However, this progress is\u0000accompanied by a substantial surge in the resource consumption of these models.\u0000We address this pressing issue by introducing a new approach, Token Reduction\u0000using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without\u0000sacrificing their performance. Inspired by human attention patterns in Visual\u0000Question Answering (VQA) tasks, TRIM presents a fresh perspective on the\u0000selection and reduction of image tokens. The TRIM method has been extensively\u0000tested across 12 datasets, and the results demonstrate a significant reduction\u0000in computational overhead while maintaining a consistent level of performance.\u0000This research marks a critical stride in efficient MLLM development, promoting\u0000greater accessibility and sustainability of high-performing models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka
Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
{"title":"Benchmarking VLMs' Reasoning About Persuasive Atypical Images","authors":"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka","doi":"arxiv-2409.10719","DOIUrl":"https://doi.org/arxiv-2409.10719","url":null,"abstract":"Vision language models (VLMs) have shown strong zero-shot generalization\u0000across various tasks, especially when integrated with large language models\u0000(LLMs). However, their ability to comprehend rhetorical and persuasive visual\u0000media, such as advertisements, remains understudied. Ads often employ atypical\u0000imagery, using surprising object juxtapositions to convey shared properties.\u0000For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\u0000advanced reasoning to deduce that this atypical representation signifies the\u0000beer's lightness. We introduce three novel tasks, Multi-label Atypicality\u0000Classification, Atypicality Statement Retrieval, and Aypical Object\u0000Recognition, to benchmark VLMs' understanding of atypicality in persuasive\u0000images. We evaluate how well VLMs use atypicality to infer an ad's message and\u0000test their reasoning abilities by employing semantically challenging negatives.\u0000Finally, we pioneer atypicality-aware verbalization by extracting comprehensive\u0000image descriptions sensitive to atypical elements. Our findings reveal that:\u0000(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\u0000effective strategies can extract atypicality-aware information, leading to\u0000comprehensive image verbalization; (3) atypicality aids persuasive\u0000advertisement understanding. Code and data will be made available.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.
{"title":"Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models","authors":"Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou","doi":"arxiv-2409.10197","DOIUrl":"https://doi.org/arxiv-2409.10197","url":null,"abstract":"Recent progress in Multimodal Large Language Models(MLLMs) often use large\u0000image tokens to compensate the visual shortcoming of MLLMs, which not only\u0000exhibits obvious redundancy but also greatly exacerbates the already high\u0000computation. Token pruning is an effective solution for speeding up MLLMs, but\u0000when and how to drop tokens still remains a challenge. In this paper, we\u0000propose a novel and training-free approach for the effective visual token\u0000pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning\u0000recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune\u0000considers token pruning as a statistical problem of MLLM and its objective is\u0000to find out an optimal pruning scheme that can minimize the divergence of the\u0000attention distributions before and after pruning. In practice, FitPrune can be\u0000quickly accomplished based on the attention statistics from a small batch of\u0000inference data, avoiding the expensive trials of MLLMs. According to the\u0000pruning recipe, an MLLM can directly remove the redundant visual tokens of\u0000different examples during inference. To validate FitPrune, we apply it to a set\u0000of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct\u0000extensive experiments on a set of benchmarks. The experimental results show\u0000that our FitPrune can not only reduce the computational complexity to a large\u0000extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT\u0000with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in\u0000about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the widespread use of mobile devices and the rapid growth of micro-video platforms such as TikTok and Kwai, the demand for personalized micro-video recommendation systems has significantly increased. Micro-videos typically contain diverse information, such as textual metadata, visual cues (e.g., cover images), and dynamic video content, significantly affecting user interaction and engagement patterns. However, most existing approaches often suffer from the problem of over-smoothing, which limits their ability to capture comprehensive interaction information effectively. Additionally, cold-start scenarios present ongoing challenges due to sparse interaction data and the underutilization of available interaction signals. To address these issues, we propose a Multi-view Hypergraph-based Contrastive learning model for cold-start micro-video Recommendation (MHCR). MHCR introduces a multi-view multimodal feature extraction layer to capture interaction signals from various perspectives and incorporates multi-view self-supervised learning tasks to provide additional supervisory signals. Through extensive experiments on two real-world datasets, we show that MHCR significantly outperforms existing video recommendation models and effectively mitigates cold-start challenges. Our code is available at https://anonymous.4open.science/r/MHCR-02EF.
{"title":"Multi-view Hypergraph-based Contrastive Learning Model for Cold-Start Micro-video Recommendation","authors":"Sisuo Lyu, Xiuze Zhou, Xuming Hu","doi":"arxiv-2409.09638","DOIUrl":"https://doi.org/arxiv-2409.09638","url":null,"abstract":"With the widespread use of mobile devices and the rapid growth of micro-video\u0000platforms such as TikTok and Kwai, the demand for personalized micro-video\u0000recommendation systems has significantly increased. Micro-videos typically\u0000contain diverse information, such as textual metadata, visual cues (e.g., cover\u0000images), and dynamic video content, significantly affecting user interaction\u0000and engagement patterns. However, most existing approaches often suffer from\u0000the problem of over-smoothing, which limits their ability to capture\u0000comprehensive interaction information effectively. Additionally, cold-start\u0000scenarios present ongoing challenges due to sparse interaction data and the\u0000underutilization of available interaction signals. To address these issues, we propose a Multi-view Hypergraph-based Contrastive\u0000learning model for cold-start micro-video Recommendation (MHCR). MHCR\u0000introduces a multi-view multimodal feature extraction layer to capture\u0000interaction signals from various perspectives and incorporates multi-view\u0000self-supervised learning tasks to provide additional supervisory signals.\u0000Through extensive experiments on two real-world datasets, we show that MHCR\u0000significantly outperforms existing video recommendation models and effectively\u0000mitigates cold-start challenges. Our code is available at\u0000https://anonymous.4open.science/r/MHCR-02EF.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu
Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.
{"title":"SafeEar: Content Privacy-Preserving Audio Deepfake Detection","authors":"Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu","doi":"arxiv-2409.09272","DOIUrl":"https://doi.org/arxiv-2409.09272","url":null,"abstract":"Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited\u0000remarkable performance in generating realistic and natural audio. However,\u0000their dark side, audio deepfake poses a significant threat to both society and\u0000individuals. Existing countermeasures largely focus on determining the\u0000genuineness of speech based on complete original audio recordings, which\u0000however often contain private content. This oversight may refrain deepfake\u0000detection from many applications, particularly in scenarios involving sensitive\u0000information like business secrets. In this paper, we propose SafeEar, a novel\u0000framework that aims to detect deepfake audios without relying on accessing the\u0000speech content within. Our key idea is to devise a neural audio codec into a\u0000novel decoupling model that well separates the semantic and acoustic\u0000information from audio samples, and only use the acoustic information (e.g.,\u0000prosody and timbre) for deepfake detection. In this way, no semantic content\u0000will be exposed to the detector. To overcome the challenge of identifying\u0000diverse deepfake audio without semantic clues, we enhance our deepfake detector\u0000with real-world codec augmentation. Extensive experiments conducted on four\u0000benchmark datasets demonstrate SafeEar's effectiveness in detecting various\u0000deepfake techniques with an equal error rate (EER) down to 2.02%.\u0000Simultaneously, it shields five-language speech content from being deciphered\u0000by both machine and human auditory analysis, demonstrated by word error rates\u0000(WERs) all above 93.93% and our user study. Furthermore, our benchmark\u0000constructed for anti-deepfake and anti-content recovery evaluation helps\u0000provide a basis for future research in the realms of audio privacy preservation\u0000and deepfake detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}