Pub Date : 2024-12-04DOI: 10.1109/LSP.2024.3511426
Yi Liu;Haowen Hou;Fei Ma;Shiguang Ni;Fei Richard Yu
In untrimmed video tasks, identifying temporal boundaries in videos is crucial for temporal video grounding. With the emergence of multimodal large language models (MLLMs), recent studies have focused on endowing these models with the capability of temporal perception in untrimmed videos. To address the challenge, in this paper, we introduce a multimodal large language model named MLLM-TA with precise temporal perception to obtain temporal attention. Unlike the traditional MLLMs, answering temporal questions through one or two words related to temporal information, we leverage the text description proficiency of MLLMs to acquire video temporal attention with description. Specifically, we design a dual temporal-aware generative branches aimed at the visual space of the entire video and the textual space of global descriptions, simultaneously generating mutually supervised consistent temporal attention, thereby enhancing the video temporal perception capabilities of MLLMs. Finally, we evaluate our approach on both video grounding task and highlight detection task on three popular benchmarks, including Charades-STA, ActivityNet Captions and QVHighlights. The extensive results show that our MLLM-TA significantly outperforms previous approaches both on zero-shot and supervised setting, achieving state-of-the-art performance.
{"title":"MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding","authors":"Yi Liu;Haowen Hou;Fei Ma;Shiguang Ni;Fei Richard Yu","doi":"10.1109/LSP.2024.3511426","DOIUrl":"https://doi.org/10.1109/LSP.2024.3511426","url":null,"abstract":"In untrimmed video tasks, identifying temporal boundaries in videos is crucial for temporal video grounding. With the emergence of multimodal large language models (MLLMs), recent studies have focused on endowing these models with the capability of temporal perception in untrimmed videos. To address the challenge, in this paper, we introduce a multimodal large language model named MLLM-TA with precise temporal perception to obtain temporal attention. Unlike the traditional MLLMs, answering temporal questions through one or two words related to temporal information, we leverage the text description proficiency of MLLMs to acquire video temporal attention with description. Specifically, we design a dual temporal-aware generative branches aimed at the visual space of the entire video and the textual space of global descriptions, simultaneously generating mutually supervised consistent temporal attention, thereby enhancing the video temporal perception capabilities of MLLMs. Finally, we evaluate our approach on both video grounding task and highlight detection task on three popular benchmarks, including Charades-STA, ActivityNet Captions and QVHighlights. The extensive results show that our MLLM-TA significantly outperforms previous approaches both on zero-shot and supervised setting, achieving state-of-the-art performance.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"281-285"},"PeriodicalIF":3.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconfigurable intelligent surfaces (RIS) offer unprecedented flexibility for smart wireless channels. Recent research shows that RIS platforms enhance signal quality, coverage, and link capacity in integrated sensing and communication (ISAC) systems. This paper explores the use of fully-connected beyond diagonal RIS (BD-RIS) in ISAC. BD-RIS provides additional degrees of freedom by allowing non-zero off-diagonal elements in the scattering matrix, enhancing functionality and performance. We aim to maximize the weighted sum of the signal-to-noise ratio (SNR) at both the radar receiver and communication users using BD-RIS. Numerical results demonstrate the advantages of BD-RIS in ISAC, significantly improving SNR for both radar and communication users.
{"title":"Beyond Diagonal RIS: Key to Next-Generation Integrated Sensing and Communications?","authors":"Tara Esmaeilbeig;Kumar Vijay Mishra;Mojtaba Soltanalian","doi":"10.1109/LSP.2024.3511395","DOIUrl":"https://doi.org/10.1109/LSP.2024.3511395","url":null,"abstract":"Reconfigurable intelligent surfaces (RIS) offer unprecedented flexibility for smart wireless channels. Recent research shows that RIS platforms enhance signal quality, coverage, and link capacity in integrated sensing and communication (ISAC) systems. This paper explores the use of fully-connected beyond diagonal RIS (BD-RIS) in ISAC. BD-RIS provides additional degrees of freedom by allowing non-zero off-diagonal elements in the scattering matrix, enhancing functionality and performance. We aim to maximize the weighted sum of the signal-to-noise ratio (SNR) at both the radar receiver and communication users using BD-RIS. Numerical results demonstrate the advantages of BD-RIS in ISAC, significantly improving SNR for both radar and communication users.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"216-220"},"PeriodicalIF":3.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10777522","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142844408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The explosive growth of generative AI has saturated the internet with AI-generated images, raising security concerns and increasing the need for reliable detection methods. The primary requirement for such detection is generalizability, typically achieved by training on numerous fake images from various models. However, practical limitations, such as closed-source models and restricted access, often result in limited training samples. Therefore, training a general detector with few-shot samples is essential for modern detection mechanisms. To address this challenge, we propose FAMSeC, a general AI-generated image detection method based on LoRA-based F