首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Phase-shifted tACS can modulate cortical alpha waves in human subjects. 相移 tACS 可以调节人体皮层阿尔法波。
IF 3.1 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-01 Epub Date: 2023-08-29 DOI: 10.1007/s11571-023-09997-1
Alexandre Aksenov, Malo Renaud-D'Ambra, Vitaly Volpert, Anne Beuter

In the present study, we investigated traveling waves induced by transcranial alternating current stimulation in the alpha frequency band of healthy subjects. Electroencephalographic data were recorded in 12 healthy subjects before, during, and after phase-shifted stimulation with a device combining both electroencephalographic and stimulation capacities. In addition, we analyzed the results of numerical simulations and compared them to the results of identical analysis on real EEG data. The results of numerical simulations indicate that imposed transcranial alternating current stimulation induces a rotating electric field. The direction of waves induced by stimulation was observed more often during at least 30 s after the end of stimulation, demonstrating the presence of aftereffects of the stimulation. Results suggest that the proposed approach could be used to modulate the interaction between distant areas of the cortex. Non-invasive transcranial alternating current stimulation can be used to facilitate the propagation of circulating waves at a particular frequency and in a controlled direction. The results presented open new opportunities for developing innovative and personalized transcranial alternating current stimulation protocols to treat various neurological disorders.

Supplementary information: The online version contains supplementary material available at 10.1007/s11571-023-09997-1.

在本研究中,我们研究了经颅交变电流刺激在健康受试者α频段诱发的行波。我们使用一种集脑电和刺激能力于一体的设备,记录了 12 名健康受试者在受到相移刺激之前、期间和之后的脑电数据。此外,我们还分析了数值模拟的结果,并将其与真实脑电图数据的相同分析结果进行了比较。数值模拟的结果表明,外加的经颅交变电流刺激会诱发旋转电场。在刺激结束后的至少 30 秒内,更频繁地观察到刺激诱发的电波方向,这表明刺激存在后遗效应。结果表明,建议的方法可用于调节大脑皮层远处区域之间的相互作用。非侵入性经颅交流电刺激可用于促进特定频率和受控方向的循环波传播。这些结果为开发创新的个性化经颅交变电流刺激方案治疗各种神经系统疾病提供了新的机遇:在线版本包含补充材料,可查阅 10.1007/s11571-023-09997-1。
{"title":"Phase-shifted tACS can modulate cortical alpha waves in human subjects.","authors":"Alexandre Aksenov, Malo Renaud-D'Ambra, Vitaly Volpert, Anne Beuter","doi":"10.1007/s11571-023-09997-1","DOIUrl":"10.1007/s11571-023-09997-1","url":null,"abstract":"<p><p>In the present study, we investigated traveling waves induced by transcranial alternating current stimulation in the alpha frequency band of healthy subjects. Electroencephalographic data were recorded in 12 healthy subjects before, during, and after phase-shifted stimulation with a device combining both electroencephalographic and stimulation capacities. In addition, we analyzed the results of numerical simulations and compared them to the results of identical analysis on real EEG data. The results of numerical simulations indicate that imposed transcranial alternating current stimulation induces a rotating electric field. The direction of waves induced by stimulation was observed more often during at least 30 s after the end of stimulation, demonstrating the presence of aftereffects of the stimulation. Results suggest that the proposed approach could be used to modulate the interaction between distant areas of the cortex. Non-invasive transcranial alternating current stimulation can be used to facilitate the propagation of circulating waves at a particular frequency and in a controlled direction. The results presented open new opportunities for developing innovative and personalized transcranial alternating current stimulation protocols to treat various neurological disorders.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11571-023-09997-1.</p>","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"24 1","pages":"1575-1592"},"PeriodicalIF":3.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11297852/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"52867081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guest Editorial Introduction to the Issue on Pre-Trained Models for Multi-Modality Understanding 多模态理解的预训练模型》特约编辑导言
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-31 DOI: 10.1109/TMM.2024.3384680
Wengang Zhou;Jiajun Deng;Niculae Sebe;Qi Tian;Alan L. Yuille;Concetto Spampinato;Zakia Hammal
In the ever-evolving domain of multimedia, the significance of multi-modality understanding cannot be overstated. As multimedia content becomes increasingly sophisticated and ubiquitous, the ability to effectively combine and analyze the diverse information from different types of data, such as text, audio, image, video and point clouds, will be paramount in pushing the boundaries of what technology can achieve in understanding and interacting with the world around us. Accordingly, multi-modality understanding has attracted a tremendous amount of research, establishing itself as an emerging topic. Pre-trained models, in particular, have revolutionized this field, providing a way to leverage vast amounts of data without task-specific annotation to facilitate various downstream tasks.
在不断发展的多媒体领域,多模态理解的重要性怎么强调都不为过。随着多媒体内容变得越来越复杂和无处不在,有效地组合和分析来自不同类型数据(如文本、音频、图像、视频和点云)的各种信息的能力,对于推动技术在理解我们周围的世界并与之互动方面所能达到的极限将是至关重要的。因此,多模态理解吸引了大量研究,成为一个新兴课题。预训练模型尤其为这一领域带来了革命性的变化,它提供了一种无需特定任务注释即可利用海量数据的方法,从而为各种下游任务提供了便利。
{"title":"Guest Editorial Introduction to the Issue on Pre-Trained Models for Multi-Modality Understanding","authors":"Wengang Zhou;Jiajun Deng;Niculae Sebe;Qi Tian;Alan L. Yuille;Concetto Spampinato;Zakia Hammal","doi":"10.1109/TMM.2024.3384680","DOIUrl":"10.1109/TMM.2024.3384680","url":null,"abstract":"In the ever-evolving domain of multimedia, the significance of multi-modality understanding cannot be overstated. As multimedia content becomes increasingly sophisticated and ubiquitous, the ability to effectively combine and analyze the diverse information from different types of data, such as text, audio, image, video and point clouds, will be paramount in pushing the boundaries of what technology can achieve in understanding and interacting with the world around us. Accordingly, multi-modality understanding has attracted a tremendous amount of research, establishing itself as an emerging topic. Pre-trained models, in particular, have revolutionized this field, providing a way to leverage vast amounts of data without task-specific annotation to facilitate various downstream tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"8291-8296"},"PeriodicalIF":8.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10616245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141862636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings 利用角度重构文本嵌入检索零镜头视频瞬间
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-19 DOI: 10.1109/TMM.2024.3396272
Xun Jiang;Xing Xu;Zailei Zhou;Yang Yang;Fumin Shen;Heng Tao Shen
Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with Angular Reconstructive Text embeddings (ART), generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.
给定一段未经剪辑的视频和一个文本查询,视频时刻检索(VMR)旨在检索视频内容与文本查询语义相关的特定时刻。传统的 VMR 方法依赖于视频-文本配对数据或每个目标事件的特定时间注释。然而,标注过程的主观性和耗时性限制了这些方法在多媒体应用中的实用性。为了解决这个问题,最近有研究人员提出了一种用于 VMR 的零镜头学习设置(Zero-Shot Learning setting for VMR,ZS-VMR),它可以在没有人工监督信号的情况下训练 VMR 模型,从而降低数据成本。在本文中,我们利用角度重构文本嵌入(ART)解决了具有挑战性的 ZS-VMR 问题,将图像-文本匹配预训练模型 CLIP 推广到了 VMR 任务中。具体来说,我们的 ART 方法假定视觉嵌入与其语义相关的文本嵌入在角度空间上很接近,通过 CLIP 的超球生成视频事件提案的伪文本嵌入。此外,针对视频的时间特性,我们还设计了局部多模态融合学习,以缩小图像-文本匹配和视频-文本匹配之间的差距。我们在两个广泛使用的 VMR 基准(Charades-STA 和 ActivityNet-Captions)上的实验结果表明,我们的方法优于目前最先进的 ZS-VMR 方法。与最新的弱监督 VMR 方法相比,我们的方法也取得了具有竞争力的性能。
{"title":"Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings","authors":"Xun Jiang;Xing Xu;Zailei Zhou;Yang Yang;Fumin Shen;Heng Tao Shen","doi":"10.1109/TMM.2024.3396272","DOIUrl":"10.1109/TMM.2024.3396272","url":null,"abstract":"Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with \u0000<italic>Angular Reconstructive Text embeddings (ART)</i>\u0000, generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9657-9670"},"PeriodicalIF":8.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset 实现高效的视频压缩伪影检测和去除:基准数据集
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-03 DOI: 10.1109/TMM.2024.3414549
Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao
Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.
视频压缩会产生压缩伪影,其中的可感知编码伪影(PEAs)会降低用户的感知能力。现有的大多数最先进的视频压缩伪影去除(VCAR)方法会不加区分地处理所有伪影,从而导致非 PEA 区域的过度增强。因此,准确检测和定位 PEA 至关重要。在本文中,我们提出了有史以来最大的细粒度 PEA 数据库 (FPEA)。首先,我们采用流行的视频编解码器 VVC 和 AVS3 及其常用测试设置,生成四种空间 PEA(模糊、阻塞、振铃和渗色)和两种时间 PEA(闪烁和浮动)。其次,我们设计了一个标记平台,并招募了足够多的受试者来手动定位上述所有类型的 PEA。第三,我们提出了一种投票机制和特征匹配来综合所有的主观标签,从而得到具有精细定位的最终 PEA 标签。此外,我们还提供了所有压缩视频序列的平均意见分值(MOS)。实验结果表明,FPEA 数据库在 VCAR 和压缩视频质量评估(VQA)方面都很有效。我们认为 FPEA 数据库将有利于 VCAR、VQA 和感知型视频编码器的未来发展。FPEA 数据库已公开发布。
{"title":"Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset","authors":"Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao","doi":"10.1109/TMM.2024.3414549","DOIUrl":"10.1109/TMM.2024.3414549","url":null,"abstract":"Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10816-10827"},"PeriodicalIF":8.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Centric Behavior Description in Videos: New Benchmark and Model 视频中以人为中心的行为描述:新基准和模型
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-02 DOI: 10.1109/TMM.2024.3414263
Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang
In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.
在视频监控领域,描述视频中每个人的行为变得越来越重要,尤其是在有多人在场的复杂场景中。这是因为描述每个人的行为可以提供更详细的情景分析,从而准确评估和应对潜在风险,确保公共场所的安全与和谐。目前,视频级字幕数据集无法对每个人的具体行为进行精细描述。然而,仅凭视频级的描述无法对个体行为进行深入解读,因此准确确定每个个体的具体身份具有挑战性。为了应对这一挑战,我们构建了一个以人为中心的视频监控字幕数据集,该数据集提供了 7820 个个体动态行为的详细描述。具体来说,我们对每个人的多个方面进行了标注,如位置、服装以及与场景中其他元素的互动,这些人分布在 1,012 个视频中。基于这个数据集,我们可以将个人与他们各自的行为联系起来,从而进一步分析每个人在监控视频中的行为。除了数据集之外,我们还提出了一种新颖的视频字幕方法,该方法可以在个人层面上详细描述个人行为,取得了最先进的效果。
{"title":"Human-Centric Behavior Description in Videos: New Benchmark and Model","authors":"Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3414263","DOIUrl":"10.1109/TMM.2024.3414263","url":null,"abstract":"In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10867-10878"},"PeriodicalIF":8.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification 用于微视频多标签分类的多模态渐进调制网络
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-26 DOI: 10.1109/TMM.2024.3405724
Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su
Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.
微视频作为一种日益流行的用户生成内容(UGC)形式,自然包含多种多样的多模态线索。然而,为了追求表征的一致性,现有方法忽略了同时考虑探索模态差异和保留模态多样性。本文提出了一种用于微视频多标签分类的多模态渐进调制网络(MPMNet),通过逐步调节各种模态偏差来增强每种模态的指示能力。在 MPMNet 中,我们首先利用以单模态为中心的并行聚合策略来获得初步的综合表征。然后,我们将特征域分解调制过程和类别域自适应调制过程整合到一个统一的框架中,共同完善面向模态的表征。在前一种调制过程中,我们在一个潜在空间中对模态间的依赖关系进行约束,从而获得面向模态的样本表征,并引入一种分解范式来进一步保持模态的多样性。在后一种调制过程中,我们构建了全局上下文感知图卷积网络,以获取面向模态的类别表征,并开发了两个实例级参数生成器,以进一步调节单模态语义偏差。在两个微型视频多标签数据集上进行的广泛实验表明,我们提出的方法优于最先进的方法。
{"title":"Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification","authors":"Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su","doi":"10.1109/TMM.2024.3405724","DOIUrl":"10.1109/TMM.2024.3405724","url":null,"abstract":"Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10134-10144"},"PeriodicalIF":8.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification 用于无人机 RGB-Infrared 车辆再识别的解耦特征学习网络中的关系感知权重共享
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3400675
Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong
Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with 16015 RGB and 13913 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.
由于具有全时目标搜索能力,基于无人机(UAV)的跨模态车辆再识别技术在视频监控和公共安全领域正受到越来越多的关注。然而,由于数据不足的问题,这项前景广阔的创新研究尚未得到充分研究。同时,跨模态差异和方位差异的挑战进一步增加了这项任务的难度。为此,我们首创了一个名为 "无人机跨模态车辆再识别(UCM-VeID)"的跨模态车辆再识别基准,其中包含 753 个身份,16015 张 RGB 和 13913 张红外图像。此外,为了应对跨模态差异和方位差异的挑战,我们提出了混合权重解耦网络(HWDNet)来学习共享的方位不变判别特征。针对第一个挑战,我们提出了一种混合权重连体网络,该网络具有精心设计的权重约束器及其相应的目标函数,可同时学习特定模态信息和模态共享信息。在第二个挑战方面,我们研究了三种有效的解耦结构和两个前置任务,以灵活地执行方位不变特征分离任务。通过综合实验验证了所提方法的有效性。
{"title":"Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification","authors":"Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong","doi":"10.1109/TMM.2024.3400675","DOIUrl":"10.1109/TMM.2024.3400675","url":null,"abstract":"Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with \u0000<bold>16015</b>\u0000 RGB and \u0000<bold>13913</b>\u0000 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9839-9853"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback 对齐和检索:有文本反馈的图像检索中的合成与分解学习
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3417694
Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen
We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.
我们研究的是带有文本反馈的图像检索任务,即参考图像和修改文本组成检索所需的目标图像。为了实现这一目标,现有的方法总是通过不同的特征编码器获得多模态表示,然后采用不同的策略对组成的输入和目标图像之间的相关性进行建模。然而,多模态查询带来了更多挑战,因为它不仅需要协同理解来自异构多模态输入的语义,还需要准确构建存在于每个输入-目标三元组(即参考图像、修改文本和目标图像)中的底层语义相关性。在本文中,我们采用一种新颖的对齐和检索(AlRet)框架来解决这些问题。首先,我们提出的方法利用特征编码器中的对比损失来学习有意义的多模态表示,同时使后续的相关性建模过程在一个更加和谐的空间中进行。然后,我们提出以一种新颖的合成-分解范式来学习合成输入与目标图像之间的精确相关性。具体来说,合成网络将参考图像和修改文本组合成一个联合表示,从而学习联合表示与目标图像之间的相关性。分解网络则将目标图像分解为视觉和文本子空间,以利用目标图像与每个查询元素之间的潜在相关性。组成和分解范式形成了一个闭环,可以同时优化,从而在性能上相互促进。在三个真实世界数据集上进行的大规模对比实验证实了所提方法的有效性。
{"title":"Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback","authors":"Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen","doi":"10.1109/TMM.2024.3417694","DOIUrl":"10.1109/TMM.2024.3417694","url":null,"abstract":"We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9936-9948"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication DeepSpoof:跨技术多媒体通信中基于深度强化学习的欺骗攻击
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-20 DOI: 10.1109/TMM.2024.3414660
Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang
Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.
跨技术通信对于多媒体物联网(IoMT)应用至关重要,它可实现不同媒体格式的无缝集成、优化数据传输,并改善跨设备和平台的用户体验。这种集成推动了智能家居、智能城市和医疗监控等领域创新而高效的 IoMT 解决方案。然而,在跨技术多媒体通信中整合不同的无线标准,增加了无线网络遭受攻击的可能性。目前的方法缺乏强大的验证机制,容易受到欺骗攻击。为了缓解这一问题,我们引入了 DeepSpoof,这是一种利用深度学习分析历史无线通信量并预测 IoMT 未来模式的欺骗系统。与传统的欺骗方法相比,这种创新方法大大提高了攻击者的假冒能力,并提供了更高的隐蔽性。利用模拟数据和真实数据进行的严格评估证实,DeepSpoof 能显著提高攻击的平均成功率。
{"title":"DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication","authors":"Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang","doi":"10.1109/TMM.2024.3414660","DOIUrl":"10.1109/TMM.2024.3414660","url":null,"abstract":"Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10879-10891"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments 利用正交矩的特征融合进行感知图像加密
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-20 DOI: 10.1109/TMM.2024.3405660
Xinran Li;Zichi Wang;Guorui Feng;Xinpeng Zhang;Chuan Qin
Due to the limited number of stable image feature descriptors and the simplistic concatenation approach to hash generation, existing hashing methods have not achieved a satisfactory balance between robustness and discrimination. To this end, a novel perceptual hashing method is proposed in this paper using feature fusion of fractional-order continuous orthogonal moments (FrCOMs). Specifically, two robust image descriptors, i.e., fractional-order Chebyshev Fourier moments (FrCHFMs) and fractional-order radial harmonic Fourier moments (FrRHFMs), are used to extract global structural features of a color image. Then, the canonical correlation analysis (CCA) strategy is employed to fuse these features during the final hash generation process. Compared to direct concatenation, CCA excels in eliminating redundancies between feature vectors, resulting in a shorter hash sequence and higher authentication performance. A series of experiments demonstrate that the proposed method achieves satisfactory robustness, discrimination and security. Particularly, the proposed method exhibits better tampering detection ability and robustness against combined content-preserving manipulations in practical applications.
由于稳定的图像特征描述子数量有限,以及哈希生成的简单连接方法,现有的哈希方法并没有在鲁棒性和辨别力之间取得令人满意的平衡。为此,本文利用分数阶连续正交矩(FrCOMs)的特征融合,提出了一种新型的感知哈希方法。具体来说,本文使用两个鲁棒图像描述符,即分数阶切比雪夫傅里叶矩(FrCHFMs)和分数阶径向谐波傅里叶矩(FrRHFMs),来提取彩色图像的全局结构特征。然后,在最终哈希生成过程中,采用典型相关分析(CCA)策略对这些特征进行融合。与直接连接相比,CCA 擅长消除特征向量之间的冗余,从而缩短哈希序列并提高验证性能。一系列实验证明,所提出的方法具有令人满意的鲁棒性、辨别力和安全性。特别是在实际应用中,所提出的方法表现出更好的篡改检测能力和对组合内容保护操作的鲁棒性。
{"title":"Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments","authors":"Xinran Li;Zichi Wang;Guorui Feng;Xinpeng Zhang;Chuan Qin","doi":"10.1109/TMM.2024.3405660","DOIUrl":"10.1109/TMM.2024.3405660","url":null,"abstract":"Due to the limited number of stable image feature descriptors and the simplistic concatenation approach to hash generation, existing hashing methods have not achieved a satisfactory balance between robustness and discrimination. To this end, a novel perceptual hashing method is proposed in this paper using feature fusion of fractional-order continuous orthogonal moments (FrCOMs). Specifically, two robust image descriptors, i.e., fractional-order Chebyshev Fourier moments (FrCHFMs) and fractional-order radial harmonic Fourier moments (FrRHFMs), are used to extract global structural features of a color image. Then, the canonical correlation analysis (CCA) strategy is employed to fuse these features during the final hash generation process. Compared to direct concatenation, CCA excels in eliminating redundancies between feature vectors, resulting in a shorter hash sequence and higher authentication performance. A series of experiments demonstrate that the proposed method achieves satisfactory robustness, discrimination and security. Particularly, the proposed method exhibits better tampering detection ability and robustness against combined content-preserving manipulations in practical applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10041-10054"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1