首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling 通过视觉语言辅助伪标记生成弱监督三维场景图
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-16 DOI: 10.1109/TMM.2024.3443670
Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang
Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.
学习构建三维场景图对于以结构化和丰富的方式感知真实世界至关重要。然而,以往的三维场景图生成方法采用的是完全监督学习方式,需要大量对象和关系的实体级注释数据,而获取这些数据极其耗费资源且繁琐。为了解决这个问题,我们提出了一种通过视觉语言辅助伪标注(Visual-Linguistic Assisted Pseudo-labeling)生成弱监督三维场景图的方法--3D-VLAP。具体地说,我们的 3D-VLAP 利用了当前大规模视觉语言学模型在文本和二维图像之间对齐语义的卓越能力,以及二维图像和三维点云之间自然存在的对应关系,从而隐式地构建了文本和三维点云之间的对应关系。首先,我们通过相机的内在和外在参数建立三维点云与二维图像的位置对应关系,从而实现三维点云与二维图像的对齐。随后,我们采用大规模跨模态视觉语言模型,通过将二维图像与对象类别标签进行匹配,间接地将三维实例与对象的文本类别标签进行对齐。然后,通过计算视觉语言模型编码的对象和关系的视觉嵌入与文本类别嵌入之间的相似度,生成对象和关系的伪标签,用于 3D-VLAP 模型训练。最后,我们设计了一种基于边缘自注意的图神经网络,用于生成三维点云的场景图。实验证明,我们的 3D-VLAP 与目前的全监督方法取得了相当的结果,同时减轻了数据注释的压力。
{"title":"Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling","authors":"Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang","doi":"10.1109/TMM.2024.3443670","DOIUrl":"10.1109/TMM.2024.3443670","url":null,"abstract":"Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11164-11175"},"PeriodicalIF":8.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Controllable Syllable-Level Lyrics Generation From Melody With Prior Attention 根据事先注意的旋律生成可控音节级歌词
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-15 DOI: 10.1109/TMM.2024.3443664
Zhe Zhang;Yi Yu;Atsuhiro Takasu
Melody-to-lyrics generation, which is based on syllable-level generation, is an intriguing and challenging topic in the interdisciplinary field of music, multimedia, and machine learning. Many previous research projects generate word-level lyrics sequences due to the lack of alignments between syllables and musical notes. Moreover, controllable lyrics generation from melody is also less explored but important for facilitating humans to generate diverse desired lyrics. In this work, we propose a controllable melody-to-lyrics model that is able to generate syllable-level lyrics with user-desired rhythm. An explicit n-gram (EXPLING) loss is proposed to train the Transformer-based model to capture the sequence dependency and alignment relationship between melody and lyrics and predict the lyrics sequences at the syllable level. A prior attention mechanism is proposed to enhance the controllability and diversity of lyrics generation. Experiments and evaluation metrics verified that our proposed model has the ability to generate higher-quality lyrics than previous methods and the feasibility of interacting with users for controllable and diverse lyrics generation. We believe this work provides valuable insights into human-centered AI research in music generation tasks. The source codes for this work will be made publicly available for further reference and exploration.
基于音节级生成的旋律到歌词的生成,是音乐、多媒体和机器学习等跨学科领域中一个既有趣又具有挑战性的课题。由于音节和音符之间缺乏对齐,以往的许多研究项目都是生成单词级的歌词序列。此外,从旋律中生成可控歌词的研究也较少,但这对于帮助人类生成各种所需的歌词非常重要。在这项工作中,我们提出了一种可控的旋律到歌词模型,该模型能够按照用户希望的节奏生成音节级歌词。我们提出了一种显式 n-gram(EXPLING)损失来训练基于 Transformer 的模型,以捕捉旋律和歌词之间的序列依赖和对齐关系,并预测音节级的歌词序列。此外,还提出了一种事先关注机制,以增强歌词生成的可控性和多样性。实验和评估指标验证了我们提出的模型比以前的方法有能力生成更高质量的歌词,以及与用户互动以生成可控和多样化歌词的可行性。我们相信,这项工作为音乐生成任务中以人为中心的人工智能研究提供了宝贵的见解。我们将公开这项工作的源代码,以供进一步参考和探索。
{"title":"Controllable Syllable-Level Lyrics Generation From Melody With Prior Attention","authors":"Zhe Zhang;Yi Yu;Atsuhiro Takasu","doi":"10.1109/TMM.2024.3443664","DOIUrl":"10.1109/TMM.2024.3443664","url":null,"abstract":"Melody-to-lyrics generation, which is based on syllable-level generation, is an intriguing and challenging topic in the interdisciplinary field of music, multimedia, and machine learning. Many previous research projects generate word-level lyrics sequences due to the lack of alignments between syllables and musical notes. Moreover, controllable lyrics generation from melody is also less explored but important for facilitating humans to generate diverse desired lyrics. In this work, we propose a controllable melody-to-lyrics model that is able to generate syllable-level lyrics with user-desired rhythm. An explicit n-gram (EXPLING) loss is proposed to train the Transformer-based model to capture the sequence dependency and alignment relationship between melody and lyrics and predict the lyrics sequences at the syllable level. A prior attention mechanism is proposed to enhance the controllability and diversity of lyrics generation. Experiments and evaluation metrics verified that our proposed model has the ability to generate higher-quality lyrics than previous methods and the feasibility of interacting with users for controllable and diverse lyrics generation. We believe this work provides valuable insights into human-centered AI research in music generation tasks. The source codes for this work will be made publicly available for further reference and exploration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11083-11094"},"PeriodicalIF":8.4,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637751","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anti-Collapse Loss for Deep Metric Learning 深度度量学习的防坍塌损失
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-15 DOI: 10.1109/TMM.2024.3443616
Xiruo Jiang;Yazhou Yao;Xili Dai;Fumin Shen;Liqiang Nie;Heng-Tao Shen
Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.
深度度量学习(DML)旨在为分类、聚类和检索等下游任务学习一个具有区分性的高维嵌入空间。之前的文献主要关注基于配对和代理的方法,以最大化类间差异和最小化类内多样性。然而,由于过度依赖标签信息,这些方法往往会导致嵌入空间的坍塌。这就导致了次优的特征表示和较差的模型性能。为了保持嵌入空间的结构并避免特征坍塌,我们提出了一种名为 "反坍塌损失"(Anti-Collapse Loss)的新型损失函数。具体来说,我们提出的损失函数主要是从最大编码率降低原理中汲取灵感。它通过最大化样本特征或类代理的平均编码率,促进嵌入空间中特征簇的稀疏性,从而防止坍塌。此外,我们将所提出的损耗与基于配对的方法和基于代理的方法相结合,从而显著提高了性能。在基准数据集上进行的综合实验证明,我们提出的方法优于现有的最先进方法。广泛的消融研究验证了我们的方法在防止嵌入空间崩溃和提高泛化性能方面的有效性。
{"title":"Anti-Collapse Loss for Deep Metric Learning","authors":"Xiruo Jiang;Yazhou Yao;Xili Dai;Fumin Shen;Liqiang Nie;Heng-Tao Shen","doi":"10.1109/TMM.2024.3443616","DOIUrl":"10.1109/TMM.2024.3443616","url":null,"abstract":"Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11139-11150"},"PeriodicalIF":8.4,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval 要点、内容、目标导向:用于视频瞬间检索的三层类人框架
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443672
Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He
Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.
视频瞬间检索(VMR)的目的是通过给定的自然语言查询在未经剪辑的视频中找到相应的瞬间。现有的大多数方法都将这一任务视为跨模态内容匹配或边界预测问题,而最近的研究则开始从阅读理解的角度来解决 VMR 问题。然而,现有模型的跨模态交互过程要么不够充分,要么过于复杂。因此,我们重新分析了人类在阅读理解的文档片段定位任务中的行为,并针对每种行为设计了特定模块,提出了三层类人时刻检索框架(Tri-MRF)。具体来说,我们总结了人类在阅读理解任务中的行为,如分别把握文档和问题的一般结构,交叉扫描以标记文档中关键词与问题中关键词的直接对应关系,以及归纳总结以获得文档片段与问题的整体对应关系。相应地,所提出的 Tri-MRF 模型包含三个模块:1) 一个面向要点的模态内理解模块,用于建立每个模态内的上下文依赖关系;2) 一个面向内容的细粒度理解模块,用于探索片段与词语之间的直接对应关系;3) 一个面向目标的综合理解模块,用于验证候选时刻与查询之间的整体对应关系。此外,我们还引入了双连接 GCN 特征增强模块,以优化查询引导的时刻表示。在 TACoS、ActivityNet Captions 和 Charades-STA 这三个基准上进行的广泛实验表明,所提出的框架优于最新方法。
{"title":"Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval","authors":"Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He","doi":"10.1109/TMM.2024.3443672","DOIUrl":"10.1109/TMM.2024.3443672","url":null,"abstract":"Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11044-11056"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse Pedestrian Character Learning for Trajectory Prediction 稀疏行人特征学习用于轨迹预测
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443591
Yonghao Dong;Le Wang;Sanping Zhou;Gang Hua;Changyin Sun
Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, i.e., action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network (TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.
第一人称视角下的行人轨迹预测因其在自动驾驶中的重要性而备受关注。最近的研究利用行人特征信息(即动作和外观)来改进学习到的轨迹嵌入,并取得了最先进的性能。然而,它忽略了无效和负面的行人特征信息,这对轨迹表示是有害的,从而导致性能下降。为了解决这个问题,我们提出了一种基于双流稀疏字符的行人轨迹预测网络(TSNet)。具体来说,TSNet 通过学习稀疏字符表示流中被删除的负字符来改进轨迹表示流中获得的轨迹嵌入。此外,为了对被删除的负面字符进行建模,我们提出了一种新颖的稀疏字符图,包括稀疏类别字符图和稀疏时间字符图,以分别学习各种字符在类别和时间维度上的不同效果。在 PIE 和 JAAD 两个第一人称视角数据集上进行的大量实验表明,我们的方法优于现有的最先进方法。此外,消减研究证明了各种字符的不同效果,并证明 TSNet 优于不消减负面字符的方法。
{"title":"Sparse Pedestrian Character Learning for Trajectory Prediction","authors":"Yonghao Dong;Le Wang;Sanping Zhou;Gang Hua;Changyin Sun","doi":"10.1109/TMM.2024.3443591","DOIUrl":"10.1109/TMM.2024.3443591","url":null,"abstract":"Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, i.e., action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network (TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11070-11082"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMVS: Enabling Robust Adaptive Video Streaming for Wildly Fluctuating and Heterogeneous Networks MMVS:为剧烈波动的异构网络提供稳健的自适应视频流服务
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443609
Shuoyao Wang;Jiawei Lin;Yu Dai
With the advancement of wireless technology, the fifth-generation mobile communication network (5G) has the capability to provide exceptionally high bandwidth for supporting high-quality video streaming services. Nevertheless, this network exhibits substantial fluctuations, posing a significant challenge in ensuring the reliability of video streaming services. This research introduces a novel algorithm, the Multi-type data perception-based Meta-learning-enabled adaptive Video Streaming algorithm (MMVS), designed to adapt to diverse network conditions, encompassing 3G and mmWave 5G networks. The proposed algorithm integrates the proximal policy optimization technique with the meta-learning framework to cope with the gradient estimation noise in network fluctuation. To further improve the robustness of the algorithm, MMVS introduces meta advantage normalization. Additionally, MMVS treats network information as multiple types of input data, thus enabling the precise definition of distinct network structures for perceiving them accurately. The experimental results on network trace datasets in real-world scenarios illustrate that MMVS is capable of delivering an additional 6% average QoE in mmWave 5G network, and outperform the representative benchmarks in six pairs of heterogeneous networks and user preferences.
随着无线技术的发展,第五代移动通信网络(5G)有能力提供超高的带宽,以支持高质量的视频流服务。然而,该网络会出现大幅波动,给确保视频流服务的可靠性带来了巨大挑战。本研究介绍了一种新型算法--基于元学习的多类型数据感知自适应视频流算法(MMVS),旨在适应包括 3G 和毫米波 5G 网络在内的各种网络条件。所提出的算法将近端策略优化技术与元学习框架相结合,以应对网络波动中的梯度估计噪声。为了进一步提高算法的鲁棒性,MMVS 引入了元优势归一化。此外,MMVS 将网络信息视为多种类型的输入数据,从而能够精确定义不同的网络结构,准确感知网络结构。在真实世界场景中的网络跟踪数据集上的实验结果表明,MMVS 能够在毫米波 5G 网络中提供额外 6% 的平均 QoE,并在六对异构网络和用户偏好中优于代表性基准。
{"title":"MMVS: Enabling Robust Adaptive Video Streaming for Wildly Fluctuating and Heterogeneous Networks","authors":"Shuoyao Wang;Jiawei Lin;Yu Dai","doi":"10.1109/TMM.2024.3443609","DOIUrl":"10.1109/TMM.2024.3443609","url":null,"abstract":"With the advancement of wireless technology, the fifth-generation mobile communication network (5G) has the capability to provide exceptionally high bandwidth for supporting high-quality video streaming services. Nevertheless, this network exhibits substantial fluctuations, posing a significant challenge in ensuring the reliability of video streaming services. This research introduces a novel algorithm, the Multi-type data perception-based Meta-learning-enabled adaptive Video Streaming algorithm (MMVS), designed to adapt to diverse network conditions, encompassing 3G and mmWave 5G networks. The proposed algorithm integrates the proximal policy optimization technique with the meta-learning framework to cope with the gradient estimation noise in network fluctuation. To further improve the robustness of the algorithm, MMVS introduces meta advantage normalization. Additionally, MMVS treats network information as multiple types of input data, thus enabling the precise definition of distinct network structures for perceiving them accurately. The experimental results on network trace datasets in real-world scenarios illustrate that MMVS is capable of delivering an additional 6% average QoE in mmWave 5G network, and outperform the representative benchmarks in six pairs of heterogeneous networks and user preferences.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11018-11030"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA HSSHG:用于视频质量检测的启发式语义约束时空异构图
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443661
Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo
Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.
视频问题解答是一项具有挑战性的任务,需要模型识别视频中的视觉信息并进行时空推理。目前的模型越来越注重通过图神经网络实现物体的时空推理。然而,现有的基于图网络的模型在构建对象间的时空关系时仍存在不足:(1)在定义邻接关系时没有考虑对象间的时空约束;(2)在生成边权重时没有充分考虑对象间的语义相关性。这些都使得模型缺乏对象间时空交互的表征,直接影响了对象关系推理的能力。为解决上述问题,本文设计了一种启发式语义约束时空异质图,采用语义一致性感知策略来构建对象间的时空交互关系。对象间的时空关系受对象共现关系和对象一致性的约束。情节摘要和对象位置被用作启发式语义先验,以限制空间和时间边缘的权重。时空异质性图更准确地还原了对象之间的时空关系,增强了模型的对象时空推理能力。基于时空异构图,本文提出了启发式语义约束时空异构图(Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA,HSSHG),它在基准MSVD-QA和FrameQA数据集上取得了最先进的性能,并在基准MSRVTT-QA和ActivityNet-QA数据集上展示了有竞争力的结果。广泛的消融实验验证了网络中每个组件的有效性和超参数设置的合理性,定性分析验证了 HSSHG 的对象级时空推理能力。
{"title":"HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA","authors":"Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo","doi":"10.1109/TMM.2024.3443661","DOIUrl":"10.1109/TMM.2024.3443661","url":null,"abstract":"Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11176-11190"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GS-SFS: Joint Gaussian Splatting and Shape-From-Silhouette for Multiple Human Reconstruction in Large-Scale Sports Scenes GS-SFS:联合高斯拼接和轮廓塑形技术,用于大规模运动场景中的多人重构
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443637
Yuqi Jiang;Jing Li;Haidong Qin;Yanran Dai;Jing Liu;Guodong Zhang;Canbin Zhang;Tao Yang
We introduce GS-SFS, a method that utilizes a camera array with wide baselines for high-quality multiple human mesh reconstruction in large-scale sports scenes. Traditional human reconstruction methods in sports scenes, such as Shape-from-Silhouette (SFS), struggle with sparse camera setups and small human targets, making it challenging to obtain complete and accurate human representations. Despite advances in differentiable rendering, including 3D Gaussian Splatting (3DGS), which can produce photorealistic novel-view renderings with dense inputs, accurate depiction of surfaces and generation of detailed meshes is still challenging. Our approach uniquely combines 3DGS's view synthesis with an optimized SFS method, thereby significantly enhancing the quality of multiperson mesh reconstruction in large-scale sports scenes. Specifically, we introduce body shape priors, including the human surface point clouds extracted through SFS and human silhouettes, to constrain 3DGS to a more accurate representation of the human body only. Then, we develop an improved mesh reconstruction method based on SFS, mainly by adding additional viewpoints through 3DGS and obtaining a more accurate surface to achieve higher-quality reconstruction models. We implement a high-density scene resampling strategy based on spherical sampling of human bounding boxes and render new perspectives using 3D Gaussian Splatting to create precise and dense multi-view human silhouettes. During mesh reconstruction, we integrate the human body's 2D Signed Distance Function (SDF) into the computation of the SFS's implicit surface field, resulting in smoother and more accurate surfaces. Moreover, we enhance mesh texture mapping by blending original and rendered images with different weights, preserving high-quality textures while compensating for missing details. The experimental results from real basketball game scenarios demonstrate the significant improvements of our approach for multiple human body model reconstruction in complex sports settings.
我们介绍了 GS-SFS,这是一种利用具有宽基线的摄像机阵列在大规模运动场景中进行高质量多重人体网格重建的方法。传统的运动场景中的人体重建方法,如轮廓重建(Shape-from-Silhouette,SFS),在摄像机设置稀疏和人体目标较小的情况下很难获得完整准确的人体表现。尽管可微分渲染技术(包括 3D Gaussian Splatting (3DGS))取得了进步,可以在高密度输入的情况下生成逼真的新颖视图渲染,但准确描绘表面和生成详细网格仍是一项挑战。我们的方法独特地将 3DGS 的视图合成与优化的 SFS 方法相结合,从而显著提高了大规模运动场景中多人网格重建的质量。具体来说,我们引入了人体形状先验,包括通过 SFS 提取的人体表面点云和人体剪影,以约束 3DGS 更精确地呈现人体。然后,我们开发了一种基于 SFS 的改进型网格重建方法,主要是通过 3DGS 增加额外的视点,获得更精确的表面,从而实现更高质量的重建模型。我们在对人体边界框进行球形采样的基础上实施了高密度场景重采样策略,并利用三维高斯拼接技术渲染新视角,从而创建精确而密集的多视角人体轮廓。在网格重建过程中,我们将人体的二维签名距离函数(SDF)整合到 SFS 的隐式曲面场计算中,从而获得更平滑、更精确的曲面。此外,我们还通过混合原始图像和渲染图像的不同权重来增强网格纹理映射,在保留高质量纹理的同时补偿缺失的细节。来自真实篮球比赛场景的实验结果表明,我们的方法在复杂运动环境中重建多个人体模型方面有显著改进。
{"title":"GS-SFS: Joint Gaussian Splatting and Shape-From-Silhouette for Multiple Human Reconstruction in Large-Scale Sports Scenes","authors":"Yuqi Jiang;Jing Li;Haidong Qin;Yanran Dai;Jing Liu;Guodong Zhang;Canbin Zhang;Tao Yang","doi":"10.1109/TMM.2024.3443637","DOIUrl":"10.1109/TMM.2024.3443637","url":null,"abstract":"We introduce GS-SFS, a method that utilizes a camera array with wide baselines for high-quality multiple human mesh reconstruction in large-scale sports scenes. Traditional human reconstruction methods in sports scenes, such as Shape-from-Silhouette (SFS), struggle with sparse camera setups and small human targets, making it challenging to obtain complete and accurate human representations. Despite advances in differentiable rendering, including 3D Gaussian Splatting (3DGS), which can produce photorealistic novel-view renderings with dense inputs, accurate depiction of surfaces and generation of detailed meshes is still challenging. Our approach uniquely combines 3DGS's view synthesis with an optimized SFS method, thereby significantly enhancing the quality of multiperson mesh reconstruction in large-scale sports scenes. Specifically, we introduce body shape priors, including the human surface point clouds extracted through SFS and human silhouettes, to constrain 3DGS to a more accurate representation of the human body only. Then, we develop an improved mesh reconstruction method based on SFS, mainly by adding additional viewpoints through 3DGS and obtaining a more accurate surface to achieve higher-quality reconstruction models. We implement a high-density scene resampling strategy based on spherical sampling of human bounding boxes and render new perspectives using 3D Gaussian Splatting to create precise and dense multi-view human silhouettes. During mesh reconstruction, we integrate the human body's 2D Signed Distance Function (SDF) into the computation of the SFS's implicit surface field, resulting in smoother and more accurate surfaces. Moreover, we enhance mesh texture mapping by blending original and rendered images with different weights, preserving high-quality textures while compensating for missing details. The experimental results from real basketball game scenarios demonstrate the significant improvements of our approach for multiple human body model reconstruction in complex sports settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11095-11110"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RCVS: A Unified Registration and Fusion Framework for Video Streams RCVS:视频流统一注册与融合框架
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443673
Housheng Xie;Meng Sang;Yukuan Zhang;Yang Yang;Shan Zhao;Jianbo Zhong
The infrared and visible cross-modal registration and fusion can generate more comprehensive representations of object and scene information. Previous frameworks primarily focus on addressing the modality disparities and the impact of preserving diverse modality information on the performance of registration and fusion tasks among different static image pairs. However, these frameworks overlook the practical deployment on real-world devices, particularly in the context of video streams. Consequently, the resulting video streams often suffer from instability in registration and fusion, characterized by fusion artifacts and inter-frame jitter. In light of these considerations, this paper proposes a unified registration and fusion scheme for video streams, termed RCVS. It utilizes a robust matcher and spatial-temporal calibration module to achieve stable registration of video sequences. Subsequently, RCVS combines a fast lightweight fusion network to provide stable fusion video streams for infrared and visible imaging. Additionally, we collect a infrared and visible video dataset HDO, which comprises high-quality infrared and visible video data captured across diverse scenes. Our RCVS exhibits superior performance in video stream registration and fusion tasks, adapting well to real-world demands. Overall, our proposed framework and HDO dataset offer the first effective and comprehensive benchmark in this field, solving stability and real-time challenges in infrared and visible video stream fusion while assessing different solution performances to foster development in this area.
红外和可见光跨模态配准与融合可以生成更全面的物体和场景信息表征。以往的框架主要侧重于解决模态差异问题,以及保留不同模态信息对不同静态图像对之间的配准和融合任务性能的影响。然而,这些框架忽视了在现实世界设备上的实际部署,尤其是在视频流的背景下。因此,生成的视频流在配准和融合过程中经常会出现不稳定的情况,表现为融合伪像和帧间抖动。有鉴于此,本文提出了一种统一的视频流注册和融合方案,称为 RCVS。它利用鲁棒匹配器和时空校准模块实现视频序列的稳定注册。随后,RCVS 与快速轻量级融合网络相结合,为红外和可见光成像提供稳定的融合视频流。此外,我们还收集了红外和可见光视频数据集 HDO,其中包括在不同场景中捕获的高质量红外和可见光视频数据。我们的 RCVS 在视频流注册和融合任务中表现出卓越的性能,能很好地适应现实世界的需求。总之,我们提出的框架和 HDO 数据集为该领域提供了首个有效而全面的基准,解决了红外和可见光视频流融合的稳定性和实时性难题,同时评估了不同解决方案的性能,促进了该领域的发展。
{"title":"RCVS: A Unified Registration and Fusion Framework for Video Streams","authors":"Housheng Xie;Meng Sang;Yukuan Zhang;Yang Yang;Shan Zhao;Jianbo Zhong","doi":"10.1109/TMM.2024.3443673","DOIUrl":"10.1109/TMM.2024.3443673","url":null,"abstract":"The infrared and visible cross-modal registration and fusion can generate more comprehensive representations of object and scene information. Previous frameworks primarily focus on addressing the modality disparities and the impact of preserving diverse modality information on the performance of registration and fusion tasks among different static image pairs. However, these frameworks overlook the practical deployment on real-world devices, particularly in the context of video streams. Consequently, the resulting video streams often suffer from instability in registration and fusion, characterized by fusion artifacts and inter-frame jitter. In light of these considerations, this paper proposes a unified registration and fusion scheme for video streams, termed RCVS. It utilizes a robust matcher and spatial-temporal calibration module to achieve stable registration of video sequences. Subsequently, RCVS combines a fast lightweight fusion network to provide stable fusion video streams for infrared and visible imaging. Additionally, we collect a infrared and visible video dataset HDO, which comprises high-quality infrared and visible video data captured across diverse scenes. Our RCVS exhibits superior performance in video stream registration and fusion tasks, adapting well to real-world demands. Overall, our proposed framework and HDO dataset offer the first effective and comprehensive benchmark in this field, solving stability and real-time challenges in infrared and visible video stream fusion while assessing different solution performances to foster development in this area.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11031-11043"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BI-AVAN: A Brain-Inspired Adversarial Visual Attention Network for Characterizing Human Visual Attention From Neural Activity BI-AVAN:从神经活动描述人类视觉注意力的脑启发对抗性视觉注意力网络
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-14 DOI: 10.1109/TMM.2024.3443623
Heng Huang;Lin Zhao;Haixing Dai;Lu Zhang;Xintao Hu;Dajiang Zhu;Tianming Liu
Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.
视觉注意力是人脑的基本机制,它启发了深度神经网络中注意力机制的设计。然而,大多数视觉注意力研究采用眼动跟踪数据而非直接测量大脑活动来表征人类的视觉注意力。此外,在人类视觉系统中,与注意力相关的物体和注意力被忽略的背景之间的对抗关系也没有被充分利用。为了弥补这些不足,我们提出了一种新颖的脑启发对抗性视觉注意力网络(BI-AVAN),直接从大脑功能活动来描述人类的视觉注意力。我们的 BI-AVAN 模型模拟了注意力相关/被忽视对象之间的偏差竞争过程,以无监督的方式识别和定位电影画面中人脑关注的视觉对象。实验结果表明,在推断有意义的人类视觉注意力以及绘制大脑活动与视觉刺激之间的关系时,我们的模型取得了稳健而有前景的结果。我们的 BI-AVAN 模型为利用大脑功能架构来启发和指导人工智能(AI)模型设计(如深度神经网络)这一新兴领域做出了贡献。
{"title":"BI-AVAN: A Brain-Inspired Adversarial Visual Attention Network for Characterizing Human Visual Attention From Neural Activity","authors":"Heng Huang;Lin Zhao;Haixing Dai;Lu Zhang;Xintao Hu;Dajiang Zhu;Tianming Liu","doi":"10.1109/TMM.2024.3443623","DOIUrl":"10.1109/TMM.2024.3443623","url":null,"abstract":"Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11191-11203"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1