Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.
{"title":"Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-Labeling","authors":"Xu Wang;Yifan Li;Qiudan Zhang;Wenhui Wu;Mark Junjie Li;Lin Ma;Jianmin Jiang","doi":"10.1109/TMM.2024.3443670","DOIUrl":"10.1109/TMM.2024.3443670","url":null,"abstract":"Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11164-11175"},"PeriodicalIF":8.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-15DOI: 10.1109/TMM.2024.3443664
Zhe Zhang;Yi Yu;Atsuhiro Takasu
Melody-to-lyrics generation, which is based on syllable-level generation, is an intriguing and challenging topic in the interdisciplinary field of music, multimedia, and machine learning. Many previous research projects generate word-level lyrics sequences due to the lack of alignments between syllables and musical notes. Moreover, controllable lyrics generation from melody is also less explored but important for facilitating humans to generate diverse desired lyrics. In this work, we propose a controllable melody-to-lyrics model that is able to generate syllable-level lyrics with user-desired rhythm. An explicit n-gram (EXPLING) loss is proposed to train the Transformer-based model to capture the sequence dependency and alignment relationship between melody and lyrics and predict the lyrics sequences at the syllable level. A prior attention mechanism is proposed to enhance the controllability and diversity of lyrics generation. Experiments and evaluation metrics verified that our proposed model has the ability to generate higher-quality lyrics than previous methods and the feasibility of interacting with users for controllable and diverse lyrics generation. We believe this work provides valuable insights into human-centered AI research in music generation tasks. The source codes for this work will be made publicly available for further reference and exploration.
{"title":"Controllable Syllable-Level Lyrics Generation From Melody With Prior Attention","authors":"Zhe Zhang;Yi Yu;Atsuhiro Takasu","doi":"10.1109/TMM.2024.3443664","DOIUrl":"10.1109/TMM.2024.3443664","url":null,"abstract":"Melody-to-lyrics generation, which is based on syllable-level generation, is an intriguing and challenging topic in the interdisciplinary field of music, multimedia, and machine learning. Many previous research projects generate word-level lyrics sequences due to the lack of alignments between syllables and musical notes. Moreover, controllable lyrics generation from melody is also less explored but important for facilitating humans to generate diverse desired lyrics. In this work, we propose a controllable melody-to-lyrics model that is able to generate syllable-level lyrics with user-desired rhythm. An explicit n-gram (EXPLING) loss is proposed to train the Transformer-based model to capture the sequence dependency and alignment relationship between melody and lyrics and predict the lyrics sequences at the syllable level. A prior attention mechanism is proposed to enhance the controllability and diversity of lyrics generation. Experiments and evaluation metrics verified that our proposed model has the ability to generate higher-quality lyrics than previous methods and the feasibility of interacting with users for controllable and diverse lyrics generation. We believe this work provides valuable insights into human-centered AI research in music generation tasks. The source codes for this work will be made publicly available for further reference and exploration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11083-11094"},"PeriodicalIF":8.4,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637751","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.
{"title":"Anti-Collapse Loss for Deep Metric Learning","authors":"Xiruo Jiang;Yazhou Yao;Xili Dai;Fumin Shen;Liqiang Nie;Heng-Tao Shen","doi":"10.1109/TMM.2024.3443616","DOIUrl":"10.1109/TMM.2024.3443616","url":null,"abstract":"Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11139-11150"},"PeriodicalIF":8.4,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443672
Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He
Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.
{"title":"Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval","authors":"Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He","doi":"10.1109/TMM.2024.3443672","DOIUrl":"10.1109/TMM.2024.3443672","url":null,"abstract":"Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11044-11056"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443591
Yonghao Dong;Le Wang;Sanping Zhou;Gang Hua;Changyin Sun
Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, i.e., action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network (TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.
第一人称视角下的行人轨迹预测因其在自动驾驶中的重要性而备受关注。最近的研究利用行人特征信息(即动作和外观)来改进学习到的轨迹嵌入,并取得了最先进的性能。然而,它忽略了无效和负面的行人特征信息,这对轨迹表示是有害的,从而导致性能下降。为了解决这个问题,我们提出了一种基于双流稀疏字符的行人轨迹预测网络(TSNet)。具体来说,TSNet 通过学习稀疏字符表示流中被删除的负字符来改进轨迹表示流中获得的轨迹嵌入。此外,为了对被删除的负面字符进行建模,我们提出了一种新颖的稀疏字符图,包括稀疏类别字符图和稀疏时间字符图,以分别学习各种字符在类别和时间维度上的不同效果。在 PIE 和 JAAD 两个第一人称视角数据集上进行的大量实验表明,我们的方法优于现有的最先进方法。此外,消减研究证明了各种字符的不同效果,并证明 TSNet 优于不消减负面字符的方法。
{"title":"Sparse Pedestrian Character Learning for Trajectory Prediction","authors":"Yonghao Dong;Le Wang;Sanping Zhou;Gang Hua;Changyin Sun","doi":"10.1109/TMM.2024.3443591","DOIUrl":"10.1109/TMM.2024.3443591","url":null,"abstract":"Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, i.e., action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network (TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11070-11082"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443609
Shuoyao Wang;Jiawei Lin;Yu Dai
With the advancement of wireless technology, the fifth-generation mobile communication network (5G) has the capability to provide exceptionally high bandwidth for supporting high-quality video streaming services. Nevertheless, this network exhibits substantial fluctuations, posing a significant challenge in ensuring the reliability of video streaming services. This research introduces a novel algorithm, the Multi-type data perception-based Meta-learning-enabled adaptive Video Streaming algorithm (MMVS), designed to adapt to diverse network conditions, encompassing 3G and mmWave 5G networks. The proposed algorithm integrates the proximal policy optimization technique with the meta-learning framework to cope with the gradient estimation noise in network fluctuation. To further improve the robustness of the algorithm, MMVS introduces meta advantage normalization. Additionally, MMVS treats network information as multiple types of input data, thus enabling the precise definition of distinct network structures for perceiving them accurately. The experimental results on network trace datasets in real-world scenarios illustrate that MMVS is capable of delivering an additional 6% average QoE in mmWave 5G network, and outperform the representative benchmarks in six pairs of heterogeneous networks and user preferences.
{"title":"MMVS: Enabling Robust Adaptive Video Streaming for Wildly Fluctuating and Heterogeneous Networks","authors":"Shuoyao Wang;Jiawei Lin;Yu Dai","doi":"10.1109/TMM.2024.3443609","DOIUrl":"10.1109/TMM.2024.3443609","url":null,"abstract":"With the advancement of wireless technology, the fifth-generation mobile communication network (5G) has the capability to provide exceptionally high bandwidth for supporting high-quality video streaming services. Nevertheless, this network exhibits substantial fluctuations, posing a significant challenge in ensuring the reliability of video streaming services. This research introduces a novel algorithm, the Multi-type data perception-based Meta-learning-enabled adaptive Video Streaming algorithm (MMVS), designed to adapt to diverse network conditions, encompassing 3G and mmWave 5G networks. The proposed algorithm integrates the proximal policy optimization technique with the meta-learning framework to cope with the gradient estimation noise in network fluctuation. To further improve the robustness of the algorithm, MMVS introduces meta advantage normalization. Additionally, MMVS treats network information as multiple types of input data, thus enabling the precise definition of distinct network structures for perceiving them accurately. The experimental results on network trace datasets in real-world scenarios illustrate that MMVS is capable of delivering an additional 6% average QoE in mmWave 5G network, and outperform the representative benchmarks in six pairs of heterogeneous networks and user preferences.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11018-11030"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443661
Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo
Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.
视频问题解答是一项具有挑战性的任务,需要模型识别视频中的视觉信息并进行时空推理。目前的模型越来越注重通过图神经网络实现物体的时空推理。然而,现有的基于图网络的模型在构建对象间的时空关系时仍存在不足:(1)在定义邻接关系时没有考虑对象间的时空约束;(2)在生成边权重时没有充分考虑对象间的语义相关性。这些都使得模型缺乏对象间时空交互的表征,直接影响了对象关系推理的能力。为解决上述问题,本文设计了一种启发式语义约束时空异质图,采用语义一致性感知策略来构建对象间的时空交互关系。对象间的时空关系受对象共现关系和对象一致性的约束。情节摘要和对象位置被用作启发式语义先验,以限制空间和时间边缘的权重。时空异质性图更准确地还原了对象之间的时空关系,增强了模型的对象时空推理能力。基于时空异构图,本文提出了启发式语义约束时空异构图(Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA,HSSHG),它在基准MSVD-QA和FrameQA数据集上取得了最先进的性能,并在基准MSRVTT-QA和ActivityNet-QA数据集上展示了有竞争力的结果。广泛的消融实验验证了网络中每个组件的有效性和超参数设置的合理性,定性分析验证了 HSSHG 的对象级时空推理能力。
{"title":"HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA","authors":"Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo","doi":"10.1109/TMM.2024.3443661","DOIUrl":"10.1109/TMM.2024.3443661","url":null,"abstract":"Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11176-11190"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443637
Yuqi Jiang;Jing Li;Haidong Qin;Yanran Dai;Jing Liu;Guodong Zhang;Canbin Zhang;Tao Yang
We introduce GS-SFS, a method that utilizes a camera array with wide baselines for high-quality multiple human mesh reconstruction in large-scale sports scenes. Traditional human reconstruction methods in sports scenes, such as Shape-from-Silhouette (SFS), struggle with sparse camera setups and small human targets, making it challenging to obtain complete and accurate human representations. Despite advances in differentiable rendering, including 3D Gaussian Splatting (3DGS), which can produce photorealistic novel-view renderings with dense inputs, accurate depiction of surfaces and generation of detailed meshes is still challenging. Our approach uniquely combines 3DGS's view synthesis with an optimized SFS method, thereby significantly enhancing the quality of multiperson mesh reconstruction in large-scale sports scenes. Specifically, we introduce body shape priors, including the human surface point clouds extracted through SFS and human silhouettes, to constrain 3DGS to a more accurate representation of the human body only. Then, we develop an improved mesh reconstruction method based on SFS, mainly by adding additional viewpoints through 3DGS and obtaining a more accurate surface to achieve higher-quality reconstruction models. We implement a high-density scene resampling strategy based on spherical sampling of human bounding boxes and render new perspectives using 3D Gaussian Splatting to create precise and dense multi-view human silhouettes. During mesh reconstruction, we integrate the human body's 2D Signed Distance Function (SDF) into the computation of the SFS's implicit surface field, resulting in smoother and more accurate surfaces. Moreover, we enhance mesh texture mapping by blending original and rendered images with different weights, preserving high-quality textures while compensating for missing details. The experimental results from real basketball game scenarios demonstrate the significant improvements of our approach for multiple human body model reconstruction in complex sports settings.
{"title":"GS-SFS: Joint Gaussian Splatting and Shape-From-Silhouette for Multiple Human Reconstruction in Large-Scale Sports Scenes","authors":"Yuqi Jiang;Jing Li;Haidong Qin;Yanran Dai;Jing Liu;Guodong Zhang;Canbin Zhang;Tao Yang","doi":"10.1109/TMM.2024.3443637","DOIUrl":"10.1109/TMM.2024.3443637","url":null,"abstract":"We introduce GS-SFS, a method that utilizes a camera array with wide baselines for high-quality multiple human mesh reconstruction in large-scale sports scenes. Traditional human reconstruction methods in sports scenes, such as Shape-from-Silhouette (SFS), struggle with sparse camera setups and small human targets, making it challenging to obtain complete and accurate human representations. Despite advances in differentiable rendering, including 3D Gaussian Splatting (3DGS), which can produce photorealistic novel-view renderings with dense inputs, accurate depiction of surfaces and generation of detailed meshes is still challenging. Our approach uniquely combines 3DGS's view synthesis with an optimized SFS method, thereby significantly enhancing the quality of multiperson mesh reconstruction in large-scale sports scenes. Specifically, we introduce body shape priors, including the human surface point clouds extracted through SFS and human silhouettes, to constrain 3DGS to a more accurate representation of the human body only. Then, we develop an improved mesh reconstruction method based on SFS, mainly by adding additional viewpoints through 3DGS and obtaining a more accurate surface to achieve higher-quality reconstruction models. We implement a high-density scene resampling strategy based on spherical sampling of human bounding boxes and render new perspectives using 3D Gaussian Splatting to create precise and dense multi-view human silhouettes. During mesh reconstruction, we integrate the human body's 2D Signed Distance Function (SDF) into the computation of the SFS's implicit surface field, resulting in smoother and more accurate surfaces. Moreover, we enhance mesh texture mapping by blending original and rendered images with different weights, preserving high-quality textures while compensating for missing details. The experimental results from real basketball game scenarios demonstrate the significant improvements of our approach for multiple human body model reconstruction in complex sports settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11095-11110"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The infrared and visible cross-modal registration and fusion can generate more comprehensive representations of object and scene information. Previous frameworks primarily focus on addressing the modality disparities and the impact of preserving diverse modality information on the performance of registration and fusion tasks among different static image pairs. However, these frameworks overlook the practical deployment on real-world devices, particularly in the context of video streams. Consequently, the resulting video streams often suffer from instability in registration and fusion, characterized by fusion artifacts and inter-frame jitter. In light of these considerations, this paper proposes a unified registration and fusion scheme for video streams, termed RCVS. It utilizes a robust matcher and spatial-temporal calibration module to achieve stable registration of video sequences. Subsequently, RCVS combines a fast lightweight fusion network to provide stable fusion video streams for infrared and visible imaging. Additionally, we collect a infrared and visible video dataset HDO, which comprises high-quality infrared and visible video data captured across diverse scenes. Our RCVS exhibits superior performance in video stream registration and fusion tasks, adapting well to real-world demands. Overall, our proposed framework and HDO dataset offer the first effective and comprehensive benchmark in this field, solving stability and real-time challenges in infrared and visible video stream fusion while assessing different solution performances to foster development in this area.
{"title":"RCVS: A Unified Registration and Fusion Framework for Video Streams","authors":"Housheng Xie;Meng Sang;Yukuan Zhang;Yang Yang;Shan Zhao;Jianbo Zhong","doi":"10.1109/TMM.2024.3443673","DOIUrl":"10.1109/TMM.2024.3443673","url":null,"abstract":"The infrared and visible cross-modal registration and fusion can generate more comprehensive representations of object and scene information. Previous frameworks primarily focus on addressing the modality disparities and the impact of preserving diverse modality information on the performance of registration and fusion tasks among different static image pairs. However, these frameworks overlook the practical deployment on real-world devices, particularly in the context of video streams. Consequently, the resulting video streams often suffer from instability in registration and fusion, characterized by fusion artifacts and inter-frame jitter. In light of these considerations, this paper proposes a unified registration and fusion scheme for video streams, termed RCVS. It utilizes a robust matcher and spatial-temporal calibration module to achieve stable registration of video sequences. Subsequently, RCVS combines a fast lightweight fusion network to provide stable fusion video streams for infrared and visible imaging. Additionally, we collect a infrared and visible video dataset HDO, which comprises high-quality infrared and visible video data captured across diverse scenes. Our RCVS exhibits superior performance in video stream registration and fusion tasks, adapting well to real-world demands. Overall, our proposed framework and HDO dataset offer the first effective and comprehensive benchmark in this field, solving stability and real-time challenges in infrared and visible video stream fusion while assessing different solution performances to foster development in this area.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11031-11043"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TMM.2024.3443623
Heng Huang;Lin Zhao;Haixing Dai;Lu Zhang;Xintao Hu;Dajiang Zhu;Tianming Liu
Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.
{"title":"BI-AVAN: A Brain-Inspired Adversarial Visual Attention Network for Characterizing Human Visual Attention From Neural Activity","authors":"Heng Huang;Lin Zhao;Haixing Dai;Lu Zhang;Xintao Hu;Dajiang Zhu;Tianming Liu","doi":"10.1109/TMM.2024.3443623","DOIUrl":"10.1109/TMM.2024.3443623","url":null,"abstract":"Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11191-11203"},"PeriodicalIF":8.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}