IEEE Transactions on Multimedia最新文献_第8页

Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis 旋转不变量点云分析超图上局部与全局一致相关研究

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521678

Yue Dai;Shihui Ying;Yue Gao

Rotation invariant point cloud analysis is essential for many real-world applications where objects can appear in arbitrary orientations. Traditional local rotation-invariant methods rely on lossy region descriptors, limiting the global comprehension of 3D objects. Conversely, global features derived from pose alignment can capture complementary information. To leverage both local and global consistency for enhanced accuracy, we propose the Global-Local-Consistent Hypergraph Cross-Attention Network (GLC-HCAN). This framework includes the Global Consistent Feature (GCF) representation branch, the Local Consistent Feature (LCF) representation branch, and the Hypergraph Cross-Attention (HyperCA) network to model complex correlations through the global-local-consistent hypergraph representation learning. Specifically, the GCF branch employs a multi-pose grouping and aggregation strategy based on PCA for improved global comprehension. Simultaneously, the LCF branch uses local farthest reference point features to enhance local region descriptions. To capture high-order and complex global-local correlations, we construct hypergraphs that integrate both features, mutually enhancing and fusing the representations. The inductive HyperCA module leverages attention techniques to better utilize these high-order relations for comprehensive understanding. Consequently, GLC-HCAN offers an effective and robust rotation-invariant point cloud analysis network, suitable for object classification and shape retrieval tasks in SO(3). Experimental results on both synthetic and scanned point cloud datasets demonstrate that GLC-HCAN outperforms state-of-the-art methods.

旋转不变性点云分析对于许多现实世界的应用程序是必不可少的，在这些应用程序中，对象可以以任意方向出现。传统的局部旋转不变方法依赖于有损区域描述符，限制了对三维物体的全局理解。相反，从姿态对齐中获得的全局特征可以捕获互补信息。为了利用局部和全局一致性来提高准确性，我们提出了全局-局部一致超图交叉注意网络（GLC-HCAN）。该框架包括全局一致特征（GCF）表示分支、局部一致特征（LCF）表示分支和超图交叉注意（Hypergraph Cross-Attention, HyperCA）网络，通过全局-局部一致超图表示学习对复杂关联进行建模。具体而言，GCF分支采用了基于PCA的多姿态分组和聚合策略，以提高全局理解能力。同时，LCF分支使用本地最远参考点特征来增强本地区域描述。为了捕获高阶和复杂的全局-局部相关性，我们构建了整合这两个特征的超图，相互增强和融合表征。归纳式HyperCA模块利用注意力技术更好地利用这些高阶关系进行全面理解。因此，GLC-HCAN提供了一种有效且鲁棒的旋转不变点云分析网络，适用于SO(3)中的目标分类和形状检索任务。在合成点云和扫描点云数据集上的实验结果表明，GLC-HCAN优于最先进的方法。

{"title":"Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis","authors":"Yue Dai;Shihui Ying;Yue Gao","doi":"10.1109/TMM.2024.3521678","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521678","url":null,"abstract":"Rotation invariant point cloud analysis is essential for many real-world applications where objects can appear in arbitrary orientations. Traditional local rotation-invariant methods rely on lossy region descriptors, limiting the global comprehension of 3D objects. Conversely, global features derived from pose alignment can capture complementary information. To leverage both local and global consistency for enhanced accuracy, we propose the Global-Local-Consistent Hypergraph Cross-Attention Network (GLC-HCAN). This framework includes the Global Consistent Feature (GCF) representation branch, the Local Consistent Feature (LCF) representation branch, and the Hypergraph Cross-Attention (HyperCA) network to model complex correlations through the global-local-consistent hypergraph representation learning. Specifically, the GCF branch employs a multi-pose grouping and aggregation strategy based on PCA for improved global comprehension. Simultaneously, the LCF branch uses local farthest reference point features to enhance local region descriptions. To capture high-order and complex global-local correlations, we construct hypergraphs that integrate both features, mutually enhancing and fusing the representations. The inductive HyperCA module leverages attention techniques to better utilize these high-order relations for comprehensive understanding. Consequently, GLC-HCAN offers an effective and robust rotation-invariant point cloud analysis network, suitable for object classification and shape retrieval tasks in SO(3). Experimental results on both synthetic and scanned point cloud datasets demonstrate that GLC-HCAN outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"186-197"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521696

Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma

Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.

{"title":"VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning","authors":"Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma","doi":"10.1109/TMM.2024.3521696","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521696","url":null,"abstract":"Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"515-530"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incomplete Multi-View Clustering With Paired and Balanced Dynamic Anchor Learning

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521789

Xingfeng Li;Yuangang Pan;Yuan Sun;Quansen Sun;Yinghui Sun;Ivor W. Tsang;Zhenwen Ren

Compared to static anchor selection, existing dynamic anchor learning could automatically learn more flexible anchors to improve the performance of large-scale multi-view clustering. Despite improving the flexibility of anchors, these methods do not pay sufficient attention to the alignment and fairness of learned anchors. Specifically, within each cluster, the positions and quantities of cross-view anchors may not align, or even anchor absence in some clusters, leading to severe anchor misalignment and imbalance issues. These issues result in inaccurate graph fusion and a reduction in clustering performance. Besides, in practical applications, missing information caused by sensor malfunctions or data losses could further exacerbate anchor misalignment and imbalance. To overcome such challenges, a novel Incomplete Multi-view Clustering with Paired and Balanced Dynamic Anchor Learning (PBDAL) is proposed to ensure the alignment and fairness of anchors. Unlike existing unsupervised anchor learning, we first design a paired and balanced dynamic anchor learning scheme to supervise dynamic anchors to be aligned and fair in each cluster. Meanwhile, we develop an enhanced bipartite graph tensor learning to refine paired and balanced anchors. Our superiority, effectiveness, and efficiency are all validated by performing extensive experiments on multiple public datasets.

{"title":"Incomplete Multi-View Clustering With Paired and Balanced Dynamic Anchor Learning","authors":"Xingfeng Li;Yuangang Pan;Yuan Sun;Quansen Sun;Yinghui Sun;Ivor W. Tsang;Zhenwen Ren","doi":"10.1109/TMM.2024.3521789","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521789","url":null,"abstract":"Compared to static anchor selection, existing dynamic anchor learning could automatically learn more flexible anchors to improve the performance of large-scale multi-view clustering. Despite improving the flexibility of anchors, these methods do not pay sufficient attention to the alignment and fairness of learned anchors. Specifically, within each cluster, the positions and quantities of cross-view anchors may not align, or even anchor absence in some clusters, leading to severe anchor misalignment and imbalance issues. These issues result in inaccurate graph fusion and a reduction in clustering performance. Besides, in practical applications, missing information caused by sensor malfunctions or data losses could further exacerbate anchor misalignment and imbalance. To overcome such challenges, a novel Incomplete Multi-view Clustering with <bold>Paired and Balanced Dynamic Anchor Learning (PBDAL) is proposed to ensure the alignment and fairness of anchors. Unlike existing unsupervised anchor learning, we first design a paired and balanced dynamic anchor learning scheme to supervise dynamic anchors to be aligned and fair in each cluster. Meanwhile, we develop an enhanced bipartite graph tensor learning to refine paired and balanced anchors. Our superiority, effectiveness, and efficiency are all validated by performing extensive experiments on multiple public datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1486-1497"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-Resolution

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521795

Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen

The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.

{"title":"Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-Resolution","authors":"Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen","doi":"10.1109/TMM.2024.3521795","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521795","url":null,"abstract":"The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1334-1348"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network 基于图变换神经网络的神经形态视觉运动分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521662

Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri

Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (IoU%) and detection rate (DR%), respectively.

运动目标分割是机器人导航系统在复杂环境中解释场景动态的关键。神经形态视觉传感器由于其异步特性，高时间分辨率和降低功耗而适合运动感知。然而，它们的非常规输出需要新颖的感知范式来利用其空间稀疏和时间密集的性质。在这项工作中，我们提出了一种新的基于事件的运动分割算法，使用图形转换神经网络，称为GTNN。我们提出的算法通过一系列非线性转换将事件流处理为3D图，以揭示事件之间的局部和全局时空相关性。基于这些相关性，属于运动物体的事件从背景中分割出来，而不需要事先了解动态场景的几何形状。该算法在公开可用的数据集上进行训练，包括MOD、EV-IMO和EV-IMO2，使用所提出的训练方案，以促进对广泛数据集的有效训练。此外，我们还引入了动态对象掩码感知事件标记（DOMEL）方法，用于为基于事件的运动分割数据集生成近似的ground-truth标签。我们使用DOMEL来标记我们自己记录的运动分割事件数据集（EMS-DOMEL），我们向公众发布以进一步研究和基准测试。在几个看不见的公开数据集上进行了严格的实验，结果表明，GTNN在动态背景变化、运动模式和具有不同大小和速度的多个动态对象的存在下优于最先进的方法。GTNN在运动分割准确率（IoU%）和检测率（DR%）方面取得了显著的性能提升，平均提高了9.4%和4.5%。

{"title":"Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network","authors":"Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri","doi":"10.1109/TMM.2024.3521662","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521662","url":null,"abstract":"Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (<italic>IoU%) and detection rate (<italic>DR%), respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"385-400"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10812712","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ByteNet: Rethinking Multimedia File Fragment Classification Through Visual Perspectives

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521830

Wenyang Liu;Kejun Wu;Tianyi Liu;Yi Wang;Kim-Hui Yap;Lap-Pui Chau

Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.

{"title":"ByteNet: Rethinking Multimedia File Fragment Classification Through Visual Perspectives","authors":"Wenyang Liu;Kejun Wu;Tianyi Liu;Yi Wang;Kim-Hui Yap;Lap-Pui Chau","doi":"10.1109/TMM.2024.3521830","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521830","url":null,"abstract":"Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose <bold>Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network <bold>ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1305-1319"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Modality Semantic Consistency Learning for Visible-Infrared Person Re-Identification

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521843

Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang

Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.

{"title":"Cross-Modality Semantic Consistency Learning for Visible-Infrared Person Re-Identification","authors":"Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang","doi":"10.1109/TMM.2024.3521843","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521843","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"568-580"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform DNP-AUT：基于双层非均匀分割和自适应U变换的图像压缩

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521853

Yumo Zhang;Zhanchuan Cai

To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.

为了提供一种压缩性能更好、计算复杂度更低的图像压缩方法，本文提出了一种新的图像压缩算法。首先，提出了一种双层非均匀分割算法，该算法分析图像块的纹理复杂度，对不同尺度的图像块进行分割和合并，为后续对图像块的压缩提供先验信息，减少空间冗余。其次，在考虑多变换核的基础上，提出了一种自适应U变换方案，对不同类型的图像块进行更有针对性的编码，以提高编码性能。最后，为了使比特分配更加灵活和准确，提出了一种全自适应量化技术。不仅给出了不同尺寸图像块之间的量化系数关系，而且进一步细化了不同拓扑下图像块之间的量化系数关系。大量实验表明，该算法的压缩性能不仅明显优于JPEG格式，而且也超过了目前一些计算复杂度相近的压缩算法。此外，与计算复杂度更高的JPEG2000压缩算法相比，其压缩性能也具有一定的优势。

{"title":"DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform","authors":"Yumo Zhang;Zhanchuan Cai","doi":"10.1109/TMM.2024.3521853","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521853","url":null,"abstract":"To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"249-262"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition 基于关系探索的视觉变换行人属性识别

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521677

Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li

Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the WACV 2023 UPAR Challenge.

通过探索图像区域与属性之间的关系，行人属性识别达到了较高的准确率。然而，现有方法通常采用直接从主干提取特征或利用单一结构（如变压器）来探索关系，导致关系挖掘效率低且不完整。为了克服这些局限性，本文提出了一种用于行人属性识别的综合关系框架Vision Transformer with Relation Exploration (vitr - re)，该框架包括属性与上下文特征投影（ACFP）和关系探索模块（REM）两个新颖的模块。在ACFP中，分别学习属性特定特征和上下文感知特征，分别捕获针对属性和图像区域量身定制的判别信息。然后，REM使用图卷积网络（GCN）块和转换块来并发地探索属性、上下文和属性-上下文关系。为了实现细粒度的关系挖掘，进一步提出了动态邻接模块（DAM）来构造GCN块的逐实例邻接矩阵。配备了全面的关系信息，vitr - re在三个流行的基准上取得了令人满意的性能，包括PETA， RAP和pa - 100k数据集。此外，vitre在WACV 2023 UPAR挑战赛中获得第一名。

{"title":"Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition","authors":"Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li","doi":"10.1109/TMM.2024.3521677","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521677","url":null,"abstract":"Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the <italic>WACV 2023 UPAR Challenge.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"198-208"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Few-Shot 3D Point Cloud Classification With Soft Interaction and Self-Attention 利用软交互和自我关注增强少镜头三维点云分类功能

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521849

Abdullah Aman Khan;Jie Shao;Sidra Shafiq;Shuyuan Zhu;Heng Tao Shen

Few-shot learning is a crucial aspect of modern machine learning that enables models to recognize and classify objects efficiently with limited training data. The shortage of labeled 3D point cloud data calls for innovative solutions, particularly when novel classes emerge more frequently. In this paper, we propose a novel few-shot learning method for recognizing 3D point clouds. More specifically, this paper addresses the challenges of applying few-shot learning to 3D point cloud data, which poses unique difficulties due to the unordered and irregular nature of these data. We propose two new modules for few-shot based 3D point cloud classification, i.e., the Soft Interaction Module (SIM) and Self-Attention Residual Feedforward (SARF) Module. These modules balance and enhance the feature representation by enabling more relevant feature interactions and capturing long-range dependencies between query and support features. To validate the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets, including ModelNet40, ShapeNetCore, and ScanObjectNN. Our approach demonstrates superior performance in handling abrupt feature changes occurring during the meta-learning process. The results of the experiments indicate the superiority of our proposed method by demonstrating its robust generalization ability and better classification performance for 3D point cloud data with limited training samples.

{"title":"Enhancing Few-Shot 3D Point Cloud Classification With Soft Interaction and Self-Attention","authors":"Abdullah Aman Khan;Jie Shao;Sidra Shafiq;Shuyuan Zhu;Heng Tao Shen","doi":"10.1109/TMM.2024.3521849","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521849","url":null,"abstract":"Few-shot learning is a crucial aspect of modern machine learning that enables models to recognize and classify objects efficiently with limited training data. The shortage of labeled 3D point cloud data calls for innovative solutions, particularly when novel classes emerge more frequently. In this paper, we propose a novel few-shot learning method for recognizing 3D point clouds. More specifically, this paper addresses the challenges of applying few-shot learning to 3D point cloud data, which poses unique difficulties due to the unordered and irregular nature of these data. We propose two new modules for few-shot based 3D point cloud classification, i.e., the Soft Interaction Module (SIM) and Self-Attention Residual Feedforward (SARF) Module. These modules balance and enhance the feature representation by enabling more relevant feature interactions and capturing long-range dependencies between query and support features. To validate the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets, including ModelNet40, ShapeNetCore, and ScanObjectNN. Our approach demonstrates superior performance in handling abrupt feature changes occurring during the meta-learning process. The results of the experiments indicate the superiority of our proposed method by demonstrating its robust generalization ability and better classification performance for 3D point cloud data with limited training samples.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1127-1141"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0