Pub Date : 2024-12-24DOI: 10.1109/TMM.2024.3521678
Yue Dai;Shihui Ying;Yue Gao
Rotation invariant point cloud analysis is essential for many real-world applications where objects can appear in arbitrary orientations. Traditional local rotation-invariant methods rely on lossy region descriptors, limiting the global comprehension of 3D objects. Conversely, global features derived from pose alignment can capture complementary information. To leverage both local and global consistency for enhanced accuracy, we propose the Global-Local-Consistent Hypergraph Cross-Attention Network (GLC-HCAN). This framework includes the Global Consistent Feature (GCF) representation branch, the Local Consistent Feature (LCF) representation branch, and the Hypergraph Cross-Attention (HyperCA) network to model complex correlations through the global-local-consistent hypergraph representation learning. Specifically, the GCF branch employs a multi-pose grouping and aggregation strategy based on PCA for improved global comprehension. Simultaneously, the LCF branch uses local farthest reference point features to enhance local region descriptions. To capture high-order and complex global-local correlations, we construct hypergraphs that integrate both features, mutually enhancing and fusing the representations. The inductive HyperCA module leverages attention techniques to better utilize these high-order relations for comprehensive understanding. Consequently, GLC-HCAN offers an effective and robust rotation-invariant point cloud analysis network, suitable for object classification and shape retrieval tasks in SO(3). Experimental results on both synthetic and scanned point cloud datasets demonstrate that GLC-HCAN outperforms state-of-the-art methods.
{"title":"Exploring Local and Global Consistent Correlation on Hypergraph for Rotation Invariant Point Cloud Analysis","authors":"Yue Dai;Shihui Ying;Yue Gao","doi":"10.1109/TMM.2024.3521678","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521678","url":null,"abstract":"Rotation invariant point cloud analysis is essential for many real-world applications where objects can appear in arbitrary orientations. Traditional local rotation-invariant methods rely on lossy region descriptors, limiting the global comprehension of 3D objects. Conversely, global features derived from pose alignment can capture complementary information. To leverage both local and global consistency for enhanced accuracy, we propose the Global-Local-Consistent Hypergraph Cross-Attention Network (GLC-HCAN). This framework includes the Global Consistent Feature (GCF) representation branch, the Local Consistent Feature (LCF) representation branch, and the Hypergraph Cross-Attention (HyperCA) network to model complex correlations through the global-local-consistent hypergraph representation learning. Specifically, the GCF branch employs a multi-pose grouping and aggregation strategy based on PCA for improved global comprehension. Simultaneously, the LCF branch uses local farthest reference point features to enhance local region descriptions. To capture high-order and complex global-local correlations, we construct hypergraphs that integrate both features, mutually enhancing and fusing the representations. The inductive HyperCA module leverages attention techniques to better utilize these high-order relations for comprehensive understanding. Consequently, GLC-HCAN offers an effective and robust rotation-invariant point cloud analysis network, suitable for object classification and shape retrieval tasks in SO(3). Experimental results on both synthetic and scanned point cloud datasets demonstrate that GLC-HCAN outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"186-197"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521696
Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma
Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.
{"title":"VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning","authors":"Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma","doi":"10.1109/TMM.2024.3521696","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521696","url":null,"abstract":"Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"515-530"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521789
Xingfeng Li;Yuangang Pan;Yuan Sun;Quansen Sun;Yinghui Sun;Ivor W. Tsang;Zhenwen Ren
Compared to static anchor selection, existing dynamic anchor learning could automatically learn more flexible anchors to improve the performance of large-scale multi-view clustering. Despite improving the flexibility of anchors, these methods do not pay sufficient attention to the alignment and fairness of learned anchors. Specifically, within each cluster, the positions and quantities of cross-view anchors may not align, or even anchor absence in some clusters, leading to severe anchor misalignment and imbalance issues. These issues result in inaccurate graph fusion and a reduction in clustering performance. Besides, in practical applications, missing information caused by sensor malfunctions or data losses could further exacerbate anchor misalignment and imbalance. To overcome such challenges, a novel Incomplete Multi-view Clustering with Paired and Balanced Dynamic Anchor Learning (PBDAL) is proposed to ensure the alignment and fairness of anchors. Unlike existing unsupervised anchor learning, we first design a paired and balanced dynamic anchor learning scheme to supervise dynamic anchors to be aligned and fair in each cluster. Meanwhile, we develop an enhanced bipartite graph tensor learning to refine paired and balanced anchors. Our superiority, effectiveness, and efficiency are all validated by performing extensive experiments on multiple public datasets.
{"title":"Incomplete Multi-View Clustering With Paired and Balanced Dynamic Anchor Learning","authors":"Xingfeng Li;Yuangang Pan;Yuan Sun;Quansen Sun;Yinghui Sun;Ivor W. Tsang;Zhenwen Ren","doi":"10.1109/TMM.2024.3521789","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521789","url":null,"abstract":"Compared to static anchor selection, existing dynamic anchor learning could automatically learn more flexible anchors to improve the performance of large-scale multi-view clustering. Despite improving the flexibility of anchors, these methods do not pay sufficient attention to the alignment and fairness of learned anchors. Specifically, within each cluster, the positions and quantities of cross-view anchors may not align, or even anchor absence in some clusters, leading to severe anchor misalignment and imbalance issues. These issues result in inaccurate graph fusion and a reduction in clustering performance. Besides, in practical applications, missing information caused by sensor malfunctions or data losses could further exacerbate anchor misalignment and imbalance. To overcome such challenges, a novel Incomplete Multi-view Clustering with <bold>Paired and Balanced Dynamic Anchor Learning (PBDAL)</b> is proposed to ensure the alignment and fairness of anchors. Unlike existing unsupervised anchor learning, we first design a paired and balanced dynamic anchor learning scheme to supervise dynamic anchors to be aligned and fair in each cluster. Meanwhile, we develop an enhanced bipartite graph tensor learning to refine paired and balanced anchors. Our superiority, effectiveness, and efficiency are all validated by performing extensive experiments on multiple public datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1486-1497"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521795
Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen
The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.
{"title":"Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-Resolution","authors":"Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen","doi":"10.1109/TMM.2024.3521795","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521795","url":null,"abstract":"The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1334-1348"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (IoU%) and detection rate (DR%), respectively.
{"title":"Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network","authors":"Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri","doi":"10.1109/TMM.2024.3521662","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521662","url":null,"abstract":"Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (<italic>IoU</i>%) and detection rate (<italic>DR</i>%), respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"385-400"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10812712","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.
{"title":"ByteNet: Rethinking Multimedia File Fragment Classification Through Visual Perspectives","authors":"Wenyang Liu;Kejun Wu;Tianyi Liu;Yi Wang;Kim-Hui Yap;Lap-Pui Chau","doi":"10.1109/TMM.2024.3521830","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521830","url":null,"abstract":"Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose <bold>Byte2Image</b>, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network <bold>ByteNet</b> to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1305-1319"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521843
Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang
Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.
{"title":"Cross-Modality Semantic Consistency Learning for Visible-Infrared Person Re-Identification","authors":"Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang","doi":"10.1109/TMM.2024.3521843","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521843","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"568-580"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521853
Yumo Zhang;Zhanchuan Cai
To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.
{"title":"DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform","authors":"Yumo Zhang;Zhanchuan Cai","doi":"10.1109/TMM.2024.3521853","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521853","url":null,"abstract":"To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"249-262"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521677
Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li
Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the WACV 2023 UPAR Challenge.
{"title":"Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition","authors":"Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li","doi":"10.1109/TMM.2024.3521677","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521677","url":null,"abstract":"Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the <italic>WACV 2023 UPAR Challenge</i>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"198-208"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521849
Abdullah Aman Khan;Jie Shao;Sidra Shafiq;Shuyuan Zhu;Heng Tao Shen
Few-shot learning is a crucial aspect of modern machine learning that enables models to recognize and classify objects efficiently with limited training data. The shortage of labeled 3D point cloud data calls for innovative solutions, particularly when novel classes emerge more frequently. In this paper, we propose a novel few-shot learning method for recognizing 3D point clouds. More specifically, this paper addresses the challenges of applying few-shot learning to 3D point cloud data, which poses unique difficulties due to the unordered and irregular nature of these data. We propose two new modules for few-shot based 3D point cloud classification, i.e., the Soft Interaction Module (SIM) and Self-Attention Residual Feedforward (SARF) Module. These modules balance and enhance the feature representation by enabling more relevant feature interactions and capturing long-range dependencies between query and support features. To validate the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets, including ModelNet40, ShapeNetCore, and ScanObjectNN. Our approach demonstrates superior performance in handling abrupt feature changes occurring during the meta-learning process. The results of the experiments indicate the superiority of our proposed method by demonstrating its robust generalization ability and better classification performance for 3D point cloud data with limited training samples.
{"title":"Enhancing Few-Shot 3D Point Cloud Classification With Soft Interaction and Self-Attention","authors":"Abdullah Aman Khan;Jie Shao;Sidra Shafiq;Shuyuan Zhu;Heng Tao Shen","doi":"10.1109/TMM.2024.3521849","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521849","url":null,"abstract":"Few-shot learning is a crucial aspect of modern machine learning that enables models to recognize and classify objects efficiently with limited training data. The shortage of labeled 3D point cloud data calls for innovative solutions, particularly when novel classes emerge more frequently. In this paper, we propose a novel few-shot learning method for recognizing 3D point clouds. More specifically, this paper addresses the challenges of applying few-shot learning to 3D point cloud data, which poses unique difficulties due to the unordered and irregular nature of these data. We propose two new modules for few-shot based 3D point cloud classification, i.e., the Soft Interaction Module (SIM) and Self-Attention Residual Feedforward (SARF) Module. These modules balance and enhance the feature representation by enabling more relevant feature interactions and capturing long-range dependencies between query and support features. To validate the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets, including ModelNet40, ShapeNetCore, and ScanObjectNN. Our approach demonstrates superior performance in handling abrupt feature changes occurring during the meta-learning process. The results of the experiments indicate the superiority of our proposed method by demonstrating its robust generalization ability and better classification performance for 3D point cloud data with limited training samples.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1127-1141"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}