Pub Date : 2024-08-20DOI: 10.1007/s00530-024-01438-1
Chih-Wei Lin, Ye Lin, Shangtai Zhou, Lirong Zhu
Recently, a popular query-based end-to-end framework has been used for instance segmentation. However, queries update based on individual layers or scales of feature maps at each stage of Transformer decoding, which makes queries unable to gather sufficient multi-scale feature information. Therefore, querying these features may result in inconsistent information due to disparities among feature maps and leading to erroneous updates. This study proposes a new network called GateInst, which employs a dual-path auto-select mechanism based on gate structures to overcome these issues. Firstly, we design a block-wise multi-scale feature fusion module that combines features of different scales while maintaining low computational cost. Secondly, we introduce the gated-enhanced queries Transformer decoder that utilizes a gating mechanism to filter and merge the queries generated at different stages to compensate for the inaccuracies in updating queries. GateInst addresses the issue of insufficient feature information and compensates for the problem of cumulative errors in queries. Experiments have shown that GateInst achieves significant gains of 8.4 AP, 5.5 (AP_{50}) over Mask2Former on the self-collected Tree Species Instance Dataset and performs well compared to non-Mask2Former-like and Mask2Former-like networks on self-collected and public COCO datasets, with only a tiny amount of additional computational cost and fast convergence. Code and models are available at https://github.com/FAFU-IMLab/GateInst.
{"title":"Gateinst: instance segmentation with multi-scale gated-enhanced queries in transformer decoder","authors":"Chih-Wei Lin, Ye Lin, Shangtai Zhou, Lirong Zhu","doi":"10.1007/s00530-024-01438-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01438-1","url":null,"abstract":"<p>Recently, a popular query-based end-to-end framework has been used for instance segmentation. However, queries update based on individual layers or scales of feature maps at each stage of Transformer decoding, which makes queries unable to gather sufficient multi-scale feature information. Therefore, querying these features may result in inconsistent information due to disparities among feature maps and leading to erroneous updates. This study proposes a new network called GateInst, which employs a dual-path auto-select mechanism based on gate structures to overcome these issues. Firstly, we design a block-wise multi-scale feature fusion module that combines features of different scales while maintaining low computational cost. Secondly, we introduce the gated-enhanced queries Transformer decoder that utilizes a gating mechanism to filter and merge the queries generated at different stages to compensate for the inaccuracies in updating queries. GateInst addresses the issue of insufficient feature information and compensates for the problem of cumulative errors in queries. Experiments have shown that GateInst achieves significant gains of 8.4 <i>AP</i>, 5.5 <span>(AP_{50})</span> over Mask2Former on the self-collected Tree Species Instance Dataset and performs well compared to non-Mask2Former-like and Mask2Former-like networks on self-collected and public COCO datasets, with only a tiny amount of additional computational cost and fast convergence. Code and models are available at https://github.com/FAFU-IMLab/GateInst.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"13 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual object tracking can be divided into the object classification and bounding-box regression tasks, but only one sharing correlation map leads to inaccuracy. Siamese trackers compute correlation map by cross-correlation operation with high computational cost, and this operation performed either on channels or in spatial domain results in weak perception of the global information. In addition, some Siamese trackers with a centerness branch ignore the associations between the centerness branch and the bounding-box regression branch. To alleviate these problems, we propose a visual object tracker based on Spatial-Channel Cross-Correlation and Centerness-Guided Regression. Firstly, we propose a spatial-channel cross-correlation module (SC3M) that combines the search region feature and the template feature both on channels and in spatial domain, which suppresses the interference of distractors. As a lightweight module, SC3M can compute dual independent correlation maps inputted to different subnetworks. Secondly, we propose a centerness-guided regression subnetwork consisting of the centerness branch and the bounding-box regression branch. The centerness guides the whole regression subnetwork to enhance the association of two branches and further suppress the low-quality predicted bounding boxes. Thirdly, we have conducted extensive experiments on five challenging benchmarks, including GOT-10k, VOT2018, TrackingNet, OTB100 and UAV123. The results show the excellent performance of our tracker and our tracker achieves real-time requirement at 48.52 fps.
{"title":"SiamS3C: spatial-channel cross-correlation for visual tracking with centerness-guided regression","authors":"Jianming Zhang, Wentao Chen, Yufan He, Li-Dan Kuang, Arun Kumar Sangaiah","doi":"10.1007/s00530-024-01450-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01450-5","url":null,"abstract":"<p>Visual object tracking can be divided into the object classification and bounding-box regression tasks, but only one sharing correlation map leads to inaccuracy. Siamese trackers compute correlation map by cross-correlation operation with high computational cost, and this operation performed either on channels or in spatial domain results in weak perception of the global information. In addition, some Siamese trackers with a centerness branch ignore the associations between the centerness branch and the bounding-box regression branch. To alleviate these problems, we propose a visual object tracker based on Spatial-Channel Cross-Correlation and Centerness-Guided Regression. Firstly, we propose a spatial-channel cross-correlation module (SC3M) that combines the search region feature and the template feature both on channels and in spatial domain, which suppresses the interference of distractors. As a lightweight module, SC3M can compute dual independent correlation maps inputted to different subnetworks. Secondly, we propose a centerness-guided regression subnetwork consisting of the centerness branch and the bounding-box regression branch. The centerness guides the whole regression subnetwork to enhance the association of two branches and further suppress the low-quality predicted bounding boxes. Thirdly, we have conducted extensive experiments on five challenging benchmarks, including GOT-10k, VOT2018, TrackingNet, OTB100 and UAV123. The results show the excellent performance of our tracker and our tracker achieves real-time requirement at 48.52 fps.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"154 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1007/s00530-024-01455-0
Luke Vandenberghe, Chris Joslin
We propose a new semi-blind semi-fragile watermarking algorithm for authenticating triangulated 3D models using the surface integrals of generated random vector fields. Watermark data is embedded into the flux of a vector field across the model’s surface and through gradient-based optimization techniques, the vertices are shifted to obtain the modified flux values. The watermark can be extracted through the recomputation of the surface integrals and compared using correlation measures. This algorithm is invariant to Euclidean transformations including rotations and translation, reduces distortion, and achieves improved robustness to additive noise.
{"title":"3D model watermarking using surface integrals of generated random vector fields","authors":"Luke Vandenberghe, Chris Joslin","doi":"10.1007/s00530-024-01455-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01455-0","url":null,"abstract":"<p>We propose a new semi-blind semi-fragile watermarking algorithm for authenticating triangulated 3D models using the surface integrals of generated random vector fields. Watermark data is embedded into the flux of a vector field across the model’s surface and through gradient-based optimization techniques, the vertices are shifted to obtain the modified flux values. The watermark can be extracted through the recomputation of the surface integrals and compared using correlation measures. This algorithm is invariant to Euclidean transformations including rotations and translation, reduces distortion, and achieves improved robustness to additive noise.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"27 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16DOI: 10.1007/s00530-024-01443-4
Dicong Wang, Kaijun Wu
Weakly supervised video anomaly detection (WSVAD) constitutes a highly research-oriented and challenging project within the domains of image and video processing. In prior studies of WSVAD, it has typically been formulated as a multiple-instance learning (MIL) problem. However, quite a few of these methods tend to primarily concentrate on time periods when anomalies occur discernibly. To recognize anomalous events, they rely solely on detecting significant changes in appearance or motion, ignoring the temporal completeness or continuity that anomalous events possess by nature. In addition, they also disregard the subtle correlations at the transitional boundaries between normal and abnormal states. Therefore, we propose a weakly supervised learning approach based on Transformer with margin learning for video anomaly detection. Specifically, our network effectively captures temporal changes around the occurrence of anomalies by utilizing the benefits of Transformer blocks, which are adept at capturing long-range dependencies in anomalous events. Secondly, to tackle challenging cases, i.e., normal events with high similarity to anomalous events, we employed a hard score memory. The purpose of this memory is to store the anomaly scores of hard samples, enabling iterative optimization training on those hard instances. Additionally, to bolster the discriminative capability of the model at the score level, we utilize pseudo-labels for anomalous events to provide supplementary support in detection. Experiments were conducted on two large-scale datasets, namely the ShanghaiTech dataset and the UCF-Crime dataset, and they achieved highly favorable results. The results of the experiments demonstrate that the proposed method is sensitive to anomalous events while performing competitively against state-of-the-art methods.
{"title":"Anomaly detection in surveillance videos using Transformer with margin learning","authors":"Dicong Wang, Kaijun Wu","doi":"10.1007/s00530-024-01443-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01443-4","url":null,"abstract":"<p>Weakly supervised video anomaly detection (WSVAD) constitutes a highly research-oriented and challenging project within the domains of image and video processing. In prior studies of WSVAD, it has typically been formulated as a multiple-instance learning (MIL) problem. However, quite a few of these methods tend to primarily concentrate on time periods when anomalies occur discernibly. To recognize anomalous events, they rely solely on detecting significant changes in appearance or motion, ignoring the temporal completeness or continuity that anomalous events possess by nature. In addition, they also disregard the subtle correlations at the transitional boundaries between normal and abnormal states. Therefore, we propose a weakly supervised learning approach based on Transformer with margin learning for video anomaly detection. Specifically, our network effectively captures temporal changes around the occurrence of anomalies by utilizing the benefits of Transformer blocks, which are adept at capturing long-range dependencies in anomalous events. Secondly, to tackle challenging cases, i.e., normal events with high similarity to anomalous events, we employed a hard score memory. The purpose of this memory is to store the anomaly scores of hard samples, enabling iterative optimization training on those hard instances. Additionally, to bolster the discriminative capability of the model at the score level, we utilize pseudo-labels for anomalous events to provide supplementary support in detection. Experiments were conducted on two large-scale datasets, namely the ShanghaiTech dataset and the UCF-Crime dataset, and they achieved highly favorable results. The results of the experiments demonstrate that the proposed method is sensitive to anomalous events while performing competitively against state-of-the-art methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"49 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16DOI: 10.1007/s00530-024-01442-5
Aozhe Dou, Yang Hao, Weifeng Liu, Liangliang Li, Zhenzhong Wang, Baodi Liu
Remote sensing imagery is indispensable in diverse domains, including geographic information systems, climate monitoring, agricultural planning, and disaster management. Nonetheless, cloud cover can drastically degrade the utility and quality of these images. Current deep learning-based cloud removal methods rely on convolutional neural networks to extract features at the same scale, which can overlook detailed and global information, resulting in suboptimal cloud removal performance. To overcome these challenges, we develop a method for cloud removal that leverages multi-scale spatial information perception. Our technique employs convolution kernels of various sizes, enabling the integration of both global semantic information and local detail information. An attention mechanism enhances this process by targeting key areas within the images, and dynamically adjusting channel weights to improve feature reconstruction. We compared our method with current popular cloud removal methods across three datasets, and the results show that our proposed method improves metrics such as PSNR, SSIM, and cosine similarity, verifying the effectiveness of our method in cloud removal.
{"title":"Remote sensing image cloud removal based on multi-scale spatial information perception","authors":"Aozhe Dou, Yang Hao, Weifeng Liu, Liangliang Li, Zhenzhong Wang, Baodi Liu","doi":"10.1007/s00530-024-01442-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01442-5","url":null,"abstract":"<p>Remote sensing imagery is indispensable in diverse domains, including geographic information systems, climate monitoring, agricultural planning, and disaster management. Nonetheless, cloud cover can drastically degrade the utility and quality of these images. Current deep learning-based cloud removal methods rely on convolutional neural networks to extract features at the same scale, which can overlook detailed and global information, resulting in suboptimal cloud removal performance. To overcome these challenges, we develop a method for cloud removal that leverages multi-scale spatial information perception. Our technique employs convolution kernels of various sizes, enabling the integration of both global semantic information and local detail information. An attention mechanism enhances this process by targeting key areas within the images, and dynamically adjusting channel weights to improve feature reconstruction. We compared our method with current popular cloud removal methods across three datasets, and the results show that our proposed method improves metrics such as PSNR, SSIM, and cosine similarity, verifying the effectiveness of our method in cloud removal.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16DOI: 10.1007/s00530-024-01444-3
Jihong Ouyang, Zhengjie Zhang, Qingyi Meng, Ximing Li, Jinjin Chi
Due to data privacy concerns, a more practical task known as Source-free Unsupervised Domain Adaptation (SFUDA) has gained significant attention recently. SFUDA adapts a pre-trained source model to the target domain without access to the source domain data. Existing SFUDA methods typically rely on per-class cluster structure to refine labels. However, these clusters often contain samples with different ground truth labels, leading to label noise. To address this issue, we propose a novel Multi-level Consistency Learning (MLCL) method. MLCL focuses on learning discriminative class-wise target feature representations, resulting in more accurate cluster structures. Specifically, at the inter-domain level, we construct pseudo-source domain data based on the entropy criterion. We align pseudo-labeled target domain sample with corresponding pseudo-source domain prototype by introducing a prototype contrastive loss. This loss ensures that our model can learn discriminative class-wise feature representations effectively. At the intra-domain level, we enforce consistency among different views of the same image by employing consistency-based self-training. The self-training further enhances the feature representation ability of our model. Additionally, we apply information maximization regularization to facilitate target sample clustering and promote diversity. Our extensive experiments conducted on four benchmark datasets for classification demonstrate the superior performance of the proposed MLCL method. The code is here.
{"title":"Exploiting multi-level consistency learning for source-free domain adaptation","authors":"Jihong Ouyang, Zhengjie Zhang, Qingyi Meng, Ximing Li, Jinjin Chi","doi":"10.1007/s00530-024-01444-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01444-3","url":null,"abstract":"<p>Due to data privacy concerns, a more practical task known as Source-free Unsupervised Domain Adaptation (SFUDA) has gained significant attention recently. SFUDA adapts a pre-trained source model to the target domain without access to the source domain data. Existing SFUDA methods typically rely on per-class cluster structure to refine labels. However, these clusters often contain samples with different ground truth labels, leading to label noise. To address this issue, we propose a novel Multi-level Consistency Learning (MLCL) method. MLCL focuses on learning discriminative class-wise target feature representations, resulting in more accurate cluster structures. Specifically, at the inter-domain level, we construct pseudo-source domain data based on the entropy criterion. We align pseudo-labeled target domain sample with corresponding pseudo-source domain prototype by introducing a prototype contrastive loss. This loss ensures that our model can learn discriminative class-wise feature representations effectively. At the intra-domain level, we enforce consistency among different views of the same image by employing consistency-based self-training. The self-training further enhances the feature representation ability of our model. Additionally, we apply information maximization regularization to facilitate target sample clustering and promote diversity. Our extensive experiments conducted on four benchmark datasets for classification demonstrate the superior performance of the proposed MLCL method. The code is here.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"58 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1007/s00530-024-01449-y
Xingbin Liu
Image encryption serves as a crucial means to safeguard information against unauthorized access during both transmission and storage phases. This paper introduces an integrated encryption algorithm tailored for multiple images, leveraging a novel hyperchaotic system and the Baker map to augment the key space and enhance security measures. The methodology encompasses a permutation-diffusion framework, employing sequences derived from the hyperchaotic system for both permutation and diffusion operations. Initially, the multiple images undergo intermixing, consolidating them into a singular image. Subsequently, the Baker map is employed to further scramble this amalgamated image, thereby extending the scrambling period. Ultimately, the ciphertext image is generated through forward–backward diffusion applied to the pixel sequence of the Zigzag scanned image. Experimental findings substantiate the high-security efficacy of the proposed scheme, demonstrating resilience against diverse threats.
{"title":"Integrate encryption of multiple images based on a new hyperchaotic system and Baker map","authors":"Xingbin Liu","doi":"10.1007/s00530-024-01449-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01449-y","url":null,"abstract":"<p>Image encryption serves as a crucial means to safeguard information against unauthorized access during both transmission and storage phases. This paper introduces an integrated encryption algorithm tailored for multiple images, leveraging a novel hyperchaotic system and the Baker map to augment the key space and enhance security measures. The methodology encompasses a permutation-diffusion framework, employing sequences derived from the hyperchaotic system for both permutation and diffusion operations. Initially, the multiple images undergo intermixing, consolidating them into a singular image. Subsequently, the Baker map is employed to further scramble this amalgamated image, thereby extending the scrambling period. Ultimately, the ciphertext image is generated through forward–backward diffusion applied to the pixel sequence of the Zigzag scanned image. Experimental findings substantiate the high-security efficacy of the proposed scheme, demonstrating resilience against diverse threats.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"44 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, research on 2D to 3D human pose estimation methods has gained increasing attention. However, these methods, such as depth ambiguity and self-occlusion, still need to be addressed. To address these problems, we propose a 3D human pose estimation method based on multi-constrained dilated convolutions. This approach involves using a local constraint based on graph convolution and a global constraint based on a fully connected network. It also utilizes a dilated temporal convolution network to capture long-term temporal correlations of human poses. Taking 2D joint coordinate sequences as input, the local constraint module constructs cross-joint and equipotential connections for the human skeleton. The global constraint module encodes global semantic information about posture. Finally, the constraint modules and the temporal correlation of human posture are alternately connected to achieve 3D human posture estimation. The method was validated on the public datasets Human3.6M and MPI-INF-3DHP, and the results show that the proposed method effectively reduces the error in 3D human pose estimation and demonstrates a certain degree of generalization ability.
{"title":"3D human pose estimation method based on multi-constrained dilated convolutions","authors":"Huaijun Wang, Bingqian Bai, Junhuai Li, Hui Ke, Wei Xiang","doi":"10.1007/s00530-024-01441-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01441-6","url":null,"abstract":"<p>In recent years, research on 2D to 3D human pose estimation methods has gained increasing attention. However, these methods, such as depth ambiguity and self-occlusion, still need to be addressed. To address these problems, we propose a 3D human pose estimation method based on multi-constrained dilated convolutions. This approach involves using a local constraint based on graph convolution and a global constraint based on a fully connected network. It also utilizes a dilated temporal convolution network to capture long-term temporal correlations of human poses. Taking 2D joint coordinate sequences as input, the local constraint module constructs cross-joint and equipotential connections for the human skeleton. The global constraint module encodes global semantic information about posture. Finally, the constraint modules and the temporal correlation of human posture are alternately connected to achieve 3D human posture estimation. The method was validated on the public datasets Human3.6M and MPI-INF-3DHP, and the results show that the proposed method effectively reduces the error in 3D human pose estimation and demonstrates a certain degree of generalization ability.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"258 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s00530-024-01437-2
Yuhan Yang, Jing Sun, Guojia An
Session-based recommendation (SBR) aims to recommend the next clicked item to users by mining the user’s interaction sequences in the current session. It has received widespread attention recently due to its excellent privacy protection capabilities. However, existing SBR methods have the following limitations: (1) there exists noisy information in session sequences; (2) it is a challenge to simultaneously model both the long-term stable and dynamic changing interests of users; (3) the internal relationships between different interest representations are often neglected. To address the above issues, we propose an Exploring Multi-Dimensional Interests for session-based recommendation model, termed EMDI, which attempts to predict more accurate and complete user intentions from multiple dimensions of user interests. Specifically, the EMDI contains the following three aspects: (1) the interest enhancement module aims to filter noise and enhance the interest expressions in the user’s behavior sequences, providing high-quality item embeddings; (2) the interest mining module separately mines users’ multi-dimensional interests, including static interests, local dynamic interests, and global dynamic interests, to capture users’ tendencies in different dimensions of interest; (3) the interest fusion module is designed to dynamically aggregate users’ interest representations from different dimensions through a novel multi-layer gated fusion network so that the implicit association between interest representations can be captured. Extensive experimental results show that the EMDI performs significantly better than other state-of-the-art methods.
{"title":"Exploring multi-dimensional interests for session-based recommendation","authors":"Yuhan Yang, Jing Sun, Guojia An","doi":"10.1007/s00530-024-01437-2","DOIUrl":"https://doi.org/10.1007/s00530-024-01437-2","url":null,"abstract":"<p>Session-based recommendation (SBR) aims to recommend the next clicked item to users by mining the user’s interaction sequences in the current session. It has received widespread attention recently due to its excellent privacy protection capabilities. However, existing SBR methods have the following limitations: (1) there exists noisy information in session sequences; (2) it is a challenge to simultaneously model both the long-term stable and dynamic changing interests of users; (3) the internal relationships between different interest representations are often neglected. To address the above issues, we propose an <u>E</u>xploring <u>M</u>ulti-<u>D</u>imensional <u>I</u>nterests for session-based recommendation model, termed EMDI, which attempts to predict more accurate and complete user intentions from multiple dimensions of user interests. Specifically, the EMDI contains the following three aspects: (1) the interest enhancement module aims to filter noise and enhance the interest expressions in the user’s behavior sequences, providing high-quality item embeddings; (2) the interest mining module separately mines users’ multi-dimensional interests, including static interests, local dynamic interests, and global dynamic interests, to capture users’ tendencies in different dimensions of interest; (3) the interest fusion module is designed to dynamically aggregate users’ interest representations from different dimensions through a novel multi-layer gated fusion network so that the implicit association between interest representations can be captured. Extensive experimental results show that the EMDI performs significantly better than other state-of-the-art methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"82 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s00530-024-01447-0
Shifeng Peng, Xin Fan, Shengwei Tian, Long Yu
Compared to generalized object detection, research on small object detection has been slow, mainly due to the need to learn appropriate features from limited information about small objects. This is coupled with difficulties such as information loss during the forward propagation of neural networks. In order to solve this problem, this paper proposes an object detector named PS-YOLO with a model: (1) Reconstructs the C2f module to reduce the weakening or loss of small object features during the deep superposition of the backbone network. (2) Optimizes the neck feature fusion using the PD module, which fuses features at different levels and sizes to improve the model’s feature fusion capability at multiple scales. (3) Design the multi-channel aggregate receptive field module (MCARF) for downsampling to extend the image receptive field and recognize more local information. The experimental results of this method on three public datasets show that the algorithm achieves satisfactory accuracy, prediction, and recall.
{"title":"PS-YOLO: a small object detector based on efficient convolution and multi-scale feature fusion","authors":"Shifeng Peng, Xin Fan, Shengwei Tian, Long Yu","doi":"10.1007/s00530-024-01447-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01447-0","url":null,"abstract":"<p>Compared to generalized object detection, research on small object detection has been slow, mainly due to the need to learn appropriate features from limited information about small objects. This is coupled with difficulties such as information loss during the forward propagation of neural networks. In order to solve this problem, this paper proposes an object detector named PS-YOLO with a model: (1) Reconstructs the C2f module to reduce the weakening or loss of small object features during the deep superposition of the backbone network. (2) Optimizes the neck feature fusion using the PD module, which fuses features at different levels and sizes to improve the model’s feature fusion capability at multiple scales. (3) Design the multi-channel aggregate receptive field module (MCARF) for downsampling to extend the image receptive field and recognize more local information. The experimental results of this method on three public datasets show that the algorithm achieves satisfactory accuracy, prediction, and recall.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"12 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}