Pub Date : 2024-07-29DOI: 10.1007/s00530-024-01422-9
Hao Pan, Xiaoli Zhao, Lipeng He, Yicong Shi, Xiaogang Lin
Multimodal Federated Learning (MMFL) is a novel machine learning technique that enhances the capabilities of traditional Federated Learning (FL) to support collaborative training of local models using data available in various modalities. With the generation and storage of a vast amount of multimodal data from the internet, sensors, and mobile devices, as well as the rapid iteration of artificial intelligence models, the demand for multimodal models is growing rapidly. While FL has been widely studied in the past few years, most of the existing research was based in unimodal settings. With the hope of inspiring more applications and research within the MMFL paradigm, we conduct a comprehensive review of the progress and challenges in various aspects of state-of-the-art MMFL. Specifically, we analyze the research motivation for MMFL, propose a new classification method of existing research, discuss the available datasets and application scenarios, and put forward perspectives on the opportunities and challenges faced by MMFL.
{"title":"A survey of multimodal federated learning: background, applications, and perspectives","authors":"Hao Pan, Xiaoli Zhao, Lipeng He, Yicong Shi, Xiaogang Lin","doi":"10.1007/s00530-024-01422-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01422-9","url":null,"abstract":"<p>Multimodal Federated Learning (MMFL) is a novel machine learning technique that enhances the capabilities of traditional Federated Learning (FL) to support collaborative training of local models using data available in various modalities. With the generation and storage of a vast amount of multimodal data from the internet, sensors, and mobile devices, as well as the rapid iteration of artificial intelligence models, the demand for multimodal models is growing rapidly. While FL has been widely studied in the past few years, most of the existing research was based in unimodal settings. With the hope of inspiring more applications and research within the MMFL paradigm, we conduct a comprehensive review of the progress and challenges in various aspects of state-of-the-art MMFL. Specifically, we analyze the research motivation for MMFL, propose a new classification method of existing research, discuss the available datasets and application scenarios, and put forward perspectives on the opportunities and challenges faced by MMFL.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"45 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1007/s00530-024-01427-4
Xiao Li, Liquan Chen, Jianchang Lai, Zhangjie Fu, Suhui Liu
Image steganography secures the transmission of secret information by covering it under routine multimedia transmission. During image generation based on Generative Adversarial Network (GAN), the embedding and recovery of secret bits can rely entirely on deep networks, relieving many manual design efforts. However, existing GAN-based methods always design deep networks by adapting generic deep learning structures to image steganography. These structures themselves lack the feature extraction that is effective for steganography, resulting in the low imperceptibility of these methods. To address the problem, we propose GAN-based image steganography by exploiting transform domain knowledge with deep networks, called EStegTGANs. Different from existing GAN-based methods, we explicitly introduce transform domain knowledge with Discrete Wavelet Transform (DWT) and its inverse (IDWT) in deep networks, ensuring that each network performs with DWT features. Specifically, the encoder embeds secrets and generates stego images with the explicit DWT and IDWT approaches. The decoder recovers secrets and the discriminator evaluates feature distribution with the explicit DWT approach. By utilizing traditional DWT and IDWT approaches, we propose EStegTGAN-coe, which directly adopts the DWT coefficients of pixels for embedding and recovering. To create more feature redundancy for secrets, we extract DWT features from the intermediate features in deep networks for embedding and recovering. We then propose EStegTGAN-DWT with traditional DWT and IDWT approaches. To entirely rely on deep networks without traditional filters, we design the convolutional DWT and IDWT approaches that conduct the same feature transformation on features as traditional approaches. We further replace the traditional approaches in EStegTGAN-DWT with our proposed convolutional approaches. Comprehensive experimental results demonstrate that our proposals significantly improve the imperceptibility and our designed convolutional DWT and IDWT approaches are more effective in distinguishing high-frequency characteristics of images for steganography than traditional DWT and IDWT approaches.
{"title":"GAN-based image steganography by exploiting transform domain knowledge with deep networks","authors":"Xiao Li, Liquan Chen, Jianchang Lai, Zhangjie Fu, Suhui Liu","doi":"10.1007/s00530-024-01427-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01427-4","url":null,"abstract":"<p>Image steganography secures the transmission of secret information by covering it under routine multimedia transmission. During image generation based on Generative Adversarial Network (GAN), the embedding and recovery of secret bits can rely entirely on deep networks, relieving many manual design efforts. However, existing GAN-based methods always design deep networks by adapting generic deep learning structures to image steganography. These structures themselves lack the feature extraction that is effective for steganography, resulting in the low imperceptibility of these methods. To address the problem, we propose GAN-based image steganography by exploiting transform domain knowledge with deep networks, called EStegTGANs. Different from existing GAN-based methods, we explicitly introduce transform domain knowledge with Discrete Wavelet Transform (DWT) and its inverse (IDWT) in deep networks, ensuring that each network performs with DWT features. Specifically, the encoder embeds secrets and generates stego images with the explicit DWT and IDWT approaches. The decoder recovers secrets and the discriminator evaluates feature distribution with the explicit DWT approach. By utilizing traditional DWT and IDWT approaches, we propose EStegTGAN-coe, which directly adopts the DWT coefficients of pixels for embedding and recovering. To create more feature redundancy for secrets, we extract DWT features from the intermediate features in deep networks for embedding and recovering. We then propose EStegTGAN-DWT with traditional DWT and IDWT approaches. To entirely rely on deep networks without traditional filters, we design the convolutional DWT and IDWT approaches that conduct the same feature transformation on features as traditional approaches. We further replace the traditional approaches in EStegTGAN-DWT with our proposed convolutional approaches. Comprehensive experimental results demonstrate that our proposals significantly improve the imperceptibility and our designed convolutional DWT and IDWT approaches are more effective in distinguishing high-frequency characteristics of images for steganography than traditional DWT and IDWT approaches.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"46 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Active Multi-Object Tracking (AMOT) is a task where cameras are controlled by a centralized system to adjust their poses automatically and collaboratively so as to maximize the coverage of targets in their shared visual field. In AMOT, each camera only receives partial information from its observation, which may mislead cameras to take locally optimal action. Besides, the global goal, i.e., maximum coverage of objects, is hard to be directly optimized. To address the above issues, we propose a coordinate-aligned multi-camera collaboration system for AMOT. In our approach, we regard each camera as an agent and address AMOT with a multi-agent reinforcement learning solution. To represent the observation of each agent, we first identify the targets in the camera view with an image detector and then align the coordinates of the targets via inverse projection transformation. We define the reward of each agent based on both global coverage as well as four individual reward terms. The action policy of the agents is derived from a value-based Q-network. To the best of our knowledge, we are the first to study the AMOT task. To train and evaluate the efficacy of our system, we build a virtual yet credible 3D environment, named “Soccer Court”, to mimic the real-world AMOT scenario. The experimental results show that our system outperforms the baseline and existing methods in various settings, including real-world datasets.
{"title":"Coordinate-aligned multi-camera collaboration for active multi-object tracking","authors":"Zeyu Fang, Jian Zhao, Mingyu Yang, Zhenbo Lu, Wengang Zhou, Houqiang Li","doi":"10.1007/s00530-024-01420-x","DOIUrl":"https://doi.org/10.1007/s00530-024-01420-x","url":null,"abstract":"<p>Active Multi-Object Tracking (AMOT) is a task where cameras are controlled by a centralized system to adjust their poses automatically and collaboratively so as to maximize the coverage of targets in their shared visual field. In AMOT, each camera only receives partial information from its observation, which may mislead cameras to take locally optimal action. Besides, the global goal, i.e., maximum coverage of objects, is hard to be directly optimized. To address the above issues, we propose a coordinate-aligned multi-camera collaboration system for AMOT. In our approach, we regard each camera as an agent and address AMOT with a multi-agent reinforcement learning solution. To represent the observation of each agent, we first identify the targets in the camera view with an image detector and then align the coordinates of the targets via inverse projection transformation. We define the reward of each agent based on both global coverage as well as four individual reward terms. The action policy of the agents is derived from a value-based Q-network. To the best of our knowledge, we are the first to study the AMOT task. To train and evaluate the efficacy of our system, we build a virtual yet credible 3D environment, named “Soccer Court”, to mimic the real-world AMOT scenario. The experimental results show that our system outperforms the baseline and existing methods in various settings, including real-world datasets.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"7 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s00530-024-01426-5
Qinghua Ren, Ke Hou, Yongzhao Zhan, Chen Wang
Traditional domain adaptive semantic segmentation methods typically assume access to source domain data during training, a paradigm known as source-access domain adaptation for semantic segmentation (SASS). To address data privacy concerns in real-world applications, source-free domain adaptation for semantic segmentation (SFSS) has recently been studied, eliminating the need for direct access to source data. Most SFSS methods primarily utilize pseudo-labels to regularize the model in either the label space or the feature space. Inspired by the segment anything model (SAM), we propose SAM-guided contrast based pseudo-label learning for SFSS in this work. Unlike previous methods that heavily rely on noisy pseudo-labels, we leverage the class-agnostic segmentation masks generated by SAM as prior knowledge to construct positive and negative sample pairs. This approach allows us to directly shape the feature space using contrastive learning. This design ensures the reliable construction of contrastive samples and exploits both intra-class and intra-instance diversity. Our framework is built upon a vanilla teacher–student network architecture for online pseudo-label learning. Consequently, the SFSS model can be jointly regularized in both the feature and label spaces in an end-to-end manner. Extensive experiments demonstrate that our method achieves competitive performance in two challenging SFSS tasks.
传统的域自适应语义分割方法通常假定在训练期间可以访问源域数据,这种模式被称为语义分割的源访问域自适应(SASS)。为了解决实际应用中的数据隐私问题,最近有人研究了无源域适应语义分割(SFSS),这种方法无需直接访问源数据。大多数 SFSS 方法主要利用伪标签来规范标签空间或特征空间中的模型。受segment anything 模型(SAM)的启发,我们在这项工作中提出了基于 SAM 引导的对比度伪标签学习 SFSS 方法。与以往严重依赖噪声伪标签的方法不同,我们利用 SAM 生成的类无关分割掩码作为先验知识来构建正负样本对。这种方法允许我们使用对比学习直接塑造特征空间。这种设计可确保可靠地构建对比样本,并利用类内和实例内的多样性。我们的框架建立在用于在线伪标签学习的虚构师生网络架构之上。因此,SFSS 模型可以端到端方式在特征空间和标签空间联合正则化。广泛的实验证明,我们的方法在两个具有挑战性的 SFSS 任务中取得了具有竞争力的性能。
{"title":"SAM-guided contrast based self-training for source-free cross-domain semantic segmentation","authors":"Qinghua Ren, Ke Hou, Yongzhao Zhan, Chen Wang","doi":"10.1007/s00530-024-01426-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01426-5","url":null,"abstract":"<p>Traditional domain adaptive semantic segmentation methods typically assume access to source domain data during training, a paradigm known as source-access domain adaptation for semantic segmentation (SASS). To address data privacy concerns in real-world applications, source-free domain adaptation for semantic segmentation (SFSS) has recently been studied, eliminating the need for direct access to source data. Most SFSS methods primarily utilize pseudo-labels to regularize the model in either the label space or the feature space. Inspired by the segment anything model (SAM), we propose SAM-guided contrast based pseudo-label learning for SFSS in this work. Unlike previous methods that heavily rely on noisy pseudo-labels, we leverage the class-agnostic segmentation masks generated by SAM as prior knowledge to construct positive and negative sample pairs. This approach allows us to directly shape the feature space using contrastive learning. This design ensures the reliable construction of contrastive samples and exploits both intra-class and intra-instance diversity. Our framework is built upon a vanilla teacher–student network architecture for online pseudo-label learning. Consequently, the SFSS model can be jointly regularized in both the feature and label spaces in an end-to-end manner. Extensive experiments demonstrate that our method achieves competitive performance in two challenging SFSS tasks.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"71 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rise of online sharing platforms has provided people with diverse and convenient ways to share images. However, a substantial amount of sensitive user information is contained within these images, which can be easily captured by malicious neural networks. To ensure the secure utilization of authorized protected data, reversible adversarial attack techniques have emerged. Existing algorithms for generating adversarial examples do not strike a good balance between visibility and attack capability. Additionally, the network oscillations generated during the training process affect the quality of the final examples. To address these shortcomings, we propose a novel reversible adversarial network based on generative adversarial networks (RA-RevGAN). In this paper, the generator is used for noise generation to map features into perturbations of the image, while the region selection module confines these perturbations to specific areas that significantly affect classification. Furthermore, a robust attack mechanism is integrated into the discriminator to stabilize the network’s training by optimizing convergence speed and minimizing time cost. Extensive experiments have demonstrated that the proposed method ensures a high image generation rate, excellent attack capability, and superior visual quality while maintaining high classification accuracy in image restoration.
{"title":"RA-RevGAN: region-aware reversible adversarial example generation network for privacy-preserving applications","authors":"Jiacheng Zhao, Xiuming Zhao, Zhihua Gan, Xiuli Chai, Tianfeng Ma, Zhen Chen","doi":"10.1007/s00530-024-01425-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01425-6","url":null,"abstract":"<p>The rise of online sharing platforms has provided people with diverse and convenient ways to share images. However, a substantial amount of sensitive user information is contained within these images, which can be easily captured by malicious neural networks. To ensure the secure utilization of authorized protected data, reversible adversarial attack techniques have emerged. Existing algorithms for generating adversarial examples do not strike a good balance between visibility and attack capability. Additionally, the network oscillations generated during the training process affect the quality of the final examples. To address these shortcomings, we propose a novel reversible adversarial network based on generative adversarial networks (RA-RevGAN). In this paper, the generator is used for noise generation to map features into perturbations of the image, while the region selection module confines these perturbations to specific areas that significantly affect classification. Furthermore, a robust attack mechanism is integrated into the discriminator to stabilize the network’s training by optimizing convergence speed and minimizing time cost. Extensive experiments have demonstrated that the proposed method ensures a high image generation rate, excellent attack capability, and superior visual quality while maintaining high classification accuracy in image restoration.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"23 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s00530-024-01414-9
Hairui Yang, Ning Wang, Haojie Li, Lei Wang, Zhihui Wang
Zero-shot learning (ZSL) addresses the challenging task of recognizing classes absent during training. Existing methodologies focus on knowledge transfer from known to unknown categories by formulating a correlation between visual and semantic spaces. However, these methods are faced with constraints related to the discrimination of visual features and the integrity of semantic representations. To alleviate these limitations, we propose a novel Collaborative learning Framework for Zero-Shot Learning (CFZSL), which integrates the CLIP architecture into a fundamental zero-shot learner. Specifically, the foundational zero-shot learning model extracts visual features through a set of CNNs and maps them to a domain-specific semantic space. Simultaneously, the CLIP image encoder extracts visual features containing universal semantics. In this way, the CFZSL framework can obtain discriminative visual features for both domain-specific and domain-agnostic semantics. Additionally, a more comprehensive semantic space is explored by combining the latent feature space learned by CLIP and the domain-specific semantic space. Notably, we just leverage the pre-trained parameters of the CLIP model, mitigating the high training cost and potential overfitting issues associated with fine-tuning. Our proposed framework, characterized by its simple structure, undergoes training exclusively via classification and triplet loss functions. Extensive experimental results, conducted on three widely recognized benchmark datasets-AwA2, CUB, and SUN, conclusively affirm the effectiveness and superiority of our proposed approach.
{"title":"Application of CLIP for efficient zero-shot learning","authors":"Hairui Yang, Ning Wang, Haojie Li, Lei Wang, Zhihui Wang","doi":"10.1007/s00530-024-01414-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01414-9","url":null,"abstract":"<p>Zero-shot learning (ZSL) addresses the challenging task of recognizing classes absent during training. Existing methodologies focus on knowledge transfer from known to unknown categories by formulating a correlation between visual and semantic spaces. However, these methods are faced with constraints related to the discrimination of visual features and the integrity of semantic representations. To alleviate these limitations, we propose a novel Collaborative learning Framework for Zero-Shot Learning (CFZSL), which integrates the CLIP architecture into a fundamental zero-shot learner. Specifically, the foundational zero-shot learning model extracts visual features through a set of CNNs and maps them to a domain-specific semantic space. Simultaneously, the CLIP image encoder extracts visual features containing universal semantics. In this way, the CFZSL framework can obtain discriminative visual features for both domain-specific and domain-agnostic semantics. Additionally, a more comprehensive semantic space is explored by combining the latent feature space learned by CLIP and the domain-specific semantic space. Notably, we just leverage the pre-trained parameters of the CLIP model, mitigating the high training cost and potential overfitting issues associated with fine-tuning. Our proposed framework, characterized by its simple structure, undergoes training exclusively via classification and triplet loss functions. Extensive experimental results, conducted on three widely recognized benchmark datasets-AwA2, CUB, and SUN, conclusively affirm the effectiveness and superiority of our proposed approach.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"65 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s00530-024-01430-9
Chendong Qin, Yongxiong Wang, Jiapeng Zhang
Medical images have low contrast and blurred boundaries between different tissues or between tissues and lesions. Because labeling medical images is laborious and requires expert knowledge, the labeled data are expensive or simply unavailable. UNet has achieved great success in the field of medical image segmentation. However, the pooling layer in downsampling tends to discard important information such as location information. It is difficult to learn global and long-range semantic interactive information well due to the locality of convolution operation. The usual solution is increasing the number of datasets or enhancing the training data though augmentation methods. However, to obtain a large number of medical datasets is tough, and the augmentation methods may increase the training burden. In this work, we propose a 2D medical image segmentation network with a convolutional capsule encoder and a multiscale local co-occurrence module. To extract more local detail and contextual information, the capsule encoder is introduced to learn the information about the target location and the relationship between the part and the whole. Multi-scale features can be fused by a new attention mechanism, which can then selectively emphasize salient features useful for a specific task by capturing global information and suppress background noise. The proposed attention mechanism is used to preserve the information that is discarded by pooling layers of the network. In addition, a multi-scale local co-occurrence algorithm is proposed, where the context and dependencies between different regions in an image can be better learned. Experimental results on the dataset of Liver, ISIC and BraTS2019 show that our network is superior to the UNet and other previous medical image segmentation networks under the same experimental conditions.
{"title":"CMLCNet: medical image segmentation network based on convolution capsule encoder and multi-scale local co-occurrence","authors":"Chendong Qin, Yongxiong Wang, Jiapeng Zhang","doi":"10.1007/s00530-024-01430-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01430-9","url":null,"abstract":"<p>Medical images have low contrast and blurred boundaries between different tissues or between tissues and lesions. Because labeling medical images is laborious and requires expert knowledge, the labeled data are expensive or simply unavailable. UNet has achieved great success in the field of medical image segmentation. However, the pooling layer in downsampling tends to discard important information such as location information. It is difficult to learn global and long-range semantic interactive information well due to the locality of convolution operation. The usual solution is increasing the number of datasets or enhancing the training data though augmentation methods. However, to obtain a large number of medical datasets is tough, and the augmentation methods may increase the training burden. In this work, we propose a 2D medical image segmentation network with a convolutional capsule encoder and a multiscale local co-occurrence module. To extract more local detail and contextual information, the capsule encoder is introduced to learn the information about the target location and the relationship between the part and the whole. Multi-scale features can be fused by a new attention mechanism, which can then selectively emphasize salient features useful for a specific task by capturing global information and suppress background noise. The proposed attention mechanism is used to preserve the information that is discarded by pooling layers of the network. In addition, a multi-scale local co-occurrence algorithm is proposed, where the context and dependencies between different regions in an image can be better learned. Experimental results on the dataset of Liver, ISIC and BraTS2019 show that our network is superior to the UNet and other previous medical image segmentation networks under the same experimental conditions.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"95 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s00530-024-01407-8
Hui Cai, Haifeng Lin, Dapeng Liu
Analyzing traffic flow based on data from traffic monitoring is an essential component of intelligent transportation systems. In most traffic scenarios, vehicles are the primary targets, so multi-object tracking of vehicles in traffic monitoring is a critical subject. In view of the current difficulties, such as complex road conditions, numerous obstructions, and similar vehicle appearances, we propose a detection-based multi-object vehicle tracking algorithm that combines motion and appearance cues. Firstly, to improve the motion prediction accuracy, we propose a Kalman filter that adaptively updates the noise according to the motion matching cost and detection confidence score, combined with exponential transformation and residuals. Then, we propose a combined distance to utilize motion and appearance cues. Finally, we present a trajectory recovery strategy to handle unmatched trajectories and detections. Experimental results on the UA-DETRAC dataset demonstrate that this method achieves excellent tracking performance for vehicle tracking tasks in traffic monitoring perspectives, meeting the practical application demands of complex traffic scenarios.
{"title":"TrafficTrack: rethinking the motion and appearance cue for multi-vehicle tracking in traffic monitoring","authors":"Hui Cai, Haifeng Lin, Dapeng Liu","doi":"10.1007/s00530-024-01407-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01407-8","url":null,"abstract":"<p>Analyzing traffic flow based on data from traffic monitoring is an essential component of intelligent transportation systems. In most traffic scenarios, vehicles are the primary targets, so multi-object tracking of vehicles in traffic monitoring is a critical subject. In view of the current difficulties, such as complex road conditions, numerous obstructions, and similar vehicle appearances, we propose a detection-based multi-object vehicle tracking algorithm that combines motion and appearance cues. Firstly, to improve the motion prediction accuracy, we propose a Kalman filter that adaptively updates the noise according to the motion matching cost and detection confidence score, combined with exponential transformation and residuals. Then, we propose a combined distance to utilize motion and appearance cues. Finally, we present a trajectory recovery strategy to handle unmatched trajectories and detections. Experimental results on the UA-DETRAC dataset demonstrate that this method achieves excellent tracking performance for vehicle tracking tasks in traffic monitoring perspectives, meeting the practical application demands of complex traffic scenarios.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"2 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fire has emerged as a major danger to the Earth’s ecological equilibrium and human well-being. Fire detection and alert systems are essential. There is a scarcity of public fire datasets with examples of fire and smoke in real-world situations. Moreover, techniques for recognizing items in fire smoke are imprecise and unreliable when it comes to identifying small objects. We developed a dual dataset to evaluate the model’s ability to handle these difficulties. Introducing FS-YOLO, a new fire detection model with improved accuracy. Training YOLOv7 may lead to overfitting because of the large number of parameters and the limited fire detection object categories. YOLOv7 struggles to recognize small dense objects during feature extraction, resulting in missed detections. The Swin Transformer module has been enhanced to decrease local feature interdependence, obtain a wider range of parameters, and handle features at several levels. The improvements strengthen the model’s robustness and the network’s ability to recognize dense tiny objects. The efficient channel attention was incorporated to reduce the occurrence of false fire detections. Localizing the region of interest and extracting meaningful information aids the model in identifying pertinent areas and minimizing false detections. The proposal also considers using fire-smoke and real-fire-smoke datasets. The latter dataset simulates real-world conditions with occlusions, lens blur, and motion blur. This dataset tests the model’s robustness and adaptability in complex situations. On both datasets, the mAP of FS-YOLO is improved by 6.4(%) and 5.4(%) compared to YOLOv7. In the robustness check experiments, the mAP of FS-YOLO is 4.1(%) and 3.1(%) higher than that of today’s SOTA models YOLOv8s, DINO.
{"title":"Fs-yolo: fire-smoke detection based on improved YOLOv7","authors":"Dongmei Wang, Ying Qian, Jingyi Lu, Peng Wang, Zhongrui Hu, Yongkang Chai","doi":"10.1007/s00530-024-01359-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01359-z","url":null,"abstract":"<p>Fire has emerged as a major danger to the Earth’s ecological equilibrium and human well-being. Fire detection and alert systems are essential. There is a scarcity of public fire datasets with examples of fire and smoke in real-world situations. Moreover, techniques for recognizing items in fire smoke are imprecise and unreliable when it comes to identifying small objects. We developed a dual dataset to evaluate the model’s ability to handle these difficulties. Introducing FS-YOLO, a new fire detection model with improved accuracy. Training YOLOv7 may lead to overfitting because of the large number of parameters and the limited fire detection object categories. YOLOv7 struggles to recognize small dense objects during feature extraction, resulting in missed detections. The Swin Transformer module has been enhanced to decrease local feature interdependence, obtain a wider range of parameters, and handle features at several levels. The improvements strengthen the model’s robustness and the network’s ability to recognize dense tiny objects. The efficient channel attention was incorporated to reduce the occurrence of false fire detections. Localizing the region of interest and extracting meaningful information aids the model in identifying pertinent areas and minimizing false detections. The proposal also considers using fire-smoke and real-fire-smoke datasets. The latter dataset simulates real-world conditions with occlusions, lens blur, and motion blur. This dataset tests the model’s robustness and adaptability in complex situations. On both datasets, the mAP of FS-YOLO is improved by 6.4<span>(%)</span> and 5.4<span>(%)</span> compared to YOLOv7. In the robustness check experiments, the mAP of FS-YOLO is 4.1<span>(%)</span> and 3.1<span>(%)</span> higher than that of today’s SOTA models YOLOv8s, DINO.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"180 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1007/s00530-024-01413-w
Yu Zhang, Yinke Dou, Kai Yang, Xiaoyang Song, Jin Wang, Liangliang Zhao
Currently, the use of deep learning technologies for detecting defects in transmission line insulators based on images obtained through unmanned aerial vehicle inspection simultaneously presents the problems of insufficient detection accuracy and speed. Therefore, this study first introduced the bidirectional feature pyramid network (BiFPN) module into YOLOv5 to achieve high detection speed as well as enable the combination of image features at different scales, enhance information representation, and allow accurate detection of insulator defect at different scales. Subsequently, the BiFPN module and simple parameter-free attention module (SimAM) were combined to improve the feature representation ability and object detection accuracy. The SimAM also enabled fusion of features at multiple scales, further improving the insulator defect detection performance. Finally, multiple experimental controls were designed to verify the effectiveness and efficiency of the proposed model. The experimental results obtained using self-made datasets show that the combined BiFPN and SimAM model (i.e., the improved BaS-YOLOv5 model) performs better than the original YOLOv5 model; the precision, recall, average precision and F1 score increased by 6.2%, 5%, 5.9%, and 6%, respectively. Therefore, BaS-YOLOv5 substantially improves detection accuracy while maintaining a high detection speed, meeting the requirements for real-time insulator defect detection.
{"title":"Insulator defect detection based on BaS-YOLOv5","authors":"Yu Zhang, Yinke Dou, Kai Yang, Xiaoyang Song, Jin Wang, Liangliang Zhao","doi":"10.1007/s00530-024-01413-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01413-w","url":null,"abstract":"<p>Currently, the use of deep learning technologies for detecting defects in transmission line insulators based on images obtained through unmanned aerial vehicle inspection simultaneously presents the problems of insufficient detection accuracy and speed. Therefore, this study first introduced the bidirectional feature pyramid network (BiFPN) module into YOLOv5 to achieve high detection speed as well as enable the combination of image features at different scales, enhance information representation, and allow accurate detection of insulator defect at different scales. Subsequently, the BiFPN module and simple parameter-free attention module (SimAM) were combined to improve the feature representation ability and object detection accuracy. The SimAM also enabled fusion of features at multiple scales, further improving the insulator defect detection performance. Finally, multiple experimental controls were designed to verify the effectiveness and efficiency of the proposed model. The experimental results obtained using self-made datasets show that the combined BiFPN and SimAM model (i.e., the improved BaS-YOLOv5 model) performs better than the original YOLOv5 model; the precision, recall, average precision and F1 score increased by 6.2%, 5%, 5.9%, and 6%, respectively. Therefore, BaS-YOLOv5 substantially improves detection accuracy while maintaining a high detection speed, meeting the requirements for real-time insulator defect detection.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"50 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}