Computer Vision and Image Understanding最新文献_第10页

Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data 利用视觉和生物传感数据同时识别定量和定性情绪的深度学习模型

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-22 DOI: 10.1016/j.cviu.2024.104121

Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman

The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: https://github.com/I-Man-H/DeepVADNet.git

情绪识别在很大程度上依赖于人的面部表情和包括脑电图和心电图在内的生理信号等重要因素。文献对情绪识别进行了定量研究（同时估算情绪价值、唤醒程度和主导地位）和定性研究（同时预测离散情绪，如快乐、悲伤、愤怒、惊讶等）。目前的方法是将视觉数据和生物传感信息结合起来，创建包含多种模式（定量/定性）的识别系统。然而，这些方法需要特定领域的广泛专业知识和复杂的预处理程序，因此无法充分利用端到端深度学习技术的固有优势。此外，这些方法通常旨在识别定性或定量情绪。虽然这两种情绪有很大的共通性，但以往的方法并不能同时识别定性和定量情绪。本文介绍了一种名为 DeepVADNet 的新型深度端到端框架，它是专门为多模态情感识别而设计的。该框架利用深度学习技术有效地提取了关键的人脸外观特征和生物传感特征，只需一次前向传递即可预测定性和定量情绪。在本研究中，我们采用 CRNN 架构提取人脸外观特征，同时利用 ConvLSTM 模型从视觉数据（视频）中提取时空信息。此外，我们还利用 Conv1D 模型来处理生理信号（脑电图、眼电图、心电图和 GSR），因为这种方法不同于传统的人工技术，传统的人工技术是基于时域和频域提取特征。在通过融合两种模式提高特征质量之后，我们采用了一种新方法，利用定量情绪来准确预测定性情绪。我们在 DEAP 和 MAHNOB-HCI 数据集上进行了大量实验，在这两个数据集上分别取得了 98.93%/6e-4 和 89.08%/0.97 （平均分类准确率/MSE）的一流定量情绪识别结果。此外，在定性情感识别任务中，我们在 MAHNOB-HCI 数据集上取得了 82.71% 的平均分类准确率。代码和评估可从以下网址获取： https://github.com/I-Man-H/DeepVADNet.git

{"title":"Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data","authors":"Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman","doi":"10.1016/j.cviu.2024.104121","DOIUrl":"10.1016/j.cviu.2024.104121","url":null,"abstract":"<div><p>The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: <span><span>https://github.com/I-Man-H/DeepVADNet.git</span><svg><path></path></svg></span></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104121"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio–visual deepfake detection using articulatory representation learning 利用发音表征学习进行音视频深度伪造检测

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-22 DOI: 10.1016/j.cviu.2024.104133

Yujia Wang, Hua Huang

Advancements in generative artificial intelligence have made it easier to manipulate auditory and visual elements, highlighting the critical need for robust audio–visual deepfake detection methods. In this paper, we propose an articulatory representation-based audio–visual deepfake detection approach, ART-AVDF. First, we devise an audio encoder to extract articulatory features that capture the physical significance of articulation movement, integrating with a lip encoder to explore audio–visual articulatory correspondences in a self-supervised learning manner. Then, we design a multimodal joint fusion module to further explore inherent audio–visual consistency using the articulatory embeddings. Extensive experiments on the DFDC, FakeAVCeleb, and DefakeAVMiT datasets demonstrate that ART-AVDF obtains a significant performance improvement compared to many deepfake detection models.

人工智能生成技术的进步使操作听觉和视觉元素变得更加容易，这就凸显了对稳健的视听深度防伪检测方法的迫切需要。在本文中，我们提出了一种基于发音表征的视听深度防伪检测方法 ART-AVDF。首先，我们设计了一个音频编码器来提取发音特征，捕捉发音运动的物理意义，并与唇音编码器集成，以自我监督学习的方式探索视听发音对应关系。然后，我们设计了一个多模态联合融合模块，利用发音嵌入进一步探索固有的视听一致性。在 DFDC、FakeAVCeleb 和 DefakeAVMiT 数据集上进行的大量实验表明，与许多深度防伪检测模型相比，ART-AVDF 的性能有了显著提高。

引用次数: 0

RSTC: Residual Swin Transformer Cascade to approximate Taylor expansion for image denoising RSTC：用于图像去噪的近似泰勒扩展的残差斯温变换级联方法

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-22 DOI: 10.1016/j.cviu.2024.104132

Jin Liu , Yang Yang , Biyun Xu , Hao Yu , Yaozong Zhang , Qian Li , Zhenghua Huang

Traditional denoising methods establish mathematical models by employing different priors, which can achieve preferable results but they are usually time-consuming and their outputs are not adaptive on regularization parameters. While the success of end-to-end deep learning denoising strategies depends on a large amount of data and lacks a theoretical interpretability. In order to address the above problems, this paper proposes a novel image denoising method, namely Residual Swin Transformer Cascade (RSTC), based on Taylor expansion. The key procedures of our RSTC are specified as follows: Firstly, we discuss the relationship between image denoising model and Taylor expansion, as well as its adjacent derivative parts. Secondly, we use a lightweight deformable convolutional neural network to estimate the basic layer of Taylor expansion and a residual network where swin transformer block is selected as a backbone for pursuing the solution of the derivative layer. Finally, the results of the two networks contribute to the approximation solution of Taylor expansion. In the experiments, we firstly test and discuss the selection of network parameters to verify its effectiveness. Then, we compare it with existing advanced methods in terms of visualization and quantification, and the results show that our method has a powerful generalization ability and performs better than state-of-the-art denoising methods on performance improvement and structure preservation.

传统的去噪方法通过采用不同的前验建立数学模型，虽然可以取得较好的效果，但通常耗时较长，而且其输出对正则化参数不具有自适应性。而端到端深度学习去噪策略的成功取决于大量数据，缺乏理论上的可解释性。针对上述问题，本文提出了一种基于泰勒展开的新型图像去噪方法，即残差斯文变换级联（Residual Swin Transformer Cascade，RSTC）。RSTC 的主要流程如下：首先，我们讨论了图像去噪模型与泰勒展开及其相邻导数部分之间的关系。其次，我们使用轻量级可变形卷积神经网络来估计泰勒展开的基本层，并使用残差网络，其中选择斯温变换器块作为主干来寻求导数层的解决方案。最后，这两个网络的结果有助于泰勒展开的近似解。在实验中，我们首先测试并讨论了网络参数的选择，以验证其有效性。结果表明，我们的方法具有强大的泛化能力，在性能改善和结构保持方面优于最先进的去噪方法。

{"title":"RSTC: Residual Swin Transformer Cascade to approximate Taylor expansion for image denoising","authors":"Jin Liu , Yang Yang , Biyun Xu , Hao Yu , Yaozong Zhang , Qian Li , Zhenghua Huang","doi":"10.1016/j.cviu.2024.104132","DOIUrl":"10.1016/j.cviu.2024.104132","url":null,"abstract":"<div><p>Traditional denoising methods establish mathematical models by employing different priors, which can achieve preferable results but they are usually time-consuming and their outputs are not adaptive on regularization parameters. While the success of end-to-end deep learning denoising strategies depends on a large amount of data and lacks a theoretical interpretability. In order to address the above problems, this paper proposes a novel image denoising method, namely Residual Swin Transformer Cascade (RSTC), based on Taylor expansion. The key procedures of our RSTC are specified as follows: Firstly, we discuss the relationship between image denoising model and Taylor expansion, as well as its adjacent derivative parts. Secondly, we use a lightweight deformable convolutional neural network to estimate the basic layer of Taylor expansion and a residual network where swin transformer block is selected as a backbone for pursuing the solution of the derivative layer. Finally, the results of the two networks contribute to the approximation solution of Taylor expansion. In the experiments, we firstly test and discuss the selection of network parameters to verify its effectiveness. Then, we compare it with existing advanced methods in terms of visualization and quantification, and the results show that our method has a powerful generalization ability and performs better than state-of-the-art denoising methods on performance improvement and structure preservation.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104132"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep video compression based on Long-range Temporal Context Learning 基于长程时态上下文学习的深度视频压缩

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-21 DOI: 10.1016/j.cviu.2024.104127

Kejun Wu , Zhenxing Li , You Yang, Qiong Liu

Video compression allows for efficient storage and transmission of data, benefiting imaging and vision applications, e.g. computational imaging, photography, and displays by delivering high-quality videos. To exploit more informative contexts of video, we propose DVCL, a novel Deep Video Compression based on Long-range Temporal Context Learning. Aiming at high coding performance, this new compression paradigm makes full use of long-range temporal correlations derived from multiple reference frames to learn richer contexts. Motion vectors (MVs) are estimated to represent the motion relations of videos. By employing MVs, a long-range temporal context learning (LTCL) module is presented to extract context information from multiple reference frames, such that a more accurate and informative temporal contexts can be learned and constructed. The long-range temporal contexts serve as conditions and generate the predicted frames by contextual encoder and decoder. To address the challenge of imbalanced training, we develop a multi-stage training strategy to ensure the whole DVCL framework is trained progressively and stably. Extensive experiments demonstrate the proposed DVCL achieves the highest objective and subjective quality, while maintaining relatively low complexity. Specifically, 25.30% and 45.75% bitrate savings on average can be obtained than x265 codec at the same PSNR and MS-SSIM, respectively.

视频压缩可以高效地存储和传输数据，通过提供高质量的视频，使成像和视觉应用（如计算成像、摄影和显示）受益匪浅。为了利用视频中的更多信息，我们提出了基于长时域上下文学习的新型深度视频压缩技术 DVCL。为了实现高编码性能，这种新的压缩范式充分利用了从多个参考帧中获得的长时相关性来学习更丰富的上下文。通过估算运动矢量（MV）来表示视频的运动关系。通过使用运动矢量，提出了一个长时域上下文学习（LTCL）模块，以从多个参考帧中提取上下文信息，从而学习和构建更准确、更翔实的时域上下文。远程时间上下文作为条件，通过上下文编码器和解码器生成预测帧。为了应对不平衡训练的挑战，我们开发了一种多阶段训练策略，以确保整个 DVCL 框架得到渐进和稳定的训练。广泛的实验证明，所提出的 DVCL 在保持相对较低复杂度的同时，实现了最高的客观和主观质量。具体来说，在相同的 PSNR 和 MS-SSIM 条件下，比 x265 编解码器平均分别节省 25.30% 和 45.75% 的比特率。

{"title":"Deep video compression based on Long-range Temporal Context Learning","authors":"Kejun Wu , Zhenxing Li , You Yang, Qiong Liu","doi":"10.1016/j.cviu.2024.104127","DOIUrl":"10.1016/j.cviu.2024.104127","url":null,"abstract":"<div><p>Video compression allows for efficient storage and transmission of data, benefiting imaging and vision applications, e.g. computational imaging, photography, and displays by delivering high-quality videos. To exploit more informative contexts of video, we propose DVCL, a novel <strong>D</strong>eep <strong>V</strong>ideo <strong>C</strong>ompression based on <strong>L</strong>ong-range Temporal Context Learning. Aiming at high coding performance, this new compression paradigm makes full use of long-range temporal correlations derived from multiple reference frames to learn richer contexts. Motion vectors (MVs) are estimated to represent the motion relations of videos. By employing MVs, a long-range temporal context learning (LTCL) module is presented to extract context information from multiple reference frames, such that a more accurate and informative temporal contexts can be learned and constructed. The long-range temporal contexts serve as conditions and generate the predicted frames by contextual encoder and decoder. To address the challenge of imbalanced training, we develop a multi-stage training strategy to ensure the whole DVCL framework is trained progressively and stably. Extensive experiments demonstrate the proposed DVCL achieves the highest objective and subjective quality, while maintaining relatively low complexity. Specifically, 25.30% and 45.75% bitrate savings on average can be obtained than x265 codec at the same PSNR and MS-SSIM, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104127"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep unsupervised shadow detection with curriculum learning and self-training 利用课程学习和自我训练进行深度无监督阴影检测

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-21 DOI: 10.1016/j.cviu.2024.104124

Qiang Zhang, Hongyuan Guo, Guanghe Li, Tianlu Zhang, Qiang Jiao

Shadow detection is undergoing a rapid and remarkable development along with the wide use of deep neural networks. Benefiting from a large number of training images annotated with strong pixel-level ground-truth masks, current deep shadow detectors have achieved state-of-the-art performance. However, it is expensive and time-consuming to provide the pixel-level ground-truth mask for each training image. Considering that, this paper proposes the first unsupervised deep shadow detection framework, which consists of an initial pseudo label generation (IPG) module, a curriculum learning (CL) module and a self-training (ST) module. The supervision signals used in our learning framework are generated from several existing traditional unsupervised shadow detectors, which usually contain a lot of noisy information. Therefore, each module in our unsupervised framework is dedicated to reduce the adverse influence of noisy information on model training. Specifically, the IPG module combines different traditional unsupervised shadow maps to obtain their complementary shadow information. After obtaining the initial pseudo labels, the CL module and the ST module will be used in conjunction to gradually learn new shadow patterns and update the qualities of pseudo labels simultaneously. Extensive experimental results on various benchmark datasets demonstrate that our deep shadow detector not only outperforms the traditional unsupervised shadow detection methods by a large margin but also achieves comparable results with some recent state-of-the-art fully-supervised deep shadow detection methods.

随着深度神经网络的广泛应用，阴影检测正经历着快速而显著的发展。得益于大量标注了强大像素级地面实况掩码的训练图像，目前的深度阴影检测器已经达到了最先进的性能。然而，为每幅训练图像提供像素级地面实况掩码既昂贵又耗时。有鉴于此，本文提出了首个无监督深度阴影检测框架，该框架由初始伪标签生成（IPG）模块、课程学习（CL）模块和自我训练（ST）模块组成。我们的学习框架中使用的监督信号来自现有的多个传统无监督阴影检测器，这些检测器通常包含大量噪声信息。因此，我们的无监督框架中的每个模块都致力于减少噪声信息对模型训练的不利影响。具体来说，IPG 模块结合了不同的传统无监督阴影地图，以获得它们互补的阴影信息。在获得初始伪标签后，CL 模块和 ST 模块将结合使用，逐步学习新的阴影模式，同时更新伪标签的质量。在各种基准数据集上的大量实验结果表明，我们的深度阴影检测器不仅大大优于传统的无监督阴影检测方法，而且还取得了与最近一些最先进的全监督深度阴影检测方法相当的结果。

{"title":"Deep unsupervised shadow detection with curriculum learning and self-training","authors":"Qiang Zhang, Hongyuan Guo, Guanghe Li, Tianlu Zhang, Qiang Jiao","doi":"10.1016/j.cviu.2024.104124","DOIUrl":"10.1016/j.cviu.2024.104124","url":null,"abstract":"<div><p>Shadow detection is undergoing a rapid and remarkable development along with the wide use of deep neural networks. Benefiting from a large number of training images annotated with strong pixel-level ground-truth masks, current deep shadow detectors have achieved state-of-the-art performance. However, it is expensive and time-consuming to provide the pixel-level ground-truth mask for each training image. Considering that, this paper proposes the first unsupervised deep shadow detection framework, which consists of an initial pseudo label generation (IPG) module, a curriculum learning (CL) module and a self-training (ST) module. The supervision signals used in our learning framework are generated from several existing traditional unsupervised shadow detectors, which usually contain a lot of noisy information. Therefore, each module in our unsupervised framework is dedicated to reduce the adverse influence of noisy information on model training. Specifically, the IPG module combines different traditional unsupervised shadow maps to obtain their complementary shadow information. After obtaining the initial pseudo labels, the CL module and the ST module will be used in conjunction to gradually learn new shadow patterns and update the qualities of pseudo labels simultaneously. Extensive experimental results on various benchmark datasets demonstrate that our deep shadow detector not only outperforms the traditional unsupervised shadow detection methods by a large margin but also achieves comparable results with some recent state-of-the-art fully-supervised deep shadow detection methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104124"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A framework for detecting fighting behavior based on key points of human skeletal posture 基于人体骨骼姿势关键点的格斗行为检测框架

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-21 DOI: 10.1016/j.cviu.2024.104123

Peng Zhang , Xinlei Zhao , Lijia Dong , Weimin Lei , Wei Zhang , Zhaonan Lin

Detecting fights from videos and images in public surveillance places is an important task to limit violent criminal behavior. Real-time detection of violent behavior can effectively ensure the personal safety of pedestrians and further maintain public social stability. Therefore, in this paper, we aim to detect real-time violent behavior in videos. We propose a novel neural network model framework based on human pose key points, called Real-Time Pose Net (RTPNet). Utilize the pose extractor (YOLO-Pose) to extract human skeleton features, and classify video level violent behavior based on the 2DCNN model (ACTION-Net). Utilize appearance features and inter frame correlation to accurately detect fighting behavior. We have also proposed a new image dataset called VIMD (Violence Image Dataset), which includes images of fighting behavior collected online and captured independently. After training on the dataset, the network can effectively identify skeletal features from videos and locate fighting movements. The dataset is available on GitHub (https://github.com/ChinaZhangPeng/Violence-Image-Dataset). We also conducted experiments on four datasets, including Hockey-Fight, RWF-2000, Surveillance Camera Fight, and AVD dataset. These experimental results showed that RTPNet outperformed the most advanced methods in the past, achieving an accuracy of 99.4% on the Hockey-Fight dataset, 93.3% on the RWF-2000 dataset, and 93.4% on the Surveillance Camera Fight dataset, 99.3% on the AVD dataset. And with speeds capable of reaching 33fps, state-of-the-art results are achieved with faster speeds. In addition, RTPNet can also have good detection performance in violent behavior in complex backgrounds.

从公共监控场所的视频和图像中发现打架斗殴行为，是限制暴力犯罪行为的一项重要任务。对暴力行为进行实时检测，可以有效保障行人的人身安全，进一步维护社会公共稳定。因此，本文旨在实时检测视频中的暴力行为。我们提出了一种基于人体姿势关键点的新型神经网络模型框架，称为实时姿势网（RTPNet）。利用姿势提取器（YOLO-Pose）提取人体骨骼特征，并基于 2DCNN 模型（ACTION-Net）对视频中的暴力行为进行分类。利用外观特征和帧间相关性精确检测打斗行为。我们还提出了一个名为 VIMD（暴力图像数据集）的新图像数据集，其中包括在线收集和独立捕获的打斗行为图像。在该数据集上进行训练后，网络可以有效识别视频中的骨骼特征并定位打斗动作。该数据集可在 GitHub 上获取（https://github.com/ChinaZhangPeng/Violence-Image-Dataset）。我们还在四个数据集上进行了实验，包括 Hockey-Fight、RWF-2000、Surveillance Camera Fight 和 AVD 数据集。这些实验结果表明，RTPNet 超越了过去最先进的方法，在 Hockey-Fight 数据集上达到了 99.4% 的准确率，在 RWF-2000 数据集上达到了 93.3% 的准确率，在 Surveillance Camera Fight 数据集上达到了 93.4% 的准确率，在 AVD 数据集上达到了 99.3% 的准确率。RTPNet 的速度可达 33fps，在更快的速度下也能达到最先进的效果。此外，RTPNet 对复杂背景下的暴力行为也有良好的检测性能。

{"title":"A framework for detecting fighting behavior based on key points of human skeletal posture","authors":"Peng Zhang , Xinlei Zhao , Lijia Dong , Weimin Lei , Wei Zhang , Zhaonan Lin","doi":"10.1016/j.cviu.2024.104123","DOIUrl":"10.1016/j.cviu.2024.104123","url":null,"abstract":"<div><p>Detecting fights from videos and images in public surveillance places is an important task to limit violent criminal behavior. Real-time detection of violent behavior can effectively ensure the personal safety of pedestrians and further maintain public social stability. Therefore, in this paper, we aim to detect real-time violent behavior in videos. We propose a novel neural network model framework based on human pose key points, called Real-Time Pose Net (RTPNet). Utilize the pose extractor (YOLO-Pose) to extract human skeleton features, and classify video level violent behavior based on the 2DCNN model (ACTION-Net). Utilize appearance features and inter frame correlation to accurately detect fighting behavior. We have also proposed a new image dataset called VIMD (Violence Image Dataset), which includes images of fighting behavior collected online and captured independently. After training on the dataset, the network can effectively identify skeletal features from videos and locate fighting movements. The dataset is available on GitHub (<span><span>https://github.com/ChinaZhangPeng/Violence-Image-Dataset</span><svg><path></path></svg></span>). We also conducted experiments on four datasets, including Hockey-Fight, RWF-2000, Surveillance Camera Fight, and AVD dataset. These experimental results showed that RTPNet outperformed the most advanced methods in the past, achieving an accuracy of 99.4% on the Hockey-Fight dataset, 93.3% on the RWF-2000 dataset, and 93.4% on the Surveillance Camera Fight dataset, 99.3% on the AVD dataset. And with speeds capable of reaching 33fps, state-of-the-art results are achieved with faster speeds. In addition, RTPNet can also have good detection performance in violent behavior in complex backgrounds.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104123"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deconfounded hierarchical multi-granularity classification 去基础分层多粒度分类法

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-20 DOI: 10.1016/j.cviu.2024.104108

Ziyu Zhao, Leilei Gan, Tao Shen, Kun Kuang, Fei Wu

Hierarchical multi-granularity classification (HMC) assigns labels at varying levels of detail to images using a structured hierarchy that categorizes labels from coarse to fine, such as [“Suliformes”, “Fregatidae”, “Frigatebird”]. Traditional HMC methods typically integrate hierarchical label information into either the model’s architecture or its loss function. However, these approaches often overlook the spurious correlations between coarse-level semantic information and fine-grained labels, which can lead models to rely on these non-causal relationships for making predictions. In this paper, we adopt a causal perspective to address the challenges in HMC, demonstrating how coarse-grained semantics can serve as confounders in fine-grained classification. To comprehensively mitigate confounding bias in HMC, we introduce a novel framework, Deconf-HMC, which consists of three main components: (1) a causal-inspired label prediction module that combines fine-level features with coarse-level prediction outcomes to determine the appropriate labels at each hierarchical level; (2) a representation disentanglement module that minimizes the mutual information between representations of different granularities; and (3) an adversarial training module that restricts the predictive influence of coarse-level representations on fine-level labels, thereby aiming to eliminate confounding bias. Extensive experiments on three widely used datasets demonstrate the superiority of our approach over existing state-of-the-art HMC methods.

分层多粒度分类法（HMC）通过结构化的层次结构，将标签从粗到细进行分类，如["Suliformes"、"Fregatidae"、"Frigatebird"]，从而为图像分配不同详细程度的标签。传统的 HMC 方法通常会将层次标签信息整合到模型结构或损失函数中。然而，这些方法往往忽略了粗粒度语义信息与细粒度标签之间的虚假相关性，从而导致模型依赖这些非因果关系进行预测。在本文中，我们采用因果视角来应对 HMC 中的挑战，展示了粗粒度语义如何成为细粒度分类的混杂因素。为了全面缓解 HMC 中的混杂偏差，我们引入了一个新颖的框架--Deconf-HMC，它由三个主要部分组成：(1) 一个因果启发标签预测模块，该模块将细粒度特征与粗粒度预测结果相结合，以确定每个层次上的适当标签；(2) 一个表征解离模块，该模块最小化不同粒度表征之间的互信息；(3) 一个对抗训练模块，该模块限制粗粒度表征对细粒度标签的预测影响，从而消除混杂偏差。在三个广泛使用的数据集上进行的大量实验证明，我们的方法优于现有的最先进的 HMC 方法。

{"title":"Deconfounded hierarchical multi-granularity classification","authors":"Ziyu Zhao, Leilei Gan, Tao Shen, Kun Kuang, Fei Wu","doi":"10.1016/j.cviu.2024.104108","DOIUrl":"10.1016/j.cviu.2024.104108","url":null,"abstract":"<div><p>Hierarchical multi-granularity classification (HMC) assigns labels at varying levels of detail to images using a structured hierarchy that categorizes labels from coarse to fine, such as [“Suliformes”, “Fregatidae”, “Frigatebird”]. Traditional HMC methods typically integrate hierarchical label information into either the model’s architecture or its loss function. However, these approaches often overlook the spurious correlations between coarse-level semantic information and fine-grained labels, which can lead models to rely on these non-causal relationships for making predictions. In this paper, we adopt a causal perspective to address the challenges in HMC, demonstrating how coarse-grained semantics can serve as confounders in fine-grained classification. To comprehensively mitigate confounding bias in HMC, we introduce a novel framework, Deconf-HMC, which consists of three main components: (1) a causal-inspired label prediction module that combines fine-level features with coarse-level prediction outcomes to determine the appropriate labels at each hierarchical level; (2) a representation disentanglement module that minimizes the mutual information between representations of different granularities; and (3) an adversarial training module that restricts the predictive influence of coarse-level representations on fine-level labels, thereby aiming to eliminate confounding bias. Extensive experiments on three widely used datasets demonstrate the superiority of our approach over existing state-of-the-art HMC methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104108"},"PeriodicalIF":4.3,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial attention inference model for cascaded siamese tracking with dynamic residual update strategy 采用动态残差更新策略的级联连体跟踪空间注意力推理模型

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-16 DOI: 10.1016/j.cviu.2024.104125

Huanlong Zhang , Mengdan Liu , Xiaohui Song , Yong Wang , Guanglu Yang , Rui Qi

Target representation is crucial for visual tracking. Most Siamese-based trackers try their best to establish target models by using various deep networks. However, they neglect the exploration of correlation among features, which leads to the inability to learn more representative features. In this paper, we propose a spatial attention inference model for cascaded Siamese tracking with dynamic residual update strategy. First, a spatial attention inference model is constructed. The model fuses interlayer multi-scale features generated by dilation convolution to enhance the spatial representation ability of features. On this basis, we use self-attention to capture interaction between target and context, and use cross-attention to aggregate interdependencies between target and background. The model infers potential feature information by exploiting the correlations among features for building better appearance models. Second, a cascaded localization-aware network is introduced to bridge a gap between classification and regression. We propose an alignment-aware branch to resample and learn object-aware features from the predicted bounding boxes for obtaining localization confidence, which is used to correct the classification confidence by weighted integration. This cascaded strategy alleviates the misalignment problem between classification and regression. Finally, a dynamic residual update strategy is proposed. This strategy utilizes the Context Fusion Network (CFNet) to fuse the templates of historical and current frames to generate the optimal templates. Meanwhile, we use a dynamic threshold function to determine when to update by judging the tracking results. The strategy uses temporal context to fully explore the intrinsic properties of the target, which enhances the adaptability to changes in the target’s appearance. We conducted extensive experiments on seven tracking benchmarks, including OTB100, UAV123, TC128, VOT2016, VOT2018, GOT10k and LaSOT, to validate the effectiveness of our proposed algorithm.

目标表示对于视觉跟踪至关重要。大多数基于连体的跟踪器都在尽力利用各种深度网络建立目标模型。然而，它们忽视了对特征间相关性的探索，导致无法学习更具代表性的特征。在本文中，我们提出了一种用于级联连体跟踪的空间注意力推理模型，并采用了动态残差更新策略。首先，我们构建了一个空间注意力推理模型。该模型融合了通过扩张卷积生成的层间多尺度特征，以增强特征的空间表示能力。在此基础上，我们使用自我注意来捕捉目标与背景之间的互动，并使用交叉注意来汇总目标与背景之间的相互依存关系。该模型通过利用特征之间的相关性来推断潜在的特征信息，从而建立更好的外观模型。其次，我们引入了级联定位感知网络，以弥补分类和回归之间的差距。我们提出了一个对齐感知分支，从预测的边界框中重新采样和学习对象感知特征，从而获得定位置信度，并通过加权整合修正分类置信度。这种级联策略缓解了分类和回归之间的错位问题。最后，我们提出了一种动态残差更新策略。该策略利用上下文融合网络（Context Fusion Network，CFNet）融合历史帧和当前帧的模板，生成最优模板。同时，我们使用动态阈值函数，通过判断跟踪结果来决定何时更新。该策略利用时间上下文来充分挖掘目标的内在属性，从而增强了对目标外观变化的适应性。我们在七个跟踪基准上进行了大量实验，包括 OTB100、UAV123、TC128、VOT2016、VOT2018、GOT10k 和 LaSOT，以验证我们提出的算法的有效性。

{"title":"Spatial attention inference model for cascaded siamese tracking with dynamic residual update strategy","authors":"Huanlong Zhang , Mengdan Liu , Xiaohui Song , Yong Wang , Guanglu Yang , Rui Qi","doi":"10.1016/j.cviu.2024.104125","DOIUrl":"10.1016/j.cviu.2024.104125","url":null,"abstract":"<div><p>Target representation is crucial for visual tracking. Most Siamese-based trackers try their best to establish target models by using various deep networks. However, they neglect the exploration of correlation among features, which leads to the inability to learn more representative features. In this paper, we propose a spatial attention inference model for cascaded Siamese tracking with dynamic residual update strategy. First, a spatial attention inference model is constructed. The model fuses interlayer multi-scale features generated by dilation convolution to enhance the spatial representation ability of features. On this basis, we use self-attention to capture interaction between target and context, and use cross-attention to aggregate interdependencies between target and background. The model infers potential feature information by exploiting the correlations among features for building better appearance models. Second, a cascaded localization-aware network is introduced to bridge a gap between classification and regression. We propose an alignment-aware branch to resample and learn object-aware features from the predicted bounding boxes for obtaining localization confidence, which is used to correct the classification confidence by weighted integration. This cascaded strategy alleviates the misalignment problem between classification and regression. Finally, a dynamic residual update strategy is proposed. This strategy utilizes the Context Fusion Network (CFNet) to fuse the templates of historical and current frames to generate the optimal templates. Meanwhile, we use a dynamic threshold function to determine when to update by judging the tracking results. The strategy uses temporal context to fully explore the intrinsic properties of the target, which enhances the adaptability to changes in the target’s appearance. We conducted extensive experiments on seven tracking benchmarks, including OTB100, UAV123, TC128, VOT2016, VOT2018, GOT10k and LaSOT, to validate the effectiveness of our proposed algorithm.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104125"},"PeriodicalIF":4.3,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coarse-to-fine mechanisms mitigate diffusion limitations on image restoration 从粗到细的机制减轻了图像修复的扩散限制

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-13 DOI: 10.1016/j.cviu.2024.104118

Liyan Wang , Qinyu Yang , Cong Wang , Wei Wang , Zhixun Su

Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) to mitigate diffusion limitations mentioned before on image restoration. Specifically, the proposed C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training mechanism. The DFSA and DFN with embedded diffusion steps respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to guide the restoration process in different time steps. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training pipeline. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on 3 tasks, including image deraining, image deblurring, and real image denoising. The source codes and visual results are available at https://github.com/wlydlut/C2F-DFT.

近年来，扩散模型在各种视觉任务中的表现可圈可点。然而，对于旨在从给定的退化观测中恢复具有更清晰细节的清晰图像的图像复原来说，基于扩散的方法可能会由于不准确的噪声估计而无法恢复出令人满意的结果。此外，简单的约束噪声不能有效地学习复杂的退化信息，从而阻碍了模型能力的提高。为了解决上述问题，我们提出了一种从粗到细的扩散变换器（C2F-DFT），以减轻前面提到的扩散对图像复原的限制。具体来说，我们提出的 C2F-DFT 包含扩散自注意（DFSA）和扩散前馈网络（DFN），并采用了一种新的从粗到细的训练机制。内嵌扩散步骤的扩散自注意（DFSA）和扩散前馈网络（DFN）分别捕捉长程扩散依赖关系，并学习分层扩散表征，以指导不同时间步骤的修复过程。在粗略训练阶段，我们的 C2F-DFT 会估计噪声，然后通过采样算法生成最终的干净图像。为了进一步提高修复质量，我们提出了一个简单而有效的精细训练管道。它首先利用具有固定步长的粗训练扩散模型生成修复结果，然后将这些结果与相应的地面实况进行约束，以优化模型，从而弥补因噪声估计不准确而导致的不理想结果。大量实验表明，C2F-DFT 的性能明显优于基于扩散的修复方法 IR-SDE，在图像去染、图像去模糊和真实图像去噪等 3 项任务中，与基于变换器的最先进方法相比，C2F-DFT 的性能更具竞争力。源代码和可视化结果可在 https://github.com/wlydlut/C2F-DFT 上获取。

{"title":"Coarse-to-fine mechanisms mitigate diffusion limitations on image restoration","authors":"Liyan Wang , Qinyu Yang , Cong Wang , Wei Wang , Zhixun Su","doi":"10.1016/j.cviu.2024.104118","DOIUrl":"10.1016/j.cviu.2024.104118","url":null,"abstract":"<div><p>Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) to mitigate diffusion limitations mentioned before on image restoration. Specifically, the proposed C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training mechanism. The DFSA and DFN with embedded diffusion steps respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to guide the restoration process in different time steps. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training pipeline. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on 3 tasks, including image deraining, image deblurring, and real image denoising. The source codes and visual results are available at <span><span>https://github.com/wlydlut/C2F-DFT</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104118"},"PeriodicalIF":4.3,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming MKP-Net：用于直播中点监督时间动作定位的记忆知识传播网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-08-12 DOI: 10.1016/j.cviu.2024.104109

Lin Chen , Jing Zhang , Yian Zhang , Junpeng Kang , Li Zhuo

Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.

对直播进行标准化监管是网络空间治理的一项重要内容。时间动作定位（TAL）可以定位特定动作的发生，从而更好地了解人类活动。由于人类特定行为的持续时间短且边界不明显，因此在未经修剪的直播中获取足够的标记数据进行训练非常麻烦。点监督方法只需要对每个动作实例进行单帧标注，可以有效地平衡成本和性能。因此，我们提出了一种记忆知识传播网络（MKP-Net），用于直播中的点监督时间动作定位，包括：（1）引入即插即用的记忆模块，利用点级注释对前景动作和背景知识的原型特征进行建模；（2）利用记忆知识传播机制在多实例学习管道中生成判别特征表征；（3）通过设计用于细化和定位时间动作的双重优化损失来进行定位完整性学习。实验结果表明，我们的方法在 THUMOS14 和自建的 BJUT-PTAL 数据集上分别实现了 61.4% 和 49.1% 的 SOTAs，推理速度为 711 FPS。

{"title":"MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming","authors":"Lin Chen , Jing Zhang , Yian Zhang , Junpeng Kang , Li Zhuo","doi":"10.1016/j.cviu.2024.104109","DOIUrl":"10.1016/j.cviu.2024.104109","url":null,"abstract":"<div><p>Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104109"},"PeriodicalIF":4.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0