LCMA-Net：用于实时视频中流媒体再识别的轻型跨模态注意力网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-09-19 DOI:10.1016/j.cviu.2024.104183

Jiacheng Yao, Jing Zhang, Hui Zhang, Li Zhuo

{"title":"LCMA-Net：用于实时视频中流媒体再识别的轻型跨模态注意力网络","authors":"Jiacheng Yao, Jing Zhang, Hui Zhang, Li Zhuo","doi":"10.1016/j.cviu.2024.104183","DOIUrl":null,"url":null,"abstract":"<div><div>With the rapid expansion of the we-media industry, streamers have increasingly incorporated inappropriate content into live videos to attract traffic and pursue interests. Blacklisted streamers often forge their identities or switch platforms to continue streaming, causing significant harm to the online environment. Consequently, streamer re-identification (re-ID) has become of paramount importance. Streamer biometrics in live videos exhibit multimodal characteristics, including voiceprints, faces, and spatiotemporal information, which complement each other. Therefore, we propose a light cross-modal attention network (LCMA-Net) for streamer re-ID in live videos. First, the voiceprint, face, and spatiotemporal features of the streamer are extracted by RawNet-SA, <span><math><mi>Π</mi></math></span>-Net, and STDA-ResNeXt3D, respectively. We then design a light cross-modal pooling attention (LCMPA) module, which, combined with a multilayer perceptron (MLP), aligns and concatenates different modality features into multimodal features within the LCMA-Net. Finally, the streamer is re-identified by measuring the similarity between these multimodal features. Five experiments were conducted on the StreamerReID dataset, and the results demonstrated that the proposed method achieved competitive performance. The dataset and code are available at <span><span>https://github.com/BJUT-AIVBD/LCMA-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104183"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LCMA-Net: A light cross-modal attention network for streamer re-identification in live video\",\"authors\":\"Jiacheng Yao, Jing Zhang, Hui Zhang, Li Zhuo\",\"doi\":\"10.1016/j.cviu.2024.104183\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the rapid expansion of the we-media industry, streamers have increasingly incorporated inappropriate content into live videos to attract traffic and pursue interests. Blacklisted streamers often forge their identities or switch platforms to continue streaming, causing significant harm to the online environment. Consequently, streamer re-identification (re-ID) has become of paramount importance. Streamer biometrics in live videos exhibit multimodal characteristics, including voiceprints, faces, and spatiotemporal information, which complement each other. Therefore, we propose a light cross-modal attention network (LCMA-Net) for streamer re-ID in live videos. First, the voiceprint, face, and spatiotemporal features of the streamer are extracted by RawNet-SA, <span><math><mi>Π</mi></math></span>-Net, and STDA-ResNeXt3D, respectively. We then design a light cross-modal pooling attention (LCMPA) module, which, combined with a multilayer perceptron (MLP), aligns and concatenates different modality features into multimodal features within the LCMA-Net. Finally, the streamer is re-identified by measuring the similarity between these multimodal features. Five experiments were conducted on the StreamerReID dataset, and the results demonstrated that the proposed method achieved competitive performance. The dataset and code are available at <span><span>https://github.com/BJUT-AIVBD/LCMA-Net</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"249 \",\"pages\":\"Article 104183\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314224002649\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002649","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着微信媒体行业的迅速发展，直播者为了吸引流量和追求利益，越来越多地在直播视频中加入不当内容。被列入黑名单的直播者往往会伪造身份或转换平台继续直播，给网络环境造成严重危害。因此，流媒体重新识别（re-ID）变得至关重要。直播视频中的流媒体生物识别呈现出多模态特征，包括声纹、人脸和时空信息，这些特征相辅相成。因此，我们提出了一种轻型跨模态注意力网络（LCMA-Net），用于直播视频中的流媒体再识别。首先，通过 RawNet-SA、Π-Net 和 STDA-ResNeXt3D 分别提取视频流的声纹、人脸和时空特征。然后，我们设计了一个轻型跨模态集合注意力（LCMPA）模块，该模块与多层感知器（MLP）相结合，在 LCMA-Net 中将不同模态特征排列并串联成多模态特征。最后，通过测量这些多模态特征之间的相似性来重新识别流媒体。在 StreamerReID 数据集上进行了五次实验，结果表明所提出的方法取得了具有竞争力的性能。数据集和代码见 https://github.com/BJUT-AIVBD/LCMA-Net。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LCMA-Net: A light cross-modal attention network for streamer re-identification in live video

With the rapid expansion of the we-media industry, streamers have increasingly incorporated inappropriate content into live videos to attract traffic and pursue interests. Blacklisted streamers often forge their identities or switch platforms to continue streaming, causing significant harm to the online environment. Consequently, streamer re-identification (re-ID) has become of paramount importance. Streamer biometrics in live videos exhibit multimodal characteristics, including voiceprints, faces, and spatiotemporal information, which complement each other. Therefore, we propose a light cross-modal attention network (LCMA-Net) for streamer re-ID in live videos. First, the voiceprint, face, and spatiotemporal features of the streamer are extracted by RawNet-SA,

Π

-Net, and STDA-ResNeXt3D, respectively. We then design a light cross-modal pooling attention (LCMPA) module, which, combined with a multilayer perceptron (MLP), aligns and concatenates different modality features into multimodal features within the LCMA-Net. Finally, the streamer is re-identified by measuring the similarity between these multimodal features. Five experiments were conducted on the StreamerReID dataset, and the results demonstrated that the proposed method achieved competitive performance. The dataset and code are available at https://github.com/BJUT-AIVBD/LCMA-Net.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems

期刊最新文献

Editorial Board Multi-Scale Adaptive Skeleton Transformer for action recognition Open-set domain adaptation with visual-language foundation models Leveraging vision-language prompts for real-world image restoration and enhancement RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving