首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
LMEye: An Interactive Perception Network for Large Language Models LMEye:大型语言模型的交互式感知网络
IF 7.3 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-15 DOI: 10.1109/tmm.2024.3428317
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, Min Zhang
{"title":"LMEye: An Interactive Perception Network for Large Language Models","authors":"Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, Min Zhang","doi":"10.1109/tmm.2024.3428317","DOIUrl":"https://doi.org/10.1109/tmm.2024.3428317","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"74 1","pages":""},"PeriodicalIF":7.3,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AnimeDiff: Customized Image Generation of Anime Characters using Diffusion Model AnimeDiff:利用扩散模型生成动漫人物的定制图像
IF 7.3 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-08 DOI: 10.1109/tmm.2024.3415357
Yuqi Jiang, Qiankun Liu, Dongdong Chen, Lu Yuan, Ying Fu
{"title":"AnimeDiff: Customized Image Generation of Anime Characters using Diffusion Model","authors":"Yuqi Jiang, Qiankun Liu, Dongdong Chen, Lu Yuan, Ying Fu","doi":"10.1109/tmm.2024.3415357","DOIUrl":"https://doi.org/10.1109/tmm.2024.3415357","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"39 1","pages":""},"PeriodicalIF":7.3,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset 实现高效的视频压缩伪影检测和去除:基准数据集
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-03 DOI: 10.1109/TMM.2024.3414549
Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao
Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.
视频压缩会产生压缩伪影,其中的可感知编码伪影(PEAs)会降低用户的感知能力。现有的大多数最先进的视频压缩伪影去除(VCAR)方法会不加区分地处理所有伪影,从而导致非 PEA 区域的过度增强。因此,准确检测和定位 PEA 至关重要。在本文中,我们提出了有史以来最大的细粒度 PEA 数据库 (FPEA)。首先,我们采用流行的视频编解码器 VVC 和 AVS3 及其常用测试设置,生成四种空间 PEA(模糊、阻塞、振铃和渗色)和两种时间 PEA(闪烁和浮动)。其次,我们设计了一个标记平台,并招募了足够多的受试者来手动定位上述所有类型的 PEA。第三,我们提出了一种投票机制和特征匹配来综合所有的主观标签,从而得到具有精细定位的最终 PEA 标签。此外,我们还提供了所有压缩视频序列的平均意见分值(MOS)。实验结果表明,FPEA 数据库在 VCAR 和压缩视频质量评估(VQA)方面都很有效。我们认为 FPEA 数据库将有利于 VCAR、VQA 和感知型视频编码器的未来发展。FPEA 数据库已公开发布。
{"title":"Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset","authors":"Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao","doi":"10.1109/TMM.2024.3414549","DOIUrl":"10.1109/TMM.2024.3414549","url":null,"abstract":"Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10816-10827"},"PeriodicalIF":8.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Centric Behavior Description in Videos: New Benchmark and Model 视频中以人为中心的行为描述:新基准和模型
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-07-02 DOI: 10.1109/TMM.2024.3414263
Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang
In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.
在视频监控领域,描述视频中每个人的行为变得越来越重要,尤其是在有多人在场的复杂场景中。这是因为描述每个人的行为可以提供更详细的情景分析,从而准确评估和应对潜在风险,确保公共场所的安全与和谐。目前,视频级字幕数据集无法对每个人的具体行为进行精细描述。然而,仅凭视频级的描述无法对个体行为进行深入解读,因此准确确定每个个体的具体身份具有挑战性。为了应对这一挑战,我们构建了一个以人为中心的视频监控字幕数据集,该数据集提供了 7820 个个体动态行为的详细描述。具体来说,我们对每个人的多个方面进行了标注,如位置、服装以及与场景中其他元素的互动,这些人分布在 1,012 个视频中。基于这个数据集,我们可以将个人与他们各自的行为联系起来,从而进一步分析每个人在监控视频中的行为。除了数据集之外,我们还提出了一种新颖的视频字幕方法,该方法可以在个人层面上详细描述个人行为,取得了最先进的效果。
{"title":"Human-Centric Behavior Description in Videos: New Benchmark and Model","authors":"Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3414263","DOIUrl":"10.1109/TMM.2024.3414263","url":null,"abstract":"In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10867-10878"},"PeriodicalIF":8.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames 利用元优化帧进行高效跨模态视频检索
IF 7.3 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-28 DOI: 10.1109/tmm.2024.3416669
Ning Han, Xun Yang, Ee-Peng Lim, Hao Chen, Qianru Sun
{"title":"Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames","authors":"Ning Han, Xun Yang, Ee-Peng Lim, Hao Chen, Qianru Sun","doi":"10.1109/tmm.2024.3416669","DOIUrl":"https://doi.org/10.1109/tmm.2024.3416669","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"78 1","pages":""},"PeriodicalIF":7.3,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification 用于微视频多标签分类的多模态渐进调制网络
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-26 DOI: 10.1109/TMM.2024.3405724
Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su
Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.
微视频作为一种日益流行的用户生成内容(UGC)形式,自然包含多种多样的多模态线索。然而,为了追求表征的一致性,现有方法忽略了同时考虑探索模态差异和保留模态多样性。本文提出了一种用于微视频多标签分类的多模态渐进调制网络(MPMNet),通过逐步调节各种模态偏差来增强每种模态的指示能力。在 MPMNet 中,我们首先利用以单模态为中心的并行聚合策略来获得初步的综合表征。然后,我们将特征域分解调制过程和类别域自适应调制过程整合到一个统一的框架中,共同完善面向模态的表征。在前一种调制过程中,我们在一个潜在空间中对模态间的依赖关系进行约束,从而获得面向模态的样本表征,并引入一种分解范式来进一步保持模态的多样性。在后一种调制过程中,我们构建了全局上下文感知图卷积网络,以获取面向模态的类别表征,并开发了两个实例级参数生成器,以进一步调节单模态语义偏差。在两个微型视频多标签数据集上进行的广泛实验表明,我们提出的方法优于最先进的方法。
{"title":"Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification","authors":"Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su","doi":"10.1109/TMM.2024.3405724","DOIUrl":"10.1109/TMM.2024.3405724","url":null,"abstract":"Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10134-10144"},"PeriodicalIF":8.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification 用于无人机 RGB-Infrared 车辆再识别的解耦特征学习网络中的关系感知权重共享
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3400675
Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong
Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with 16015 RGB and 13913 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.
由于具有全时目标搜索能力,基于无人机(UAV)的跨模态车辆再识别技术在视频监控和公共安全领域正受到越来越多的关注。然而,由于数据不足的问题,这项前景广阔的创新研究尚未得到充分研究。同时,跨模态差异和方位差异的挑战进一步增加了这项任务的难度。为此,我们首创了一个名为 "无人机跨模态车辆再识别(UCM-VeID)"的跨模态车辆再识别基准,其中包含 753 个身份,16015 张 RGB 和 13913 张红外图像。此外,为了应对跨模态差异和方位差异的挑战,我们提出了混合权重解耦网络(HWDNet)来学习共享的方位不变判别特征。针对第一个挑战,我们提出了一种混合权重连体网络,该网络具有精心设计的权重约束器及其相应的目标函数,可同时学习特定模态信息和模态共享信息。在第二个挑战方面,我们研究了三种有效的解耦结构和两个前置任务,以灵活地执行方位不变特征分离任务。通过综合实验验证了所提方法的有效性。
{"title":"Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification","authors":"Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong","doi":"10.1109/TMM.2024.3400675","DOIUrl":"10.1109/TMM.2024.3400675","url":null,"abstract":"Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with \u0000<bold>16015</b>\u0000 RGB and \u0000<bold>13913</b>\u0000 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9839-9853"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Alleviating Over-fitting in Hashing-based Fine-grained Image Retrieval: From Causal Feature Learning to Binary-injected Hash Learning 缓解基于哈希算法的细粒度图像检索中的过度拟合:从因果特征学习到二元注入哈希学习
IF 7.3 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-21 DOI: 10.1109/tmm.2024.3410136
Xinguang Xiang, Xinhao Ding, Lu Jin, Zechao Li, Jinhui Tang, Ramesh Jain
{"title":"Alleviating Over-fitting in Hashing-based Fine-grained Image Retrieval: From Causal Feature Learning to Binary-injected Hash Learning","authors":"Xinguang Xiang, Xinhao Ding, Lu Jin, Zechao Li, Jinhui Tang, Ramesh Jain","doi":"10.1109/tmm.2024.3410136","DOIUrl":"https://doi.org/10.1109/tmm.2024.3410136","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"161 1","pages":""},"PeriodicalIF":7.3,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback 对齐和检索:有文本反馈的图像检索中的合成与分解学习
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3417694
Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen
We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.
我们研究的是带有文本反馈的图像检索任务,即参考图像和修改文本组成检索所需的目标图像。为了实现这一目标,现有的方法总是通过不同的特征编码器获得多模态表示,然后采用不同的策略对组成的输入和目标图像之间的相关性进行建模。然而,多模态查询带来了更多挑战,因为它不仅需要协同理解来自异构多模态输入的语义,还需要准确构建存在于每个输入-目标三元组(即参考图像、修改文本和目标图像)中的底层语义相关性。在本文中,我们采用一种新颖的对齐和检索(AlRet)框架来解决这些问题。首先,我们提出的方法利用特征编码器中的对比损失来学习有意义的多模态表示,同时使后续的相关性建模过程在一个更加和谐的空间中进行。然后,我们提出以一种新颖的合成-分解范式来学习合成输入与目标图像之间的精确相关性。具体来说,合成网络将参考图像和修改文本组合成一个联合表示,从而学习联合表示与目标图像之间的相关性。分解网络则将目标图像分解为视觉和文本子空间,以利用目标图像与每个查询元素之间的潜在相关性。组成和分解范式形成了一个闭环,可以同时优化,从而在性能上相互促进。在三个真实世界数据集上进行的大规模对比实验证实了所提方法的有效性。
{"title":"Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback","authors":"Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen","doi":"10.1109/TMM.2024.3417694","DOIUrl":"10.1109/TMM.2024.3417694","url":null,"abstract":"We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9936-9948"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication DeepSpoof:跨技术多媒体通信中基于深度强化学习的欺骗攻击
IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-20 DOI: 10.1109/TMM.2024.3414660
Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang
Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.
跨技术通信对于多媒体物联网(IoMT)应用至关重要,它可实现不同媒体格式的无缝集成、优化数据传输,并改善跨设备和平台的用户体验。这种集成推动了智能家居、智能城市和医疗监控等领域创新而高效的 IoMT 解决方案。然而,在跨技术多媒体通信中整合不同的无线标准,增加了无线网络遭受攻击的可能性。目前的方法缺乏强大的验证机制,容易受到欺骗攻击。为了缓解这一问题,我们引入了 DeepSpoof,这是一种利用深度学习分析历史无线通信量并预测 IoMT 未来模式的欺骗系统。与传统的欺骗方法相比,这种创新方法大大提高了攻击者的假冒能力,并提供了更高的隐蔽性。利用模拟数据和真实数据进行的严格评估证实,DeepSpoof 能显著提高攻击的平均成功率。
{"title":"DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication","authors":"Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang","doi":"10.1109/TMM.2024.3414660","DOIUrl":"10.1109/TMM.2024.3414660","url":null,"abstract":"Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10879-10891"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1