ACM Multimedia Asia最新文献

英文中文

CMRD-Net: An Improved Method for Underwater Image Enhancement CMRD-Net:一种改进的水下图像增强方法

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3493590

Fengjie Xu, Chang-Hua Zhang, Zhongshu Chen, Zhekai Du, Lei Han, Lin Zuo

Underwater image enhancement is a challenging task due to the degradation of image quality in underwater complicated lighting conditions and scenes. In recent years, most methods improve the visual quality of underwater images by using deep Convolutional Neural Networks and Generative Adversarial Networks. However, the majority of existing methods do not consider that the attenuation degrees of R, G, B channels of the underwater image are different, leading to a sub-optimal performance. Based on this observation, we propose a Channel-wise Multi-scale Residual Dense Network called CMRD-Net, which learns the weights of different color channels instead of treating all the channels equally. More specifically, the Channel-wise Multi-scale Fusion Residual Attention Block (CMFRAB) is involved in the CMRD-Net to obtain a better ability of feature extraction and representation. Notably, we evaluate the effectiveness of our model by comparing it with recent state-of-the-art methods. Extensive experimental results show that our method can achieve a satisfactory performance on a popular public dataset.

由于水下复杂光照条件和场景下图像质量的下降，水下图像增强是一项具有挑战性的任务。近年来，提高水下图像视觉质量的方法主要是利用深度卷积神经网络和生成对抗网络。然而，现有的大多数方法没有考虑到水下图像的R、G、B信道的衰减程度不同，导致性能不佳。在此基础上，我们提出了一种基于通道的多尺度残差密集网络，称为CMRD-Net，它学习不同颜色通道的权重，而不是平等地对待所有通道。具体来说，在CMRD-Net中加入了Channel-wise Multi-scale Fusion Residual Attention Block (CMFRAB)，以获得更好的特征提取和表征能力。值得注意的是，我们通过与最新的最先进的方法进行比较来评估我们模型的有效性。大量的实验结果表明，我们的方法可以在一个流行的公共数据集上取得令人满意的性能。

引用次数: 0

Towards Transferable 3D Adversarial Attack 朝向可转移的3D对抗性攻击

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3493596

Qiming Lu, Shikui Wei, Haoyu Chu, Yao Zhao

Currently, most of the adversarial attacks focused on perturbation adding on 2D images. In this way, however, the adversarial attacks cannot easily be involved in a real-world AI system, since it is impossible for the AI system to open an interface to attackers. Therefore, it is more practical to add perturbation on real-world 3D objects’ surface, i.e., 3D adversarial attacks. The key challenges for 3D adversarial attacks are how to effectively deal with viewpoint changing and keep strong transferability across different state-of-the-art networks. In this paper, we mainly focus on improving the robustness and transferability of 3D adversarial examples generated by perturbing the surface textures of 3D objects. Towards this end, we propose an effective method, named Momentum Gradient-Filter Sign Method (M-GFSM), to generate 3D adversarial examples. Specially, the momentum is introduced into the procedure of 3D adversarial examples generation, which results in multiview robustness of 3D adversarial examples and high efficiency of attacking by updating the perturbation and stabilizing the update directions. In addition, filter operation is involved to improve the transferability of 3D adversarial examples by filtering gradient images selectively and completing the gradients of neglected pixels caused by downsampling in the rendering stage. Experimental results show the effectiveness and good transferability of the proposed method. Besides, we show that the 3D adversarial examples generated by our method still be robust under different illuminations.

目前，大多数对抗性攻击都集中在二维图像的摄动添加上。然而，通过这种方式，对抗性攻击就不容易出现在现实世界的AI系统中，因为AI系统不可能向攻击者开放一个界面。因此，在现实世界的3D物体表面添加摄动，即3D对抗性攻击，是更实用的方法。如何有效地处理视点变化，并在不同的先进网络之间保持强大的可移植性，是三维对抗性攻击面临的关键挑战。在本文中，我们主要关注通过扰动三维物体的表面纹理来提高三维对抗样例的鲁棒性和可转移性。为此，我们提出了一种有效的方法，称为动量梯度滤波符号法(M-GFSM)，以生成三维对抗示例。特别地，将动量引入到三维对抗样例生成过程中，通过更新摄动和稳定更新方向，使三维对抗样例具有多视图鲁棒性，提高了攻击效率。此外，为了提高3D对抗样例的可转移性，还涉及了滤波操作，对梯度图像进行选择性滤波，并在渲染阶段完成下采样导致的被忽略像素的梯度。实验结果表明了该方法的有效性和良好的可移植性。此外，我们还证明了用我们的方法生成的三维对抗样例在不同光照下仍然具有鲁棒性。

{"title":"Towards Transferable 3D Adversarial Attack","authors":"Qiming Lu, Shikui Wei, Haoyu Chu, Yao Zhao","doi":"10.1145/3469877.3493596","DOIUrl":"https://doi.org/10.1145/3469877.3493596","url":null,"abstract":"Currently, most of the adversarial attacks focused on perturbation adding on 2D images. In this way, however, the adversarial attacks cannot easily be involved in a real-world AI system, since it is impossible for the AI system to open an interface to attackers. Therefore, it is more practical to add perturbation on real-world 3D objects’ surface, i.e., 3D adversarial attacks. The key challenges for 3D adversarial attacks are how to effectively deal with viewpoint changing and keep strong transferability across different state-of-the-art networks. In this paper, we mainly focus on improving the robustness and transferability of 3D adversarial examples generated by perturbing the surface textures of 3D objects. Towards this end, we propose an effective method, named Momentum Gradient-Filter Sign Method (M-GFSM), to generate 3D adversarial examples. Specially, the momentum is introduced into the procedure of 3D adversarial examples generation, which results in multiview robustness of 3D adversarial examples and high efficiency of attacking by updating the perturbation and stabilizing the update directions. In addition, filter operation is involved to improve the transferability of 3D adversarial examples by filtering gradient images selectively and completing the gradients of neglected pixels caused by downsampling in the rendering stage. Experimental results show the effectiveness and good transferability of the proposed method. Besides, we show that the 3D adversarial examples generated by our method still be robust under different illuminations.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117037993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding 用于深度视频理解的多模态分析的混合改进

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3493599

Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, Gangshan Wu

The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. In this paper, we improved the joint learning method which we previously proposed in many aspects, including few shot learning, optical flow feature, entity recognition, and video description matching. We verified the effectiveness of these measures through experiments.

深度视频理解挑战(Deep Video Understanding Challenge, DVU)是一项专注于理解涉及多个实体的长时间视频的任务。其主要目标是建立实体之间的关系和交互知识图谱，以回答相关问题。本文对之前提出的联合学习方法进行了改进，包括少镜头学习、光流特征、实体识别、视频描述匹配等方面。我们通过实验验证了这些措施的有效性。

引用次数: 3

Score Transformer: Generating Musical Score from Note-level Representation 分数转换器:从音符级表示生成乐谱

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490612

Masahiro Suzuki

In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic representations of music. Although the note-level representations can comprise sufficient information to reproduce music aurally, they cannot contain adequate information to represent music visually in terms of notation. Musical scores contain various musical symbols (e.g., clef, key signature, and notes) and attributes (e.g., stem direction, beam, and tie) that enable us to visually comprehend musical content. However, automated estimation of these elements has yet to be comprehensively addressed. In this paper, we first design score token representation corresponding to the various musical elements. We then train the Transformer model to transcribe note-level representation into appropriate music notation. Evaluations of popular piano scores show that the proposed method significantly outperforms existing methods on all 12 musical aspects that were investigated. We also explore an effective notation-level token representation to work with the model and determine that our proposed representation produces the steadiest results.

在本文中，我们使用Transformer模型来探索乐谱的标记化表示，以自动生成乐谱。到目前为止，序列模型已经在音符级(midi等效)音乐符号表示方面取得了丰硕的成果。虽然音符级的表示可以包含足够的信息来再现音乐的听觉，但它们不能包含足够的信息来在视觉上表示音乐的符号。乐谱包含各种音乐符号(如谱号、音号和音符)和属性(如干方向、梁和系)，使我们能够直观地理解音乐内容。然而，这些元素的自动估计还没有得到全面的解决。在本文中，我们首先设计了对应于各种音乐元素的乐谱符号表示。然后，我们训练Transformer模型将音符级表示转换为适当的音乐符号。对流行钢琴分数的评估表明，所提出的方法在被调查的所有12个音乐方面都明显优于现有方法。我们还探索了一种有效的符号级令牌表示来与模型一起工作，并确定我们建议的表示产生最稳定的结果。

{"title":"Score Transformer: Generating Musical Score from Note-level Representation","authors":"Masahiro Suzuki","doi":"10.1145/3469877.3490612","DOIUrl":"https://doi.org/10.1145/3469877.3490612","url":null,"abstract":"In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic representations of music. Although the note-level representations can comprise sufficient information to reproduce music aurally, they cannot contain adequate information to represent music visually in terms of notation. Musical scores contain various musical symbols (e.g., clef, key signature, and notes) and attributes (e.g., stem direction, beam, and tie) that enable us to visually comprehend musical content. However, automated estimation of these elements has yet to be comprehensively addressed. In this paper, we first design score token representation corresponding to the various musical elements. We then train the Transformer model to transcribe note-level representation into appropriate music notation. Evaluations of popular piano scores show that the proposed method significantly outperforms existing methods on all 12 musical aspects that were investigated. We also explore an effective notation-level token representation to work with the model and determine that our proposed representation produces the steadiest results.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114563756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Discovering Social Connections using Event Images 使用事件图像发现社会联系

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3493699

Ming Cheung, Weiwei Sun, Jiantao Zhou

Social events are very common activities, where people can interact with each other. During an event, the organizer often hires photographers to take images, which provide rich information about the participants’ behaviour. In this work, we propose a method to discover the social graphs among event participants from the event images for social network analytics. By studying over 94 events with 32,330 event images, it is proven that the social graphs can be effectively extracted solely from event images. It is found that the discovered social graphs follow similar properties of online social graphs; for instance, the degree distribution obeys power law distribution. The usefulness of the proposed method for social graph discovery from event images is demonstrated through two applications: important participants detection and community detection. To the best of our knowledge, it is the first work to show the feasibility of discovering social graphs by utilizing event images only. As a result, social network analytics such as recommendations become possible, even without access to the online social graph.

社交活动是非常常见的活动，人们可以在其中相互交流。在活动期间，组织者通常会雇佣摄影师来拍摄照片，这些照片可以提供有关参与者行为的丰富信息。在这项工作中，我们提出了一种从事件图像中发现事件参与者之间的社交图的方法，用于社交网络分析。通过对94个事件32330张事件图像的研究，证明了仅从事件图像中提取社交图是有效的。发现发现的社交图具有与在线社交图相似的属性;例如，度分布服从幂律分布。通过两个应用:重要参与者检测和社区检测，证明了该方法从事件图像中发现社交图的有效性。据我们所知，这是第一个显示仅利用事件图像发现社交图的可行性的工作。因此，即使没有访问在线社交图谱，推荐等社交网络分析也成为可能。

引用次数: 1

Blindly Predict Image and Video Quality in the Wild 在野外盲目预测图像和视频质量

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490588

Jiapeng Tang, Yi Fang, Yu Dong, Rong Xie, Xiao Gu, Guangtao Zhai, Li Song

Emerging interests have been brought to blind quality assessment for images/videos captured in the wild, known as in-the-wild I/VQA. Prior deep learning based approaches have achieved considerable progress in I/VQA, but are intrinsically troubled with two issues. Firstly, most existing methods fine-tune the image-classification-oriented pre-trained models for the absence of large-scale I/VQA datasets. However, the task misalignment between I/VQA and image classification leads to degraded generalization performance. Secondly, existing VQA methods directly conduct temporal pooling on the predicted frame-wise scores, resulting in ambiguous inter-frame relation modeling. In this work, we propose a two-stage architecture to separately predict image and video quality in the wild. In the first stage, we resort to supervised contrastive learning to derive quality-aware representations that facilitate the prediction of image quality. Specifically, we propose a novel quality-aware contrastive loss to pull together samples of similar quality and push away quality-different ones in embedding space. In the second stage, we develop a Relation-Guided Temporal Attention (RTA) module for video quality prediction, which captures global inter-frame dependencies in embedding space to learn frame-wise attention weights for frame quality aggregation. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on both authentically distorted image benchmarks and video benchmarks.

对野外拍摄的图像/视频进行盲质量评估，被称为野外I/VQA。先前基于深度学习的方法在I/VQA方面取得了相当大的进展，但本质上存在两个问题。首先，针对缺乏大规模I/VQA数据集的情况，大多数现有方法对面向图像分类的预训练模型进行了微调。然而，I/VQA与图像分类之间的任务错位导致了泛化性能的下降。其次，现有的VQA方法直接对预测的逐帧分数进行时间池化，导致帧间关系建模不明确。在这项工作中，我们提出了一个两阶段的架构来分别预测图像和视频质量。在第一阶段，我们采用监督对比学习来获得有助于预测图像质量的质量感知表示。具体来说，我们提出了一种新的质量感知对比损失，将质量相似的样本聚集在一起，将质量不同的样本推离嵌入空间。在第二阶段，我们开发了一个用于视频质量预测的关系引导时间注意力(RTA)模块，该模块捕获嵌入空间中的全局帧间依赖关系，以学习用于帧质量聚合的帧明智的注意力权重。大量的实验表明，我们的方法在真实扭曲的图像基准和视频基准上都优于最先进的方法。

{"title":"Blindly Predict Image and Video Quality in the Wild","authors":"Jiapeng Tang, Yi Fang, Yu Dong, Rong Xie, Xiao Gu, Guangtao Zhai, Li Song","doi":"10.1145/3469877.3490588","DOIUrl":"https://doi.org/10.1145/3469877.3490588","url":null,"abstract":"Emerging interests have been brought to blind quality assessment for images/videos captured in the wild, known as in-the-wild I/VQA. Prior deep learning based approaches have achieved considerable progress in I/VQA, but are intrinsically troubled with two issues. Firstly, most existing methods fine-tune the image-classification-oriented pre-trained models for the absence of large-scale I/VQA datasets. However, the task misalignment between I/VQA and image classification leads to degraded generalization performance. Secondly, existing VQA methods directly conduct temporal pooling on the predicted frame-wise scores, resulting in ambiguous inter-frame relation modeling. In this work, we propose a two-stage architecture to separately predict image and video quality in the wild. In the first stage, we resort to supervised contrastive learning to derive quality-aware representations that facilitate the prediction of image quality. Specifically, we propose a novel quality-aware contrastive loss to pull together samples of similar quality and push away quality-different ones in embedding space. In the second stage, we develop a Relation-Guided Temporal Attention (RTA) module for video quality prediction, which captures global inter-frame dependencies in embedding space to learn frame-wise attention weights for frame quality aggregation. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on both authentically distorted image benchmarks and video benchmarks.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133850026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Local-enhanced Multi-resolution Representation Learning for Vehicle Re-identification 车辆再识别的局部增强多分辨率表示学习

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3497690

Jun Zhang, X. Zhong, Jingling Yuan, Shilei Zhao, Rongbo Zhang, Duxiu Feng, Luo Zhong

In real traffic scenarios, the changes of vehicle resolution that the camera captures tend to be relatively obvious considering the distances to the vehicle, different directions, and height of the camera. When the resolution difference exists between the probe and the gallery vehicle, the resolution mismatch will occur, which will seriously influence the performance of the vehicle re-identification (Re-ID). This problem is also known as multi-resolution vehicle Re-ID. An effective strategy is equivalent to utilize image super-resolution to handle the resolution gap. However, existing methods conduct super-resolution on global images instead of local representation of each image, leading to much more noisy information generated from the background and illumination variations. In our work, a local-enhanced multi-resolution representation learning (LMRL) is therefore proposed to address these problems by combining the training of local-enhanced super-resolution (LSR) module and local-guided contrastive learning (LCL) module. Specifically, we use a parsing network to parse a vehicle into four different parts to extract local-enhanced vehicle representation. And then, the LSR module, which consists of two auto-encoders that share parameters, transforms low-resolution images into high-resolution in both global and local branches. LCL module can learn discriminative vehicle representation by contrasting local representation between the high-resolution reconstructed image and the ground truth. We evaluate our approach on two public datasets that contain vehicle images at a wide range of resolutions, in which our approach shows significant superiority to the existing solution.

在真实的交通场景中，考虑到与车辆的距离、不同的方向和摄像机的高度，摄像机捕捉到的车辆分辨率的变化往往比较明显。当探针与通道车辆存在分辨率差异时，会产生分辨率失配，严重影响车辆再识别的性能。这个问题也被称为多分辨率车辆Re-ID。一种有效的策略相当于利用图像超分辨率来处理分辨率差距。然而，现有方法对全局图像进行超分辨率处理，而不是对每张图像进行局部表示，导致背景和光照变化产生更多的噪声信息。在我们的工作中，我们提出了一种局部增强的多分辨率表示学习(LMRL)，通过结合局部增强的超分辨率(LSR)模块和局部引导的对比学习(LCL)模块的训练来解决这些问题。具体来说，我们使用解析网络将车辆解析为四个不同的部分，以提取局部增强的车辆表示。然后，由两个共享参数的自编码器组成的LSR模块将低分辨率图像转换为全球和本地分支的高分辨率图像。LCL模块通过对比高分辨率重建图像与地面真实图像的局部表示，学习判别性车辆表示。我们在两个公共数据集上评估了我们的方法，这些数据集包含各种分辨率的车辆图像，在这些数据集上，我们的方法比现有的解决方案显示出显著的优势。

{"title":"Local-enhanced Multi-resolution Representation Learning for Vehicle Re-identification","authors":"Jun Zhang, X. Zhong, Jingling Yuan, Shilei Zhao, Rongbo Zhang, Duxiu Feng, Luo Zhong","doi":"10.1145/3469877.3497690","DOIUrl":"https://doi.org/10.1145/3469877.3497690","url":null,"abstract":"In real traffic scenarios, the changes of vehicle resolution that the camera captures tend to be relatively obvious considering the distances to the vehicle, different directions, and height of the camera. When the resolution difference exists between the probe and the gallery vehicle, the resolution mismatch will occur, which will seriously influence the performance of the vehicle re-identification (Re-ID). This problem is also known as multi-resolution vehicle Re-ID. An effective strategy is equivalent to utilize image super-resolution to handle the resolution gap. However, existing methods conduct super-resolution on global images instead of local representation of each image, leading to much more noisy information generated from the background and illumination variations. In our work, a local-enhanced multi-resolution representation learning (LMRL) is therefore proposed to address these problems by combining the training of local-enhanced super-resolution (LSR) module and local-guided contrastive learning (LCL) module. Specifically, we use a parsing network to parse a vehicle into four different parts to extract local-enhanced vehicle representation. And then, the LSR module, which consists of two auto-encoders that share parameters, transforms low-resolution images into high-resolution in both global and local branches. LCL module can learn discriminative vehicle representation by contrasting local representation between the high-resolution reconstructed image and the ground truth. We evaluate our approach on two public datasets that contain vehicle images at a wide range of resolutions, in which our approach shows significant superiority to the existing solution.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115995625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Model-Guided Unfolding Network for Single Image Reflection Removal 单幅图像反射去除的模型引导展开网络

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490607

Dongliang Shao, Yunhui Shi, Jin Wang, N. Ling, Baocai Yin

Removing undesirable reflections from a single image captured through a glass surface is of broad application to various image processing and computer vision tasks, but it is an ill-posed and challenging problem. Existing traditional single image reflection removal(SIRR) methods are often less efficient to remove reflection due to the limited description ability of handcrafted priors. State-of-the-art learning based methods often cause instability problems because they are designed as unexplainable black boxes. In this paper, we present an explainable approach for SIRR named model-guided unfolding network(MoG-SIRR), which is unfolded from our proposed reflection removal model with non-local autoregressive prior and dereflection prior. In order to complement the transmission layer and the reflection layer in a single image, we construct a deep learning framework with two streams by integrating reflection removal and non-local regularization into trainable modules. Extensive experiments on public benchmark datasets demonstrate that our method achieves superior performance for single image reflection removal.

从通过玻璃表面捕获的单个图像中去除不良反射在各种图像处理和计算机视觉任务中有着广泛的应用，但这是一个不适定和具有挑战性的问题。传统的单图像反射去除(SIRR)方法由于手工先验的描述能力有限，去除反射的效率往往较低。最先进的基于学习的方法通常会导致不稳定的问题，因为它们被设计成无法解释的黑盒子。在本文中，我们提出了一种可解释的SIRR方法，称为模型引导展开网络(MoG-SIRR)，该方法是在我们提出的具有非局部自回归先验和反反射先验的反射去除模型的基础上展开的。为了补充单幅图像中的传输层和反射层，我们通过将反射去除和非局部正则化集成到可训练模块中，构建了具有两个流的深度学习框架。在公共基准数据集上的大量实验表明，我们的方法在去除单幅图像反射方面取得了优异的性能。

引用次数: 0

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features 基于对比学习和基于clip特征的服装条件图像检索

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3493593

Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.

基于多模态零镜头表示学习的最新进展，在本文中，我们探索了使用从最近的CLIP模型中获得的特征来执行条件图像检索。从参考图像和用户想要的关于参考图像的附加文本描述开始，我们学习一个能够理解图像内容，整合文本描述并提供用于执行条件图像检索的组合特征的Combiner网络。从简单的CLIP特征和简单的基线开始，我们展示了基于这种多模式特征的精心制作的Combiner网络非常有效，并且在流行的FashionIQ数据集上优于更复杂的最先进的方法。

引用次数: 13

Deep Adaptive Attention Triple Hashing 深度自适应注意力三重哈希

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495646

Yang Shi, Xiushan Nie, Quan Zhou, Li Zou, Yilong Yin

Recent studies have verified that learning compact hash codes can facilitate big data retrieval processing. In particular, learning the deep hash function can greatly improve the retrieval performance. However, the existing deep supervised hashing algorithm treats all the samples in the same way, which leads to insufficient learning of difficult samples. Therefore, we cannot obtain the accurate learning of the similarity relation, making it difficult to achieve satisfactory performance. In light of this, this work proposes a deep supervised hashing model, called deep adaptive attention triple hashing (DAATH), which weights the similarity prediction scores of positive and negative samples in the form of triples, thus giving different degrees of attention to different samples. Compared with the traditional triple loss, it places a greater emphasis on the difficult triple, dramatically reducing the redundant calculation. Extensive experiments have been conducted to show that DAAH consistently outperforms the state-of-the-arts, confirmed its the effectiveness.

最近的研究证实，学习紧凑哈希码可以促进大数据检索处理。特别是，学习深度哈希函数可以大大提高检索性能。然而，现有的深度监督哈希算法对所有样本的处理方式都是相同的，这导致了对困难样本的学习不足。因此，我们无法获得相似关系的准确学习，难以达到令人满意的性能。鉴于此，本工作提出了一种深度监督哈希模型，称为深度自适应注意三重哈希(DAATH)，该模型以三元组的形式对正、负样本的相似性预测分数进行加权，从而对不同的样本给予不同的关注程度。与传统的三重损失相比，它更加重视困难的三重损失，大大减少了冗余计算。广泛的实验表明，DAAH始终优于最先进的技术，证实了其有效性。

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Multimedia Asia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀