Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446280

T. Haruyama, Sho Takahashi, Takahiro Ogawa, M. Haseyama

This paper presents a novel method to retrieve similar scenes in soccer videos with weak annotations via multimodal use of bidirectional long short-term memory (BiLSTM). The significant increase in the number of different types of soccer videos with the development of technology brings valid assets for effective coaching, but it also increases the work of players and training staff. We tackle this problem with a nontraditional combination of pre-trained models for feature extraction and BiLSTMs for feature transformation. By using the pre-trained models, no training data is required for feature extraction. Then effective feature transformation for similarity calculation is performed by applying BiLSTM trained with weak annotations. This transformation allows for highly accurate capture of soccer video context from less annotation work. In this paper, we achieve an accurate retrieval of similar scenes by multimodal use of this BiLSTM-based transformer trainable with less human effort. The effectiveness of our method was verified by comparative experiments with state-of-the-art using actual soccer video dataset.

提出了一种基于双向长短期记忆(BiLSTM)的多模态弱注释足球视频相似场景检索方法。随着技术的发展，不同类型的足球视频的数量显著增加，为有效的教练带来了有效的资产，但同时也增加了球员和训练人员的工作量。我们采用了一种非传统的组合方法来解决这个问题，即使用预训练模型进行特征提取，使用bilstm进行特征转换。通过使用预训练模型，特征提取不需要训练数据。然后利用弱标注训练的BiLSTM进行有效的特征变换，进行相似度计算。这种转换允许从较少的注释工作中高度准确地捕获足球视频上下文。在本文中，我们通过多模态使用这种基于bilstm的可训练变压器实现了相似场景的准确检索，并且减少了人工的工作量。通过实际足球视频数据集与最新技术的对比实验，验证了该方法的有效性。

{"title":"Similar scene retrieval in soccer videos with weak annotations by multimodal use of bidirectional LSTM","authors":"T. Haruyama, Sho Takahashi, Takahiro Ogawa, M. Haseyama","doi":"10.1145/3444685.3446280","DOIUrl":"https://doi.org/10.1145/3444685.3446280","url":null,"abstract":"This paper presents a novel method to retrieve similar scenes in soccer videos with weak annotations via multimodal use of bidirectional long short-term memory (BiLSTM). The significant increase in the number of different types of soccer videos with the development of technology brings valid assets for effective coaching, but it also increases the work of players and training staff. We tackle this problem with a nontraditional combination of pre-trained models for feature extraction and BiLSTMs for feature transformation. By using the pre-trained models, no training data is required for feature extraction. Then effective feature transformation for similarity calculation is performed by applying BiLSTM trained with weak annotations. This transformation allows for highly accurate capture of soccer video context from less annotation work. In this paper, we achieve an accurate retrieval of similar scenes by multimodal use of this BiLSTM-based transformer trainable with less human effort. The effectiveness of our method was verified by comparative experiments with state-of-the-art using actual soccer video dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124449118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Low-quality watermarked face inpainting with discriminative residual learning 基于判别残差学习的低质量水印人脸修复

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446261

Zheng He, Xueli Wei, Kangli Zeng, Zhen Han, Qin Zou, Zhongyuan Wang

Most existing image inpainting methods assume that the location of the repair area (watermark) is known, but this assumption does not always hold. In addition, the actual watermarked face is in a compressed low-quality form, which is very disadvantageous to the repair due to compression distortion effects. To address these issues, this paper proposes a low-quality watermarked face inpainting method based on joint residual learning with cooperative discriminant network. We first employ residual learning based global inpainting and facial features based local inpainting to render clean and clear faces under unknown watermark positions. Because the repair process may distort the genuine face, we further propose a discriminative constraint network to maintain the fidelity of repaired faces. Experimentally, the average PSNR of inpainted face images is increased by 4.16dB, and the average SSIM is increased by 0.08. TPR is improved by 16.96% when FPR is 10% in face verification.

大多数现有的图像修复方法都假设修复区域(水印)的位置是已知的，但这一假设并不总是成立。另外，实际的水印面是压缩后的低质量形式，由于压缩失真的影响，对修复非常不利。针对这些问题，本文提出了一种基于联合残差学习和协同判别网络的低质量水印人脸修复方法。我们首先采用残差学习的全局补图和基于人脸特征的局部补图，在未知水印位置下绘制出干净清晰的人脸。由于修复过程可能会扭曲真实的人脸，我们进一步提出了一种判别约束网络来保持修复后人脸的保真度。实验结果表明，人脸图像的平均PSNR提高了4.16dB，平均SSIM提高了0.08。在人脸验证中，当FPR为10%时，TPR提高了16.96%。

引用次数: 1

Two-stage structure aware image inpainting based on generative adversarial networks 基于生成对抗网络的两阶段结构感知图像绘制

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446260

Jin Wang, Xi Zhang, Chen Wang, Qing Zhu, Baocai Yin

In recent years, the image inpainting technology based on deep learning has made remarkable progress, which can better complete the complex image inpainting task compared with traditional methods. However, most of the existing methods can not generate reasonable structure and fine texture details at the same time. To solve this problem, in this paper we propose a two-stage image inpainting method with structure awareness based on Generative Adversarial Networks, which divides the inpainting process into two sub tasks, namely, image structure generation and image content generation. In the former stage, the network generates the structural information of the missing area; while in the latter stage, the network uses this structural information as a prior, and combines the existing texture and color information to complete the image. Extensive experiments are conducted to evaluate the performance of our proposed method on Places2, CelebA and Paris Streetview datasets. The experimental results show the superior performance of the proposed method compared with other state-of-the-art methods qualitatively and quantitatively.

近年来，基于深度学习的图像绘制技术取得了显著的进步，与传统方法相比，可以更好地完成复杂的图像绘制任务。然而，现有的方法大多不能同时生成合理的结构和精细的纹理细节。为了解决这一问题，本文提出了一种基于生成式对抗网络的具有结构感知的两阶段图像绘制方法，该方法将图像绘制过程分为图像结构生成和图像内容生成两个子任务。在前一阶段，网络生成缺失区域的结构信息;在后一阶段，网络将这些结构信息作为先验信息，并结合已有的纹理和颜色信息来完成图像。我们进行了大量的实验来评估我们提出的方法在Places2、CelebA和巴黎街景数据集上的性能。实验结果表明，该方法在定性和定量上均优于其他先进方法。

{"title":"Two-stage structure aware image inpainting based on generative adversarial networks","authors":"Jin Wang, Xi Zhang, Chen Wang, Qing Zhu, Baocai Yin","doi":"10.1145/3444685.3446260","DOIUrl":"https://doi.org/10.1145/3444685.3446260","url":null,"abstract":"In recent years, the image inpainting technology based on deep learning has made remarkable progress, which can better complete the complex image inpainting task compared with traditional methods. However, most of the existing methods can not generate reasonable structure and fine texture details at the same time. To solve this problem, in this paper we propose a two-stage image inpainting method with structure awareness based on Generative Adversarial Networks, which divides the inpainting process into two sub tasks, namely, image structure generation and image content generation. In the former stage, the network generates the structural information of the missing area; while in the latter stage, the network uses this structural information as a prior, and combines the existing texture and color information to complete the image. Extensive experiments are conducted to evaluate the performance of our proposed method on Places2, CelebA and Paris Streetview datasets. The experimental results show the superior performance of the proposed method compared with other state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134539340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Attention feature matching for weakly-supervised video relocalization 弱监督视频重定位的注意特征匹配

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446317

Haoyu Tang, Jihua Zhu, Zan Gao, Tao Zhuo, Zhiyong Cheng

Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video. Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.

在未修剪的视频中，为给定查询定位所需的视频片段一直是多媒体理解的一个热门研究课题。最近，提出了一个名为视频重定位的新任务，其中查询是一个视频片段。针对这一任务已经开发了一些方法，然而，这些方法通常需要对长视频内的时间边界进行密集注释以进行训练。一种更实用的解决方案是弱监督方法，它只需要查询和视频之间的匹配信息。基于此，我们提出了一种基于注意力特征匹配的弱监督视频定位方法。具体来说，它通过基于帧嵌入的匹配结果找到与查询片段帧最相关的片段来识别视频片段。此外，引入了注意模块来识别查询视频中包含丰富语义相关性的帧。在ActivityNet数据集上进行的大量实验表明，我们的方法可以始终优于几种弱监督方法，甚至可以达到与监督基线竞争的性能。

{"title":"Attention feature matching for weakly-supervised video relocalization","authors":"Haoyu Tang, Jihua Zhu, Zan Gao, Tao Zhuo, Zhiyong Cheng","doi":"10.1145/3444685.3446317","DOIUrl":"https://doi.org/10.1145/3444685.3446317","url":null,"abstract":"Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video. Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Graph-based motion prediction for abnormal action detection 基于图的异常动作检测运动预测

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446316

Yao Tang, Lin Zhao, Zhaoliang Yao, Chen Gong, Jian Yang

Abnormal action detection is the most noteworthy part of anomaly detection, which tries to identify unusual human behaviors in videos. Previous methods typically utilize future frame prediction to detect frames deviating from the normal scenario. While this strategy enjoys success in the accuracy of anomaly detection, critical information such as the cause and location of the abnormality is unable to be acquired. This paper proposes human motion prediction for abnormal action detection. We employ sequence of human poses to represent human motion, and detect irregular behavior by comparing the predicted pose with the actual pose detected in the frame. Hence the proposed method is able to explain why the action is regarded as irregularity and locate where the anomaly happens. Moreover, pose sequence is robust to noise, complex background and small targets in videos. Since posture information is non-Euclidean data, graph convolutional network is adopted for future pose prediction, which not only leads to greater expressive power but also stronger generalization capability. Experiments are conducted both on the widely used anomaly detection dataset ShanghaiTech and our newly proposed dataset NJUST-Anomaly, which mainly contains irregular behaviors happened in the campus. Our dataset expands the existing datasets by giving more abnormal actions attracting public attention in social security, which happen in more complex scenes and dynamic backgrounds. Experimental results on both datasets demonstrate the superiority of our method over the-state-of-the-art methods. The source code and NJUST-Anomaly dataset will be made public at https://github.com/datangzhengqing/MP-GCN.

异常动作检测是异常检测中最值得关注的部分，它试图识别视频中人类的异常行为。以前的方法通常利用未来帧预测来检测偏离正常场景的帧。虽然该策略在异常检测的准确性方面取得了成功，但无法获得异常原因和位置等关键信息。提出了一种用于异常动作检测的人体运动预测方法。我们使用人体姿态序列来表示人体运动，并通过将预测姿态与帧中检测到的实际姿态进行比较来检测不规则行为。因此，所提出的方法能够解释为什么动作被认为是不正常的，并定位异常发生的位置。此外，姿态序列对视频中的噪声、复杂背景和小目标具有较强的鲁棒性。由于姿态信息是非欧氏数据，未来姿态预测采用图卷积网络，不仅表达能力更强，而且泛化能力更强。实验分别在应用广泛的异常检测数据集ShanghaiTech和我们新提出的数据集NJUST-Anomaly上进行，NJUST-Anomaly主要包含校园内发生的不规则行为。我们的数据集扩展了现有的数据集，给出了更多社会保障中引起公众关注的异常行为，这些异常行为发生在更复杂的场景和动态背景中。在两个数据集上的实验结果表明，我们的方法优于最先进的方法。源代码和NJUST-Anomaly数据集将在https://github.com/datangzhengqing/MP-GCN上公开。

{"title":"Graph-based motion prediction for abnormal action detection","authors":"Yao Tang, Lin Zhao, Zhaoliang Yao, Chen Gong, Jian Yang","doi":"10.1145/3444685.3446316","DOIUrl":"https://doi.org/10.1145/3444685.3446316","url":null,"abstract":"Abnormal action detection is the most noteworthy part of anomaly detection, which tries to identify unusual human behaviors in videos. Previous methods typically utilize future frame prediction to detect frames deviating from the normal scenario. While this strategy enjoys success in the accuracy of anomaly detection, critical information such as the cause and location of the abnormality is unable to be acquired. This paper proposes human motion prediction for abnormal action detection. We employ sequence of human poses to represent human motion, and detect irregular behavior by comparing the predicted pose with the actual pose detected in the frame. Hence the proposed method is able to explain why the action is regarded as irregularity and locate where the anomaly happens. Moreover, pose sequence is robust to noise, complex background and small targets in videos. Since posture information is non-Euclidean data, graph convolutional network is adopted for future pose prediction, which not only leads to greater expressive power but also stronger generalization capability. Experiments are conducted both on the widely used anomaly detection dataset ShanghaiTech and our newly proposed dataset NJUST-Anomaly, which mainly contains irregular behaviors happened in the campus. Our dataset expands the existing datasets by giving more abnormal actions attracting public attention in social security, which happen in more complex scenes and dynamic backgrounds. Experimental results on both datasets demonstrate the superiority of our method over the-state-of-the-art methods. The source code and NJUST-Anomaly dataset will be made public at https://github.com/datangzhengqing/MP-GCN.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126719536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multiplicative angular margin loss for text-based person search 基于文本的人员搜索的乘法角边损失

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446314

Peng Zhang, Deqiang Ouyang, Feiyu Chen, Jie Shao

Text-based person search aims at retrieving the most relevant pedestrian images from database in response to a query in form of natural language description. Existing algorithms mainly focus on embedding textual and visual features into a common semantic space so that the similarity score of features from different modalities can be computed directly. Softmax loss is widely adopted to classify textual and visual features into a correct category in the joint embedding space. However, softmax loss can only help classify features but not increase the intra-class compactness and inter-class discrepancy. To this end, we propose multiplicative angular margin (MAM) loss to learn angularly discriminative features for each identity. The multiplicative angular margin loss penalizes the angle between feature vector and its corresponding classifier vector to learn more discriminative feature. Moreover, to focus more on informative image-text pair, we propose pairwise similarity weighting (PSW) loss to assign higher weight to informative pairs. Extensive experimental evaluations have been conducted on the CUHK-PEDES dataset over our proposed losses. The results show the superiority of our proposed method. Code is available at https://github.com/pengzhanguestc/MAM_loss.

基于文本的人物搜索旨在以自然语言描述的形式从数据库中检索与查询最相关的行人图像。现有算法主要是将文本特征和视觉特征嵌入到一个共同的语义空间中，从而直接计算不同模态特征的相似度得分。Softmax loss被广泛用于在联合嵌入空间中将文本和视觉特征分类到正确的类别中。然而，softmax损失只能帮助分类特征，而不能增加类内紧密度和类间差异。为此，我们提出了乘法角边损失(MAM)来学习每个恒等式的角判别特征。乘法角边损失对特征向量与其对应的分类器向量之间的角度进行惩罚，以学习更多的判别特征。此外，为了更多地关注信息图像-文本对，我们提出了成对相似加权(PSW)损失，为信息对分配更高的权重。针对我们提出的损失，我们在中大- pedes数据集上进行了广泛的实验评估。结果表明了该方法的优越性。代码可从https://github.com/pengzhanguestc/MAM_loss获得。

{"title":"Multiplicative angular margin loss for text-based person search","authors":"Peng Zhang, Deqiang Ouyang, Feiyu Chen, Jie Shao","doi":"10.1145/3444685.3446314","DOIUrl":"https://doi.org/10.1145/3444685.3446314","url":null,"abstract":"Text-based person search aims at retrieving the most relevant pedestrian images from database in response to a query in form of natural language description. Existing algorithms mainly focus on embedding textual and visual features into a common semantic space so that the similarity score of features from different modalities can be computed directly. Softmax loss is widely adopted to classify textual and visual features into a correct category in the joint embedding space. However, softmax loss can only help classify features but not increase the intra-class compactness and inter-class discrepancy. To this end, we propose multiplicative angular margin (MAM) loss to learn angularly discriminative features for each identity. The multiplicative angular margin loss penalizes the angle between feature vector and its corresponding classifier vector to learn more discriminative feature. Moreover, to focus more on informative image-text pair, we propose pairwise similarity weighting (PSW) loss to assign higher weight to informative pairs. Extensive experimental evaluations have been conducted on the CUHK-PEDES dataset over our proposed losses. The results show the superiority of our proposed method. Code is available at https://github.com/pengzhanguestc/MAM_loss.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114232488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Defense for adversarial videos by self-adaptive JPEG compression and optical texture 通过自适应JPEG压缩和光学纹理防御对抗性视频

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446308

Yupeng Cheng, Xingxing Wei, H. Fu, Shang-Wei Lin, Weisi Lin

Despite demonstrated outstanding effectiveness in various computer vision tasks, Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Nowadays, adversarial attacks as well as their defenses w.r.t. DNNs in image domain have been intensively studied, and there are some recent works starting to explore adversarial attacks w.r.t. DNNs in video domain. However, the corresponding defense is rarely studied. In this paper, we propose a new two-stage framework for defending video adversarial attack. It contains two main components, namely self-adaptive Joint Photographic Experts Group (JPEG) compression defense and optical texture based defense (OTD). In self-adaptive JPEG compression defense, we propose to adaptively choose an appropriate JPEG quality based on an estimation of moving foreground object, such that the JPEG compression could depress most impact of adversarial noise without losing too much video quality. In OTD, we generate "optical texture" containing high-frequency information based on the optical flow map, and use it to edit Y channel (in YCrCb color space) of input frames, thus further reducing the influence of adversarial perturbation. Experimental results on a benchmark dataset demonstrate the effectiveness of our framework in recovering the classification performance on perturbed videos.

尽管深度神经网络(dnn)在各种计算机视觉任务中表现出出色的有效性，但众所周知，它很容易受到对抗性示例的影响。目前，图像域的对抗性攻击及其防御已经得到了广泛的研究，最近也有一些研究开始探索视频域的对抗性攻击。然而，相关的防御研究却很少。本文提出了一种新的两阶段防御视频对抗性攻击的框架。它包括两个主要部分，即自适应联合摄影专家组(JPEG)压缩防御和基于光学纹理的防御。在自适应JPEG压缩防御中，我们提出基于对前景运动物体的估计自适应选择合适的JPEG质量，使JPEG压缩能够在不损失太多视频质量的情况下抑制对抗性噪声的大部分影响。在OTD中，我们基于光流图生成包含高频信息的“光学纹理”，并利用它来编辑输入帧的Y通道(在YCrCb色彩空间中)，从而进一步减少对抗性扰动的影响。在一个基准数据集上的实验结果证明了我们的框架在恢复对扰动视频的分类性能方面的有效性。

{"title":"Defense for adversarial videos by self-adaptive JPEG compression and optical texture","authors":"Yupeng Cheng, Xingxing Wei, H. Fu, Shang-Wei Lin, Weisi Lin","doi":"10.1145/3444685.3446308","DOIUrl":"https://doi.org/10.1145/3444685.3446308","url":null,"abstract":"Despite demonstrated outstanding effectiveness in various computer vision tasks, Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Nowadays, adversarial attacks as well as their defenses w.r.t. DNNs in image domain have been intensively studied, and there are some recent works starting to explore adversarial attacks w.r.t. DNNs in video domain. However, the corresponding defense is rarely studied. In this paper, we propose a new two-stage framework for defending video adversarial attack. It contains two main components, namely self-adaptive Joint Photographic Experts Group (JPEG) compression defense and optical texture based defense (OTD). In self-adaptive JPEG compression defense, we propose to adaptively choose an appropriate JPEG quality based on an estimation of moving foreground object, such that the JPEG compression could depress most impact of adversarial noise without losing too much video quality. In OTD, we generate \"optical texture\" containing high-frequency information based on the optical flow map, and use it to edit Y channel (in YCrCb color space) of input frames, thus further reducing the influence of adversarial perturbation. Experimental results on a benchmark dataset demonstrate the effectiveness of our framework in recovering the classification performance on perturbed videos.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"35 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116616076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Change detection from SAR images based on deformable residual convolutional neural networks 基于可变形残差卷积神经网络的SAR图像变化检测

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446320

Junjie Wang, Feng Gao, Junyu Dong

Convolutional neural networks (CNN) have made great progress for synthetic aperture radar (SAR) images change detection. However, sampling locations of traditional convolutional kernels are fixed and cannot be changed according to the actual structure of the SAR images. Besides, objects may appear with different sizes in natural scenes, which requires the network to have stronger multi-scale representation ability. In this paper, a novel Deformable Residual Convolutional Neural Network (DRNet) is designed for SAR images change detection. First, the proposed DRNet introduces the deformable convolutional sampling locations, and the shape of convolutional kernel can be adaptively adjusted according to the actual structure of ground objects. To create the deformable sampling locations, 2-D offsets are calculated for each pixel according to the spatial information of the input images. Then the sampling location of pixels can adaptively reflect the spatial structure of the input images. Moreover, we proposed a novel pooling module replacing the vanilla pooling to utilize multi-scale information effectively, by constructing hierarchical residual-like connections within one pooling layer, which improve the multi-scale representation ability at a granular level. Experimental results on three real SAR datasets demonstrate the effectiveness of the proposed DR-Net.

卷积神经网络(CNN)在合成孔径雷达(SAR)图像变化检测方面取得了很大进展。然而，传统卷积核的采样位置是固定的，不能根据SAR图像的实际结构进行改变。此外，在自然场景中，物体可能会以不同的大小出现，这就要求网络具有更强的多尺度表示能力。本文设计了一种用于SAR图像变化检测的可变形残差卷积神经网络(DRNet)。首先，本文提出的DRNet引入了可变形的卷积采样位置，卷积核的形状可以根据地物的实际结构自适应调整。为了创建可变形的采样位置，根据输入图像的空间信息计算每个像素的二维偏移量。然后像素的采样位置可以自适应地反映输入图像的空间结构。此外，我们提出了一种新的池化模块，通过在一个池化层内构建分层的类残差连接，有效地利用了多尺度信息，提高了在粒度级别上的多尺度表示能力。在三个真实SAR数据集上的实验结果验证了该网络的有效性。

{"title":"Change detection from SAR images based on deformable residual convolutional neural networks","authors":"Junjie Wang, Feng Gao, Junyu Dong","doi":"10.1145/3444685.3446320","DOIUrl":"https://doi.org/10.1145/3444685.3446320","url":null,"abstract":"Convolutional neural networks (CNN) have made great progress for synthetic aperture radar (SAR) images change detection. However, sampling locations of traditional convolutional kernels are fixed and cannot be changed according to the actual structure of the SAR images. Besides, objects may appear with different sizes in natural scenes, which requires the network to have stronger multi-scale representation ability. In this paper, a novel Deformable Residual Convolutional Neural Network (DRNet) is designed for SAR images change detection. First, the proposed DRNet introduces the deformable convolutional sampling locations, and the shape of convolutional kernel can be adaptively adjusted according to the actual structure of ground objects. To create the deformable sampling locations, 2-D offsets are calculated for each pixel according to the spatial information of the input images. Then the sampling location of pixels can adaptively reflect the spatial structure of the input images. Moreover, we proposed a novel pooling module replacing the vanilla pooling to utilize multi-scale information effectively, by constructing hierarchical residual-like connections within one pooling layer, which improve the multi-scale representation ability at a granular level. Experimental results on three real SAR datasets demonstrate the effectiveness of the proposed DR-Net.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"174 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123204324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A multimedia solution to motivate childhood cancer patients to keep up with cancer treatment 一个多媒体解决方案，激励儿童癌症患者跟上癌症治疗

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446262

Carmen Chai Wang Er, B. Lau, A. Mahmud, Mark Tee Kit Tsun

Childhood cancer is a deadly illness that requires the young patient to adhere to cancer treatment for survival. Sadly, the high treatment side-effect burden can make it difficult for patients to keep up with their treatment. However, childhood cancer patients can manage these treatment side effects through daily self-care to make the process more bearable. This paper outlines the design and development process of a multimedia-based solution to motivate these young patients to adhere to cancer treatment and manage their treatment side effects. Due to the high appeal of multimedia-based interventions and the proficiency of young children in using mobile devices, the intervention of this study takes the form of a virtual pet serious game developed for mobile. The intervention which is developed based on the Protection Motivation Theory, includes multiple game modules with the purpose of improving the coping appraisal of childhood cancer patients on using cancer treatment to fight cancer, and taking daily self-care to combat treatment side-effects. The prototype testing results show that the intervention is well received by the voluntary play testers. Future work of this study includes the evaluation of the intervention developed with childhood cancer patients to determine its effectiveness.

儿童癌症是一种致命的疾病，需要年轻的患者坚持癌症治疗才能生存。令人遗憾的是，高昂的治疗副作用负担可能使患者难以跟上他们的治疗。然而，儿童癌症患者可以通过日常自我护理来控制这些治疗副作用，使这个过程更容易忍受。本文概述了一个基于多媒体的解决方案的设计和开发过程，以激励这些年轻患者坚持癌症治疗并管理其治疗副作用。由于多媒体干预的高吸引力和幼儿对移动设备的熟练使用，本研究的干预采用了为移动设备开发的虚拟宠物严肃游戏的形式。该干预是基于保护动机理论开发的，包括多个游戏模块，旨在提高儿童癌症患者对使用癌症治疗对抗癌症的应对评价，并采取日常自我护理来对抗治疗副作用。原型测试结果表明，该干预措施得到了自愿游戏测试者的好评。本研究未来的工作包括评估儿童癌症患者的干预措施，以确定其有效性。

{"title":"A multimedia solution to motivate childhood cancer patients to keep up with cancer treatment","authors":"Carmen Chai Wang Er, B. Lau, A. Mahmud, Mark Tee Kit Tsun","doi":"10.1145/3444685.3446262","DOIUrl":"https://doi.org/10.1145/3444685.3446262","url":null,"abstract":"Childhood cancer is a deadly illness that requires the young patient to adhere to cancer treatment for survival. Sadly, the high treatment side-effect burden can make it difficult for patients to keep up with their treatment. However, childhood cancer patients can manage these treatment side effects through daily self-care to make the process more bearable. This paper outlines the design and development process of a multimedia-based solution to motivate these young patients to adhere to cancer treatment and manage their treatment side effects. Due to the high appeal of multimedia-based interventions and the proficiency of young children in using mobile devices, the intervention of this study takes the form of a virtual pet serious game developed for mobile. The intervention which is developed based on the Protection Motivation Theory, includes multiple game modules with the purpose of improving the coping appraisal of childhood cancer patients on using cancer treatment to fight cancer, and taking daily self-care to combat treatment side-effects. The prototype testing results show that the intervention is well received by the voluntary play testers. Future work of this study includes the evaluation of the intervention developed with childhood cancer patients to determine its effectiveness.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129274021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fixations based personal target objects segmentation 基于个人目标对象分割的注视

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446310

Ran Shi, Gongyang Li, Weijie Wei, Zhi Liu

With the development of the eye-tracking technique, the fixation becomes an emergent interactive mode in many human-computer interaction study field. For a personal target objects segmentation task, although the fixation can be taken as a novel and more convenient interactive input, it induces a heavy ambiguity problem of the input's indication so that the segmentation quality is severely degraded. In this paper, to address this challenge, we develop an "extraction-to-fusion" strategy based iterative lightweight neural network, whose input is composed by an original image, a fixation map and a position map. Our neural network consists of two main parts: The first extraction part is a concise interlaced structure of standard convolution layers and progressively higher dilated convolution layers to better extract and integrate local and global features of target objects. The second fusion part is a convolutional long short-term memory component to refine the extracted features and store them. Depending on the iteration framework, current extracted features are refined by fusing them with stored features extracted in the previous iterations, which is a feature transmission mechanism in our neural network. Then, current improved segmentation result is generated to further adjust the fixation map and the position map in the next iteration. Thus, the ambiguity problem induced by the fixations can be alleviated. Experiments demonstrate better segmentation performance of our method and effectiveness of each part in our model.

随着眼球追踪技术的发展，注视成为许多人机交互研究领域的新兴交互方式。对于个人目标物体分割任务，注视注视虽然可以作为一种新颖、方便的交互输入，但由于输入指示存在严重的歧义问题，严重降低了分割质量。在本文中，为了解决这一挑战，我们开发了一种基于“提取到融合”策略的迭代轻量级神经网络，其输入由原始图像、注视图和位置图组成。我们的神经网络主要由两部分组成:第一部分提取部分是一个简洁的交错结构的标准卷积层和逐步扩大的卷积层，以更好地提取和整合目标物体的局部和全局特征。第二个融合部分是卷积长短期记忆组件，对提取的特征进行细化和存储。根据迭代框架，将当前提取的特征与之前迭代中提取的存储特征融合，从而对当前提取的特征进行细化，这是神经网络中的一种特征传递机制。然后，生成当前改进的分割结果，在下一次迭代中进一步调整注视图和位置图。因此，可以减轻由注视引起的歧义问题。实验结果表明，该方法具有较好的分割性能和模型中各部分的分割效果。

{"title":"Fixations based personal target objects segmentation","authors":"Ran Shi, Gongyang Li, Weijie Wei, Zhi Liu","doi":"10.1145/3444685.3446310","DOIUrl":"https://doi.org/10.1145/3444685.3446310","url":null,"abstract":"With the development of the eye-tracking technique, the fixation becomes an emergent interactive mode in many human-computer interaction study field. For a personal target objects segmentation task, although the fixation can be taken as a novel and more convenient interactive input, it induces a heavy ambiguity problem of the input's indication so that the segmentation quality is severely degraded. In this paper, to address this challenge, we develop an \"extraction-to-fusion\" strategy based iterative lightweight neural network, whose input is composed by an original image, a fixation map and a position map. Our neural network consists of two main parts: The first extraction part is a concise interlaced structure of standard convolution layers and progressively higher dilated convolution layers to better extract and integrate local and global features of target objects. The second fusion part is a convolutional long short-term memory component to refine the extracted features and store them. Depending on the iteration framework, current extracted features are refined by fusing them with stored features extracted in the previous iterations, which is a feature transmission mechanism in our neural network. Then, current improved segmentation result is generated to further adjust the fixation map and the position map in the next iteration. Thus, the ambiguity problem induced by the fixations can be alleviated. Experiments demonstrate better segmentation performance of our method and effectiveness of each part in our model.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130081852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1