Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献_第7页

Overlap classification mechanism for skeletal bone age assessment 骨骼骨龄评估的重叠分类机制

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446286

Pengyi Hao, Xuhang Xie, Tianxing Han, Cong Bai

The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.

骨发育是一个连续的过程，但骨年龄通常用离散的标记来表示。这不可避免地造成了实际情况和标签表示范围之间的语义差距。本文提出了一种基于重叠分类网络的骨龄评估方法。在提出的网络中，离散骨龄标签(如0-228个月)被认为是一个序列，用于生成一系列子序列。然后利用相邻子序列之间的重叠信息，对一个案例同时输出多个骨龄范围。这些年龄范围的重叠部分被认为是最终预测的骨年龄。与公共数据集上最先进的方法相比，该方法在没有任何预处理的情况下可以获得更小的平均绝对误差。

引用次数: 2

Fixation guided network for salient object detection 用于显著目标检测的固定导向网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446288

Zhe Cui, Li Su, Weigang Zhang, Qingming Huang

Convolutional neural network (CNN) based salient object detection (SOD) has achieved great development in recent years. However, in some challenging cases, i.e. small-scale salient object, low contrast salient object and cluttered background, existing salient object detect methods are still not satisfying. In order to accurately detect salient objects, SOD networks need to fix the position of most salient part. Fixation prediction (FP) focuses on the most visual attractive regions, so we think it could assist in locating salient objects. As far as we know, there are few methods jointly consider SOD and FP tasks. In this paper, we propose a fixation guided salient object detection network (FGNet) to leverage the correlation between SOD and FP. FGNet consists of two branches to deal with fixation prediction and salient object detection respectively. Further, an effective feature cooperation module (FCM) is proposed to fuse complementary information between the two branches. Extensive experiments on four popular datasets and comparisons with twelve state-of-the-art methods show that the proposed FGNet well captures the main context of images and locates salient objects more accurately.

基于卷积神经网络(CNN)的显著目标检测(SOD)近年来取得了很大的发展。然而，在一些具有挑战性的情况下，如小尺度显著目标、低对比度显著目标和杂乱背景，现有的显著目标检测方法仍然不能令人满意。为了准确检测显著目标，SOD网络需要固定最显著部分的位置。注视预测(FP)专注于最具视觉吸引力的区域，因此我们认为它可以帮助定位显著物体。据我们所知，联合考虑SOD和FP任务的方法很少。在本文中，我们提出了一个固定引导显著目标检测网络(FGNet)来利用SOD和FP之间的相关性。FGNet包括两个分支，分别处理注视预测和显著目标检测。在此基础上，提出了一种有效的特征协同模块(FCM)来融合两个分支之间的互补信息。在四个流行的数据集上进行了大量的实验，并与12种最先进的方法进行了比较，结果表明，所提出的FGNet可以很好地捕捉图像的主要背景，并更准确地定位显著目标。

{"title":"Fixation guided network for salient object detection","authors":"Zhe Cui, Li Su, Weigang Zhang, Qingming Huang","doi":"10.1145/3444685.3446288","DOIUrl":"https://doi.org/10.1145/3444685.3446288","url":null,"abstract":"Convolutional neural network (CNN) based salient object detection (SOD) has achieved great development in recent years. However, in some challenging cases, i.e. small-scale salient object, low contrast salient object and cluttered background, existing salient object detect methods are still not satisfying. In order to accurately detect salient objects, SOD networks need to fix the position of most salient part. Fixation prediction (FP) focuses on the most visual attractive regions, so we think it could assist in locating salient objects. As far as we know, there are few methods jointly consider SOD and FP tasks. In this paper, we propose a fixation guided salient object detection network (FGNet) to leverage the correlation between SOD and FP. FGNet consists of two branches to deal with fixation prediction and salient object detection respectively. Further, an effective feature cooperation module (FCM) is proposed to fuse complementary information between the two branches. Extensive experiments on four popular datasets and comparisons with twelve state-of-the-art methods show that the proposed FGNet well captures the main context of images and locates salient objects more accurately.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125992037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Destylization of text with decorative elements 用装饰元素将文本去风格化

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446324

Yuting Ma, Fan Tang, Weiming Dong, Changsheng Xu

Style text with decorative elements has a strong visual sense, and enriches our daily work, study and life. However, it introduces new challenges to text detection and recognition. In this study, we propose a text destylized framework, that can transform the stylized texts with decorative elements into a type that is easily distinguishable by a detection or recognition model. We arranged and integrate an existing stylistic text data set to train the destylized network. The new destylized data set contains English letters and Chinese characters. The proposed approach enables a framework to handle both Chinese characters and English letters without the need for additional networks. Experiments show that the method is superior to the state-of-the-art style-related models.

带有装饰元素的风格文字具有强烈的视觉感，丰富了我们日常的工作、学习和生活。然而，它给文本检测和识别带来了新的挑战。在本研究中，我们提出了一个文本去风格化框架，该框架可以将带有装饰元素的风格化文本转换为易于通过检测或识别模型区分的类型。我们整理和整合现有的文体文本数据集来训练去文体化的网络。新的非风格化数据集包含英文字母和中文字符。提出的方法使一个框架可以同时处理中文和英文字母，而不需要额外的网络。实验表明，该方法优于目前最先进的风格相关模型。

引用次数: 0

Graph convolution network with node feature optimization using cross attention for few-shot learning 基于交叉关注的节点特征优化图卷积网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446278

Ying Liu, Yanbo Lei, Sheikh Faisal Rashid

Graph convolution network (GCN) is an important method recently developed for few-shot learning. The adjacency matrix in GCN models is constructed based on graph node features to represent the graph node relationships, according to which, the graph network achieves message-passing inference. Therefore, the representation ability of graph node features is an important factor affecting the learning performance of GCN. This paper proposes an improved GCN model with node feature optimization using cross attention, named GCN-NFO. Leveraging on cross attention mechanism to associate the image features of support set and query set, the proposed model extracts more representative and discriminative salient region features as initialization features of graph nodes through information aggregation. Since graph network can represent the relationship between samples, the optimized graph node features transmit information through the graph network, thus implicitly enhances the similarity of intra-class samples and the dissimilarity of inter-class samples, thus enhancing the learning capability of GCN. Intensive experimental results on image classification task using different image datasets prove that GCN-NFO is an effective few-shot learning algorithm which significantly improves the classification accuracy, compared with other existing models.

图卷积网络(GCN)是近年来发展起来的一种重要的少镜头学习方法。GCN模型中基于图节点特征构造邻接矩阵来表示图节点之间的关系，图网络根据邻接矩阵实现消息传递推理。因此，图节点特征的表示能力是影响GCN学习性能的重要因素。本文提出了一种基于交叉关注的节点特征优化改进GCN模型，命名为GCN- nfo。该模型利用交叉关注机制将支持集和查询集的图像特征关联起来，通过信息聚合提取更具代表性和判别性的显著区域特征作为图节点的初始化特征。由于图网络可以表示样本之间的关系，优化后的图节点特征通过图网络传递信息，从而隐含地增强了类内样本的相似性和类间样本的不相似性，从而增强了GCN的学习能力。利用不同的图像数据集进行图像分类任务的大量实验结果证明，与其他现有模型相比，GCN-NFO是一种有效的少镜头学习算法，显著提高了分类精度。

{"title":"Graph convolution network with node feature optimization using cross attention for few-shot learning","authors":"Ying Liu, Yanbo Lei, Sheikh Faisal Rashid","doi":"10.1145/3444685.3446278","DOIUrl":"https://doi.org/10.1145/3444685.3446278","url":null,"abstract":"Graph convolution network (GCN) is an important method recently developed for few-shot learning. The adjacency matrix in GCN models is constructed based on graph node features to represent the graph node relationships, according to which, the graph network achieves message-passing inference. Therefore, the representation ability of graph node features is an important factor affecting the learning performance of GCN. This paper proposes an improved GCN model with node feature optimization using cross attention, named GCN-NFO. Leveraging on cross attention mechanism to associate the image features of support set and query set, the proposed model extracts more representative and discriminative salient region features as initialization features of graph nodes through information aggregation. Since graph network can represent the relationship between samples, the optimized graph node features transmit information through the graph network, thus implicitly enhances the similarity of intra-class samples and the dissimilarity of inter-class samples, thus enhancing the learning capability of GCN. Intensive experimental results on image classification task using different image datasets prove that GCN-NFO is an effective few-shot learning algorithm which significantly improves the classification accuracy, compared with other existing models.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131151647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Structure-preserving extremely low light image enhancement with fractional order differential mask guidance 分数阶微分掩模导引下保持结构的极弱光图像增强

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446319

Yijun Liu, Zhengning Wang, Ruixu Geng, Hao Zeng, Yi Zeng

Low visibility and high-level noise are two challenges for low-light image enhancement. In this paper, by introducing fractional order differential, we propose an end-to-end conditional generative adversarial network(GAN) to solve those two problems. For the problem of low visibility, we set up a global discriminator to improve the overall reconstruction quality and restore brightness information. For the high-level noise problem, we introduce fractional order differentiation into both the generator and the discriminator. Compared with conventional end-to-end methods, fractional order can better distinguish noise and high-frequency details, thereby achieving superior noise reduction effects while maintaining details. Finally, experimental results show that the proposed model obtains superior visual effects in low-light image enhancement. By introducing fractional order differential, we anticipate that our framework will enable high quality and detailed image recovery not only in the field of low-light enhancement but also in other fields that require details.

低可见度和高噪点是微光图像增强面临的两大挑战。本文通过引入分数阶微分，提出了一种端到端条件生成对抗网络(GAN)来解决这两个问题。针对图像可见度低的问题，我们建立了一个全局鉴别器，以提高图像的整体重建质量和恢复亮度信息。对于高噪声问题，我们在发生器和鉴别器中都引入了分数阶微分。与传统的端到端方法相比，分数阶可以更好地区分噪声和高频细节，从而在保持细节的同时获得更优的降噪效果。实验结果表明，该模型在弱光图像增强中具有较好的视觉效果。通过引入分数阶微分，我们期望我们的框架不仅在弱光增强领域，而且在其他需要细节的领域都能实现高质量和详细的图像恢复。

{"title":"Structure-preserving extremely low light image enhancement with fractional order differential mask guidance","authors":"Yijun Liu, Zhengning Wang, Ruixu Geng, Hao Zeng, Yi Zeng","doi":"10.1145/3444685.3446319","DOIUrl":"https://doi.org/10.1145/3444685.3446319","url":null,"abstract":"Low visibility and high-level noise are two challenges for low-light image enhancement. In this paper, by introducing fractional order differential, we propose an end-to-end conditional generative adversarial network(GAN) to solve those two problems. For the problem of low visibility, we set up a global discriminator to improve the overall reconstruction quality and restore brightness information. For the high-level noise problem, we introduce fractional order differentiation into both the generator and the discriminator. Compared with conventional end-to-end methods, fractional order can better distinguish noise and high-frequency details, thereby achieving superior noise reduction effects while maintaining details. Finally, experimental results show that the proposed model obtains superior visual effects in low-light image enhancement. By introducing fractional order differential, we anticipate that our framework will enable high quality and detailed image recovery not only in the field of low-light enhancement but also in other fields that require details.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Adaptive feature aggregation network for nuclei segmentation 核分割的自适应特征聚合网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446271

Ruizhe Geng, Zhongyi Huang, Jie Chen

Nuclei instance segmentation is essential for cell morphometrics and analysis, playing a crucial role in digital pathology. The problem of variability in nuclei characteristics among diverse cell types makes this task more challenging. Recently, proposal-based segmentation methods with feature pyramid network (FPN) has shown good performance because FPN integrates multi-scale features with strong semantics. However, FPN has information loss of the highest-level feature map and sub-optimal feature fusion strategies. This paper proposes a proposal-based adaptive feature aggregation methods (AANet) to make full use of multi-scale features. Specifically, AANet consists of two components: Context Augmentation Module (CAM) and Feature Adaptive Selection Module (ASM). In feature fusion, CAM focus on exploring extensive contextual information and capturing discriminative semantics to reduce the information loss of feature map at the highest pyramid level. The enhanced features are then sent to ASM to get a combined feature representation adaptively over all feature levels for each RoI. The experiments show our model's effectiveness on two publicly available datasets: the Kaggle 2018 Data Science Bowl dataset and the Multi-Organ nuclei segmentation dataset.

细胞核实例分割是细胞形态计量学和分析的基础，在数字病理学中起着至关重要的作用。不同细胞类型的细胞核特征的可变性问题使这项任务更具挑战性。近年来，基于提议的特征金字塔网络(FPN)分割方法由于融合了多尺度特征，具有较强的语义特征，表现出了较好的分割效果。然而，FPN存在最高级特征映射的信息丢失和次优特征融合策略。为了充分利用多尺度特征，提出了一种基于提议的自适应特征聚合方法(AANet)。具体来说，AANet由两个部分组成:上下文增强模块(Context Augmentation Module, CAM)和特征自适应选择模块(Feature Adaptive Selection Module, ASM)。在特征融合中，CAM侧重于挖掘广泛的上下文信息和捕获判别语义，以减少最高金字塔层特征图的信息丢失。然后将增强的特征发送给ASM，以自适应地获得每个RoI的所有特征级别的组合特征表示。实验证明了我们的模型在两个公开可用的数据集上的有效性:Kaggle 2018数据科学碗数据集和多器官核分割数据集。

{"title":"Adaptive feature aggregation network for nuclei segmentation","authors":"Ruizhe Geng, Zhongyi Huang, Jie Chen","doi":"10.1145/3444685.3446271","DOIUrl":"https://doi.org/10.1145/3444685.3446271","url":null,"abstract":"Nuclei instance segmentation is essential for cell morphometrics and analysis, playing a crucial role in digital pathology. The problem of variability in nuclei characteristics among diverse cell types makes this task more challenging. Recently, proposal-based segmentation methods with feature pyramid network (FPN) has shown good performance because FPN integrates multi-scale features with strong semantics. However, FPN has information loss of the highest-level feature map and sub-optimal feature fusion strategies. This paper proposes a proposal-based adaptive feature aggregation methods (AANet) to make full use of multi-scale features. Specifically, AANet consists of two components: Context Augmentation Module (CAM) and Feature Adaptive Selection Module (ASM). In feature fusion, CAM focus on exploring extensive contextual information and capturing discriminative semantics to reduce the information loss of feature map at the highest pyramid level. The enhanced features are then sent to ASM to get a combined feature representation adaptively over all feature levels for each RoI. The experiments show our model's effectiveness on two publicly available datasets: the Kaggle 2018 Data Science Bowl dataset and the Multi-Organ nuclei segmentation dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116565078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Attention-constraint facial expression recognition 注意约束面部表情识别

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446307

Qisheng Jiang

To make full use of existing inherent correlation between facial regions and expression, we propose an attention-constraint facial expression recognition method, where the prior correlation between facial regions and expression is integrated into attention weights for extracting better representation. The proposed method mainly consists of four components: feature extractor, local self attention-constraint learner (LSACL), global and local attention-constraint learner (GLACL) and facial expression classifier. Specifically, feature extractor is mainly used to extract features from overall facial image and its corresponding cropped facial regions. Then, the extracted local features from facial regions are fed into local self attention-constraint learner, where some prior rank constraints summarized from facial domain knowledge are embedded into self attention weights. Similarly, the rank correlation constraints between respective facial region and a specified expression are further embedded into global-to-local attention weights when the global feature and local features from local self attention-constraint learner are fed into global and local attention-constraint learner. Finally, the feature from global and local attention-constraint learner and original global feature are fused and passed to facial expression classifier for conducting facial expression recognition. Experiments on two benchmark datasets validate the effectiveness of the proposed method.

为了充分利用面部区域与表情之间固有的相关性，提出了一种注意约束的面部表情识别方法，该方法将面部区域与表情之间的先验相关性整合到注意权重中，以提取更好的表征。该方法主要由四个部分组成:特征提取器、局部自我注意约束学习器(LSACL)、全局和局部注意约束学习器(GLACL)和面部表情分类器。具体而言，特征提取器主要用于从整个面部图像及其相应裁剪的面部区域中提取特征。然后，将提取的面部区域局部特征输入到局部自注意约束学习器中，在局部自注意约束学习器中嵌入从面部领域知识中总结的先验秩约束。同样，将局部自我注意约束学习器的全局特征和局部特征分别输入到全局和局部注意约束学习器中，将各自面部区域与特定表情之间的等级相关约束进一步嵌入到全局到局部的注意权重中。最后，将来自全局和局部注意约束学习器的特征与原始全局特征融合并传递给面部表情分类器进行面部表情识别。在两个基准数据集上的实验验证了该方法的有效性。

{"title":"Attention-constraint facial expression recognition","authors":"Qisheng Jiang","doi":"10.1145/3444685.3446307","DOIUrl":"https://doi.org/10.1145/3444685.3446307","url":null,"abstract":"To make full use of existing inherent correlation between facial regions and expression, we propose an attention-constraint facial expression recognition method, where the prior correlation between facial regions and expression is integrated into attention weights for extracting better representation. The proposed method mainly consists of four components: feature extractor, local self attention-constraint learner (LSACL), global and local attention-constraint learner (GLACL) and facial expression classifier. Specifically, feature extractor is mainly used to extract features from overall facial image and its corresponding cropped facial regions. Then, the extracted local features from facial regions are fed into local self attention-constraint learner, where some prior rank constraints summarized from facial domain knowledge are embedded into self attention weights. Similarly, the rank correlation constraints between respective facial region and a specified expression are further embedded into global-to-local attention weights when the global feature and local features from local self attention-constraint learner are fed into global and local attention-constraint learner. Finally, the feature from global and local attention-constraint learner and original global feature are fused and passed to facial expression classifier for conducting facial expression recognition. Experiments on two benchmark datasets validate the effectiveness of the proposed method.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132444820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards annotation-free evaluation of cross-lingual image captioning 跨语言图像字幕的无标注评价

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2020-12-09 DOI: 10.1145/3444685.3446322

Aozhu Chen, Xinyi Huang, Hailan Lin, Xirong Li

Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.

跨语言图像字幕是多媒体领域的一个新兴课题，它能够用英语以外的目标语言对未标记的图像进行字幕。为了节省每个目标语言重新编写参考句子的宝贵人力资源，本文对跨语言图像字幕的无标注评价进行了大胆的尝试。根据我们是否假设英语参考文献的可用性，研究了两种情况。对于具有可用引用的第一个场景，我们提出两个度量，即WMDRel和CLinRel。WMDRel使用Word Mover’s Distance测量模型生成的标题和英语参考文献的机器翻译之间的语义相关性。CLinRel是一种面向视觉的跨语言相关性度量方法，通过将两个字幕投影到深度视觉特征空间中。对于第二种场景，它没有参考，因此更具挑战性，我们建议CMedRel在与CLinRel相同的视觉特征空间中计算生成的标题和图像内容之间的跨媒体相关性。我们已经进行了一些实验来评估这三个指标的有效性。WMDRel、CLinRel和CMedRel的组合与BLEU-4、METEOR、ROUGE-L和CIDEr四个标准指标(使用目标语言的参考文献计算)的和的Spearman秩相关系数为0.952。仅CMedRel与标准指标的Spearman等级相关性为0.786。结果表明，新的评价指标具有很高的潜力，不需要在目标语言中引用。

{"title":"Towards annotation-free evaluation of cross-lingual image captioning","authors":"Aozhu Chen, Xinyi Huang, Hailan Lin, Xirong Li","doi":"10.1145/3444685.3446322","DOIUrl":"https://doi.org/10.1145/3444685.3446322","url":null,"abstract":"Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114484934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Improving auto-encoder novelty detection using channel attention and entropy minimization 利用信道注意和熵最小化改进自编码器新颖性检测

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2020-07-03 DOI: 10.1145/3444685.3446311

Dongyan Guo, Miao Tian, Ying Cui, Xiang Pan, Shengyong Chen

Novelty detection is a important research area which mainly solves the classification problem of inliers which usually consists of normal samples and outliers composed of abnormal samples. Auto-encoder is often used for novelty detection. However, the generalization ability of the auto-encoder may cause the undesirable reconstruction of abnormal elements and reduce the identification ability of the model. To solve the problem, we focus on the perspective of better reconstructing the normal samples as well as retaining the unique information of normal samples to improve the performance of auto-encoder for novelty detection. Firstly, we introduce attention mechanism into the task. Under the action of attention mechanism, auto-encoder can pay more attention to the representation of inlier samples through adversarial training. Secondly, we apply the information entropy into the latent layer to make it sparse and constrain the expression of diversity. Experimental results on three public datasets show that the proposed method achieves comparable performance compared with previous popular approaches.

新颖性检测是一个重要的研究领域，主要解决通常由正常样本组成的内线和由异常样本组成的离群的分类问题。自动编码器常用于新颖性检测。然而，自编码器的泛化能力可能会导致异常元素的不良重构，降低模型的识别能力。为了解决这一问题，我们从更好地重建正态样本和保留正态样本的唯一信息的角度来提高自编码器的新颖性检测性能。首先，我们将注意机制引入到任务中。在注意机制的作用下，自编码器可以通过对抗性训练将更多的注意力放在对早期样本的表示上。其次，将信息熵应用到隐层中，使其稀疏化，并对多样性的表达进行约束;在三个公开数据集上的实验结果表明，该方法与之前流行的方法相比具有相当的性能。

{"title":"Improving auto-encoder novelty detection using channel attention and entropy minimization","authors":"Dongyan Guo, Miao Tian, Ying Cui, Xiang Pan, Shengyong Chen","doi":"10.1145/3444685.3446311","DOIUrl":"https://doi.org/10.1145/3444685.3446311","url":null,"abstract":"Novelty detection is a important research area which mainly solves the classification problem of inliers which usually consists of normal samples and outliers composed of abnormal samples. Auto-encoder is often used for novelty detection. However, the generalization ability of the auto-encoder may cause the undesirable reconstruction of abnormal elements and reduce the identification ability of the model. To solve the problem, we focus on the perspective of better reconstructing the normal samples as well as retaining the unique information of normal samples to improve the performance of auto-encoder for novelty detection. Firstly, we introduce attention mechanism into the task. Under the action of attention mechanism, auto-encoder can pay more attention to the representation of inlier samples through adversarial training. Secondly, we apply the information entropy into the latent layer to make it sparse and constrain the expression of diversity. Experimental results on three public datasets show that the proposed method achieves comparable performance compared with previous popular approaches.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115694903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

C3VQG: category consistent cyclic visual question generation C3VQG:类别一致循环可视化问题生成

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2020-05-15 DOI: 10.1145/3444685.3446302

Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.

视觉问题生成(VQG)是基于图像生成自然问题的任务。过去流行的方法已经探索了用最大似然训练的图像到序列架构，这些架构已经证明了给定图像及其相关的基础真值答案的有意义的生成问题。如果图像包含描述其不同语义类别的丰富上下文信息，VQG将变得更具挑战性。在本文中，我们尝试利用图像中的不同视觉线索和概念，使用变分自编码器(VAE)生成问题，而不需要真实答案。我们的方法解决了现有VQG系统的两个主要缺点:(i)最小化监督水平;(ii)用类别相关代替换通用问题。最重要的是，通过消除昂贵的答案注释，所需的监督被削弱了。使用不同的范畴使我们能够利用不同的概念，因为推理只需要图像和范畴。在我们的VAE潜在空间中，图像、问题和答案类别之间的互信息最大化。提出了一种新的类别一致循环损失，使模型能够产生关于答案类别的一致预测，减少冗余和不规则性。此外，我们还对生成模型的潜在空间施加补充约束，以提供基于类别的结构，并通过在每个维度内封装去相关特征来增强泛化。通过大量的实验，C3VQG模型优于目前最先进的弱监督VQG方法。

{"title":"C3VQG: category consistent cyclic visual question generation","authors":"Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah","doi":"10.1145/3444685.3446302","DOIUrl":"https://doi.org/10.1145/3444685.3446302","url":null,"abstract":"Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129302118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14