Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang
Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.
场景图生成旨在从图像中构建基于图的表示,其中节点和边分别表示对象及其之间的关系。然而,今天的场景图生成受到不平衡的类别预测的严重限制。具体来说,大多数现有的工作在简单和频繁的关系类(例如on)上实现了令人满意的性能,但是在细粒度和不频繁的关系类(例如walk on, stand on)上留下了较差的性能。为了解决这个问题,本文将框架重新设计为两个分支,表示学习分支和分类器学习分支,以获得更平衡的场景图生成器。此外,对于表征学习分支,我们提出了跨模态注意协调器(Cross-modal Attention Coordinator, CAC),利用动态注意从多模态中收集一致的特征。对于分类器学习分支,我们首先从大规模语料库中迁移关系类的知识,然后通过图注意网络(MR-GAT)利用多关系分类器来弥合频繁关系和不频繁关系之间的差距。在挑战数据集VG200上的综合实验结果表明,本文提出的方法具有竞争力和显著的优越性。
{"title":"Scene graph generation via multi-relation classification and cross-modal attention coordinator","authors":"Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang","doi":"10.1145/3444685.3446276","DOIUrl":"https://doi.org/10.1145/3444685.3446276","url":null,"abstract":"Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.
{"title":"Full-resolution encoder-decoder networks with multi-scale feature fusion for human pose estimation","authors":"Jie Ou, Mingjian Chen, Hong Wu","doi":"10.1145/3444685.3446282","DOIUrl":"https://doi.org/10.1145/3444685.3446282","url":null,"abstract":"To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115592855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.
{"title":"Fixed-size video summarization over streaming data via non-monotone submodular maximization","authors":"Ganfeng Lu, Jiping Zheng","doi":"10.1145/3444685.3446285","DOIUrl":"https://doi.org/10.1145/3444685.3446285","url":null,"abstract":"Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions geometrically, through rotation and vertically scaling transformation. The facial expression patterns made by the geometric elements and transformations were composed employing three dimensions of visual information that had been suggested by many previous researches, slantedness of the mouth, openness of the face, and slantedness of the eyes. The authors found that this minimal facial expressions can be classified into 10 emotions: happy, angry, sad, disgust, fear, surprised, angry*, fear*, neutral (pleasant) indicating positive emotion, and neutral (unpleasant) indicating negative emotion. The authors also investigate and report cultural differences of impressions of facial expressions of above-mentioned simplified face.
{"title":"Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?","authors":"I. Kanaya, Meina Tawaki, Keiko Yamamoto","doi":"10.1145/3444685.3446294","DOIUrl":"https://doi.org/10.1145/3444685.3446294","url":null,"abstract":"In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions geometrically, through rotation and vertically scaling transformation. The facial expression patterns made by the geometric elements and transformations were composed employing three dimensions of visual information that had been suggested by many previous researches, slantedness of the mouth, openness of the face, and slantedness of the eyes. The authors found that this minimal facial expressions can be classified into 10 emotions: happy, angry, sad, disgust, fear, surprised, angry*, fear*, neutral (pleasant) indicating positive emotion, and neutral (unpleasant) indicating negative emotion. The authors also investigate and report cultural differences of impressions of facial expressions of above-mentioned simplified face.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117017056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong
Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.
{"title":"WFN-PSC: weighted-fusion network with poly-scale convolution for image dehazing","authors":"Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong","doi":"10.1145/3444685.3446292","DOIUrl":"https://doi.org/10.1145/3444685.3446292","url":null,"abstract":"Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127935984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin
With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.
{"title":"Motion-transformer: self-supervised pre-training for skeleton-based action recognition","authors":"Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin","doi":"10.1145/3444685.3446289","DOIUrl":"https://doi.org/10.1145/3444685.3446289","url":null,"abstract":"With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.
{"title":"Efficient inter-image relation graph neural network hashing for scalable image retrieval","authors":"Hui Cui, Lei Zhu, Wentao Tan","doi":"10.1145/3444685.3446321","DOIUrl":"https://doi.org/10.1145/3444685.3446321","url":null,"abstract":"Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The past few years have seen an increase in the number of products that use AR and VR as well as the emergence of products in both these categories i.e. Mixed Reality. However, current systems are exclusive to a market that exists in the top 1% of the population in most countries due to the expensive and heavy technology required by these systems. This project showcases a system in the field of Smartphone Based Mixed Reality through an Interior Design Solution that allows the user to visualise their design choices through the lens of a smartphone. Our system uses Image Processing algorithms to perceive room dimensions alongside a GUI which allows a user to create their own blueprints. Navigable 3D models are created from these blueprints, allowing users to view their builds. Following this, Users switch to the mobile application for the purpose of visualising their ideas in their own homes (MR). This System/POC showcases the potential of MR as a field that can be explored for a larger portion of the population through a more efficient medium.
{"title":"Synthesized 3D models with smartphone based MR to modify the PreBuilt environment: interior design","authors":"Anish Bhardwaj, N. Chauhan, R. Shah","doi":"10.1145/3444685.3446251","DOIUrl":"https://doi.org/10.1145/3444685.3446251","url":null,"abstract":"The past few years have seen an increase in the number of products that use AR and VR as well as the emergence of products in both these categories i.e. Mixed Reality. However, current systems are exclusive to a market that exists in the top 1% of the population in most countries due to the expensive and heavy technology required by these systems. This project showcases a system in the field of Smartphone Based Mixed Reality through an Interior Design Solution that allows the user to visualise their design choices through the lens of a smartphone. Our system uses Image Processing algorithms to perceive room dimensions alongside a GUI which allows a user to create their own blueprints. Navigable 3D models are created from these blueprints, allowing users to view their builds. Following this, Users switch to the mobile application for the purpose of visualising their ideas in their own homes (MR). This System/POC showcases the potential of MR as a field that can be explored for a larger portion of the population through a more efficient medium.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen
Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.
{"title":"Self-supervised adversarial learning for cross-modal retrieval","authors":"Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen","doi":"10.1145/3444685.3446269","DOIUrl":"https://doi.org/10.1145/3444685.3446269","url":null,"abstract":"Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129551921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.
{"title":"Overlap classification mechanism for skeletal bone age assessment","authors":"Pengyi Hao, Xuhang Xie, Tianxing Han, Cong Bai","doi":"10.1145/3444685.3446286","DOIUrl":"https://doi.org/10.1145/3444685.3446286","url":null,"abstract":"The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125905000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}