The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.
{"title":"Overlap classification mechanism for skeletal bone age assessment","authors":"Pengyi Hao, Xuhang Xie, Tianxing Han, Cong Bai","doi":"10.1145/3444685.3446286","DOIUrl":"https://doi.org/10.1145/3444685.3446286","url":null,"abstract":"The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125905000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural network (CNN) based salient object detection (SOD) has achieved great development in recent years. However, in some challenging cases, i.e. small-scale salient object, low contrast salient object and cluttered background, existing salient object detect methods are still not satisfying. In order to accurately detect salient objects, SOD networks need to fix the position of most salient part. Fixation prediction (FP) focuses on the most visual attractive regions, so we think it could assist in locating salient objects. As far as we know, there are few methods jointly consider SOD and FP tasks. In this paper, we propose a fixation guided salient object detection network (FGNet) to leverage the correlation between SOD and FP. FGNet consists of two branches to deal with fixation prediction and salient object detection respectively. Further, an effective feature cooperation module (FCM) is proposed to fuse complementary information between the two branches. Extensive experiments on four popular datasets and comparisons with twelve state-of-the-art methods show that the proposed FGNet well captures the main context of images and locates salient objects more accurately.
{"title":"Fixation guided network for salient object detection","authors":"Zhe Cui, Li Su, Weigang Zhang, Qingming Huang","doi":"10.1145/3444685.3446288","DOIUrl":"https://doi.org/10.1145/3444685.3446288","url":null,"abstract":"Convolutional neural network (CNN) based salient object detection (SOD) has achieved great development in recent years. However, in some challenging cases, i.e. small-scale salient object, low contrast salient object and cluttered background, existing salient object detect methods are still not satisfying. In order to accurately detect salient objects, SOD networks need to fix the position of most salient part. Fixation prediction (FP) focuses on the most visual attractive regions, so we think it could assist in locating salient objects. As far as we know, there are few methods jointly consider SOD and FP tasks. In this paper, we propose a fixation guided salient object detection network (FGNet) to leverage the correlation between SOD and FP. FGNet consists of two branches to deal with fixation prediction and salient object detection respectively. Further, an effective feature cooperation module (FCM) is proposed to fuse complementary information between the two branches. Extensive experiments on four popular datasets and comparisons with twelve state-of-the-art methods show that the proposed FGNet well captures the main context of images and locates salient objects more accurately.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125992037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Style text with decorative elements has a strong visual sense, and enriches our daily work, study and life. However, it introduces new challenges to text detection and recognition. In this study, we propose a text destylized framework, that can transform the stylized texts with decorative elements into a type that is easily distinguishable by a detection or recognition model. We arranged and integrate an existing stylistic text data set to train the destylized network. The new destylized data set contains English letters and Chinese characters. The proposed approach enables a framework to handle both Chinese characters and English letters without the need for additional networks. Experiments show that the method is superior to the state-of-the-art style-related models.
{"title":"Destylization of text with decorative elements","authors":"Yuting Ma, Fan Tang, Weiming Dong, Changsheng Xu","doi":"10.1145/3444685.3446324","DOIUrl":"https://doi.org/10.1145/3444685.3446324","url":null,"abstract":"Style text with decorative elements has a strong visual sense, and enriches our daily work, study and life. However, it introduces new challenges to text detection and recognition. In this study, we propose a text destylized framework, that can transform the stylized texts with decorative elements into a type that is easily distinguishable by a detection or recognition model. We arranged and integrate an existing stylistic text data set to train the destylized network. The new destylized data set contains English letters and Chinese characters. The proposed approach enables a framework to handle both Chinese characters and English letters without the need for additional networks. Experiments show that the method is superior to the state-of-the-art style-related models.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134048240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph convolution network (GCN) is an important method recently developed for few-shot learning. The adjacency matrix in GCN models is constructed based on graph node features to represent the graph node relationships, according to which, the graph network achieves message-passing inference. Therefore, the representation ability of graph node features is an important factor affecting the learning performance of GCN. This paper proposes an improved GCN model with node feature optimization using cross attention, named GCN-NFO. Leveraging on cross attention mechanism to associate the image features of support set and query set, the proposed model extracts more representative and discriminative salient region features as initialization features of graph nodes through information aggregation. Since graph network can represent the relationship between samples, the optimized graph node features transmit information through the graph network, thus implicitly enhances the similarity of intra-class samples and the dissimilarity of inter-class samples, thus enhancing the learning capability of GCN. Intensive experimental results on image classification task using different image datasets prove that GCN-NFO is an effective few-shot learning algorithm which significantly improves the classification accuracy, compared with other existing models.
{"title":"Graph convolution network with node feature optimization using cross attention for few-shot learning","authors":"Ying Liu, Yanbo Lei, Sheikh Faisal Rashid","doi":"10.1145/3444685.3446278","DOIUrl":"https://doi.org/10.1145/3444685.3446278","url":null,"abstract":"Graph convolution network (GCN) is an important method recently developed for few-shot learning. The adjacency matrix in GCN models is constructed based on graph node features to represent the graph node relationships, according to which, the graph network achieves message-passing inference. Therefore, the representation ability of graph node features is an important factor affecting the learning performance of GCN. This paper proposes an improved GCN model with node feature optimization using cross attention, named GCN-NFO. Leveraging on cross attention mechanism to associate the image features of support set and query set, the proposed model extracts more representative and discriminative salient region features as initialization features of graph nodes through information aggregation. Since graph network can represent the relationship between samples, the optimized graph node features transmit information through the graph network, thus implicitly enhances the similarity of intra-class samples and the dissimilarity of inter-class samples, thus enhancing the learning capability of GCN. Intensive experimental results on image classification task using different image datasets prove that GCN-NFO is an effective few-shot learning algorithm which significantly improves the classification accuracy, compared with other existing models.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131151647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yijun Liu, Zhengning Wang, Ruixu Geng, Hao Zeng, Yi Zeng
Low visibility and high-level noise are two challenges for low-light image enhancement. In this paper, by introducing fractional order differential, we propose an end-to-end conditional generative adversarial network(GAN) to solve those two problems. For the problem of low visibility, we set up a global discriminator to improve the overall reconstruction quality and restore brightness information. For the high-level noise problem, we introduce fractional order differentiation into both the generator and the discriminator. Compared with conventional end-to-end methods, fractional order can better distinguish noise and high-frequency details, thereby achieving superior noise reduction effects while maintaining details. Finally, experimental results show that the proposed model obtains superior visual effects in low-light image enhancement. By introducing fractional order differential, we anticipate that our framework will enable high quality and detailed image recovery not only in the field of low-light enhancement but also in other fields that require details.
{"title":"Structure-preserving extremely low light image enhancement with fractional order differential mask guidance","authors":"Yijun Liu, Zhengning Wang, Ruixu Geng, Hao Zeng, Yi Zeng","doi":"10.1145/3444685.3446319","DOIUrl":"https://doi.org/10.1145/3444685.3446319","url":null,"abstract":"Low visibility and high-level noise are two challenges for low-light image enhancement. In this paper, by introducing fractional order differential, we propose an end-to-end conditional generative adversarial network(GAN) to solve those two problems. For the problem of low visibility, we set up a global discriminator to improve the overall reconstruction quality and restore brightness information. For the high-level noise problem, we introduce fractional order differentiation into both the generator and the discriminator. Compared with conventional end-to-end methods, fractional order can better distinguish noise and high-frequency details, thereby achieving superior noise reduction effects while maintaining details. Finally, experimental results show that the proposed model obtains superior visual effects in low-light image enhancement. By introducing fractional order differential, we anticipate that our framework will enable high quality and detailed image recovery not only in the field of low-light enhancement but also in other fields that require details.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nuclei instance segmentation is essential for cell morphometrics and analysis, playing a crucial role in digital pathology. The problem of variability in nuclei characteristics among diverse cell types makes this task more challenging. Recently, proposal-based segmentation methods with feature pyramid network (FPN) has shown good performance because FPN integrates multi-scale features with strong semantics. However, FPN has information loss of the highest-level feature map and sub-optimal feature fusion strategies. This paper proposes a proposal-based adaptive feature aggregation methods (AANet) to make full use of multi-scale features. Specifically, AANet consists of two components: Context Augmentation Module (CAM) and Feature Adaptive Selection Module (ASM). In feature fusion, CAM focus on exploring extensive contextual information and capturing discriminative semantics to reduce the information loss of feature map at the highest pyramid level. The enhanced features are then sent to ASM to get a combined feature representation adaptively over all feature levels for each RoI. The experiments show our model's effectiveness on two publicly available datasets: the Kaggle 2018 Data Science Bowl dataset and the Multi-Organ nuclei segmentation dataset.
{"title":"Adaptive feature aggregation network for nuclei segmentation","authors":"Ruizhe Geng, Zhongyi Huang, Jie Chen","doi":"10.1145/3444685.3446271","DOIUrl":"https://doi.org/10.1145/3444685.3446271","url":null,"abstract":"Nuclei instance segmentation is essential for cell morphometrics and analysis, playing a crucial role in digital pathology. The problem of variability in nuclei characteristics among diverse cell types makes this task more challenging. Recently, proposal-based segmentation methods with feature pyramid network (FPN) has shown good performance because FPN integrates multi-scale features with strong semantics. However, FPN has information loss of the highest-level feature map and sub-optimal feature fusion strategies. This paper proposes a proposal-based adaptive feature aggregation methods (AANet) to make full use of multi-scale features. Specifically, AANet consists of two components: Context Augmentation Module (CAM) and Feature Adaptive Selection Module (ASM). In feature fusion, CAM focus on exploring extensive contextual information and capturing discriminative semantics to reduce the information loss of feature map at the highest pyramid level. The enhanced features are then sent to ASM to get a combined feature representation adaptively over all feature levels for each RoI. The experiments show our model's effectiveness on two publicly available datasets: the Kaggle 2018 Data Science Bowl dataset and the Multi-Organ nuclei segmentation dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116565078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To make full use of existing inherent correlation between facial regions and expression, we propose an attention-constraint facial expression recognition method, where the prior correlation between facial regions and expression is integrated into attention weights for extracting better representation. The proposed method mainly consists of four components: feature extractor, local self attention-constraint learner (LSACL), global and local attention-constraint learner (GLACL) and facial expression classifier. Specifically, feature extractor is mainly used to extract features from overall facial image and its corresponding cropped facial regions. Then, the extracted local features from facial regions are fed into local self attention-constraint learner, where some prior rank constraints summarized from facial domain knowledge are embedded into self attention weights. Similarly, the rank correlation constraints between respective facial region and a specified expression are further embedded into global-to-local attention weights when the global feature and local features from local self attention-constraint learner are fed into global and local attention-constraint learner. Finally, the feature from global and local attention-constraint learner and original global feature are fused and passed to facial expression classifier for conducting facial expression recognition. Experiments on two benchmark datasets validate the effectiveness of the proposed method.
{"title":"Attention-constraint facial expression recognition","authors":"Qisheng Jiang","doi":"10.1145/3444685.3446307","DOIUrl":"https://doi.org/10.1145/3444685.3446307","url":null,"abstract":"To make full use of existing inherent correlation between facial regions and expression, we propose an attention-constraint facial expression recognition method, where the prior correlation between facial regions and expression is integrated into attention weights for extracting better representation. The proposed method mainly consists of four components: feature extractor, local self attention-constraint learner (LSACL), global and local attention-constraint learner (GLACL) and facial expression classifier. Specifically, feature extractor is mainly used to extract features from overall facial image and its corresponding cropped facial regions. Then, the extracted local features from facial regions are fed into local self attention-constraint learner, where some prior rank constraints summarized from facial domain knowledge are embedded into self attention weights. Similarly, the rank correlation constraints between respective facial region and a specified expression are further embedded into global-to-local attention weights when the global feature and local features from local self attention-constraint learner are fed into global and local attention-constraint learner. Finally, the feature from global and local attention-constraint learner and original global feature are fused and passed to facial expression classifier for conducting facial expression recognition. Experiments on two benchmark datasets validate the effectiveness of the proposed method.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132444820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.
{"title":"Towards annotation-free evaluation of cross-lingual image captioning","authors":"Aozhu Chen, Xinyi Huang, Hailan Lin, Xirong Li","doi":"10.1145/3444685.3446322","DOIUrl":"https://doi.org/10.1145/3444685.3446322","url":null,"abstract":"Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114484934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Novelty detection is a important research area which mainly solves the classification problem of inliers which usually consists of normal samples and outliers composed of abnormal samples. Auto-encoder is often used for novelty detection. However, the generalization ability of the auto-encoder may cause the undesirable reconstruction of abnormal elements and reduce the identification ability of the model. To solve the problem, we focus on the perspective of better reconstructing the normal samples as well as retaining the unique information of normal samples to improve the performance of auto-encoder for novelty detection. Firstly, we introduce attention mechanism into the task. Under the action of attention mechanism, auto-encoder can pay more attention to the representation of inlier samples through adversarial training. Secondly, we apply the information entropy into the latent layer to make it sparse and constrain the expression of diversity. Experimental results on three public datasets show that the proposed method achieves comparable performance compared with previous popular approaches.
{"title":"Improving auto-encoder novelty detection using channel attention and entropy minimization","authors":"Dongyan Guo, Miao Tian, Ying Cui, Xiang Pan, Shengyong Chen","doi":"10.1145/3444685.3446311","DOIUrl":"https://doi.org/10.1145/3444685.3446311","url":null,"abstract":"Novelty detection is a important research area which mainly solves the classification problem of inliers which usually consists of normal samples and outliers composed of abnormal samples. Auto-encoder is often used for novelty detection. However, the generalization ability of the auto-encoder may cause the undesirable reconstruction of abnormal elements and reduce the identification ability of the model. To solve the problem, we focus on the perspective of better reconstructing the normal samples as well as retaining the unique information of normal samples to improve the performance of auto-encoder for novelty detection. Firstly, we introduce attention mechanism into the task. Under the action of attention mechanism, auto-encoder can pay more attention to the representation of inlier samples through adversarial training. Secondly, we apply the information entropy into the latent layer to make it sparse and constrain the expression of diversity. Experimental results on three public datasets show that the proposed method achieves comparable performance compared with previous popular approaches.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115694903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah
Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.
{"title":"C3VQG: category consistent cyclic visual question generation","authors":"Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah","doi":"10.1145/3444685.3446302","DOIUrl":"https://doi.org/10.1145/3444685.3446302","url":null,"abstract":"Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129302118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}