Rintaro Yanagi, Ren Togo, Takahiro Ogawa, M. Haseyama
Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is convenient and accurate when users input a query text that can uniquely identify the desired image. Meanwhile, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain the desired images. To alleviate these difficulties, in this paper, we propose a novel interactive cross-modal retrieval method based on question answering (QA) with users. The proposed method analyses candidate images and asks users about information that can narrow retrieval candidates effectively. By only answering the questions generated by the proposed method, users can reach their desired images even from an ambiguous query text. Experimental results show the effectiveness of the proposed method.
{"title":"Interactive re-ranking for cross-modal retrieval based on object-wise question answering","authors":"Rintaro Yanagi, Ren Togo, Takahiro Ogawa, M. Haseyama","doi":"10.1145/3444685.3446290","DOIUrl":"https://doi.org/10.1145/3444685.3446290","url":null,"abstract":"Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is convenient and accurate when users input a query text that can uniquely identify the desired image. Meanwhile, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain the desired images. To alleviate these difficulties, in this paper, we propose a novel interactive cross-modal retrieval method based on question answering (QA) with users. The proposed method analyses candidate images and asks users about information that can narrow retrieval candidates effectively. By only answering the questions generated by the proposed method, users can reach their desired images even from an ambiguous query text. Experimental results show the effectiveness of the proposed method.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115056849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Petit, Pierre Letessier, S. Duffner, Christophe Garcia
Despite a huge leap in performance of face recognition systems in recent years, some cases remain challenging for them while being trivial for humans. This is because a human brain is exploiting much more information than the face appearance to identify a person. In this work, we aim at capturing the social context of unlabeled observed faces in order to improve face retrieval. In particular, we propose a framework that substantially improves face retrieval by exploiting the faces occurring simultaneously in a query's context to infer a multi-dimensional social context descriptor. Combining this compact structural descriptor with the individual visual face features in a common feature vector considerably increases the correct face retrieval rate and allows to disambiguate a large proportion of query results of different persons that are barely distinguishable visually. To evaluate our framework, we also introduce a new large dataset of faces of French TV personalities organised in TV shows in order to capture the co-occurrence relations between people. On this dataset, our framework is able to improve the mean Average Precision over a set of internal queries from 67.93% (using only facial features extracted with a state-of-the-art pre-trained model) to 78.16% (using both facial features and faces co-occurrences), and from 67.88% to 77.36% over a set of external queries.
{"title":"Unsupervised learning of co-occurrences for face images retrieval","authors":"Thomas Petit, Pierre Letessier, S. Duffner, Christophe Garcia","doi":"10.1145/3444685.3446265","DOIUrl":"https://doi.org/10.1145/3444685.3446265","url":null,"abstract":"Despite a huge leap in performance of face recognition systems in recent years, some cases remain challenging for them while being trivial for humans. This is because a human brain is exploiting much more information than the face appearance to identify a person. In this work, we aim at capturing the social context of unlabeled observed faces in order to improve face retrieval. In particular, we propose a framework that substantially improves face retrieval by exploiting the faces occurring simultaneously in a query's context to infer a multi-dimensional social context descriptor. Combining this compact structural descriptor with the individual visual face features in a common feature vector considerably increases the correct face retrieval rate and allows to disambiguate a large proportion of query results of different persons that are barely distinguishable visually. To evaluate our framework, we also introduce a new large dataset of faces of French TV personalities organised in TV shows in order to capture the co-occurrence relations between people. On this dataset, our framework is able to improve the mean Average Precision over a set of internal queries from 67.93% (using only facial features extracted with a state-of-the-art pre-trained model) to 78.16% (using both facial features and faces co-occurrences), and from 67.88% to 77.36% over a set of external queries.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121252923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paradigm of sliding window is proven effective for the task of visual instance segmentation in many popular research works. However, it still suffers from the bottleneck of inference time. To accelerate existing instance segmentation approaches which are dense sliding window based, this work introduces a novel approach, called patch assembly, which can be integrated into bounding box detectors for segmentation without extra up-sampling computations. A well-designed detector named PAMask is proposed to verify the effectiveness of the proposed approach. Benefitting from the simple structure as well as a fusion of multiple representations, PAMask has the ability to run in real time while achieving competitive performances. Besides, another effective technique called Center-NMS is designed to reduce the number of boxes for intersection of union calculation, which can be fully parallelized on device and contributes 0.6% mAP improvement both in detection and segmentation for free.
{"title":"Patch assembly for real-time instance segmentation","authors":"Yutao Xu, Hanli Wang, Jian Zhu","doi":"10.1145/3444685.3446281","DOIUrl":"https://doi.org/10.1145/3444685.3446281","url":null,"abstract":"The paradigm of sliding window is proven effective for the task of visual instance segmentation in many popular research works. However, it still suffers from the bottleneck of inference time. To accelerate existing instance segmentation approaches which are dense sliding window based, this work introduces a novel approach, called patch assembly, which can be integrated into bounding box detectors for segmentation without extra up-sampling computations. A well-designed detector named PAMask is proposed to verify the effectiveness of the proposed approach. Benefitting from the simple structure as well as a fusion of multiple representations, PAMask has the ability to run in real time while achieving competitive performances. Besides, another effective technique called Center-NMS is designed to reduce the number of boxes for intersection of union calculation, which can be fully parallelized on device and contributes 0.6% mAP improvement both in detection and segmentation for free.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116712409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiwei Wei, Yang Yang, Xing Xu, Yanli Ji, Xiaofeng Zhu, Heng Tao Shen
Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.
{"title":"Graph-based variational auto-encoder for generalized zero-shot learning","authors":"Jiwei Wei, Yang Yang, Xing Xu, Yanli Ji, Xiaofeng Zhu, Heng Tao Shen","doi":"10.1145/3444685.3446283","DOIUrl":"https://doi.org/10.1145/3444685.3446283","url":null,"abstract":"Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nuclei segmentation plays an important role in cancer diagnosis. Automated methods for digital pathology become popular due to the developments of deep learning and neural networks. However, this task still faces challenges. Most of current techniques cannot be applied directly because of the clustered state and the large number of nuclei in images. Moreover, anchor-based methods for object detection lead a huge amount of calculation, which is even worse on pathological images with a large target density. To address these issues, we propose a novel network with an anchor-free detection and a U-shaped segmentation. An altered feature enhancement module is attached to improve the performance in dense target detection. Meanwhile, the U-Shaped structure in segmentation block ensures the aggregation of features in different dimensions generated from the backbone network. We evaluate our work on a Multi-Organ Nuclei Segmentation dataset from MICCAI 2018 challenge. In comparisons with others, our proposed method achieves state-of-the-art performance.
{"title":"An automated method with anchor-free detection and U-shaped segmentation for nuclei instance segmentation","authors":"X. Feng, Lijuan Duan, Jie Chen","doi":"10.1145/3444685.3446258","DOIUrl":"https://doi.org/10.1145/3444685.3446258","url":null,"abstract":"Nuclei segmentation plays an important role in cancer diagnosis. Automated methods for digital pathology become popular due to the developments of deep learning and neural networks. However, this task still faces challenges. Most of current techniques cannot be applied directly because of the clustered state and the large number of nuclei in images. Moreover, anchor-based methods for object detection lead a huge amount of calculation, which is even worse on pathological images with a large target density. To address these issues, we propose a novel network with an anchor-free detection and a U-shaped segmentation. An altered feature enhancement module is attached to improve the performance in dense target detection. Meanwhile, the U-Shaped structure in segmentation block ensures the aggregation of features in different dimensions generated from the backbone network. We evaluate our work on a Multi-Organ Nuclei Segmentation dataset from MICCAI 2018 challenge. In comparisons with others, our proposed method achieves state-of-the-art performance.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126883895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aspect-level Sentiment Analysis is a fine-grained sentiment analysis task, which aims to infer the corresponding sentiment polarity with different aspects in an opinion sentence. Attention-based neural networks have proven to be effective in extracting aspect terms, but the prior models are based on context-dependent. Moreover, the prior works only attend aspect terms to detect the sentiment word and cannot consider the sentiment words that might be influenced by domain-specific knowledge. In this work, we proposed a novel integrating Aspect-aware Interactive Attention and Emotional Position-aware module for multi-aspect sentiment analysis (abbreviated to AIAEP) where the aspect-aware interactive attention is utilized to extract aspect terms, and it fuses the domain-specific information of an aspect and context and learns their relationship representations by global context and local context attention mechanisms. Specifically, in the sentiment lexicon, the syntactic parse is used to increase the prior domain knowledge. Then we propose a novel position-aware fusion scheme to compose aspect-sentiment pairs. It combines absolute distance and relative distance from aspect terms and sentiment words, which can improve the accuracy of polarity classification. Extensive experimental results on SemEval2014 task4 restaurant and AIChallenge2018 datasets demonstrate that AIAEP can outperform state-of-the-art approaches, and it is very effective for aspect-level sentiment analysis.
{"title":"Integrating aspect-aware interactive attention and emotional position-aware for multi-aspect sentiment analysis","authors":"Xiaoye Wang, Xiaowen Zhou, Zan Gao, Peng Yang, Xianbin Wen, Hongyun Ning","doi":"10.1145/3444685.3446315","DOIUrl":"https://doi.org/10.1145/3444685.3446315","url":null,"abstract":"Aspect-level Sentiment Analysis is a fine-grained sentiment analysis task, which aims to infer the corresponding sentiment polarity with different aspects in an opinion sentence. Attention-based neural networks have proven to be effective in extracting aspect terms, but the prior models are based on context-dependent. Moreover, the prior works only attend aspect terms to detect the sentiment word and cannot consider the sentiment words that might be influenced by domain-specific knowledge. In this work, we proposed a novel integrating Aspect-aware Interactive Attention and Emotional Position-aware module for multi-aspect sentiment analysis (abbreviated to AIAEP) where the aspect-aware interactive attention is utilized to extract aspect terms, and it fuses the domain-specific information of an aspect and context and learns their relationship representations by global context and local context attention mechanisms. Specifically, in the sentiment lexicon, the syntactic parse is used to increase the prior domain knowledge. Then we propose a novel position-aware fusion scheme to compose aspect-sentiment pairs. It combines absolute distance and relative distance from aspect terms and sentiment words, which can improve the accuracy of polarity classification. Extensive experimental results on SemEval2014 task4 restaurant and AIChallenge2018 datasets demonstrate that AIAEP can outperform state-of-the-art approaches, and it is very effective for aspect-level sentiment analysis.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taijin Zhao, Hongliang Li, Heqian Qiu, Q. Wu, K. Ngan
Referring expression comprehension (REC) is a task that aims to find the location of an object specified by a language expression. Current solutions for REC can be classified into proposal-based methods and proposal-free methods. Proposal-free methods are popular recently because of its flexibility and lightness. Nevertheless, existing proposal-free works give little consideration to visual context. As REC is a context sensitive task, it is hard for current proposal-free methods to comprehend expressions that describe objects by the relative position with surrounding things. In this paper, we propose a multi-scale language embedding network for REC. Our method adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. In the fusion process, we propose a grid fusion module and a grid-context fusion module to compute the similarity between language features and visual features in different size regions. Meanwhile, we extra add fully interacted vision-language information and position information to strength the feature fusion. This novel fusion strategy can help to utilize context flexibly therefore the network can deal with varied expressions, especially expressions that describe objects by things around. Our proposed method outperforms the state-of-the-art methods on Refcoco, Refcoco+ and Refcocog datasets.
{"title":"A multi-scale language embedding network for proposal-free referring expression comprehension","authors":"Taijin Zhao, Hongliang Li, Heqian Qiu, Q. Wu, K. Ngan","doi":"10.1145/3444685.3446279","DOIUrl":"https://doi.org/10.1145/3444685.3446279","url":null,"abstract":"Referring expression comprehension (REC) is a task that aims to find the location of an object specified by a language expression. Current solutions for REC can be classified into proposal-based methods and proposal-free methods. Proposal-free methods are popular recently because of its flexibility and lightness. Nevertheless, existing proposal-free works give little consideration to visual context. As REC is a context sensitive task, it is hard for current proposal-free methods to comprehend expressions that describe objects by the relative position with surrounding things. In this paper, we propose a multi-scale language embedding network for REC. Our method adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. In the fusion process, we propose a grid fusion module and a grid-context fusion module to compute the similarity between language features and visual features in different size regions. Meanwhile, we extra add fully interacted vision-language information and position information to strength the feature fusion. This novel fusion strategy can help to utilize context flexibly therefore the network can deal with varied expressions, especially expressions that describe objects by things around. Our proposed method outperforms the state-of-the-art methods on Refcoco, Refcoco+ and Refcocog datasets.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"576 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134272984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.
{"title":"Incremental multi-view object detection from a moving camera","authors":"T. Konno, Ayako Amma, Asako Kanezaki","doi":"10.1145/3444685.3446257","DOIUrl":"https://doi.org/10.1145/3444685.3446257","url":null,"abstract":"Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132472022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge number of labeled data that is expensive to collect. We explore a highly challenging task, few-sample training, which uses a small number of labeled images of each category and corresponding textual descriptions to train a model for fine-grained visual categorization. In order to tackle overfitting caused by small data, in this paper, we propose two novel feature augmentation approaches, Semantic Gate Feature Augmentation (SGFA) and Semantic Boundary Feature Augmentation (SBFA). Instead of generating a new image instance, we propose to directly synthesize instance features by leveraging semantic information, and its main novelties are: (1) The SGFA method is proposed to reduce the overfitting of small data by adding random noise to different regions of the image's feature maps through a gating mechanism. (2) The SBFA approach is proposed to optimize the decision boundary of the classifier. Technically, the decision boundary of the image feature is estimated through the assistance of semantic information, and then feature augmentation is performed by sampling in this region. Experiments in fine-grained visual categorization benchmark demonstrate that our proposed approach can significantly improve the categorization performance.
{"title":"Semantic feature augmentation for fine-grained visual categorization with few-sample training","authors":"Xiang Guan, Yang Yang, Zheng Wang, Jingjing Li","doi":"10.1145/3444685.3446264","DOIUrl":"https://doi.org/10.1145/3444685.3446264","url":null,"abstract":"Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge number of labeled data that is expensive to collect. We explore a highly challenging task, few-sample training, which uses a small number of labeled images of each category and corresponding textual descriptions to train a model for fine-grained visual categorization. In order to tackle overfitting caused by small data, in this paper, we propose two novel feature augmentation approaches, Semantic Gate Feature Augmentation (SGFA) and Semantic Boundary Feature Augmentation (SBFA). Instead of generating a new image instance, we propose to directly synthesize instance features by leveraging semantic information, and its main novelties are: (1) The SGFA method is proposed to reduce the overfitting of small data by adding random noise to different regions of the image's feature maps through a gating mechanism. (2) The SBFA approach is proposed to optimize the decision boundary of the classifier. Technically, the decision boundary of the image feature is estimated through the assistance of semantic information, and then feature augmentation is performed by sampling in this region. Experiments in fine-grained visual categorization benchmark demonstrate that our proposed approach can significantly improve the categorization performance.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"6 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115735014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of "Hungry Networks," a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.
{"title":"Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume","authors":"Shu Naritomi, Keiji Yanai","doi":"10.1145/3444685.3446275","DOIUrl":"https://doi.org/10.1145/3444685.3446275","url":null,"abstract":"Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of \"Hungry Networks,\" a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115986742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}