Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177460
Congqi Cao, Yifan Zhang, Hanqing Lu
With the development of sensing equipments, data from different modalities is available for gesture recognition. In this paper, we propose a novel multi-modal learning framework. A coupled hidden Markov model (CHMM) is employed to discover the correlation and complementary information across different modalities. In this framework, we use two configurations: one is multi-modal learning and multi-modal testing, where all the modalities used during learning are still available during testing; the other is multi-modal learning and single-modal testing, where only one modality is available during testing. Experiments on two real-world gesture recognition data sets have demonstrated the effectiveness of our multi-modal learning framework. Improvements on both of the multi-modal and single-modal testing have been observed.
{"title":"Multi-modal learning for gesture recognition","authors":"Congqi Cao, Yifan Zhang, Hanqing Lu","doi":"10.1109/ICME.2015.7177460","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177460","url":null,"abstract":"With the development of sensing equipments, data from different modalities is available for gesture recognition. In this paper, we propose a novel multi-modal learning framework. A coupled hidden Markov model (CHMM) is employed to discover the correlation and complementary information across different modalities. In this framework, we use two configurations: one is multi-modal learning and multi-modal testing, where all the modalities used during learning are still available during testing; the other is multi-modal learning and single-modal testing, where only one modality is available during testing. Experiments on two real-world gesture recognition data sets have demonstrated the effectiveness of our multi-modal learning framework. Improvements on both of the multi-modal and single-modal testing have been observed.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124407098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177390
Yinpeng Chen, Zicheng Liu, P. Chou, Zhengyou Zhang
We propose a system that augments touch input with visual understanding of the user to improve interaction with a large touch-sensitive display. A commodity color plus depth sensor such as Microsoft Kinect adds the visual modality and enables new interactions beyond touch. Through visual analysis, the system understands where the user is, who the user is, and what the user is doing even before the user touches the display. Such information is used to enhance interaction in multiple ways. For example, a user can use simple gestures to bring up menu items such as color palette and soft keyboard; menu items can be shown where the user is and can follow the user; hovering can show information to the user before the user commits to touch; the user can perform different functions (for example writing and erasing) with different hands; and the user's preference profile can be maintained, distinct from other users. User studies are conducted and the users very much appreciate the value of these and other enhanced interactions.
{"title":"VTouch: Vision-enhanced interaction for large touch displays","authors":"Yinpeng Chen, Zicheng Liu, P. Chou, Zhengyou Zhang","doi":"10.1109/ICME.2015.7177390","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177390","url":null,"abstract":"We propose a system that augments touch input with visual understanding of the user to improve interaction with a large touch-sensitive display. A commodity color plus depth sensor such as Microsoft Kinect adds the visual modality and enables new interactions beyond touch. Through visual analysis, the system understands where the user is, who the user is, and what the user is doing even before the user touches the display. Such information is used to enhance interaction in multiple ways. For example, a user can use simple gestures to bring up menu items such as color palette and soft keyboard; menu items can be shown where the user is and can follow the user; hovering can show information to the user before the user commits to touch; the user can perform different functions (for example writing and erasing) with different hands; and the user's preference profile can be maintained, distinct from other users. User studies are conducted and the users very much appreciate the value of these and other enhanced interactions.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114661975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177521
Mengyao Sun, Yumei Wang, Hao Yu, Yu Liu
In wireless video broadcast system, analog joint source-channel coding (JSCC) has shown advantage compared to conventional separate digital source/channel coding in the aspect that it can avoid cliff effect gracefully. What's more, analog JSCC only needs a little calculations at the encoder and has strong adaptability to different channel condition, which is very suitable to the wireless cooperative scenario. Thus in this paper, we propose a distributed cooperative video coding (DCVC) scheme for wireless video broadcast system. The scheme is based on the transmission structure of Softcast and borrows the basic idea of distributed video coding. Different from the former cooperative video delivery methods, DCVC utilizes analog coding and coset coding to avoid cliff effect and to make the best of transmission power. The experimental results show that DCVC outperforms the conventional WSVC and H.264/SVC cooperative schemes, especially when the cooperative channel is worse than the original source-terminal channel.
{"title":"Distributed cooperative video coding for wireless video broadcast system","authors":"Mengyao Sun, Yumei Wang, Hao Yu, Yu Liu","doi":"10.1109/ICME.2015.7177521","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177521","url":null,"abstract":"In wireless video broadcast system, analog joint source-channel coding (JSCC) has shown advantage compared to conventional separate digital source/channel coding in the aspect that it can avoid cliff effect gracefully. What's more, analog JSCC only needs a little calculations at the encoder and has strong adaptability to different channel condition, which is very suitable to the wireless cooperative scenario. Thus in this paper, we propose a distributed cooperative video coding (DCVC) scheme for wireless video broadcast system. The scheme is based on the transmission structure of Softcast and borrows the basic idea of distributed video coding. Different from the former cooperative video delivery methods, DCVC utilizes analog coding and coset coding to avoid cliff effect and to make the best of transmission power. The experimental results show that DCVC outperforms the conventional WSVC and H.264/SVC cooperative schemes, especially when the cooperative channel is worse than the original source-terminal channel.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177384
Luming Zhang, Roger Zimmermann
Aesthetic tendency discovery is a useful and interesting application in social media. This paper proposes to categorize large-scale Flickr users into multiple circles. Each circle contains users with similar aesthetic interests (e.g., landscapes or abstract paintings). We notice that: (1) an aesthetic model should be flexible as different visual features may be used to describe different image sets, and (2) the numbers of photos from different users varies significantly and some users have very few photos. Therefore, a regularized topic model is proposed to quantify user's aesthetic interest as a distribution in the latent space. Then, a graph is built to describe the similarity of aesthetic interests among users. Obviously, densely connected users are with similar aesthetic interests. Thus an efficient dense subgraph mining algorithm is adopted to group users into different circles. Experiments show that our approach accurately detects circles on an image set crawled from over 60,000 Flickr users.
{"title":"Flickr circles: Mining socially-aware aesthetic tendency","authors":"Luming Zhang, Roger Zimmermann","doi":"10.1109/ICME.2015.7177384","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177384","url":null,"abstract":"Aesthetic tendency discovery is a useful and interesting application in social media. This paper proposes to categorize large-scale Flickr users into multiple circles. Each circle contains users with similar aesthetic interests (e.g., landscapes or abstract paintings). We notice that: (1) an aesthetic model should be flexible as different visual features may be used to describe different image sets, and (2) the numbers of photos from different users varies significantly and some users have very few photos. Therefore, a regularized topic model is proposed to quantify user's aesthetic interest as a distribution in the latent space. Then, a graph is built to describe the similarity of aesthetic interests among users. Obviously, densely connected users are with similar aesthetic interests. Thus an efficient dense subgraph mining algorithm is adopted to group users into different circles. Experiments show that our approach accurately detects circles on an image set crawled from over 60,000 Flickr users.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128686483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177436
Yilin Wang, Qiang Zhang, Baoxin Li
Perceptual Image Quality Assessment (IQA) has many applications. Existing IQA approaches typically work only for one of three scenarios: full-reference, non-reference, or reduced-reference. Techniques that attempt to incorporate image structure information often rely on hand-crafted features, making them difficult to be extended to handle different scenarios. On the other hand, objective metrics like Mean Square Error (MSE), while being easy to compute, are often deemed ineffective for measuring perceptual quality. This paper presents a novel approach to perceptual quality assessment by developing an MSE-like metric, which enjoys the benefit of MSE in terms of inexpensive computation and universal applicability while allowing structural information of an image being taken into consideration. The latter was achieved through introducing structure-preserving kernelization into a MSE-like formulation. We show that the method can lead to competitive FR-IQA results. Further, by developing a feature coding scheme based on this formulation, we extend the model to improve the performance of NR-IQA methods. We report extensive experiments illustrating the results from both our FR-IQA and NR-IQA algorithms with comparison to existing state-of-the-art methods.
{"title":"Structure-preserving Image Quality Assessment","authors":"Yilin Wang, Qiang Zhang, Baoxin Li","doi":"10.1109/ICME.2015.7177436","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177436","url":null,"abstract":"Perceptual Image Quality Assessment (IQA) has many applications. Existing IQA approaches typically work only for one of three scenarios: full-reference, non-reference, or reduced-reference. Techniques that attempt to incorporate image structure information often rely on hand-crafted features, making them difficult to be extended to handle different scenarios. On the other hand, objective metrics like Mean Square Error (MSE), while being easy to compute, are often deemed ineffective for measuring perceptual quality. This paper presents a novel approach to perceptual quality assessment by developing an MSE-like metric, which enjoys the benefit of MSE in terms of inexpensive computation and universal applicability while allowing structural information of an image being taken into consideration. The latter was achieved through introducing structure-preserving kernelization into a MSE-like formulation. We show that the method can lead to competitive FR-IQA results. Further, by developing a feature coding scheme based on this formulation, we extend the model to improve the performance of NR-IQA methods. We report extensive experiments illustrating the results from both our FR-IQA and NR-IQA algorithms with comparison to existing state-of-the-art methods.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134368859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177449
Kuan-Chuan Peng, Tsuhan Chen
Most works related to convolutional neural networks (CNN) use the traditional CNN framework which extracts features in only one scale. We propose multi-scale convolutional neural networks (MSCNN) which can not only extract multi-scale features but also solve the issues of the previous methods which use CNN to extract multi-scale features. With the assumption of label-inheritable (LI) property, we also propose a method to generate exponentially more training examples for MSCNN from the given training set. Our experimental results show that MSCNN outperforms both the state-of-the-art methods and the traditional CNN framework on artist, artistic style, and architectural style classification, supporting that MSCNN outperforms the traditional CNN framework on the tasks which at least partially satisfy LI property.
{"title":"A framework of extracting multi-scale features using multiple convolutional neural networks","authors":"Kuan-Chuan Peng, Tsuhan Chen","doi":"10.1109/ICME.2015.7177449","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177449","url":null,"abstract":"Most works related to convolutional neural networks (CNN) use the traditional CNN framework which extracts features in only one scale. We propose multi-scale convolutional neural networks (MSCNN) which can not only extract multi-scale features but also solve the issues of the previous methods which use CNN to extract multi-scale features. With the assumption of label-inheritable (LI) property, we also propose a method to generate exponentially more training examples for MSCNN from the given training set. Our experimental results show that MSCNN outperforms both the state-of-the-art methods and the traditional CNN framework on artist, artistic style, and architectural style classification, supporting that MSCNN outperforms the traditional CNN framework on the tasks which at least partially satisfy LI property.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129390990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177441
Junjie Cai, Richang Hong, Meng Wang, Q. Tian
Indexing is a critical step for searching digital images in a large database. To date, how to design discriminative and compact indexing strategy still remains a challenging issue, partly due to the well-known semantic gap between user queries and rich semantics in the large scale dataset. In this paper, we propose to construct a novel joint semantic-visual space by leveraging visual descriptors and semantic attributes, which aims to narrow down the semantic gap by taking both attribute and indexing into one framework. Such a joint space embraces the flexibility of conducting Coherent Semantic-visual Indexing, which employs binary codes to boost the retrieval speed with satisfying accuracy. To solve the proposed model effectively, three contributions are made in this submission. First, we propose an interactive optimization method to find the joint space of semantic and visual descriptors. Second, we prove the convergence property of our optimization algorithm, which guarantees our system will find a good solution in certain rounds. At last, we integrate the semantic-visual joint space system with spectral hashing, which can find an efficient solution to search up to million scale datasets. Experiments on two standard retrieval datasets i.e., Holidays1M and Oxford5K, show that the proposed method presents promising performance compared with the state-of-the-arts.
{"title":"Exploring feature space with semantic attributes","authors":"Junjie Cai, Richang Hong, Meng Wang, Q. Tian","doi":"10.1109/ICME.2015.7177441","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177441","url":null,"abstract":"Indexing is a critical step for searching digital images in a large database. To date, how to design discriminative and compact indexing strategy still remains a challenging issue, partly due to the well-known semantic gap between user queries and rich semantics in the large scale dataset. In this paper, we propose to construct a novel joint semantic-visual space by leveraging visual descriptors and semantic attributes, which aims to narrow down the semantic gap by taking both attribute and indexing into one framework. Such a joint space embraces the flexibility of conducting Coherent Semantic-visual Indexing, which employs binary codes to boost the retrieval speed with satisfying accuracy. To solve the proposed model effectively, three contributions are made in this submission. First, we propose an interactive optimization method to find the joint space of semantic and visual descriptors. Second, we prove the convergence property of our optimization algorithm, which guarantees our system will find a good solution in certain rounds. At last, we integrate the semantic-visual joint space system with spectral hashing, which can find an efficient solution to search up to million scale datasets. Experiments on two standard retrieval datasets i.e., Holidays1M and Oxford5K, show that the proposed method presents promising performance compared with the state-of-the-arts.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130793893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177485
Na Qi, Yunhui Shi, Xiaoyan Sun, Wenpeng Ding, Baocai Yin
Image super-resolution with sparsity prior provides promising performance. However, traditional sparse-based super resolution methods transform a two dimensional (2D) image into a one dimensional (1D) vector, which ignores the intrinsic 2D structure as well as spatial correlation inherent in images. In this paper, we propose the first image super-resolution method which reconstructs a high resolution image from its low resolution counterpart via a two dimensional sparse model. Correspondingly, we present a new dictionary learning algorithm to fully make use of the corresponding relationship of two pairs of 2D dictionaries of low and high resolution images, respectively. Experimental results demonstrate that our proposed image super-resolution with 2D sparse model outperforms state-of-the-art 1D sparse model based super resolution methods in terms of both reconstruction ability and memory usage.
{"title":"Single image super-resolution via 2D sparse representation","authors":"Na Qi, Yunhui Shi, Xiaoyan Sun, Wenpeng Ding, Baocai Yin","doi":"10.1109/ICME.2015.7177485","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177485","url":null,"abstract":"Image super-resolution with sparsity prior provides promising performance. However, traditional sparse-based super resolution methods transform a two dimensional (2D) image into a one dimensional (1D) vector, which ignores the intrinsic 2D structure as well as spatial correlation inherent in images. In this paper, we propose the first image super-resolution method which reconstructs a high resolution image from its low resolution counterpart via a two dimensional sparse model. Correspondingly, we present a new dictionary learning algorithm to fully make use of the corresponding relationship of two pairs of 2D dictionaries of low and high resolution images, respectively. Experimental results demonstrate that our proposed image super-resolution with 2D sparse model outperforms state-of-the-art 1D sparse model based super resolution methods in terms of both reconstruction ability and memory usage.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1983 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120847185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177463
H. Shidanshidi, F. Safaei, W. Li
Light field (LF) rendering is widely used in free viewpoint video systems (FVV). Different methods have been proposed to employ depth maps to improve the rendering quality. However, estimation of depth is often error-prone. In this paper, a new method based on the concept of effective sampling density (ESD) is proposed for evaluating the depth-based LF rendering algorithms at different levels of errors in the depth estimation. In addition, for a given rendering quality, we provide an estimation of number of rays required in the interpolation algorithm to compensate for the adverse effect caused by errors in depth maps. The proposed method is particularly useful in designing a rendering algorithm with inaccurate knowledge of depth to achieve the required rendering quality. Both the theoretical study and numerical simulations have verified the efficacy of the proposed method.
{"title":"Optimization of the number of rays in interpolation for light field based free viewpoint systems","authors":"H. Shidanshidi, F. Safaei, W. Li","doi":"10.1109/ICME.2015.7177463","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177463","url":null,"abstract":"Light field (LF) rendering is widely used in free viewpoint video systems (FVV). Different methods have been proposed to employ depth maps to improve the rendering quality. However, estimation of depth is often error-prone. In this paper, a new method based on the concept of effective sampling density (ESD) is proposed for evaluating the depth-based LF rendering algorithms at different levels of errors in the depth estimation. In addition, for a given rendering quality, we provide an estimation of number of rays required in the interpolation algorithm to compensate for the adverse effect caused by errors in depth maps. The proposed method is particularly useful in designing a rendering algorithm with inaccurate knowledge of depth to achieve the required rendering quality. Both the theoretical study and numerical simulations have verified the efficacy of the proposed method.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128631951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-08-06DOI: 10.1109/ICME.2015.7177464
Luis Herranz, Ruihan Xu, Shuqiang Jiang
A large amount of food photos are taken in restaurants for diverse reasons. This dish recognition problem is very challenging, due to different cuisines, cooking styles and the intrinsic difficulty of modeling food from its visual appearance. Contextual knowledge is crucial to improve recognition in such scenario. In particular, geocontext has been widely exploited for outdoor landmark recognition. Similarly, we exploit knowledge about menus and geolocation of restaurants and test images. We first adapt a framework based on discarding unlikely categories located far from the test image. Then we reformulate the problem using a probabilistic model connecting dishes, restaurants and geolocations. We apply that model in three different tasks: dish recognition, restaurant recognition and geolocation refinement. Experiments on a dataset including 187 restaurants and 701 dishes show that combining multiple evidences (visual, geolocation, and external knowledge) can boost the performance in all tasks.
{"title":"A probabilistic model for food image recognition in restaurants","authors":"Luis Herranz, Ruihan Xu, Shuqiang Jiang","doi":"10.1109/ICME.2015.7177464","DOIUrl":"https://doi.org/10.1109/ICME.2015.7177464","url":null,"abstract":"A large amount of food photos are taken in restaurants for diverse reasons. This dish recognition problem is very challenging, due to different cuisines, cooking styles and the intrinsic difficulty of modeling food from its visual appearance. Contextual knowledge is crucial to improve recognition in such scenario. In particular, geocontext has been widely exploited for outdoor landmark recognition. Similarly, we exploit knowledge about menus and geolocation of restaurants and test images. We first adapt a framework based on discarding unlikely categories located far from the test image. Then we reformulate the problem using a probabilistic model connecting dishes, restaurants and geolocations. We apply that model in three different tasks: dish recognition, restaurant recognition and geolocation refinement. Experiments on a dataset including 187 restaurants and 701 dishes show that combining multiple evidences (visual, geolocation, and external knowledge) can boost the performance in all tasks.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129315154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}