Multiview video coding (MVC) plays an important role in 3D video system, while the huge computational complexity blocks its applications. This paper proposes an adaptive Inter mode decision algorithm to reduce the complexity of MVC. First, the selection of Inter modes is determined by using the textural region type of macro block (MB). Then, the estimation of small size Inter modes (Inter16x8, Inter8x16, and Inter8x8) is decided based on the motion homogenization of MB, which is predicted by utilizing the motion estimation results of Inter16x16 mode. Finally, the complexity of Inter8x8 mode estimation is progressively reduced by employing rate-distortion (RD) costs of estimated modes. As compared to the full mode decision in MVC reference software, the proposed algorithm achieved 71% encoding time saving on average with 0.026 dB peak signal-to-noise ratio loss and 0.74% bit rate increase.
{"title":"An Adaptive Inter Mode Decision for Multiview Video Coding","authors":"Wei Zhu, Peng Chen, Yayu Zheng, Jie Feng","doi":"10.1109/ISM.2011.54","DOIUrl":"https://doi.org/10.1109/ISM.2011.54","url":null,"abstract":"Multiview video coding (MVC) plays an important role in 3D video system, while the huge computational complexity blocks its applications. This paper proposes an adaptive Inter mode decision algorithm to reduce the complexity of MVC. First, the selection of Inter modes is determined by using the textural region type of macro block (MB). Then, the estimation of small size Inter modes (Inter16x8, Inter8x16, and Inter8x8) is decided based on the motion homogenization of MB, which is predicted by utilizing the motion estimation results of Inter16x16 mode. Finally, the complexity of Inter8x8 mode estimation is progressively reduced by employing rate-distortion (RD) costs of estimated modes. As compared to the full mode decision in MVC reference software, the proposed algorithm achieved 71% encoding time saving on average with 0.026 dB peak signal-to-noise ratio loss and 0.74% bit rate increase.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133255052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chung-hsuan Wang, J. Zao, Hsing-Min Chen, Pei-Lun Diao, Chih-Ming Chiu
Wireless broadcasting of scalable video coded medium grain scalable (SVC/MGS) bit streams requires unequal erasure protection (UEP) at the transport layer in order to ensure graceful degradation of playback video quality over a wide range of frame error rates. Modern wireless broadcasting systems even employ rate less fountain codes to aid the receivers in making inevitable trade¬offs among picture quality, channel throughput and playback latency. Designing a rate less UEP channel code fits for such an application posts distinct engineering challenges as the neces¬sary protection for SVC base and enhancement layers differ by orders of magnitude while their intradependent groups of pictures fluctuate notably in their sizes. In this paper, we present the design and implementation of a rate less UEP con¬volutional code that meets these demanding requirements. Use of this UEP channel code along with rate-distortion based network application layer unit extraction offer sufficient protection to SVC bit streams under different lossy conditions without the need to re-code the bit stream. We also investigated the differences in playback performance of SVC bit streams that were protected by rate less codes vs. con¬ventional Reed-Solomon codes. The comparison makes clear the advantages and the disadvantages of employing rate less codes in protecting wireless video broad-casting.
{"title":"A Rateless UEP Convolutional Code for Robust SVC/MGS Wireless Broadcasting","authors":"Chung-hsuan Wang, J. Zao, Hsing-Min Chen, Pei-Lun Diao, Chih-Ming Chiu","doi":"10.1109/ISM.2011.51","DOIUrl":"https://doi.org/10.1109/ISM.2011.51","url":null,"abstract":"Wireless broadcasting of scalable video coded medium grain scalable (SVC/MGS) bit streams requires unequal erasure protection (UEP) at the transport layer in order to ensure graceful degradation of playback video quality over a wide range of frame error rates. Modern wireless broadcasting systems even employ rate less fountain codes to aid the receivers in making inevitable trade¬offs among picture quality, channel throughput and playback latency. Designing a rate less UEP channel code fits for such an application posts distinct engineering challenges as the neces¬sary protection for SVC base and enhancement layers differ by orders of magnitude while their intradependent groups of pictures fluctuate notably in their sizes. In this paper, we present the design and implementation of a rate less UEP con¬volutional code that meets these demanding requirements. Use of this UEP channel code along with rate-distortion based network application layer unit extraction offer sufficient protection to SVC bit streams under different lossy conditions without the need to re-code the bit stream. We also investigated the differences in playback performance of SVC bit streams that were protected by rate less codes vs. con¬ventional Reed-Solomon codes. The comparison makes clear the advantages and the disadvantages of employing rate less codes in protecting wireless video broad-casting.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116231748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enes Yildiz, K. Akkaya, Esra Sisikoglu, M. Sir, Ismail Guneydas
In this paper, we tackle the problem of providing coverage for video panorama generation in Wireless Heterogeneous Visual Sensor Networks (VSNs) where cameras may have different price, resolution, Field-of-View (FoV) and Depth-of-Field (DoF). We utilize multi-perspective coverage (MPC) which refers to the coverage of a point from given disparate perspectives simultaneously. For a given minimum average resolution, area boundaries, and variety of camera sensors, we propose a deployment algorithm which minimizes the total cost while guaranteeing full MPC of the area (i.e., the coverage needed for video panorama generation) and the minimum required resolution. Specifically, the approach is based on a bi-level mixed integer program (MIP), which runs two models, namely master problem and sub-problem, iteratively. Master-problem provides coverage for initial set of identified points while meeting the minimum resolution requirement with minimum cost. Sub-problem which follows the master-problem finds an uncovered point and extends the set of points to be covered. It then sends this set back to the master-problem. Master-problem and sub-problem continue to run iteratively until sub-problem becomes infeasible, which means full MPC has been achieved with the resolution requirements. The numerical results show the superiority of our approach with respect to existing approaches.
{"title":"Camera Deployment for Video Panorama Generation in Wireless Visual Sensor Networks","authors":"Enes Yildiz, K. Akkaya, Esra Sisikoglu, M. Sir, Ismail Guneydas","doi":"10.1109/ISM.2011.105","DOIUrl":"https://doi.org/10.1109/ISM.2011.105","url":null,"abstract":"In this paper, we tackle the problem of providing coverage for video panorama generation in Wireless Heterogeneous Visual Sensor Networks (VSNs) where cameras may have different price, resolution, Field-of-View (FoV) and Depth-of-Field (DoF). We utilize multi-perspective coverage (MPC) which refers to the coverage of a point from given disparate perspectives simultaneously. For a given minimum average resolution, area boundaries, and variety of camera sensors, we propose a deployment algorithm which minimizes the total cost while guaranteeing full MPC of the area (i.e., the coverage needed for video panorama generation) and the minimum required resolution. Specifically, the approach is based on a bi-level mixed integer program (MIP), which runs two models, namely master problem and sub-problem, iteratively. Master-problem provides coverage for initial set of identified points while meeting the minimum resolution requirement with minimum cost. Sub-problem which follows the master-problem finds an uncovered point and extends the set of points to be covered. It then sends this set back to the master-problem. Master-problem and sub-problem continue to run iteratively until sub-problem becomes infeasible, which means full MPC has been achieved with the resolution requirements. The numerical results show the superiority of our approach with respect to existing approaches.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128470090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mayumi Ueda, Takuya Funatomi, Atsushi Hashimoto, Takahiro Watanabe, M. Minoh
In this paper, we propose a real-time system for measuring the consumption of various types of seasonings. In our system, all seasonings are placed on a scale, and we continuously take images of these items using a camera. Our system estimates the consumption of each condiment by calculating the difference between the weight when the seasoning was picked up and the weight when it was placed back on the scale. Our system identifies the type of seasoning that was used by determining whether or not the seasoning was present on the scale. By using our system, users can automatically log their usage of seasoning. Then, they can adjust the seasoning according to their desired taste.
{"title":"Developing a Real-Time System for Measuring the Consumption of Seasoning","authors":"Mayumi Ueda, Takuya Funatomi, Atsushi Hashimoto, Takahiro Watanabe, M. Minoh","doi":"10.1109/ISM.2011.71","DOIUrl":"https://doi.org/10.1109/ISM.2011.71","url":null,"abstract":"In this paper, we propose a real-time system for measuring the consumption of various types of seasonings. In our system, all seasonings are placed on a scale, and we continuously take images of these items using a camera. Our system estimates the consumption of each condiment by calculating the difference between the weight when the seasoning was picked up and the weight when it was placed back on the scale. Our system identifies the type of seasoning that was used by determining whether or not the seasoning was present on the scale. By using our system, users can automatically log their usage of seasoning. Then, they can adjust the seasoning according to their desired taste.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131973383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One-way audiovisual quality and mouth-to-ear delay (MED) are two important quality metrics in the design of real-time video-conferencing systems, and their trade-offs have significant impact on the user-perceived quality. In this paper, we address one aspect of this larger problem by developing efficient loss-concealment schemes that optimize the one-way quality under given MED and network conditions. Our experimental results show that our approach can attain significant improvements over the LARDo reference scheme that does not consider MED in its optimization.
{"title":"Delay-Aware Loss-Concealment Strategies for Real-Time Video Conferencing","authors":"Jingxi Xu, B. Wah","doi":"10.1109/ISM.2011.14","DOIUrl":"https://doi.org/10.1109/ISM.2011.14","url":null,"abstract":"One-way audiovisual quality and mouth-to-ear delay (MED) are two important quality metrics in the design of real-time video-conferencing systems, and their trade-offs have significant impact on the user-perceived quality. In this paper, we address one aspect of this larger problem by developing efficient loss-concealment schemes that optimize the one-way quality under given MED and network conditions. Our experimental results show that our approach can attain significant improvements over the LARDo reference scheme that does not consider MED in its optimization.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133108633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, the effects of employing different noise estimation strategies on the performance of noise artifact suppression techniques in achieving high image quality has been investigated. Most literature on the subject tends to use the true noise level of the noisy image when performing noise artifact suppression. However, this approach does not reflect how such techniques would be used in practical situations where the true noise level is unknown, which is common in most image and video processing applications. Therefore, in practical situations, the noise level must first be estimated before a noise artifact suppression technique can be applied using the estimated noise level. Through a comprehensive analysis of different noise estimation strategies using empirical testing on a variety of images with different characteristics, the MAD wavelet noise estimation technique was found to be the overall preferred noise estimation technique for all popular noise artifact suppression techniques investigated (BM3D, bilateral, Neigh Shrink, BLS-GSM and non-local means). Furthermore, the BM3D noise artifact suppression technique, combined with the MAD wavelet noise estimation technique, was found to offer the best performance in achieving high image quality in situations where the noise level is unknown and must be estimated. The outcome of this research is clear recommendations that can be used in practise when suppressing noise artifacts exhibited in digital imagery and video.
{"title":"Comprehensive Analysis on the Effects of Noise Estimation Strategies on Image Noise Artifact Suppression Performance","authors":"Angus Leigh, A. Wong, David A Clausi, P. Fieguth","doi":"10.1109/ISM.2011.24","DOIUrl":"https://doi.org/10.1109/ISM.2011.24","url":null,"abstract":"In this paper, the effects of employing different noise estimation strategies on the performance of noise artifact suppression techniques in achieving high image quality has been investigated. Most literature on the subject tends to use the true noise level of the noisy image when performing noise artifact suppression. However, this approach does not reflect how such techniques would be used in practical situations where the true noise level is unknown, which is common in most image and video processing applications. Therefore, in practical situations, the noise level must first be estimated before a noise artifact suppression technique can be applied using the estimated noise level. Through a comprehensive analysis of different noise estimation strategies using empirical testing on a variety of images with different characteristics, the MAD wavelet noise estimation technique was found to be the overall preferred noise estimation technique for all popular noise artifact suppression techniques investigated (BM3D, bilateral, Neigh Shrink, BLS-GSM and non-local means). Furthermore, the BM3D noise artifact suppression technique, combined with the MAD wavelet noise estimation technique, was found to offer the best performance in achieving high image quality in situations where the noise level is unknown and must be estimated. The outcome of this research is clear recommendations that can be used in practise when suppressing noise artifacts exhibited in digital imagery and video.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133237847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pierre R. Lebreton, A. Raake, M. Barkowsky, P. Callet
This paper describes the results of a subjective test to assess current technology used for 3DTV broadcasting. As a first aspect, the performance of the currently deployed coding schemes was compared to state of the art algorithms. Our results show that down sampling and packing 3D stereoscopic videos according to the so called Side-By-Side format gives the highest perceived quality for a given bit rate. The second aspect of the study was to investigate how common 2D error concealment algorithms perform in case of 3D, and how their 3D-related performance compares with the 2D case. The results provide information on whether binocular suppression or binocular rivalries play the most important role for 3D video quality under transmission error. The results indicate that binocular rivalries and related visual discomfort are the dominant factors. Another aspect of the paper is a comparison of the test results with results from different labs to evaluate the repeatability of a subjective experiment in the 3D case, and to compare the employed test methodologies. Here, the study shows the variation between observers when they are rating visual discomfort and illustrates the difficulty to evaluate this new dimension.
{"title":"A Subjective Evaluation of 3D Iptv Broadcasting Implementations Considering Coding and Transmission Degradation","authors":"Pierre R. Lebreton, A. Raake, M. Barkowsky, P. Callet","doi":"10.1109/ISM.2011.89","DOIUrl":"https://doi.org/10.1109/ISM.2011.89","url":null,"abstract":"This paper describes the results of a subjective test to assess current technology used for 3DTV broadcasting. As a first aspect, the performance of the currently deployed coding schemes was compared to state of the art algorithms. Our results show that down sampling and packing 3D stereoscopic videos according to the so called Side-By-Side format gives the highest perceived quality for a given bit rate. The second aspect of the study was to investigate how common 2D error concealment algorithms perform in case of 3D, and how their 3D-related performance compares with the 2D case. The results provide information on whether binocular suppression or binocular rivalries play the most important role for 3D video quality under transmission error. The results indicate that binocular rivalries and related visual discomfort are the dominant factors. Another aspect of the paper is a comparison of the test results with results from different labs to evaluate the repeatability of a subjective experiment in the 3D case, and to compare the employed test methodologies. Here, the study shows the variation between observers when they are rating visual discomfort and illustrates the difficulty to evaluate this new dimension.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"7 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113957580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new method to detect salient region(s) in images is proposed in this paper. The proposed approach, which is inspired by object-based visual attention theory, segments the input image into coherent regions and measures region-based center-surround distance (RBCSD), which is a distance between region attributes such as color histograms found in each region and its surrounding region. Furthermore, segmented regions are merged such that the RBCSD of the merged region is greater than the individual RBCSD of the component regions through region-based incremental center surround distance (RBCSD+I) process. Due to this RBCSD+I process, merged regions may contain incoherent color regions, which improves the robustness of the proposed approach. The key advantages of the proposed algorithm are: (1) it provides a salient region with plausible object boundaries, (2) it is robust to color incoherency present in the salient region, and (3) it is computationally efficient. Extensive qualitative and quantitative evaluation of the proposed algorithm on widely used data sets and comparison with the existing saliency detection approaches clearly indicates the feasibility and efficiency of the proposed approach.
{"title":"Saliency Detection Using Region-Based Incremental Center-Surround Distance","authors":"Minwoo Park, Mrityunjay Kumar, A. Loui","doi":"10.1109/ISM.2011.47","DOIUrl":"https://doi.org/10.1109/ISM.2011.47","url":null,"abstract":"A new method to detect salient region(s) in images is proposed in this paper. The proposed approach, which is inspired by object-based visual attention theory, segments the input image into coherent regions and measures region-based center-surround distance (RBCSD), which is a distance between region attributes such as color histograms found in each region and its surrounding region. Furthermore, segmented regions are merged such that the RBCSD of the merged region is greater than the individual RBCSD of the component regions through region-based incremental center surround distance (RBCSD+I) process. Due to this RBCSD+I process, merged regions may contain incoherent color regions, which improves the robustness of the proposed approach. The key advantages of the proposed algorithm are: (1) it provides a salient region with plausible object boundaries, (2) it is robust to color incoherency present in the salient region, and (3) it is computationally efficient. Extensive qualitative and quantitative evaluation of the proposed algorithm on widely used data sets and comparison with the existing saliency detection approaches clearly indicates the feasibility and efficiency of the proposed approach.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127847382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Texton is a representative dense visual word and it has proven its effectiveness in categorizing materials as well as generic object classes. Despite its success and popularity, no prior work has tackled the problem of its scale optimization for a given image data and associated object category. We propose scale-optimized textons to learn the best scale for each object in a scene, and incorporate them into image categorization and segmentation. Our textonization process produces a scale-optimized codebook of visual words. We approach the scale-optimization problem of textons by using the scene-context scale in each image, which is the effective scale of local context to classify an image pixel in a scene. We perform the textonization process using the randomized decision forest which is a powerful tool with high computational efficiency in vision applications. Our experiments using MSRC and VOC 2007 segmentation dataset show that our scale-optimized textons improve the performance of image categorization and segmentation.
{"title":"Scale-Optimized Textons for Image Categorization and Segmentation","authors":"Yousun Kang, A. Sugimoto","doi":"10.1109/ISM.2011.48","DOIUrl":"https://doi.org/10.1109/ISM.2011.48","url":null,"abstract":"Texton is a representative dense visual word and it has proven its effectiveness in categorizing materials as well as generic object classes. Despite its success and popularity, no prior work has tackled the problem of its scale optimization for a given image data and associated object category. We propose scale-optimized textons to learn the best scale for each object in a scene, and incorporate them into image categorization and segmentation. Our textonization process produces a scale-optimized codebook of visual words. We approach the scale-optimization problem of textons by using the scene-context scale in each image, which is the effective scale of local context to classify an image pixel in a scene. We perform the textonization process using the randomized decision forest which is a powerful tool with high computational efficiency in vision applications. Our experiments using MSRC and VOC 2007 segmentation dataset show that our scale-optimized textons improve the performance of image categorization and segmentation.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128560569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Schroth, S. Hilsenbeck, Robert Huitl, F. Schweiger, E. Steinbach
Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors dynamic objects and clutter significantly impair the retrieval performance. We present an approach which exploits text, a major source of information for humans during orientation and navigation, without the need for error-prone optical character recognition. To this end, characters are detected and described using robust feature descriptors like SURF. By quantizing them into several hundred visual words we consider the distinctive appearance of the characters rather than reducing the set of possible features to an alphabet. Writings in images are transformed to strings of visual words termed visual phrases, which provide significantly improved distinctiveness when compared to individual features. An approximate string matching is performed using N-grams, which can be efficiently combined with an inverted file structure to cope with large datasets. An experimental evaluation on three different datasets shows significant improvement of the retrieval performance while reducing the size of the database by two orders of magnitude compared to state-of-the-art. Its low computational complexity makes the approach particularly suited for mobile image retrieval applications.
{"title":"Exploiting Text-Related Features for Content-based Image Retrieval","authors":"Georg Schroth, S. Hilsenbeck, Robert Huitl, F. Schweiger, E. Steinbach","doi":"10.1109/ISM.2011.21","DOIUrl":"https://doi.org/10.1109/ISM.2011.21","url":null,"abstract":"Distinctive visual cues are of central importance for image retrieval applications, in particular, in the context of visual location recognition. While in indoor environments typically only few distinctive features can be found, outdoors dynamic objects and clutter significantly impair the retrieval performance. We present an approach which exploits text, a major source of information for humans during orientation and navigation, without the need for error-prone optical character recognition. To this end, characters are detected and described using robust feature descriptors like SURF. By quantizing them into several hundred visual words we consider the distinctive appearance of the characters rather than reducing the set of possible features to an alphabet. Writings in images are transformed to strings of visual words termed visual phrases, which provide significantly improved distinctiveness when compared to individual features. An approximate string matching is performed using N-grams, which can be efficiently combined with an inverted file structure to cope with large datasets. An experimental evaluation on three different datasets shows significant improvement of the retrieval performance while reducing the size of the database by two orders of magnitude compared to state-of-the-art. Its low computational complexity makes the approach particularly suited for mobile image retrieval applications.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117333846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}