In this work, we proposes a simple yet effective method for improving performance of local feature matching among equirectangular cylindrical images, which brings more stable and complete 3D reconstruction by incremental SfM. The key idea is to exiplictly generate synthesized images by rotating the spherical panoramic images and to detect and describe features only from the less distroted area in the rectified panoramic images. We demonstrate that the proposed method is advantageous for both rotational and translational camera motions compared with the standard methods on the synthetic data. We also demonstrate that the proposed feature matching is beneficial for incremental SfM through the experiments on the Pittsburgh Reserach dataset.
{"title":"Robust Feature Matching for Distorted Projection by Spherical Cameras","authors":"Hajime Taira, Yuki Inoue, A. Torii, M. Okutomi","doi":"10.2197/ipsjtcva.7.84","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.84","url":null,"abstract":"In this work, we proposes a simple yet effective method for improving performance of local feature matching among equirectangular cylindrical images, which brings more stable and complete 3D reconstruction by incremental SfM. The key idea is to exiplictly generate synthesized images by rotating the spherical panoramic images and to detect and describe features only from the less distroted area in the rectified panoramic images. We demonstrate that the proposed method is advantageous for both rotational and translational camera motions compared with the standard methods on the synthetic data. We also demonstrate that the proposed feature matching is beneficial for incremental SfM through the experiments on the Pittsburgh Reserach dataset.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"1 1","pages":"84-88"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73013604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takayoshi Yamashita, Takaya Nakamura, Hiroshi Fukui, Yuji Yamauchi, H. Fujiyoshi
Facial part labeling which is parsing semantic components enables high-level facial image analysis, and contributes greatly to face recognition, expression recognition, animation, and synthesis. In this paper, we propose a cost-alleviative learning method that uses a weighted cost function to improve the performance of certain classes during facial part labeling. As the conventional cost function handles the error in all classes equally, the error in a class with a slightly biased prior probability tends not to be propagated. The weighted cost function enables the training coefficient for each class to be adjusted. In addition, the boundaries of each class may be recognized after fewer iterations, which will improve the performance. In facial part labeling, the recognition performance of the eye class can be significantly improved using cost-alleviative learning.
{"title":"Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling","authors":"Takayoshi Yamashita, Takaya Nakamura, Hiroshi Fukui, Yuji Yamauchi, H. Fujiyoshi","doi":"10.2197/ipsjtcva.7.99","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.99","url":null,"abstract":"Facial part labeling which is parsing semantic components enables high-level facial image analysis, and contributes greatly to face recognition, expression recognition, animation, and synthesis. In this paper, we propose a cost-alleviative learning method that uses a weighted cost function to improve the performance of certain classes during facial part labeling. As the conventional cost function handles the error in all classes equally, the error in a class with a slightly biased prior probability tends not to be propagated. The weighted cost function enables the training coefficient for each class to be adjusted. In addition, the boundaries of each class may be recognized after fewer iterations, which will improve the performance. In facial part labeling, the recognition performance of the eye class can be significantly improved using cost-alleviative learning.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"22 4 1","pages":"99-103"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78482744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomoki Matsuzawa, Raissa Relator, Wataru Takei, S. Omachi, Tsuyoshi Kato
Nowadays, the design of the representation of images is one of the most crucial factors in the performance of visual categorization. A common pipeline employed in most of recent researches for obtaining an image representa- tion consists of two steps: the encoding step and the pooling step. In this paper, we introduce the Mahalanobis metric to the two popular image patch encoding modules, Histogram Encoding and Fisher Encoding, that are used for Bag- of-Visual-Word method and Fisher Vector method, respectively. Moreover, for the proposed Fisher Vector method, a close-form approximation of Fisher Vector can be derived with the same assumption used in the original Fisher Vector, and the codebook is built without resorting to time-consuming EM (Expectation-Maximization) steps. Experimental evaluation of multi-class classification demonstrates the effectiveness of the proposed encoding methods.
{"title":"Mahalanobis Encodings for Visual Categorization","authors":"Tomoki Matsuzawa, Raissa Relator, Wataru Takei, S. Omachi, Tsuyoshi Kato","doi":"10.2197/ipsjtcva.7.69","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.69","url":null,"abstract":"Nowadays, the design of the representation of images is one of the most crucial factors in the performance of visual categorization. A common pipeline employed in most of recent researches for obtaining an image representa- tion consists of two steps: the encoding step and the pooling step. In this paper, we introduce the Mahalanobis metric to the two popular image patch encoding modules, Histogram Encoding and Fisher Encoding, that are used for Bag- of-Visual-Word method and Fisher Vector method, respectively. Moreover, for the proposed Fisher Vector method, a close-form approximation of Fisher Vector can be derived with the same assumption used in the original Fisher Vector, and the codebook is built without resorting to time-consuming EM (Expectation-Maximization) steps. Experimental evaluation of multi-class classification demonstrates the effectiveness of the proposed encoding methods.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"7 1","pages":"69-73"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85440502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document’s storage size and encodes the math- ematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, all textual and non-textual components must be recognised, identified and tagged. These components may be text or mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expres- sions are one of the most significant components within scanned scientific and engineering PDF documents and need to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non- alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech engine (TTS) to make the document accessible for vision-impaired users. Keywords: math recognition, graphics recognition, Mathematical Informati
{"title":"Mathematical Information Retrieval (MIR) from Scanned PDF Documents and MathML Conversion","authors":"A. Nazemi, I. Murray, D. McMeekin","doi":"10.2197/ipsjtcva.6.132","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.132","url":null,"abstract":"This paper describes part of an ongoing comprehensive research project that is aimed at generating a \u0000MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. \u0000A MathML representation of a scanned PDF document reduces the document’s storage size and encodes the math- \u0000ematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible \u0000through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, \u0000all textual and non-textual components must be recognised, identified and tagged. These components may be text or \u0000mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expres- \u0000sions are one of the most significant components within scanned scientific and engineering PDF documents and need \u0000to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple \u0000different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non- \u0000alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the \u0000MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech \u0000engine (TTS) to make the document accessible for vision-impaired users. \u0000Keywords: math recognition, graphics recognition, Mathematical Informati","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"23 1","pages":"132-142"},"PeriodicalIF":0.0,"publicationDate":"2014-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85347979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel method for image classification using local feature descriptors. The method utilizes linear subspaces of local descriptors for characterizing their distribution and extracting image features. The extracted features are transformed into more discriminative features by the linear discriminant analysis and employed for recognizing their categories. Experimental results demonstrate that this method is competitive with the Fisher kernel method in terms of classification accuracy.
{"title":"Image Classification Using a Mixture of Subspace Models","authors":"Takashi Takahashi, Takio Kurita","doi":"10.2197/ipsjtcva.6.93","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.93","url":null,"abstract":"This paper introduces a novel method for image classification using local feature descriptors. The method utilizes linear subspaces of local descriptors for characterizing their distribution and extracting image features. The extracted features are transformed into more discriminative features by the linear discriminant analysis and employed for recognizing their categories. Experimental results demonstrate that this method is competitive with the Fisher kernel method in terms of classification accuracy.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"780 1","pages":"93-97"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77531533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Kanatani, A. Al-Sharadqah, N. Chernov, Y. Sugaya
The technique of “renormalization” for geometric estimation attracted much attention when it appeared in early 1990s for having higher accuracy than any other then known methods. The key fact is that it directly specifies equations to solve, rather than minimizing some cost function. This paper expounds this “non-minimization approach” in detail and exploits this principle to modify renormalization so that it outperforms the standard reprojection error minimization. Doing a precise error analysis in the most general situation, we derive a formula that maximizes the accuracy of the solution; we call it hyper-renormalization. Applying it to ellipse fitting, fundamental matrix computation, and homography computation, we confirm its accuracy and efficiency for sufficiently small noise. Our emphasis is on the general principle, rather than on individual methods for particular problems.
{"title":"Hyper-renormalization: Non-minimization Approach for Geometric Estimation","authors":"K. Kanatani, A. Al-Sharadqah, N. Chernov, Y. Sugaya","doi":"10.2197/ipsjtcva.6.143","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.143","url":null,"abstract":"The technique of “renormalization” for geometric estimation attracted much attention when it appeared in early 1990s for having higher accuracy than any other then known methods. The key fact is that it directly specifies equations to solve, rather than minimizing some cost function. This paper expounds this “non-minimization approach” in detail and exploits this principle to modify renormalization so that it outperforms the standard reprojection error minimization. Doing a precise error analysis in the most general situation, we derive a formula that maximizes the accuracy of the solution; we call it hyper-renormalization. Applying it to ellipse fitting, fundamental matrix computation, and homography computation, we confirm its accuracy and efficiency for sufficiently small noise. Our emphasis is on the general principle, rather than on individual methods for particular problems.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"14 1","pages":"143-159"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90469601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For 3D active measurement methods using video projector, there is the implicit limitation that the projected patterns must be in focus on the target object. Such limitation set a severe constraints on possible range of the depth for reconstruction. In order to overcome the problem, Depth from Defocus (DfD) method using multiple patterns with different in-focus depth is proposed to expand the depth range in the paper. With the method, not only the range of the depth is extended, but also the shape can be recovered even if there is an obstacle between the projector and the target, because of the large aperture of the projector. Furthermore, thanks to the advantage of DfD which does not require baseline between the cameras and the projector, occlusion does not occur with the method. In order to verify the effectiveness of the method, several experiments using the actual system was conducted to estimate the depth of several objects.
{"title":"Depth from Projector's Defocus Based on Multiple Focus Pattern Projection","authors":"H. Masuyama, Hiroshi Kawasaki, Furukawa Ryo","doi":"10.2197/ipsjtcva.6.88","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.88","url":null,"abstract":"For 3D active measurement methods using video projector, there is the implicit limitation that the projected patterns must be in focus on the target object. Such limitation set a severe constraints on possible range of the depth for reconstruction. In order to overcome the problem, Depth from Defocus (DfD) method using multiple patterns with different in-focus depth is proposed to expand the depth range in the paper. With the method, not only the range of the depth is extended, but also the shape can be recovered even if there is an obstacle between the projector and the target, because of the large aperture of the projector. Furthermore, thanks to the advantage of DfD which does not require baseline between the cameras and the projector, occlusion does not occur with the method. In order to verify the effectiveness of the method, several experiments using the actual system was conducted to estimate the depth of several objects.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"175 1","pages":"88-92"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74928182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on initializing 3-D reconstruction from scratch without any prior scene information. Traditionally, this has been done from two-view matching, which is prone to the degeneracy called “imaginary focal lengths.” We overcome this difficulty by using three images, but we do not require three-view matching; all we need is three fundamental matrices separately computed from pair-wise image matching. We exploit the redundancy of the three fundamental matrices to optimize the camera parameters and the 3-D structure. The main theme of this paper is to give an analytical procedure for computing the positions and orientations of the three cameras and their internal parameters from three fundamental matrices. The emphasis is on resolving the ambiguity of the solution resulting from the sign indeterminacy of the fundamental matrices. We do numerical simulation to show that imaginary focal lengths are less likely for our three view methods, resulting in higher accuracy than the conventional two-view method. We also test the degeneracy tolerance capability of our method by using endoscopic intestine tract images, for which the camera configuration is almost always nearly degenerate. We demonstrate that our method allows us to obtain more detailed intestine structures than two-view reconstruction and observe how our three-view reconstruction is refined by bundle adjustment. Our method is expected to broaden medical applications of endoscopic images.
{"title":"Decomposing Three Fundamental Matrices for Initializing 3-D Reconstruction from Three Views","authors":"Yasushi Kanazawa, Y. Sugaya, K. Kanatani","doi":"10.2197/ipsjtcva.6.120","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.120","url":null,"abstract":"This paper focuses on initializing 3-D reconstruction from scratch without any prior scene information. Traditionally, this has been done from two-view matching, which is prone to the degeneracy called “imaginary focal lengths.” We overcome this difficulty by using three images, but we do not require three-view matching; all we need is three fundamental matrices separately computed from pair-wise image matching. We exploit the redundancy of the three fundamental matrices to optimize the camera parameters and the 3-D structure. The main theme of this paper is to give an analytical procedure for computing the positions and orientations of the three cameras and their internal parameters from three fundamental matrices. The emphasis is on resolving the ambiguity of the solution resulting from the sign indeterminacy of the fundamental matrices. We do numerical simulation to show that imaginary focal lengths are less likely for our three view methods, resulting in higher accuracy than the conventional two-view method. We also test the degeneracy tolerance capability of our method by using endoscopic intestine tract images, for which the camera configuration is almost always nearly degenerate. We demonstrate that our method allows us to obtain more detailed intestine structures than two-view reconstruction and observe how our three-view reconstruction is refined by bundle adjustment. Our method is expected to broaden medical applications of endoscopic images.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"85 1","pages":"120-131"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75819546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper is aimed at realizing a practical image-based 3D surface capture system of underwater objects. Image-based 3D shape acquisition of objects in water has a wide variety of academic and industrial applications because of its non-contact and non-invasive sensing properties. For example, 3D shape capture of fertilized eggs and young fish can provide a quantitative evaluation method for life-science and aquaculture. On realizing such a system, we utilize fully-calibrated multiview projectors and cameras in water (Fig. 1). Underwater projectors serve as reverse cameras while providing additional textures on poorly-textured targets. To this end, this paper focuses on the refraction caused by flat housings, while underwater photography involves other complex light events such as scattering [3,16,17], specularity [4], and transparency [13]. This is because one of the main difficulties in image-based 3D surface estimation in water is to account for refractions caused by flat housings, since flat housings cause epipolar lines to be curved and hence the local support window for texture matching to be inconstant. To cope with this issue, we can project 3D candidate points in water to 2D image planes taking the refraction into account explicitly. However, projecting a 3D point in water to a camera via a flat housing is known to be a time-consuming process which requires solving a 12th degree equation for each projection [1]. This fact indicates that 3D shape estimation in water cannot be practical as long as it is done by using the analytical projection computation. To solve this problem, we model both the projectors and cameras with flat housings based on the pixel-wise varifocal model [9]. Since this virtual camera model provides an efficient forward (3D-to-2D) projection, we can make the 3D shape estimation process feasible. The key contribution of this paper is twofold. Firstly we propose a practical method to calibrate underwater projectors with flat housings based on the pixel-wise varifocal model. Secondly we show a system for underwater 3D surface capture based on space carving principle [12] using multiple projectors and cameras in water.
{"title":"Underwater 3D Surface Capture Using Multi-view Projectors and Cameras with Flat Housings","authors":"Ryo Kawahara, S. Nobuhara, T. Matsuyama","doi":"10.2197/IPSJTCVA.6.43","DOIUrl":"https://doi.org/10.2197/IPSJTCVA.6.43","url":null,"abstract":"This paper is aimed at realizing a practical image-based 3D surface capture system of underwater objects. Image-based 3D shape acquisition of objects in water has a wide variety of academic and industrial applications because of its non-contact and non-invasive sensing properties. For example, 3D shape capture of fertilized eggs and young fish can provide a quantitative evaluation method for life-science and aquaculture. On realizing such a system, we utilize fully-calibrated multiview projectors and cameras in water (Fig. 1). Underwater projectors serve as reverse cameras while providing additional textures on poorly-textured targets. To this end, this paper focuses on the refraction caused by flat housings, while underwater photography involves other complex light events such as scattering [3,16,17], specularity [4], and transparency [13]. This is because one of the main difficulties in image-based 3D surface estimation in water is to account for refractions caused by flat housings, since flat housings cause epipolar lines to be curved and hence the local support window for texture matching to be inconstant. To cope with this issue, we can project 3D candidate points in water to 2D image planes taking the refraction into account explicitly. However, projecting a 3D point in water to a camera via a flat housing is known to be a time-consuming process which requires solving a 12th degree equation for each projection [1]. This fact indicates that 3D shape estimation in water cannot be practical as long as it is done by using the analytical projection computation. To solve this problem, we model both the projectors and cameras with flat housings based on the pixel-wise varifocal model [9]. Since this virtual camera model provides an efficient forward (3D-to-2D) projection, we can make the 3D shape estimation process feasible. The key contribution of this paper is twofold. Firstly we propose a practical method to calibrate underwater projectors with flat housings based on the pixel-wise varifocal model. Secondly we show a system for underwater 3D surface capture based on space carving principle [12] using multiple projectors and cameras in water.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"16 1","pages":"43-47"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87138704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, to the best of our knowledge, we propose a stand-alone large-scale image classification system running on an Android smartphone. The objective of this work is to prove that mobile large-scale image classification requires no communication to external servers. To do that, we propose a scalar-based compression method for weight vectors of linear classifiers. As an additional characteristic, the proposed method does not need to uncompress the compressed vectors for evaluation of the classifiers, which brings the saving of recognition time. We have implemented a large-scale image classification system on an Android smartphone, which can perform 1000class classification for a given image in 0.270 seconds. In the experiment, we show that compressing the weights to 1/8 leaded to only 0.80% performance loss for 1000-class classification with the ILSVRC2012 dataset. In addition, the experimental results indicate that weight vectors compressed in low bits, even in the binarized case (bit=1), are still valid for classification of high dimensional vectors.
{"title":"ILSVRC on a Smartphone","authors":"Yoshiyuki Kawano, Keiji Yanai","doi":"10.2197/ipsjtcva.6.83","DOIUrl":"https://doi.org/10.2197/ipsjtcva.6.83","url":null,"abstract":"In this work, to the best of our knowledge, we propose a stand-alone large-scale image classification system running on an Android smartphone. The objective of this work is to prove that mobile large-scale image classification requires no communication to external servers. To do that, we propose a scalar-based compression method for weight vectors of linear classifiers. As an additional characteristic, the proposed method does not need to uncompress the compressed vectors for evaluation of the classifiers, which brings the saving of recognition time. We have implemented a large-scale image classification system on an Android smartphone, which can perform 1000class classification for a given image in 0.270 seconds. In the experiment, we show that compressing the weights to 1/8 leaded to only 0.80% performance loss for 1000-class classification with the ILSVRC2012 dataset. In addition, the experimental results indicate that weight vectors compressed in low bits, even in the binarized case (bit=1), are still valid for classification of high dimensional vectors.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"57 1","pages":"83-87"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81897735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}