Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412050
Adarsh Jamadandi, Rishabh Tigadoli, R. Tabib, U. Mudenagudi
In this paper, we propose a method for learning representations in the space of Gaussian-like distribution defined on a novel geometrical space called Kinematic space. The utility of non-Euclidean geometry for deep representation learning has recently been in vogue, specifically models of hyperbolic geometry such as Poincaré and Lorentz models have proven useful for learning hierarchical representations. Going beyond manifolds with constant curvature, albeit has better representation capacity might lead to unhanding of computationally tractable tools like Riemannian optimization methods. Here, we explore a pseudo-Riemannian auxiliary Lorentzian space called Kinematic space and provide a principled approach for constructing a Gaussian-like distribution, which is compatible with gradient-based learning methods, to formulate a probabilistic word embedding framework. Contrary to, mapping lexically distributed representations to a single point vector in Euclidean space, we advocate for mapping entities to density-based representations, as it provides explicit control over the uncertainty in representations. We test our framework by embedding WordNet-Noun hierarchy, a large lexical database, our experiments report strong consistent improvements in Mean Rank and Mean Average Precision (MAP) values compared to probabilistic word embedding frameworks defined on Euclidean and hyperbolic spaces. We show an average improvement of 72.68% in MAP and 82.60% in Rank compared to the hyperbolic version. Our work serves as evidence for the utility of novel geometrical spaces for learning hierarchical representations.
{"title":"Probabilistic Word Embeddings in Kinematic Space","authors":"Adarsh Jamadandi, Rishabh Tigadoli, R. Tabib, U. Mudenagudi","doi":"10.1109/ICPR48806.2021.9412050","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412050","url":null,"abstract":"In this paper, we propose a method for learning representations in the space of Gaussian-like distribution defined on a novel geometrical space called Kinematic space. The utility of non-Euclidean geometry for deep representation learning has recently been in vogue, specifically models of hyperbolic geometry such as Poincaré and Lorentz models have proven useful for learning hierarchical representations. Going beyond manifolds with constant curvature, albeit has better representation capacity might lead to unhanding of computationally tractable tools like Riemannian optimization methods. Here, we explore a pseudo-Riemannian auxiliary Lorentzian space called Kinematic space and provide a principled approach for constructing a Gaussian-like distribution, which is compatible with gradient-based learning methods, to formulate a probabilistic word embedding framework. Contrary to, mapping lexically distributed representations to a single point vector in Euclidean space, we advocate for mapping entities to density-based representations, as it provides explicit control over the uncertainty in representations. We test our framework by embedding WordNet-Noun hierarchy, a large lexical database, our experiments report strong consistent improvements in Mean Rank and Mean Average Precision (MAP) values compared to probabilistic word embedding frameworks defined on Euclidean and hyperbolic spaces. We show an average improvement of 72.68% in MAP and 82.60% in Rank compared to the hyperbolic version. Our work serves as evidence for the utility of novel geometrical spaces for learning hierarchical representations.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"8 1","pages":"8759-8765"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72926520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412720
S. Azimi, T. Kaur, T. Gandhi
Stress due to water deficiency in plants can significantly lower the agricultural yield. It can affect many visible plant traits such as size and surface area, the number of leaves and their color, etc. In recent years, computer vision-based plant phenomics has emerged as a promising tool for plant research and management. Such techniques have the advantage of being non-destructive, non-evasive, fast, and offer high levels of automation. Pulses like chickpeas play an important role in ensuring food security in poor countries owing to their high protein and nutrition content. In the present work, we have built a dataset comprising of two varieties of chickpea plant shoot images under different moisture stress conditions. Specifically, we propose a BAT optimized ResNet-18 model for classifying stress induced by water deficiency using chickpea shoot images. BAT algorithm identifies the optimal value of the mini-batch size to be used for training rather than employing the traditional manual approach of trial and error. Experimentation on two crop varieties (JG and Pusa) reveals that BAT optimized approach achieves an accuracy of 96% and 91% for JG and Pusa varieties that is better than the traditional method by 4%. The experimental results are also compared with state of the art CNN models like Alexnet, GoogleNet, and ResNet-50. The comparison results demonstrate that the proposed BAT optimized ResNet-18 model achieves higher performance than the comparison counterparts.
{"title":"BAT Optimized CNN Model Identifies Water Stress in Chickpea Plant Shoot Images","authors":"S. Azimi, T. Kaur, T. Gandhi","doi":"10.1109/ICPR48806.2021.9412720","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412720","url":null,"abstract":"Stress due to water deficiency in plants can significantly lower the agricultural yield. It can affect many visible plant traits such as size and surface area, the number of leaves and their color, etc. In recent years, computer vision-based plant phenomics has emerged as a promising tool for plant research and management. Such techniques have the advantage of being non-destructive, non-evasive, fast, and offer high levels of automation. Pulses like chickpeas play an important role in ensuring food security in poor countries owing to their high protein and nutrition content. In the present work, we have built a dataset comprising of two varieties of chickpea plant shoot images under different moisture stress conditions. Specifically, we propose a BAT optimized ResNet-18 model for classifying stress induced by water deficiency using chickpea shoot images. BAT algorithm identifies the optimal value of the mini-batch size to be used for training rather than employing the traditional manual approach of trial and error. Experimentation on two crop varieties (JG and Pusa) reveals that BAT optimized approach achieves an accuracy of 96% and 91% for JG and Pusa varieties that is better than the traditional method by 4%. The experimental results are also compared with state of the art CNN models like Alexnet, GoogleNet, and ResNet-50. The comparison results demonstrate that the proposed BAT optimized ResNet-18 model achieves higher performance than the comparison counterparts.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"48 1","pages":"8500-8506"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75375285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9413099
Zitang Sun, S. Kamata, Ruojing Wang
Semantic segmentation requires both a large receptive field and accurate spatial information. Although existing methods based on a fully convolutional network have greatly improved the accuracy, the prediction results still do not show satisfactory when parsing small objects and boundary regions. We propose a refinement algorithm to improve the result generated by the front network. Our method takes a modified double-branches network to generate both segmentation masks and semantic boundaries, which serve as refinement algorithms' input. We creatively introduce information entropy to represent the confidence of the neural network's prediction corresponding to each pixel. The information entropy combined with the semantic boundary can capture those unpredictable pixels with low-confidence through Monte Carlo sampling. Each selected pixel will serve as the initial seed for directed local search and refinement. According to the initial seed, our purpose is tantamount to searching the neighbor high-confidence regions, and the re-labeling approach is based on high-confidence results. Remarkably, our method adopts a directed regional search strategy based on gradient descent to find the high-confidence region effectively. Our method can be flexibly embedded into the existing encoder backbone at a trivial computational cost. Our refinement algorithm can further improve the state of the art method's accuracy both on Cityscapes and PASCAL VOC datasets. In evaluating some small objects, our method surpasses most of the state of the art methods.
{"title":"Semantic Segmentation Refinement Using Entropy and Boundary-guided Monte Carlo Sampling and Directed Regional Search","authors":"Zitang Sun, S. Kamata, Ruojing Wang","doi":"10.1109/ICPR48806.2021.9413099","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9413099","url":null,"abstract":"Semantic segmentation requires both a large receptive field and accurate spatial information. Although existing methods based on a fully convolutional network have greatly improved the accuracy, the prediction results still do not show satisfactory when parsing small objects and boundary regions. We propose a refinement algorithm to improve the result generated by the front network. Our method takes a modified double-branches network to generate both segmentation masks and semantic boundaries, which serve as refinement algorithms' input. We creatively introduce information entropy to represent the confidence of the neural network's prediction corresponding to each pixel. The information entropy combined with the semantic boundary can capture those unpredictable pixels with low-confidence through Monte Carlo sampling. Each selected pixel will serve as the initial seed for directed local search and refinement. According to the initial seed, our purpose is tantamount to searching the neighbor high-confidence regions, and the re-labeling approach is based on high-confidence results. Remarkably, our method adopts a directed regional search strategy based on gradient descent to find the high-confidence region effectively. Our method can be flexibly embedded into the existing encoder backbone at a trivial computational cost. Our refinement algorithm can further improve the state of the art method's accuracy both on Cityscapes and PASCAL VOC datasets. In evaluating some small objects, our method surpasses most of the state of the art methods.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"23 1","pages":"3931-3938"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72580366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412618
Jiali Ding, Tie Liu, Qixin Chen, Zejian Yuan, Yuanyuan Shang
Vehicle scale estimation with a single camera is a typical application for intelligent transportation and it faces the challenges from visual computing while intensity-based method and descriptor-based method should be balanced. This paper proposed a vehicle scale estimation method based on salient object detection to resolve this problem. The regularized intensity matching method is proposed in Lie Algebra to achieve robust and accurate scale estimation, and descriptor matching and intensity matching are combined to minimize the proposed loss function. The visual attention mechanism is designed to select image patches with texture and remove the occluded image patches. Then the weights are assigned to pixels from the selected image patches which alleviates the influence of noise-corrupted pixels. The experiments show that the proposed method significantly outperforms state-of-the-art methods with regard to the robustness and accuracy of vehicle scale estimation.
{"title":"Visual Saliency Oriented Vehicle Scale Estimation","authors":"Jiali Ding, Tie Liu, Qixin Chen, Zejian Yuan, Yuanyuan Shang","doi":"10.1109/ICPR48806.2021.9412618","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412618","url":null,"abstract":"Vehicle scale estimation with a single camera is a typical application for intelligent transportation and it faces the challenges from visual computing while intensity-based method and descriptor-based method should be balanced. This paper proposed a vehicle scale estimation method based on salient object detection to resolve this problem. The regularized intensity matching method is proposed in Lie Algebra to achieve robust and accurate scale estimation, and descriptor matching and intensity matching are combined to minimize the proposed loss function. The visual attention mechanism is designed to select image patches with texture and remove the occluded image patches. Then the weights are assigned to pixels from the selected image patches which alleviates the influence of noise-corrupted pixels. The experiments show that the proposed method significantly outperforms state-of-the-art methods with regard to the robustness and accuracy of vehicle scale estimation.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"43 1","pages":"1867-1873"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78624629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412451
Andrea D'Eusanio, S. Pini, G. Borghi, R. Vezzani, R. Cucchiara
Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we present a new dataset, called Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.
{"title":"RefiNet: 3D Human Pose Refinement with Depth Maps","authors":"Andrea D'Eusanio, S. Pini, G. Borghi, R. Vezzani, R. Cucchiara","doi":"10.1109/ICPR48806.2021.9412451","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412451","url":null,"abstract":"Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we present a new dataset, called Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"12 1","pages":"2320-2327"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78627749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412106
Niluthpol Chowdhury Mithun, Ravdeep Pasricha, E. Papalexakis, A. Roy-Chowdhury
In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.
{"title":"Webly Supervised Image-Text Embedding with Noisy Tag Refinement","authors":"Niluthpol Chowdhury Mithun, Ravdeep Pasricha, E. Papalexakis, A. Roy-Chowdhury","doi":"10.1109/ICPR48806.2021.9412106","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412106","url":null,"abstract":"In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"27 1","pages":"7454-7461"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78409049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412477
Minh-Hieu Phan, S. L. Phung, A. Bouzerdoum
Depth perception is essential for scene understanding, autonomous navigation and augmented reality. Depth estimation from a single 2D image is challenging due to the lack of reliable cues, e.g. stereo correspondences and motions. Modern approaches exploit multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies only use simple addition or concatenation to combine the extracted multi-scale features. This paper proposes a novel region-based self-attention (rSA) unit for effective feature fusions. The rSA recalibrates the multi-scale responses by explicitly modelling the dependency between channels in separate image regions. We discretize continuous depths to formulate an ordinal depth classification problem in which the relative order between categories is preserved. The experiments are performed on a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. The proposed module improves the models on small-sized datasets by 22% to 40%.
{"title":"Ordinal Depth Classification Using Region-based Self-attention","authors":"Minh-Hieu Phan, S. L. Phung, A. Bouzerdoum","doi":"10.1109/ICPR48806.2021.9412477","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412477","url":null,"abstract":"Depth perception is essential for scene understanding, autonomous navigation and augmented reality. Depth estimation from a single 2D image is challenging due to the lack of reliable cues, e.g. stereo correspondences and motions. Modern approaches exploit multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies only use simple addition or concatenation to combine the extracted multi-scale features. This paper proposes a novel region-based self-attention (rSA) unit for effective feature fusions. The rSA recalibrates the multi-scale responses by explicitly modelling the dependency between channels in separate image regions. We discretize continuous depths to formulate an ordinal depth classification problem in which the relative order between categories is preserved. The experiments are performed on a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. The proposed module improves the models on small-sized datasets by 22% to 40%.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"84 1","pages":"3620-3627"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75909537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412766
Yueran Pan, Kunjing Cai, Ming Cheng, Xiaobing Zou, Ming Li
Autism spectrum disorder (ASD) is a neuro-developmental disorder, which causes deficits in social lives. Early screening of ASD for young children is important to reduce the impact of ASD on people's lives. Traditional screening methods mainly rely on protocol-based interviews and subjective evaluations from clinicians and domain experts, which requires advanced expertise and intensive labor. To standardize the process of ASD screening, we design a “Responsive Social Smile” protocol and the associated experimental setup. Moreover, we propose a machine learning based assessment framework for early ASD screening. By integrating speech recognition and computer vision technologies, the proposed framework can quantitatively analyze children's behaviors under well-designed protocols. We collect 196 stimulus samples from 41 children with an average age of 23.34 months, and the proposed method obtains 85.20% accuracy for predicting stimulus scores and 80.49% accuracy for the final ASD prediction. This result indicates that our model approaches the average level of domain experts in this “Responsive Social Smile” protocol.
{"title":"Responsive Social Smile: A Machine Learning based Multimodal Behavior Assessment Framework towards Early Stage Autism Screening","authors":"Yueran Pan, Kunjing Cai, Ming Cheng, Xiaobing Zou, Ming Li","doi":"10.1109/ICPR48806.2021.9412766","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412766","url":null,"abstract":"Autism spectrum disorder (ASD) is a neuro-developmental disorder, which causes deficits in social lives. Early screening of ASD for young children is important to reduce the impact of ASD on people's lives. Traditional screening methods mainly rely on protocol-based interviews and subjective evaluations from clinicians and domain experts, which requires advanced expertise and intensive labor. To standardize the process of ASD screening, we design a “Responsive Social Smile” protocol and the associated experimental setup. Moreover, we propose a machine learning based assessment framework for early ASD screening. By integrating speech recognition and computer vision technologies, the proposed framework can quantitatively analyze children's behaviors under well-designed protocols. We collect 196 stimulus samples from 41 children with an average age of 23.34 months, and the proposed method obtains 85.20% accuracy for predicting stimulus scores and 80.49% accuracy for the final ASD prediction. This result indicates that our model approaches the average level of domain experts in this “Responsive Social Smile” protocol.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"195 1","pages":"2240-2247"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75889952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412790
Chang Jie Wu, Qing Wang, Jianshu Zhang, Jun Du, Jiaming Wang, Jiajia Wu, Jinshui Hu
Recently, many researches propose to employ attention based encoder-decoder models to convert a sequence of trajectory points into a LaTeX string for online handwritten mathematical expression recognition (OHMER), and the recognition performance of these models critically relies on the accuracy of the attention. In this paper, unlike previous methods which basically employ a soft attention model, we propose to employ a posterior attention model, which modifies the attention probabilities after observing the output probabilities generated by the soft attention model. In order to further improve the posterior attention mechanism, we propose a stroke average pooling layer to aggregate point-level features obtained from the encoder into stroke-level features. We argue that posterior attention is better to be implemented on stroke-level features than point-level features as the output probabilities generated by stroke is more convincing than generated by point, and we prove that through experimental analysis. Validated on the CROHME competition task, we demonstrate that stroke based posterior attention achieves expression recognition rates of 54.26% on CROHME 2014 and 51.75% on CROHME 2016. According to attention visualization analysis, we empirically demonstrate that the posterior attention mechanism can achieve better alignment accuracy than the soft attention mechanism.
{"title":"Stroke Based Posterior Attention for Online Handwritten Mathematical Expression Recognition","authors":"Chang Jie Wu, Qing Wang, Jianshu Zhang, Jun Du, Jiaming Wang, Jiajia Wu, Jinshui Hu","doi":"10.1109/ICPR48806.2021.9412790","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412790","url":null,"abstract":"Recently, many researches propose to employ attention based encoder-decoder models to convert a sequence of trajectory points into a LaTeX string for online handwritten mathematical expression recognition (OHMER), and the recognition performance of these models critically relies on the accuracy of the attention. In this paper, unlike previous methods which basically employ a soft attention model, we propose to employ a posterior attention model, which modifies the attention probabilities after observing the output probabilities generated by the soft attention model. In order to further improve the posterior attention mechanism, we propose a stroke average pooling layer to aggregate point-level features obtained from the encoder into stroke-level features. We argue that posterior attention is better to be implemented on stroke-level features than point-level features as the output probabilities generated by stroke is more convincing than generated by point, and we prove that through experimental analysis. Validated on the CROHME competition task, we demonstrate that stroke based posterior attention achieves expression recognition rates of 54.26% on CROHME 2014 and 51.75% on CROHME 2016. According to attention visualization analysis, we empirically demonstrate that the posterior attention mechanism can achieve better alignment accuracy than the soft attention mechanism.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"34 1","pages":"2943-2949"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75084647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412794
Ciarán Donegan, H. Yous, Saksham Sinha, Jonathan Byrne
The success of deep learning at computer vision tasks has led to an ever-increasing number of applications on edge devices. Often with the use of edge AI hardware accelerators like the Intel Movidius Vision Processing Unit (VPU). Performing computer vision tasks on edge devices is challenging. Many Convolutional Neural Networks (CNNs) are too complex to run on edge devices with limited computing power. This has created large interest in designing efficient CNNs and one promising way of doing this is through Neural Architecture Search (NAS). NAS aims to automate the design of neural networks. NAS can also optimize multiple different objectives together, like accuracy and efficiency, which is difficult for humans. In this paper, we use a differentiable NAS method to find efficient CNNs for VPU that achieves state-of-the-art classification accuracy on ImageNet. Our NAS designed model outperforms MobileNetV2, having almost 1% higher top-1 accuracy while being 13% faster on MyriadX VPU. To the best of our knowledge, this is the first time a VPU specific CNN has been designed using a NAS algorithm. Our results also reiterate the fact that efficient networks must be designed for each specific hardware. We show that efficient networks targeted at different devices do not perform as well on the VPU.
{"title":"VPU Specific CNNs through Neural Architecture Search","authors":"Ciarán Donegan, H. Yous, Saksham Sinha, Jonathan Byrne","doi":"10.1109/ICPR48806.2021.9412794","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412794","url":null,"abstract":"The success of deep learning at computer vision tasks has led to an ever-increasing number of applications on edge devices. Often with the use of edge AI hardware accelerators like the Intel Movidius Vision Processing Unit (VPU). Performing computer vision tasks on edge devices is challenging. Many Convolutional Neural Networks (CNNs) are too complex to run on edge devices with limited computing power. This has created large interest in designing efficient CNNs and one promising way of doing this is through Neural Architecture Search (NAS). NAS aims to automate the design of neural networks. NAS can also optimize multiple different objectives together, like accuracy and efficiency, which is difficult for humans. In this paper, we use a differentiable NAS method to find efficient CNNs for VPU that achieves state-of-the-art classification accuracy on ImageNet. Our NAS designed model outperforms MobileNetV2, having almost 1% higher top-1 accuracy while being 13% faster on MyriadX VPU. To the best of our knowledge, this is the first time a VPU specific CNN has been designed using a NAS algorithm. Our results also reiterate the fact that efficient networks must be designed for each specific hardware. We show that efficient networks targeted at different devices do not perform as well on the VPU.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"385 1","pages":"9772-9779"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75138058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}