{"title":"Shape and Morphological Transformation Based Features for Language Identification in Indian Document Images","authors":"M. Hangarge, B. V. Dhandra","doi":"10.1109/ICETET.2008.177","DOIUrl":null,"url":null,"abstract":"In this paper, a technique of language identification in document images is described to discriminate five major Indian languages: Hindi, Marathi, Sanskrit, Assamese and Bengali belong to Devnagari and Bangla scripts. A text block of each language containing at least two text lines is selected and characterized by employing global and local features. Morphological transformations are used to decompose a text block in two directions at three levels, to capture fine texture primitives. Shape features of connected components are used to retain the local properties of the text block. Further, combination of these features is used to classify 500 text blocks of proposed languages based on Binary decision tree and KNN classifier. Proposed method is quite different from reported method on non-Indian languages, which are based on shape coding of characters, words and document vectorization. This method directly captures word shapes without segmentation and it is tolerant to variations in font style and size. The language identification results are encouraging.","PeriodicalId":269929,"journal":{"name":"2008 First International Conference on Emerging Trends in Engineering and Technology","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 First International Conference on Emerging Trends in Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETET.2008.177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
In this paper, a technique of language identification in document images is described to discriminate five major Indian languages: Hindi, Marathi, Sanskrit, Assamese and Bengali belong to Devnagari and Bangla scripts. A text block of each language containing at least two text lines is selected and characterized by employing global and local features. Morphological transformations are used to decompose a text block in two directions at three levels, to capture fine texture primitives. Shape features of connected components are used to retain the local properties of the text block. Further, combination of these features is used to classify 500 text blocks of proposed languages based on Binary decision tree and KNN classifier. Proposed method is quite different from reported method on non-Indian languages, which are based on shape coding of characters, words and document vectorization. This method directly captures word shapes without segmentation and it is tolerant to variations in font style and size. The language identification results are encouraging.