{"title":"Text extraction from color documents-clustering approaches in three and four dimensions","authors":"T. Perroud, K. Sobottka, H. Bunke, L. Hall","doi":"10.1109/ICDAR.2001.953923","DOIUrl":null,"url":null,"abstract":"Colored paper documents often contain important text information. For automating the retrieval process, identification of text elements is essential. In order to reduce the number of colors in a scanned document, color clustering is usually done first. In this article two histogram-based color clustering algorithms are investigated. The first is based on the RGB color space exclusively, while the second takes spatial information into account, in addition to the colors. Experimental results have shown that the use of spatial information in the clustering algorithm has a positive impact. Thus the automatic retrieval of text information can be improved. The proposed methods for clustering are not restricted to document images. They can also be used for processing Web or video images, for example.","PeriodicalId":277816,"journal":{"name":"Proceedings of Sixth International Conference on Document Analysis and Recognition","volume":"01 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of Sixth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2001.953923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
Colored paper documents often contain important text information. For automating the retrieval process, identification of text elements is essential. In order to reduce the number of colors in a scanned document, color clustering is usually done first. In this article two histogram-based color clustering algorithms are investigated. The first is based on the RGB color space exclusively, while the second takes spatial information into account, in addition to the colors. Experimental results have shown that the use of spatial information in the clustering algorithm has a positive impact. Thus the automatic retrieval of text information can be improved. The proposed methods for clustering are not restricted to document images. They can also be used for processing Web or video images, for example.