José C. Gutiérrez, Rodolfo Valiente, M. T. Sadaike, Daniel F. Soriano, G. Bressan, W. Ruggiero
{"title":"基于语义分析的通用身份文件图像数据结构化机制","authors":"José C. Gutiérrez, Rodolfo Valiente, M. T. Sadaike, Daniel F. Soriano, G. Bressan, W. Ruggiero","doi":"10.1145/3126858.3131594","DOIUrl":null,"url":null,"abstract":"Nowadays, the enormous variety of identity documents that exist makes it difficult to standardize a system capable of extracting all the information of interest presented by them. Therefore, systems that use templates to classify information based on their positions are limited by the number of templates they could recognize. Thus, in this paper, a novel mechanism intended to automatically classify the major information of interest exposed by generic identity documents is presented. The proposal is created to be easily adaptable to any system capable of detecting and extracting text information from an identity document image. To assign meaning to the text extracted from the identity document, the proposal is based on a novel mechanism to structuring the data using semantic analysis. The mechanism consists of two main steps, first, all the text data are classified as sentences or near sentences based on the Euclidean distance between words; second, the sentences are analyzed to find keywords that allow structuring the information based on its semantic to show it as abstractions. The proposal has been designed to be able to store the data as abstractions of its meaning. This allows improving the scalability of the system and a better use of this information by different services, by the end user or to be interpreted by an automated process of decision-making.","PeriodicalId":338362,"journal":{"name":"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Mechanism for Structuring the Data from a Generic Identity Document Image using Semantic Analysis\",\"authors\":\"José C. Gutiérrez, Rodolfo Valiente, M. T. Sadaike, Daniel F. Soriano, G. Bressan, W. Ruggiero\",\"doi\":\"10.1145/3126858.3131594\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, the enormous variety of identity documents that exist makes it difficult to standardize a system capable of extracting all the information of interest presented by them. Therefore, systems that use templates to classify information based on their positions are limited by the number of templates they could recognize. Thus, in this paper, a novel mechanism intended to automatically classify the major information of interest exposed by generic identity documents is presented. The proposal is created to be easily adaptable to any system capable of detecting and extracting text information from an identity document image. To assign meaning to the text extracted from the identity document, the proposal is based on a novel mechanism to structuring the data using semantic analysis. The mechanism consists of two main steps, first, all the text data are classified as sentences or near sentences based on the Euclidean distance between words; second, the sentences are analyzed to find keywords that allow structuring the information based on its semantic to show it as abstractions. The proposal has been designed to be able to store the data as abstractions of its meaning. This allows improving the scalability of the system and a better use of this information by different services, by the end user or to be interpreted by an automated process of decision-making.\",\"PeriodicalId\":338362,\"journal\":{\"name\":\"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3126858.3131594\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3126858.3131594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mechanism for Structuring the Data from a Generic Identity Document Image using Semantic Analysis
Nowadays, the enormous variety of identity documents that exist makes it difficult to standardize a system capable of extracting all the information of interest presented by them. Therefore, systems that use templates to classify information based on their positions are limited by the number of templates they could recognize. Thus, in this paper, a novel mechanism intended to automatically classify the major information of interest exposed by generic identity documents is presented. The proposal is created to be easily adaptable to any system capable of detecting and extracting text information from an identity document image. To assign meaning to the text extracted from the identity document, the proposal is based on a novel mechanism to structuring the data using semantic analysis. The mechanism consists of two main steps, first, all the text data are classified as sentences or near sentences based on the Euclidean distance between words; second, the sentences are analyzed to find keywords that allow structuring the information based on its semantic to show it as abstractions. The proposal has been designed to be able to store the data as abstractions of its meaning. This allows improving the scalability of the system and a better use of this information by different services, by the end user or to be interpreted by an automated process of decision-making.