{"title":"恢复损坏的文档以改进信息检索过程","authors":"Angel L. Garrido, Álvaro Peiró","doi":"10.5584/JIOMICS.V8I3.230","DOIUrl":null,"url":null,"abstract":"Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.","PeriodicalId":37675,"journal":{"name":"Journal of Integrated OMICS","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Recovering Damaged Documents to Improve Information Retrieval Processes\",\"authors\":\"Angel L. Garrido, Álvaro Peiró\",\"doi\":\"10.5584/JIOMICS.V8I3.230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.\",\"PeriodicalId\":37675,\"journal\":{\"name\":\"Journal of Integrated OMICS\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Integrated OMICS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5584/JIOMICS.V8I3.230\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Biochemistry, Genetics and Molecular Biology\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrated OMICS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5584/JIOMICS.V8I3.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
Recovering Damaged Documents to Improve Information Retrieval Processes
Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.
期刊介绍:
JIOMICS provides a forum for the publication of original research papers, letters to the editor, short communications, and critical reviews in all branches of pure and applied –omics subjects, such as proteomics, metabolomics, metallomics and genomics. Especial interest is given to papers where more than one –omics subject is covered. Papers are evaluated based on scientific novelty and demonstrated scientific applicability. Original research papers on fundamental studies, and novel sensor and instrumentation development, are especially encouraged. Novel or improved findings in areas such as clinical, medicinal, biological, environmental and materials –omics are welcome.