R. Sitaram, Gopal Datt Joshi, S. Noushath, Pulkit Parikh, Vishal Gupta
{"title":"PaperDiff:一种独立于脚本的自动查找两个文档图像之间文本差异的方法","authors":"R. Sitaram, Gopal Datt Joshi, S. Noushath, Pulkit Parikh, Vishal Gupta","doi":"10.1109/DAS.2008.69","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce a novel concept called {PaperDiff} and propose an algorithm to implement it. The aim of PaperDiff is to compare two printed (paper) documents using their images and determine the differences in terms of text inserted, deleted and substituted between them. This lets an end-user compare two documents which are already printed or even if one of which is printed (the other could be in electronic form such as MS-word *.doc file). The algorithm we have proposed for realizing PaperDiff is based on word image comparison and is even suitable for symbol strings and for any script/language (including multiple scripts) in the documents, where even mature optical character recognition (OCR) technology has had very little success. PaperDiff enables end-users like lawyers, novelists, etc, in comparing new document versions with older versions of them. Our proposed method is suitable even when the formatting of content is different between the two input documents, where the structures of the document images are different (for e.g., differing page widths, page structure etc). An experiment of PaperDiff on single column text documents yielded 99.2 % accuracy while detecting 135 induced differences in 10 pairs of documents.","PeriodicalId":423207,"journal":{"name":"2008 The Eighth IAPR International Workshop on Document Analysis Systems","volume":"500 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"PaperDiff: A Script Independent Automatic Method for Finding the Text Differences Between Two Document Images\",\"authors\":\"R. Sitaram, Gopal Datt Joshi, S. Noushath, Pulkit Parikh, Vishal Gupta\",\"doi\":\"10.1109/DAS.2008.69\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we introduce a novel concept called {PaperDiff} and propose an algorithm to implement it. The aim of PaperDiff is to compare two printed (paper) documents using their images and determine the differences in terms of text inserted, deleted and substituted between them. This lets an end-user compare two documents which are already printed or even if one of which is printed (the other could be in electronic form such as MS-word *.doc file). The algorithm we have proposed for realizing PaperDiff is based on word image comparison and is even suitable for symbol strings and for any script/language (including multiple scripts) in the documents, where even mature optical character recognition (OCR) technology has had very little success. PaperDiff enables end-users like lawyers, novelists, etc, in comparing new document versions with older versions of them. Our proposed method is suitable even when the formatting of content is different between the two input documents, where the structures of the document images are different (for e.g., differing page widths, page structure etc). An experiment of PaperDiff on single column text documents yielded 99.2 % accuracy while detecting 135 induced differences in 10 pairs of documents.\",\"PeriodicalId\":423207,\"journal\":{\"name\":\"2008 The Eighth IAPR International Workshop on Document Analysis Systems\",\"volume\":\"500 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 The Eighth IAPR International Workshop on Document Analysis Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2008.69\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The Eighth IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2008.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PaperDiff: A Script Independent Automatic Method for Finding the Text Differences Between Two Document Images
In this paper, we introduce a novel concept called {PaperDiff} and propose an algorithm to implement it. The aim of PaperDiff is to compare two printed (paper) documents using their images and determine the differences in terms of text inserted, deleted and substituted between them. This lets an end-user compare two documents which are already printed or even if one of which is printed (the other could be in electronic form such as MS-word *.doc file). The algorithm we have proposed for realizing PaperDiff is based on word image comparison and is even suitable for symbol strings and for any script/language (including multiple scripts) in the documents, where even mature optical character recognition (OCR) technology has had very little success. PaperDiff enables end-users like lawyers, novelists, etc, in comparing new document versions with older versions of them. Our proposed method is suitable even when the formatting of content is different between the two input documents, where the structures of the document images are different (for e.g., differing page widths, page structure etc). An experiment of PaperDiff on single column text documents yielded 99.2 % accuracy while detecting 135 induced differences in 10 pairs of documents.