Fine-grained change detection in structured text documents

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI:10.1145/2644866.2644880

Hannes Dohrn, D. Riehle

{"title":"Fine-grained change detection in structured text documents","authors":"Hannes Dohrn, D. Riehle","doi":"10.1145/2644866.2644880","DOIUrl":null,"url":null,"abstract":"Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change.\n Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes.\n We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"149 1","pages":"87-96"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2644866.2644880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change. Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes. We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结构化文本文档中的细粒度变更检测

检测和理解文档修订之间的变化是一项重要任务。获得的知识可用于对新文档修订的性质进行分类，或在审查过程中支持人工编辑。虽然纯文本更改检测算法提供细粒度的结果，但它们不理解更改的语法含义。通过将结构化文本文档表示为XML文档，我们可以应用树到树的校正算法来识别更改的语法性质。已经提出了许多用于XML文档中更改检测的算法，但其中大多数算法关注的是通用XML数据的复杂性，并且强调速度而不是结果的质量。结构化文本需要一个变化检测算法来密切关注文本节点中的内容，然而，最近的算法将文本节点视为黑盒。我们提出了一种算法，它结合了纯文本方法的优点和树到树的变化检测的优点，通过将文本从非重叠的公共子字符串重新分配到树的节点。这使我们不仅可以发现结构的变化，还可以发现文本本身的变化，从而在平均线性时间内获得更高的质量和细粒度的结果。该算法通过将其应用于英语维基百科中可以找到的结构化文本文档的语料库来评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

自引率

0.00%

发文量

期刊最新文献

The Notarial Archives, Valletta: Starting from Zero Truncation: all the news that fits we'll print Classifying and ranking search engine results as potential sources of plagiarism An ensemble approach for text document clustering using Wikipedia concepts Document changes: modeling, detection, storage and visualization (DChanges 2014)