{"title":"基于内容和表达的知识产权保护复制识别","authors":"Özlem Uzuner, Randall Davis","doi":"10.1145/947380.947393","DOIUrl":null,"url":null,"abstract":"Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of digital copies of works, to ensure proper compensation to content owners.The fingerprints we collected consist of surface, syntactic, and semantic features of documents. Because they reflect mostly how things are said, we call these features stylistic fingerprints. However, how things are said are not independent of what is said, therefore these features have predictive power with respect to both content and expression.We tested the ability of these stylistic fingerprints to identify content and expression similarities between documents using a corpus of translated novels. On this corpus, these fingerprints identified the source of a given book chapter (content) successfully 90% of the time and the translator of the chapter (expression) 67% of the time using ten-fold cross validation and decision trees.In comparison, fingerprints based on the vocabularies of documents recognized the source of a given book chapter accurately 93% of the time and the expression of a particular translator 61% of the time.We believe that the right fingerprints can identify modified and literal copies of works, securing revenues for content owners. Enabling the content owners to secure revenues from distribution of their works can alleviate the digital copyright problem and reduce the need to prevent distribution, giving a chance to solutions that promote uninhibited distribution and use of works by the public.","PeriodicalId":124354,"journal":{"name":"ACM Digital Rights Management Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Content and expression-based copy recognition for intellectual property protection\",\"authors\":\"Özlem Uzuner, Randall Davis\",\"doi\":\"10.1145/947380.947393\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of digital copies of works, to ensure proper compensation to content owners.The fingerprints we collected consist of surface, syntactic, and semantic features of documents. Because they reflect mostly how things are said, we call these features stylistic fingerprints. However, how things are said are not independent of what is said, therefore these features have predictive power with respect to both content and expression.We tested the ability of these stylistic fingerprints to identify content and expression similarities between documents using a corpus of translated novels. On this corpus, these fingerprints identified the source of a given book chapter (content) successfully 90% of the time and the translator of the chapter (expression) 67% of the time using ten-fold cross validation and decision trees.In comparison, fingerprints based on the vocabularies of documents recognized the source of a given book chapter accurately 93% of the time and the expression of a particular translator 61% of the time.We believe that the right fingerprints can identify modified and literal copies of works, securing revenues for content owners. Enabling the content owners to secure revenues from distribution of their works can alleviate the digital copyright problem and reduce the need to prevent distribution, giving a chance to solutions that promote uninhibited distribution and use of works by the public.\",\"PeriodicalId\":124354,\"journal\":{\"name\":\"ACM Digital Rights Management Workshop\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Digital Rights Management Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/947380.947393\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Digital Rights Management Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/947380.947393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Content and expression-based copy recognition for intellectual property protection
Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of digital copies of works, to ensure proper compensation to content owners.The fingerprints we collected consist of surface, syntactic, and semantic features of documents. Because they reflect mostly how things are said, we call these features stylistic fingerprints. However, how things are said are not independent of what is said, therefore these features have predictive power with respect to both content and expression.We tested the ability of these stylistic fingerprints to identify content and expression similarities between documents using a corpus of translated novels. On this corpus, these fingerprints identified the source of a given book chapter (content) successfully 90% of the time and the translator of the chapter (expression) 67% of the time using ten-fold cross validation and decision trees.In comparison, fingerprints based on the vocabularies of documents recognized the source of a given book chapter accurately 93% of the time and the expression of a particular translator 61% of the time.We believe that the right fingerprints can identify modified and literal copies of works, securing revenues for content owners. Enabling the content owners to secure revenues from distribution of their works can alleviate the digital copyright problem and reduce the need to prevent distribution, giving a chance to solutions that promote uninhibited distribution and use of works by the public.