首页 > 最新文献

Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.最新文献

英文 中文
Text-mining based journal splitting 基于文本挖掘的日志分割
Xiaofan Lin
This paper introduces a novel journal splittingalgorithm. It takes full advantage of various kinds ofinformation such as text match, layout and page numbers.The core procedure is a highly efficient text-miningalgorithm, which detects the matched phrases between thecontent pages and the title pages of individual articles.Experiments show that this algorithm is robust and ableto split a wide range of journals, magazines and books.
介绍了一种新的期刊分割算法。它充分利用了各种信息,如文本匹配,布局和页码。核心过程是高效的文本挖掘算法,该算法检测单个文章的内容页和标题页之间的匹配短语。实验表明,该算法具有较强的鲁棒性,能够对多种期刊、杂志和图书进行分类。
{"title":"Text-mining based journal splitting","authors":"Xiaofan Lin","doi":"10.1109/ICDAR.2003.1227822","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227822","url":null,"abstract":"This paper introduces a novel journal splittingalgorithm. It takes full advantage of various kinds ofinformation such as text match, layout and page numbers.The core procedure is a highly efficient text-miningalgorithm, which detects the matched phrases between thecontent pages and the title pages of individual articles.Experiments show that this algorithm is robust and ableto split a wide range of journals, magazines and books.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116961984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Image segmentation by learning approach 基于学习的图像分割方法
H. Legal-Ayala, J. Facon
This article describes a new segmentation bythresholding approach based on learning. The methodconsists in learning to threshold correctly submitting bothan image and its ideal thresholded version. From thisstage it is generated a decision matrix for each pixel andeach gray level that is re-utilized at the moment of thenew images segmentation. The new image is thresholdedby means of a new strategy based on the nearestneighbors, that seeks, for each pixel of this new image,the best solution in the decision matrix. Performed testson handwritten documents showed promising results. Interms of quality of the results, the developed technique isequal or superior to the traditional segmentation bythresholding techniques, with the advantage that the onediscussed here does not requires the use of heuristicparameters.
本文描述了一种新的基于学习的阈值分割方法。该方法包括学习正确提交图像及其理想阈值版本的阈值。从这个阶段开始,它为每个像素和每个灰度级生成一个决策矩阵,在新的图像分割时刻重新利用。通过基于最近邻的新策略对新图像进行阈值设置,该策略为新图像的每个像素寻找决策矩阵中的最佳解。对手写文件进行的测试显示了令人鼓舞的结果。就结果的质量而言,所开发的技术等于或优于传统的阈值分割技术,其优点是这里讨论的技术不需要使用启发式参数。
{"title":"Image segmentation by learning approach","authors":"H. Legal-Ayala, J. Facon","doi":"10.1109/ICDAR.2003.1227776","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227776","url":null,"abstract":"This article describes a new segmentation bythresholding approach based on learning. The methodconsists in learning to threshold correctly submitting bothan image and its ideal thresholded version. From thisstage it is generated a decision matrix for each pixel andeach gray level that is re-utilized at the moment of thenew images segmentation. The new image is thresholdedby means of a new strategy based on the nearestneighbors, that seeks, for each pixel of this new image,the best solution in the decision matrix. Performed testson handwritten documents showed promising results. Interms of quality of the results, the developed technique isequal or superior to the traditional segmentation bythresholding techniques, with the advantage that the onediscussed here does not requires the use of heuristicparameters.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117259280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Writer identification based on the fractal construction of a reference base 基于分形构造的作家识别参考库
A. Seropian, M. Grimaldi, N. Vincent
Our aim is to achieve writer identification processthanks to a fractal analysis of handwriting style. For eachwriter, a set of characteristics is extracted. They arespecific to the writer. Advantage is taken from theautosimilarity properties that are present in one'shandwriting. In order to do that, some invariant patternscharacterizing the writing are extracted. During thetraining step these invariant patterns appear along afractal compression process, then they are organized in areference base that can be associated with the writer.This base allows to analyze an unknown writing thewriter of which has to be identified. A Pattern Matchingprocess is performed using all the reference basessuccessively. The results of this analyze are estimatedthrough the signal to noise ratio. Thus, the signal to noiseratio according to a set of bases identifies the unknowntext's writer.
我们的目标是通过对笔迹风格的分形分析来实现作者的识别过程。对于每个writer,提取一组特征。它们是特定于作者的。优点是从一个人的笔迹中呈现的自相似特性中获得的。为了做到这一点,提取了一些具有书写特征的不变模式。在训练阶段,这些不变模式沿着分形压缩过程出现,然后将它们组织在可以与编写器相关联的参考库中。这个基础允许分析一个未知的写作,其作者必须确定。一个模式匹配过程使用所有的引用库依次执行。通过信噪比对分析结果进行估计。因此,根据一组碱基的信噪比来识别未知文本的作者。
{"title":"Writer identification based on the fractal construction of a reference base","authors":"A. Seropian, M. Grimaldi, N. Vincent","doi":"10.1109/ICDAR.2003.1227840","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227840","url":null,"abstract":"Our aim is to achieve writer identification processthanks to a fractal analysis of handwriting style. For eachwriter, a set of characteristics is extracted. They arespecific to the writer. Advantage is taken from theautosimilarity properties that are present in one'shandwriting. In order to do that, some invariant patternscharacterizing the writing are extracted. During thetraining step these invariant patterns appear along afractal compression process, then they are organized in areference base that can be associated with the writer.This base allows to analyze an unknown writing thewriter of which has to be identified. A Pattern Matchingprocess is performed using all the reference basessuccessively. The results of this analyze are estimatedthrough the signal to noise ratio. Thus, the signal to noiseratio according to a set of bases identifies the unknowntext's writer.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115903434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
An architecture for ink annotations on Web documents Web文档上墨水注释的体系结构
Sriram Ramachandran, R. Kashi
There have been recent improvements in document technologies like the standardization of object interfaces to access and manipulate the properties of Web documents. There has also been significant progress in pen based computing for recognition of digital ink in desktops, tablets and handheld devices. These have necessitated a need for further research on annotation architectures for digital documents, specifically pen-based annotation systems. This paper presents an attempt to leverage the new standards of DHTML and W3C DOM that are being gradually implemented by popular browsers, to build a prototype of an ink annotation system with common components across browsers. One of the primary goals in this study is to semantically link ink data with underlying document elements like text and images. The system has three components: a) ink capture and rendering b) Ink Understanding, which recognizes and associates ink with the underlying document; and c) Ink storage and retrieval.
文档技术最近有了一些改进,比如用于访问和操作Web文档属性的对象接口的标准化。在基于笔的计算,用于识别桌面、平板电脑和手持设备上的数字墨水方面也取得了重大进展。这就需要进一步研究数字文档的注释体系结构,特别是基于笔的注释系统。本文尝试利用DHTML和W3C DOM的新标准,构建一个具有跨浏览器通用组件的墨水注释系统原型。本研究的主要目标之一是在语义上将链接数据与底层文档元素(如文本和图像)链接起来。该系统有三个组成部分:a)墨水捕获和呈现;b)墨水理解,识别并将墨水与底层文档联系起来;c)墨水存储和检索。
{"title":"An architecture for ink annotations on Web documents","authors":"Sriram Ramachandran, R. Kashi","doi":"10.1109/ICDAR.2003.1227669","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227669","url":null,"abstract":"There have been recent improvements in document technologies like the standardization of object interfaces to access and manipulate the properties of Web documents. There has also been significant progress in pen based computing for recognition of digital ink in desktops, tablets and handheld devices. These have necessitated a need for further research on annotation architectures for digital documents, specifically pen-based annotation systems. This paper presents an attempt to leverage the new standards of DHTML and W3C DOM that are being gradually implemented by popular browsers, to build a prototype of an ink annotation system with common components across browsers. One of the primary goals in this study is to semantically link ink data with underlying document elements like text and images. The system has three components: a) ink capture and rendering b) Ink Understanding, which recognizes and associates ink with the underlying document; and c) Ink storage and retrieval.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115432751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Text identification in noisy document images using Markov random model 基于马尔可夫随机模型的噪声文档图像文本识别
Yefeng Zheng, Huiping Li, D. Doermann
In this paper we address the problem of the identification of text from noisy documents. We segment and identify handwriting from machine printed text because 1) handwriting in a document often indicates corrections, additions or other supplemental information that should be treated differently from the main body or body content, and 2) the segmentation and recognition techniques for machine printed text and handwriting are significantly different. Our novelty is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise. We further exploit context to refine the classification. A Markov random field (MRF) based approach is used to model the geometrical structure of the printed text, handwriting and noise to rectify the mis-classification. Experimental results show our approach is promising and robust, and can significantly improve the page segmentation results in noise documents.
在本文中,我们解决了从噪声文档中识别文本的问题。我们从机器打印文本中分割和识别手写,因为1)文档中的手写通常表示更正、添加或其他补充信息,这些信息应该与主体或主体内容区别对待,2)机器打印文本和手写的分割和识别技术有很大不同。我们的新颖之处在于,我们将噪声视为一个单独的类别,并基于选定的特征对噪声进行建模。经过训练的Fisher分类器用于从噪声中识别机器打印的文本和手写。我们进一步利用上下文来改进分类。基于马尔可夫随机场(MRF)的方法对打印文本、手写和噪声的几何结构进行建模,以纠正错误分类。实验结果表明,该方法具有良好的鲁棒性,可以显著改善噪声文档中的页面分割效果。
{"title":"Text identification in noisy document images using Markov random model","authors":"Yefeng Zheng, Huiping Li, D. Doermann","doi":"10.1109/ICDAR.2003.1227734","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227734","url":null,"abstract":"In this paper we address the problem of the identification of text from noisy documents. We segment and identify handwriting from machine printed text because 1) handwriting in a document often indicates corrections, additions or other supplemental information that should be treated differently from the main body or body content, and 2) the segmentation and recognition techniques for machine printed text and handwriting are significantly different. Our novelty is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise. We further exploit context to refine the classification. A Markov random field (MRF) based approach is used to model the geometrical structure of the printed text, handwriting and noise to rectify the mis-classification. Experimental results show our approach is promising and robust, and can significantly improve the page segmentation results in noise documents.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121871123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Detection of text marks on moving vehicles 移动车辆上的文本标记检测
R. Kasturi
Vehicle text marks are unique features which are useful for identifying vehicles in video surveillance applications. We propose a method for finding such text marks. An existing text detection algorithm is modified such that detection is increased and made more robust to outdoor conditions. False alarm is reduced by introducing a binary image test which remove detections that are not likely to be caused by text. The method is tested on a captured video of a typical street scene.
车辆文本标记是视频监控应用中用于识别车辆的独特功能。我们提出了一种查找此类文本标记的方法。对现有的文本检测算法进行了改进,从而增加了检测并使其对室外条件更具鲁棒性。通过引入二值图像测试来减少误报,该测试可以去除不太可能由文本引起的检测。该方法在一个典型街景的视频上进行了测试。
{"title":"Detection of text marks on moving vehicles","authors":"R. Kasturi","doi":"10.1109/ICDAR.2003.1227696","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227696","url":null,"abstract":"Vehicle text marks are unique features which are useful for identifying vehicles in video surveillance applications. We propose a method for finding such text marks. An existing text detection algorithm is modified such that detection is increased and made more robust to outdoor conditions. False alarm is reduced by introducing a binary image test which remove detections that are not likely to be caused by text. The method is tested on a captured video of a typical street scene.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121880873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Individuality of numerals 数字的个性
S. Srihari, C. Tomai, Bin Zhang, Sangjik Lee
The analysis of handwritten documents from the view-pointof determining their writership has great bearing onthe criminal justice system. In many cases, only a limitedamount of handwriting is available and sometimes it consistsof only numerals. Using a large number of handwrittennumeral images extracted from about 3000 samples writtenby 1000 writers, a study of the individuality of numerals foridentification/verification purposes was conducted. The individualityof numerals was studied using cluster analysis.Numerals discriminability was measured for writer verification.The study shows that some numerals present a higherdiscriminatory power and that their performances for theverification/identification tasks are very different.
从笔迹鉴定的角度对手写文书进行分析,对刑事司法制度有着重要的影响。在许多情况下,只有有限数量的笔迹可用,有时它只包含数字。利用从1000位写作者所写的约3000个样本中提取的大量手写数字图像,对数字的个性进行了识别/验证目的的研究。采用聚类分析对数字的个性进行了研究。测量了数字的可判别性,以供作者验证。研究表明,一些数字具有较高的歧视性,它们在验证/识别任务中的表现差异很大。
{"title":"Individuality of numerals","authors":"S. Srihari, C. Tomai, Bin Zhang, Sangjik Lee","doi":"10.1109/ICDAR.2003.1227826","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227826","url":null,"abstract":"The analysis of handwritten documents from the view-pointof determining their writership has great bearing onthe criminal justice system. In many cases, only a limitedamount of handwriting is available and sometimes it consistsof only numerals. Using a large number of handwrittennumeral images extracted from about 3000 samples writtenby 1000 writers, a study of the individuality of numerals foridentification/verification purposes was conducted. The individualityof numerals was studied using cluster analysis.Numerals discriminability was measured for writer verification.The study shows that some numerals present a higherdiscriminatory power and that their performances for theverification/identification tasks are very different.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129982815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A scalable solution for integrating illustrated parts drawings into a Class IV Interactive Electronic Technical Manual 一个可扩展的解决方案,用于将插图零件图集成到IV类交互式电子技术手册中
Molly L. Boose, D. B. Shema, Lawrence S. Baum
This paper discusses a scalable solution for integrating legacy illustrated parts drawings into a Class IV Interactive Electronic Technical Manual (IETM) (1995). An IETM is an interactive electronic version of a system's technical manuals such as for a commercial airplane or a military helicopter. It contains the information a technician needs to do her job including troubleshooting, vehicle maintenance and repair procedures. A Class IV IETM is an IETM that is authored and managed directly via a database. The end-user system optimizes viewing and navigation, minimizing the need for users to browse and search through large volumes of data. The Boeing Company has hundreds of thousands of illustrated parts drawings for both commercial and military vehicles. As Boeing migrates to Class IV IETM systems, it is necessary to incorporate existing illustrated parts drawings into the new systems. Manually re-authoring the drawings to bring them up to the level of a Class IV IETM is prohibitively expensive. Our solution is to provide a batch-processing system that performs the required modifications to the raster images and automatically updates the IETM database.
本文讨论了一种可扩展的解决方案,将遗留的图解零件图集成到IV类交互式电子技术手册(IETM)(1995)中。IETM是系统技术手册的交互式电子版本,例如用于商用飞机或军用直升机。它包含了技术人员工作所需的信息,包括故障排除、车辆维护和维修程序。IV类IETM是直接通过数据库编写和管理的IETM。最终用户系统优化了查看和导航,最大限度地减少了用户浏览和搜索大量数据的需要。波音公司有成千上万的商用和军用车辆的图解零件图。随着波音向IV类IETM系统的迁移,有必要将现有的图解部件图纸合并到新系统中。手动重新编写图纸以使其达到IV类IETM的水平是非常昂贵的。我们的解决方案是提供一个批处理系统,该系统对光栅图像执行所需的修改并自动更新IETM数据库。
{"title":"A scalable solution for integrating illustrated parts drawings into a Class IV Interactive Electronic Technical Manual","authors":"Molly L. Boose, D. B. Shema, Lawrence S. Baum","doi":"10.1109/ICDAR.2003.1227679","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227679","url":null,"abstract":"This paper discusses a scalable solution for integrating legacy illustrated parts drawings into a Class IV Interactive Electronic Technical Manual (IETM) (1995). An IETM is an interactive electronic version of a system's technical manuals such as for a commercial airplane or a military helicopter. It contains the information a technician needs to do her job including troubleshooting, vehicle maintenance and repair procedures. A Class IV IETM is an IETM that is authored and managed directly via a database. The end-user system optimizes viewing and navigation, minimizing the need for users to browse and search through large volumes of data. The Boeing Company has hundreds of thousands of illustrated parts drawings for both commercial and military vehicles. As Boeing migrates to Class IV IETM systems, it is necessary to incorporate existing illustrated parts drawings into the new systems. Manually re-authoring the drawings to bring them up to the level of a Class IV IETM is prohibitively expensive. Our solution is to provide a batch-processing system that performs the required modifications to the raster images and automatically updates the IETM database.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129323292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
SAGENT: a novel technique for document modeling for secure access and distribution SAGENT:一种用于安全访问和分发的文档建模的新技术
Sanaul Hoque, H. Selim, G. Howells, M. Fairhurst, F. Deravi
A novel strategy for the representation and manipulationof distributed documents, potentially complex andheterogeneous, is presented in this paper. The documentunder the proposed model is represented in a hierarchicalstructure. Associated metadata' describes the flexiblehierarchy with the scope of dynamically restructuring thetree at runtime. All useful functionals can also be includedwithin the hierarchy to minimize reliance on externalprograms in manipulating sensitive data. Thisgives the proposed model two key properties: generality(capable of representing any document format includingfuture innovations) and autonomy (non-reliance on externalprograms). The model also allows incorporation ofadditional features for security and access control. Biometricperson authentication measures are introduced. Abrief example illustrates the key ideas.
本文提出了一种新的分布式文档表示和操作策略,该策略具有潜在的复杂性和异构性。所建议模型下的文档以层次结构表示。关联元数据描述了灵活的层次结构,在运行时可以动态地重组树。所有有用的功能也可以包含在层次结构中,以尽量减少在操作敏感数据时对外部程序的依赖。这给了所提议的模型两个关键属性:通用性(能够表示任何文档格式,包括未来的创新)和自主性(不依赖于外部程序)。该模型还允许合并额外的安全和访问控制功能。介绍了生物识别认证措施。一个简短的例子说明了关键思想。
{"title":"SAGENT: a novel technique for document modeling for secure access and distribution","authors":"Sanaul Hoque, H. Selim, G. Howells, M. Fairhurst, F. Deravi","doi":"10.1109/ICDAR.2003.1227859","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227859","url":null,"abstract":"A novel strategy for the representation and manipulationof distributed documents, potentially complex andheterogeneous, is presented in this paper. The documentunder the proposed model is represented in a hierarchicalstructure. Associated metadata' describes the flexiblehierarchy with the scope of dynamically restructuring thetree at runtime. All useful functionals can also be includedwithin the hierarchy to minimize reliance on externalprograms in manipulating sensitive data. Thisgives the proposed model two key properties: generality(capable of representing any document format includingfuture innovations) and autonomy (non-reliance on externalprograms). The model also allows incorporation ofadditional features for security and access control. Biometricperson authentication measures are introduced. Abrief example illustrates the key ideas.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129488369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification of Web documents using a graph model 使用图模型对Web文档进行分类
A. Schenker, Mark Last, H. Bunke, A. Kandel
In this paper we describe work relating to classification of Web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k-nearest neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different Web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.
在本文中,我们描述了与Web文档分类相关的工作,使用基于图的模型代替传统的基于向量的文档表示模型。我们比较了使用k-最近邻(k-NN)算法的向量模型方法的分类精度和一种允许在k-NN算法中使用图来表示文档的新方法。在三种不同的Web文档集合上对所提出的方法进行了评估,使用“留一”方法来测量分类精度。结果表明,基于图的k-NN方法在准确率和执行时间上都优于传统的基于向量的k-NN方法。
{"title":"Classification of Web documents using a graph model","authors":"A. Schenker, Mark Last, H. Bunke, A. Kandel","doi":"10.1109/ICDAR.2003.1227666","DOIUrl":"https://doi.org/10.1109/ICDAR.2003.1227666","url":null,"abstract":"In this paper we describe work relating to classification of Web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k-nearest neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different Web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128232333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
期刊
Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1