首页 > 最新文献

Proceedings of the 2015 ACM Symposium on Document Engineering最新文献

英文 中文
MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach 面向多屏幕的网页分割方法
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797090
Mira Sarkis, C. Concolato, Jean-Claude Dufourd
In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.
在本文中,我们描述了一种面向多屏幕的网页分割方法。分割是一种视觉与结构相结合的自动分割方法。它旨在创建由多屏幕环境决定的具有不同功能的连贯块。它还具有动态适应页面内容的特点。实验是在一组包含多媒体元素的现有应用程序上进行的,特别是YouTube和视频播放器页面。将结果与文献中的一种分割方法和人工创建的基础真值进行比较。该方法具有81%的分割精度,是一种很有前途的分割方法。
{"title":"MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach","authors":"Mira Sarkis, C. Concolato, Jean-Claude Dufourd","doi":"10.1145/2682571.2797090","DOIUrl":"https://doi.org/10.1145/2682571.2797090","url":null,"abstract":"In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123955372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Proceedings of the 2015 ACM Symposium on Document Engineering 2015 ACM文献工程研讨会论文集
Pub Date : 2015-09-08 DOI: 10.1145/2682571
C. Vanoirbeek, P. Genevès
It is our great pleasure to welcome you to the 2015 ACM Symposium on Document Engineering -- DocEng'15. This year's symposium both continues and innovates in its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document engineering. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. Document engineering is a rapidly developing field that encompasses both traditional topics and also new ideas and challenges related to new technologies and to changes in the ways in which information is created, managed, and disseminated. This year we issued a new call for papers centered on new hot topics around the notion of document that has evolved to encompass a broader vision of the field. We therefore took pains to include new program committee members to supplement the overall expertise around these topics. Our call for papers attracted submissions from 25 countries (Algeria, Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Ecuador, Ethiopia, France, Germany, India, Italy, Japan, Netherlands, Portugal, Qatar, Russian Federation, Singapore, Spain, Switzerland, Tunisia, United Kingdom of Great Britain and Northern Ireland, United States of America). All papers were carefully reviewed by a minimum of three program committee members. The program committee accepted 11 of 31 reviewed full paper submissions (35%) and 18 of 51 reviewed short paper submissions (35%) for oral presentations, for a combined acceptance rate of 35%. A further 10 short paper submissions were accepted for poster presentations. This year's program includes two poster sessions during which attendees will be given the opportunity to interact with authors of short papers accepted for poster presentation. The most covered topics this year are analysis, layout, authoring, querying, transformation, validation, management and semantics of documents, as well as related algorithms. We are happy to feature two keynote talks: Documents as Data, Data as Documents: what we learned about Semi-Structured Information for our Open World of Cloud & Devices, Jean Paoli (who is currently President at Microsoft Open Technologies, Inc.) The Venice Time Machine, Frederic Kaplan (who is currently professor at EPFL)
我们非常高兴地欢迎您参加2015年ACM文档工程研讨会——DocEng’15。今年的研讨会既延续了传统,也创新了传统,成为展示文档工程前沿问题的研究成果和经验报告的主要论坛。研讨会的任务是分享重要的成果,评估新的方法和模型,并确定未来研究和发展的有希望的方向。DocEng为研究人员和实践者提供了一个独特的机会,与其他对文档工程各个方面感兴趣的人分享他们的观点。文档工程是一个快速发展的领域,它既包含传统主题,也包含与新技术相关的新思想和挑战,以及信息创建、管理和传播方式的变化。今年,我们发布了一项新的论文征集活动,围绕文档概念的新热点话题展开,该概念已经发展到涵盖更广泛的领域。因此,我们煞费苦心地加入了新的项目委员会成员,以补充围绕这些主题的整体专业知识。我们的论文征集活动吸引了来自25个国家(阿尔及利亚、澳大利亚、奥地利、比利时、巴西、加拿大、中国、丹麦、厄瓜多尔、埃塞俄比亚、法国、德国、印度、意大利、日本、荷兰、葡萄牙、卡塔尔、俄罗斯联邦、新加坡、西班牙、瑞士、突尼斯、大不列颠及北爱尔兰联合王国、美利坚合众国)的提交。所有论文都经过至少三名项目委员会成员的仔细审查。项目委员会接受了31篇审阅过的完整论文中的11篇(35%)和51篇审阅过的简短论文中的18篇(35%)进行口头陈述,总录取率为35%。另有10篇短文被接受作海报展示。今年的活动包括两个海报环节,在此期间,与会者将有机会与海报展示的短篇论文作者互动。今年涉及最多的主题是文档的分析、布局、编写、查询、转换、验证、管理和语义,以及相关的算法。我们很高兴有两个主题演讲:文档作为数据,数据作为文档:我们在云计算和设备的开放世界中对半结构化信息的了解,Jean Paoli(现任微软开放技术公司总裁)威尼斯时光机,Frederic Kaplan(现任EPFL教授)
{"title":"Proceedings of the 2015 ACM Symposium on Document Engineering","authors":"C. Vanoirbeek, P. Genevès","doi":"10.1145/2682571","DOIUrl":"https://doi.org/10.1145/2682571","url":null,"abstract":"It is our great pleasure to welcome you to the 2015 ACM Symposium on Document Engineering -- DocEng'15. This year's symposium both continues and innovates in its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document engineering. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. Document engineering is a rapidly developing field that encompasses both traditional topics and also new ideas and challenges related to new technologies and to changes in the ways in which information is created, managed, and disseminated. \u0000 \u0000This year we issued a new call for papers centered on new hot topics around the notion of document that has evolved to encompass a broader vision of the field. We therefore took pains to include new program committee members to supplement the overall expertise around these topics. Our call for papers attracted submissions from 25 countries (Algeria, Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Ecuador, Ethiopia, France, Germany, India, Italy, Japan, Netherlands, Portugal, Qatar, Russian Federation, Singapore, Spain, Switzerland, Tunisia, United Kingdom of Great Britain and Northern Ireland, United States of America). All papers were carefully reviewed by a minimum of three program committee members. The program committee accepted 11 of 31 reviewed full paper submissions (35%) and 18 of 51 reviewed short paper submissions (35%) for oral presentations, for a combined acceptance rate of 35%. A further 10 short paper submissions were accepted for poster presentations. This year's program includes two poster sessions during which attendees will be given the opportunity to interact with authors of short papers accepted for poster presentation. The most covered topics this year are analysis, layout, authoring, querying, transformation, validation, management and semantics of documents, as well as related algorithms. \u0000 \u0000We are happy to feature two keynote talks: \u0000Documents as Data, Data as Documents: what we learned about Semi-Structured Information for our Open World of Cloud & Devices, Jean Paoli (who is currently President at Microsoft Open Technologies, Inc.) \u0000The Venice Time Machine, Frederic Kaplan (who is currently professor at EPFL)","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131611477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Change Classification in Graphics-Intensive Digital Documents 图形密集型数字文档中的变化分类
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797079
Jeremy Svendsen, A. Albu
This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.
本文提出了一种对内容相同但不同软件版本或不同操作平台生成的文档图像发生变化的自动检测和分类方法。我们的工作是在一个使用财务报告工具创建的数字化业务文档数据库上进行的。所提出的方法涉及一个多阶段过程,其最终目标是向人类用户呈现已更改的报告和已检测到的更改。我们的主要贡献与图形文档元素的匹配和比较有关。本文主要研究基于翻译的局部变化的检测。未来的工作将探索其他局部变化,包括大小、颜色和旋转。
{"title":"Change Classification in Graphics-Intensive Digital Documents","authors":"Jeremy Svendsen, A. Albu","doi":"10.1145/2682571.2797079","DOIUrl":"https://doi.org/10.1145/2682571.2797079","url":null,"abstract":"This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123908151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents VEDD:用于创建和半自动更新派生文档的可视化编辑器
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797075
K. Marriott, Mingzheng Shi, Michael Wybrow
Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.
文档内容越来越多地针对特定受众进行定制。这种自定义文档通常是通过组合来自选定逻辑内容模块的内容,然后对其进行编辑以创建自定义文档来构建的。一个主要的困难是,当源文档发生变化时,如何有效地更新这些派生文档。在这里,我们描述了一个基于web的可视化编辑工具,用于创建和半自动更新源库中模块的派生文档。
{"title":"VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents","authors":"K. Marriott, Mingzheng Shi, Michael Wybrow","doi":"10.1145/2682571.2797075","DOIUrl":"https://doi.org/10.1145/2682571.2797075","url":null,"abstract":"Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114144760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Text Document Summarization Based on Machine Learning 基于机器学习的文本文档自动摘要
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797099
G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss
The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.
随着互联网上可获得的信息量空前庞大,自动生成摘要的需求变得越来越重要。基于提取摘要技术的自动系统选择一个或多个文本中最重要的句子来生成摘要。本文利用机器学习技术来评估提取摘要中使用的20种最常用策略的质量,并将它们集成到一个工具中。在这种评估中考虑了数量和质量方面,证明了拟议方案的有效性。实验是在cnn语料库上进行的,这可能是目前最大和最适合对抽取摘要策略进行基准测试的测试语料库。
{"title":"Automatic Text Document Summarization Based on Machine Learning","authors":"G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss","doi":"10.1145/2682571.2797099","DOIUrl":"https://doi.org/10.1145/2682571.2797099","url":null,"abstract":"The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125325361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Searching Live Meeting Documents "Show me the Action" 搜索实时会议文档“Show me the Action”
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797082
Laurent Denoue, S. Carter, Matthew L. Cooper
Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.
实时会议文档需要不同的技术来有效地检索重要的信息。在实时会议期间,人们共享网站、编辑演示幻灯片和共享代码编辑器。一种简单的方法是使用光学字符识别(OCR)对共享的视频帧或关键帧进行索引,并让用户检索它们。这里我们将展示一种更有用的方法,即查看用户在实时文档流中采取了哪些操作。基于对真实会议的观察,我们关注两个重要信号:文本编辑和鼠标光标移动。我们描述了文本和光标运动的检测,它们在基于Web实时通信(Web Real-Time Communication)的系统中的实现,以及用户如何在会议期间基于这些提取的动作更好地搜索实时文档。
{"title":"Searching Live Meeting Documents \"Show me the Action\"","authors":"Laurent Denoue, S. Carter, Matthew L. Cooper","doi":"10.1145/2682571.2797082","DOIUrl":"https://doi.org/10.1145/2682571.2797082","url":null,"abstract":"Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Document Engineering Issues in Document Analysis 文档分析中的文档工程问题
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2801033
Charles K. Nicholas, Robert Brandon
We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.
我们介绍了恶意软件分析领域的概述,重点是与文档工程相关的问题。我们将通过讨论恶意软件的类型来介绍这个领域,包括可执行二进制文件、多态恶意软件、恶意pdf和漏洞利用工具包。我们将以我们对该领域重要研究问题的看法作为结束。
{"title":"Document Engineering Issues in Document Analysis","authors":"Charles K. Nicholas, Robert Brandon","doi":"10.1145/2682571.2801033","DOIUrl":"https://doi.org/10.1145/2682571.2801033","url":null,"abstract":"We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114954946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Delaunay Document Layout Descriptor Delaunay文档布局描述符
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797059
Sébastien Eskenazi, Petra Gomez-Krämer, J. Ogier
Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.
与文档身份验证相关的安全应用程序需要文档的真实副本与原始副本之间的精确匹配。这意味着用于比较两个文档(原始文档和副本文档)的文档分析算法应该提供相同的输出。这种算法包括从分割结果中计算布局描述符,因为文档的布局是其语义内容的一部分。为此,本文提出了一种新的布局描述符,它显著提高了当前的技术水平。这个描述符的基础是使用Delaunay三角剖分文档区域的质心。这个三角剖分被看作是一个图,图的邻接矩阵构成了描述符。虽然大多数布局描述符在精确匹配方面的稳定性为0%,但我们的描述符的稳定性为74%,使用适当的匹配算法可以将其提高到100%。在960个文档图像数据库的文档检索方案中,它也实现了100%的准确性和检索。此外,这个描述符非常高效,因为它在恒定的时间内根据文档数据库的大小执行搜索,并将数据库索引的大小减少了1 / 400。
{"title":"The Delaunay Document Layout Descriptor","authors":"Sébastien Eskenazi, Petra Gomez-Krämer, J. Ogier","doi":"10.1145/2682571.2797059","DOIUrl":"https://doi.org/10.1145/2682571.2797059","url":null,"abstract":"Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automatic Extraction of Figures from Scholarly Documents 从学术文献中自动提取数字
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797085
Sagnik Ray Choudhury, P. Mitra, C. Lee Giles
Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.
学术论文(期刊和会议论文,技术报告等)通常包含多个“图形”,如图表,流程图和其他手动生成的图像,以象征性地表示和说明视觉上重要的概念,发现和结果。这些图形可以用于自动数据提取或语义分析。令人惊讶的是,从PDF文档中大规模自动提取这些数据却很少受到关注。在这里,我们讨论了如何为这样的提取任务建立启发式独立可训练模型以及如何大规模提取图形的挑战。受表提取最新发展的推动,我们定义了三个新的评估指标:数字精度、数字召回率和数字f1得分。我们的数据集包括200个pdf文件的样本,随机从500万学术pdf文件中收集,并手动标记了180个图形位置。我们工作的初步结果表明准确率大于80%。
{"title":"Automatic Extraction of Figures from Scholarly Documents","authors":"Sagnik Ray Choudhury, P. Mitra, C. Lee Giles","doi":"10.1145/2682571.2797085","DOIUrl":"https://doi.org/10.1145/2682571.2797085","url":null,"abstract":"Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130008257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Multimedia Document Structure for Distributed Theatre 分布式影院的多媒体文档结构
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797087
Jack Jansen, Michael Frantzis, Pablo César
This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.
本文探讨了结构化(和声明式)多媒体文档格式的适用性,以支持一种新型的表演艺术:分布式戏剧。在分布式剧院中,演员们被分散在两个(或更多)地点,但在摄像机、互联网和投影技术的协调下,他们一起完成了一场表演。本文以我们制作的一部实际的分布式戏剧作品(奇迹剧院的《暴风雨》)为基础,对我们的经验进行了反思。我们的发现分为两个主要领域:工作流和文档结构。我们得出结论,新型的视频媒介应用,如分布式剧院,需要新的文档创作方式。此外,需要对现有文档格式进行特定的扩展,以适应这类应用程序带来的新需求。
{"title":"Multimedia Document Structure for Distributed Theatre","authors":"Jack Jansen, Michael Frantzis, Pablo César","doi":"10.1145/2682571.2797087","DOIUrl":"https://doi.org/10.1145/2682571.2797087","url":null,"abstract":"This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127625641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2015 ACM Symposium on Document Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1