首页 > 最新文献

Proceedings of the 2015 ACM Symposium on Document Engineering最新文献

英文 中文
Knuth-Plass Revisited: Flexible Line-Breaking for Automatic Document Layout Knuth-Plass重访:自动文档布局的灵活断行
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797091
Tamir Hassan, Andrew Hunter
There is an inherent flexibility in typesetting a block of text. Traditionally, line breaks would be manually chosen at strategic points in such a way as to minimize the amount of whitespace in each line. Hyphenation would only be used as a last resort. Knuth and Plass automated this optimization procedure, which has been used in various typesetting systems and DTP applications ever since. However, an optimal solution for the line-breaking problem does not necessarily lead us to an optimal document layout on the whole. The flexibility of choosing line breaks enables us, in many cases, to adjust the height of a paragraph by changing the number of lines, without having to make adjustments to font size, leading, etc. In many cases, the word spacing remains within the usual tolerances and visual quality does not noticeably suffer. This paper presents a modification to the Knuth-Plass algorithm to return several results for a given column of text, each corresponding to a different height, and describes steps to quantify the amount of expected flexibility in a given paragraph. We conclude with a discussion on how such "sub-optimal" results can lead to a better overall document layout, particularly in the context of mobile layouts, where flexibility is of key importance.
排版文本块有其固有的灵活性。传统上,在策略点手动选择换行符,以使每行中的空白量最小化。连字符只能作为最后的手段使用。Knuth和Plass自动化了这个优化过程,从那时起,它就被用于各种排版系统和DTP应用程序中。然而,对断行问题的最佳解决方案并不一定会使我们在整体上获得最佳的文档布局。在许多情况下,选择换行的灵活性使我们能够通过改变行数来调整段落的高度,而不必调整字体大小、行距等。在许多情况下,字间距保持在通常的公差范围内,并且视觉质量不会明显受到影响。本文提出了对Knuth-Plass算法的修改,以便为给定的文本列返回多个结果,每个结果对应于不同的高度,并描述了量化给定段落中预期灵活性的步骤。最后,我们讨论了这种“次优”的结果是如何导致更好的整体文档布局的,特别是在移动布局的背景下,灵活性是至关重要的。
{"title":"Knuth-Plass Revisited: Flexible Line-Breaking for Automatic Document Layout","authors":"Tamir Hassan, Andrew Hunter","doi":"10.1145/2682571.2797091","DOIUrl":"https://doi.org/10.1145/2682571.2797091","url":null,"abstract":"There is an inherent flexibility in typesetting a block of text. Traditionally, line breaks would be manually chosen at strategic points in such a way as to minimize the amount of whitespace in each line. Hyphenation would only be used as a last resort. Knuth and Plass automated this optimization procedure, which has been used in various typesetting systems and DTP applications ever since. However, an optimal solution for the line-breaking problem does not necessarily lead us to an optimal document layout on the whole. The flexibility of choosing line breaks enables us, in many cases, to adjust the height of a paragraph by changing the number of lines, without having to make adjustments to font size, leading, etc. In many cases, the word spacing remains within the usual tolerances and visual quality does not noticeably suffer. This paper presents a modification to the Knuth-Plass algorithm to return several results for a given column of text, each corresponding to a different height, and describes steps to quantify the amount of expected flexibility in a given paragraph. We conclude with a discussion on how such \"sub-optimal\" results can lead to a better overall document layout, particularly in the context of mobile layouts, where flexibility is of key importance.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124945973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach 面向多屏幕的网页分割方法
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797090
Mira Sarkis, C. Concolato, Jean-Claude Dufourd
In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.
在本文中,我们描述了一种面向多屏幕的网页分割方法。分割是一种视觉与结构相结合的自动分割方法。它旨在创建由多屏幕环境决定的具有不同功能的连贯块。它还具有动态适应页面内容的特点。实验是在一组包含多媒体元素的现有应用程序上进行的,特别是YouTube和视频播放器页面。将结果与文献中的一种分割方法和人工创建的基础真值进行比较。该方法具有81%的分割精度,是一种很有前途的分割方法。
{"title":"MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach","authors":"Mira Sarkis, C. Concolato, Jean-Claude Dufourd","doi":"10.1145/2682571.2797090","DOIUrl":"https://doi.org/10.1145/2682571.2797090","url":null,"abstract":"In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123955372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Change Classification in Graphics-Intensive Digital Documents 图形密集型数字文档中的变化分类
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797079
Jeremy Svendsen, A. Albu
This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.
本文提出了一种对内容相同但不同软件版本或不同操作平台生成的文档图像发生变化的自动检测和分类方法。我们的工作是在一个使用财务报告工具创建的数字化业务文档数据库上进行的。所提出的方法涉及一个多阶段过程,其最终目标是向人类用户呈现已更改的报告和已检测到的更改。我们的主要贡献与图形文档元素的匹配和比较有关。本文主要研究基于翻译的局部变化的检测。未来的工作将探索其他局部变化,包括大小、颜色和旋转。
{"title":"Change Classification in Graphics-Intensive Digital Documents","authors":"Jeremy Svendsen, A. Albu","doi":"10.1145/2682571.2797079","DOIUrl":"https://doi.org/10.1145/2682571.2797079","url":null,"abstract":"This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123908151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents VEDD:用于创建和半自动更新派生文档的可视化编辑器
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797075
K. Marriott, Mingzheng Shi, Michael Wybrow
Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.
文档内容越来越多地针对特定受众进行定制。这种自定义文档通常是通过组合来自选定逻辑内容模块的内容,然后对其进行编辑以创建自定义文档来构建的。一个主要的困难是,当源文档发生变化时,如何有效地更新这些派生文档。在这里,我们描述了一个基于web的可视化编辑工具,用于创建和半自动更新源库中模块的派生文档。
{"title":"VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents","authors":"K. Marriott, Mingzheng Shi, Michael Wybrow","doi":"10.1145/2682571.2797075","DOIUrl":"https://doi.org/10.1145/2682571.2797075","url":null,"abstract":"Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114144760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Text Document Summarization Based on Machine Learning 基于机器学习的文本文档自动摘要
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797099
G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss
The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.
随着互联网上可获得的信息量空前庞大,自动生成摘要的需求变得越来越重要。基于提取摘要技术的自动系统选择一个或多个文本中最重要的句子来生成摘要。本文利用机器学习技术来评估提取摘要中使用的20种最常用策略的质量,并将它们集成到一个工具中。在这种评估中考虑了数量和质量方面,证明了拟议方案的有效性。实验是在cnn语料库上进行的,这可能是目前最大和最适合对抽取摘要策略进行基准测试的测试语料库。
{"title":"Automatic Text Document Summarization Based on Machine Learning","authors":"G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss","doi":"10.1145/2682571.2797099","DOIUrl":"https://doi.org/10.1145/2682571.2797099","url":null,"abstract":"The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125325361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Searching Live Meeting Documents "Show me the Action" 搜索实时会议文档“Show me the Action”
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797082
Laurent Denoue, S. Carter, Matthew L. Cooper
Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.
实时会议文档需要不同的技术来有效地检索重要的信息。在实时会议期间,人们共享网站、编辑演示幻灯片和共享代码编辑器。一种简单的方法是使用光学字符识别(OCR)对共享的视频帧或关键帧进行索引,并让用户检索它们。这里我们将展示一种更有用的方法,即查看用户在实时文档流中采取了哪些操作。基于对真实会议的观察,我们关注两个重要信号:文本编辑和鼠标光标移动。我们描述了文本和光标运动的检测,它们在基于Web实时通信(Web Real-Time Communication)的系统中的实现,以及用户如何在会议期间基于这些提取的动作更好地搜索实时文档。
{"title":"Searching Live Meeting Documents \"Show me the Action\"","authors":"Laurent Denoue, S. Carter, Matthew L. Cooper","doi":"10.1145/2682571.2797082","DOIUrl":"https://doi.org/10.1145/2682571.2797082","url":null,"abstract":"Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Document Engineering Issues in Document Analysis 文档分析中的文档工程问题
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2801033
Charles K. Nicholas, Robert Brandon
We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.
我们介绍了恶意软件分析领域的概述,重点是与文档工程相关的问题。我们将通过讨论恶意软件的类型来介绍这个领域,包括可执行二进制文件、多态恶意软件、恶意pdf和漏洞利用工具包。我们将以我们对该领域重要研究问题的看法作为结束。
{"title":"Document Engineering Issues in Document Analysis","authors":"Charles K. Nicholas, Robert Brandon","doi":"10.1145/2682571.2801033","DOIUrl":"https://doi.org/10.1145/2682571.2801033","url":null,"abstract":"We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114954946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Delaunay Document Layout Descriptor Delaunay文档布局描述符
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797059
Sébastien Eskenazi, Petra Gomez-Krämer, J. Ogier
Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.
与文档身份验证相关的安全应用程序需要文档的真实副本与原始副本之间的精确匹配。这意味着用于比较两个文档(原始文档和副本文档)的文档分析算法应该提供相同的输出。这种算法包括从分割结果中计算布局描述符,因为文档的布局是其语义内容的一部分。为此,本文提出了一种新的布局描述符,它显著提高了当前的技术水平。这个描述符的基础是使用Delaunay三角剖分文档区域的质心。这个三角剖分被看作是一个图,图的邻接矩阵构成了描述符。虽然大多数布局描述符在精确匹配方面的稳定性为0%,但我们的描述符的稳定性为74%,使用适当的匹配算法可以将其提高到100%。在960个文档图像数据库的文档检索方案中,它也实现了100%的准确性和检索。此外,这个描述符非常高效,因为它在恒定的时间内根据文档数据库的大小执行搜索,并将数据库索引的大小减少了1 / 400。
{"title":"The Delaunay Document Layout Descriptor","authors":"Sébastien Eskenazi, Petra Gomez-Krämer, J. Ogier","doi":"10.1145/2682571.2797059","DOIUrl":"https://doi.org/10.1145/2682571.2797059","url":null,"abstract":"Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automatic Extraction of Figures from Scholarly Documents 从学术文献中自动提取数字
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797085
Sagnik Ray Choudhury, P. Mitra, C. Lee Giles
Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.
学术论文(期刊和会议论文,技术报告等)通常包含多个“图形”,如图表,流程图和其他手动生成的图像,以象征性地表示和说明视觉上重要的概念,发现和结果。这些图形可以用于自动数据提取或语义分析。令人惊讶的是,从PDF文档中大规模自动提取这些数据却很少受到关注。在这里,我们讨论了如何为这样的提取任务建立启发式独立可训练模型以及如何大规模提取图形的挑战。受表提取最新发展的推动,我们定义了三个新的评估指标:数字精度、数字召回率和数字f1得分。我们的数据集包括200个pdf文件的样本,随机从500万学术pdf文件中收集,并手动标记了180个图形位置。我们工作的初步结果表明准确率大于80%。
{"title":"Automatic Extraction of Figures from Scholarly Documents","authors":"Sagnik Ray Choudhury, P. Mitra, C. Lee Giles","doi":"10.1145/2682571.2797085","DOIUrl":"https://doi.org/10.1145/2682571.2797085","url":null,"abstract":"Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130008257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Multimedia Document Structure for Distributed Theatre 分布式影院的多媒体文档结构
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797087
Jack Jansen, Michael Frantzis, Pablo César
This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.
本文探讨了结构化(和声明式)多媒体文档格式的适用性,以支持一种新型的表演艺术:分布式戏剧。在分布式剧院中,演员们被分散在两个(或更多)地点,但在摄像机、互联网和投影技术的协调下,他们一起完成了一场表演。本文以我们制作的一部实际的分布式戏剧作品(奇迹剧院的《暴风雨》)为基础,对我们的经验进行了反思。我们的发现分为两个主要领域:工作流和文档结构。我们得出结论,新型的视频媒介应用,如分布式剧院,需要新的文档创作方式。此外,需要对现有文档格式进行特定的扩展,以适应这类应用程序带来的新需求。
{"title":"Multimedia Document Structure for Distributed Theatre","authors":"Jack Jansen, Michael Frantzis, Pablo César","doi":"10.1145/2682571.2797087","DOIUrl":"https://doi.org/10.1145/2682571.2797087","url":null,"abstract":"This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127625641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2015 ACM Symposium on Document Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1