In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.
{"title":"MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach","authors":"Mira Sarkis, C. Concolato, Jean-Claude Dufourd","doi":"10.1145/2682571.2797090","DOIUrl":"https://doi.org/10.1145/2682571.2797090","url":null,"abstract":"In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an automatic and hybrid visual and structural method. It aims at creating coherent blocks which have different functions determined by the multiscreen environment. It is also characterized by a dynamic adaptation to the page content.Experiments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages. Results are compared with one segmentation method from the literature and with a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good segmentation results.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123955372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is our great pleasure to welcome you to the 2015 ACM Symposium on Document Engineering -- DocEng'15. This year's symposium both continues and innovates in its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document engineering. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. Document engineering is a rapidly developing field that encompasses both traditional topics and also new ideas and challenges related to new technologies and to changes in the ways in which information is created, managed, and disseminated. This year we issued a new call for papers centered on new hot topics around the notion of document that has evolved to encompass a broader vision of the field. We therefore took pains to include new program committee members to supplement the overall expertise around these topics. Our call for papers attracted submissions from 25 countries (Algeria, Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Ecuador, Ethiopia, France, Germany, India, Italy, Japan, Netherlands, Portugal, Qatar, Russian Federation, Singapore, Spain, Switzerland, Tunisia, United Kingdom of Great Britain and Northern Ireland, United States of America). All papers were carefully reviewed by a minimum of three program committee members. The program committee accepted 11 of 31 reviewed full paper submissions (35%) and 18 of 51 reviewed short paper submissions (35%) for oral presentations, for a combined acceptance rate of 35%. A further 10 short paper submissions were accepted for poster presentations. This year's program includes two poster sessions during which attendees will be given the opportunity to interact with authors of short papers accepted for poster presentation. The most covered topics this year are analysis, layout, authoring, querying, transformation, validation, management and semantics of documents, as well as related algorithms. We are happy to feature two keynote talks: Documents as Data, Data as Documents: what we learned about Semi-Structured Information for our Open World of Cloud & Devices, Jean Paoli (who is currently President at Microsoft Open Technologies, Inc.) The Venice Time Machine, Frederic Kaplan (who is currently professor at EPFL)
{"title":"Proceedings of the 2015 ACM Symposium on Document Engineering","authors":"C. Vanoirbeek, P. Genevès","doi":"10.1145/2682571","DOIUrl":"https://doi.org/10.1145/2682571","url":null,"abstract":"It is our great pleasure to welcome you to the 2015 ACM Symposium on Document Engineering -- DocEng'15. This year's symposium both continues and innovates in its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document engineering. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. Document engineering is a rapidly developing field that encompasses both traditional topics and also new ideas and challenges related to new technologies and to changes in the ways in which information is created, managed, and disseminated. \u0000 \u0000This year we issued a new call for papers centered on new hot topics around the notion of document that has evolved to encompass a broader vision of the field. We therefore took pains to include new program committee members to supplement the overall expertise around these topics. Our call for papers attracted submissions from 25 countries (Algeria, Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Ecuador, Ethiopia, France, Germany, India, Italy, Japan, Netherlands, Portugal, Qatar, Russian Federation, Singapore, Spain, Switzerland, Tunisia, United Kingdom of Great Britain and Northern Ireland, United States of America). All papers were carefully reviewed by a minimum of three program committee members. The program committee accepted 11 of 31 reviewed full paper submissions (35%) and 18 of 51 reviewed short paper submissions (35%) for oral presentations, for a combined acceptance rate of 35%. A further 10 short paper submissions were accepted for poster presentations. This year's program includes two poster sessions during which attendees will be given the opportunity to interact with authors of short papers accepted for poster presentation. The most covered topics this year are analysis, layout, authoring, querying, transformation, validation, management and semantics of documents, as well as related algorithms. \u0000 \u0000We are happy to feature two keynote talks: \u0000Documents as Data, Data as Documents: what we learned about Semi-Structured Information for our Open World of Cloud & Devices, Jean Paoli (who is currently President at Microsoft Open Technologies, Inc.) \u0000The Venice Time Machine, Frederic Kaplan (who is currently professor at EPFL)","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131611477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.
{"title":"Change Classification in Graphics-Intensive Digital Documents","authors":"Jeremy Svendsen, A. Albu","doi":"10.1145/2682571.2797079","DOIUrl":"https://doi.org/10.1145/2682571.2797079","url":null,"abstract":"This paper proposes an approach for the automatic detection and classification of changes occurring in images of documents with identical content, but generated with different software versions, or under different operating platforms. Our work is performed on a database of digitally-born business documents created using financial reporting tools. The proposed method involves a multi-stage process, where the end goal is to present to a human user the reports which have changed and the changes which were detected. Our main contribution is related to matching and comparing of graphical document elements. This paper focuses on detection of local, translation-based changes. Future work will explore other local changes involving size, color, and rotation.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123908151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.
{"title":"VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents","authors":"K. Marriott, Mingzheng Shi, Michael Wybrow","doi":"10.1145/2682571.2797075","DOIUrl":"https://doi.org/10.1145/2682571.2797075","url":null,"abstract":"Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114144760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss
The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.
{"title":"Automatic Text Document Summarization Based on Machine Learning","authors":"G. Silva, Rafael Ferreira, R. Lins, L. Cabral, Hilário Oliveira, S. Simske, M. Riss","doi":"10.1145/2682571.2797099","DOIUrl":"https://doi.org/10.1145/2682571.2797099","url":null,"abstract":"The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization strategies.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125325361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.
{"title":"Searching Live Meeting Documents \"Show me the Action\"","authors":"Laurent Denoue, S. Carter, Matthew L. Cooper","doi":"10.1145/2682571.2797082","DOIUrl":"https://doi.org/10.1145/2682571.2797082","url":null,"abstract":"Live meeting documents require different techniques for effectively retrieving important pieces of information. During live meetings, people share web sites, edit presentation slides, and share code editors. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them. Here we show that a more useful approach is to look at what actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC (Web Real-Time Communication)-based system, and how users are better able to search live documents during a meeting based on these extracted actions.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.
{"title":"Document Engineering Issues in Document Analysis","authors":"Charles K. Nicholas, Robert Brandon","doi":"10.1145/2682571.2801033","DOIUrl":"https://doi.org/10.1145/2682571.2801033","url":null,"abstract":"We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, polymorphic malware, malicious PDFs, and exploit kits. We will conclude with our view of important research questions in the field.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114954946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.
{"title":"The Delaunay Document Layout Descriptor","authors":"Sébastien Eskenazi, Petra Gomez-Krämer, J. Ogier","doi":"10.1145/2682571.2797059","DOIUrl":"https://doi.org/10.1145/2682571.2797059","url":null,"abstract":"Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.
{"title":"Automatic Extraction of Figures from Scholarly Documents","authors":"Sagnik Ray Choudhury, P. Mitra, C. Lee Giles","doi":"10.1145/2682571.2797085","DOIUrl":"https://doi.org/10.1145/2682571.2797085","url":null,"abstract":"Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130008257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.
{"title":"Multimedia Document Structure for Distributed Theatre","authors":"Jack Jansen, Michael Frantzis, Pablo César","doi":"10.1145/2682571.2797087","DOIUrl":"https://doi.org/10.1145/2682571.2797087","url":null,"abstract":"This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more) locations, but together deliver a single performance mediated by the cameras, the internet, and projection technologies. Based on our efforts to make an actual distributed theatre production happen (the Tempest by Miracle Theatre), this paper reflects on our experience. Our findings are divided into two main areas: workflow and document structure. We conclude that novel types of video-mediated applications, like distributed theatre, require new manners of authoring documents. Moreover, specific extensions to existing document formats are needed in order to accommodate the new requirements imposed by such kind of applications.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127625641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}