This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy manipulation of their annotations (deletion or update) and the creation of alternative annotations of individual documents even for annotation formalisms that by design do not support this feature.
{"title":"Annotation Data Management with JeDIS","authors":"E. Faessler, U. Hahn","doi":"10.1145/3209280.3229102","DOIUrl":"https://doi.org/10.1145/3209280.3229102","url":null,"abstract":"This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy manipulation of their annotations (deletion or update) and the creation of alternative annotations of individual documents even for annotation formalisms that by design do not support this feature.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121845394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gioele Barabucci, Uwe M. Borghoff, A. Iorio, Sonja Schimmler, E. Munson
The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we would like to focus on collaboration tools for non-textual documents. The workshop is open to researchers and practitioners from industry and academia. We would like to provide a platform to discuss and explore the state of the art in the field of document changes. One of the goals of this year's edition is to review the outcomes of the last four editions and to develop plans for the future.
{"title":"Document Changes: Modeling, Detection, Storage and Visualization (DChanges 2018)","authors":"Gioele Barabucci, Uwe M. Borghoff, A. Iorio, Sonja Schimmler, E. Munson","doi":"10.1145/3209280.3232792","DOIUrl":"https://doi.org/10.1145/3209280.3232792","url":null,"abstract":"The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we would like to focus on collaboration tools for non-textual documents. The workshop is open to researchers and practitioners from industry and academia. We would like to provide a platform to discuss and explore the state of the art in the field of document changes. One of the goals of this year's edition is to review the outcomes of the last four editions and to develop plans for the future.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116516127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects -- as well as their temporal validity -- are determined by the broadcast source, but the actual rendering and placement of these objects can be customized to the needs/constraints of the content viewer(s). The use of object-based broadcasting enables a more tailored end-user experience than the one-size-fits-all of traditional broadcasts: the viewer may be able to selectively turn off overlay graphics (such as statistics) during a sports game, or selectively render them on a secondary device. Object-based broadcasting also holds the potential for supporting presentation adaptivity for accessibility or for device heterogeneity. From a technology perspective, object-based broadcasting resembles a traditional IP media stream, accompanied by a structured multimedia document that contains timed rendering instructions. Unfortunately, the use of object-based broadcasting is severely limited because of the problems it poses for the traditional television production workflow (and in particular, for use in live television production). The traditional workflow places graphics, effects and replays as immutable components in the main audio/video feed originating from, for example, a production truck outside a sports stadium. This single feed is then delivered near-live to the homes of all viewers. In order to effectively support dynamic object-based broadcasting, the production workflow will need to retain a familiar creative interface to the production staff, but also allow the insertion and delivery of a differentiated set of objects for selective use at the receiving end. In this paper we present a model and implementation of a dynamic system for supporting object-based broadcasting in the context of a motor sport application. We define a new multimedia document format that supports dynamic modifications during playback; this allows editing decisions by the producer to be activated by agents at the receiving end of the content. We describe a prototype system to allow playback of these broadcasts and a production system that allows live object-based control within the production workflow. We conclude with an evaluation of a trial using near-live deployment of the environment, using content from our partners, in a sport environment.
{"title":"Workflow Support for Live Object-Based Broadcasting","authors":"Jack Jansen, Pablo César, D. Bulterman","doi":"10.1145/3209280.3209528","DOIUrl":"https://doi.org/10.1145/3209280.3209528","url":null,"abstract":"This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects -- as well as their temporal validity -- are determined by the broadcast source, but the actual rendering and placement of these objects can be customized to the needs/constraints of the content viewer(s). The use of object-based broadcasting enables a more tailored end-user experience than the one-size-fits-all of traditional broadcasts: the viewer may be able to selectively turn off overlay graphics (such as statistics) during a sports game, or selectively render them on a secondary device. Object-based broadcasting also holds the potential for supporting presentation adaptivity for accessibility or for device heterogeneity. From a technology perspective, object-based broadcasting resembles a traditional IP media stream, accompanied by a structured multimedia document that contains timed rendering instructions. Unfortunately, the use of object-based broadcasting is severely limited because of the problems it poses for the traditional television production workflow (and in particular, for use in live television production). The traditional workflow places graphics, effects and replays as immutable components in the main audio/video feed originating from, for example, a production truck outside a sports stadium. This single feed is then delivered near-live to the homes of all viewers. In order to effectively support dynamic object-based broadcasting, the production workflow will need to retain a familiar creative interface to the production staff, but also allow the insertion and delivery of a differentiated set of objects for selective use at the receiving end. In this paper we present a model and implementation of a dynamic system for supporting object-based broadcasting in the context of a motor sport application. We define a new multimedia document format that supports dynamic modifications during playback; this allows editing decisions by the producer to be activated by agents at the receiving end of the content. We describe a prototype system to allow playback of these broadcasts and a production system that allows live object-based control within the production workflow. We conclude with an evaluation of a trial using near-live deployment of the environment, using content from our partners, in a sport environment.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114433199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional structure need higher-level organization. We introduce the Causal Graph, which extends the Causal Tree CRDT into a graph of nodes and transitions to represent ordered trees. This data structure is useful in driving document outlines for large collaborative documents, resolving structures with over 100,000 sections in less than a second.
{"title":"The Causal Graph CRDT for Complex Document Structure","authors":"A. Hall, Grant Nelson, Mike Thiesen, Nate Woods","doi":"10.1145/3209280.3229110","DOIUrl":"https://doi.org/10.1145/3209280.3229110","url":null,"abstract":"Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional structure need higher-level organization. We introduce the Causal Graph, which extends the Causal Tree CRDT into a graph of nodes and transitions to represent ordered trees. This data structure is useful in driving document outlines for large collaborative documents, resolving structures with over 100,000 sections in less than a second.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115732835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose OurDirection, an open-domain dialogue framework that is specialized in mimicking the Hansard (debate) materials from Canadian House of Commons. In this framework, we employed two sets of neural network models (Hierarchical Recurrent Encoder-Decoder (HRED) and RNN) to generate the dialogue responses. Extensive experiments on Hansard dataset shows that the models can learn the structure of the debates, and can produce reasonable responses to the user entries.
{"title":"OurDirection","authors":"Sadra Abrishamkar, J. Huang","doi":"10.1145/3209280.3229101","DOIUrl":"https://doi.org/10.1145/3209280.3229101","url":null,"abstract":"We propose OurDirection, an open-domain dialogue framework that is specialized in mimicking the Hansard (debate) materials from Canadian House of Commons. In this framework, we employed two sets of neural network models (Hierarchical Recurrent Encoder-Decoder (HRED) and RNN) to generate the dialogue responses. Extensive experiments on Hansard dataset shows that the models can learn the structure of the debates, and can produce reasonable responses to the user entries.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"43 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120884134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron MacSween, Caleb James Delisle, P. Libbrecht, Yann Flory
Document editing has migrated in the last decade from a mostly individual activity to a shared activity among multiple persons. The World Wide Web and other communication means have contributed to this evolution. However, collaboration via the web has shown a tendency to centralize information, making it accessible to subsequent uses and abuses, such as surveillance, marketing, and data theft. Traditionally, access control policies have been enforced by a central authority, usually the server hosting the content, a single point of failure. We describe a novel scheme for collaborative editing in which clients enforce access control through the use of strong encryption. Encryption keys are distributed as the portion of a URI which is not shared with the server, enabling users to adopt a variety of document security workflows. This system separates access to the information ("the key") from the responsibility of hosting the content ("the carrier of the vault"), allowing privacy-conscious editors to enjoy a modern collaborative editing experience without relaxing their requirements. The paper presents CryptPad, an open-source reference implementation which features a variety of editors which employ the described access control methodology. We will detail approaches for implementing a variety of features required for user productivity in a manner that satisfies user-defined privacy concerns.
{"title":"Private Document Editing with some Trust","authors":"Aaron MacSween, Caleb James Delisle, P. Libbrecht, Yann Flory","doi":"10.1145/3209280.3209535","DOIUrl":"https://doi.org/10.1145/3209280.3209535","url":null,"abstract":"Document editing has migrated in the last decade from a mostly individual activity to a shared activity among multiple persons. The World Wide Web and other communication means have contributed to this evolution. However, collaboration via the web has shown a tendency to centralize information, making it accessible to subsequent uses and abuses, such as surveillance, marketing, and data theft. Traditionally, access control policies have been enforced by a central authority, usually the server hosting the content, a single point of failure. We describe a novel scheme for collaborative editing in which clients enforce access control through the use of strong encryption. Encryption keys are distributed as the portion of a URI which is not shared with the server, enabling users to adopt a variety of document security workflows. This system separates access to the information (\"the key\") from the responsibility of hosting the content (\"the carrier of the vault\"), allowing privacy-conscious editors to enjoy a modern collaborative editing experience without relaxing their requirements. The paper presents CryptPad, an open-source reference implementation which features a variety of editors which employ the described access control methodology. We will detail approaches for implementing a variety of features required for user productivity in a manner that satisfies user-defined privacy concerns.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122793849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Document engineering employs practices of modeling and representation. Enactment of these practices relies on shared metaphors. However, choices driven by metaphor often receive less attention than those driven by factors critical to developing working systems, such as performance and usability. One way to remedy this issue is to take a historical approach, studying cases without a guiding concern for their ongoing development and maintenance. In this paper, we compare two historical case studies of "failed" designs for hypertext on the Web. The first case is netomat (1999), a Web browser created by the artist Maciej Wisniewski, which responded to search queries with dynamic multimedia streams culled from across the Web and structured by a custom markup language. The second is the XML Linking Language (XLink), a W3C standard to express hypertext links within and between XML documents. Our analysis focuses on the relationship between the metaphors used to make sense of Web documents and the hypermedia structures they compose. The metaphors offered by netomat and XLink stand as alternatives to metaphors of the "page" or the "app." Our intent here is not to argue that any of these metaphors are superior, but to consider how designers' and engineers' metaphorical choices are situated within a complex of already existing factors shaping Web technology and practice. The results provide insight into underexplored interconnections between art and document engineering at a critical moment in the history of the Web, and demonstrate the value for designers and engineers of studying "paths not taken" during the history of the technologies we work on today.
{"title":"Never the Same Stream: netomat, XLink, and Metaphors of Web Documents","authors":"Colin Post, Patrick Golden, R. Shaw","doi":"10.1145/3209280.3209530","DOIUrl":"https://doi.org/10.1145/3209280.3209530","url":null,"abstract":"Document engineering employs practices of modeling and representation. Enactment of these practices relies on shared metaphors. However, choices driven by metaphor often receive less attention than those driven by factors critical to developing working systems, such as performance and usability. One way to remedy this issue is to take a historical approach, studying cases without a guiding concern for their ongoing development and maintenance. In this paper, we compare two historical case studies of \"failed\" designs for hypertext on the Web. The first case is netomat (1999), a Web browser created by the artist Maciej Wisniewski, which responded to search queries with dynamic multimedia streams culled from across the Web and structured by a custom markup language. The second is the XML Linking Language (XLink), a W3C standard to express hypertext links within and between XML documents. Our analysis focuses on the relationship between the metaphors used to make sense of Web documents and the hypermedia structures they compose. The metaphors offered by netomat and XLink stand as alternatives to metaphors of the \"page\" or the \"app.\" Our intent here is not to argue that any of these metaphors are superior, but to consider how designers' and engineers' metaphorical choices are situated within a complex of already existing factors shaping Web technology and practice. The results provide insight into underexplored interconnections between art and document engineering at a critical moment in the history of the Web, and demonstrate the value for designers and engineers of studying \"paths not taken\" during the history of the technologies we work on today.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132356677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.
{"title":"Main Content Detection in HTML Journal Articles","authors":"Alastair R. Rae, Jongwoo Kim, D. Le, G. Thoma","doi":"10.1145/3209280.3229115","DOIUrl":"https://doi.org/10.1145/3209280.3229115","url":null,"abstract":"Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114604753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, the large amount of data and intentional variants in technical texts make this a challenging task. We tackle this problem by using an efficient cosine similarity algorithm which leverages semantic information from XML-based information models. To verify our approach we built a browser-based prototype which can identify intentional variants by weighting semantic text properties with high performance. The prototype was successfully deployed in an industry project with a large-scale content corpus.
{"title":"Semantically Weighted Similarity Analysis for XML-based Content Components","authors":"Jan Oevermann, Christoph Lüth","doi":"10.1145/3209280.3229098","DOIUrl":"https://doi.org/10.1145/3209280.3229098","url":null,"abstract":"Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, the large amount of data and intentional variants in technical texts make this a challenging task. We tackle this problem by using an efficient cosine similarity algorithm which leverages semantic information from XML-based information models. To verify our approach we built a browser-based prototype which can identify intentional variants by weighting semantic text properties with high performance. The prototype was successfully deployed in an industry project with a large-scale content corpus.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.
{"title":"Helmholtz Principle on word embeddings for automatic document segmentation","authors":"D. Krzemiński, H. Balinsky, A. Balinsky","doi":"10.1145/3209280.3229103","DOIUrl":"https://doi.org/10.1145/3209280.3229103","url":null,"abstract":"Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129300125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}