首页 > 最新文献

Proceedings of the ACM Symposium on Document Engineering 2018最新文献

英文 中文
Annotation Data Management with JeDIS 使用JeDIS的注释数据管理
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229102
E. Faessler, U. Hahn
This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy manipulation of their annotations (deletion or update) and the creation of alternative annotations of individual documents even for annotation formalisms that by design do not support this feature.
本文介绍耶拿文献信息系统(JeDIS)。重点在于它将注释图划分为模块的能力。注释模块是根据注释模式的类型定义的。模块允许轻松地操作它们的注释(删除或更新)和为单个文档创建替代注释,甚至对于设计上不支持此特性的注释形式化也是如此。
{"title":"Annotation Data Management with JeDIS","authors":"E. Faessler, U. Hahn","doi":"10.1145/3209280.3229102","DOIUrl":"https://doi.org/10.1145/3209280.3229102","url":null,"abstract":"This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy manipulation of their annotations (deletion or update) and the creation of alternative annotations of individual documents even for annotation formalisms that by design do not support this feature.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121845394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Document Changes: Modeling, Detection, Storage and Visualization (DChanges 2018) 文档变更:建模、检测、存储和可视化(DChanges 2018)
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3232792
Gioele Barabucci, Uwe M. Borghoff, A. Iorio, Sonja Schimmler, E. Munson
The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we would like to focus on collaboration tools for non-textual documents. The workshop is open to researchers and practitioners from industry and academia. We would like to provide a platform to discuss and explore the state of the art in the field of document changes. One of the goals of this year's edition is to review the outcomes of the last four editions and to develop plans for the future.
DChanges系列研讨会关注的是变化的所有方面和应用:检测变化的算法、描述变化的模型和向用户展示变化的技术,这些只是研究的部分主题。今年,我们将重点关注非文本文档的协作工具。研讨会向工业界和学术界的研究人员和从业人员开放。我们希望提供一个平台来讨论和探索文件更改领域的最新技术。今年会议的目标之一是回顾过去四次会议的成果,并制定未来的计划。
{"title":"Document Changes: Modeling, Detection, Storage and Visualization (DChanges 2018)","authors":"Gioele Barabucci, Uwe M. Borghoff, A. Iorio, Sonja Schimmler, E. Munson","doi":"10.1145/3209280.3232792","DOIUrl":"https://doi.org/10.1145/3209280.3232792","url":null,"abstract":"The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we would like to focus on collaboration tools for non-textual documents. The workshop is open to researchers and practitioners from industry and academia. We would like to provide a platform to discuss and explore the state of the art in the field of document changes. One of the goals of this year's edition is to review the outcomes of the last four editions and to develop plans for the future.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116516127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Workflow Support for Live Object-Based Broadcasting 工作流支持基于对象的实时广播
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209528
Jack Jansen, Pablo César, D. Bulterman
This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects -- as well as their temporal validity -- are determined by the broadcast source, but the actual rendering and placement of these objects can be customized to the needs/constraints of the content viewer(s). The use of object-based broadcasting enables a more tailored end-user experience than the one-size-fits-all of traditional broadcasts: the viewer may be able to selectively turn off overlay graphics (such as statistics) during a sports game, or selectively render them on a secondary device. Object-based broadcasting also holds the potential for supporting presentation adaptivity for accessibility or for device heterogeneity. From a technology perspective, object-based broadcasting resembles a traditional IP media stream, accompanied by a structured multimedia document that contains timed rendering instructions. Unfortunately, the use of object-based broadcasting is severely limited because of the problems it poses for the traditional television production workflow (and in particular, for use in live television production). The traditional workflow places graphics, effects and replays as immutable components in the main audio/video feed originating from, for example, a production truck outside a sports stadium. This single feed is then delivered near-live to the homes of all viewers. In order to effectively support dynamic object-based broadcasting, the production workflow will need to retain a familiar creative interface to the production staff, but also allow the insertion and delivery of a differentiated set of objects for selective use at the receiving end. In this paper we present a model and implementation of a dynamic system for supporting object-based broadcasting in the context of a motor sport application. We define a new multimedia document format that supports dynamic modifications during playback; this allows editing decisions by the producer to be activated by agents at the receiving end of the content. We describe a prototype system to allow playback of these broadcasts and a production system that allows live object-based control within the production workflow. We conclude with an evaluation of a trial using near-live deployment of the environment, using content from our partners, in a sport environment.
本文研究了基于对象的广播的文档方面。基于对象的广播通过附加的(暂时受限的)媒体对象增强了传统的视频和音频广播内容。这些对象的内容——以及它们的时间有效性——由广播源决定,但是这些对象的实际呈现和放置可以根据内容查看者的需求/约束进行定制。与一刀切的传统广播相比,使用基于对象的广播可以提供更定制的最终用户体验:观看者可以在体育比赛期间选择性地关闭覆盖图形(如统计数据),或者选择性地在辅助设备上呈现它们。基于对象的广播还具有支持可访问性或设备异构的表示自适应的潜力。从技术的角度来看,基于对象的广播类似于传统的IP媒体流,伴随着包含定时呈现指令的结构化多媒体文档。不幸的是,基于对象的广播的使用受到了严重的限制,因为它给传统的电视制作工作流程(特别是在直播电视制作中使用)带来了问题。传统的工作流程将图形、效果和回放作为不可变的组件放在主要的音频/视频馈送中,例如,来自体育场外的生产卡车。然后,这一单一的画面几乎实时地传送到所有观众的家中。为了有效地支持动态的基于对象的广播,制作工作流程将需要为制作人员保留一个熟悉的创意界面,但也允许插入和交付一组不同的对象,以便在接收端选择性使用。在本文中,我们提出了一个动态系统的模型和实现,用于支持赛车运动应用程序中基于对象的广播。我们定义了一种新的多媒体文档格式,支持在播放过程中进行动态修改;这允许内容接收端的代理激活生产者的编辑决策。我们描述了一个原型系统,允许播放这些广播和一个生产系统,允许在生产工作流中基于对象的实时控制。最后,我们评估了在体育环境中使用近现场部署环境的试验,使用了我们合作伙伴提供的内容。
{"title":"Workflow Support for Live Object-Based Broadcasting","authors":"Jack Jansen, Pablo César, D. Bulterman","doi":"10.1145/3209280.3209528","DOIUrl":"https://doi.org/10.1145/3209280.3209528","url":null,"abstract":"This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects -- as well as their temporal validity -- are determined by the broadcast source, but the actual rendering and placement of these objects can be customized to the needs/constraints of the content viewer(s). The use of object-based broadcasting enables a more tailored end-user experience than the one-size-fits-all of traditional broadcasts: the viewer may be able to selectively turn off overlay graphics (such as statistics) during a sports game, or selectively render them on a secondary device. Object-based broadcasting also holds the potential for supporting presentation adaptivity for accessibility or for device heterogeneity. From a technology perspective, object-based broadcasting resembles a traditional IP media stream, accompanied by a structured multimedia document that contains timed rendering instructions. Unfortunately, the use of object-based broadcasting is severely limited because of the problems it poses for the traditional television production workflow (and in particular, for use in live television production). The traditional workflow places graphics, effects and replays as immutable components in the main audio/video feed originating from, for example, a production truck outside a sports stadium. This single feed is then delivered near-live to the homes of all viewers. In order to effectively support dynamic object-based broadcasting, the production workflow will need to retain a familiar creative interface to the production staff, but also allow the insertion and delivery of a differentiated set of objects for selective use at the receiving end. In this paper we present a model and implementation of a dynamic system for supporting object-based broadcasting in the context of a motor sport application. We define a new multimedia document format that supports dynamic modifications during playback; this allows editing decisions by the producer to be activated by agents at the receiving end of the content. We describe a prototype system to allow playback of these broadcasts and a production system that allows live object-based control within the production workflow. We conclude with an evaluation of a trial using near-live deployment of the environment, using content from our partners, in a sport environment.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114433199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The Causal Graph CRDT for Complex Document Structure 复杂文档结构的因果图CRDT
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229110
A. Hall, Grant Nelson, Mike Thiesen, Nate Woods
Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional structure need higher-level organization. We introduce the Causal Graph, which extends the Causal Tree CRDT into a graph of nodes and transitions to represent ordered trees. This data structure is useful in driving document outlines for large collaborative documents, resolving structures with over 100,000 sections in less than a second.
交换复制数据类型(crdt)是一种新兴的实时协同编辑工具。crdt的现有工作主要集中在作为文本内容列表的文档上,但是具有复杂分段结构的大型文档(超过7,000页)需要更高级别的组织。我们引入了因果图,它将因果树CRDT扩展为节点和过渡的图,以表示有序树。这种数据结构在驱动大型协作文档的文档大纲方面非常有用,可以在不到一秒的时间内解析包含超过100,000个部分的结构。
{"title":"The Causal Graph CRDT for Complex Document Structure","authors":"A. Hall, Grant Nelson, Mike Thiesen, Nate Woods","doi":"10.1145/3209280.3229110","DOIUrl":"https://doi.org/10.1145/3209280.3229110","url":null,"abstract":"Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional structure need higher-level organization. We introduce the Causal Graph, which extends the Causal Tree CRDT into a graph of nodes and transitions to represent ordered trees. This data structure is useful in driving document outlines for large collaborative documents, resolving structures with over 100,000 sections in less than a second.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115732835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OurDirection OurDirection
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229101
Sadra Abrishamkar, J. Huang
We propose OurDirection, an open-domain dialogue framework that is specialized in mimicking the Hansard (debate) materials from Canadian House of Commons. In this framework, we employed two sets of neural network models (Hierarchical Recurrent Encoder-Decoder (HRED) and RNN) to generate the dialogue responses. Extensive experiments on Hansard dataset shows that the models can learn the structure of the debates, and can produce reasonable responses to the user entries.
{"title":"OurDirection","authors":"Sadra Abrishamkar, J. Huang","doi":"10.1145/3209280.3229101","DOIUrl":"https://doi.org/10.1145/3209280.3229101","url":null,"abstract":"We propose OurDirection, an open-domain dialogue framework that is specialized in mimicking the Hansard (debate) materials from Canadian House of Commons. In this framework, we employed two sets of neural network models (Hierarchical Recurrent Encoder-Decoder (HRED) and RNN) to generate the dialogue responses. Extensive experiments on Hansard dataset shows that the models can learn the structure of the debates, and can produce reasonable responses to the user entries.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"43 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120884134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Private Document Editing with some Trust 私人文档编辑与一些信任
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209535
Aaron MacSween, Caleb James Delisle, P. Libbrecht, Yann Flory
Document editing has migrated in the last decade from a mostly individual activity to a shared activity among multiple persons. The World Wide Web and other communication means have contributed to this evolution. However, collaboration via the web has shown a tendency to centralize information, making it accessible to subsequent uses and abuses, such as surveillance, marketing, and data theft. Traditionally, access control policies have been enforced by a central authority, usually the server hosting the content, a single point of failure. We describe a novel scheme for collaborative editing in which clients enforce access control through the use of strong encryption. Encryption keys are distributed as the portion of a URI which is not shared with the server, enabling users to adopt a variety of document security workflows. This system separates access to the information ("the key") from the responsibility of hosting the content ("the carrier of the vault"), allowing privacy-conscious editors to enjoy a modern collaborative editing experience without relaxing their requirements. The paper presents CryptPad, an open-source reference implementation which features a variety of editors which employ the described access control methodology. We will detail approaches for implementing a variety of features required for user productivity in a manner that satisfies user-defined privacy concerns.
在过去的十年中,文档编辑已经从主要是个人活动转变为多人共享的活动。万维网和其他通信手段促进了这一演变。然而,通过网络进行的协作已经显示出一种信息集中的趋势,使其易于后续使用和滥用,例如监视、营销和数据盗窃。传统上,访问控制策略是由中央机构执行的,通常是托管内容的服务器,这是一个单点故障。我们描述了一种新的协作编辑方案,其中客户端通过使用强加密来实施访问控制。加密密钥作为不与服务器共享的URI的一部分分发,使用户能够采用各种文档安全工作流。该系统将对信息的访问(“钥匙”)与托管内容的责任(“保险库的载体”)分离开来,允许注重隐私的编辑在不放松要求的情况下享受现代协作编辑体验。本文介绍了CryptPad,一个开源的参考实现,它的特点是采用所描述的访问控制方法的各种编辑器。我们将详细介绍以满足用户定义的隐私问题的方式实现用户生产力所需的各种功能的方法。
{"title":"Private Document Editing with some Trust","authors":"Aaron MacSween, Caleb James Delisle, P. Libbrecht, Yann Flory","doi":"10.1145/3209280.3209535","DOIUrl":"https://doi.org/10.1145/3209280.3209535","url":null,"abstract":"Document editing has migrated in the last decade from a mostly individual activity to a shared activity among multiple persons. The World Wide Web and other communication means have contributed to this evolution. However, collaboration via the web has shown a tendency to centralize information, making it accessible to subsequent uses and abuses, such as surveillance, marketing, and data theft. Traditionally, access control policies have been enforced by a central authority, usually the server hosting the content, a single point of failure. We describe a novel scheme for collaborative editing in which clients enforce access control through the use of strong encryption. Encryption keys are distributed as the portion of a URI which is not shared with the server, enabling users to adopt a variety of document security workflows. This system separates access to the information (\"the key\") from the responsibility of hosting the content (\"the carrier of the vault\"), allowing privacy-conscious editors to enjoy a modern collaborative editing experience without relaxing their requirements. The paper presents CryptPad, an open-source reference implementation which features a variety of editors which employ the described access control methodology. We will detail approaches for implementing a variety of features required for user productivity in a manner that satisfies user-defined privacy concerns.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122793849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Never the Same Stream: netomat, XLink, and Metaphors of Web Documents 从不相同的流:网络、XLink和Web文档的隐喻
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209530
Colin Post, Patrick Golden, R. Shaw
Document engineering employs practices of modeling and representation. Enactment of these practices relies on shared metaphors. However, choices driven by metaphor often receive less attention than those driven by factors critical to developing working systems, such as performance and usability. One way to remedy this issue is to take a historical approach, studying cases without a guiding concern for their ongoing development and maintenance. In this paper, we compare two historical case studies of "failed" designs for hypertext on the Web. The first case is netomat (1999), a Web browser created by the artist Maciej Wisniewski, which responded to search queries with dynamic multimedia streams culled from across the Web and structured by a custom markup language. The second is the XML Linking Language (XLink), a W3C standard to express hypertext links within and between XML documents. Our analysis focuses on the relationship between the metaphors used to make sense of Web documents and the hypermedia structures they compose. The metaphors offered by netomat and XLink stand as alternatives to metaphors of the "page" or the "app." Our intent here is not to argue that any of these metaphors are superior, but to consider how designers' and engineers' metaphorical choices are situated within a complex of already existing factors shaping Web technology and practice. The results provide insight into underexplored interconnections between art and document engineering at a critical moment in the history of the Web, and demonstrate the value for designers and engineers of studying "paths not taken" during the history of the technologies we work on today.
文档工程采用建模和表示的实践。这些实践的实施依赖于共享的隐喻。然而,由隐喻驱动的选择通常比那些由开发工作系统的关键因素(如性能和可用性)驱动的选择受到的关注要少。解决这个问题的一种方法是采用历史方法,在研究案例时不考虑它们正在进行的开发和维护。在本文中,我们比较了两个关于Web上超文本“失败”设计的历史案例研究。第一个例子是netomat(1999),这是一个由艺术家Maciej Wisniewski创建的网络浏览器,它用从整个网络中挑选出来的动态多媒体流响应搜索查询,并由自定义标记语言构建。第二种是XML链接语言(XML Linking Language, XLink),这是一种W3C标准,用于表示XML文档内部和文档之间的超文本链接。我们的分析集中在用于理解Web文档的隐喻和它们构成的超媒体结构之间的关系。netomat和XLink提供的隐喻是“页面”或“应用程序”隐喻的替代品。我们在这里的目的并不是争论这些隐喻中的任何一个是优越的,而是考虑设计师和工程师的隐喻选择是如何在一个已经存在的影响Web技术和实践的复杂因素中定位的。结果提供了在Web历史上的关键时刻,艺术和文档工程之间未被充分探索的相互联系的洞察力,并展示了在我们今天工作的技术历史中研究“未采取的路径”对设计师和工程师的价值。
{"title":"Never the Same Stream: netomat, XLink, and Metaphors of Web Documents","authors":"Colin Post, Patrick Golden, R. Shaw","doi":"10.1145/3209280.3209530","DOIUrl":"https://doi.org/10.1145/3209280.3209530","url":null,"abstract":"Document engineering employs practices of modeling and representation. Enactment of these practices relies on shared metaphors. However, choices driven by metaphor often receive less attention than those driven by factors critical to developing working systems, such as performance and usability. One way to remedy this issue is to take a historical approach, studying cases without a guiding concern for their ongoing development and maintenance. In this paper, we compare two historical case studies of \"failed\" designs for hypertext on the Web. The first case is netomat (1999), a Web browser created by the artist Maciej Wisniewski, which responded to search queries with dynamic multimedia streams culled from across the Web and structured by a custom markup language. The second is the XML Linking Language (XLink), a W3C standard to express hypertext links within and between XML documents. Our analysis focuses on the relationship between the metaphors used to make sense of Web documents and the hypermedia structures they compose. The metaphors offered by netomat and XLink stand as alternatives to metaphors of the \"page\" or the \"app.\" Our intent here is not to argue that any of these metaphors are superior, but to consider how designers' and engineers' metaphorical choices are situated within a complex of already existing factors shaping Web technology and practice. The results provide insight into underexplored interconnections between art and document engineering at a critical moment in the history of the Web, and demonstrate the value for designers and engineers of studying \"paths not taken\" during the history of the technologies we work on today.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132356677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Main Content Detection in HTML Journal Articles HTML期刊文章的主要内容检测
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229115
Alastair R. Rae, Jongwoo Kim, D. Le, G. Thoma
Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.
Web内容提取算法已被证明可以提高Web内容分析任务的性能。这是因为嘈杂的网页内容,如广告和导航链接,会显著降低性能。提出了一种新颖有效的HTML期刊文章主内容检测版面分析算法。该算法首先根据呈现的换行符对网页进行分段,然后根据其列结构对网页进行分段,最后确定包含段落文本最多的列。在359篇手工标记的HTML期刊文章的测试集上,发现提出的布局分析算法明显优于基于HTML 5语义标记的另一种语义标记算法。测得布局分析算法的准确率为0.96,召回率为0.99,F-score为0.98。
{"title":"Main Content Detection in HTML Journal Articles","authors":"Alastair R. Rae, Jongwoo Kim, D. Le, G. Thoma","doi":"10.1145/3209280.3229115","DOIUrl":"https://doi.org/10.1145/3209280.3229115","url":null,"abstract":"Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114604753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Semantically Weighted Similarity Analysis for XML-based Content Components 基于xml内容组件的语义加权相似度分析
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229098
Jan Oevermann, Christoph Lüth
Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, the large amount of data and intentional variants in technical texts make this a challenging task. We tackle this problem by using an efficient cosine similarity algorithm which leverages semantic information from XML-based information models. To verify our approach we built a browser-based prototype which can identify intentional variants by weighting semantic text properties with high performance. The prototype was successfully deployed in an industry project with a large-scale content corpus.
不受控制的变体和重复内容是组件内容管理中持续存在的问题;它们降低了内容组件的总体重用。相似性分析可以帮助清理现有的数据库并识别有问题的文本,但是,技术文本中的大量数据和有意的变体使这成为一项具有挑战性的任务。我们通过使用有效的余弦相似度算法来解决这个问题,该算法利用了基于xml的信息模型中的语义信息。为了验证我们的方法,我们构建了一个基于浏览器的原型,该原型可以通过高性能地加权语义文本属性来识别故意变体。该原型已成功部署在一个具有大规模内容语料库的工业项目中。
{"title":"Semantically Weighted Similarity Analysis for XML-based Content Components","authors":"Jan Oevermann, Christoph Lüth","doi":"10.1145/3209280.3229098","DOIUrl":"https://doi.org/10.1145/3209280.3229098","url":null,"abstract":"Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, the large amount of data and intentional variants in technical texts make this a challenging task. We tackle this problem by using an efficient cosine similarity algorithm which leverages semantic information from XML-based information models. To verify our approach we built a browser-based prototype which can identify intentional variants by weighting semantic text properties with high performance. The prototype was successfully deployed in an industry project with a large-scale content corpus.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Helmholtz Principle on word embeddings for automatic document segmentation 自动文档分割中词嵌入的Helmholtz原理
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229103
D. Krzemiński, H. Balinsky, A. Balinsky
Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.
在自然语言处理领域,自动文档分割越来越受到人们的关注。这个问题被定义为将文本划分为词汇连贯的片段。事实上,大多数实际文档都不是同构的,因此提取底层结构可能会提高各种算法在主题识别、文档摘要或文档分类等问题中的性能。同时,词嵌入程序的最新进展加速了各种文本挖掘方法的发展。word2vec或GloVe等模型允许有效地学习大型文本数据集的表示,从而引入更健壮的单词相似度度量。本文将基于嵌入的词间关系度量思想与文本挖掘的亥姆霍兹原理相结合,提出了一种新的文档分割算法。我们比较了两种最常见的词嵌入模型,并在基准数据集上展示了我们的方法的改进。
{"title":"Helmholtz Principle on word embeddings for automatic document segmentation","authors":"D. Krzemiński, H. Balinsky, A. Balinsky","doi":"10.1145/3209280.3229103","DOIUrl":"https://doi.org/10.1145/3209280.3229103","url":null,"abstract":"Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129300125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the ACM Symposium on Document Engineering 2018
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1