Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering最新文献

英文中文

Humanist-centric tools for big data: berkeley prosopography services 以人为中心的大数据工具:伯克利人文学服务

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644870

P. Schmitz, L. Pearce

In this paper, we describe Berkeley Prosopography Services (BPS), a new set of tools for prosopography - the identification of individuals and study of their interactions - in support of humanities research. Prosopography is an example of "big data" in the humanities, characterized not by the size of the datasets, but by the way that computational and data-driven methods can transform scholarly workflows. BPS is based upon re-usable infrastructure, supporting generalized web services for corpus management, social network analysis, and visualization. The BPS disambiguation model is a formal implementation of the traditional heuristics used by humanists, and supports plug-in rules for adaptation to a wide range of domain corpora. A workspace model supports exploratory research and collaboration. We contrast the BPS model of configurable heuristic rules to other approaches for automated text analysis, and explain how our model facilitates interpretation by humanist researchers. We describe the significance of the BPS assertion model in which researchers assert conclusions or possibilities, allowing them to override automated inference, to explore ideas in what-if scenarios, and to formally publish and subscribe-to asserted annotations among colleagues, and/or with students. We present an initial evaluation of researchers' experience using the tools to study corpora of cuneiform tablets, and describe plans to expand the application of the tools to a broader range of corpora.

在本文中，我们描述了伯克利人文学服务(BPS)，这是一套新的人文学工具，用于识别个人并研究他们之间的相互作用，以支持人文科学研究。人文学是人文学科中“大数据”的一个例子，它的特点不是数据集的大小，而是计算和数据驱动的方法可以改变学术工作流程的方式。BPS基于可重用的基础设施，支持用于语料库管理、社会网络分析和可视化的通用web服务。BPS消歧模型是人文主义者使用的传统启发式的正式实现，并支持插件规则以适应广泛的领域语料库。工作空间模型支持探索性研究和协作。我们将可配置启发式规则的BPS模型与其他自动文本分析方法进行了对比，并解释了我们的模型如何促进人文主义研究人员的解释。我们描述了BPS断言模型的重要性，在该模型中，研究人员断言结论或可能性，允许他们超越自动推理，在假设场景中探索想法，并在同事之间和/或学生之间正式发布和订阅断言的注释。我们对研究人员使用这些工具研究楔形文字语料库的经验进行了初步评估，并描述了将这些工具应用于更广泛语料库的计划。

{"title":"Humanist-centric tools for big data: berkeley prosopography services","authors":"P. Schmitz, L. Pearce","doi":"10.1145/2644866.2644870","DOIUrl":"https://doi.org/10.1145/2644866.2644870","url":null,"abstract":"In this paper, we describe Berkeley Prosopography Services (BPS), a new set of tools for prosopography - the identification of individuals and study of their interactions - in support of humanities research. Prosopography is an example of \"big data\" in the humanities, characterized not by the size of the datasets, but by the way that computational and data-driven methods can transform scholarly workflows. BPS is based upon re-usable infrastructure, supporting generalized web services for corpus management, social network analysis, and visualization. The BPS disambiguation model is a formal implementation of the traditional heuristics used by humanists, and supports plug-in rules for adaptation to a wide range of domain corpora. A workspace model supports exploratory research and collaboration. We contrast the BPS model of configurable heuristic rules to other approaches for automated text analysis, and explain how our model facilitates interpretation by humanist researchers. We describe the significance of the BPS assertion model in which researchers assert conclusions or possibilities, allowing them to override automated inference, to explore ideas in what-if scenarios, and to formally publish and subscribe-to asserted annotations among colleagues, and/or with students. We present an initial evaluation of researchers' experience using the tools to study corpora of cuneiform tablets, and describe plans to expand the application of the tools to a broader range of corpora.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"78 1 1","pages":"179-188"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78290101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fine-grained change detection in structured text documents 结构化文本文档中的细粒度变更检测

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644880

Hannes Dohrn, D. Riehle

Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change. Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes. We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.

检测和理解文档修订之间的变化是一项重要任务。获得的知识可用于对新文档修订的性质进行分类，或在审查过程中支持人工编辑。虽然纯文本更改检测算法提供细粒度的结果，但它们不理解更改的语法含义。通过将结构化文本文档表示为XML文档，我们可以应用树到树的校正算法来识别更改的语法性质。已经提出了许多用于XML文档中更改检测的算法，但其中大多数算法关注的是通用XML数据的复杂性，并且强调速度而不是结果的质量。结构化文本需要一个变化检测算法来密切关注文本节点中的内容，然而，最近的算法将文本节点视为黑盒。我们提出了一种算法，它结合了纯文本方法的优点和树到树的变化检测的优点，通过将文本从非重叠的公共子字符串重新分配到树的节点。这使我们不仅可以发现结构的变化，还可以发现文本本身的变化，从而在平均线性时间内获得更高的质量和细粒度的结果。该算法通过将其应用于英语维基百科中可以找到的结构化文本文档的语料库来评估。

{"title":"Fine-grained change detection in structured text documents","authors":"Hannes Dohrn, D. Riehle","doi":"10.1145/2644866.2644880","DOIUrl":"https://doi.org/10.1145/2644866.2644880","url":null,"abstract":"Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change.\u0000 Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes.\u0000 We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"149 1","pages":"87-96"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79440986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Image-based document management: aggregating collections of handwritten forms 基于图像的文档管理:聚合手写表单的集合

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644891

J. Barrus, E. L. Schwartz

Many companies still operate critical business processes using paper-based forms, including customer surveys, inspections, contracts and invoices. Converting those handwritten forms to symbolic data is expensive and complicated. This paper presents an overview of the Image-Based Document Management (IBDM) system for analyzing handwritten forms without requiring conversion to symbolic data. Strokes captured in a questionnaire on a tablet are separated into fields that are then displayed in a spreadsheet. Rows represent documents while columns represent corresponding fields across all documents. IBDM allows a process owner to capture and analyze large collections of documents with minimal IT support. IBDM supports the creation of filters and queries on the data. IBDM also allows the user to request symbolic conversion of individual columns of data and permits the user to create custom views by reordering and sorting the columns. In other words, IBDM provides a "writing on paper" experience for the data collector and a web-based database experience for the analyst.

许多公司仍然使用基于纸张的表单来操作关键业务流程，包括客户调查、检查、合同和发票。将这些手写表单转换为符号数据既昂贵又复杂。本文概述了基于图像的文档管理(IBDM)系统，该系统用于分析手写表单而无需转换为符号数据。在平板电脑上的问卷中捕捉到的笔划被分成字段，然后显示在电子表格中。行表示文档，列表示所有文档中的相应字段。IBDM允许流程所有者在最少的IT支持下捕获和分析大型文档集合。IBDM支持在数据上创建过滤器和查询。IBDM还允许用户请求对各个数据列进行符号转换，并允许用户通过对列重新排序和排序来创建自定义视图。换句话说，IBDM为数据收集器提供了“写在纸上”的体验，为分析人员提供了基于web的数据库体验。

引用次数: 0

The virtual splitter: refactoring web applications for themultiscreen environment 虚拟分配器:为多屏幕环境重构web应用程序

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644893

Mira Sarkis, C. Concolato, Jean-Claude Dufourd

Creating web applications for the multiscreen environment is still a challenge. One approach is to transform existing single-screen applications but this has not been done yet automatically or generically. This paper proposes a refactoring system. It consists of a generic and extensible mapping phase that automatically analyzes the application content based on a semantic or a visual criterion determined by the author or the user, and prepares it for the splitting process. The system then splits the application and as a result delivers two instrumented applications ready for distribution across devices. During runtime, the system uses a mirroring phase to maintain the functionality of the distributed application and to support a dynamic splitting process. Developed as a Chrome extension, our approach is validated on several web applications, including a YouTube page and a video application from Mozilla.

为多屏幕环境创建web应用程序仍然是一个挑战。一种方法是转换现有的单屏应用程序，但这还没有自动或通用地完成。本文提出了一个重构系统。它由一个通用的和可扩展的映射阶段组成，该阶段根据作者或用户确定的语义或视觉标准自动分析应用程序内容，并为拆分过程做好准备。然后，系统拆分应用程序，从而交付两个准备跨设备分发的仪器化应用程序。在运行时，系统使用镜像阶段来维护分布式应用程序的功能，并支持动态拆分过程。作为Chrome扩展开发，我们的方法在几个web应用程序上得到了验证，包括YouTube页面和Mozilla的视频应用程序。

引用次数: 8

On automatic text segmentation 自动文本分割

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644874

Boris Dadachev, A. Balinsky, H. Balinsky

Automatic text segmentation, which is the task of breaking a text into topically-consistent segments, is a fundamental problem in Natural Language Processing, Document Classification and Information Retrieval. Text segmentation can significantly improve the performance of various text mining algorithms, by splitting heterogeneous documents into homogeneous fragments and thus facilitating subsequent processing. Applications range from screening of radio communication transcripts to document summarization, from automatic document classification to information visualization, from automatic filtering to security policy enforcement - all rely on, or can largely benefit from, automatic document segmentation. In this article, a novel approach for automatic text and data stream segmentation is presented and studied. The proposed automatic segmentation algorithm takes advantage of feature extraction and unusual behaviour detection algorithms developed in [4, 5]. It is entirely unsupervised and flexible to allow segmentation at different scales, such as short paragraphs and large sections. We also briefly review the most popular and important algorithms for automatic text segmentation and present detailed comparisons of our approach with several of those state-of-the-art algorithms.

自动文本分割是将文本分割成主题一致的文本片段的任务，是自然语言处理、文档分类和信息检索中的一个基本问题。文本分割通过将异构文档分割成同质的片段，从而便于后续处理，可以显著提高各种文本挖掘算法的性能。应用程序的范围从无线电通信记录的筛选到文档摘要，从自动文档分类到信息可视化，从自动过滤到安全策略的实施——所有这些都依赖于或很大程度上受益于自动文档分割。本文提出并研究了一种新的文本和数据流自动分割方法。本文提出的自动分割算法利用了[4,5]中开发的特征提取和异常行为检测算法。它是完全无监督和灵活的，允许在不同规模的分割，如短段落和大的部分。我们还简要回顾了最流行和最重要的自动文本分割算法，并将我们的方法与其中几种最先进的算法进行了详细的比较。

{"title":"On automatic text segmentation","authors":"Boris Dadachev, A. Balinsky, H. Balinsky","doi":"10.1145/2644866.2644874","DOIUrl":"https://doi.org/10.1145/2644866.2644874","url":null,"abstract":"Automatic text segmentation, which is the task of breaking a text into topically-consistent segments, is a fundamental problem in Natural Language Processing, Document Classification and Information Retrieval. Text segmentation can significantly improve the performance of various text mining algorithms, by splitting heterogeneous documents into homogeneous fragments and thus facilitating subsequent processing. Applications range from screening of radio communication transcripts to document summarization, from automatic document classification to information visualization, from automatic filtering to security policy enforcement - all rely on, or can largely benefit from, automatic document segmentation. In this article, a novel approach for automatic text and data stream segmentation is presented and studied. The proposed automatic segmentation algorithm takes advantage of feature extraction and unusual behaviour detection algorithms developed in [4, 5]. It is entirely unsupervised and flexible to allow segmentation at different scales, such as short paragraphs and large sections. We also briefly review the most popular and important algorithms for automatic text segmentation and present detailed comparisons of our approach with several of those state-of-the-art algorithms.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"9 1","pages":"73-80"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89114138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Personalized document clustering with dual supervision 具有双重监督的个性化文档聚类

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361393

Yeming Hu, E. Milios, J. Blustein, Shali Liu

The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.

半监督技术产生个性化集群的潜力尚未得到探索。这是因为半监督聚类算法过去是使用基于底层类标签的oracle来评估的。虽然使用oracle可以快速评估聚类算法，并且不需要耗费大量劳动的标记，但它有一个关键的缺点，即对于文档或特征的分配，oracle总是给出相同的答案。然而，不同的人类用户可能会因为不同但同样有效的观点而对同一文档和/或特性给出不同的分配。在本文中，我们进行了一项用户研究，我们要求参与者(用户)根据自己的理解将相同的文档集合分组，然后使用这些分组来评估用户个性化的半监督聚类算法。通过我们的用户研究，我们观察到不同的用户对相同的收藏有自己的个性化组织，并且用户的组织随时间而变化。因此，我们建议文档聚类算法应该能够结合用户输入并基于用户输入生成个性化的聚类。我们还证实，与传统的无监督算法相比，带有噪声用户输入的半监督算法仍然可以产生更好的匹配用户期望(个性化)的组织。最后，我们证明了在标记文档的同时标记聚类的关键字可以进一步提高聚类性能，而不仅仅是标记用户个性化的文档。

{"title":"Personalized document clustering with dual supervision","authors":"Yeming Hu, E. Milios, J. Blustein, Shali Liu","doi":"10.1145/2361354.2361393","DOIUrl":"https://doi.org/10.1145/2361354.2361393","url":null,"abstract":"The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"59 Pt A 1","pages":"161-170"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86924607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Just-in-time personalized video presentations 即时的个性化视频演示

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361368

Jack Jansen, Pablo César, R. Guimarães, D. Bulterman

Using high-quality video cameras on mobile devices, it is relatively easy to capture a significant volume of video content for community events such as local concerts or sporting events. A more difficult problem is selecting and sequencing individual media fragments that meet the personal interests of a viewer of such content. In this paper, we consider an infrastructure that supports the just-in-time delivery of personalized content. Based on user profiles and interests, tailored video mash-ups can be created at view-time and then further tailored to user interests via simple end-user interaction. Unlike other mash-up research, our system focuses on client-side compilation based on personal (rather than aggregate) interests. This paper concentrates on a discussion of language and infrastructure issues required to support just-in-time video composition and delivery. Using a high school concert as an example, we provide a set of requirements for dynamic content delivery. We then provide an architecture and infrastructure that meets these requirements. We conclude with a technical and user analysis of the just-in-time personalized video approach.

在移动设备上使用高质量的视频摄像机，可以相对容易地为社区活动(如当地音乐会或体育赛事)捕获大量视频内容。一个更困难的问题是选择和排序满足这些内容的观众个人兴趣的单个媒体片段。在本文中，我们考虑一种支持即时交付个性化内容的基础设施。基于用户配置文件和兴趣，可以在观看时创建定制视频混搭，然后通过简单的最终用户交互进一步针对用户兴趣进行定制。与其他混搭研究不同，我们的系统侧重于基于个人(而不是总体)兴趣的客户端编译。本文集中讨论了支持实时视频合成和交付所需的语言和基础结构问题。以一所高中的音乐会为例，我们提供了一组动态内容交付的需求。然后，我们提供满足这些需求的体系结构和基础设施。最后，我们对即时个性化视频方法进行了技术和用户分析。

{"title":"Just-in-time personalized video presentations","authors":"Jack Jansen, Pablo César, R. Guimarães, D. Bulterman","doi":"10.1145/2361354.2361368","DOIUrl":"https://doi.org/10.1145/2361354.2361368","url":null,"abstract":"Using high-quality video cameras on mobile devices, it is relatively easy to capture a significant volume of video content for community events such as local concerts or sporting events. A more difficult problem is selecting and sequencing individual media fragments that meet the personal interests of a viewer of such content. In this paper, we consider an infrastructure that supports the just-in-time delivery of personalized content. Based on user profiles and interests, tailored video mash-ups can be created at view-time and then further tailored to user interests via simple end-user interaction. Unlike other mash-up research, our system focuses on client-side compilation based on personal (rather than aggregate) interests. This paper concentrates on a discussion of language and infrastructure issues required to support just-in-time video composition and delivery. Using a high school concert as an example, we provide a set of requirements for dynamic content delivery. We then provide an architecture and infrastructure that meets these requirements. We conclude with a technical and user analysis of the just-in-time personalized video approach.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"59-68"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83657243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Ad insertion in automatically composed documents 自动组合文档中的广告插入

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361358

Niranjan Damera-Venkata, José Bento

We consider the problem of automatically inserting advertisements (ads) into machine composed documents. We explicitly analyze the fundamental tradeoff between expected revenue due to ad insertion and the quality of the corresponding composed documents. We show that the optimal tradeoff a publisher can expect may be expressed as an efficient-frontier in the revenue-quality space. We develop algorithms to compose documents that lie on this optimal tradeoff frontier. These algorithms can automatically choose distributions of ad sizes and ad placement locations to optimize revenue for a given quality or optimize quality for given revenue. Such automation allows a market maker to accept highly personalized content from publishers who have no design or ad inventory management capability and distribute formatted documents to end users with aesthetic ad placement. The ad density/coverage may be controlled by the publisher or the end user on a per document basis by simply sliding along the tradeoff frontier. Business models where ad sales precede (ad-pull) or follow (ad-push) document composition are analyzed from a document engineering perspective.

我们考虑在机器合成文档中自动插入广告的问题。我们明确地分析了由于广告插入而产生的预期收入和相应组合文档的质量之间的基本权衡。我们认为发行商所期望的最佳权衡可以用收益-质量领域的有效边界来表示。我们开发算法来编写位于这个最佳权衡边界的文档。这些算法可以自动选择广告大小和广告放置位置的分布，以优化给定质量的收入或优化给定收入的质量。这种自动化允许做市商从没有设计或广告库存管理能力的发布商那里接受高度个性化的内容，并通过美观的广告位置向最终用户分发格式化的文档。广告密度/覆盖率可以由发行商或最终用户根据每个文档来控制，只需沿着权衡边界滑动即可。从文档工程的角度分析了广告销售先于(广告拉)或紧跟(广告推)文档构成的业务模型。

{"title":"Ad insertion in automatically composed documents","authors":"Niranjan Damera-Venkata, José Bento","doi":"10.1145/2361354.2361358","DOIUrl":"https://doi.org/10.1145/2361354.2361358","url":null,"abstract":"We consider the problem of automatically inserting advertisements (ads) into machine composed documents. We explicitly analyze the fundamental tradeoff between expected revenue due to ad insertion and the quality of the corresponding composed documents. We show that the optimal tradeoff a publisher can expect may be expressed as an efficient-frontier in the revenue-quality space. We develop algorithms to compose documents that lie on this optimal tradeoff frontier. These algorithms can automatically choose distributions of ad sizes and ad placement locations to optimize revenue for a given quality or optimize quality for given revenue. Such automation allows a market maker to accept highly personalized content from publishers who have no design or ad inventory management capability and distribute formatted documents to end users with aesthetic ad placement. The ad density/coverage may be controlled by the publisher or the end user on a per document basis by simply sliding along the tradeoff frontier. Business models where ad sales precede (ad-pull) or follow (ad-push) document composition are analyzed from a document engineering perspective.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"82 1","pages":"3-12"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87378701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Receipts2Go: the big world of small documents Receipts2Go:小文档的大世界

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361381

Bill Janssen, E. Saund, E. Bier, Patricia Wall, M. Sprague

The Receipts2Go system is about the world of one-page documents: cash register receipts, book covers, cereal boxes, price tags, train tickets, fire extinguisher tags. In that world, we're exploring techniques for extracting accurate information from documents for which we have no layout descriptions -- indeed no initial idea of what the document's genre is -- using photos taken with cell phone cameras by users who aren't skilled document capture technicians. This paper outlines the system and reports on some initial results, including the algorithms we've found useful for cleaning up those document images, and the techniques used to extract and organize relevant information from thousands of similar-but-different page layouts.

Receipts2Go系统是关于单页文件的世界:收银机收据，书封面，麦片盒，价格标签，火车票，灭火器标签。在那个世界里，我们正在探索从文档中提取准确信息的技术，我们没有布局描述——实际上没有文档类型的初步概念——使用用户用手机相机拍摄的照片，这些用户不是熟练的文档捕获技术人员。本文概述了该系统并报告了一些初步结果，包括我们发现的用于清理这些文档图像的算法，以及用于从数千个相似但不同的页面布局中提取和组织相关信息的技术。

引用次数: 9

A methodology for evaluating algorithms for table understanding in PDF documents 一种评估PDF文档中表理解算法的方法

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361365

Max C. Göbel, Tamir Hassan, Ermelinda Oro, G. Orsi

This paper presents a methodology for the evaluation of table understanding algorithms for PDF documents. The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis. We provide a general and flexible output model for each task along with corresponding evaluation metrics and methods. We also present a methodology for collecting and ground-truthing PDF documents based on consensus-reaching principles and provide a publicly available ground-truthed dataset.

本文提出了一种评估PDF文档表理解算法的方法。评估主要考虑三个主要任务:表检测、表结构识别和功能分析。我们为每个任务提供了一个通用的、灵活的输出模型以及相应的评估指标和方法。我们还提出了一种基于共识原则收集和整理PDF文档的方法，并提供了一个公开可用的整理数据集。

引用次数: 61

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀