首页 > 最新文献

Proceedings of the 2015 ACM Symposium on Document Engineering最新文献

英文 中文
The Browser as a Document Composition Engine 浏览器作为文档组合引擎
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797063
Tamir Hassan, N. Venkata
Printing has long been a neglected aspect of the Web, and the print function of browsers, when used on documents designed for on-screen consumption, often leads to a poor result. Whereas print CSS goes some way towards optimizing the paper experience, it still does not enable full control over the page layout, which is necessary to obtain a publication-quality print result. Furthermore, its use requires web authors to invest additional resources for a feature that might only be used infrequently. This paper introduces a framework designed to alleviate these issues and improve the print experience on the Web. We describe the technologies that enable us to automatically compose and optimize the layout of a document, and generate a high quality PDF fully within the browser. This functionality can be offered to web publishers in the form of a print button, enabling content to be simultaneously delivered in screen and print formats, and ensuring a publication-quality result that adheres to the publisher's design guidelines.
打印长期以来一直是Web的一个被忽视的方面,而浏览器的打印功能,当用于为屏幕消费而设计的文档时,通常会导致糟糕的结果。尽管打印CSS在某种程度上优化了纸张体验,但它仍然不能完全控制页面布局,而这对于获得出版物质量的打印结果是必要的。此外,它的使用需要网络作者为一个可能不经常使用的特性投入额外的资源。本文介绍了一个旨在缓解这些问题并改善Web打印体验的框架。我们描述了使我们能够自动撰写和优化文档布局的技术,并在浏览器中完全生成高质量的PDF。该功能可以以打印按钮的形式提供给网络出版商,使内容能够同时以屏幕和打印格式交付,并确保符合出版商设计准则的出版质量结果。
{"title":"The Browser as a Document Composition Engine","authors":"Tamir Hassan, N. Venkata","doi":"10.1145/2682571.2797063","DOIUrl":"https://doi.org/10.1145/2682571.2797063","url":null,"abstract":"Printing has long been a neglected aspect of the Web, and the print function of browsers, when used on documents designed for on-screen consumption, often leads to a poor result. Whereas print CSS goes some way towards optimizing the paper experience, it still does not enable full control over the page layout, which is necessary to obtain a publication-quality print result. Furthermore, its use requires web authors to invest additional resources for a feature that might only be used infrequently. This paper introduces a framework designed to alleviate these issues and improve the print experience on the Web. We describe the technologies that enable us to automatically compose and optimize the layout of a document, and generate a high quality PDF fully within the browser. This functionality can be offered to web publishers in the form of a print button, enabling content to be simultaneously delivered in screen and print formats, and ensuring a publication-quality result that adheres to the publisher's design guidelines.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127059559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Concept Hierarchy Extraction from Textbooks 从教科书中提取概念层次
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797062
Shuting Wang, Chen Liang, Zhaohui Wu, Kyle Williams, B. Pursel, Benjamin Bräutigam, Sherwyn Saul, Hannah Williams, Kyle Bowen, C. Lee Giles
Concept hierarchies have been useful tools for presenting and organizing knowledge. With the rapid growth in the number of online knowledge resources, automatic concept hierarchy extraction is increasingly attractive. Here, we focus on concept extraction from textbooks based on the knowledge in Wikipedia. Given a book, we extract important concepts in each book chapter using Wikipedia as a resource and from this construct a concept hierarchy for that book. We define local and global features that capture both the local relatedness and global coherence embedded in that textbook. In order to evaluate the proposed features and extracted concept hierarchies, we manually construct concept hierarchies for three well used textbooks by labeling important concepts for each book chapter. Experiments show that our proposed local and global features achieve better performance than using only keyphrases to construct the concept hierarchies. Moreover, we observe that incorporating global features can improve the concept ranking precision and reaffirms the global coherence in the book.
概念层次结构是表示和组织知识的有用工具。随着在线知识资源数量的快速增长,概念层次的自动提取越来越有吸引力。在这里,我们的重点是基于维基百科中的知识从教科书中提取概念。给定一本书,我们使用维基百科作为资源,从书的每个章节中提取重要的概念,并以此为该书构建概念层次结构。我们定义了局部和全局特征,以捕捉教科书中嵌入的局部相关性和全局一致性。为了评估提出的特征和提取的概念层次,我们通过标记每一章节的重要概念,为三本常用的教科书手动构建了概念层次。实验表明,我们提出的局部特征和全局特征比仅使用关键字短语构建概念层次具有更好的性能。此外,我们观察到纳入全局特征可以提高概念排序的精度,并重申了书中的全局一致性。
{"title":"Concept Hierarchy Extraction from Textbooks","authors":"Shuting Wang, Chen Liang, Zhaohui Wu, Kyle Williams, B. Pursel, Benjamin Bräutigam, Sherwyn Saul, Hannah Williams, Kyle Bowen, C. Lee Giles","doi":"10.1145/2682571.2797062","DOIUrl":"https://doi.org/10.1145/2682571.2797062","url":null,"abstract":"Concept hierarchies have been useful tools for presenting and organizing knowledge. With the rapid growth in the number of online knowledge resources, automatic concept hierarchy extraction is increasingly attractive. Here, we focus on concept extraction from textbooks based on the knowledge in Wikipedia. Given a book, we extract important concepts in each book chapter using Wikipedia as a resource and from this construct a concept hierarchy for that book. We define local and global features that capture both the local relatedness and global coherence embedded in that textbook. In order to evaluate the proposed features and extracted concept hierarchies, we manually construct concept hierarchies for three well used textbooks by labeling important concepts for each book chapter. Experiments show that our proposed local and global features achieve better performance than using only keyphrases to construct the concept hierarchies. Moreover, we observe that incorporating global features can improve the concept ranking precision and reaffirms the global coherence in the book.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127117383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Developing Web Applications with Document Engineering Technologies and Enjoying It! 使用文档工程技术开发Web应用程序并享受它!
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2801034
S. Sire
This tutorial proposes a practical software development method for building web applications using the XQuery and XSLT languages for manipulating semi-structured data. This method captures solutions and practices that we have applied during the last 4 years into many projects. It can be used on any XML database, as it requires only a thin layer to analyze and route incoming HTTP requests to a simple pipeline rendering the page. We will demonstrate it with a real world example developed with eXist-DB and the Oppidum lightweight XQuery framework.
本教程提出了一种实用的软件开发方法,用于使用XQuery和XSLT语言构建web应用程序来操作半结构化数据。该方法捕获了我们在过去4年中应用到许多项目中的解决方案和实践。它可以在任何XML数据库上使用,因为它只需要一个薄层来分析传入的HTTP请求并将其路由到呈现页面的简单管道。我们将用一个使用eXist-DB和Oppidum轻量级XQuery框架开发的真实示例来演示它。
{"title":"Developing Web Applications with Document Engineering Technologies and Enjoying It!","authors":"S. Sire","doi":"10.1145/2682571.2801034","DOIUrl":"https://doi.org/10.1145/2682571.2801034","url":null,"abstract":"This tutorial proposes a practical software development method for building web applications using the XQuery and XSLT languages for manipulating semi-structured data. This method captures solutions and practices that we have applied during the last 4 years into many projects. It can be used on any XML database, as it requires only a thin layer to analyze and route incoming HTTP requests to a simple pipeline rendering the page. We will demonstrate it with a real world example developed with eXist-DB and the Oppidum lightweight XQuery framework.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126178457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Document Classification using Summarization Strategies 使用摘要策略的自动文档分类
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797077
Rafael Ferreira, R. Lins, L. Cabral, F. Freitas, S. Simske, M. Riss
An efficient way to automatically classify documents may be provided by automatic text summarization, the task of creating a shorter text from one or several documents. This paper presents an assessment of the 15 most widely used methods for automatic text summarization from the text classification perspective. A naive Bayes classifier was used showing that some of the methods tested are better suited for such a task.
自动文本摘要可以提供一种自动分类文档的有效方法,即从一个或多个文档中创建更短的文本。本文从文本分类的角度对15种应用最广泛的自动文本摘要方法进行了评价。使用朴素贝叶斯分类器表明,一些测试方法更适合这样的任务。
{"title":"Automatic Document Classification using Summarization Strategies","authors":"Rafael Ferreira, R. Lins, L. Cabral, F. Freitas, S. Simske, M. Riss","doi":"10.1145/2682571.2797077","DOIUrl":"https://doi.org/10.1145/2682571.2797077","url":null,"abstract":"An efficient way to automatically classify documents may be provided by automatic text summarization, the task of creating a shorter text from one or several documents. This paper presents an assessment of the 15 most widely used methods for automatic text summarization from the text classification perspective. A naive Bayes classifier was used showing that some of the methods tested are better suited for such a task.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129382546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Venice Time Machine 威尼斯时光机
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797071
F. Kaplan
The Venice Time Machine is an international scientific programme launched by the EPFL and the University Ca'Foscari of Venice with the generous support of the Fondation Lombard Odier. It aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. The project ambitions to reconstruct a large open access database that could be used for research and education. Thanks to a parternship with the Archivio di Stato in Venice, kilometers of archives are currently digitized, transcribed and indexed setting the base of the largest database ever created on Venetian documents. The State Archives of Venice contain a massive amount of hand-written documentation in languages evolving from medieval times to the 20th century. An estimated 80 km of shelves are filled with over a thousand years of administrative documents, from birth registrations, death certificates and tax statements, all the way to maps and urban planning designs. These documents are often very delicate and are occasionally in a fragile state of conservation. In complementary to these primary sources, the content of thousands of monographies have been indexed and made searchable. The documents digitised in the Venice Time Machine programme are intricately interweaved, telling a much richer story when they are cross-referenced. By combining this mass of information, it is possible to reconstruct large segments of the city's past: complete biographies, political dynamics, or even the appearance of buildings and entire neighborhoods. The information extracted from the primary and secondary sources are organized in a semantic graph of linked data and unfolded in space and time in an historical geographical information system. The resulting platform can serve for both research and education. About a hundred researchers and students collaborate already on this programme. A doctoral school is organised every year in Venice and several bachelor and master courses currently use the data produced in the context of the Venice Time Machine. Through all these initiatives, the Venice Time Machine explores how "big data of the past" can change research and education in historical sciences, hopefully paving the way towards a general methodology that could be applied to many other cities and archives.
威尼斯时光机是由欧洲联邦理工学院和威尼斯Ca'Foscari大学在隆巴德·奥迪埃基金会的慷慨支持下发起的一项国际科学计划。它旨在建立一个多维度的威尼斯模型及其一千多年来的演变。该项目旨在重建一个可用于研究和教育的大型开放存取数据库。由于与威尼斯国家档案馆的合作,目前有数公里的档案被数字化、转录和索引,建立了有史以来最大的威尼斯文件数据库的基础。威尼斯国家档案馆保存着大量从中世纪到20世纪演变而来的各种语言的手写文件。大约80公里的书架上摆满了上千年来的行政文件,从出生登记、死亡证明和纳税申报表,一直到地图和城市规划设计。这些文件通常非常精致,偶尔处于脆弱的保存状态。作为对这些主要来源的补充,数以千计的专著的内容已被编入索引并可供搜索。在威尼斯时光机器项目中数字化的文件错综复杂地交织在一起,当它们相互参照时,讲述了一个更丰富的故事。通过结合这些大量的信息,可以重建城市过去的大部分:完整的传记,政治动态,甚至建筑物和整个社区的外观。从一次和二次资源中提取的信息被组织成一个关联数据的语义图,并在空间和时间上展开在一个历史地理信息系统中。由此产生的平台可以用于研究和教育。大约有100名研究人员和学生已经在这个项目上合作。威尼斯每年都会组织一个博士学院,目前有几个学士和硕士课程使用威尼斯时间机器所产生的数据。通过所有这些举措,威尼斯时光机探索了“过去的大数据”如何改变历史科学的研究和教育,希望为一种可以应用于许多其他城市和档案馆的通用方法铺平道路。
{"title":"The Venice Time Machine","authors":"F. Kaplan","doi":"10.1145/2682571.2797071","DOIUrl":"https://doi.org/10.1145/2682571.2797071","url":null,"abstract":"The Venice Time Machine is an international scientific programme launched by the EPFL and the University Ca'Foscari of Venice with the generous support of the Fondation Lombard Odier. It aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. The project ambitions to reconstruct a large open access database that could be used for research and education. Thanks to a parternship with the Archivio di Stato in Venice, kilometers of archives are currently digitized, transcribed and indexed setting the base of the largest database ever created on Venetian documents. The State Archives of Venice contain a massive amount of hand-written documentation in languages evolving from medieval times to the 20th century. An estimated 80 km of shelves are filled with over a thousand years of administrative documents, from birth registrations, death certificates and tax statements, all the way to maps and urban planning designs. These documents are often very delicate and are occasionally in a fragile state of conservation. In complementary to these primary sources, the content of thousands of monographies have been indexed and made searchable. The documents digitised in the Venice Time Machine programme are intricately interweaved, telling a much richer story when they are cross-referenced. By combining this mass of information, it is possible to reconstruct large segments of the city's past: complete biographies, political dynamics, or even the appearance of buildings and entire neighborhoods. The information extracted from the primary and secondary sources are organized in a semantic graph of linked data and unfolded in space and time in an historical geographical information system. The resulting platform can serve for both research and education. About a hundred researchers and students collaborate already on this programme. A doctoral school is organised every year in Venice and several bachelor and master courses currently use the data produced in the context of the Venice Time Machine. Through all these initiatives, the Venice Time Machine explores how \"big data of the past\" can change research and education in historical sciences, hopefully paving the way towards a general methodology that could be applied to many other cities and archives.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123164619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Session details: Documents Made Accessible 会话详细信息:文档可访问
Pub Date : 2015-09-08 DOI: 10.1145/3256805
M. Hardy
{"title":"Session details: Documents Made Accessible","authors":"M. Hardy","doi":"10.1145/3256805","DOIUrl":"https://doi.org/10.1145/3256805","url":null,"abstract":"","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124211447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segmentation of Overlapping Digits through the Emulation of a Hypothetical Ball and Physical Forces 通过模拟一个假想的球和物理力来分割重叠的数字
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797080
Alberto N. G. Lopes Filho, C. Mello
This paper presents an algorithm for segmenting pairs of overlapping handwritten digits. Digits can be found overlapped in text depending on writing style and organization; digits in close proximity or with elongated strokes may also overlap with their neighbors. Applications such as automated character recognition are directly affected by overlapping characters and their segmentation. The proposed approach is based on the emulation of inertia and a deformable hypothetical ball. The strokes act as a pathway for the ball to run and create the segmentation. The results of the algorithm are subject to a digit recognizer and it is shown that the method performs well and presents lower computational cost when compared to other segmentation approaches.
提出了一种手写体重叠数字对分割算法。根据写作风格和组织,数字在文本中可以发现重叠;靠近的手指或长笔划的手指也可能与相邻的手指重叠。自动字符识别等应用直接受到重叠字符及其分割的影响。提出的方法是基于惯性仿真和一个可变形的假设球。击球作为球运行和创造分割的途径。实验结果表明,与其他分割方法相比,该方法具有较好的分割性能和较低的计算成本。
{"title":"Segmentation of Overlapping Digits through the Emulation of a Hypothetical Ball and Physical Forces","authors":"Alberto N. G. Lopes Filho, C. Mello","doi":"10.1145/2682571.2797080","DOIUrl":"https://doi.org/10.1145/2682571.2797080","url":null,"abstract":"This paper presents an algorithm for segmenting pairs of overlapping handwritten digits. Digits can be found overlapped in text depending on writing style and organization; digits in close proximity or with elongated strokes may also overlap with their neighbors. Applications such as automated character recognition are directly affected by overlapping characters and their segmentation. The proposed approach is based on the emulation of inertia and a deformable hypothetical ball. The strokes act as a pathway for the ball to run and create the segmentation. The results of the algorithm are subject to a digit recognizer and it is shown that the method performs well and presents lower computational cost when compared to other segmentation approaches.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116101606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Filling the Gaps: Improving Wikipedia Stubs 填补空白:改进维基百科存根
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797073
Siddhartha Banerjee, P. Mitra
The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.
在维基百科上,只有有限数量的贡献者的可用性不能确保在线百科全书的持续增长和改进。随着信息在网络上的分散,我们的目标是使维基百科生成内容的过程自动化。在这项工作中,我们提出了一种改进维基百科上不包含全面信息的存根的技术。分类器从Wikipedia上现有的综合文章中学习特征,并推荐可以添加到存根中的内容,以提高存根的完整性。我们使用几种分类器进行了实验-基于潜狄利克雷分配(LDA)的模型,基于深度学习的架构(deep belief network)和基于TFIDF的分类器。我们的实验表明,基于LDA的模型优于其他模型(~6%的f值)。我们的生成方法表明,这种技术能够生成全面的文章。我们的系统生成的文章的ROUGE-2分数优于使用基线生成的文章。我们的系统生成的内容已经被添加到几个存根中,并成功地保留在维基百科中。
{"title":"Filling the Gaps: Improving Wikipedia Stubs","authors":"Siddhartha Banerjee, P. Mitra","doi":"10.1145/2682571.2797073","DOIUrl":"https://doi.org/10.1145/2682571.2797073","url":null,"abstract":"The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121545084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Combining Advanced Information Retrieval and Text-Mining for Digital Humanities 数字人文学科高级信息检索与文本挖掘的结合
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797067
Antoine Widlöcher, Nicolas Béchet, Jean-Marc Lecarpentier, Yann Mathet, Julia Roger
Digital Humanities make more and more structured and richly annotated corpora available. Most of this data rely on well known and established standards, such as TEI, which especially enable scientists to edit and publish their work. However, one of the remaining problems is to give adequate access to this rich data, in order to produce higher-order knowledge. In this paper, we present an integrated environment combining an advanced search engine and text-mining techniques for hermeneutics in Digital Humanities. Relying on semantic web technologies, the search engine uses full text as well as complex embedding structures and offers a single interface to access rich and heterogeneous data and meta-data. Text-mining possibilities enable scholars to exhibit regularities in corpora. Results obtained on the Cartesian corpus illustrate these principles and tools.
数字人文使越来越多的结构化和丰富注释的语料库可用。这些数据大多依赖于众所周知的既定标准,例如TEI,这些标准特别使科学家能够编辑和发表他们的工作。然而,仍然存在的问题之一是如何充分访问这些丰富的数据,以产生高阶知识。在本文中,我们提出了一个集成的环境,结合了先进的搜索引擎和文本挖掘技术的解释学数字人文。基于语义web技术,搜索引擎使用全文以及复杂的嵌入结构,并提供单一接口来访问丰富的异构数据和元数据。文本挖掘的可能性使学者能够展示语料库中的规律。在笛卡尔语料库上得到的结果说明了这些原理和工具。
{"title":"Combining Advanced Information Retrieval and Text-Mining for Digital Humanities","authors":"Antoine Widlöcher, Nicolas Béchet, Jean-Marc Lecarpentier, Yann Mathet, Julia Roger","doi":"10.1145/2682571.2797067","DOIUrl":"https://doi.org/10.1145/2682571.2797067","url":null,"abstract":"Digital Humanities make more and more structured and richly annotated corpora available. Most of this data rely on well known and established standards, such as TEI, which especially enable scientists to edit and publish their work. However, one of the remaining problems is to give adequate access to this rich data, in order to produce higher-order knowledge. In this paper, we present an integrated environment combining an advanced search engine and text-mining techniques for hermeneutics in Digital Humanities. Relying on semantic web technologies, the search engine uses full text as well as complex embedding structures and offers a single interface to access rich and heterogeneous data and meta-data. Text-mining possibilities enable scholars to exhibit regularities in corpora. Results obtained on the Cartesian corpus illustrate these principles and tools.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131647562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Enhancing Exploration with a Faceted Browser through Summarization 通过摘要增强对分面浏览器的探索
Pub Date : 2015-09-08 DOI: 10.1145/2682571.2797083
Grzegorz Drzadzewski, Frank Wm. Tompa
An enhanced faceted browsing system has been developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of representative terms and a diverse set of documents. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus.
一个增强的面浏览系统已经开发出来,以支持用户对大型多标记文档集合的探索。它在通过一组代表性术语和一组不同的文档进行导航的每个步骤中提供文档结果集的汇总度量。这些摘要来源于预先物化的视图,这些视图允许快速计算各种结果集的质心。以《纽约时报》标注语料库为例,验证了该系统的实用性和有效性。
{"title":"Enhancing Exploration with a Faceted Browser through Summarization","authors":"Grzegorz Drzadzewski, Frank Wm. Tompa","doi":"10.1145/2682571.2797083","DOIUrl":"https://doi.org/10.1145/2682571.2797083","url":null,"abstract":"An enhanced faceted browsing system has been developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of representative terms and a diverse set of documents. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129221881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2015 ACM Symposium on Document Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1