We describe one tool designed to enhance the visualization of chemical structural formulae in E-book readers. When dealing with small formulae, to avoid the pixelation effect with zoomed images, the formula is converted to a vectoral representation and then enlarged. On the opposite, large formulae are split in sub-images by cutting the image in suitable locations attempting to reduce the parts of the formula that are broken. In both cases the formulae are embedded in one ePub document that allows users to browse the chemical structure on most reading devices.
{"title":"Displaying chemical structural formulae in ePub format","authors":"S. Marinai, Stefano Quiriconi","doi":"10.1145/2361354.2361382","DOIUrl":"https://doi.org/10.1145/2361354.2361382","url":null,"abstract":"We describe one tool designed to enhance the visualization of chemical structural formulae in E-book readers. When dealing with small formulae, to avoid the pixelation effect with zoomed images, the formula is converted to a vectoral representation and then enlarged. On the opposite, large formulae are split in sub-images by cutting the image in suitable locations attempting to reduce the parts of the formula that are broken. In both cases the formulae are embedded in one ePub document that allows users to browse the chemical structure on most reading devices.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"34 1","pages":"125-128"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75245171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guillotine-based page layout is a method for document layout commonly used by newspapers and magazines, where each region of the page either contains a single article, or is recursively split either vertically or horizontally. Suprisingly there appears to be little research into algorithms for automatic guillotine-based document layout. In this paper we give efficient algorithms to find optimal solutions to guillotine layout problems of two forms. Fixed-cut layout is where the structure of the guillotining is given and we only have to determine the best configuration for each individual article to give the optimal total configuration. Free layout is where we also have to search for the optimal structure. We give bottom-up and top-down dynamic programming algorithms to solve these problems, and propose a novel interaction model for documents on electronic media. Experiments show that our algorithms are effective for realistic layout problems.
{"title":"Optimal guillotine layout","authors":"G. Gange, K. Marriott, Peter James Stuckey","doi":"10.1145/2361354.2361359","DOIUrl":"https://doi.org/10.1145/2361354.2361359","url":null,"abstract":"Guillotine-based page layout is a method for document layout commonly used by newspapers and magazines, where each region of the page either contains a single article, or is recursively split either vertically or horizontally. Suprisingly there appears to be little research into algorithms for automatic guillotine-based document layout. In this paper we give efficient algorithms to find optimal solutions to guillotine layout problems of two forms. Fixed-cut layout is where the structure of the guillotining is given and we only have to determine the best configuration for each individual article to give the optimal total configuration. Free layout is where we also have to search for the optimal structure. We give bottom-up and top-down dynamic programming algorithms to solve these problems, and propose a novel interaction model for documents on electronic media. Experiments show that our algorithms are effective for realistic layout problems.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"230 1","pages":"13-22"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76927280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typesetting software is often faced with conflicting aesthetic goals. For example, choosing where to break lines in text might involve aiming to minimize hyphenation, variation in word spacing, and consecutive lines starting with the same word. Typically, automatic layout is modelled as an optimization problem in which the goal is to minimize a complex objective function that combines various penalty functions each of which corresponds to a particular bad feature. Determining how to combine these penalty functions is difficult and very time consuming, becoming harder each time we add another penalty. Here we present a machine-learning approach to do this, and test it in the context of line-breaking. Our approach repeatedly queries the expert typographer as to which one of a pair of layouts is better, and accordingly refines the estimate of how best to weight the penalties in a linear combination. It chooses layout pair queries by a heuristic to maximize the amount that can be learnt from them so as to reduce the number of combinations that must be considered by the typographer.
{"title":"Learning how to trade off aesthetic criteria in layout","authors":"P. Moulder, K. Marriott","doi":"10.1145/2361354.2361361","DOIUrl":"https://doi.org/10.1145/2361354.2361361","url":null,"abstract":"Typesetting software is often faced with conflicting aesthetic goals. For example, choosing where to break lines in text might involve aiming to minimize hyphenation, variation in word spacing, and consecutive lines starting with the same word. Typically, automatic layout is modelled as an optimization problem in which the goal is to minimize a complex objective function that combines various penalty functions each of which corresponds to a particular bad feature. Determining how to combine these penalty functions is difficult and very time consuming, becoming harder each time we add another penalty. Here we present a machine-learning approach to do this, and test it in the context of line-breaking. Our approach repeatedly queries the expert typographer as to which one of a pair of layouts is better, and accordingly refines the estimate of how best to weight the penalties in a linear combination. It chooses layout pair queries by a heuristic to maximize the amount that can be learnt from them so as to reduce the number of combinations that must be considered by the typographer.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"29 1","pages":"33-36"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81221478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Young-Min Kim, P. Bellot, J. Tavernier, Elodie Faath, Marin Dacos
Automatic bibliographic reference annotation involves the tokenization and identification of reference fields. Recent methods use machine learning techniques such as Conditional Random Fields to tackle this problem. On the other hand, the state of the art methods always learn and evaluate their systems with a well structured data having simple format such as bibliography at the end of scientific articles. And that is a reason why the parsing of new reference different from a regular format does not work well. In our previous work, we have established a standard for the tokenization and feature selection with a less formulaic data such as notes. In this paper, we evaluate our system BILBO with other popular online reference parsing tools on a new data from totally different source. BILBO is constructed with our own corpora extracted and annotated from real world data, digital humanities articles of Revues.org site (90% in French) of OpenEdition. The robustness of BILBO system allows a language independent tagging result. We expect that this first attempt of evaluation will motivate the development of other efficient techniques for the scattered and less formulaic bibliographic references.
{"title":"Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools","authors":"Young-Min Kim, P. Bellot, J. Tavernier, Elodie Faath, Marin Dacos","doi":"10.1145/2361354.2361400","DOIUrl":"https://doi.org/10.1145/2361354.2361400","url":null,"abstract":"Automatic bibliographic reference annotation involves the tokenization and identification of reference fields. Recent methods use machine learning techniques such as Conditional Random Fields to tackle this problem. On the other hand, the state of the art methods always learn and evaluate their systems with a well structured data having simple format such as bibliography at the end of scientific articles. And that is a reason why the parsing of new reference different from a regular format does not work well. In our previous work, we have established a standard for the tokenization and feature selection with a less formulaic data such as notes. In this paper, we evaluate our system BILBO with other popular online reference parsing tools on a new data from totally different source. BILBO is constructed with our own corpora extracted and annotated from real world data, digital humanities articles of Revues.org site (90% in French) of OpenEdition. The robustness of BILBO system allows a language independent tagging result. We expect that this first attempt of evaluation will motivate the development of other efficient techniques for the scattered and less formulaic bibliographic references.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"23 1","pages":"209-212"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74535031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective radical segmentation of handwritten Chinese characters can greatly facilitate the subsequent character processing tasks, such as Chinese handwriting recognition/identification and the generation of Chinese handwritten fonts. In this paper, a popular snake model is enhanced by considering the guided image force and optimized by Genetic Algorithm, such that it achieves a significant improvement in terms of both accuracy and efficiency when applied to segment the radicals in handwritten Chinese characters. The proposed radical segmentation approach consists of three stages: constructing guide information, Genetic Algorithm optimization and post-embellishment. Testing results show that the proposed approach can effectively decompose radicals with overlaps and connections from handwritten Chinese characters with various layout structures. The segmentation accuracy reaches 94.91% for complicated samples with overlapped and connected radicals and the segmentation speed is 0.05 second per character. For demonstrating the advantages of the approach, radicals extracted from the user input samples are reused to construct personal Chinese handwritten font library. Experiments show that the constructed characters well maintain the handwriting style of the user and have good enough performance. In this way, the user only needs to write a small number of samples for obtaining his/her own handwritten font library. This method greatly reduces the cost of existing solutions and makes it much easier for people to use computers to write letters/e-mails, diaries/blogs, even magazines/books in their own handwriting.
对手写体汉字进行有效的根式切分,可以极大地方便后续的汉字处理任务,如汉字的识别/识别和汉字手写体的生成。本文通过考虑引导象力对一种流行的蛇形模型进行增强,并通过遗传算法进行优化,使得该模型在用于手写体汉字词根分割时,准确率和效率都有了显著提高。本文提出的激进分割方法包括三个阶段:构建引导信息、遗传算法优化和后期修饰。测试结果表明,该方法可以有效地分解具有重叠和连接的不同布局结构的手写体汉字的词根。对于具有重叠连接自由基的复杂样本,分割准确率达到94.91%,分割速度为0.05 s / character。为了证明该方法的优越性,将从用户输入样本中提取的词根用于构建个人中文手写字体库。实验表明,所构建的汉字能够很好地保持用户的笔迹风格,具有良好的性能。这样,用户只需要编写少量的样本,就可以获得自己的手写字体库。这种方法大大降低了现有解决方案的成本,使人们更容易使用电脑手写信件/电子邮件、日记/博客,甚至杂志/书籍。
{"title":"Effective radical segmentation of offline handwritten Chinese characters towards constructing personal handwritten fonts","authors":"Zhanghui Chen, Baoyao Zhou","doi":"10.1145/2361354.2361379","DOIUrl":"https://doi.org/10.1145/2361354.2361379","url":null,"abstract":"Effective radical segmentation of handwritten Chinese characters can greatly facilitate the subsequent character processing tasks, such as Chinese handwriting recognition/identification and the generation of Chinese handwritten fonts. In this paper, a popular snake model is enhanced by considering the guided image force and optimized by Genetic Algorithm, such that it achieves a significant improvement in terms of both accuracy and efficiency when applied to segment the radicals in handwritten Chinese characters. The proposed radical segmentation approach consists of three stages: constructing guide information, Genetic Algorithm optimization and post-embellishment. Testing results show that the proposed approach can effectively decompose radicals with overlaps and connections from handwritten Chinese characters with various layout structures. The segmentation accuracy reaches 94.91% for complicated samples with overlapped and connected radicals and the segmentation speed is 0.05 second per character. For demonstrating the advantages of the approach, radicals extracted from the user input samples are reused to construct personal Chinese handwritten font library. Experiments show that the constructed characters well maintain the handwriting style of the user and have good enough performance. In this way, the user only needs to write a small number of samples for obtaining his/her own handwritten font library. This method greatly reduces the cost of existing solutions and makes it much easier for people to use computers to write letters/e-mails, diaries/blogs, even magazines/books in their own handwriting.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"19 1","pages":"107-116"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75624042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Corpus linguistics and Natural Language Processing make it necessary to produce and share reference annotations to which linguistic and computational models can be compared. Creating such resources requires a formal framework supporting description of heterogeneous linguistic objects and structures, appropriate representation formats, and adequate manual annotation tools, making it possible to locate, identify and describe linguistic phenomena in textual documents. The Glozz platform addresses all these needs, and provides a highly versatile corpus annotation tool with advanced visualization, querying and evaluation possibilities.
{"title":"The Glozz platform: a corpus annotation and mining tool","authors":"Antoine Widlöcher, Yann Mathet","doi":"10.1145/2361354.2361394","DOIUrl":"https://doi.org/10.1145/2361354.2361394","url":null,"abstract":"Corpus linguistics and Natural Language Processing make it necessary to produce and share reference annotations to which linguistic and computational models can be compared. Creating such resources requires a formal framework supporting description of heterogeneous linguistic objects and structures, appropriate representation formats, and adequate manual annotation tools, making it possible to locate, identify and describe linguistic phenomena in textual documents. The Glozz platform addresses all these needs, and provides a highly versatile corpus annotation tool with advanced visualization, querying and evaluation possibilities.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"21 1","pages":"171-180"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75982826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Museum visitors today can regularly view 500 year old art by Renaissance masters. Will visitors to museums 500 years in the future be able to see the work of digital artists from the early 21st century? This paper considers the real problem of conserving interactive digital artwork for museum installation in the far distant future by exploring the requirements for creating documentation that will support an artwork's adaptation to future technology. In effect, this documentation must survive as long as the artwork itself -- effectively, in perpetuity. A proposal is made for the use of software engineering methodologies as solutions for designing this documentation.
{"title":"500 year documentation","authors":"F. Marchese, Maninder Pal Kaur Shergill","doi":"10.1145/2361354.2361391","DOIUrl":"https://doi.org/10.1145/2361354.2361391","url":null,"abstract":"Museum visitors today can regularly view 500 year old art by Renaissance masters. Will visitors to museums 500 years in the future be able to see the work of digital artists from the early 21st century? This paper considers the real problem of conserving interactive digital artwork for museum installation in the far distant future by exploring the requirements for creating documentation that will support an artwork's adaptation to future technology. In effect, this documentation must survive as long as the artwork itself -- effectively, in perpetuity. A proposal is made for the use of software engineering methodologies as solutions for designing this documentation.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"42 1","pages":"157-160"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80151994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seongchan Kim, Keejun Han, Soon Young Kim, Ying Liu
Tables are ubiquitous in digital libraries and on the Web, utilized to satisfy various types of data delivery and document formatting goals. For example, tables are widely used to present experimental results or statistical data in a condensed fashion in scientific documents. Identifying and organizing tables of different types is an absolutely necessary task for better table understanding, and data sharing and reusing. This paper has a three-fold contribution: 1) We propose Introduction, Methods, Results, and Discussion (IMRAD)-based table functional classification for scientific documents; 2) A fine-grained table taxonomy is introduced based on an extensive observation and investigation of tables in digital libraries; and 3) We investigate table characteristics and classify tables automatically based on the defined taxonomy. The preliminary experimental results show that our table taxonomy with salient features can significantly improve scientific table classification performance.
表在数字图书馆和Web上无处不在,用于满足各种类型的数据传递和文档格式化目标。例如,在科学文献中,表格被广泛用于以浓缩的方式呈现实验结果或统计数据。识别和组织不同类型的表对于更好地理解表以及数据共享和重用是绝对必要的任务。本文有三个方面的贡献:1)提出了基于IMRAD (Introduction, Methods, Results, and Discussion)的科学文献表功能分类;2)基于对数字图书馆中表的广泛观察和调查,提出了一种细粒度表分类法;3)研究表的特征,根据已定义的分类法对表进行自动分类。初步实验结果表明,基于显著特征的表分类方法可以显著提高科学表分类的性能。
{"title":"Scientific table type classification in digital library","authors":"Seongchan Kim, Keejun Han, Soon Young Kim, Ying Liu","doi":"10.1145/2361354.2361384","DOIUrl":"https://doi.org/10.1145/2361354.2361384","url":null,"abstract":"Tables are ubiquitous in digital libraries and on the Web, utilized to satisfy various types of data delivery and document formatting goals. For example, tables are widely used to present experimental results or statistical data in a condensed fashion in scientific documents. Identifying and organizing tables of different types is an absolutely necessary task for better table understanding, and data sharing and reusing. This paper has a three-fold contribution: 1) We propose Introduction, Methods, Results, and Discussion (IMRAD)-based table functional classification for scientific documents; 2) A fine-grained table taxonomy is introduced based on an extensive observation and investigation of tables in digital libraries; and 3) We investigate table characteristics and classify tables automatically based on the defined taxonomy. The preliminary experimental results show that our table taxonomy with salient features can significantly improve scientific table classification performance.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"28 1","pages":"133-136"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91166559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Professional users of medical information often report difficulties when attempting to locate specific information in lengthy documents. Sometimes these difficulties can be attributed to poorly specified section titles which fail to advertise relevant content. In this paper we describe preliminary work on a software plug-in for a document engineering environment that will assist authors when they formulate section-level headings. We describe two different algorithms which can be used to generate section titles. We compare the performance of these algorithms and correlate our experimental results with an evaluation of title quality performed by domain experts.
{"title":"A section title authoring tool for clinical guidelines","authors":"M. Truran, G. Georg, M. Cavazza, Dong Zhou","doi":"10.1145/2361354.2361364","DOIUrl":"https://doi.org/10.1145/2361354.2361364","url":null,"abstract":"Professional users of medical information often report difficulties when attempting to locate specific information in lengthy documents. Sometimes these difficulties can be attributed to poorly specified section titles which fail to advertise relevant content. In this paper we describe preliminary work on a software plug-in for a document engineering environment that will assist authors when they formulate section-level headings. We describe two different algorithms which can be used to generate section titles. We compare the performance of these algorithms and correlate our experimental results with an evaluation of title quality performed by domain experts.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"114 1","pages":"41-44"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77684479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In today's world most of the data produced and consumed by employees is content. In this talk we will present our approach to create and deploy content and document based applications to improve business processes and user experience.
{"title":"Content and document based approach for digital productivity applications","authors":"Thierry Delprat","doi":"10.1145/2361354.2361372","DOIUrl":"https://doi.org/10.1145/2361354.2361372","url":null,"abstract":"In today's world most of the data produced and consumed by employees is content. In this talk we will present our approach to create and deploy content and document based applications to improve business processes and user experience.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"24 1","pages":"83-84"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80390204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}