Documents do often not exist in isolation but are implicitly or explicitly linked to parts of other documents. However, due to a multitude of proprietary document formats with rather simple link models, today's possibilities for creating hyperlinks between snippets of information in different document formats are limited. In previous work, we have presented a dynamically extensible cross-document link service overcoming the limitations of the simple link models supported by most existing document formats. Based on a plug-in mechanism, our link service enables the linking across different document types. In this paper, we assess the extensibility of our link service by integrating some document formats as well as third-party document viewers. We illustrate the flexibility of creating advanced hyperlinks across these document formats and viewers that cannot be realised with existing linking solutions or link models of existing document formats. A user study further investigates the user experience when creating and navigating cross-document hyperlinks.
{"title":"Cross-Media Document Linking and Navigation","authors":"Ahmed A. O. Tayeh, Payam Ebrahimi, B. Signer","doi":"10.1145/3209280.3209529","DOIUrl":"https://doi.org/10.1145/3209280.3209529","url":null,"abstract":"Documents do often not exist in isolation but are implicitly or explicitly linked to parts of other documents. However, due to a multitude of proprietary document formats with rather simple link models, today's possibilities for creating hyperlinks between snippets of information in different document formats are limited. In previous work, we have presented a dynamically extensible cross-document link service overcoming the limitations of the simple link models supported by most existing document formats. Based on a plug-in mechanism, our link service enables the linking across different document types. In this paper, we assess the extensibility of our link service by integrating some document formats as well as third-party document viewers. We illustrate the flexibility of creating advanced hyperlinks across these document formats and viewers that cannot be realised with existing linking solutions or link models of existing document formats. A user study further investigates the user experience when creating and navigating cross-document hyperlinks.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126698120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The simple question "Was this review helpful to you?" increases an estimated $2.7B revenue to Amazon.com annually 1. In this paper, we propose a solution to the problem of electronic product review accumulation using helpfulness prediction. The popularity of e-commerce and online retailers such as Amazon, eBay, Yelp, and TripAdvisor are largely relying on the presence of product reviews to attract more customers. The major issue for the user submitted reviews is to quantify and evaluate the actual effectiveness by combining all the reviews under a particular product. With the varying size of reviews for each product, it is quite cumbersome for the customers to get hold of the overall helpfulness.Therefore, we propose a feature extraction technique that can quantify and measure helpfulness for each product based on user submitted reviews.
{"title":"Helpfulness Prediction of Online Product Reviews","authors":"Md. Enamul Haque, M. E. Tozal, Aminul Islam","doi":"10.1145/3209280.3229105","DOIUrl":"https://doi.org/10.1145/3209280.3229105","url":null,"abstract":"The simple question \"Was this review helpful to you?\" increases an estimated $2.7B revenue to Amazon.com annually 1. In this paper, we propose a solution to the problem of electronic product review accumulation using helpfulness prediction. The popularity of e-commerce and online retailers such as Amazon, eBay, Yelp, and TripAdvisor are largely relying on the presence of product reviews to attract more customers. The major issue for the user submitted reviews is to quantify and evaluate the actual effectiveness by combining all the reviews under a particular product. With the varying size of reviews for each product, it is quite cumbersome for the customers to get hold of the overall helpfulness.Therefore, we propose a feature extraction technique that can quantify and measure helpfulness for each product based on user submitted reviews.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last ten years have witnessed an enormous increase in the application of "deep learning" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and acoustic modelling, the answer is most certainly affirmative, but those are mere components of the real applications that are driving the increasing interest in our field. In many of these real applications, the answer is surprisingly that we cannot be certain because of the shambolic evaluation standards that have been commonplace --- long before the deep learning renaissance --- in the communities that specialized in advancing them. This talk will consider three examples in detail: sentiment analysis, text-to-speech synthesis, and summarization. We will discuss empirical grounding, the use of inferential statistics alongside the usual, more engineering-oriented pattern recognition techniques, and the use of machine learning in the process of conducting an evaluation itself.
{"title":"Can Deep Learning Compensate for a Shallow Evaluation?","authors":"Gerald Penn","doi":"10.1145/3209280.3236023","DOIUrl":"https://doi.org/10.1145/3209280.3236023","url":null,"abstract":"The last ten years have witnessed an enormous increase in the application of \"deep learning\" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and acoustic modelling, the answer is most certainly affirmative, but those are mere components of the real applications that are driving the increasing interest in our field. In many of these real applications, the answer is surprisingly that we cannot be certain because of the shambolic evaluation standards that have been commonplace --- long before the deep learning renaissance --- in the communities that specialized in advancing them. This talk will consider three examples in detail: sentiment analysis, text-to-speech synthesis, and summarization. We will discuss empirical grounding, the use of inferential statistics alongside the usual, more engineering-oriented pattern recognition techniques, and the use of machine learning in the process of conducting an evaluation itself.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127784658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based citation analysis (CCA) seeks to explain the "how" and "why" of citation behavior. In this paper we tackle to explain the "how" from a centrality indicator based on factors which are built automatically according to the authors' citation behavior. This indicator allows to evaluate bibliographical references' importance for reading the paper with which user interacts. From objective quantitative measurements, factors are computed in order to characterize the level of granularity where citations are used. By the setting of the centrality indicator's factors we can highlight citations which tend towards a partial or a global construction of the authors' discourse. We carry out a pilot study in which we test our approach on some papers and discuss the challenges in carrying out the citation analysis in this context. Our results show interesting and consistent correlations between the level of granularity and the significance of citation influences.
{"title":"Measuring the Centrality of the References in Scientific Papers","authors":"Anaïs Ollagnier, S. Fournier, P. Bellot","doi":"10.1145/3209280.3229104","DOIUrl":"https://doi.org/10.1145/3209280.3229104","url":null,"abstract":"Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based citation analysis (CCA) seeks to explain the \"how\" and \"why\" of citation behavior. In this paper we tackle to explain the \"how\" from a centrality indicator based on factors which are built automatically according to the authors' citation behavior. This indicator allows to evaluate bibliographical references' importance for reading the paper with which user interacts. From objective quantitative measurements, factors are computed in order to characterize the level of granularity where citations are used. By the setting of the centrality indicator's factors we can highlight citations which tend towards a partial or a global construction of the authors' discourse. We carry out a pilot study in which we test our approach on some papers and discuss the challenges in carrying out the citation analysis in this context. Our results show interesting and consistent correlations between the level of granularity and the significance of citation influences.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is currently much discussion and research on topics such as open access, alternative publishing models, semantic publishing, peer review, data sharing, reproducible science, etc.; in short, how we can bring scholarly publishing in line with modern technologies and expectations. In the past, the document engineering community's participation in defining the future directions has been rather limited, which is surprising, as many document-centric issues related to scientific publishing still remain unresolved. Therefore, the main goal of this workshop, which will be held at DocEng 2018, is to stimulate discussion on this topic among experts in the document engineering field and provide a forum for the exchange of ideas. The second goal of the workshop is more hands-on: for generating the Post-Proceedings, we will be trialling a new workflow, which is based on some of the technologies discussed. The results will be reported to the DocEng Steering Committee and recommendations will be made future conferences.
{"title":"Workshop on the Future of Scholarly Publishing","authors":"Tamir Hassan","doi":"10.1145/3209280.3232793","DOIUrl":"https://doi.org/10.1145/3209280.3232793","url":null,"abstract":"There is currently much discussion and research on topics such as open access, alternative publishing models, semantic publishing, peer review, data sharing, reproducible science, etc.; in short, how we can bring scholarly publishing in line with modern technologies and expectations. In the past, the document engineering community's participation in defining the future directions has been rather limited, which is surprising, as many document-centric issues related to scientific publishing still remain unresolved. Therefore, the main goal of this workshop, which will be held at DocEng 2018, is to stimulate discussion on this topic among experts in the document engineering field and provide a forum for the exchange of ideas. The second goal of the workshop is more hands-on: for generating the Post-Proceedings, we will be trialling a new workflow, which is based on some of the technologies discussed. The results will be reported to the DocEng Steering Committee and recommendations will be made future conferences.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"50 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131389305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be verified to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.
{"title":"Automatic Rights Management for Photocopiers","authors":"Andreas Girgensohn, L. Wilcox, Qiong Liu","doi":"10.1145/3209280.3209531","DOIUrl":"https://doi.org/10.1145/3209280.3209531","url":null,"abstract":"We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be verified to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128933680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Carter, Laurent Denoue, Matthew Cooper, Jennifer Marlow
Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using metadata and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.
{"title":"FormYak","authors":"S. Carter, Laurent Denoue, Matthew Cooper, Jennifer Marlow","doi":"10.1145/3209280.3229108","DOIUrl":"https://doi.org/10.1145/3209280.3229108","url":null,"abstract":"Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using metadata and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117325931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antoine Clarinval, Isabelle Linden, Anne Wallemacq, Bruno Dumas
Structural analysis is a text analysis technique that helps uncovering the association and opposition relationships between the terms of a text. It is used in particular in the field of humanities and social sciences. This technique is usually applied by hand with pen and paper as support. However, as any combination of words in the raw text may be considered as an association or opposition relationship, applying the technique by hand in a readable way can quickly prove overwhelming for the analyst. In this paper, we propose Evoq, an application that provides support to structural analysts in their work. Furthermore, we present interactive visualizations representing the relationships between terms. These visualizations help create alternative representations of text, as advocated by structural analysts. We conducted two usability evaluations that showed great potential for Evoq as a structural analysis support tool and for the use of alternative representations of texts in the analysis.
{"title":"Evoq","authors":"Antoine Clarinval, Isabelle Linden, Anne Wallemacq, Bruno Dumas","doi":"10.1145/3209280.3209533","DOIUrl":"https://doi.org/10.1145/3209280.3209533","url":null,"abstract":"Structural analysis is a text analysis technique that helps uncovering the association and opposition relationships between the terms of a text. It is used in particular in the field of humanities and social sciences. This technique is usually applied by hand with pen and paper as support. However, as any combination of words in the raw text may be considered as an association or opposition relationship, applying the technique by hand in a readable way can quickly prove overwhelming for the analyst. In this paper, we propose Evoq, an application that provides support to structural analysts in their work. Furthermore, we present interactive visualizations representing the relationships between terms. These visualizations help create alternative representations of text, as advocated by structural analysts. We conducted two usability evaluations that showed great potential for Evoq as a structural analysis support tool and for the use of alternative representations of texts in the analysis.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116900395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.
{"title":"Hash-Grams: Faster N-Gram Features for Classification and Malware Detection","authors":"Edward Raff, Charles K. Nicholas","doi":"10.1145/3209280.3229085","DOIUrl":"https://doi.org/10.1145/3209280.3229085","url":null,"abstract":"N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131351819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atsushi Yamazaki, Kazuki Sando, Tetsuya Suzuki, A. Aiba
Reprint of Japanese historical manuscripts is time-consuming and requires training because they are hand-written, and may contain characters different from those currently used. We proposed a framework for assisting the human process for reading Japanese historical manuscripts and implemented a part of a system based on the framework as a Web service. In this paper, we present a graphical user interface (GUI) for the system and reprint process through the GUI. We conducted a user test to evaluate the system with the GUI by a questionnaire. From the results of the experiment, we confirmed that the GUI can be used intuitively but we also found points to be improved in the GUI.
{"title":"A Handwritten Japanese Historical Kana Reprint Support System: Development of a Graphical User Interface","authors":"Atsushi Yamazaki, Kazuki Sando, Tetsuya Suzuki, A. Aiba","doi":"10.1145/3209280.3229117","DOIUrl":"https://doi.org/10.1145/3209280.3229117","url":null,"abstract":"Reprint of Japanese historical manuscripts is time-consuming and requires training because they are hand-written, and may contain characters different from those currently used. We proposed a framework for assisting the human process for reading Japanese historical manuscripts and implemented a part of a system based on the framework as a Web service. In this paper, we present a graphical user interface (GUI) for the system and reprint process through the GUI. We conducted a user test to evaluate the system with the GUI by a questionnaire. From the results of the experiment, we confirmed that the GUI can be used intuitively but we also found points to be improved in the GUI.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124852590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}