Suppose that you write a text or find a text that looks interesting on the Web, and that want to create an e-book from the text. When creating an e-book from such a text file, you have to create a cover page for the e-book. However, with existing conversion services/tools we cannot obtain any cover page reflecting the impression of the text automatically. In this paper, in order to support users to create "good" cover pages for such texts, we propose a method for recommending colors and fonts for the cover pages of given texts/cover-less EPUB books. In our method, colors and fonts are selected so that the colors and the fonts reflect the impression of the contents of given texts/EPUB books.
{"title":"Recommending Colors and Fonts for Cover Page of EPUB Book","authors":"Haruka Kawaguchi, Nobutaka Suzuki","doi":"10.1145/3209280.3229086","DOIUrl":"https://doi.org/10.1145/3209280.3229086","url":null,"abstract":"Suppose that you write a text or find a text that looks interesting on the Web, and that want to create an e-book from the text. When creating an e-book from such a text file, you have to create a cover page for the e-book. However, with existing conversion services/tools we cannot obtain any cover page reflecting the impression of the text automatically. In this paper, in order to support users to create \"good\" cover pages for such texts, we propose a method for recommending colors and fonts for the cover pages of given texts/cover-less EPUB books. In our method, colors and fonts are selected so that the colors and the fonts reflect the impression of the contents of given texts/EPUB books.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123426890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Rashadul Hasan Rakib, Magdalena Jankowska, N. Zeh, E. Milios
Short text clustering is an important but challenging task. We investigate impact of similarity matrix sparsification on the performance of short text clustering. We show that two sparsification methods (the proposed Similarity Distribution based, and k-nearest neighbors) that aim to retain a prescribed number of similarity elements per text, improve hierarchical clustering quality of short texts for various text similarities. These methods using a word embedding based similarity yield competitive results with state-of-the-art methods for short text clustering especially for general domain, and are faster than the main state-of-the-art baseline.
{"title":"Improving Short Text Clustering by Similarity Matrix Sparsification","authors":"Md. Rashadul Hasan Rakib, Magdalena Jankowska, N. Zeh, E. Milios","doi":"10.1145/3209280.3229114","DOIUrl":"https://doi.org/10.1145/3209280.3229114","url":null,"abstract":"Short text clustering is an important but challenging task. We investigate impact of similarity matrix sparsification on the performance of short text clustering. We show that two sparsification methods (the proposed Similarity Distribution based, and k-nearest neighbors) that aim to retain a prescribed number of similarity elements per text, improve hierarchical clustering quality of short texts for various text similarities. These methods using a word embedding based similarity yield competitive results with state-of-the-art methods for short text clustering especially for general domain, and are faster than the main state-of-the-art baseline.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122627311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extracting key terms from technical documents allows us to write effective documentation that is specific and clear, with minimum ambiguity and confusion caused by nearly synonymous but different terms. For instance, in order to avoid confusion, the same object should not be referred to by two different names (e.g. "hydraulic oil filter"). In the modern world of commerce, clear terminology is the hallmark of successful RFPs (Requests for Proposal) and is therefore a key to the growth of competitive organizations. While Automatic Term Extraction (ATE) is a well-developed area of study, its applications in the technical domain have been sparse and constrained to certain narrow areas such as the biomedical research domain. We present a method for Automatic Term Extraction (ATE) for the technical domain based on the use of part-of-speech features and common words information. The method is evaluated on a C programming language reference manual as well as a manual of aircraft maintenance guidelines, and has shown comparable or better results to the reported state of the art results.
{"title":"Automatic Term Extraction in Technical Domain using Part-of-Speech and Common-Word Features","authors":"N. Simon, Vlado Keselj","doi":"10.1145/3209280.3229100","DOIUrl":"https://doi.org/10.1145/3209280.3229100","url":null,"abstract":"Extracting key terms from technical documents allows us to write effective documentation that is specific and clear, with minimum ambiguity and confusion caused by nearly synonymous but different terms. For instance, in order to avoid confusion, the same object should not be referred to by two different names (e.g. \"hydraulic oil filter\"). In the modern world of commerce, clear terminology is the hallmark of successful RFPs (Requests for Proposal) and is therefore a key to the growth of competitive organizations. While Automatic Term Extraction (ATE) is a well-developed area of study, its applications in the technical domain have been sparse and constrained to certain narrow areas such as the biomedical research domain. We present a method for Automatic Term Extraction (ATE) for the technical domain based on the use of part-of-speech features and common words information. The method is evaluated on a C programming language reference manual as well as a manual of aircraft maintenance guidelines, and has shown comparable or better results to the reported state of the art results.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114588126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scholarship in the humanities often requires the ability to search curated electronic corpora and to display search results in a variety of formats. Challenges that need to be addressed include transforming the texts into a suitable form, typically XML, and catering to the scholars' search and display needs. We describe our experience in creating such a search and display facility.
{"title":"Fashioning a Search Engine to Support Humanities Research","authors":"Frank Wm. Tompa","doi":"10.1145/3209280.3209520","DOIUrl":"https://doi.org/10.1145/3209280.3209520","url":null,"abstract":"Scholarship in the humanities often requires the ability to search curated electronic corpora and to display search results in a variety of formats. Challenges that need to be addressed include transforming the texts into a suitable form, typically XML, and catering to the scholars' search and display needs. We describe our experience in creating such a search and display facility.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125997522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isabelle Percy, A. Balinsky, H. Balinsky, S. Simske
We present some results from a joint project between HP Labs, Cardiff University and Dyfed Powys Police on predictive policing. Applications of the various techniques from recommender systems and text mining to the problem of crime patterns recognition are demonstrated. Our main idea is to consider crime records for different regions and time period as a corpus of text documents with words being crime types. We apply tools from NLP and text documents classifications to analyse different regions in time and space. We evaluate performance of several measures of similarity for texts and documents clustering algorithms.
{"title":"Text Mining and Recommender Systems for Predictive Policing","authors":"Isabelle Percy, A. Balinsky, H. Balinsky, S. Simske","doi":"10.1145/3209280.3229112","DOIUrl":"https://doi.org/10.1145/3209280.3229112","url":null,"abstract":"We present some results from a joint project between HP Labs, Cardiff University and Dyfed Powys Police on predictive policing. Applications of the various techniques from recommender systems and text mining to the problem of crime patterns recognition are demonstrated. Our main idea is to consider crime records for different regions and time period as a corpus of text documents with words being crime types. We apply tools from NLP and text documents classifications to analyse different regions in time and space. We evaluate performance of several measures of similarity for texts and documents clustering algorithms.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122272186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Mendes, R. Azevedo, Ruy Guilherme Silva Gomes de Oliveira, Carlos de Salles Soares Neto
This paper describes the BumbAR approach for composing multimedia presentations and evaluates it through a qualitative study based on the Technology Acceptance Model (TAM). The BumbAR proposal is based on the event-condition-action model of Nested Context Model (NCM) and explores the use of augmented reality and real-world objects (markers) as an innovative user interface to specify the behavior and relationships between the media objects in a presentation. The qualitative study aimed at measuring the users' attitude towards using BumbAR and an augmented reality environment for authoring multimedia presentations. The results show that the participants found the BumbAR approach both useful and easy-to-use, while most of them (66,67%) found the system more convenient than traditional desktop-based authoring tools.
{"title":"Exploring an AR-based User Interface for Authoring Multimedia Presentations","authors":"P. Mendes, R. Azevedo, Ruy Guilherme Silva Gomes de Oliveira, Carlos de Salles Soares Neto","doi":"10.1145/3209280.3209534","DOIUrl":"https://doi.org/10.1145/3209280.3209534","url":null,"abstract":"This paper describes the BumbAR approach for composing multimedia presentations and evaluates it through a qualitative study based on the Technology Acceptance Model (TAM). The BumbAR proposal is based on the event-condition-action model of Nested Context Model (NCM) and explores the use of augmented reality and real-world objects (markers) as an innovative user interface to specify the behavior and relationships between the media objects in a presentation. The qualitative study aimed at measuring the users' attitude towards using BumbAR and an augmented reality environment for authoring multimedia presentations. The results show that the participants found the BumbAR approach both useful and easy-to-use, while most of them (66,67%) found the system more convenient than traditional desktop-based authoring tools.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131468343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The E-marketplace is a common place where entities situated in different contexts conduct business electronically. Since sellers and buyers may be located in areas with different languages, customs and even business standards, business documents may be heterogeneously edited and parsed in different contexts. However, so far, no desirable approaches have been implemented to transfer a document from one context to another without generating ambiguity and disputes may arise due to different interpretation towards to the same document. Thus, it is important to guarantee consistent understanding among different contexts. This paper proposes a cross-context semantic document exchange approach, Tabdoc approach, as a novel strategy in implementing semantic interoperability. It guarantees consistent business document understanding and realizes automatic cross-context document processing. The experimental results demonstrate promising performance improvements over state-of-the-art methods.
{"title":"Semantic Interoperability for Electronic Business through a Novel Cross-Context Semantic Document Exchange Approach","authors":"Shuo Yang, Ran Wei, A. Shigarov","doi":"10.1145/3209280.3209523","DOIUrl":"https://doi.org/10.1145/3209280.3209523","url":null,"abstract":"The E-marketplace is a common place where entities situated in different contexts conduct business electronically. Since sellers and buyers may be located in areas with different languages, customs and even business standards, business documents may be heterogeneously edited and parsed in different contexts. However, so far, no desirable approaches have been implemented to transfer a document from one context to another without generating ambiguity and disputes may arise due to different interpretation towards to the same document. Thus, it is important to guarantee consistent understanding among different contexts. This paper proposes a cross-context semantic document exchange approach, Tabdoc approach, as a novel strategy in implementing semantic interoperability. It guarantees consistent business document understanding and realizes automatic cross-context document processing. The experimental results demonstrate promising performance improvements over state-of-the-art methods.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124586372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel marketing method for consumer trend detection from online user generated content, which is motivated by the gap identified in the market research literature. The existing approaches for trend analysis are generally based on rating of trends by industry experts through survey questionnaires, interviews, or similar. These methods proved to be inherently costly and often suffer from bias. Our approach is based on the use of information extraction techniques for identification of trends in large aggregations of social media data. It is cost-effective method that reduces the possibility of errors associated with the design of the sample and the research instrument. The effectiveness of the approach is demonstrated in the experiment performed on restaurant review data. The accuracy of the results is at the level of current approaches for both, information extraction and market research.
{"title":"A Market Analytics Approach to Restaurant Review Data","authors":"Olga Tsubiks, Vlado Keselj","doi":"10.1145/3209280.3209524","DOIUrl":"https://doi.org/10.1145/3209280.3209524","url":null,"abstract":"We present a novel marketing method for consumer trend detection from online user generated content, which is motivated by the gap identified in the market research literature. The existing approaches for trend analysis are generally based on rating of trends by industry experts through survey questionnaires, interviews, or similar. These methods proved to be inherently costly and often suffer from bias. Our approach is based on the use of information extraction techniques for identification of trends in large aggregations of social media data. It is cost-effective method that reduces the possibility of errors associated with the design of the sample and the research instrument. The effectiveness of the approach is demonstrated in the experiment performed on restaurant review data. The accuracy of the results is at the level of current approaches for both, information extraction and market research.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PDF is the established format for the exchange of final-form print-oriented documents on the Web, and for a good reason: it is the only format that guarantees the preservation of layout across different platforms, systems and viewing devices. Its main disadvantage, however, is that a document, once converted to PDF, is very difficult to edit. As of today (2018), there is still no universal format for the exchange of editable formatted text documents on the Web; users can only exchange the application's source files, which do not benefit from the robustness and portability of PDF. This position paper describes how we can engineer such an editable format based on some of the principles of PDF. We begin by analysing the current status quo, and provide a summary of current approaches for editing existing PDFs, other relevant document formats, and ways to embed the document's structure into the PDF itself. We then ask ourselves what it really means for a formatted document to be editable, and discuss the related problem of enabling WYSIWYG direct manipulation even in cases where layout is usually computed or optimized using offline or batch methods (as is common with long-form documents). After defining our goals, we propose a framework for creating such editable portable documents and present a prototype tool that demonstrates our initial steps and serves as a proof of concept. We conclude by providing a roadmap for future work.
{"title":"Towards a Universally Editable Portable Document Format","authors":"Tamir Hassan","doi":"10.1145/3209280.3229083","DOIUrl":"https://doi.org/10.1145/3209280.3229083","url":null,"abstract":"PDF is the established format for the exchange of final-form print-oriented documents on the Web, and for a good reason: it is the only format that guarantees the preservation of layout across different platforms, systems and viewing devices. Its main disadvantage, however, is that a document, once converted to PDF, is very difficult to edit. As of today (2018), there is still no universal format for the exchange of editable formatted text documents on the Web; users can only exchange the application's source files, which do not benefit from the robustness and portability of PDF. This position paper describes how we can engineer such an editable format based on some of the principles of PDF. We begin by analysing the current status quo, and provide a summary of current approaches for editing existing PDFs, other relevant document formats, and ways to embed the document's structure into the PDF itself. We then ask ourselves what it really means for a formatted document to be editable, and discuss the related problem of enabling WYSIWYG direct manipulation even in cases where layout is usually computed or optimized using offline or batch methods (as is common with long-form documents). After defining our goals, we propose a framework for creating such editable portable documents and present a prototype tool that demonstrates our initial steps and serves as a proof of concept. We conclude by providing a roadmap for future work.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125388826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Engineering design makes use of freehand sketches to communicate ideas, allowing designers to externalise form concepts quickly and naturally. Such sketches serve as working documents which demonstrate the evolution of the design process. For the product design to progress, however, these sketches are often redrawn using computer-aided design tools to obtain virtual, interactive prototypes of the design. Although there are commercial software packages which extract the required information from freehand sketches, such packages typically do not handle the complexity of the sketched drawings, particularly when considering the visual cues that are introduced to the sketch to aid the human observer to interpret the sketch. In this paper, we tackle one such complexity, namely the use of shading and shadows which help portray spatial and depth information in the sketch. For this reason, we propose a vectorisation algorithm, based on trainable COSFIRE filters for the detection of junction points and subsequent tracing of line paths to create a topology graph as a representation of the sketched object form. The vectorisation algorithm is evaluated on 17 sketches containing different shading patterns and drawn by different sketchers specifically for this work. Using these sketches, we show that the vectorisation algorithm can handle drawings with straight or curved contours containing shadow cues, reducing the salient point error in the junction point location by 91% of that obtained by the off-the-shelf Harris-Stephen's corner detector while the overall vectorial representations of the sketch achieved an average F-score of 0.92 in comparison to the ground truth. The results demonstrate the effectiveness of the proposed approach.
{"title":"Vectorisation of Sketches with Shadows and Shading using COSFIRE filters","authors":"Alexandra Bonnici, Dorian Bugeja, G. Azzopardi","doi":"10.1145/3209280.3209525","DOIUrl":"https://doi.org/10.1145/3209280.3209525","url":null,"abstract":"Engineering design makes use of freehand sketches to communicate ideas, allowing designers to externalise form concepts quickly and naturally. Such sketches serve as working documents which demonstrate the evolution of the design process. For the product design to progress, however, these sketches are often redrawn using computer-aided design tools to obtain virtual, interactive prototypes of the design. Although there are commercial software packages which extract the required information from freehand sketches, such packages typically do not handle the complexity of the sketched drawings, particularly when considering the visual cues that are introduced to the sketch to aid the human observer to interpret the sketch. In this paper, we tackle one such complexity, namely the use of shading and shadows which help portray spatial and depth information in the sketch. For this reason, we propose a vectorisation algorithm, based on trainable COSFIRE filters for the detection of junction points and subsequent tracing of line paths to create a topology graph as a representation of the sketched object form. The vectorisation algorithm is evaluated on 17 sketches containing different shading patterns and drawn by different sketchers specifically for this work. Using these sketches, we show that the vectorisation algorithm can handle drawings with straight or curved contours containing shadow cues, reducing the salient point error in the junction point location by 91% of that obtained by the off-the-shelf Harris-Stephen's corner detector while the overall vectorial representations of the sketch achieved an average F-score of 0.92 in comparison to the ground truth. The results demonstrate the effectiveness of the proposed approach.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122202403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}