首页 > 最新文献

Proceedings of the ACM Symposium on Document Engineering 2018最新文献

英文 中文
Cross-Media Document Linking and Navigation 跨媒体文档链接和导航
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209529
Ahmed A. O. Tayeh, Payam Ebrahimi, B. Signer
Documents do often not exist in isolation but are implicitly or explicitly linked to parts of other documents. However, due to a multitude of proprietary document formats with rather simple link models, today's possibilities for creating hyperlinks between snippets of information in different document formats are limited. In previous work, we have presented a dynamically extensible cross-document link service overcoming the limitations of the simple link models supported by most existing document formats. Based on a plug-in mechanism, our link service enables the linking across different document types. In this paper, we assess the extensibility of our link service by integrating some document formats as well as third-party document viewers. We illustrate the flexibility of creating advanced hyperlinks across these document formats and viewers that cannot be realised with existing linking solutions or link models of existing document formats. A user study further investigates the user experience when creating and navigating cross-document hyperlinks.
文档通常不是孤立存在的,而是隐式或显式地链接到其他文档的部分。然而,由于许多专有文档格式的链接模型都相当简单,目前在不同文档格式的信息片段之间创建超链接的可能性有限。在之前的工作中,我们提出了一个动态扩展的跨文档链接服务,克服了大多数现有文档格式支持的简单链接模型的限制。基于插件机制,我们的链接服务支持跨不同文档类型的链接。在本文中,我们通过集成一些文档格式和第三方文档查看器来评估链接服务的可扩展性。我们演示了跨这些文档格式和查看器创建高级超链接的灵活性,这是现有的链接解决方案或现有文档格式的链接模型无法实现的。用户研究进一步调查了创建和导航跨文档超链接时的用户体验。
{"title":"Cross-Media Document Linking and Navigation","authors":"Ahmed A. O. Tayeh, Payam Ebrahimi, B. Signer","doi":"10.1145/3209280.3209529","DOIUrl":"https://doi.org/10.1145/3209280.3209529","url":null,"abstract":"Documents do often not exist in isolation but are implicitly or explicitly linked to parts of other documents. However, due to a multitude of proprietary document formats with rather simple link models, today's possibilities for creating hyperlinks between snippets of information in different document formats are limited. In previous work, we have presented a dynamically extensible cross-document link service overcoming the limitations of the simple link models supported by most existing document formats. Based on a plug-in mechanism, our link service enables the linking across different document types. In this paper, we assess the extensibility of our link service by integrating some document formats as well as third-party document viewers. We illustrate the flexibility of creating advanced hyperlinks across these document formats and viewers that cannot be realised with existing linking solutions or link models of existing document formats. A user study further investigates the user experience when creating and navigating cross-document hyperlinks.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126698120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Helpfulness Prediction of Online Product Reviews 在线产品评论的有用性预测
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229105
Md. Enamul Haque, M. E. Tozal, Aminul Islam
The simple question "Was this review helpful to you?" increases an estimated $2.7B revenue to Amazon.com annually 1. In this paper, we propose a solution to the problem of electronic product review accumulation using helpfulness prediction. The popularity of e-commerce and online retailers such as Amazon, eBay, Yelp, and TripAdvisor are largely relying on the presence of product reviews to attract more customers. The major issue for the user submitted reviews is to quantify and evaluate the actual effectiveness by combining all the reviews under a particular product. With the varying size of reviews for each product, it is quite cumbersome for the customers to get hold of the overall helpfulness.Therefore, we propose a feature extraction technique that can quantify and measure helpfulness for each product based on user submitted reviews.
“这篇评论对你有帮助吗?”这个简单的问题每年为亚马逊带来约27亿美元的收入。本文提出了一种利用有用性预测来解决电子产品评论积累问题的方法。亚马逊、eBay、Yelp和TripAdvisor等电子商务和在线零售商的流行,在很大程度上依赖于产品评论的存在来吸引更多的客户。用户提交评论的主要问题是通过组合特定产品下的所有评论来量化和评估实际有效性。由于每个产品的评论大小不一,因此客户很难掌握总体的有用性。因此,我们提出了一种特征提取技术,该技术可以基于用户提交的评论来量化和衡量每个产品的有用性。
{"title":"Helpfulness Prediction of Online Product Reviews","authors":"Md. Enamul Haque, M. E. Tozal, Aminul Islam","doi":"10.1145/3209280.3229105","DOIUrl":"https://doi.org/10.1145/3209280.3229105","url":null,"abstract":"The simple question \"Was this review helpful to you?\" increases an estimated $2.7B revenue to Amazon.com annually 1. In this paper, we propose a solution to the problem of electronic product review accumulation using helpfulness prediction. The popularity of e-commerce and online retailers such as Amazon, eBay, Yelp, and TripAdvisor are largely relying on the presence of product reviews to attract more customers. The major issue for the user submitted reviews is to quantify and evaluate the actual effectiveness by combining all the reviews under a particular product. With the varying size of reviews for each product, it is quite cumbersome for the customers to get hold of the overall helpfulness.Therefore, we propose a feature extraction technique that can quantify and measure helpfulness for each product based on user submitted reviews.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Can Deep Learning Compensate for a Shallow Evaluation? 深度学习可以弥补浅评价吗?
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3236023
Gerald Penn
The last ten years have witnessed an enormous increase in the application of "deep learning" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and acoustic modelling, the answer is most certainly affirmative, but those are mere components of the real applications that are driving the increasing interest in our field. In many of these real applications, the answer is surprisingly that we cannot be certain because of the shambolic evaluation standards that have been commonplace --- long before the deep learning renaissance --- in the communities that specialized in advancing them. This talk will consider three examples in detail: sentiment analysis, text-to-speech synthesis, and summarization. We will discuss empirical grounding, the use of inferential statistics alongside the usual, more engineering-oriented pattern recognition techniques, and the use of machine learning in the process of conducting an evaluation itself.
过去十年见证了“深度学习”方法在口语和文本自然语言处理中的应用的巨大增长。他们有帮助吗?对于一些定义明确的任务,如语言建模和声学建模,答案是肯定的,但这些仅仅是推动我们领域日益增长的兴趣的实际应用的组成部分。在许多这样的实际应用中,令人惊讶的是,答案是我们无法确定的,因为在深度学习复兴之前很久,在专门推进它们的社区中,混乱的评估标准已经司空见惯。本演讲将详细讨论三个例子:情感分析、文本到语音合成和摘要。我们将讨论经验基础,推理统计的使用以及通常的,更面向工程的模式识别技术,以及在进行评估本身的过程中使用机器学习。
{"title":"Can Deep Learning Compensate for a Shallow Evaluation?","authors":"Gerald Penn","doi":"10.1145/3209280.3236023","DOIUrl":"https://doi.org/10.1145/3209280.3236023","url":null,"abstract":"The last ten years have witnessed an enormous increase in the application of \"deep learning\" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and acoustic modelling, the answer is most certainly affirmative, but those are mere components of the real applications that are driving the increasing interest in our field. In many of these real applications, the answer is surprisingly that we cannot be certain because of the shambolic evaluation standards that have been commonplace --- long before the deep learning renaissance --- in the communities that specialized in advancing them. This talk will consider three examples in detail: sentiment analysis, text-to-speech synthesis, and summarization. We will discuss empirical grounding, the use of inferential statistics alongside the usual, more engineering-oriented pattern recognition techniques, and the use of machine learning in the process of conducting an evaluation itself.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127784658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring the Centrality of the References in Scientific Papers 测量科学论文中参考文献的中心性
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229104
Anaïs Ollagnier, S. Fournier, P. Bellot
Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based citation analysis (CCA) seeks to explain the "how" and "why" of citation behavior. In this paper we tackle to explain the "how" from a centrality indicator based on factors which are built automatically according to the authors' citation behavior. This indicator allows to evaluate bibliographical references' importance for reading the paper with which user interacts. From objective quantitative measurements, factors are computed in order to characterize the level of granularity where citations are used. By the setting of the centrality indicator's factors we can highlight citations which tend towards a partial or a global construction of the authors' discourse. We carry out a pilot study in which we test our approach on some papers and discuss the challenges in carrying out the citation analysis in this context. Our results show interesting and consistent correlations between the level of granularity and the significance of citation influences.
引文分析被认为是文献计量学的主要分支之一。引文分析的基础是假设所有的引文都有相似的值,权重相等。具体的研究领域,如基于内容的引文分析(CCA),试图解释引文行为的“如何”和“为什么”。在本文中,我们试图从一个基于作者引用行为自动构建的因素的中心性指标来解释“如何”。这个指标允许评估参考书目对阅读与用户交互的论文的重要性。从客观的定量测量中,计算因子以表征引用所使用的粒度水平。通过设置中心性指标的因素,我们可以突出倾向于作者话语的部分或整体构建的引文。我们进行了一项试点研究,在一些论文上测试了我们的方法,并讨论了在这种情况下进行引文分析的挑战。我们的研究结果显示,粒度水平与引文影响的显著性之间存在有趣且一致的相关性。
{"title":"Measuring the Centrality of the References in Scientific Papers","authors":"Anaïs Ollagnier, S. Fournier, P. Bellot","doi":"10.1145/3209280.3229104","DOIUrl":"https://doi.org/10.1145/3209280.3229104","url":null,"abstract":"Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based citation analysis (CCA) seeks to explain the \"how\" and \"why\" of citation behavior. In this paper we tackle to explain the \"how\" from a centrality indicator based on factors which are built automatically according to the authors' citation behavior. This indicator allows to evaluate bibliographical references' importance for reading the paper with which user interacts. From objective quantitative measurements, factors are computed in order to characterize the level of granularity where citations are used. By the setting of the centrality indicator's factors we can highlight citations which tend towards a partial or a global construction of the authors' discourse. We carry out a pilot study in which we test our approach on some papers and discuss the challenges in carrying out the citation analysis in this context. Our results show interesting and consistent correlations between the level of granularity and the significance of citation influences.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Workshop on the Future of Scholarly Publishing 学术出版的未来研讨会
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3232793
Tamir Hassan
There is currently much discussion and research on topics such as open access, alternative publishing models, semantic publishing, peer review, data sharing, reproducible science, etc.; in short, how we can bring scholarly publishing in line with modern technologies and expectations. In the past, the document engineering community's participation in defining the future directions has been rather limited, which is surprising, as many document-centric issues related to scientific publishing still remain unresolved. Therefore, the main goal of this workshop, which will be held at DocEng 2018, is to stimulate discussion on this topic among experts in the document engineering field and provide a forum for the exchange of ideas. The second goal of the workshop is more hands-on: for generating the Post-Proceedings, we will be trialling a new workflow, which is based on some of the technologies discussed. The results will be reported to the DocEng Steering Committee and recommendations will be made future conferences.
目前在开放获取、替代出版模式、语义出版、同行评议、数据共享、可复制科学等主题上有很多讨论和研究;简而言之,我们如何使学术出版符合现代技术和期望。在过去,文档工程社区在定义未来方向方面的参与相当有限,这是令人惊讶的,因为许多与科学出版相关的以文档为中心的问题仍然没有解决。因此,本次研讨会的主要目标是激发文档工程领域专家对这一主题的讨论,并提供一个交流思想的论坛。研讨会的第二个目标是更实际的:为了生成Post-Proceedings,我们将尝试一个新的工作流程,这是基于一些讨论的技术。结果将报告给会议指导委员会,并在未来的会议上提出建议。
{"title":"Workshop on the Future of Scholarly Publishing","authors":"Tamir Hassan","doi":"10.1145/3209280.3232793","DOIUrl":"https://doi.org/10.1145/3209280.3232793","url":null,"abstract":"There is currently much discussion and research on topics such as open access, alternative publishing models, semantic publishing, peer review, data sharing, reproducible science, etc.; in short, how we can bring scholarly publishing in line with modern technologies and expectations. In the past, the document engineering community's participation in defining the future directions has been rather limited, which is surprising, as many document-centric issues related to scientific publishing still remain unresolved. Therefore, the main goal of this workshop, which will be held at DocEng 2018, is to stimulate discussion on this topic among experts in the document engineering field and provide a forum for the exchange of ideas. The second goal of the workshop is more hands-on: for generating the Post-Proceedings, we will be trialling a new workflow, which is based on some of the technologies discussed. The results will be reported to the DocEng Steering Committee and recommendations will be made future conferences.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"50 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131389305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Rights Management for Photocopiers 复印机的自动权限管理
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209531
Andreas Girgensohn, L. Wilcox, Qiong Liu
We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be verified to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.
我们引进了一个系统来自动管理版权印刷品的影印件。该系统监控影印机,以侦测从受版权保护的出版物中复制页面的行为。对此类活动进行统计是为了记帐。可以验证对材料的访问权限,以防止打印。复制页面的数字图像与受版权保护的页面数据库进行核对。为保护复制非版权资料的私隐,只有数码指纹才会提交给图像匹配服务。这种系统的一个问题是版权页数据库的创建。为了方便起见,我们的系统维护了类似未知页面图像集群的统计数据以及复制序列。一旦这样的集群发展到足够的规模,人工检查人员就可以确定这些页面序列是否受版权保护。该系统已经在会议记录中的10万页和随机生成的数百万页上进行了测试。即使使用副本或双页副本,检索准确率也在99%左右。
{"title":"Automatic Rights Management for Photocopiers","authors":"Andreas Girgensohn, L. Wilcox, Qiong Liu","doi":"10.1145/3209280.3209531","DOIUrl":"https://doi.org/10.1145/3209280.3209531","url":null,"abstract":"We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be verified to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128933680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FormYak
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229108
S. Carter, Laurent Denoue, Matthew Cooper, Jennifer Marlow
Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using metadata and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.
{"title":"FormYak","authors":"S. Carter, Laurent Denoue, Matthew Cooper, Jennifer Marlow","doi":"10.1145/3209280.3229108","DOIUrl":"https://doi.org/10.1145/3209280.3229108","url":null,"abstract":"Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using metadata and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117325931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Evoq
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3209533
Antoine Clarinval, Isabelle Linden, Anne Wallemacq, Bruno Dumas
Structural analysis is a text analysis technique that helps uncovering the association and opposition relationships between the terms of a text. It is used in particular in the field of humanities and social sciences. This technique is usually applied by hand with pen and paper as support. However, as any combination of words in the raw text may be considered as an association or opposition relationship, applying the technique by hand in a readable way can quickly prove overwhelming for the analyst. In this paper, we propose Evoq, an application that provides support to structural analysts in their work. Furthermore, we present interactive visualizations representing the relationships between terms. These visualizations help create alternative representations of text, as advocated by structural analysts. We conducted two usability evaluations that showed great potential for Evoq as a structural analysis support tool and for the use of alternative representations of texts in the analysis.
{"title":"Evoq","authors":"Antoine Clarinval, Isabelle Linden, Anne Wallemacq, Bruno Dumas","doi":"10.1145/3209280.3209533","DOIUrl":"https://doi.org/10.1145/3209280.3209533","url":null,"abstract":"Structural analysis is a text analysis technique that helps uncovering the association and opposition relationships between the terms of a text. It is used in particular in the field of humanities and social sciences. This technique is usually applied by hand with pen and paper as support. However, as any combination of words in the raw text may be considered as an association or opposition relationship, applying the technique by hand in a readable way can quickly prove overwhelming for the analyst. In this paper, we propose Evoq, an application that provides support to structural analysts in their work. Furthermore, we present interactive visualizations representing the relationships between terms. These visualizations help create alternative representations of text, as advocated by structural analysts. We conducted two usability evaluations that showed great potential for Evoq as a structural analysis support tool and for the use of alternative representations of texts in the analysis.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116900395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection 哈希- grams:用于分类和恶意软件检测的更快的N-Gram特征
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229085
Edward Raff, Charles K. Nicholas
N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.
长期以来,n-图一直被用作分类问题的特征,它们的分布通常允许选择最上面k个出现的n-图作为可靠的第一步特征选择。然而,这种top-k选择可能会成为性能瓶颈,特别是在处理大量项目集和语料库时。在这项工作中,我们介绍了哈希图,一种对分类问题进行top-k特征挖掘的方法。我们证明,哈希图方法可以比精确的top-k选择算法快三个数量级。使用大小超过2tb的恶意软件语料库,我们展示了哈希图如何保持相当的分类准确性,同时显着降低了计算需求。
{"title":"Hash-Grams: Faster N-Gram Features for Classification and Malware Detection","authors":"Edward Raff, Charles K. Nicholas","doi":"10.1145/3209280.3229085","DOIUrl":"https://doi.org/10.1145/3209280.3229085","url":null,"abstract":"N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131351819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A Handwritten Japanese Historical Kana Reprint Support System: Development of a Graphical User Interface 手写日本历史假名再版支持系统:图形用户界面的开发
Pub Date : 2018-08-28 DOI: 10.1145/3209280.3229117
Atsushi Yamazaki, Kazuki Sando, Tetsuya Suzuki, A. Aiba
Reprint of Japanese historical manuscripts is time-consuming and requires training because they are hand-written, and may contain characters different from those currently used. We proposed a framework for assisting the human process for reading Japanese historical manuscripts and implemented a part of a system based on the framework as a Web service. In this paper, we present a graphical user interface (GUI) for the system and reprint process through the GUI. We conducted a user test to evaluate the system with the GUI by a questionnaire. From the results of the experiment, we confirmed that the GUI can be used intuitively but we also found points to be improved in the GUI.
日本历史手稿的重印既耗时又需要培训,因为它们是手写的,而且可能包含与目前使用的不同的字符。我们提出了一个框架,用于协助人类阅读日本历史手稿,并基于该框架将系统的一部分作为Web服务实现。本文给出了系统的图形用户界面(GUI),并通过GUI实现了转载过程。我们进行了用户测试,通过问卷调查来评估系统与GUI。从实验结果中,我们证实了GUI可以直观地使用,但我们也发现了GUI中需要改进的地方。
{"title":"A Handwritten Japanese Historical Kana Reprint Support System: Development of a Graphical User Interface","authors":"Atsushi Yamazaki, Kazuki Sando, Tetsuya Suzuki, A. Aiba","doi":"10.1145/3209280.3229117","DOIUrl":"https://doi.org/10.1145/3209280.3229117","url":null,"abstract":"Reprint of Japanese historical manuscripts is time-consuming and requires training because they are hand-written, and may contain characters different from those currently used. We proposed a framework for assisting the human process for reading Japanese historical manuscripts and implemented a part of a system based on the framework as a Web service. In this paper, we present a graphical user interface (GUI) for the system and reprint process through the GUI. We conducted a user test to evaluate the system with the GUI by a questionnaire. From the results of the experiment, we confirmed that the GUI can be used intuitively but we also found points to be improved in the GUI.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124852590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the ACM Symposium on Document Engineering 2018
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1