首页 > 最新文献

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)最新文献

英文 中文
An example of automatic authority control 一个自动权限控制的示例
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925458
A. Knyazeva, O. Kolobov, I. Turchanovsky
The automatic authority control problem is considered. One possible solution is to use the record linkage approach for authority and bibliographic records. The main aim of this paper is to figure out which concepts and methods are most useful for dealing with our data. An approach based on machine learning method (classification) is considered. A comparative study of different distances and feature sets is made. A study carried out on data of several Russian libraries. The data we deal with are in RUSMARC format which is a variant of UNIMARC popular in Russia.
考虑了自动权限控制问题。一个可能的解决方案是对权威和书目记录使用记录链接方法。本文的主要目的是找出哪些概念和方法对处理我们的数据最有用。考虑了一种基于机器学习方法(分类)的方法。对不同距离和特征集进行了比较研究。对俄罗斯几家图书馆的数据进行了研究。我们处理的数据是RUSMARC格式,这是俄罗斯流行的UNIMARC的一种变体。
{"title":"An example of automatic authority control","authors":"A. Knyazeva, O. Kolobov, I. Turchanovsky","doi":"10.1145/2910896.2925458","DOIUrl":"https://doi.org/10.1145/2910896.2925458","url":null,"abstract":"The automatic authority control problem is considered. One possible solution is to use the record linkage approach for authority and bibliographic records. The main aim of this paper is to figure out which concepts and methods are most useful for dealing with our data. An approach based on machine learning method (classification) is considered. A comparative study of different distances and feature sets is made. A study carried out on data of several Russian libraries. The data we deal with are in RUSMARC format which is a variant of UNIMARC popular in Russia.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115818951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Evaluating link-based recommendations for Wikipedia 评估维基百科基于链接的推荐
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2910908
M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp
Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.
文献推荐系统支持用户过滤数字图书馆和网络上大量且数量不断增加的文献。对于学术文献,研究已经证明了基于引文的文档相似度度量,如共引(CoCit)或共引接近分析(CPA)可以提高推荐质量。在本文中,我们报告了CPA方法在为维基百科生成文献推荐中的性能的第一次大规模调查,这与学术文献领域有着根本的不同。我们分析链接而不是引用来生成文章推荐。我们评估了CPA、CoCit和Apache Lucene MoreLikeThis (MLT)函数,它代表了传统的基于文本的相似性度量。我们使用了两个数据集,分别为779,716和257万篇维基百科文章,大数据处理框架Apache Flink和一个十节点计算集群。为了实现我们的大规模评估,我们从维基百科“See also”部分的链接和一个全面的维基百科点击流数据集中得出了两个准黄金标准。我们的研究结果表明,与基于文本的MLT测量相比,基于引文的CPA和CoCit具有互补的优势。虽然MLT在识别具有相似词汇和结构的狭义相似文章方面表现良好,但基于引用的度量能够更好地识别主题相关信息,例如某所大学或该地区其他技术大学的城市信息。CPA方法一直优于CoCit,它更适合于识别更广泛的相关文章,以及通常表现出更高质量的流行文章。CPA方法的其他优点是其较低的运行时需求和语言独立性,允许跨语言检索文章。我们提出了示范性文章的手工分析,以展示和讨论我们的发现。我们研究的原始数据和源代码,以及如何使用它们的手册,都可以在https://github.com/wikimedia/citolytics上公开获取。
{"title":"Evaluating link-based recommendations for Wikipedia","authors":"M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp","doi":"10.1145/2910896.2910908","DOIUrl":"https://doi.org/10.1145/2910896.2910908","url":null,"abstract":"Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124199163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Research on the follow-up actions of college students' mobile search 大学生移动搜索的后续行为研究
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2910921
Dan Wu, Shaobo Liang
This paper focuses on the follow-up actions triggered by college students' mobile searches, which involved 30 participants conducting an uncontrolled experiment in fifteen days. We collected the mobile phone usage data by an app called AWARE, and combined with structured diary and interviews to perform a quantitative and qualitative study. The results showed that, there were three categories of follow-up actions and majority of these actions occurred within one hour after the initial search session. We also found that participants often conducted follow-up actions with different apps, and certain information needs triggered more follow-up actions. We finally discussed the characteristics and the causes of these actions, and stated further studies which include comparing follow-up actions triggered by mobile search and that of Web search, and building a model for the follow-up actions.
本文以大学生手机搜索引发的后续行为为研究对象,对30名参与者进行了为期15天的无控实验。我们通过AWARE软件收集手机使用数据,并结合结构化日记和访谈进行定量和定性研究。结果表明,有三类后续行动,这些行动大多发生在初始搜索会话后的一个小时内。我们还发现,参与者经常使用不同的app进行后续操作,并且某些信息需求会引发更多的后续操作。最后,我们讨论了这些行为的特征和原因,并提出了进一步的研究,包括比较移动搜索和Web搜索引发的后续行为,并建立后续行为模型。
{"title":"Research on the follow-up actions of college students' mobile search","authors":"Dan Wu, Shaobo Liang","doi":"10.1145/2910896.2910921","DOIUrl":"https://doi.org/10.1145/2910896.2910921","url":null,"abstract":"This paper focuses on the follow-up actions triggered by college students' mobile searches, which involved 30 participants conducting an uncontrolled experiment in fifteen days. We collected the mobile phone usage data by an app called AWARE, and combined with structured diary and interviews to perform a quantitative and qualitative study. The results showed that, there were three categories of follow-up actions and majority of these actions occurred within one hour after the initial search session. We also found that participants often conducted follow-up actions with different apps, and certain information needs triggered more follow-up actions. We finally discussed the characteristics and the causes of these actions, and stated further studies which include comparing follow-up actions triggered by mobile search and that of Web search, and building a model for the follow-up actions.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123486274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Knowledge curation discussions and activity dynamics in a short lived social Q&A community 短期社会问答社区中的知识管理讨论和活动动态
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925432
Hengyi Fu, Besiki Stvilia
Studying the dynamics and lifecycles of online knowledge curation communities is essential to identify and assemble community type specific repertoires of strategies, rules, and actions of community design, governance, content creation and curation. This paper examines the lifecycle of a short lived social Q&A community on Stack Exchange by performing the content analysis of the logs of member discussions and content curation actions.
研究在线知识管理社区的动态和生命周期对于识别和整合社区类型的策略、规则和社区设计、治理、内容创建和管理的行动至关重要。本文通过对成员讨论和内容管理操作的日志进行内容分析,研究了Stack Exchange上一个短暂存在的社交问答社区的生命周期。
{"title":"Knowledge curation discussions and activity dynamics in a short lived social Q&A community","authors":"Hengyi Fu, Besiki Stvilia","doi":"10.1145/2910896.2925432","DOIUrl":"https://doi.org/10.1145/2910896.2925432","url":null,"abstract":"Studying the dynamics and lifecycles of online knowledge curation communities is essential to identify and assemble community type specific repertoires of strategies, rules, and actions of community design, governance, content creation and curation. This paper examines the lifecycle of a short lived social Q&A community on Stack Exchange by performing the content analysis of the logs of member discussions and content curation actions.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124920369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Big data processing of school shooting archives 校园枪击案档案大数据处理
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925466
M. Farag, P. Nakate, E. Fox
Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2%-40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.
关于校园枪击事件的网络档案由网页组成,这些网页可能与感兴趣的事件相关,也可能与事件无关。这项工作有三个主要目标;首先是清理网页,这涉及到去除网页上的停止词和不相关的部分。第二个目标是只选择与感兴趣的事件相关的网页。第三个目标是将清理后的相关网页上传到Apache Solr,以便于访问。我们将展示实现这些目标所需的所有步骤的细节。结果表明,代表性的Web档案存在噪声,相关内容仅占2%-40%。通过清理档案,我们帮助研究人员专注于相关内容进行分析。
{"title":"Big data processing of school shooting archives","authors":"M. Farag, P. Nakate, E. Fox","doi":"10.1145/2910896.2925466","DOIUrl":"https://doi.org/10.1145/2910896.2925466","url":null,"abstract":"Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2%-40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123301177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Information extraction for scholarly digital libraries 学术数字图书馆的信息提取
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925430
Kyle Williams, Jian Wu, Zhaohui Wu, C. Lee Giles
Scholarly documents contain many data entities, such as titles, authors, affiliations, figures, and tables. These entities can be used to enhance digital library services through enhanced metadata and enable the development of new services and tools for interacting with and exploring scholarly data. However, in a world of scholarly big data, extracting these entities in a scalable, efficient and accurate manner can be challenging. In this tutorial, we introduce the broad field of information extraction for scholarly digital libraries. Drawing on our experience in running the Cite-SeerX digital library, which has performed information extraction on over 7 million academic documents, we argue for the need for automatic information extraction, describe different approaches for performing information extraction, present tools and datasets that are readily available, and describe best practices and areas of research interest.
学术文档包含许多数据实体,如标题、作者、隶属关系、图形和表格。这些实体可用于通过增强元数据来增强数字图书馆服务,并使开发与学术数据交互和探索的新服务和工具成为可能。然而,在学术大数据的世界里,以一种可扩展、高效和准确的方式提取这些实体可能是一项挑战。在本教程中,我们介绍了学术数字图书馆信息提取的广泛领域。根据我们运行Cite-SeerX数字图书馆的经验,该图书馆已经对超过700万份学术文件进行了信息提取,我们论证了自动信息提取的必要性,描述了执行信息提取的不同方法,提供了现成的工具和数据集,并描述了最佳实践和研究兴趣领域。
{"title":"Information extraction for scholarly digital libraries","authors":"Kyle Williams, Jian Wu, Zhaohui Wu, C. Lee Giles","doi":"10.1145/2910896.2925430","DOIUrl":"https://doi.org/10.1145/2910896.2925430","url":null,"abstract":"Scholarly documents contain many data entities, such as titles, authors, affiliations, figures, and tables. These entities can be used to enhance digital library services through enhanced metadata and enable the development of new services and tools for interacting with and exploring scholarly data. However, in a world of scholarly big data, extracting these entities in a scalable, efficient and accurate manner can be challenging. In this tutorial, we introduce the broad field of information extraction for scholarly digital libraries. Drawing on our experience in running the Cite-SeerX digital library, which has performed information extraction on over 7 million academic documents, we argue for the need for automatic information extraction, describe different approaches for performing information extraction, present tools and datasets that are readily available, and describe best practices and areas of research interest.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127917011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The energy of delusion: The New York Art Resources Consortium (NYARC) & the digital 妄想的能量:纽约艺术资源联盟(NYARC)与数字
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2926742
Stephen J. Bury
Museum libraries came late to the digitization party - primarily because of perceived copyright issues. Since 2010 the three libraries of the New York Art Resources Consortium (NYARC) have embarked on a series of niche, boutique digitization projects, pushing the boundaries of fair use, but they have also embraced the born-digital, establishing a program to capture art-history-rich websites and to give access to them via an innovative use of a discovery layer which prioritizes web resources in the ranking of results.
博物馆图书馆加入数字化大军的时间较晚,主要是因为版权问题。自2010年以来,纽约艺术资源联盟(NYARC)的三个图书馆已经开始了一系列的细分市场,精品数字化项目,推动了合理使用的界限,但他们也接受了与生俱来的数字化,建立了一个程序,以捕获艺术史丰富的网站,并通过创新地使用发现层来访问它们,该发现层在结果排名中优先考虑网络资源。
{"title":"The energy of delusion: The New York Art Resources Consortium (NYARC) & the digital","authors":"Stephen J. Bury","doi":"10.1145/2910896.2926742","DOIUrl":"https://doi.org/10.1145/2910896.2926742","url":null,"abstract":"Museum libraries came late to the digitization party - primarily because of perceived copyright issues. Since 2010 the three libraries of the New York Art Resources Consortium (NYARC) have embarked on a series of niche, boutique digitization projects, pushing the boundaries of fair use, but they have also embraced the born-digital, establishing a program to capture art-history-rich websites and to give access to them via an innovative use of a discovery layer which prioritizes web resources in the ranking of results.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124658221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time filtering on interest profiles in Twitter stream 对Twitter流中的兴趣配置文件进行实时过滤
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925462
Yue Fei, Chao Lv, Yansong Feng, Dongyan Zhao
The advent of Twitter has led to the ubiquitous information overload problem with a dramatic increase in the amount of tweets a user is exposed to. In this paper, we consider real-time tweet filtering with respect to users' interest profiles in public Twitter stream. While traditional filtering methods mainly focus on judging relevance of a document, we aim to retrieve relevant and novel documents to address the high redundancy of tweets. An unsupervised approach is proposed to model relevance between tweets and different profiles adaptively and a neural network language model is employed to learn semantic representation for tweets. Experiments on TREC 2015 dataset demonstrate the effectiveness of the proposed approach.
Twitter的出现导致了无处不在的信息过载问题,用户接触到的tweet数量急剧增加。在本文中,我们考虑在公共Twitter流中对用户的兴趣配置文件进行实时tweet过滤。传统的过滤方法主要集中在判断文档的相关性,而我们的目标是检索相关和新颖的文档,以解决推文的高冗余。提出了一种无监督的方法自适应建模推文与不同配置文件之间的相关性,并采用神经网络语言模型学习推文的语义表示。在TREC 2015数据集上的实验验证了该方法的有效性。
{"title":"Real-time filtering on interest profiles in Twitter stream","authors":"Yue Fei, Chao Lv, Yansong Feng, Dongyan Zhao","doi":"10.1145/2910896.2925462","DOIUrl":"https://doi.org/10.1145/2910896.2925462","url":null,"abstract":"The advent of Twitter has led to the ubiquitous information overload problem with a dramatic increase in the amount of tweets a user is exposed to. In this paper, we consider real-time tweet filtering with respect to users' interest profiles in public Twitter stream. While traditional filtering methods mainly focus on judging relevance of a document, we aim to retrieve relevant and novel documents to address the high redundancy of tweets. An unsupervised approach is proposed to model relevance between tweets and different profiles adaptively and a neural network language model is employed to learn semantic representation for tweets. Experiments on TREC 2015 dataset demonstrate the effectiveness of the proposed approach.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126077386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Glyph miner: A system for efficiently extracting glyphs from early prints in the context of OCR 象形文字挖掘器:一种在OCR环境中有效地从早期印刷中提取象形文字的系统
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2910915
B. Budig, Thomas C. van Dijk, F. Kirchner
While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be “trained” specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.
虽然现成的OCR系统在许多现代文档上工作得很好,但早期打印的异质性提供了一个重大挑战。为了获得良好的识别质量,现有的软件必须针对每个特定的语料库进行专门的“训练”。这是一个繁琐的过程,需要大量的用户工作。在本文中,我们演示了一个系统,该系统通常用更有效的工作流程取代训练管道的公共部分:给定一组历史文档的扫描页面,我们的系统使用有效的用户交互来半自动地提取用户指示的大量出现的字形。在一个初步的案例研究中,我们通过将我们的系统嵌入到 rzburg大学图书馆的工作流程中来评估我们方法的有效性。
{"title":"Glyph miner: A system for efficiently extracting glyphs from early prints in the context of OCR","authors":"B. Budig, Thomas C. van Dijk, F. Kirchner","doi":"10.1145/2910896.2910915","DOIUrl":"https://doi.org/10.1145/2910896.2910915","url":null,"abstract":"While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be “trained” specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132606787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
How to identify specialized research communities related to a researcher's changing interests 如何识别与研究人员不断变化的兴趣相关的专门研究社区
Pub Date : 2016-06-19 DOI: 10.1145/2910896.2925450
Hamed Alhoori
Scholarly events and venues are increasing rapidly in number. This poses a challenge for researchers who seek to identify events and venues related to their work in order to draw more efficiently and comprehensively from published research and to share their own findings more effectively. Such efforts are hampered also by the fact that no rating system yet exists to assist researchers in culling the venues most relevant to their current readings and interests. This study describes a methodology we developed in response to this need, one that recommends scholarly venues related to researchers' specific interests according to personalized social web indicators. Our experiments applying our proposed rating and recommendation method show that it outperforms the baseline venue recommendations in terms of accuracy and ranking quality.
学术活动和场所的数量正在迅速增加。这给那些寻求确定与其工作相关的活动和场所的研究人员提出了挑战,以便更有效和全面地从已发表的研究中提取信息,并更有效地分享他们自己的发现。这种努力也受到了阻碍,因为目前还没有评级系统来帮助研究人员挑选与他们当前的阅读和兴趣最相关的场所。本研究描述了我们针对这一需求开发的一种方法,即根据个性化的社交网络指标,推荐与研究人员的特定兴趣相关的学术场所。我们应用我们提出的评级和推荐方法的实验表明,它在准确性和排名质量方面优于基线场地推荐。
{"title":"How to identify specialized research communities related to a researcher's changing interests","authors":"Hamed Alhoori","doi":"10.1145/2910896.2925450","DOIUrl":"https://doi.org/10.1145/2910896.2925450","url":null,"abstract":"Scholarly events and venues are increasing rapidly in number. This poses a challenge for researchers who seek to identify events and venues related to their work in order to draw more efficiently and comprehensively from published research and to share their own findings more effectively. Such efforts are hampered also by the fact that no rating system yet exists to assist researchers in culling the venues most relevant to their current readings and interests. This study describes a methodology we developed in response to this need, one that recommends scholarly venues related to researchers' specific interests according to personalized social web indicators. Our experiments applying our proposed rating and recommendation method show that it outperforms the baseline venue recommendations in terms of accuracy and ranking quality.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115535280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1