The automatic authority control problem is considered. One possible solution is to use the record linkage approach for authority and bibliographic records. The main aim of this paper is to figure out which concepts and methods are most useful for dealing with our data. An approach based on machine learning method (classification) is considered. A comparative study of different distances and feature sets is made. A study carried out on data of several Russian libraries. The data we deal with are in RUSMARC format which is a variant of UNIMARC popular in Russia.
{"title":"An example of automatic authority control","authors":"A. Knyazeva, O. Kolobov, I. Turchanovsky","doi":"10.1145/2910896.2925458","DOIUrl":"https://doi.org/10.1145/2910896.2925458","url":null,"abstract":"The automatic authority control problem is considered. One possible solution is to use the record linkage approach for authority and bibliographic records. The main aim of this paper is to figure out which concepts and methods are most useful for dealing with our data. An approach based on machine learning method (classification) is considered. A comparative study of different distances and feature sets is made. A study carried out on data of several Russian libraries. The data we deal with are in RUSMARC format which is a variant of UNIMARC popular in Russia.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115818951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp
Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.
{"title":"Evaluating link-based recommendations for Wikipedia","authors":"M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp","doi":"10.1145/2910896.2910908","DOIUrl":"https://doi.org/10.1145/2910896.2910908","url":null,"abstract":"Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124199163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on the follow-up actions triggered by college students' mobile searches, which involved 30 participants conducting an uncontrolled experiment in fifteen days. We collected the mobile phone usage data by an app called AWARE, and combined with structured diary and interviews to perform a quantitative and qualitative study. The results showed that, there were three categories of follow-up actions and majority of these actions occurred within one hour after the initial search session. We also found that participants often conducted follow-up actions with different apps, and certain information needs triggered more follow-up actions. We finally discussed the characteristics and the causes of these actions, and stated further studies which include comparing follow-up actions triggered by mobile search and that of Web search, and building a model for the follow-up actions.
{"title":"Research on the follow-up actions of college students' mobile search","authors":"Dan Wu, Shaobo Liang","doi":"10.1145/2910896.2910921","DOIUrl":"https://doi.org/10.1145/2910896.2910921","url":null,"abstract":"This paper focuses on the follow-up actions triggered by college students' mobile searches, which involved 30 participants conducting an uncontrolled experiment in fifteen days. We collected the mobile phone usage data by an app called AWARE, and combined with structured diary and interviews to perform a quantitative and qualitative study. The results showed that, there were three categories of follow-up actions and majority of these actions occurred within one hour after the initial search session. We also found that participants often conducted follow-up actions with different apps, and certain information needs triggered more follow-up actions. We finally discussed the characteristics and the causes of these actions, and stated further studies which include comparing follow-up actions triggered by mobile search and that of Web search, and building a model for the follow-up actions.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123486274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Studying the dynamics and lifecycles of online knowledge curation communities is essential to identify and assemble community type specific repertoires of strategies, rules, and actions of community design, governance, content creation and curation. This paper examines the lifecycle of a short lived social Q&A community on Stack Exchange by performing the content analysis of the logs of member discussions and content curation actions.
{"title":"Knowledge curation discussions and activity dynamics in a short lived social Q&A community","authors":"Hengyi Fu, Besiki Stvilia","doi":"10.1145/2910896.2925432","DOIUrl":"https://doi.org/10.1145/2910896.2925432","url":null,"abstract":"Studying the dynamics and lifecycles of online knowledge curation communities is essential to identify and assemble community type specific repertoires of strategies, rules, and actions of community design, governance, content creation and curation. This paper examines the lifecycle of a short lived social Q&A community on Stack Exchange by performing the content analysis of the logs of member discussions and content curation actions.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124920369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2%-40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.
{"title":"Big data processing of school shooting archives","authors":"M. Farag, P. Nakate, E. Fox","doi":"10.1145/2910896.2925466","DOIUrl":"https://doi.org/10.1145/2910896.2925466","url":null,"abstract":"Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2%-40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123301177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scholarly documents contain many data entities, such as titles, authors, affiliations, figures, and tables. These entities can be used to enhance digital library services through enhanced metadata and enable the development of new services and tools for interacting with and exploring scholarly data. However, in a world of scholarly big data, extracting these entities in a scalable, efficient and accurate manner can be challenging. In this tutorial, we introduce the broad field of information extraction for scholarly digital libraries. Drawing on our experience in running the Cite-SeerX digital library, which has performed information extraction on over 7 million academic documents, we argue for the need for automatic information extraction, describe different approaches for performing information extraction, present tools and datasets that are readily available, and describe best practices and areas of research interest.
{"title":"Information extraction for scholarly digital libraries","authors":"Kyle Williams, Jian Wu, Zhaohui Wu, C. Lee Giles","doi":"10.1145/2910896.2925430","DOIUrl":"https://doi.org/10.1145/2910896.2925430","url":null,"abstract":"Scholarly documents contain many data entities, such as titles, authors, affiliations, figures, and tables. These entities can be used to enhance digital library services through enhanced metadata and enable the development of new services and tools for interacting with and exploring scholarly data. However, in a world of scholarly big data, extracting these entities in a scalable, efficient and accurate manner can be challenging. In this tutorial, we introduce the broad field of information extraction for scholarly digital libraries. Drawing on our experience in running the Cite-SeerX digital library, which has performed information extraction on over 7 million academic documents, we argue for the need for automatic information extraction, describe different approaches for performing information extraction, present tools and datasets that are readily available, and describe best practices and areas of research interest.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127917011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Museum libraries came late to the digitization party - primarily because of perceived copyright issues. Since 2010 the three libraries of the New York Art Resources Consortium (NYARC) have embarked on a series of niche, boutique digitization projects, pushing the boundaries of fair use, but they have also embraced the born-digital, establishing a program to capture art-history-rich websites and to give access to them via an innovative use of a discovery layer which prioritizes web resources in the ranking of results.
{"title":"The energy of delusion: The New York Art Resources Consortium (NYARC) & the digital","authors":"Stephen J. Bury","doi":"10.1145/2910896.2926742","DOIUrl":"https://doi.org/10.1145/2910896.2926742","url":null,"abstract":"Museum libraries came late to the digitization party - primarily because of perceived copyright issues. Since 2010 the three libraries of the New York Art Resources Consortium (NYARC) have embarked on a series of niche, boutique digitization projects, pushing the boundaries of fair use, but they have also embraced the born-digital, establishing a program to capture art-history-rich websites and to give access to them via an innovative use of a discovery layer which prioritizes web resources in the ranking of results.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124658221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advent of Twitter has led to the ubiquitous information overload problem with a dramatic increase in the amount of tweets a user is exposed to. In this paper, we consider real-time tweet filtering with respect to users' interest profiles in public Twitter stream. While traditional filtering methods mainly focus on judging relevance of a document, we aim to retrieve relevant and novel documents to address the high redundancy of tweets. An unsupervised approach is proposed to model relevance between tweets and different profiles adaptively and a neural network language model is employed to learn semantic representation for tweets. Experiments on TREC 2015 dataset demonstrate the effectiveness of the proposed approach.
{"title":"Real-time filtering on interest profiles in Twitter stream","authors":"Yue Fei, Chao Lv, Yansong Feng, Dongyan Zhao","doi":"10.1145/2910896.2925462","DOIUrl":"https://doi.org/10.1145/2910896.2925462","url":null,"abstract":"The advent of Twitter has led to the ubiquitous information overload problem with a dramatic increase in the amount of tweets a user is exposed to. In this paper, we consider real-time tweet filtering with respect to users' interest profiles in public Twitter stream. While traditional filtering methods mainly focus on judging relevance of a document, we aim to retrieve relevant and novel documents to address the high redundancy of tweets. An unsupervised approach is proposed to model relevance between tweets and different profiles adaptively and a neural network language model is employed to learn semantic representation for tweets. Experiments on TREC 2015 dataset demonstrate the effectiveness of the proposed approach.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126077386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be “trained” specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.
{"title":"Glyph miner: A system for efficiently extracting glyphs from early prints in the context of OCR","authors":"B. Budig, Thomas C. van Dijk, F. Kirchner","doi":"10.1145/2910896.2910915","DOIUrl":"https://doi.org/10.1145/2910896.2910915","url":null,"abstract":"While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be “trained” specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132606787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scholarly events and venues are increasing rapidly in number. This poses a challenge for researchers who seek to identify events and venues related to their work in order to draw more efficiently and comprehensively from published research and to share their own findings more effectively. Such efforts are hampered also by the fact that no rating system yet exists to assist researchers in culling the venues most relevant to their current readings and interests. This study describes a methodology we developed in response to this need, one that recommends scholarly venues related to researchers' specific interests according to personalized social web indicators. Our experiments applying our proposed rating and recommendation method show that it outperforms the baseline venue recommendations in terms of accuracy and ranking quality.
{"title":"How to identify specialized research communities related to a researcher's changing interests","authors":"Hamed Alhoori","doi":"10.1145/2910896.2925450","DOIUrl":"https://doi.org/10.1145/2910896.2925450","url":null,"abstract":"Scholarly events and venues are increasing rapidly in number. This poses a challenge for researchers who seek to identify events and venues related to their work in order to draw more efficiently and comprehensively from published research and to share their own findings more effectively. Such efforts are hampered also by the fact that no rating system yet exists to assist researchers in culling the venues most relevant to their current readings and interests. This study describes a methodology we developed in response to this need, one that recommends scholarly venues related to researchers' specific interests according to personalized social web indicators. Our experiments applying our proposed rating and recommendation method show that it outperforms the baseline venue recommendations in terms of accuracy and ranking quality.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115535280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}