Problem/project Based Learning (PBL) is a highly effective student-centered teaching method, where student teams learn by solving problems. This paper describes an instance of PBL applied to digital library education. We show the design, implementation, results, and partial evaluation of a Computational Linguistics course that provides students an opportunity to engage in active learning about adding value to digital libraries with large collections of text, i.e., one aspect of "big data." Students are engaging in PBL with the semester long challenge of generating good English summaries of an event, given a large collection from our webpage archives. Six teams, each working with a different type of event, and applying three different summarization methods, learned how to generate good summaries; these have fair precision relative to the Wikipedia page that describes their event.
{"title":"Big Data Text Summarization for Events: A Problem Based Learning Course","authors":"Tarek Kanan, Xuan Zhang, M. Magdy, E. Fox","doi":"10.1145/2756406.2756943","DOIUrl":"https://doi.org/10.1145/2756406.2756943","url":null,"abstract":"Problem/project Based Learning (PBL) is a highly effective student-centered teaching method, where student teams learn by solving problems. This paper describes an instance of PBL applied to digital library education. We show the design, implementation, results, and partial evaluation of a Computational Linguistics course that provides students an opportunity to engage in active learning about adding value to digital libraries with large collections of text, i.e., one aspect of \"big data.\" Students are engaging in PBL with the semester long challenge of generating good English summaries of an event, given a large collection from our webpage archives. Six teams, each working with a different type of event, and applying three different summarization methods, learned how to generate good summaries; these have fair precision relative to the Wikipedia page that describes their event.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the results of a large survey designed to quantify the risks and threats to the preservation of the research data in the lab and to determine the mitigating actions of researchers. A total of 724 National Science Foundation awardees completed this survey. Identifying risks and threats to digital preservation has been a significant research stream. Much of this work has been within the context of a preservation technology infrastructure such as data archives for a digital repository. This study looks at the risks and threats to research data prior to its inclusion in a preservation technology infrastructure. The greatest threat to preservation is human error, followed by equipment malfunction, obsolete software, and data corruption. Lost and mislabeled media are not components in the threat taxonomies developed for repositories; however, they do represent an important threat to research data in the lab. Researchers have recognized the need to mitigate the risks inherent in maintaining digital data by implementing data management in their lab environments and have taken their responsibility as data managers seriously; however, they would still prefer to have professional data management support.
{"title":"Before the Repository: Defining the Preservation Threats to Research Data in the Lab","authors":"Stacy T. Kowalczyk","doi":"10.1145/2756406.2756909","DOIUrl":"https://doi.org/10.1145/2756406.2756909","url":null,"abstract":"This paper describes the results of a large survey designed to quantify the risks and threats to the preservation of the research data in the lab and to determine the mitigating actions of researchers. A total of 724 National Science Foundation awardees completed this survey. Identifying risks and threats to digital preservation has been a significant research stream. Much of this work has been within the context of a preservation technology infrastructure such as data archives for a digital repository. This study looks at the risks and threats to research data prior to its inclusion in a preservation technology infrastructure. The greatest threat to preservation is human error, followed by equipment malfunction, obsolete software, and data corruption. Lost and mislabeled media are not components in the threat taxonomies developed for repositories; however, they do represent an important threat to research data in the lab. Researchers have recognized the need to mitigate the risks inherent in maintaining digital data by implementing data management in their lab environments and have taken their responsibility as data managers seriously; however, they would still prefer to have professional data management support.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121463678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This lecture provides an update on the recent developments and activities of the HathiTrust Research Center (HTRC). The HTRC is the research arm of the HathiTrust, an online repository dedicated to the provision of access to a comprehensive body of published works for scholarship and education. The HathiTrust is a partnership of over 100 major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. Membership is open to institutions worldwide. Over 13.1 million volumes (4.7 billion pages) have been ingested into the HathiTrust digital archive from sources including Google Books, member university libraries, the Internet Archive, and numerous private collections. The HTRC is dedicated to facilitating scholarship by enabling analytic access to the corpus, developing research tools, fostering research projects and communities, and providing additional resources such as enhanced metadata and indices that will assist scholars to more easily exploit the HathiTrust materials. This talk will outline the mission, goals and structure of the HTRC. It will also provide an overview of recent work being conducted on a range of projects, partnerships and initiatives. Projects include Workset Creation for Scholarly Analysis project (WCSA, funded by the Andrew W. Mellon Foundation) and the HathiTrust + Bookworm project (HT+BW, funded by the National Endowment for the Humanities). HTRC's involvement with the NOVEL(TM) text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme, will be introduced. The HTRC's new feature extraction and Data Capsule initiatives, part of its ongoing work its ongoing efforts to enable the non-consumptive analyses of the approximately 8 million volumes under copyright restrictions will also be discussed. The talk will conclude with some suggestions on how the non-consumptive research model might be improved upon and possibly extended beyond the HathiTrust context.
本讲座将介绍HathiTrust研究中心(HTRC)的最新发展和活动。HTRC是HathiTrust的研究机构,HathiTrust是一个在线存储库,致力于为学术和教育提供全面的已出版作品。HathiTrust是由100多家主要研究机构和图书馆组成的伙伴关系,致力于确保文化记录在未来长期得到保存和访问。会员资格向世界各地的机构开放。超过1310万卷(47亿页)已经被纳入HathiTrust的数字档案,来源包括谷歌图书、成员大学图书馆、互联网档案馆和众多私人收藏。HTRC致力于通过提供对语料库的分析访问,开发研究工具,促进研究项目和社区,以及提供额外的资源,如增强元数据和索引,帮助学者更容易地利用HathiTrust材料,从而促进学术研究。本讲座将概述HTRC的使命、目标和结构。它还将概述最近在一系列项目、伙伴关系和倡议方面正在进行的工作。项目包括学术分析工作集创建项目(WCSA,由Andrew W. Mellon基金会资助)和HathiTrust + Bookworm项目(HT+BW,由国家人文基金会资助)。将介绍HTRC参与的NOVEL(TM)文本挖掘项目和乐谱搜索和分析单一界面(SIMSSA)项目,这两个项目都是由SSHRC伙伴关系资助计划资助的。HTRC的新特征提取和数据胶囊计划是其正在进行的工作的一部分,它正在努力使大约800万册受版权限制的非消耗性分析成为可能。讲座最后将提出一些建议,说明如何改进非消费性研究模式,并可能将其扩展到HathiTrust之外。
{"title":"The HathiTrust Research Center: Providing analytic access to the HathiTrust Digital Library's 4.7 billion pages","authors":"J. S. Downie","doi":"10.1145/2756406.2771494","DOIUrl":"https://doi.org/10.1145/2756406.2771494","url":null,"abstract":"This lecture provides an update on the recent developments and activities of the HathiTrust Research Center (HTRC). The HTRC is the research arm of the HathiTrust, an online repository dedicated to the provision of access to a comprehensive body of published works for scholarship and education. The HathiTrust is a partnership of over 100 major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. Membership is open to institutions worldwide. Over 13.1 million volumes (4.7 billion pages) have been ingested into the HathiTrust digital archive from sources including Google Books, member university libraries, the Internet Archive, and numerous private collections. The HTRC is dedicated to facilitating scholarship by enabling analytic access to the corpus, developing research tools, fostering research projects and communities, and providing additional resources such as enhanced metadata and indices that will assist scholars to more easily exploit the HathiTrust materials. This talk will outline the mission, goals and structure of the HTRC. It will also provide an overview of recent work being conducted on a range of projects, partnerships and initiatives. Projects include Workset Creation for Scholarly Analysis project (WCSA, funded by the Andrew W. Mellon Foundation) and the HathiTrust + Bookworm project (HT+BW, funded by the National Endowment for the Humanities). HTRC's involvement with the NOVEL(TM) text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme, will be introduced. The HTRC's new feature extraction and Data Capsule initiatives, part of its ongoing work its ongoing efforts to enable the non-consumptive analyses of the approximately 8 million volumes under copyright restrictions will also be discussed. The talk will conclude with some suggestions on how the non-consumptive research model might be improved upon and possibly extended beyond the HathiTrust context.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115272594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 9 - Archiving, Repositories, and Content","authors":"Maureen Henninger","doi":"10.1145/3260517","DOIUrl":"https://doi.org/10.1145/3260517","url":null,"abstract":"","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121736925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing popularity of e-books and audiobooks provided by public libraries in the U.S., the demand does not seem to be met with sufficient supply, as many popular titles require months of waiting time. In this study, we collected data from the Wisconsin Public Library Consortium's digital libraries service once a day for more than two months for selected popular titles. This data reflects the current supply and demand of popular titles in public libraries' digital library services. Based on our data analysis and observation, we suggest ways to achieve faster circulation, which ultimately allows for better services to library users.
{"title":"Case Study of Waiting List on WPLC Digital Library","authors":"Wooseob Jeong, H. Han, Laura Ridenour","doi":"10.1145/2756406.2756961","DOIUrl":"https://doi.org/10.1145/2756406.2756961","url":null,"abstract":"With the increasing popularity of e-books and audiobooks provided by public libraries in the U.S., the demand does not seem to be met with sufficient supply, as many popular titles require months of waiting time. In this study, we collected data from the Wisconsin Public Library Consortium's digital libraries service once a day for more than two months for selected popular titles. This data reflects the current supply and demand of popular titles in public libraries' digital library services. Based on our data analysis and observation, we suggest ways to achieve faster circulation, which ultimately allows for better services to library users.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117169866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.
{"title":"iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling","authors":"Gerhard Gossen, Elena Demidova, T. Risse","doi":"10.1145/2756406.2756925","DOIUrl":"https://doi.org/10.1145/2756406.2756925","url":null,"abstract":"Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"358 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122728865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Exploring the accumulative nature of Internet documents has become a rising issue that requires systematic ways to construct what we need from what we have. Manual and semi-manual document classification techniques have facilitated retrieval and maintenance of document repositories for easy access; however, they are customarily painstaking and labor-intensive. Herein, we propose a document classification model using automatic access of natural language meaning. The model is made up of application, business, and storage layers. The business layer, as a core component, automatically extracts sentences containing keywords from research documents and classifies them using the geometrical similarity of their sentential entailments.
{"title":"Automatic Classification of Research Documents using Textual Entailment","authors":"B. Ojokoh, O. Omisore, O. W. Samuel","doi":"10.1145/2756406.2756960","DOIUrl":"https://doi.org/10.1145/2756406.2756960","url":null,"abstract":"Exploring the accumulative nature of Internet documents has become a rising issue that requires systematic ways to construct what we need from what we have. Manual and semi-manual document classification techniques have facilitated retrieval and maintenance of document repositories for easy access; however, they are customarily painstaking and labor-intensive. Herein, we propose a document classification model using automatic access of natural language meaning. The model is made up of application, business, and storage layers. The business layer, as a core component, automatically extracts sentences containing keywords from research documents and classifies them using the geometrical similarity of their sentential entailments.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123182803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 3 - Big Data, Big Resources","authors":"G. Newton","doi":"10.1145/3260511","DOIUrl":"https://doi.org/10.1145/3260511","url":null,"abstract":"","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.
{"title":"Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach","authors":"J. M. Pinto, Wolf-Tilo Balke","doi":"10.1145/2756406.2756923","DOIUrl":"https://doi.org/10.1145/2756406.2756923","url":null,"abstract":"Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"os-44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127782629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anime is increasingly becoming recognized as an important commercial product and cultural artifact. However, little is known regarding users' information needs and behavior related to anime. This study specifically attempts to improve our understanding of how people seek anime recommendations. We analyzed 546 user questions in natural language, collected from a Korean Q&A website Naver Knowledge-iN, where users are asking for anime recommendations. The findings suggest the importance of establishing robust metadata for the seven commonly used features for anime recommenders (i.e., title, genre, artistic style, story, character description, series title, and mood) in digital libraries, as well as allowing users to specify known anime and series titles as examples for seeking similar items, or examples of the kinds of items to be excluded.
{"title":"Analyzing User Requests for Anime Recommendations","authors":"Jin Ha Lee, Yun-Jeong Shim, Jacob Jett","doi":"10.1145/2756406.2756969","DOIUrl":"https://doi.org/10.1145/2756406.2756969","url":null,"abstract":"Anime is increasingly becoming recognized as an important commercial product and cultural artifact. However, little is known regarding users' information needs and behavior related to anime. This study specifically attempts to improve our understanding of how people seek anime recommendations. We analyzed 546 user questions in natural language, collected from a Korean Q&A website Naver Knowledge-iN, where users are asking for anime recommendations. The findings suggest the importance of establishing robust metadata for the seven commonly used features for anime recommenders (i.e., title, genre, artistic style, story, character description, series title, and mood) in digital libraries, as well as allowing users to specify known anime and series titles as examples for seeking similar items, or examples of the kinds of items to be excluded.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127103118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}