Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context-question-answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.
{"title":"<i>ScienceQA</i>: a novel resource for question answering on scholarly articles.","authors":"Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, Pushpak Bhattacharyya","doi":"10.1007/s00799-022-00329-y","DOIUrl":"https://doi.org/10.1007/s00799-022-00329-y","url":null,"abstract":"<p><p>Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context-question-answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9297303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40533640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01Epub Date: 2022-02-03DOI: 10.1007/s00799-022-00321-6
Afrodite Malliari, Ilias Nitsos, Sofia Zapounidou, Stavros Doropoulos
In Greece, there are many audiovisual resources available on the Internet that interest scientists and the general public. Although freely available, finding such resources often becomes a challenging task, because they are hosted on scattered websites and in different types/formats. These websites usually offer limited search options; at the same time, there is no aggregation service for audiovisual resources, nor a national registry for such content. To meet this need, the Open AudioVisual Archives project was launched and the first step in its development is to create a dataset with open access audiovisual material. The current research creates such a dataset by applying specific selection criteria in terms of copyright and content, form/use and process/technical characteristics. The results reported in this paper show that libraries, archives, museums, universities, mass media organizations, governmental and non-governmental organizations are the main types of providers, but the vast majority of resources are open courses offered by universities under the "Creative Commons" license. Providers have significant differences in terms of their collection management capabilities. Most of them do not own any kind of publishing infrastructure and use commercial streaming services, such as YouTube. In terms of metadata policy, most of the providers use application profiles instead of international metadata schemas.
{"title":"Mapping audiovisual content providers and resources in Greece.","authors":"Afrodite Malliari, Ilias Nitsos, Sofia Zapounidou, Stavros Doropoulos","doi":"10.1007/s00799-022-00321-6","DOIUrl":"https://doi.org/10.1007/s00799-022-00321-6","url":null,"abstract":"<p><p>In Greece, there are many audiovisual resources available on the Internet that interest scientists and the general public. Although freely available, finding such resources often becomes a challenging task, because they are hosted on scattered websites and in different types/formats. These websites usually offer limited search options; at the same time, there is no aggregation service for audiovisual resources, nor a national registry for such content. To meet this need, the Open AudioVisual Archives project was launched and the first step in its development is to create a dataset with open access audiovisual material. The current research creates such a dataset by applying specific selection criteria in terms of copyright and content, form/use and process/technical characteristics. The results reported in this paper show that libraries, archives, museums, universities, mass media organizations, governmental and non-governmental organizations are the main types of providers, but the vast majority of resources are open courses offered by universities under the \"Creative Commons\" license. Providers have significant differences in terms of their collection management capabilities. Most of them do not own any kind of publishing infrastructure and use commercial streaming services, such as YouTube. In terms of metadata policy, most of the providers use application profiles instead of international metadata schemas.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8811596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39898216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal-relation classification plays an important role in the field of natural language processing. Various deep learning-based classifiers, which can generate better models using sentence embedding, have been proposed to address this challenging task. These approaches, however, do not work well due to the lack of task-related information. To overcome this problem, we propose a novel framework that incorporates prior information by employing awareness of events and time expressions (time-event entities) with various window sizes to focus on context words around the entities as a filter. We refer to this module as "question encoder." In our approach, this kind of prior information can extract task-related information from simple sentence embedding. Our experimental results on a publicly available Timebank-Dense corpus demonstrate that our approach outperforms some state-of-the-art techniques, including CNN-, LSTM-, and BERT-based temporal relation classifiers.
{"title":"CNN-based framework for classifying temporal relations with question encoder.","authors":"Yohei Seki, Kangkang Zhao, Masaki Oguni, Kazunari Sugiyama","doi":"10.1007/s00799-021-00310-1","DOIUrl":"https://doi.org/10.1007/s00799-021-00310-1","url":null,"abstract":"<p><p>Temporal-relation classification plays an important role in the field of natural language processing. Various deep learning-based classifiers, which can generate better models using sentence embedding, have been proposed to address this challenging task. These approaches, however, do not work well due to the lack of task-related information. To overcome this problem, we propose a novel framework that incorporates prior information by employing awareness of events and time expressions (time-event entities) with various window sizes to focus on context words around the entities as a filter. We refer to this module as \"question encoder.\" In our approach, this kind of prior information can extract task-related information from simple sentence embedding. Our experimental results on a publicly available <i>Timebank-Dense</i> corpus demonstrate that our approach outperforms some state-of-the-art techniques, including CNN-, LSTM-, and BERT-based temporal relation classifiers.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39890130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01Epub Date: 2022-10-05DOI: 10.1007/s00799-022-00339-w
Christin Katharina Kreutz, Ralf Schenkel
Scientific writing builds upon already published papers. Manual identification of publications to read, cite or consider as related papers relies on a researcher's ability to identify fitting keywords or initial papers from which a literature search can be started. The rapidly increasing amount of papers has called for automatic measures to find the desired relevant publications, so-called paper recommendation systems. As the number of publications increases so does the amount of paper recommendation systems. Former literature reviews focused on discussing the general landscape of approaches throughout the years and highlight the main directions. We refrain from this perspective, instead we only consider a comparatively small time frame but analyse it fully. In this literature review we discuss used methods, datasets, evaluations and open challenges encountered in all works first released between January 2019 and October 2021. The goal of this survey is to provide a comprehensive and complete overview of current paper recommendation systems.
{"title":"Scientific paper recommendation systems: a literature review of recent publications.","authors":"Christin Katharina Kreutz, Ralf Schenkel","doi":"10.1007/s00799-022-00339-w","DOIUrl":"https://doi.org/10.1007/s00799-022-00339-w","url":null,"abstract":"<p><p>Scientific writing builds upon already published papers. Manual identification of publications to read, cite or consider as related papers relies on a researcher's ability to identify fitting keywords or initial papers from which a literature search can be started. The rapidly increasing amount of papers has called for automatic measures to find the desired <i>relevant</i> publications, so-called paper recommendation systems. As the number of publications increases so does the amount of paper recommendation systems. Former literature reviews focused on discussing the general landscape of approaches throughout the years and highlight the main directions. We refrain from this perspective, instead we only consider a comparatively small time frame but analyse it fully. In this literature review we discuss used methods, datasets, evaluations and open challenges encountered in all works first released between January 2019 and October 2021. The goal of this survey is to provide a comprehensive and complete overview of current paper recommendation systems.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9533296/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33496792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-30DOI: 10.1007/s00799-021-00316-9
Nicholas Vanderschantz, A. Hinze
{"title":"Children’s query formulation and search result exploration","authors":"Nicholas Vanderschantz, A. Hinze","doi":"10.1007/s00799-021-00316-9","DOIUrl":"https://doi.org/10.1007/s00799-021-00316-9","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84577425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-29DOI: 10.1007/s00799-021-00319-6
Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, José G. Moreno, Emanuela Boros, Ahmed Hamdi, Antoine Doucet, Nicolas Sidère, Mickaël Coustaty
{"title":"MELHISSA: a multilingual entity linking architecture for historical press articles","authors":"Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, José G. Moreno, Emanuela Boros, Ahmed Hamdi, Antoine Doucet, Nicolas Sidère, Mickaël Coustaty","doi":"10.1007/s00799-021-00319-6","DOIUrl":"https://doi.org/10.1007/s00799-021-00319-6","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73677985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-23DOI: 10.1007/s00799-021-00317-8
Christin Katharina Kreutz, Michael Wolz, Jascha Knack, B. Weyers, Ralf Schenkel
{"title":"SchenQL: in-depth analysis of a query language for bibliographic metadata","authors":"Christin Katharina Kreutz, Michael Wolz, Jascha Knack, B. Weyers, Ralf Schenkel","doi":"10.1007/s00799-021-00317-8","DOIUrl":"https://doi.org/10.1007/s00799-021-00317-8","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80894407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VeTo+: improved expert set expansion in academia","authors":"Serafeim Chatzopoulos, Thanasis Vergoulis, Theodore Dalamagas, Christos Tryfonopoulos","doi":"10.1007/s00799-021-00318-7","DOIUrl":"https://doi.org/10.1007/s00799-021-00318-7","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79868917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-07DOI: 10.1007/s00799-021-00312-z
T. Saier, Michael Färber, Tornike Tsereteli
{"title":"Cross-lingual citations in English papers: a large-scale analysis of prevalence, usage, and impact","authors":"T. Saier, Michael Färber, Tornike Tsereteli","doi":"10.1007/s00799-021-00312-z","DOIUrl":"https://doi.org/10.1007/s00799-021-00312-z","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76415464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-03DOI: 10.1007/s00799-021-00314-x
Brenda Reyes Ayala
{"title":"Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study","authors":"Brenda Reyes Ayala","doi":"10.1007/s00799-021-00314-x","DOIUrl":"https://doi.org/10.1007/s00799-021-00314-x","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2021-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89782953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}