Andrea Reyes Elizondo, C. Calero-Medina, M. Visser
Abstract Purpose A key question when ranking universities is whether or not to allocate the publication output of affiliated hospitals to universities. This paper presents a method for classifying the varying degrees of interdependency between academic hospitals and universities in the context of the Leiden Ranking. Design/methodology/approach Hospital nomenclatures vary worldwide to denote some form of collaboration with a university, however they do not correspond to universally standard definitions. Thus, rather than seeking a normative definition of academic hospitals, we propose a three-step workflow that aligns the university-hospital relationship with one of three general models: full integration of the hospital and the medical faculty into a single organization; health science centres in which hospitals and medical faculty remain separate entities albeit within the same governance structure; and structures in which universities and hospitals are separate entities which collaborate with one another. This classification system provides a standard through which publications which mention affiliations with academic hospitals can be better allocated. Findings In the paper we illustrate how the three-step workflow effectively translates the three above-mentioned models into two types of instrumental relationships for the assignation of publications: “associate” and “component”. When a hospital and a medical faculty are fully integrated or when a hospital is part of a health science centre, the relationship is classified as component. When a hospital follows the model of collaboration and support, the relationship is classified as associate. The compilation of data following these standards allows for a more uniform comparison between worldwide educational and research systems. Research limitations The workflow is resource intensive, depends heavily on the information provided by universities and hospitals, and is more challenging for languages that use non-Latin characters. Further, the application of the workflow demands a careful evaluation of different types of input which can result in ambiguity and makes it difficult to automatize. Practical implications Determining the type of affiliation an academic hospital has with a university can have a substantial impact on the publication counts for universities. This workflow can also aid in analysing collaborations among the two types of organizations. Originality/value The three-step workflow is a unique way to establish the type of relationship an academic hospital has with a university accounting for national and regional differences on nomenclature.
{"title":"The Three-Step Workflow: A Pragmatic Approach to Allocating Academic Hospitals’ Affiliations for Bibliometric Purposes","authors":"Andrea Reyes Elizondo, C. Calero-Medina, M. Visser","doi":"10.2478/jdis-2022-0006","DOIUrl":"https://doi.org/10.2478/jdis-2022-0006","url":null,"abstract":"Abstract Purpose A key question when ranking universities is whether or not to allocate the publication output of affiliated hospitals to universities. This paper presents a method for classifying the varying degrees of interdependency between academic hospitals and universities in the context of the Leiden Ranking. Design/methodology/approach Hospital nomenclatures vary worldwide to denote some form of collaboration with a university, however they do not correspond to universally standard definitions. Thus, rather than seeking a normative definition of academic hospitals, we propose a three-step workflow that aligns the university-hospital relationship with one of three general models: full integration of the hospital and the medical faculty into a single organization; health science centres in which hospitals and medical faculty remain separate entities albeit within the same governance structure; and structures in which universities and hospitals are separate entities which collaborate with one another. This classification system provides a standard through which publications which mention affiliations with academic hospitals can be better allocated. Findings In the paper we illustrate how the three-step workflow effectively translates the three above-mentioned models into two types of instrumental relationships for the assignation of publications: “associate” and “component”. When a hospital and a medical faculty are fully integrated or when a hospital is part of a health science centre, the relationship is classified as component. When a hospital follows the model of collaboration and support, the relationship is classified as associate. The compilation of data following these standards allows for a more uniform comparison between worldwide educational and research systems. Research limitations The workflow is resource intensive, depends heavily on the information provided by universities and hospitals, and is more challenging for languages that use non-Latin characters. Further, the application of the workflow demands a careful evaluation of different types of input which can result in ambiguity and makes it difficult to automatize. Practical implications Determining the type of affiliation an academic hospital has with a university can have a substantial impact on the publication counts for universities. This workflow can also aid in analysing collaborations among the two types of organizations. Originality/value The three-step workflow is a unique way to establish the type of relationship an academic hospital has with a university accounting for national and regional differences on nomenclature.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"7 1","pages":"20 - 36"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44720964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiao Li, Guojian Xian, Ruixue Zhao, Yongwen Huang, Yuantao Kou, Tingting Luo, Tan Sun
Abstract Purpose The interdisciplinary nature and rapid development of the Semantic Web led to the mass publication of RDF data in a large number of widely accepted serialization formats, thus developing out the necessity for RDF data processing with specific purposes. The paper reports on an assessment of chief RDF data endpoint challenges and introduces the RDF Adaptor, a set of plugins for RDF data processing which covers the whole life-cycle with high efficiency. Design/methodology/approach The RDFAdaptor is designed based on the prominent ETL tool—Pentaho Data Integration—which provides a user-friendly and intuitive interface and allows connect to various data sources and formats, and reuses the Java framework RDF4J as middleware that realizes access to data repositories, SPARQL endpoints and all leading RDF database solutions with SPARQL 1.1 support. It can support effortless services with various configuration templates in multi-scenario applications, and help extend data process tasks in other services or tools to complement missing functions. Findings The proposed comprehensive RDF ETL solution—RDFAdaptor—provides an easy-to-use and intuitive interface, supports data integration and federation over multi-source heterogeneous repositories or endpoints, as well as manage linked data in hybrid storage mode. Research limitations The plugin set can support several application scenarios of RDF data process, but error detection/check and interaction with other graph repositories remain to be improved. Practical implications The plugin set can provide user interface and configuration templates which enable its usability in various applications of RDF data generation, multi-format data conversion, remote RDF data migration, and RDF graph update in semantic query process. Originality/value This is the first attempt to develop components instead of systems that can include extract, consolidate, and store RDF data on the basis of an ecologically mature data warehousing environment.
{"title":"RDFAdaptor: Efficient ETL Plugins for RDF Data Process","authors":"Jiao Li, Guojian Xian, Ruixue Zhao, Yongwen Huang, Yuantao Kou, Tingting Luo, Tan Sun","doi":"10.2478/jdis-2021-0020","DOIUrl":"https://doi.org/10.2478/jdis-2021-0020","url":null,"abstract":"Abstract Purpose The interdisciplinary nature and rapid development of the Semantic Web led to the mass publication of RDF data in a large number of widely accepted serialization formats, thus developing out the necessity for RDF data processing with specific purposes. The paper reports on an assessment of chief RDF data endpoint challenges and introduces the RDF Adaptor, a set of plugins for RDF data processing which covers the whole life-cycle with high efficiency. Design/methodology/approach The RDFAdaptor is designed based on the prominent ETL tool—Pentaho Data Integration—which provides a user-friendly and intuitive interface and allows connect to various data sources and formats, and reuses the Java framework RDF4J as middleware that realizes access to data repositories, SPARQL endpoints and all leading RDF database solutions with SPARQL 1.1 support. It can support effortless services with various configuration templates in multi-scenario applications, and help extend data process tasks in other services or tools to complement missing functions. Findings The proposed comprehensive RDF ETL solution—RDFAdaptor—provides an easy-to-use and intuitive interface, supports data integration and federation over multi-source heterogeneous repositories or endpoints, as well as manage linked data in hybrid storage mode. Research limitations The plugin set can support several application scenarios of RDF data process, but error detection/check and interaction with other graph repositories remain to be improved. Practical implications The plugin set can provide user interface and configuration templates which enable its usability in various applications of RDF data generation, multi-format data conversion, remote RDF data migration, and RDF graph update in semantic query process. Originality/value This is the first attempt to develop components instead of systems that can include extract, consolidate, and store RDF data on the basis of an ecologically mature data warehousing environment.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"123 - 145"},"PeriodicalIF":0.0,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46308518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose This study aims to construct new models and methods of academic genealogy research based on bibliometrics. Design/methodology/approach This study proposes an academic influence scale for academic genealogy, and introduces the w index for bibliometric scaling of the academic genealogy. We then construct a two-dimensional (academic fecundity versus academic influence) evaluation system of academic genealogy, and validate it on the academic genealogy of a famous Chinese geologist. Findings The two-dimensional evaluation system can characterize the development and evolution of the academic genealogy, compare the academic influences of different genealogies, and evaluate individuals’ contributions to the inheritance and evolution of the academic genealogy. Individual academic influence is mainly indicated by the w index (the improved h index), which overcomes the situation of repeated measurements and distortion of results in the academic genealogy. Practical implications The two-dimensional evaluation system for the academic genealogy can better demonstrate the reproduction and the academic inheritance ability of a genealogy. Research limitations It is not comprehensive to only use the w index to characterize academic influence. It should also include scholars’ academic awards and academic part-timers and so on. In future work, we will integrate scholars’ academic awards and academic part-timers into the w index for a comprehensive reflection of scholars’ individual academic influences. Originality/value This study constructs new models and methods of academic genealogy research based on bibliometrics, which improves the quantitative assessment of academic genealogy and enriches its research and evaluation methods.
{"title":"Bibliometric-based Study of Scientist Academic Genealogy","authors":"R. Lv, Huan Chang","doi":"10.2478/jdis-2021-0021","DOIUrl":"https://doi.org/10.2478/jdis-2021-0021","url":null,"abstract":"Abstract Purpose This study aims to construct new models and methods of academic genealogy research based on bibliometrics. Design/methodology/approach This study proposes an academic influence scale for academic genealogy, and introduces the w index for bibliometric scaling of the academic genealogy. We then construct a two-dimensional (academic fecundity versus academic influence) evaluation system of academic genealogy, and validate it on the academic genealogy of a famous Chinese geologist. Findings The two-dimensional evaluation system can characterize the development and evolution of the academic genealogy, compare the academic influences of different genealogies, and evaluate individuals’ contributions to the inheritance and evolution of the academic genealogy. Individual academic influence is mainly indicated by the w index (the improved h index), which overcomes the situation of repeated measurements and distortion of results in the academic genealogy. Practical implications The two-dimensional evaluation system for the academic genealogy can better demonstrate the reproduction and the academic inheritance ability of a genealogy. Research limitations It is not comprehensive to only use the w index to characterize academic influence. It should also include scholars’ academic awards and academic part-timers and so on. In future work, we will integrate scholars’ academic awards and academic part-timers into the w index for a comprehensive reflection of scholars’ individual academic influences. Originality/value This study constructs new models and methods of academic genealogy research based on bibliometrics, which improves the quantitative assessment of academic genealogy and enriches its research and evaluation methods.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"146 - 163"},"PeriodicalIF":0.0,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42737707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose The ranking lists of highly cited researchers receive much public attention. In common interpretations, highly cited researchers are perceived to have made extraordinary contributions to science. Thus, the metrics of highly cited researchers are often linked to notions of breakthroughs, scientific excellence, and lone geniuses. Design/methodology/approach In this study, we analyze a sample of individuals who appear on Clarivate Analytics’ Highly Cited Researchers list. The main purpose is to juxtapose the characteristics of their research performance against the claim that the list captures a small fraction of the researcher population that contributes disproportionately to extending the frontier and gaining—on behalf of society—knowledge and innovations that make the world healthier, richer, sustainable, and more secure. Findings The study reveals that the highly cited articles of the selected individuals generally have a very large number of authors. Thus, these papers seldom represent individual contributions but rather are the result of large collective research efforts conducted in research consortia. This challenges the common perception of highly cited researchers as individual geniuses who can be singled out for their extraordinary contributions. Moreover, the study indicates that a few of the individuals have not even contributed to highly cited original research but rather to reviews or clinical guidelines. Finally, the large number of authors of the papers implies that the ranking list is very sensitive to the specific method used for allocating papers and citations to individuals. In the “whole count” methodology applied by Clarivate Analytics, each author gets full credit of the papers regardless of the number of additional co-authors. The study shows that the ranking list would look very different using an alternative fractionalised methodology. Research limitations The study is based on a limited part of the total population of highly cited researchers. Practical implications It is concluded that “excellence” understood as highly cited encompasses very different types of research and researchers of which many do not fit with dominant preconceptions. Originality/value The study develops further knowledge on highly cited researchers, addressing questions such as who becomes highly cited and the type of research that benefits by defining excellence in terms of citation scores and specific counting methods.
{"title":"Lone Geniuses or One among Many? An Explorative Study of Contemporary Highly Cited Researchers","authors":"D. Aksnes, K. Aagaard","doi":"10.2478/jdis-2021-0019","DOIUrl":"https://doi.org/10.2478/jdis-2021-0019","url":null,"abstract":"Abstract Purpose The ranking lists of highly cited researchers receive much public attention. In common interpretations, highly cited researchers are perceived to have made extraordinary contributions to science. Thus, the metrics of highly cited researchers are often linked to notions of breakthroughs, scientific excellence, and lone geniuses. Design/methodology/approach In this study, we analyze a sample of individuals who appear on Clarivate Analytics’ Highly Cited Researchers list. The main purpose is to juxtapose the characteristics of their research performance against the claim that the list captures a small fraction of the researcher population that contributes disproportionately to extending the frontier and gaining—on behalf of society—knowledge and innovations that make the world healthier, richer, sustainable, and more secure. Findings The study reveals that the highly cited articles of the selected individuals generally have a very large number of authors. Thus, these papers seldom represent individual contributions but rather are the result of large collective research efforts conducted in research consortia. This challenges the common perception of highly cited researchers as individual geniuses who can be singled out for their extraordinary contributions. Moreover, the study indicates that a few of the individuals have not even contributed to highly cited original research but rather to reviews or clinical guidelines. Finally, the large number of authors of the papers implies that the ranking list is very sensitive to the specific method used for allocating papers and citations to individuals. In the “whole count” methodology applied by Clarivate Analytics, each author gets full credit of the papers regardless of the number of additional co-authors. The study shows that the ranking list would look very different using an alternative fractionalised methodology. Research limitations The study is based on a limited part of the total population of highly cited researchers. Practical implications It is concluded that “excellence” understood as highly cited encompasses very different types of research and researchers of which many do not fit with dominant preconceptions. Originality/value The study develops further knowledge on highly cited researchers, addressing questions such as who becomes highly cited and the type of research that benefits by defining excellence in terms of citation scores and specific counting methods.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"41 - 66"},"PeriodicalIF":0.0,"publicationDate":"2021-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44706468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose Although gender identities influence how people present themselves on social media, previous studies have tested pre-specified dimensions of difference, potentially overlooking other differences and ignoring nonbinary users. Design/methodology/approach Word association thematic analysis was used to systematically check for fine-grained statistically significant gender differences in Twitter profile descriptions between 409,487 UK-based female, male, and nonbinary users in 2020. A series of statistical tests systematically identified 1,474 differences at the individual word level, and a follow up thematic analysis grouped these words into themes. Findings The results reflect offline variations in interests and in jobs. They also show differences in personal disclosures, as reflected by words, with females mentioning qualifications, relationships, pets, and illnesses much more, nonbinaries discussing sexuality more, and males declaring political and sports affiliations more. Other themes were internally imbalanced, including personal appearance (e.g. male: beardy; female: redhead), self-evaluations (e.g. male: legend; nonbinary: witch; female: feisty), and gender identity (e.g. male: dude; nonbinary: enby; female: queen). Research limitations The methods are affected by linguistic styles and probably under-report nonbinary differences. Practical implications The gender differences found may inform gender theory, and aid social web communicators and marketers. Originality/value The results show a much wider range of gender expression differences than previously acknowledged for any social media site.
{"title":"Male, Female, and Nonbinary Differences in UK Twitter Self-descriptions: A Fine-grained Systematic Exploration","authors":"M. Thelwall, Saheeda Thelwall, Ruth Fairclough","doi":"10.2478/jdis-2021-0018","DOIUrl":"https://doi.org/10.2478/jdis-2021-0018","url":null,"abstract":"Abstract Purpose Although gender identities influence how people present themselves on social media, previous studies have tested pre-specified dimensions of difference, potentially overlooking other differences and ignoring nonbinary users. Design/methodology/approach Word association thematic analysis was used to systematically check for fine-grained statistically significant gender differences in Twitter profile descriptions between 409,487 UK-based female, male, and nonbinary users in 2020. A series of statistical tests systematically identified 1,474 differences at the individual word level, and a follow up thematic analysis grouped these words into themes. Findings The results reflect offline variations in interests and in jobs. They also show differences in personal disclosures, as reflected by words, with females mentioning qualifications, relationships, pets, and illnesses much more, nonbinaries discussing sexuality more, and males declaring political and sports affiliations more. Other themes were internally imbalanced, including personal appearance (e.g. male: beardy; female: redhead), self-evaluations (e.g. male: legend; nonbinary: witch; female: feisty), and gender identity (e.g. male: dude; nonbinary: enby; female: queen). Research limitations The methods are affected by linguistic styles and probably under-report nonbinary differences. Practical implications The gender differences found may inform gender theory, and aid social web communicators and marketers. Originality/value The results show a much wider range of gender expression differences than previously acknowledged for any social media site.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2021-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47203787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose This paper examines factors of payment decision as well as the role each factor plays in casual configurations leading to high payment intention under systematic and heuristic information processing routes. Design/methodology/approach Based on heuristic-systematic model (HSM), we propose a configurational analytic framework to investigate complex casual relationships between influencing factors and payment decision. In line with this approach, we use fuzzy-set qualitative comparative analysis (fsQCA) to analyze data crawled from Zhihu.com. Findings The number of previous consultations is a necessary element in all five equivalent configurations which lead to high intention in payment decision. The heuristic processing route plays a core role while the systematic processing route plays a peripheral role in payment decision-making process. Research limitations Research is limited in that moderating effect of professional fields has not been considered in the framework. Practical implications Configurations in results can assist managers of knowledge communities and paid Q&A service providers in the management of information elements to motivate more payment decision. Originality/value This paper is one of the few studies to apply HSM theory and fsQCA method with respect to the payment decision in paid Q&A.
{"title":"A Causal Configuration Analysis of Payment Decision Drivers in Paid Q&A","authors":"Wenyu Chen, Yan Cheng, Jia Li","doi":"10.2478/jdis-2021-0017","DOIUrl":"https://doi.org/10.2478/jdis-2021-0017","url":null,"abstract":"Abstract Purpose This paper examines factors of payment decision as well as the role each factor plays in casual configurations leading to high payment intention under systematic and heuristic information processing routes. Design/methodology/approach Based on heuristic-systematic model (HSM), we propose a configurational analytic framework to investigate complex casual relationships between influencing factors and payment decision. In line with this approach, we use fuzzy-set qualitative comparative analysis (fsQCA) to analyze data crawled from Zhihu.com. Findings The number of previous consultations is a necessary element in all five equivalent configurations which lead to high intention in payment decision. The heuristic processing route plays a core role while the systematic processing route plays a peripheral role in payment decision-making process. Research limitations Research is limited in that moderating effect of professional fields has not been considered in the framework. Practical implications Configurations in results can assist managers of knowledge communities and paid Q&A service providers in the management of information elements to motivate more payment decision. Originality/value This paper is one of the few studies to apply HSM theory and fsQCA method with respect to the payment decision in paid Q&A.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"139 - 162"},"PeriodicalIF":0.0,"publicationDate":"2021-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41557970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose This study attempts to disclose the characteristics of knowledge integration in an interdisciplinary field by looking into the content aspect of knowledge. Design/methodology/approach The eHealth field was chosen in the case study. Associated knowledge phrases (AKPs) that are shared between citing papers and their references were extracted from the citation contexts of the eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the domain was proposed to categorize the identified AKPs. The source disciplines of each knowledge type were analyzed. Quantitative indicators and a co-occurrence analysis were applied to disclose the integration patterns of different knowledge types. Findings The annotated AKPs evidence the major disciplines supplying each type of knowledge. Different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag. We also find several frequent co-occurrence patterns of different knowledge types. Research limitations The collected articles of the field are limited to the two leading open access journals. The stem-matching method to extract AKPs could not identify those phrases with the same meaning but expressed in words with different stems. The type of Research Subject dominates the recognized AKPs, which calls on an improvement of the classification schema for better knowledge integration analysis on knowledge units. Practical implications The methodology proposed in this paper sheds new light on knowledge integration characteristics of an interdisciplinary field from the content perspective. The findings have practical implications on the future development of research strategies in eHealth and the policies about interdisciplinary research. Originality/value This study proposed a new methodology to explore the content characteristics of knowledge integration in an interdisciplinary field.
{"title":"Content Characteristics of Knowledge Integration in the eHealth Field: An Analysis Based on Citation Contexts","authors":"Shiyun Wang, Jin Mao, Jing Tang, Yujie Cao","doi":"10.2478/jdis-2021-0015","DOIUrl":"https://doi.org/10.2478/jdis-2021-0015","url":null,"abstract":"Abstract Purpose This study attempts to disclose the characteristics of knowledge integration in an interdisciplinary field by looking into the content aspect of knowledge. Design/methodology/approach The eHealth field was chosen in the case study. Associated knowledge phrases (AKPs) that are shared between citing papers and their references were extracted from the citation contexts of the eHealth papers by applying a stem-matching method. A classification schema that considers the functions of knowledge in the domain was proposed to categorize the identified AKPs. The source disciplines of each knowledge type were analyzed. Quantitative indicators and a co-occurrence analysis were applied to disclose the integration patterns of different knowledge types. Findings The annotated AKPs evidence the major disciplines supplying each type of knowledge. Different knowledge types have remarkably different integration patterns in terms of knowledge amount, the breadth of source disciplines, and the integration time lag. We also find several frequent co-occurrence patterns of different knowledge types. Research limitations The collected articles of the field are limited to the two leading open access journals. The stem-matching method to extract AKPs could not identify those phrases with the same meaning but expressed in words with different stems. The type of Research Subject dominates the recognized AKPs, which calls on an improvement of the classification schema for better knowledge integration analysis on knowledge units. Practical implications The methodology proposed in this paper sheds new light on knowledge integration characteristics of an interdisciplinary field from the content perspective. The findings have practical implications on the future development of research strategies in eHealth and the policies about interdisciplinary research. Originality/value This study proposed a new methodology to explore the content characteristics of knowledge integration in an interdisciplinary field.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"58 - 74"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42368138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu
Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
{"title":"Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling","authors":"Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu","doi":"10.2478/jdis-2021-0013","DOIUrl":"https://doi.org/10.2478/jdis-2021-0013","url":null,"abstract":"Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"35 - 57"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46705840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose This article aims to determine the percentage of “Sparking” articles among the work of this year’s Nobel Prize winners in medicine, physics, and chemistry. Design/methodology/approach We focus on under-cited influential research among the key publications as mentioned by the Nobel Prize Committee for the 2020 Noble Prize laureates. Specifically, we extracted data from the Web of Science, and calculated the Sparking Indices using the formulas as proposed by Hu and Rousseau in 2016 and 2017. In addition, we identified another type of igniting articles based on the notion in 2017. Findings In the fields of medicine and physics, the proportions of articles with sparking characteristics share 78.571% and 68.75% respectively, yet, in chemistry 90% articles characterized by “igniting”. Moreover, the two types of articles share more than 93% in the work of the Nobel Prize included in this study. Research limitations Our research did not cover the impact of topic, socio-political, and author’s reputation on the Sparking Indices. Practical implications Our study shows that the Sparking Indices truly reflect influence of the best research work, so it can be used to detect under-cited influential articles, as well as identifying fundamental work. Originality/value Our findings suggest that the Sparking Indices have good applicability for research evaluation.
{"title":"“Sparking” and “Igniting” Key Publications of 2020 Nobel Prize Laureates","authors":"Fangjie Xi, R. Rousseau, Xiaojun Hu","doi":"10.2478/jdis-2021-0016","DOIUrl":"https://doi.org/10.2478/jdis-2021-0016","url":null,"abstract":"Abstract Purpose This article aims to determine the percentage of “Sparking” articles among the work of this year’s Nobel Prize winners in medicine, physics, and chemistry. Design/methodology/approach We focus on under-cited influential research among the key publications as mentioned by the Nobel Prize Committee for the 2020 Noble Prize laureates. Specifically, we extracted data from the Web of Science, and calculated the Sparking Indices using the formulas as proposed by Hu and Rousseau in 2016 and 2017. In addition, we identified another type of igniting articles based on the notion in 2017. Findings In the fields of medicine and physics, the proportions of articles with sparking characteristics share 78.571% and 68.75% respectively, yet, in chemistry 90% articles characterized by “igniting”. Moreover, the two types of articles share more than 93% in the work of the Nobel Prize included in this study. Research limitations Our research did not cover the impact of topic, socio-political, and author’s reputation on the Sparking Indices. Practical implications Our study shows that the Sparking Indices truly reflect influence of the best research work, so it can be used to detect under-cited influential articles, as well as identifying fundamental work. Originality/value Our findings suggest that the Sparking Indices have good applicability for research evaluation.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"28 - 40"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48157058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Purpose Interdisciplinarity is a hot topic in science and technology policy. However, the concept of interdisciplinarity is both abstract and complex, and therefore difficult to measure using a single indicator. A variety of metrics for measuring the diversity and interdisciplinarity of articles, journals, and fields have been proposed in the literature. In this article, we ask whether institutions can be ranked in terms of their (inter-)disciplinary diversity. Design/methodology/approach We developed a software application (interd_vb.exe) that outputs the values of relevant diversity indicators for any document set or network structure. The software is made available, free to the public, online. The indicators it considers include the advanced diversity indicators Rao-Stirling (RS) diversity and DIV*, as well as standard measures of diversity, such as the Gini coefficient, Shannon entropy, and the Simpson Index. As an empirical demonstration of how the application works, we compared the research portfolios of 42 “Double First-Class” Chinese universities across Web of Science Subject Categories (WCs). Findings The empirical results suggest that DIV* provides results that are more in line with one's intuitive impressions than RS, particularly when the results are based on sample-dependent disparity measures. Furthermore, the scores for diversity are more consistent when based on a global disparity matrix than on a local map. Research limitations “Interdisciplinarity” can be operationalized as bibliographic coupling among (sets of) documents with references to disciplines. At the institutional level, however, diversity may also indicate comprehensiveness. Unlike impact (e.g. citation), diversity and interdisciplinarity are context-specific and therefore provide a second dimension to the evaluation. Policy or practical implications Operationalization and quantification make it necessary for analysts to make their choices and options clear. Although the equations used to calculate diversity are often mathematically transparent, the specification in terms of computer code helps the analyst to further precision in decisions. Although diversity is not necessarily a goal of universities, a high diversity score may inform potential policies concerning interdisciplinarity at the university level. Originality/value This article introduces a non-commercial online application to the public domain that allows researchers and policy analysts to measure “diversity” and “interdisciplinarity” using the various indicators as encompassing as possible for any document set or network structure (e.g. a network of co-authors). Insofar as we know, such a professional computing tool for evaluating data sets using diversity indicators has not yet been made available online.
{"title":"The Scientometric Measurement of Interdisciplinarity and Diversity in the Research Portfolios of Chinese Universities","authors":"Lin Zhang, L. Leydesdorff","doi":"10.2139/ssrn.3798519","DOIUrl":"https://doi.org/10.2139/ssrn.3798519","url":null,"abstract":"Abstract Purpose Interdisciplinarity is a hot topic in science and technology policy. However, the concept of interdisciplinarity is both abstract and complex, and therefore difficult to measure using a single indicator. A variety of metrics for measuring the diversity and interdisciplinarity of articles, journals, and fields have been proposed in the literature. In this article, we ask whether institutions can be ranked in terms of their (inter-)disciplinary diversity. Design/methodology/approach We developed a software application (interd_vb.exe) that outputs the values of relevant diversity indicators for any document set or network structure. The software is made available, free to the public, online. The indicators it considers include the advanced diversity indicators Rao-Stirling (RS) diversity and DIV*, as well as standard measures of diversity, such as the Gini coefficient, Shannon entropy, and the Simpson Index. As an empirical demonstration of how the application works, we compared the research portfolios of 42 “Double First-Class” Chinese universities across Web of Science Subject Categories (WCs). Findings The empirical results suggest that DIV* provides results that are more in line with one's intuitive impressions than RS, particularly when the results are based on sample-dependent disparity measures. Furthermore, the scores for diversity are more consistent when based on a global disparity matrix than on a local map. Research limitations “Interdisciplinarity” can be operationalized as bibliographic coupling among (sets of) documents with references to disciplines. At the institutional level, however, diversity may also indicate comprehensiveness. Unlike impact (e.g. citation), diversity and interdisciplinarity are context-specific and therefore provide a second dimension to the evaluation. Policy or practical implications Operationalization and quantification make it necessary for analysts to make their choices and options clear. Although the equations used to calculate diversity are often mathematically transparent, the specification in terms of computer code helps the analyst to further precision in decisions. Although diversity is not necessarily a goal of universities, a high diversity score may inform potential policies concerning interdisciplinarity at the university level. Originality/value This article introduces a non-commercial online application to the public domain that allows researchers and policy analysts to measure “diversity” and “interdisciplinarity” using the various indicators as encompassing as possible for any document set or network structure (e.g. a network of co-authors). Insofar as we know, such a professional computing tool for evaluating data sets using diversity indicators has not yet been made available online.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"13 - 35"},"PeriodicalIF":0.0,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44665473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}