Jan Vogt, Thilo Voigt, Annika Nowak, Jan M. Pawlowski
{"title":"Development of a Job Advertisement Analysis for Assessing Data Science Competencies","authors":"Jan Vogt, Thilo Voigt, Annika Nowak, Jan M. Pawlowski","doi":"10.5334/dsj-2023-033","DOIUrl":"https://doi.org/10.5334/dsj-2023-033","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71068443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
在许多研究中,我们想要确定某些特征对因变量的影响。更具体地说,我们感兴趣的是影响力的强弱。也就是说,功能是否相关?如果有,特征是如何影响因变量的。最近,随机森林回归等数据驱动方法已经进入应用领域(Boulesteix et al. 2012)。这些模型使研究人员能够直接得出特征重要性的度量,这是影响强度的自然指标。对于相关特征,通常使用特征与因变量之间的相关性或等级相关性来确定影响的性质。最近的一些方法基于建模方法,其中一些方法也可以测量特征之间的相互作用。特别是,当使用机器学习模型时,SHAP分数是确定这些趋势的最新和突出的方法(Lundberg et al. 2017)。本文在已有研究的Gram-Schmidt去相关方法的基础上,引入了一种新的特征重要性概念。此外,我们提出了使用随机森林回归识别数据趋势的两个估计量,即所谓的绝对和相对遍历率。我们在经验上比较了我们的估计器与在各种合成和现实世界数据集上建立的估计器的性质。
{"title":"A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression","authors":"Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg","doi":"10.5334/dsj-2023-042","DOIUrl":"https://doi.org/10.5334/dsj-2023-042","url":null,"abstract":"In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135784269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ontology-Driven Semantic Enrichment Framework for Open Data Value Creation","authors":"Oarabile Sebubi, Irina Zlotnikova, Hlomani Hlomani","doi":"10.5334/dsj-2023-040","DOIUrl":"https://doi.org/10.5334/dsj-2023-040","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134883791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomasz Miksa, P. Walk, Peter Neish, Simon Oblasser, Hollydawn Murray, Tom Renner, Marie-Christine Jacquemot-Perbal, João Cardoso, T. Kvamme, M. Praetzellis, M. Suchánek, Rob W.W. Hooft, Benjamin Faure, H. Moa, A. Hasan, Sarah Jones
This paper presents the application profile for machine-actionable data management plans that allows information from traditional data management plans to be expressed in a machine-actionable way. We describe the methodology and research conducted to define the application profile. We also discuss design decisions made during its development and present systems which have adopted it. The application profile was developed in an open and consensus-driven manner within the DMP Common Standards Working Group of the Research Data Alliance and is its official recommendation. TOMASZ MIKSA PAUL WALK PETER NEISH SIMON OBLASSER HOLLYDAWN MURRAY TOM RENNER MARIE-CHRISTINE JACQUEMOT-PERBAL JOÃO CARDOSO TROND KVAMME MARIA PRAETZELLIS MAREK SUCHÁNEK ROB HOOFT BENJAMIN FAURE HANNE MOA ADIL HASAN SARAH JONES
{"title":"Application Profile for Machine-Actionable Data Management Plans","authors":"Tomasz Miksa, P. Walk, Peter Neish, Simon Oblasser, Hollydawn Murray, Tom Renner, Marie-Christine Jacquemot-Perbal, João Cardoso, T. Kvamme, M. Praetzellis, M. Suchánek, Rob W.W. Hooft, Benjamin Faure, H. Moa, A. Hasan, Sarah Jones","doi":"10.5334/dsj-2021-032","DOIUrl":"https://doi.org/10.5334/dsj-2021-032","url":null,"abstract":"This paper presents the application profile for machine-actionable data management plans that allows information from traditional data management plans to be expressed in a machine-actionable way. We describe the methodology and research conducted to define the application profile. We also discuss design decisions made during its development and present systems which have adopted it. The application profile was developed in an open and consensus-driven manner within the DMP Common Standards Working Group of the Research Data Alliance and is its official recommendation. TOMASZ MIKSA PAUL WALK PETER NEISH SIMON OBLASSER HOLLYDAWN MURRAY TOM RENNER MARIE-CHRISTINE JACQUEMOT-PERBAL JOÃO CARDOSO TROND KVAMME MARIA PRAETZELLIS MAREK SUCHÁNEK ROB HOOFT BENJAMIN FAURE HANNE MOA ADIL HASAN SARAH JONES","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49529013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do I-PASS for FAIR? Measuring the FAIR-ness of Research Organizations","authors":"J. Ringersma, M. Miedema","doi":"10.5334/dsj-2021-030","DOIUrl":"https://doi.org/10.5334/dsj-2021-030","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43886427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open access, free access, and the public domain are different concepts. The International Nucleotide Sequence Database Collaboration (INSDC) permanently guarantees free and unrestricted access to nucleotide sequence data for all researchers, irrespective of nationality or affiliation. However, recent virus information is primarily distributed via the restricted-access repository known as the Global Initiative on Sharing Avian Flu Data (GISAID) supported by the World Health Organization. As compensation for the restriction, GISAID needs to meet its initial goal of benefit-sharing among countries and to curb ongoing vaccine diplomacy campaigns.
{"title":"Open Access and Data Sharing of Nucleotide Sequence Data","authors":"Masanori Arita","doi":"10.5334/dsj-2021-028","DOIUrl":"https://doi.org/10.5334/dsj-2021-028","url":null,"abstract":"Open access, free access, and the public domain are different concepts. The International Nucleotide Sequence Database Collaboration (INSDC) permanently guarantees free and unrestricted access to nucleotide sequence data for all researchers, irrespective of nationality or affiliation. However, recent virus information is primarily distributed via the restricted-access repository known as the Global Initiative on Sharing Avian Flu Data (GISAID) supported by the World Health Organization. As compensation for the restriction, GISAID needs to meet its initial goal of benefit-sharing among countries and to curb ongoing vaccine diplomacy campaigns.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47342634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. S. Hansen, Signe Gadegaard, Karsten Kryger Hansen, Asger Væring Larsen, S. Møller, Gertrud Stougård Thomsen, Katrine Flindt Holmstrand
Citizen science (CS) projects are part of a new era of data aggregation and harmonisation that facilitates interconnections between different datasets. Increasing the value and reuse of CS data has received growing attention with the appearance of the FAIR principles and systematic research data management (RDM) practises, which are often promoted by university libraries. However, RDM initiatives in CS appear diversified and if CS have special needs in terms of RDM is unclear. Therefore, the aim of this article is firstly to identify RDM challenges for CS projects and secondly, to discuss how university libraries may support any such challenges. A scoping review and a case study of Danish CS projects were performed to identify RDM challenges. 48 articles were selected for data extraction. Four academic project leaders were interviewed about RDM practices in their CS projects. Challenges and recommendations identified in the review and case study are often not specific for CS. However, finding CS data, engaging specific populations, attributing volunteers and handling sensitive data including health data are some of the challenges requiring special attention by CS project managers. Scientific requirements or national practices do not always encompass the nature of CS projects. Based on the identified challenges, it is recommended that university libraries focus their services on 1) identifying legal and ethical issues that the project managers should be aware of in their projects, 2) elaborating these issues in a Terms of Participation that also specifies data handling and sharing to the citizen scientist, and 3) motivating the project manager to good data handling practises. Adhering to the FAIR principles and good RDM practices in CS projects will continuously secure contextualisation and data quality. High data quality increases the value and reuse of the data and, therefore, the empowerment of the citizen scientists.
{"title":"Research Data Management Challenges in Citizen Science Projects and Recommendations for Library Support Services. A Scoping Review and Case Study","authors":"J. S. Hansen, Signe Gadegaard, Karsten Kryger Hansen, Asger Væring Larsen, S. Møller, Gertrud Stougård Thomsen, Katrine Flindt Holmstrand","doi":"10.5334/dsj-2021-025","DOIUrl":"https://doi.org/10.5334/dsj-2021-025","url":null,"abstract":"Citizen science (CS) projects are part of a new era of data aggregation and harmonisation that facilitates interconnections between different datasets. Increasing the value and reuse of CS data has received growing attention with the appearance of the FAIR principles and systematic research data management (RDM) practises, which are often promoted by university libraries. However, RDM initiatives in CS appear diversified and if CS have special needs in terms of RDM is unclear. Therefore, the aim of this article is firstly to identify RDM challenges for CS projects and secondly, to discuss how university libraries may support any such challenges. A scoping review and a case study of Danish CS projects were performed to identify RDM challenges. 48 articles were selected for data extraction. Four academic project leaders were interviewed about RDM practices in their CS projects. Challenges and recommendations identified in the review and case study are often not specific for CS. However, finding CS data, engaging specific populations, attributing volunteers and handling sensitive data including health data are some of the challenges requiring special attention by CS project managers. Scientific requirements or national practices do not always encompass the nature of CS projects. Based on the identified challenges, it is recommended that university libraries focus their services on 1) identifying legal and ethical issues that the project managers should be aware of in their projects, 2) elaborating these issues in a Terms of Participation that also specifies data handling and sharing to the citizen scientist, and 3) motivating the project manager to good data handling practises. Adhering to the FAIR principles and good RDM practices in CS projects will continuously secure contextualisation and data quality. High data quality increases the value and reuse of the data and, therefore, the empowerment of the citizen scientists.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41536545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.
{"title":"On the Application of Principal Component Analysis to Classification Problems","authors":"Jianwei Zheng, C. Rakovski","doi":"10.5334/dsj-2021-026","DOIUrl":"https://doi.org/10.5334/dsj-2021-026","url":null,"abstract":"Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48310066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsaone Swaabow Thapelo, M. Namoshe, O. Matsebe, T. Motshegwa, Mary-Jane M. Bopape
The Southern African Science Service Centre for Climate and Land Management (SASSCAL) was initiated to support regional weather monitoring and climate research in Southern Africa. As a result, several Automatic Weather Stations (AWSs) were implemented to provide numerical weather data within the collaborating countries. Meanwhile, access to the SASSCAL weather data is limited to a number of records that are achieved via a series of clicks. Currently, end users can not efficaciously extract the desired weather values. Thus, the data is not fully utilised by end users. This work contributes with an open source Web Scraping Application Programming Interface (WebSAPI) through an interactive dashboard. The objective is to extend functionalities of the SASSCAL Weathernet for: data extraction, statistical data analysis and visualisation. The SASSCAL WebSAPI was developed using the R statistical environment. It deploys web scraping and data wrangling techniques to support access to SASSCAL weather data. This WebSAPI reduces the risk of human error, and the researcher’s effort of generating desired data sets. The proposed framework for the SASSCAL WebSAPI can be modified for other weather data banks while taking into consideration the legality and ethics of the toolkit.
{"title":"SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data","authors":"Tsaone Swaabow Thapelo, M. Namoshe, O. Matsebe, T. Motshegwa, Mary-Jane M. Bopape","doi":"10.5334/dsj-2021-024","DOIUrl":"https://doi.org/10.5334/dsj-2021-024","url":null,"abstract":"The Southern African Science Service Centre for Climate and Land Management (SASSCAL) was initiated to support regional weather monitoring and climate research in Southern Africa. As a result, several Automatic Weather Stations (AWSs) were implemented to provide numerical weather data within the collaborating countries. Meanwhile, access to the SASSCAL weather data is limited to a number of records that are achieved via a series of clicks. Currently, end users can not efficaciously extract the desired weather values. Thus, the data is not fully utilised by end users. This work contributes with an open source Web Scraping Application Programming Interface (WebSAPI) through an interactive dashboard. The objective is to extend functionalities of the SASSCAL Weathernet for: data extraction, statistical data analysis and visualisation. The SASSCAL WebSAPI was developed using the R statistical environment. It deploys web scraping and data wrangling techniques to support access to SASSCAL weather data. This WebSAPI reduces the risk of human error, and the researcher’s effort of generating desired data sets. The proposed framework for the SASSCAL WebSAPI can be modified for other weather data banks while taking into consideration the legality and ethics of the toolkit.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42327269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}