首页 > 最新文献

Data Science Journal最新文献

英文 中文
Development of a Job Advertisement Analysis for Assessing Data Science Competencies 开发一种评估数据科学能力的招聘广告分析
Q2 Computer Science Pub Date : 2023-01-01 DOI: 10.5334/dsj-2023-033
Jan Vogt, Thilo Voigt, Annika Nowak, Jan M. Pawlowski
{"title":"Development of a Job Advertisement Analysis for Assessing Data Science Competencies","authors":"Jan Vogt, Thilo Voigt, Annika Nowak, Jan M. Pawlowski","doi":"10.5334/dsj-2023-033","DOIUrl":"https://doi.org/10.5334/dsj-2023-033","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71068443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression 特征重要性的去相关概念和趋势的随机森林回归检测
Q2 Computer Science Pub Date : 2023-01-01 DOI: 10.5334/dsj-2023-042
Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg
In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
在许多研究中,我们想要确定某些特征对因变量的影响。更具体地说,我们感兴趣的是影响力的强弱。也就是说,功能是否相关?如果有,特征是如何影响因变量的。最近,随机森林回归等数据驱动方法已经进入应用领域(Boulesteix et al. 2012)。这些模型使研究人员能够直接得出特征重要性的度量,这是影响强度的自然指标。对于相关特征,通常使用特征与因变量之间的相关性或等级相关性来确定影响的性质。最近的一些方法基于建模方法,其中一些方法也可以测量特征之间的相互作用。特别是,当使用机器学习模型时,SHAP分数是确定这些趋势的最新和突出的方法(Lundberg et al. 2017)。本文在已有研究的Gram-Schmidt去相关方法的基础上,引入了一种新的特征重要性概念。此外,我们提出了使用随机森林回归识别数据趋势的两个估计量,即所谓的绝对和相对遍历率。我们在经验上比较了我们的估计器与在各种合成和现实世界数据集上建立的估计器的性质。
{"title":"A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression","authors":"Yannick Gerstorfer, Max Hahn-Klimroth, Lena Krieg","doi":"10.5334/dsj-2023-042","DOIUrl":"https://doi.org/10.5334/dsj-2023-042","url":null,"abstract":"In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence – i.e., is the feature relevant? And, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as random forest regression have found their way into applications (Boulesteix et al. 2012). These models allow researchers to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al. 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative traversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135784269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Ontology-Driven Semantic Enrichment Framework for Open Data Value Creation 面向开放数据价值创造的本体驱动语义充实框架
Q2 Computer Science Pub Date : 2023-01-01 DOI: 10.5334/dsj-2023-040
Oarabile Sebubi, Irina Zlotnikova, Hlomani Hlomani
{"title":"Ontology-Driven Semantic Enrichment Framework for Open Data Value Creation","authors":"Oarabile Sebubi, Irina Zlotnikova, Hlomani Hlomani","doi":"10.5334/dsj-2023-040","DOIUrl":"https://doi.org/10.5334/dsj-2023-040","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134883791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Framework for Data-Driven Solutions with COVID-19 Illustrations 带有COVID-19插图的数据驱动解决方案框架
Q2 Computer Science Pub Date : 2021-11-18 DOI: 10.5334/dsj-2021-036
Kassim S. Mwitondi, Raed A. Said
Data–driven solutions have long been keenly sought after as tools for driving the world’s fast changing business environment, with business leaders seeking to enhance decision making processes within their organisations. In the current era of Big Data, applications of data tools in addressing global, regional and national challenges have steadily grown in almost all fields across the globe. However, working in silos has continued to impede research progress, creating knowledge gaps and challenges across geographical borders, legislations, sectors and fields. There are many examples of the challenges the world faces in tackling global issues, including the complex interactions of the 17 Sustainable Development Goals (SDG) and the spatio–temporal variations of the impact of the on-going COVID–19 pandemic. Both challenges can be seen as non–orthogonal, strongly correlated and requiring an interdisciplinary approach to address. We present a generic framework for filling such gaps, based on two data-driven algorithms that combine data, machine learning and interdisciplinarity to bridge societal knowledge gaps. The novelty of the algorithms derives from their robust built–in mechanics for handling data randomness. Animation applications on structured COVID–19 related data obtained from the European Centre for Disease Prevention and Control (ECDC) and the UK Office of National Statistics exhibit great potentials for decision-support systems. Predictive findings are based on unstructured data–a large COVID–19 X–Ray data, 3181 image files, obtained from GitHub and Kaggle. Our results exhibit consistent performance across samples, resonating with cross-disciplinary discussions on novel paths for data-driven interdisciplinary research. © 2021, Ubiquity Press. All rights reserved.
长期以来,数据驱动的解决方案一直备受追捧,因为它是推动全球快速变化的商业环境的工具,商业领袖们也在寻求加强组织内的决策流程。在大数据时代,在全球几乎所有领域,数据工具在应对全球、区域和国家挑战方面的应用都在稳步增长。然而,竖井式的工作继续阻碍着研究进展,造成了跨越地理边界、立法、部门和领域的知识差距和挑战。世界在解决全球性问题时面临的挑战有很多例子,包括17项可持续发展目标之间复杂的相互作用,以及正在发生的COVID-19大流行影响的时空变化。这两个挑战可以被视为非正交的,强烈相关的,需要跨学科的方法来解决。我们提出了一个填补这些空白的通用框架,基于两种数据驱动的算法,将数据、机器学习和跨学科结合起来,以弥合社会知识空白。这些算法的新颖之处在于它们处理数据随机性的强大内置机制。从欧洲疾病预防控制中心(ECDC)和英国国家统计局获得的结构化COVID-19相关数据的动画应用显示出决策支持系统的巨大潜力。预测结果基于非结构化数据-从GitHub和Kaggle获得的大型COVID-19 x射线数据,3181个图像文件。我们的研究结果在不同的样本中表现出一致的表现,与数据驱动的跨学科研究的新路径的跨学科讨论产生共鸣。©2021,Ubiquity出版社。版权所有。
{"title":"A Framework for Data-Driven Solutions with COVID-19 Illustrations","authors":"Kassim S. Mwitondi, Raed A. Said","doi":"10.5334/dsj-2021-036","DOIUrl":"https://doi.org/10.5334/dsj-2021-036","url":null,"abstract":"Data–driven solutions have long been keenly sought after as tools for driving the world’s fast changing business environment, with business leaders seeking to enhance decision making processes within their organisations. In the current era of Big Data, applications of data tools in addressing global, regional and national challenges have steadily grown in almost all fields across the globe. However, working in silos has continued to impede research progress, creating knowledge gaps and challenges across geographical borders, legislations, sectors and fields. There are many examples of the challenges the world faces in tackling global issues, including the complex interactions of the 17 Sustainable Development Goals (SDG) and the spatio–temporal variations of the impact of the on-going COVID–19 pandemic. Both challenges can be seen as non–orthogonal, strongly correlated and requiring an interdisciplinary approach to address. We present a generic framework for filling such gaps, based on two data-driven algorithms that combine data, machine learning and interdisciplinarity to bridge societal knowledge gaps. The novelty of the algorithms derives from their robust built–in mechanics for handling data randomness. Animation applications on structured COVID–19 related data obtained from the European Centre for Disease Prevention and Control (ECDC) and the UK Office of National Statistics exhibit great potentials for decision-support systems. Predictive findings are based on unstructured data–a large COVID–19 X–Ray data, 3181 image files, obtained from GitHub and Kaggle. Our results exhibit consistent performance across samples, resonating with cross-disciplinary discussions on novel paths for data-driven interdisciplinary research. © 2021, Ubiquity Press. All rights reserved.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47906193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Application Profile for Machine-Actionable Data Management Plans 机器可操作数据管理计划的应用程序配置文件
Q2 Computer Science Pub Date : 2021-10-26 DOI: 10.5334/dsj-2021-032
Tomasz Miksa, P. Walk, Peter Neish, Simon Oblasser, Hollydawn Murray, Tom Renner, Marie-Christine Jacquemot-Perbal, João Cardoso, T. Kvamme, M. Praetzellis, M. Suchánek, Rob W.W. Hooft, Benjamin Faure, H. Moa, A. Hasan, Sarah Jones
This paper presents the application profile for machine-actionable data management plans that allows information from traditional data management plans to be expressed in a machine-actionable way. We describe the methodology and research conducted to define the application profile. We also discuss design decisions made during its development and present systems which have adopted it. The application profile was developed in an open and consensus-driven manner within the DMP Common Standards Working Group of the Research Data Alliance and is its official recommendation. TOMASZ MIKSA PAUL WALK PETER NEISH SIMON OBLASSER HOLLYDAWN MURRAY TOM RENNER MARIE-CHRISTINE JACQUEMOT-PERBAL JOÃO CARDOSO TROND KVAMME MARIA PRAETZELLIS MAREK SUCHÁNEK ROB HOOFT BENJAMIN FAURE HANNE MOA ADIL HASAN SARAH JONES
本文介绍了机器可操作数据管理计划的应用程序配置文件,该文件允许以机器可操作的方式表达传统数据管理计划中的信息。我们描述了定义应用程序概要的方法和研究。我们还讨论了在其开发过程中做出的设计决策以及目前采用该设计的系统。该应用程序简介是在研究数据联盟的DMP通用标准工作组内以开放和共识驱动的方式开发的,是其官方建议。托马什·米克萨·保罗步行彼得·奈什西蒙·奥布拉瑟霍利达恩默里汤姆·雷纳玛丽·克里斯汀·雅克莫特·佩巴尔乔奥卡多索·特隆德·克瓦米·玛丽亚·普雷策利斯·马雷克
{"title":"Application Profile for Machine-Actionable Data Management Plans","authors":"Tomasz Miksa, P. Walk, Peter Neish, Simon Oblasser, Hollydawn Murray, Tom Renner, Marie-Christine Jacquemot-Perbal, João Cardoso, T. Kvamme, M. Praetzellis, M. Suchánek, Rob W.W. Hooft, Benjamin Faure, H. Moa, A. Hasan, Sarah Jones","doi":"10.5334/dsj-2021-032","DOIUrl":"https://doi.org/10.5334/dsj-2021-032","url":null,"abstract":"This paper presents the application profile for machine-actionable data management plans that allows information from traditional data management plans to be expressed in a machine-actionable way. We describe the methodology and research conducted to define the application profile. We also discuss design decisions made during its development and present systems which have adopted it. The application profile was developed in an open and consensus-driven manner within the DMP Common Standards Working Group of the Research Data Alliance and is its official recommendation. TOMASZ MIKSA PAUL WALK PETER NEISH SIMON OBLASSER HOLLYDAWN MURRAY TOM RENNER MARIE-CHRISTINE JACQUEMOT-PERBAL JOÃO CARDOSO TROND KVAMME MARIA PRAETZELLIS MAREK SUCHÁNEK ROB HOOFT BENJAMIN FAURE HANNE MOA ADIL HASAN SARAH JONES","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49529013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Do I-PASS for FAIR? Measuring the FAIR-ness of Research Organizations 我算公平吗?衡量研究机构的公平性
Q2 Computer Science Pub Date : 2021-10-07 DOI: 10.5334/dsj-2021-030
J. Ringersma, M. Miedema
{"title":"Do I-PASS for FAIR? Measuring the FAIR-ness of Research Organizations","authors":"J. Ringersma, M. Miedema","doi":"10.5334/dsj-2021-030","DOIUrl":"https://doi.org/10.5334/dsj-2021-030","url":null,"abstract":"","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43886427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open Access and Data Sharing of Nucleotide Sequence Data 核苷酸序列数据的开放获取与数据共享
Q2 Computer Science Pub Date : 2021-09-15 DOI: 10.5334/dsj-2021-028
Masanori Arita
Open access, free access, and the public domain are different concepts. The International Nucleotide Sequence Database Collaboration (INSDC) permanently guarantees free and unrestricted access to nucleotide sequence data for all researchers, irrespective of nationality or affiliation. However, recent virus information is primarily distributed via the restricted-access repository known as the Global Initiative on Sharing Avian Flu Data (GISAID) supported by the World Health Organization. As compensation for the restriction, GISAID needs to meet its initial goal of benefit-sharing among countries and to curb ongoing vaccine diplomacy campaigns.
开放访问、免费访问和公共域是不同的概念。国际核苷酸序列数据库合作组织(INSDC)永久保证所有研究人员,无论国籍或隶属关系,都能自由、不受限制地获取核苷酸序列数据。然而,最近的病毒信息主要通过世界卫生组织支持的全球禽流感数据共享倡议(GISAID)的限制访问存储库分发。作为对这一限制的补偿,GISAID需要实现其在各国之间分享利益的最初目标,并遏制正在进行的疫苗外交运动。
{"title":"Open Access and Data Sharing of Nucleotide Sequence Data","authors":"Masanori Arita","doi":"10.5334/dsj-2021-028","DOIUrl":"https://doi.org/10.5334/dsj-2021-028","url":null,"abstract":"Open access, free access, and the public domain are different concepts. The International Nucleotide Sequence Database Collaboration (INSDC) permanently guarantees free and unrestricted access to nucleotide sequence data for all researchers, irrespective of nationality or affiliation. However, recent virus information is primarily distributed via the restricted-access repository known as the Global Initiative on Sharing Avian Flu Data (GISAID) supported by the World Health Organization. As compensation for the restriction, GISAID needs to meet its initial goal of benefit-sharing among countries and to curb ongoing vaccine diplomacy campaigns.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47342634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Research Data Management Challenges in Citizen Science Projects and Recommendations for Library Support Services. A Scoping Review and Case Study 公民科学项目中的研究数据管理挑战和图书馆支持服务建议。范围界定综述和案例研究
Q2 Computer Science Pub Date : 2021-08-18 DOI: 10.5334/dsj-2021-025
J. S. Hansen, Signe Gadegaard, Karsten Kryger Hansen, Asger Væring Larsen, S. Møller, Gertrud Stougård Thomsen, Katrine Flindt Holmstrand
Citizen science (CS) projects are part of a new era of data aggregation and harmonisation that facilitates interconnections between different datasets. Increasing the value and reuse of CS data has received growing attention with the appearance of the FAIR principles and systematic research data management (RDM) practises, which are often promoted by university libraries. However, RDM initiatives in CS appear diversified and if CS have special needs in terms of RDM is unclear. Therefore, the aim of this article is firstly to identify RDM challenges for CS projects and secondly, to discuss how university libraries may support any such challenges. A scoping review and a case study of Danish CS projects were performed to identify RDM challenges. 48 articles were selected for data extraction. Four academic project leaders were interviewed about RDM practices in their CS projects. Challenges and recommendations identified in the review and case study are often not specific for CS. However, finding CS data, engaging specific populations, attributing volunteers and handling sensitive data including health data are some of the challenges requiring special attention by CS project managers. Scientific requirements or national practices do not always encompass the nature of CS projects. Based on the identified challenges, it is recommended that university libraries focus their services on 1) identifying legal and ethical issues that the project managers should be aware of in their projects, 2) elaborating these issues in a Terms of Participation that also specifies data handling and sharing to the citizen scientist, and 3) motivating the project manager to good data handling practises. Adhering to the FAIR principles and good RDM practices in CS projects will continuously secure contextualisation and data quality. High data quality increases the value and reuse of the data and, therefore, the empowerment of the citizen scientists.
公民科学(CS)项目是数据聚合和协调新时代的一部分,有助于不同数据集之间的互联。随着FAIR原则和系统研究数据管理(RDM)实践的出现,增加CS数据的价值和重用受到了越来越多的关注,这些原则和实践通常由大学图书馆推广。然而,CS的RDM举措似乎是多样化的,CS在RDM方面是否有特殊需求尚不清楚。因此,本文的目的首先是确定CS项目的RDM挑战,其次是讨论大学图书馆如何支持任何此类挑战。对丹麦CS项目进行了范围界定审查和案例研究,以确定RDM的挑战。选择48篇文章进行数据提取。四位学术项目负责人就其CS项目中的RDM实践接受了采访。审查和案例研究中发现的挑战和建议往往不是针对CS的。然而,寻找CS数据、吸引特定人群、确定志愿者的归属以及处理包括健康数据在内的敏感数据是CS项目经理需要特别关注的一些挑战。科学要求或国家实践并不总是包含CS项目的性质。基于已确定的挑战,建议大学图书馆将其服务重点放在1)确定项目经理在其项目中应该意识到的法律和道德问题,2)在参与条款中详细说明这些问题,该条款还规定了公民科学家的数据处理和共享,以及3)激励项目经理进行良好的数据处理实践。在CS项目中坚持FAIR原则和良好的RDM实践将持续确保情境化和数据质量。高数据质量增加了数据的价值和重复使用,从而增强了公民科学家的能力。
{"title":"Research Data Management Challenges in Citizen Science Projects and Recommendations for Library Support Services. A Scoping Review and Case Study","authors":"J. S. Hansen, Signe Gadegaard, Karsten Kryger Hansen, Asger Væring Larsen, S. Møller, Gertrud Stougård Thomsen, Katrine Flindt Holmstrand","doi":"10.5334/dsj-2021-025","DOIUrl":"https://doi.org/10.5334/dsj-2021-025","url":null,"abstract":"Citizen science (CS) projects are part of a new era of data aggregation and harmonisation that facilitates interconnections between different datasets. Increasing the value and reuse of CS data has received growing attention with the appearance of the FAIR principles and systematic research data management (RDM) practises, which are often promoted by university libraries. However, RDM initiatives in CS appear diversified and if CS have special needs in terms of RDM is unclear. Therefore, the aim of this article is firstly to identify RDM challenges for CS projects and secondly, to discuss how university libraries may support any such challenges. A scoping review and a case study of Danish CS projects were performed to identify RDM challenges. 48 articles were selected for data extraction. Four academic project leaders were interviewed about RDM practices in their CS projects. Challenges and recommendations identified in the review and case study are often not specific for CS. However, finding CS data, engaging specific populations, attributing volunteers and handling sensitive data including health data are some of the challenges requiring special attention by CS project managers. Scientific requirements or national practices do not always encompass the nature of CS projects. Based on the identified challenges, it is recommended that university libraries focus their services on 1) identifying legal and ethical issues that the project managers should be aware of in their projects, 2) elaborating these issues in a Terms of Participation that also specifies data handling and sharing to the citizen scientist, and 3) motivating the project manager to good data handling practises. Adhering to the FAIR principles and good RDM practices in CS projects will continuously secure contextualisation and data quality. High data quality increases the value and reuse of the data and, therefore, the empowerment of the citizen scientists.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41536545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
On the Application of Principal Component Analysis to Classification Problems 主成分分析在分类问题中的应用
Q2 Computer Science Pub Date : 2021-08-18 DOI: 10.5334/dsj-2021-026
Jianwei Zheng, C. Rakovski
Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.
主成分分析(PCA)是一种常用的技术,它利用原始变量的相关结构来降低数据的维数。这种减少是通过仅考虑用于后续分析的前几个主要成分来实现的。通常的包含标准由主成分的总方差超过预定阈值的比例来定义。我们发现,在某些分类问题中,即使是极高的包含阈值也会对分类精度产生负面影响。忽略小方差主成分会严重降低模型的性能。我们在使用高维ECG数据的分类分析中注意到了这一现象,即使使用99%的包含阈值,最常见的分类方法也会损失1%至6%的准确性。然而,正如我们的数值例子所示,这个问题甚至可能发生在具有简单相关结构的低维数据中。我们的结论是,应该仔细研究排除任何主要成分的问题。
{"title":"On the Application of Principal Component Analysis to Classification Problems","authors":"Jianwei Zheng, C. Rakovski","doi":"10.5334/dsj-2021-026","DOIUrl":"https://doi.org/10.5334/dsj-2021-026","url":null,"abstract":"Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48310066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data 支持访问SASSCAL天气数据的Web抓取应用程序编程接口
Q2 Computer Science Pub Date : 2021-07-28 DOI: 10.5334/dsj-2021-024
Tsaone Swaabow Thapelo, M. Namoshe, O. Matsebe, T. Motshegwa, Mary-Jane M. Bopape
The Southern African Science Service Centre for Climate and Land Management (SASSCAL) was initiated to support regional weather monitoring and climate research in Southern Africa. As a result, several Automatic Weather Stations (AWSs) were implemented to provide numerical weather data within the collaborating countries. Meanwhile, access to the SASSCAL weather data is limited to a number of records that are achieved via a series of clicks. Currently, end users can not efficaciously extract the desired weather values. Thus, the data is not fully utilised by end users. This work contributes with an open source Web Scraping Application Programming Interface (WebSAPI) through an interactive dashboard. The objective is to extend functionalities of the SASSCAL Weathernet for: data extraction, statistical data analysis and visualisation. The SASSCAL WebSAPI was developed using the R statistical environment. It deploys web scraping and data wrangling techniques to support access to SASSCAL weather data. This WebSAPI reduces the risk of human error, and the researcher’s effort of generating desired data sets. The proposed framework for the SASSCAL WebSAPI can be modified for other weather data banks while taking into consideration the legality and ethics of the toolkit.
成立了南部非洲气候和土地管理科学服务中心(SASSCAL),以支持南部非洲的区域天气监测和气候研究。因此,实施了几个自动气象站(AWSs),在合作国家内提供数值天气数据。同时,访问SASSCAL的天气数据仅限于通过一系列点击获得的一些记录。目前,终端用户无法有效地提取所需的天气值。因此,最终用户没有充分利用这些数据。这项工作通过一个交互式仪表板提供了一个开源的Web抓取应用程序编程接口(WebSAPI)。目标是扩展中国国家科协天气网的功能:数据提取、统计数据分析和可视化。SASSCAL WebSAPI是使用R统计环境开发的。它部署了网络抓取和数据整理技术来支持访问SASSCAL天气数据。这个WebSAPI减少了人为错误的风险,减少了研究人员生成所需数据集的工作量。在考虑工具包的合法性和道德规范的同时,建议的SASSCAL WebSAPI框架可以为其他天气数据库进行修改。
{"title":"SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data","authors":"Tsaone Swaabow Thapelo, M. Namoshe, O. Matsebe, T. Motshegwa, Mary-Jane M. Bopape","doi":"10.5334/dsj-2021-024","DOIUrl":"https://doi.org/10.5334/dsj-2021-024","url":null,"abstract":"The Southern African Science Service Centre for Climate and Land Management (SASSCAL) was initiated to support regional weather monitoring and climate research in Southern Africa. As a result, several Automatic Weather Stations (AWSs) were implemented to provide numerical weather data within the collaborating countries. Meanwhile, access to the SASSCAL weather data is limited to a number of records that are achieved via a series of clicks. Currently, end users can not efficaciously extract the desired weather values. Thus, the data is not fully utilised by end users. This work contributes with an open source Web Scraping Application Programming Interface (WebSAPI) through an interactive dashboard. The objective is to extend functionalities of the SASSCAL Weathernet for: data extraction, statistical data analysis and visualisation. The SASSCAL WebSAPI was developed using the R statistical environment. It deploys web scraping and data wrangling techniques to support access to SASSCAL weather data. This WebSAPI reduces the risk of human error, and the researcher’s effort of generating desired data sets. The proposed framework for the SASSCAL WebSAPI can be modified for other weather data banks while taking into consideration the legality and ethics of the toolkit.","PeriodicalId":35375,"journal":{"name":"Data Science Journal","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42327269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1