Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-11-10 DOI: 10.3390/data8110170

Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

我们介绍了DeReKoGram，这是一个新的频率数据集，包含来自德语参考语料库的1-，2-和3-g的引理和词性(POS)信息。该数据集包含基于432亿个token的语料库的信息，并基于16个语料库折叠分为16个部分。我们描述了数据集是如何创建和结构化的。通过评估16个折叠的分布，我们展示了在许多用例中使用折叠的子集是可能的(例如，为了节省计算资源)。在一个案例研究中，我们研究了随着分析中包含的折叠数量的增加，词汇量的增长(以及偶合现象的数量)。我们将其与数据集的各个清理阶段交叉结合。我们还以Python、R和Stata markdown脚本的形式提供了一些关于如何使用该资源的指导。

引用次数: 0

Machine Learning for Credit Risk Prediction: A Systematic Literature Review 信用风险预测的机器学习:系统文献综述

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-11-07 DOI: 10.3390/data8110169

Jomark Pablo Noriega, Luis Antonio Rivera, José Alfredo Herrera

In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for financial institutions to use Artificial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identified 52 relevant studies within the credit industry of microfinance. Challenges and approaches in credit risk prediction using ML models were identified; we had difficulties with the implemented models such as the black box model, the need for explanatory artificial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identified that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most significant limitation identified is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.

在这篇关于使用机器学习(ML)进行信用风险预测的文献的系统综述中，我们提出金融机构需要使用人工智能(AI)和机器学习来评估信用风险，分析大量信息。我们提出了关于预测信用风险的算法、指标、结果、数据集、变量和相关限制的研究问题。此外，我们检索了与他们相关的知名数据库，并在小额信贷信贷行业中确定了52项相关研究。识别了使用ML模型进行信用风险预测的挑战和方法;我们在实现模型方面遇到了困难，例如黑箱模型、解释性人工智能的需求、选择相关特征的重要性、解决多重共线性以及输入数据不平衡的问题。通过回答这些问题，我们发现boost类别是研究最多的ML模型家族;最常用的评估指标是曲线下面积(AUC)、准确度(ACC)、召回率(Recall)、精度测量F1 (F1)和精度(precision)。研究主要使用公共数据集来比较模型，而使用私有数据集在应用于现实世界时产生新的知识。发现的最重要的限制是现实的代表性，小额信贷行业主要使用的变量是与人口统计、操作和支付行为相关的数据。本研究旨在指导信用风险管理工具和软件的开发人员利用机器学习方法、指标和技术的现有能力来预测信用风险，从而最大限度地减少违约可能造成的损失，并引导风险偏好。

{"title":"Machine Learning for Credit Risk Prediction: A Systematic Literature Review","authors":"Jomark Pablo Noriega, Luis Antonio Rivera, José Alfredo Herrera","doi":"10.3390/data8110169","DOIUrl":"https://doi.org/10.3390/data8110169","url":null,"abstract":"In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for financial institutions to use Artificial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identified 52 relevant studies within the credit industry of microfinance. Challenges and approaches in credit risk prediction using ML models were identified; we had difficulties with the implemented models such as the black box model, the need for explanatory artificial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identified that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most significant limitation identified is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.","PeriodicalId":36824,"journal":{"name":"Data","volume":"5 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135432324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Draft Genome Sequence Data of Lysinibacillus sphaericus Strain 1795 with Insecticidal Properties 具有杀虫特性的球形赖氨酸芽孢杆菌菌株1795基因组序列数据草图

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-11-03 DOI: 10.3390/data8110167

Maria N. Romanenko, Maksim A. Nesterenko, Anton E. Shikov, Anton A. Nizhnikov, Kirill S. Antonets

Lysinibacillus sphaericus holds a significant agricultural importance by being able to produce insecticidal toxins and chemical moieties of varying antibacterial and fungicidal activities. In this study, the genome of the L. sphaericus strain 1795 is presented. Illumina short reads sequenced on the HiSeq X platform were used to obtain the genome’s assembly by applying the SPAdes v3.15.4 software. The genome size based on a cumulative length of 23 contigs reached 4.74 Mb, with a respective N50 of 1.34 Mb. The assembled genome carried 4672 genes, including 4643 protein-encoding ones, 5 of which represented loci coding for insecticidal toxins active against the orders Diptera, Lepidoptera, and Blattodea. We also revealed biosynthetic gene clusters responsible for the synthesis of secondary metabolites with predicted antibacterial, fungicidal, and growth-promoting properties. The genomic data provided will be helpful for deepening our understanding of genetic markers determining the efficient application of the L. sphaericus strain 1795 primarily for biocontrol purposes in veterinary and medical applications against several groups of blood-sucking insects.

球形赖氨酸芽孢杆菌具有重要的农业意义，能够产生杀虫毒素和不同的抗菌和杀真菌活性的化学成分。本文报道了球形L. sphaericus菌株1795的基因组。利用HiSeq X平台上测序的Illumina短读序列，应用SPAdes v3.15.4软件获得基因组的组装。基于23个contigs的累积长度，基因组大小达到4.74 Mb, N50分别为1.34 Mb。组装的基因组携带4672个基因，其中编码蛋白的基因4643个，其中5个位点编码对双翅目、鳞翅目和bltodea有活性的杀虫毒素。我们还发现了负责合成具有抗菌、杀真菌和促进生长特性的次生代谢物的生物合成基因簇。提供的基因组数据将有助于加深我们对遗传标记的理解，这些遗传标记决定了球形L.菌株1795的有效应用，主要用于兽医和医疗中几种吸血昆虫的生物防治目的。

{"title":"Draft Genome Sequence Data of Lysinibacillus sphaericus Strain 1795 with Insecticidal Properties","authors":"Maria N. Romanenko, Maksim A. Nesterenko, Anton E. Shikov, Anton A. Nizhnikov, Kirill S. Antonets","doi":"10.3390/data8110167","DOIUrl":"https://doi.org/10.3390/data8110167","url":null,"abstract":"Lysinibacillus sphaericus holds a significant agricultural importance by being able to produce insecticidal toxins and chemical moieties of varying antibacterial and fungicidal activities. In this study, the genome of the L. sphaericus strain 1795 is presented. Illumina short reads sequenced on the HiSeq X platform were used to obtain the genome’s assembly by applying the SPAdes v3.15.4 software. The genome size based on a cumulative length of 23 contigs reached 4.74 Mb, with a respective N50 of 1.34 Mb. The assembled genome carried 4672 genes, including 4643 protein-encoding ones, 5 of which represented loci coding for insecticidal toxins active against the orders Diptera, Lepidoptera, and Blattodea. We also revealed biosynthetic gene clusters responsible for the synthesis of secondary metabolites with predicted antibacterial, fungicidal, and growth-promoting properties. The genomic data provided will be helpful for deepening our understanding of genetic markers determining the efficient application of the L. sphaericus strain 1795 primarily for biocontrol purposes in veterinary and medical applications against several groups of blood-sucking insects.","PeriodicalId":36824,"journal":{"name":"Data","volume":"31 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135869226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Applying Eye Tracking with Deep Learning Techniques for Early-Stage Detection of Autism Spectrum Disorders 眼动追踪与深度学习技术在自闭症谱系障碍早期检测中的应用

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-11-03 DOI: 10.3390/data8110168

Zeyad A. T. Ahmed, Eid Albalawi, Theyazn H. H. Aldhyani, Mukti E. Jadhav, Prachi Janrao, Mansour Ratib Mohammad Obeidat

Autism spectrum disorder (ASD) poses a complex challenge to researchers and practitioners, with its multifaceted etiology and varied manifestations. Timely intervention is critical in enhancing the developmental outcomes of individuals with ASD. This paper underscores the paramount significance of early detection and diagnosis as a pivotal precursor to effective intervention. To this end, integrating advanced technological tools, specifically eye-tracking technology and deep learning algorithms, is investigated for its potential to discriminate between children with ASD and their typically developing (TD) peers. By employing these methods, the research aims to contribute to refining early detection strategies and support mechanisms. This study introduces innovative deep learning models grounded in convolutional neural network (CNN) and recurrent neural network (RNN) architectures, employing an eye-tracking dataset for training. Of note, performance outcomes have been realised, with the bidirectional long short-term memory (BiLSTM) achieving an accuracy of 96.44%, the gated recurrent unit (GRU) attaining 97.49%, the CNN-LSTM hybridising to 97.94%, and the LSTM achieving the most remarkable accuracy result of 98.33%. These outcomes underscore the efficacy of the applied methodologies and the potential of advanced computational frameworks in achieving substantial accuracy levels in ASD detection and classification.

自闭症谱系障碍(Autism spectrum disorder, ASD)病因多样，表现多样，给研究者和实践者带来了复杂的挑战。及时干预对于提高自闭症患者的发育结果至关重要。本文强调了早期发现和诊断作为有效干预的关键前兆的重要意义。为此，整合先进的技术工具，特别是眼动追踪技术和深度学习算法，研究其区分自闭症儿童和正常发育儿童(TD)的潜力。通过采用这些方法，研究旨在为完善早期检测策略和支持机制做出贡献。本研究引入了基于卷积神经网络(CNN)和循环神经网络(RNN)架构的创新深度学习模型，采用眼动追踪数据集进行训练。值得注意的是，已经实现了性能结果，双向长短期记忆(BiLSTM)达到96.44%的准确率，门控循环单元(GRU)达到97.49%，CNN-LSTM混合达到97.94%，LSTM达到了最显著的准确率98.33%。这些结果强调了应用方法的有效性和先进计算框架在实现ASD检测和分类的实质性准确性水平方面的潜力。

{"title":"Applying Eye Tracking with Deep Learning Techniques for Early-Stage Detection of Autism Spectrum Disorders","authors":"Zeyad A. T. Ahmed, Eid Albalawi, Theyazn H. H. Aldhyani, Mukti E. Jadhav, Prachi Janrao, Mansour Ratib Mohammad Obeidat","doi":"10.3390/data8110168","DOIUrl":"https://doi.org/10.3390/data8110168","url":null,"abstract":"Autism spectrum disorder (ASD) poses a complex challenge to researchers and practitioners, with its multifaceted etiology and varied manifestations. Timely intervention is critical in enhancing the developmental outcomes of individuals with ASD. This paper underscores the paramount significance of early detection and diagnosis as a pivotal precursor to effective intervention. To this end, integrating advanced technological tools, specifically eye-tracking technology and deep learning algorithms, is investigated for its potential to discriminate between children with ASD and their typically developing (TD) peers. By employing these methods, the research aims to contribute to refining early detection strategies and support mechanisms. This study introduces innovative deep learning models grounded in convolutional neural network (CNN) and recurrent neural network (RNN) architectures, employing an eye-tracking dataset for training. Of note, performance outcomes have been realised, with the bidirectional long short-term memory (BiLSTM) achieving an accuracy of 96.44%, the gated recurrent unit (GRU) attaining 97.49%, the CNN-LSTM hybridising to 97.94%, and the LSTM achieving the most remarkable accuracy result of 98.33%. These outcomes underscore the efficacy of the applied methodologies and the potential of advanced computational frameworks in achieving substantial accuracy levels in ASD detection and classification.","PeriodicalId":36824,"journal":{"name":"Data","volume":"11 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scalable Data Structure for Efficient Graph Analytics and In-Place Mutations 高效图形分析和就地突变的可扩展数据结构

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-11-03 DOI: 10.3390/data8110166

Soukaina Firmli, Dalila Chiadmi

The graph model enables a broad range of analyses; thus, graph processing (GP) is an invaluable tool in data analytics. At the heart of every GP system lies a concurrent graph data structure that stores the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, GP systems face the challenge of providing an appropriate graph data structure that enables both fast analytical workloads and fast, low-memory graph mutations. Existing graph structures offer a hard tradeoff among read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these tradeoffs and enables both fast read-only analytics, and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists (ALs) to achieve the best of both worlds. We compare CSR++ to CSR, ALs from the Boost Graph Library (BGL), and the following state-of-the-art update-friendly graph structures: LLAMA, STINGER, GraphOne, and Teseo. In our evaluation, which is based on popular GP algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average) while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2×) with frequent updates. We also show that both CSR++’s update throughput and analytics performance exceed those of several state-of-the-art graph structures while maintaining low memory consumption when the workload includes updates.

图形模型可以进行广泛的分析;因此，图形处理(GP)在数据分析中是一个非常宝贵的工具。在每个GP系统的核心都有一个存储图形的并发图形数据结构。这样的数据结构对于图算法和查询都需要非常高效。由于现实世界图的持续发展、稀疏性和无标度特性，GP系统面临着提供适当的图数据结构的挑战，该结构既支持快速分析工作负载，又支持快速、低内存的图突变。现有的图结构在只读性能、更新友好性和更新时的内存消耗之间进行了艰难的权衡。在本文中，我们介绍了CSR++，这是一种新的图形数据结构，它消除了这些权衡，并支持快速只读分析和快速且内存友好的突变。CSR++结合了CSR、最快的只读数据结构和邻接表(al)的思想，以实现两者的最佳效果。我们将CSR++与CSR、Boost Graph Library (BGL)中的ALs以及以下最先进的更新友好型图形结构(LLAMA、STINGER、GraphOne和Teseo)进行比较。在我们的评估中(基于在真实世界图形上执行的流行GP算法)，我们发现CSR++在只读并发性能上仍然接近CSR(平均在10%以内)，而在频繁更新的情况下，显著优于CSR(一个数量级)和LLAMA(几乎2倍)。我们还表明，当工作负载包含更新时，CSR++的更新吞吐量和分析性能都超过了几种最先进的图结构，同时保持了较低的内存消耗。

{"title":"A Scalable Data Structure for Efficient Graph Analytics and In-Place Mutations","authors":"Soukaina Firmli, Dalila Chiadmi","doi":"10.3390/data8110166","DOIUrl":"https://doi.org/10.3390/data8110166","url":null,"abstract":"The graph model enables a broad range of analyses; thus, graph processing (GP) is an invaluable tool in data analytics. At the heart of every GP system lies a concurrent graph data structure that stores the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, GP systems face the challenge of providing an appropriate graph data structure that enables both fast analytical workloads and fast, low-memory graph mutations. Existing graph structures offer a hard tradeoff among read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these tradeoffs and enables both fast read-only analytics, and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists (ALs) to achieve the best of both worlds. We compare CSR++ to CSR, ALs from the Boost Graph Library (BGL), and the following state-of-the-art update-friendly graph structures: LLAMA, STINGER, GraphOne, and Teseo. In our evaluation, which is based on popular GP algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average) while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2×) with frequent updates. We also show that both CSR++’s update throughput and analytics performance exceed those of several state-of-the-art graph structures while maintaining low memory consumption when the workload includes updates.","PeriodicalId":36824,"journal":{"name":"Data","volume":"10 26","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135818226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can We Mathematically Spot the Possible Manipulation of Results in Research Manuscripts Using Benford’s Law? 我们可以用本福德定律在数学上发现研究手稿中可能的操纵结果吗?

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-10-31 DOI: 10.3390/data8110165

Teddy Lazebnik, Dan Gorlitsky

The reproducibility of academic research has long been a persistent issue, contradicting one of the fundamental principles of science. Recently, there has been an increasing number of false claims found in academic manuscripts, casting doubt on the validity of reported results. In this paper, we utilize an adapted version of Benford’s law, a statistical phenomenon that describes the distribution of leading digits in naturally occurring datasets, to identify the potential manipulation of results in research manuscripts, solely using the aggregated data presented in those manuscripts rather than the commonly unavailable raw datasets. Our methodology applies the principles of Benford’s law to commonly employed analyses in academic manuscripts, thus reducing the need for the raw data itself. To validate our approach, we employed 100 open-source datasets and successfully predicted 79% of them accurately using our rules. Moreover, we tested the proposed method on known retracted manuscripts, showing that around half (48.6%) can be detected using the proposed method. Additionally, we analyzed 100 manuscripts published in the last two years across ten prominent economic journals, with 10 manuscripts randomly sampled from each journal. Our analysis predicted a 3% occurrence of results manipulation with a 96% confidence level. Our findings show that Benford’s law adapted for aggregated data, can be an initial tool for identifying data manipulation; however, it is not a silver bullet, requiring further investigation for each flagged manuscript due to the relatively low prediction accuracy.

学术研究的可重复性是一个长期存在的问题，与科学的基本原则之一相矛盾。最近，在学术论文中发现了越来越多的虚假声明，这让人们对报告结果的有效性产生了怀疑。在本文中，我们利用本福德定律(一种描述自然发生的数据集中前导数字分布的统计现象)的改编版本来识别研究手稿中结果的潜在操纵，仅使用这些手稿中呈现的汇总数据，而不是通常不可用的原始数据集。我们的方法将本福德定律的原则应用于学术手稿中常用的分析，从而减少了对原始数据本身的需求。为了验证我们的方法，我们使用了100个开源数据集，并使用我们的规则成功预测了其中79%的数据集。此外，我们对已知的撤稿进行了测试，结果表明，使用该方法可以检测到大约一半(48.6%)的撤稿。此外，我们分析了过去两年在10个著名经济学期刊上发表的100篇手稿，每个期刊随机抽取10篇手稿。我们的分析以96%的置信水平预测了3%的结果操纵发生。我们的研究结果表明，本福德定律适用于汇总数据，可以作为识别数据操纵的初始工具;然而，这并不是灵丹妙药，由于预测精度相对较低，需要对每个标记的手稿进行进一步的调查。

{"title":"Can We Mathematically Spot the Possible Manipulation of Results in Research Manuscripts Using Benford’s Law?","authors":"Teddy Lazebnik, Dan Gorlitsky","doi":"10.3390/data8110165","DOIUrl":"https://doi.org/10.3390/data8110165","url":null,"abstract":"The reproducibility of academic research has long been a persistent issue, contradicting one of the fundamental principles of science. Recently, there has been an increasing number of false claims found in academic manuscripts, casting doubt on the validity of reported results. In this paper, we utilize an adapted version of Benford’s law, a statistical phenomenon that describes the distribution of leading digits in naturally occurring datasets, to identify the potential manipulation of results in research manuscripts, solely using the aggregated data presented in those manuscripts rather than the commonly unavailable raw datasets. Our methodology applies the principles of Benford’s law to commonly employed analyses in academic manuscripts, thus reducing the need for the raw data itself. To validate our approach, we employed 100 open-source datasets and successfully predicted 79% of them accurately using our rules. Moreover, we tested the proposed method on known retracted manuscripts, showing that around half (48.6%) can be detected using the proposed method. Additionally, we analyzed 100 manuscripts published in the last two years across ten prominent economic journals, with 10 manuscripts randomly sampled from each journal. Our analysis predicted a 3% occurrence of results manipulation with a 96% confidence level. Our findings show that Benford’s law adapted for aggregated data, can be an initial tool for identifying data manipulation; however, it is not a silver bullet, requiring further investigation for each flagged manuscript due to the relatively low prediction accuracy.","PeriodicalId":36824,"journal":{"name":"Data","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135810173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Information Competences and Academic Achievement: A Dataset 信息能力与学术成就:一个数据集

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-10-27 DOI: 10.3390/data8110164

Jacqueline Köhler, Roberto González-Ibáñez

Information literacy (IL) is becoming fundamental in the modern world. Although several IL standards and assessments have been developed for secondary and higher education, there is still no agreement about the possible associations between IL and both academic achievement and student dropout rates. In this article, we present a dataset including IL competences measurements, as well as academic achievement and socioeconomic indicators for 153 Chilean first- and second-year engineering students. The dataset is intended to allow researchers to use machine learning methods to study to what extent, if any, IL and academic achievement are related.

信息素养(IL)正在成为现代世界的基础。虽然已经为中等和高等教育制定了一些IL标准和评估，但对于IL与学业成绩和学生辍学率之间的可能联系，仍然没有达成一致意见。在本文中，我们提供了一个数据集，包括153名智利一年级和二年级工程专业学生的IL能力测量，以及学术成就和社会经济指标。该数据集旨在允许研究人员使用机器学习方法来研究IL和学术成就在多大程度上(如果有的话)相关。

引用次数: 0

The Development of a Water Resource Monitoring Ontology as a Research Tool for Sustainable Regional Development 基于水资源监测本体的区域可持续发展研究

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-10-26 DOI: 10.3390/data8110162

Assel Ospan, Madina Mansurova, Vladimir Barakhnin, Aliya Nugumanova, Roman Titkov

The development of knowledge graphs about water resources as a tool for studying the sustainable development of a region is currently an urgent task, because the growing deterioration of the state of water bodies affects the ecology, economy, and health of the population of the region. This study presents a new ontological approach to water resource monitoring in Kazakhstan, providing data integration from heterogeneous sources, semantic analysis, decision support, and querying and searching and presenting new knowledge in the field of water monitoring. The contribution of this work is the integration of table extraction and understanding, semantic web rule language, semantic sensor network, time ontology methods, and the inclusion of a module of socioeconomic indicators that reveal the impact of water quality on the quality of life of the population. Using machine learning methods, the study derived six ontological rules to establish new knowledge about water resource monitoring. The results of the queries demonstrate the effectiveness of the proposed method, demonstrating its potential to improve water monitoring practices, promote sustainable resource management, and support decision-making processes in Kazakhstan, and can also be integrated into the ontology of water resources at the scale of Central Asia.

由于水体状况的日益恶化影响到该地区的生态、经济和人口健康，因此开发水资源知识图谱作为研究区域可持续发展的工具是当前一项紧迫的任务。本研究提出了哈萨克斯坦水资源监测的一种新的本体论方法，提供了来自异构源的数据集成、语义分析、决策支持、查询和搜索，并呈现了水监测领域的新知识。这项工作的贡献在于整合了表提取和理解、语义网规则语言、语义传感器网络、时间本体方法，并包含了一个揭示水质对人口生活质量影响的社会经济指标模块。利用机器学习方法，该研究导出了六个本体论规则，以建立有关水资源监测的新知识。查询的结果证明了所提出方法的有效性，证明了其在改善水资源监测实践、促进可持续资源管理和支持哈萨克斯坦决策过程方面的潜力，并且还可以整合到中亚规模的水资源本体中。

{"title":"The Development of a Water Resource Monitoring Ontology as a Research Tool for Sustainable Regional Development","authors":"Assel Ospan, Madina Mansurova, Vladimir Barakhnin, Aliya Nugumanova, Roman Titkov","doi":"10.3390/data8110162","DOIUrl":"https://doi.org/10.3390/data8110162","url":null,"abstract":"The development of knowledge graphs about water resources as a tool for studying the sustainable development of a region is currently an urgent task, because the growing deterioration of the state of water bodies affects the ecology, economy, and health of the population of the region. This study presents a new ontological approach to water resource monitoring in Kazakhstan, providing data integration from heterogeneous sources, semantic analysis, decision support, and querying and searching and presenting new knowledge in the field of water monitoring. The contribution of this work is the integration of table extraction and understanding, semantic web rule language, semantic sensor network, time ontology methods, and the inclusion of a module of socioeconomic indicators that reveal the impact of water quality on the quality of life of the population. Using machine learning methods, the study derived six ontological rules to establish new knowledge about water resource monitoring. The results of the queries demonstrate the effectiveness of the proposed method, demonstrating its potential to improve water monitoring practices, promote sustainable resource management, and support decision-making processes in Kazakhstan, and can also be integrated into the ontology of water resources at the scale of Central Asia.","PeriodicalId":36824,"journal":{"name":"Data","volume":"137 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions 源自不同地理区域的与疾病X相关的大规模搜索兴趣数据集

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-10-26 DOI: 10.3390/data8110163

Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Isabella Hall, Yuvraj Nihal Duggal

The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.

世界卫生组织(世卫组织)将疾病X添加到他们的蓝图优先疾病名单中，代表一种假设的未知病原体，可能导致未来的流行病。在过去不同的病毒爆发期间，如COVID-19、流感、莱姆病和寨卡病毒，来自不同学科的研究人员利用谷歌趋势挖掘网络行为的多模态组成部分，研究、调查和分析与这些病毒爆发相关的全球意识、准备和响应。随着世界为疾病X做准备，与疾病X相关的网络行为数据集对于及时推进该领域的研究至关重要。此外，该领域之前的工作都没有专注于开发一个数据集来编译相关的网络行为数据，这将有助于为疾病X做准备。为了解决这些研究挑战，本工作提出了一个与疾病X相关的网络行为数据集，这些数据来自2018年2月至2023年8月之间的世界不同地理区域。具体来说，该数据集展示了来自94个地理区域的与X疾病相关的搜索兴趣。选择这些区域进行数据挖掘是因为这些区域在此时间段内记录了与X疾病相关的重要搜索兴趣。该数据集是通过使用谷歌趋势收集数据而开发的。在这个数据集中可以找到所有这些地区在这个时间范围内每个月的相关搜索兴趣。本文还讨论了该数据集是否符合科学数据管理的FAIR原则。最后，对该数据集进行了分析，以维护该数据集在大数据、数据挖掘、医疗保健、流行病学和数据分析等相关领域的不同研究问题的适用性、相关性和有用性，并特别关注X疾病。

{"title":"A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions","authors":"Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Isabella Hall, Yuvraj Nihal Duggal","doi":"10.3390/data8110163","DOIUrl":"https://doi.org/10.3390/data8110163","url":null,"abstract":"The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.","PeriodicalId":36824,"journal":{"name":"Data","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fabaceae: South African Medicinal Plant Species Used in the Treatment and Management of Sexually Transmitted and Related Opportunistic Infections Associated with HIV-AIDS 豆科:用于治疗和管理与HIV-AIDS相关的性传播和相关机会性感染的南非药用植物物种

Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data

Pub Date : 2023-10-24 DOI: 10.3390/data8110160

Nkoana Ishmael Mongalo, Maropeng Vellry Raletsena

The use of medicinal plants, particularly in the treatment of sexually transmitted and related infections, is ancient. These plants may well be used as alternative and complementary medicine to a variety of antibiotics that may possess limitations mainly due to an emerging enormous antimicrobial resistance. Several computerized database literature sources such as ScienceDirect, Scopus, Scielo, PubMed, and Google Scholar were used to retrieve information on Fabaceae species used in the treatment and management of sexually transmitted and related infections in South Africa. The other information was sourced from various academic dissertations, theses, and botanical books. A total of 42 medicinal plant species belonging to the Fabaceae family, used in the treatment of sexually transmitted and related opportunistic infections associated with HIV-AIDS, have been documented. Trees were the most reported life form, yielding 47.62%, while Senna and Vachellia were the frequently cited genera yielding six and three species, respectively. Peltophorum africanum Sond. was the most preferred medicinal plant, yielding a frequency of citation of 14, while Vachellia karoo (Hayne) Banfi and Glasso as well as Elephantorrhiza burkei Benth. yielded 12 citations each. The most frequently used plant parts were roots, yielding 57.14%, while most of the plant species were administered orally after boiling (51.16%) until the infection subsided. Amazingly, many of the medicinal plant species are recommended for use to treat impotence (29.87%), while most common STI infections such as chlamydia (7.79%), gonorrhea (6.49%), syphilis (5.19%), genital warts (2.60%), and many other unidentified STIs that may include “Makgoma” and “Divhu” were less cited. Although there are widespread data on the in vitro evidence of the use of the Fabaceae species in the treatment of sexually transmitted and related infections, there is a need to explore the in vivo studies to further ascertain the use of species as a possible complementary and alternative medicine to the currently used antibiotics in both developing and underdeveloped countries. Furthermore, the toxicological profiles of many of these studies need to be further explored. The safety and efficacy of over-the-counter pharmaceutical products developed using these species also need to be explored.

药用植物的使用，特别是在性传播和相关感染的治疗中，是古老的。这些植物很可能被用作各种抗生素的替代和补充药物，这些抗生素可能具有局限性，主要是由于出现了巨大的抗菌素耐药性。利用ScienceDirect、Scopus、Scielo、PubMed和Google Scholar等计算机数据库文献资源检索南非用于性传播及相关感染治疗和管理的豆科物种信息。其他信息来源于各种学术论文、论文和植物学书籍。共有42种属于豆科的药用植物被记录在案，用于治疗与HIV-AIDS相关的性传播感染和相关机会性感染。树木是报道最多的生命形式，产量为47.62%，而塞纳属和瓦切利亚属是被引用最多的属，分别产量为6种和3种。非洲石竹。是最受欢迎的药用植物，引用频率为14次，而Vachellia karoo (Hayne) Banfi和Glasso以及象根(Elephantorrhiza burkei Benth)。每篇论文被引用12次。最常使用的植物部位是根，占57.14%，而大多数植物品种是煮沸后口服，占51.16%，直到感染消退。令人惊讶的是，许多药用植物物种被推荐用于治疗阳痿(29.87%)，而最常见的性传播感染，如衣原体(7.79%)、淋病(6.49%)、梅毒(5.19%)、生殖器疣(2.60%)和许多其他未识别的性传播感染(可能包括“Makgoma”和“Divhu”)却很少被引用。尽管有广泛的数据表明豆科植物在体外治疗性传播感染和相关感染方面的使用，但仍有必要探索体内研究，以进一步确定在发展中国家和不发达国家使用豆科植物作为目前使用的抗生素的可能补充和替代药物。此外，许多这些研究的毒理学概况需要进一步探讨。使用这些物种开发的非处方药的安全性和有效性也需要探索。

{"title":"Fabaceae: South African Medicinal Plant Species Used in the Treatment and Management of Sexually Transmitted and Related Opportunistic Infections Associated with HIV-AIDS","authors":"Nkoana Ishmael Mongalo, Maropeng Vellry Raletsena","doi":"10.3390/data8110160","DOIUrl":"https://doi.org/10.3390/data8110160","url":null,"abstract":"The use of medicinal plants, particularly in the treatment of sexually transmitted and related infections, is ancient. These plants may well be used as alternative and complementary medicine to a variety of antibiotics that may possess limitations mainly due to an emerging enormous antimicrobial resistance. Several computerized database literature sources such as ScienceDirect, Scopus, Scielo, PubMed, and Google Scholar were used to retrieve information on Fabaceae species used in the treatment and management of sexually transmitted and related infections in South Africa. The other information was sourced from various academic dissertations, theses, and botanical books. A total of 42 medicinal plant species belonging to the Fabaceae family, used in the treatment of sexually transmitted and related opportunistic infections associated with HIV-AIDS, have been documented. Trees were the most reported life form, yielding 47.62%, while Senna and Vachellia were the frequently cited genera yielding six and three species, respectively. Peltophorum africanum Sond. was the most preferred medicinal plant, yielding a frequency of citation of 14, while Vachellia karoo (Hayne) Banfi and Glasso as well as Elephantorrhiza burkei Benth. yielded 12 citations each. The most frequently used plant parts were roots, yielding 57.14%, while most of the plant species were administered orally after boiling (51.16%) until the infection subsided. Amazingly, many of the medicinal plant species are recommended for use to treat impotence (29.87%), while most common STI infections such as chlamydia (7.79%), gonorrhea (6.49%), syphilis (5.19%), genital warts (2.60%), and many other unidentified STIs that may include “Makgoma” and “Divhu” were less cited. Although there are widespread data on the in vitro evidence of the use of the Fabaceae species in the treatment of sexually transmitted and related infections, there is a need to explore the in vivo studies to further ascertain the use of species as a possible complementary and alternative medicine to the currently used antibiotics in both developing and underdeveloped countries. Furthermore, the toxicological profiles of many of these studies need to be further explored. The safety and efficacy of over-the-counter pharmaceutical products developed using these species also need to be explored.","PeriodicalId":36824,"journal":{"name":"Data","volume":"3 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135268326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data最新文献