首页 > 最新文献

Journal of Big Data最新文献

英文 中文
DAPS diagrams for defining Data Science projects 用于定义数据科学项目的 DAPS 图表
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-12 DOI: 10.1186/s40537-024-00916-7
Jeroen de Mast, Joran Lokkerbol

Background

Models for structuring big-data and data-analytics projects typically start with a definition of the project’s goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project’s success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling.

Contribution

This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization’s goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects.

Methods

The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach.

背景构建大数据和数据分析项目的模型通常从定义项目目标和预期创造的业务价值开始。文献指出,正确的项目定义是项目成功的关键,同时也认识到将业务目标转化为数据分析问题是一项艰巨的任务。遗憾的是,与数据准备和建模等后续项目阶段相比,CRISP-DM 等常见项目结构对这一关键阶段几乎没有提供指导。这种图解技术有助于对数据分析问题进行一致、准确的定义,并阐明该问题如何有助于实现组织目标。此外,该技术还有助于确定重要的假设,并将庞大的雄心壮志分解为易于管理的子项目。半正式说明技术以运筹学和商业分析等领域常见的其他问题结构模型为出发点。在 47 个实际数据分析项目中应用了所提出的技术,并根据结果,采用设计科学方法对其进行了改进。
{"title":"DAPS diagrams for defining Data Science projects","authors":"Jeroen de Mast, Joran Lokkerbol","doi":"10.1186/s40537-024-00916-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00916-7","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Models for structuring big-data and data-analytics projects typically start with a definition of the project’s goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project’s success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling.</p><h3 data-test=\"abstract-sub-heading\">Contribution</h3><p>This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization’s goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"36 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows B-CAT:利用对网络流量的深度攻击行为分析检测僵尸网络攻击的模型
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-10 DOI: 10.1186/s40537-024-00900-1
Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi

Threats on computer networks have been increasing rapidly, and irresponsible parties are always trying to exploit vulnerabilities in the network to do various dangerous things. One way to exploit vulnerabilities in a computer network is by employing malware. Botnets are a type of malware that infects and attacks targets in groups. Botnets develop quickly; the characteristics of initially sporadic attacks have grown into periodic and simultaneous. This rapid development has proved that the botnet is advanced and requires more attention and proper handling. Many studies have introduced detection models for botnet attack activity on computer networks. Apart from detecting the presence of botnet attacks, those studies have attempted to explore the characteristics of botnets, such as attack intensity, relationships between activities, and time segment analysis. However, there has been no research that explicitly detects those characteristics. On the other hand, each botnet characteristic requires different handling, while recognizing the characteristics of the botnet can help network administrators make appropriate decisions. Based on these reasons, this research builds a detection model that can recognize botnet characteristics using sequential traffic mining and similarity analysis. The proposed method consists of two main processes. The first is training to build a knowledge base, and the second is testing to detect botnet activity and attack characteristics. It involves dynamic thresholds to improve the model sensitivity in recognizing attack characteristics through similarity analysis. The novelty includes developing and combining analytical techniques of sequential traffic mining, similarity analysis, and dynamic threshold to detect and recognize the characteristics of botnet attacks explicitly on actual behavior in network traffic. Extensive experiments have been conducted for the evaluation using three different datasets whose results show better performance than others.

计算机网络面临的威胁与日俱增,不负责任的人总是试图利用网络漏洞做各种危险的事情。利用计算机网络漏洞的方法之一就是使用恶意软件。僵尸网络是一种以群体形式感染和攻击目标的恶意软件。僵尸网络发展迅速,从最初的零星攻击发展为周期性同时攻击。这种快速发展证明僵尸网络很先进,需要更多关注和妥善处理。许多研究提出了计算机网络僵尸网络攻击活动的检测模型。除了检测是否存在僵尸网络攻击外,这些研究还试图探索僵尸网络的特征,如攻击强度、活动之间的关系和时间段分析等。但是,目前还没有明确检测这些特征的研究。另一方面,每个僵尸网络的特征需要不同的处理方法,而识别僵尸网络的特征则有助于网络管理员做出适当的决策。基于这些原因,本研究建立了一个检测模型,利用顺序流量挖掘和相似性分析来识别僵尸网络的特征。所提出的方法包括两个主要过程。第一个过程是训练,以建立知识库;第二个过程是测试,以检测僵尸网络活动和攻击特征。它采用动态阈值,通过相似性分析提高模型识别攻击特征的灵敏度。新颖之处在于开发并结合了顺序流量挖掘、相似性分析和动态阈值等分析技术,以明确的网络流量实际行为来检测和识别僵尸网络攻击的特征。我们使用三个不同的数据集进行了广泛的实验评估,结果表明其性能优于其他数据集。
{"title":"B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows","authors":"Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi","doi":"10.1186/s40537-024-00900-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00900-1","url":null,"abstract":"<p>Threats on computer networks have been increasing rapidly, and irresponsible parties are always trying to exploit vulnerabilities in the network to do various dangerous things. One way to exploit vulnerabilities in a computer network is by employing malware. Botnets are a type of malware that infects and attacks targets in groups. Botnets develop quickly; the characteristics of initially sporadic attacks have grown into periodic and simultaneous. This rapid development has proved that the botnet is advanced and requires more attention and proper handling. Many studies have introduced detection models for botnet attack activity on computer networks. Apart from detecting the presence of botnet attacks, those studies have attempted to explore the characteristics of botnets, such as attack intensity, relationships between activities, and time segment analysis. However, there has been no research that explicitly detects those characteristics. On the other hand, each botnet characteristic requires different handling, while recognizing the characteristics of the botnet can help network administrators make appropriate decisions. Based on these reasons, this research builds a detection model that can recognize botnet characteristics using sequential traffic mining and similarity analysis. The proposed method consists of two main processes. The first is training to build a knowledge base, and the second is testing to detect botnet activity and attack characteristics. It involves dynamic thresholds to improve the model sensitivity in recognizing attack characteristics through similarity analysis. The novelty includes developing and combining analytical techniques of sequential traffic mining, similarity analysis, and dynamic threshold to detect and recognize the characteristics of botnet attacks explicitly on actual behavior in network traffic. Extensive experiments have been conducted for the evaluation using three different datasets whose results show better performance than others.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computer aided technology based on graph sample and aggregate attention network optimized for soccer teaching and training 基于图样本和聚合注意力网络的计算机辅助技术,优化足球教学和训练
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-05 DOI: 10.1186/s40537-024-00893-x
Guanghui Yang, Xinyuan Feng

Football is the most popular game in the world and has significant influence on various aspects including politics, economy and culture. The experience of the football developed nation has shown that the steady growth of youth football is crucial for elevating a nation's overall football proficiency. It is essential to develop techniques and create strategies that adapt to their individual physical features to resolve the football players’ problem of lacking exercise in various topics. In this manuscript, Computer aided technology depending on the Graph Sample and Aggregate Attention Network Optimized for Soccer Teaching and Training (CAT-GSAAN-STT) is proposed to improve the efficiency of Soccer teaching and training effectively. The proposed method contains four stages, like data collection, data preprocessing, prediction and optimization. Initially the input data are collected by Microsoft Kinect V2 smart camera. Then the collected data are preprocessed by using Improving graph collaborative filtering. After preprocessing the data is given for motion recognition layer here prediction is done using Graph Sample and Aggregate Attention Network (GSAAN) for improving the effectiveness of Soccer Teaching and Training. To enhance the accuracy of the system, the GSAAN are optimized by using Artificial Rabbits Optimization. The proposed CAT-GSAAN-STT method is executed in Python and the efficiency of the proposed technique is examined with different metrics, like accuracy, computation time, learning activity analysis, student performance ratio and teaching evaluation analysis. The simulation outcomes proves that the proposed technique attains provides28.33%, 31.60%, 25.63% higherRecognition accuracy and33.67%, 38.12% and 27.34%lesser evaluation time while compared with existing techniques like computer aided teaching system based upon artificial intelligence in football teaching with training (STT-IOT-CATS), Computer Aided Teaching System for Football Teaching and Training Based on Video Image (CAT-STT-VI) and method for enhancing the football coaching quality using artificial intelligence and meta verse-empowered in mobile internet environment (SI-STQ-AI-MIE) respectively.

足球是世界上最受欢迎的运动,对政治、经济和文化等各个方面都有重大影响。足球发达国家的经验表明,青少年足球运动的稳步发展对于提升一个国家的整体足球水平至关重要。要解决足球运动员在各方面缺乏锻炼的问题,就必须根据他们各自的身体特点开发技术、制定策略。本文提出了基于图形样本和聚合注意力网络的足球教学和训练优化计算机辅助技术(CAT-GSAAN-STT),以有效提高足球教学和训练的效率。该方法包括数据收集、数据预处理、预测和优化四个阶段。首先,通过 Microsoft Kinect V2 智能摄像头采集输入数据。然后使用改进图协同过滤法对收集到的数据进行预处理。预处理后的数据将用于运动识别层,在此使用图形样本和聚合注意力网络(GSAAN)进行预测,以提高足球教学和训练的效果。为了提高系统的准确性,使用人工兔子优化法对 GSAAN 进行了优化。建议的 CAT-GSAAN-STT 方法在 Python 中执行,并通过不同的指标,如准确性、计算时间、学习活动分析、学生成绩比率和教学评价分析,来检验建议技术的效率。模拟结果证明,所提技术的识别准确率分别提高了 28.33%、31.60% 和 25.63%,评估时间分别缩短了 33.67%、38.12% 和 27.34%。与现有技术相比,如基于人工智能的足球教学与训练计算机辅助教学系统(STT-IOT-CATS)、基于视频图像的足球教学与训练计算机辅助教学系统(CAT-STT-VI)和在移动互联网环境下利用人工智能和元诗增强足球教练质量的方法(SI-STQ-AI-MIE),评估时间分别减少了 34%。
{"title":"Computer aided technology based on graph sample and aggregate attention network optimized for soccer teaching and training","authors":"Guanghui Yang, Xinyuan Feng","doi":"10.1186/s40537-024-00893-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00893-x","url":null,"abstract":"<p>Football is the most popular game in the world and has significant influence on various aspects including politics, economy and culture. The experience of the football developed nation has shown that the steady growth of youth football is crucial for elevating a nation's overall football proficiency. It is essential to develop techniques and create strategies that adapt to their individual physical features to resolve the football players’ problem of lacking exercise in various topics. In this manuscript, Computer aided technology depending on the Graph Sample and Aggregate Attention Network Optimized for Soccer Teaching and Training (CAT-GSAAN-STT) is proposed to improve the efficiency of Soccer teaching and training effectively. The proposed method contains four stages, like data collection, data preprocessing, prediction and optimization. Initially the input data are collected by Microsoft Kinect V2 smart camera. Then the collected data are preprocessed by using Improving graph collaborative filtering. After preprocessing the data is given for motion recognition layer here prediction is done using Graph Sample and Aggregate Attention Network (GSAAN) for improving the effectiveness of Soccer Teaching and Training. To enhance the accuracy of the system, the GSAAN are optimized by using Artificial Rabbits Optimization. The proposed CAT-GSAAN-STT method is executed in Python and the efficiency of the proposed technique is examined with different metrics, like accuracy, computation time, learning activity analysis, student performance ratio and teaching evaluation analysis. The simulation outcomes proves that the proposed technique attains provides28.33%, 31.60%, 25.63% higherRecognition accuracy and33.67%, 38.12% and 27.34%lesser evaluation time while compared with existing techniques like computer aided teaching system based upon artificial intelligence in football teaching with training (STT-IOT-CATS), Computer Aided Teaching System for Football Teaching and Training Based on Video Image (CAT-STT-VI) and method for enhancing the football coaching quality using artificial intelligence and meta verse-empowered in mobile internet environment (SI-STQ-AI-MIE) respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting transformer-based language models for heart disease detection and risk factors extraction 调整基于转换器的语言模型,用于心脏病检测和风险因素提取
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-04 DOI: 10.1186/s40537-024-00903-y
Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali

Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.

在心脏病发作前有效治疗心脏病患者有赖于对心脏病的精确预测。识别和检测心脏病的危险因素,如糖尿病、冠状动脉疾病(CAD)、高脂血症、高血压、吸烟、家族性冠状动脉疾病史、肥胖和药物,对于制定有效的预防和管理措施至关重要。虽然电子健康记录(EHR)已成为识别这些风险因素的宝贵资源,但其非结构化的格式给心脏病专家检索相关信息带来了挑战。本研究建议采用迁移学习技术自动从电子病历中提取心脏病风险因素。迁移学习是一种深度学习技术,在各种临床自然语言处理(NLP)应用中,尤其是在心脏病风险预测中表现出了显著的性能。本研究探索了基于变换器的语言模型的应用,特别是利用 i2b2 数据集,利用 BERT(来自变换器的双向编码器表示)、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT 等预训练架构,从临床笔记中检测心脏病并提取相关风险因素。这些转换器模型在大量医学文献和临床记录的语料库中进行了预训练,以深入理解语境化语言表达。然后,利用专门针对心脏病的注释数据集(如 i2b2 数据集)对调整后的模型进行微调,使其能够学习该领域内的模式和关系。这些模型在从电子病历中提取语义信息、自动进行高性能心脏病风险因素识别以及在临床领域内执行下游 NLP 任务方面表现出色。本研究利用 2014 年 i2b2 临床 NLP 挑战数据集,提出了微调五种广泛使用的基于转换器的模型,即 BERT、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT。经过微调的模型在预测心脏病风险因素方面超越了传统方法,其准确性令人印象深刻。RoBERTa 模型的性能最高,其微观 F1 分数为 94.27%,而 BERT、BioClinicalBERT、XLNet 和 BioBERT 模型的微观 F1 分数分别为 93.73%、94.03%、93.97% 和 93.99%,表现极具竞争力。最后,研究人员提出了基于五个转换器的模型的简单集合,该集合在心脏病风险扇形中的表现优于大多数现有方法,微观 F1 分数达到 94.26%。这项研究证明了基于转换器模型的迁移学习在增强风险预测和促进心脏病早期干预方面的功效。
{"title":"Adapting transformer-based language models for heart disease detection and risk factors extraction","authors":"Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali","doi":"10.1186/s40537-024-00903-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00903-y","url":null,"abstract":"<p>Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"24 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gene selection via improved nuclear reaction optimization algorithm for cancer classification in high-dimensional data 通过改进的核反应优化算法选择基因,用于高维数据中的癌症分类
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-03 DOI: 10.1186/s40537-024-00902-z

Abstract

RNA Sequencing (RNA-Seq) has been considered a revolutionary technique in gene profiling and quantification. It offers a comprehensive view of the transcriptome, making it a more expansive technique in comparison with micro-array. Genes that discriminate malignancy and normal can be deduced using quantitative gene expression. However, this data is a high-dimensional dense matrix; each sample has a dimension of more than 20,000 genes. Dealing with this data poses challenges. This paper proposes RBNRO-DE (Relief Binary NRO based on Differential Evolution) for handling the gene selection strategy on (rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized) with more than 20,000 genes to pick the best informative genes and assess them through 22 cancer datasets. The k-nearest Neighbor (k-NN) and Support Vector Machine (SVM) are applied to assess the quality of the selected genes. Binary versions of the most common meta-heuristic algorithms have been compared with the proposed RBNRO-DE algorithm. In most of the 22 cancer datasets, the RBNRO-DE algorithm based on k-NN and SVM classifiers achieved optimal convergence and classification accuracy up to 100% integrated with a feature reduction size down to 98%, which is very evident when compared to its counterparts, according to Wilcoxon’s rank-sum test (5% significance level).

摘要 RNA 测序(RNA-Seq)被认为是基因谱分析和定量的革命性技术。它提供了转录组的全面视图,使其成为一种与微阵列相比更具扩展性的技术。利用定量基因表达可以推断出区分恶性肿瘤和正常肿瘤的基因。然而,这些数据是一个高维密集矩阵;每个样本都有超过 20,000 个基因。处理这些数据是一项挑战。本文提出了基于差分进化的救济二元 NRO(Relief Binary NRO based on Differential Evolution)处理基因选择策略(rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized),在 20,000 多个基因中挑选出信息量最大的基因,并通过 22 个癌症数据集对其进行评估。k-nearest Neighbor(k-NN)和支持向量机(SVM)被用来评估所选基因的质量。最常见的元启发式算法的二进制版本与所提出的 RBNRO-DE 算法进行了比较。根据 Wilcoxon 秩和检验(5% 显著性水平),在大多数 22 个癌症数据集中,基于 k-NN 和 SVM 分类器的 RBNRO-DE 算法实现了最佳收敛,分类准确率高达 100%,特征缩减量低至 98%,与同类算法相比非常明显。
{"title":"Gene selection via improved nuclear reaction optimization algorithm for cancer classification in high-dimensional data","authors":"","doi":"10.1186/s40537-024-00902-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00902-z","url":null,"abstract":"<h3>Abstract</h3> <p>RNA Sequencing (RNA-Seq) has been considered a revolutionary technique in gene profiling and quantification. It offers a comprehensive view of the transcriptome, making it a more expansive technique in comparison with micro-array. Genes that discriminate malignancy and normal can be deduced using quantitative gene expression. However, this data is a high-dimensional dense matrix; each sample has a dimension of more than 20,000 genes. Dealing with this data poses challenges. This paper proposes RBNRO-DE (Relief Binary NRO based on Differential Evolution) for handling the gene selection strategy on (rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized) with more than 20,000 genes to pick the best informative genes and assess them through 22 cancer datasets. The <em>k</em>-nearest Neighbor (<em>k</em>-NN) and Support Vector Machine (SVM) are applied to assess the quality of the selected genes. Binary versions of the most common meta-heuristic algorithms have been compared with the proposed RBNRO-DE algorithm. In most of the 22 cancer datasets, the RBNRO-DE algorithm based on <em>k</em>-NN and SVM classifiers achieved optimal convergence and classification accuracy up to 100% integrated with a feature reduction size down to 98%, which is very evident when compared to its counterparts, according to Wilcoxon’s rank-sum test (5% significance level).</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The role of classifiers and data complexity in learned Bloom filters: insights and recommendations 分类器和数据复杂性在学习型布鲁姆过滤器中的作用:见解和建议
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-27 DOI: 10.1186/s40537-024-00906-9
Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca

Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.

布鲁姆过滤器自 50 多年前问世以来,已成为在较小空间内处理成员查询的支柱,并在大数据挖掘和流处理中得到了相关应用。最近,人们利用机器学习技术提出了进一步的改进方案:学习型布鲁姆过滤器。后者使这种多标准数据结构的适当参数设置变得更加复杂,特别是在选择其关键组件之一(分类器)和考虑输入数据集的分类复杂性方面。鉴于这一技术现状,我们的贡献如下。(1) 一种由软件支持的新方法,用于设计、分析和实施学习型布鲁姆过滤器,该过滤器考虑到了自身的多标准特性,特别是分类器类型选择和数据分类复杂性。广泛的实验表明,所提出的方法是有效的,而且由于我们的软件是公开的,我们为有兴趣使用学习型布鲁姆过滤器的从业人员提供了一个有效的工具。(2) 对提升技术水平具有重大现实意义的其他贡献如下:(a) 分类器的推理时间不应被视为筛选器拒绝时间的代表;(b) 在我们考虑的众多分类器中,只有两个能提供良好的性能;这一结果与文献中的早期发现一致,并进一步加强了文献中的早期发现;(c) Sandwiched Bloom 筛选器已被认为是这一领域的参考之一,本文进一步证明了它对数据复杂性和分类器性能变化的显著鲁棒性。
{"title":"The role of classifiers and data complexity in learned Bloom filters: insights and recommendations","authors":"Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca","doi":"10.1186/s40537-024-00906-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00906-9","url":null,"abstract":"<p>Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"5 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods 特征选择策略:基于 SHAP 值和重要性的方法比较分析
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-26 DOI: 10.1186/s40537-024-00905-w
Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar

In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

在高维信用卡欺诈数据方面,研究人员和从业人员通常利用特征选择技术来提高欺诈检测模型的性能。本研究比较了使用 SHAP(SHapley Additive exPlanations)值和模型内置特征重要性列表选择的最重要特征的模型性能。这两种方法都对特征进行排序,并选择最重要的特征进行模型评估。为了评估这些特征选择技术的有效性,我们使用五种分类器建立了分类模型:XGBoost、决策树、CatBoost、极随机树和随机森林。精度-召回曲线下的面积(AUPRC)作为评估指标。所有实验都是在 Kaggle 信用卡欺诈检测数据集上进行的。实验结果和统计测试表明,基于重要性值的特征选择方法优于基于 SHAP 值的分类器和各种特征子集大小的特征选择方法。对于在较大数据集上训练的模型,建议使用模型内置的特征重要性列表作为主要特征选择方法,而不是 SHAP。这一建议的依据是,计算 SHAP 特征重要性是一项独特的工作,而模型在训练过程中自然会提供内置的特征重要性,无需额外工作。因此,对于更大的数据集和更复杂的模型来说,选择模型内置的特征重要性列表可以提供更高效、更实用的方法。
{"title":"Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods","authors":"Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar","doi":"10.1186/s40537-024-00905-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00905-w","url":null,"abstract":"<p>In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140298523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-sample $$zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant 多样本 $$zeta $$ -mixup:来自 p 系列插值器的更丰富、更逼真的合成样本
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-23 DOI: 10.1186/s40537-024-00898-6

Abstract

Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose (zeta ) -mixup, a generalization of mixup with provably and demonstrably desirable properties that allows convex combinations of ({T} ge 2) samples, leading to more realistic and diverse outputs that incorporate information from ({T}) original samples by using a p-series interpolant. We show that, compared to mixup, (zeta ) -mixup better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of (zeta ) -mixup is faster than mixup, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that (zeta ) -mixup outperforms mixup, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.

摘要 现代深度学习训练程序依赖于数据增强方法等模型正则化技术,这些方法生成的训练样本可以增加数据的多样性和标签信息的丰富性。最近流行的一种方法是 mixup,它使用原始样本对的凸组合来生成新样本。然而,正如我们在实验中展示的那样,mixup 会产生不理想的合成样本,其中的数据采样偏离流形,可能包含不正确的标签。我们提出了 (zeta ) -mixup,它是对 mixup 的一种概括,具有可证明和可证明的理想特性,允许 ({T} ge 2) 样本的凸组合,通过使用 p 系列插值法,将来自 ({T}) 原始样本的信息纳入到更真实和多样化的输出中。我们的研究表明,与 mixup 相比,(zeta )-mixup 更好地保留了原始数据集的内在维度,而这正是训练可泛化模型的理想特性。此外,我们还展示了我们的 (zeta ) -mixup 实现比 mixup 更快,在受控合成数据集和 26 个不同的真实世界自然和医学图像分类数据集上进行的广泛评估显示,(zeta ) -mixup 优于 mixup、CutMix 和传统的数据增强技术。代码将在 https://github.com/kakumarabhishek/zeta-mixup 上发布。
{"title":"Multi-sample $$zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant","authors":"","doi":"10.1186/s40537-024-00898-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00898-6","url":null,"abstract":"<h3>Abstract</h3> <p>Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, <em>mixup</em>, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, <em>mixup</em> can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose <span> <span>(zeta )</span> </span>-<em>mixup</em>, a generalization of <em>mixup</em> with provably and demonstrably desirable properties that allows convex combinations of <span> <span>({T} ge 2)</span> </span> samples, leading to more realistic and diverse outputs that incorporate information from <span> <span>({T})</span> </span> original samples by using a <em>p</em>-series interpolant. We show that, compared to <em>mixup</em>, <span> <span>(zeta )</span> </span>-<em>mixup</em> better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of <span> <span>(zeta )</span> </span>-<em>mixup</em> is faster than <em>mixup</em>, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that <span> <span>(zeta )</span> </span>-<em>mixup</em> outperforms <em>mixup</em>, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"9 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning manifolds from non-stationary streams 从非稳态流中学习流形
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-23 DOI: 10.1186/s40537-023-00872-8
Suchismit Mahapatra, Varun Chandola

Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.

基于流形学习的流适应降维方法(如 Isomap)所依据的假设是,一小批初始观测数据足以精确学习流形,而剩余的流数据实例可以廉价地映射到该流形上。然而,目前还没有理论结果表明这一核心假设是成立的。此外,这类方法通常假定底层数据分布是静态的,无法检测或处理数据流时可能出现的分布突变或渐变。我们提出的理论结果表明,随着数据量的增加,流形的质量会逐渐收敛。然后我们证明,使用特定流形核函数并在足够大的初始批次上进行训练的高斯过程回归(GPR)模型可以近似于最先进的流式 Isomap 算法,而且从 GPR 预测中获得的预测方差可以用作底层数据分布变化的有效检测器。在几个合成和真实数据集上的结果表明,由此产生的算法可以在流式环境中有效地学习高维数据的低维表示,同时识别生成分布的变化。例如,在气体传感器阵列数据集上的主要研究结果表明,我们的方法可以检测到由于真实世界因素(如系统中引入新气体)而引发的底层数据流的变化,同时有效地将数据映射到低维流形上。
{"title":"Learning manifolds from non-stationary streams","authors":"Suchismit Mahapatra, Varun Chandola","doi":"10.1186/s40537-023-00872-8","DOIUrl":"https://doi.org/10.1186/s40537-023-00872-8","url":null,"abstract":"<p>Streaming adaptations of manifold learning based dimensionality reduction methods, such as <i>Isomap</i>, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection 采用 Xgb-Tree 算法的自适应混合非洲秃鹫-龙舌兰优化器用于假新闻检测
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-19 DOI: 10.1186/s40537-024-00895-9
Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny

Online platforms and social networking have increased in the contemporary years. They are now a major news source worldwide, leading to the online proliferation of Fake News (FNs). These FNs are alarming because they fundamentally reshape public opinion, which may cause customers to leave these online platforms, threatening the reputations of several organizations and industries. This rapid dissemination of FNs makes it imperative for automated systems to detect them, encouraging many researchers to propose various systems to classify news articles and detect FNs automatically. In this paper, a Fake News Detection (FND) methodology is presented based on an effective IBAVO-AO algorithm, which stands for hybridization of African Vultures Optimization (AVO) and Aquila Optimization (AO) algorithms, with an extreme gradient boosting Tree (Xgb-Tree) classifier. The suggested methodology involves three main phases: Initially, the unstructured FNs dataset is analyzed, and the essential features are extracted by tokenizing, encoding, and padding the input news words into a sequence of integers utilizing the GLOVE approach. Then, the extracted features are filtered using the effective Relief algorithm to select only the appropriate ones. Finally, the recovered features are used to classify the news items using the suggested IBAVO-AO algorithm based on the Xgb-Tree classifier. Hence, the suggested methodology is distinguished from prior models in that it performs automatic data pre-processing, optimization, and classification tasks. The proposed methodology is carried out on the ISOT-FNs dataset, containing more than 44 thousand multiple news articles divided into truthful and fake. We validated the proposed methodology’s reliability by examining numerous evaluation metrics involving accuracy, fitness values, the number of selected features, Kappa, Precision, Recall, F1-score, Specificity, Sensitivity, ROC_AUC, and MCC. Then, the proposed methodology is compared against the most common meta-heuristic optimization algorithms utilizing the ISOT-FNs. The experimental results reveal that the suggested methodology achieved optimal classification accuracy and F1-score and successfully categorized more than 92.5% of news articles compared to its peers. This study will assist researchers in expanding their understanding of meta-heuristic optimization algorithms applications for FND.

Graphical Abstract

当代,网络平台和社交网络日益增多。它们现已成为全球主要的新闻来源,导致假新闻(FNs)在网上泛滥。这些假新闻令人震惊,因为它们从根本上重塑了公众舆论,可能导致客户离开这些在线平台,威胁到一些组织和行业的声誉。假新闻的快速传播使得自动系统检测假新闻成为当务之急,因此许多研究人员提出了各种系统来对新闻文章进行分类并自动检测假新闻。本文提出了一种假新闻检测(FND)方法,该方法基于一种有效的 IBAVO-AO 算法,即非洲秃鹫优化(AVO)和 Aquila 优化(AO)算法与极端梯度提升树(Xgb-Tree)分类器的混合算法。建议的方法包括三个主要阶段:首先,分析非结构化 FNs 数据集,利用 GLOVE 方法将输入的新闻词标记化、编码并填充为整数序列,从而提取基本特征。然后,使用有效的 Relief 算法对提取的特征进行过滤,只选择合适的特征。最后,使用基于 Xgb-Tree 分类器的 IBAVO-AO 算法对恢复的特征进行新闻分类。因此,建议的方法有别于之前的模型,因为它能自动执行数据预处理、优化和分类任务。我们在 ISOT-FNs 数据集上实施了所提出的方法,该数据集包含 4.4 万多篇新闻文章,分为真假两类。我们通过考察大量评估指标,包括准确率、适配值、所选特征数量、Kappa、精度、召回率、F1-分数、特异性、灵敏度、ROC_AUC 和 MCC,验证了所提方法的可靠性。然后,利用 ISOT-FNs 将建议的方法与最常见的元启发式优化算法进行了比较。实验结果表明,与同类方法相比,所建议的方法达到了最佳的分类准确率和 F1 分数,并成功分类了 92.5% 以上的新闻文章。这项研究将有助于研究人员扩大对 FND 元启发式优化算法应用的理解。
{"title":"An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection","authors":"Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny","doi":"10.1186/s40537-024-00895-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00895-9","url":null,"abstract":"<p>Online platforms and social networking have increased in the contemporary years. They are now a major news source worldwide, leading to the online proliferation of Fake News (FNs). These FNs are alarming because they fundamentally reshape public opinion, which may cause customers to leave these online platforms, threatening the reputations of several organizations and industries. This rapid dissemination of FNs makes it imperative for automated systems to detect them, encouraging many researchers to propose various systems to classify news articles and detect FNs automatically. In this paper, a Fake News Detection (FND) methodology is presented based on an effective IBAVO-AO algorithm, which stands for hybridization of African Vultures Optimization (AVO) and Aquila Optimization (AO) algorithms, with an extreme gradient boosting Tree (Xgb-Tree) classifier. The suggested methodology involves three main phases: Initially, the unstructured FNs dataset is analyzed, and the essential features are extracted by tokenizing, encoding, and padding the input news words into a sequence of integers utilizing the GLOVE approach. Then, the extracted features are filtered using the effective Relief algorithm to select only the appropriate ones. Finally, the recovered features are used to classify the news items using the suggested IBAVO-AO algorithm based on the Xgb-Tree classifier. Hence, the suggested methodology is distinguished from prior models in that it performs automatic data pre-processing, optimization, and classification tasks. The proposed methodology is carried out on the ISOT-FNs dataset, containing more than 44 thousand multiple news articles divided into truthful and fake. We validated the proposed methodology’s reliability by examining numerous evaluation metrics involving accuracy, fitness values, the number of selected features, Kappa, Precision, Recall, F1-score, Specificity, Sensitivity, ROC_AUC, and MCC. Then, the proposed methodology is compared against the most common meta-heuristic optimization algorithms utilizing the ISOT-FNs. The experimental results reveal that the suggested methodology achieved optimal classification accuracy and F1-score and successfully categorized more than 92.5% of news articles compared to its peers. This study will assist researchers in expanding their understanding of meta-heuristic optimization algorithms applications for FND.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1