Generalized Estimating Equations Boosting (GEEB) machine for correlated data

IF 6.4 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Journal of Big Data Pub Date : 2024-01-22 DOI:10.1186/s40537-023-00875-5

{"title":"Generalized Estimating Equations Boosting (GEEB) machine for correlated data","authors":"","doi":"10.1186/s40537-023-00875-5","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"29 1 1","pages":""},"PeriodicalIF":6.4000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s40537-023-00875-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于相关数据的广义估计方程提升（GEEB）机器

摘要数据科学的快速发展使机器学习和人工智能成为各学科最流行的研究工具。虽然有许多文章显示了良好的预测能力，但很少有研究探讨复杂相关数据的影响。我们的目标是在重复测量或分层数据结构下开发出更精确的模型。因此，本研究提出了一种新算法--广义估计方程提升（GEEB）机，将梯度提升技术整合到处理相关数据的基准统计方法--广义估计方程（GEE）中。与以往利用所有输入特征的梯度提升不同，我们在建立模型时随机选择了一些输入特征，以减少预测误差。模拟研究评估了 GEEB、GEE、eXtreme Gradient Boosting (XGBoost) 和支持向量机 (SVM) 在不同样本量的多个分层结构中的预测性能。结果表明，新策略 GEEB 的表现优于 GEE，并且在大多数情况下都比 SVM 和 XGBoost 表现出更高的预测准确性。在实际数据集森林火灾数据中的应用也表明，与 GEE、XGBoost 和 SVM 相比，GEEB 将均方误差减少了 4.5% 到 25%。这项研究还提供了一个免费的 R 函数，可以毫不费力地为纵向或分层数据实现 GEEB 机器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Big Data Computer Science-Information Systems

CiteScore

17.80

自引率

3.70%

发文量

105

审稿时长

13 weeks

期刊介绍： The Journal of Big Data publishes high-quality, scholarly research papers, methodologies, and case studies covering a broad spectrum of topics, from big data analytics to data-intensive computing and all applications of big data research. It addresses challenges facing big data today and in the future, including data capture and storage, search, sharing, analytics, technologies, visualization, architectures, data mining, machine learning, cloud computing, distributed systems, and scalable storage. The journal serves as a seminal source of innovative material for academic researchers and practitioners alike.