首页 > 最新文献

Frontiers in Big Data最新文献

英文 中文
PHTFNet-RPM: a probabilistic hybrid network with RPM for tobacco root disease forecasting. PHTFNet-RPM:一种用于烟草根病预测的概率混合网络。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-10 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1705587
Yunhong Bu, Tingshan Yao, Shaowu Geng, Renjie Huang

Introduction: Tobacco growers usually face particular challenges in predicting the risks of tobacco root diseases due to complex pathogenesis, concealed early symptoms, and heterogeneous farm conditions.

Methods: To address this problem, we proposed a flexible Probabilistic Hybrid Temporal Fusion Network with Random Period Mask (PHTFNet-RPM). This model is designed to forecast future multi-day disease incidences and indices. It incorporates a hybrid input structure with RPM to handle configurable static management variables and time-series data of weather factors and disease metrics, using the RPM to simulate diverse absences of historical observations. The model's internal hierarchically aggregated modules learn cross-variable and cross-temporal feature representations to model the complex non-linear relationships. Furthermore, probabilistic theory-based uncertainty quantification is designed to enhance the model's credibility and reliability.

Results: The proposed PHTFNet-RPM was validated using a large-scale time-series dataset of tobacco root diseases, organized from 20-year meteorological and disease survey records in Chuxiong Prefecture, Yunnan Province. Extensive comparative experiments demonstrated that our model achieves a 4.44%-16.43% lower mean absolute error (MAE) than existing models (including LR, SVR, CNN-LSTM, and LSTM-Attention).

Discussion: The results confirm that the model can reliably forecast disease progression trends under different configurations, even when relying solely on historical weather observations. The integration of uncertainty quantification provides a robust tool for assessing prediction reliability, offering significant practical value for disease management.

导言:由于发病机制复杂、早期症状隐蔽和农场条件不同,烟草种植者在预测烟草根病风险方面通常面临着特殊的挑战。方法:为了解决这一问题,我们提出了一种灵活的随机周期掩码概率混合时间融合网络(PHTFNet-RPM)。该模型旨在预测未来多日的疾病发病率和指数。它结合了一个混合输入结构和RPM来处理可配置的静态管理变量和天气因素和疾病指标的时间序列数据,使用RPM来模拟各种缺乏历史观测的情况。该模型的内部分层聚合模块学习跨变量和跨时间的特征表示来建模复杂的非线性关系。在此基础上,设计了基于概率理论的不确定性量化方法,提高了模型的可信度和可靠性。结果:利用云南省楚雄州20年气象和病害调查记录整理的大规模烟草根系病害时间序列数据,验证了所提出的PHTFNet-RPM。大量的对比实验表明,我们的模型比现有模型(包括LR、SVR、CNN-LSTM和LSTM-Attention)的平均绝对误差(MAE)低4.44%-16.43%。讨论:结果证实,即使仅依靠历史天气观测,该模型也可以可靠地预测不同配置下的疾病进展趋势。不确定性量化的集成为评估预测可靠性提供了一个强大的工具,为疾病管理提供了重要的实用价值。
{"title":"PHTFNet-RPM: a probabilistic hybrid network with RPM for tobacco root disease forecasting.","authors":"Yunhong Bu, Tingshan Yao, Shaowu Geng, Renjie Huang","doi":"10.3389/fdata.2025.1705587","DOIUrl":"10.3389/fdata.2025.1705587","url":null,"abstract":"<p><strong>Introduction: </strong>Tobacco growers usually face particular challenges in predicting the risks of tobacco root diseases due to complex pathogenesis, concealed early symptoms, and heterogeneous farm conditions.</p><p><strong>Methods: </strong>To address this problem, we proposed a flexible Probabilistic Hybrid Temporal Fusion Network with Random Period Mask (PHTFNet-RPM). This model is designed to forecast future multi-day disease incidences and indices. It incorporates a hybrid input structure with RPM to handle configurable static management variables and time-series data of weather factors and disease metrics, using the RPM to simulate diverse absences of historical observations. The model's internal hierarchically aggregated modules learn cross-variable and cross-temporal feature representations to model the complex non-linear relationships. Furthermore, probabilistic theory-based uncertainty quantification is designed to enhance the model's credibility and reliability.</p><p><strong>Results: </strong>The proposed PHTFNet-RPM was validated using a large-scale time-series dataset of tobacco root diseases, organized from 20-year meteorological and disease survey records in Chuxiong Prefecture, Yunnan Province. Extensive comparative experiments demonstrated that our model achieves a 4.44%-16.43% lower mean absolute error (MAE) than existing models (including LR, SVR, CNN-LSTM, and LSTM-Attention).</p><p><strong>Discussion: </strong>The results confirm that the model can reliably forecast disease progression trends under different configurations, even when relying solely on historical weather observations. The integration of uncertainty quantification provides a robust tool for assessing prediction reliability, offering significant practical value for disease management.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1705587"},"PeriodicalIF":2.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12640811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards the neuromorphic Cyber-Twin: an architecture for cognitive defense in digital twin ecosystems. 迈向神经形态的网络孪生:数字孪生生态系统中认知防御的架构。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-04 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1659757
Nida Nasir, Hussam Al Hamadi

Introduction: As cyber-physical systems become increasingly virtualized, digital twins have emerged as essential components for real-time monitoring, simulation, and control. However, their growing complexity and exposure to dynamic network environments make them vulnerable to sophisticated cyber threats. Traditional rule-based and machine-learning-based security models often fail to adapt in real time to evolving attack patterns, particularly in decentralized and resource-constrained settings.

Methods: This study introduces the Neuromorphic Cyber-Twin (NCT), a brain-inspired architectural framework that integrates spiking neural networks (SNNs) and event-driven cognition to enhance adaptive cyber defense. The NCT leverages neuromorphic principles such as sparse coding, temporal encoding, and spike-timing-dependent plasticity (STDP) to transform telemetry data from the digital-twin layer into spike-based sensory inputs. A layered cognitive architecture continuously monitors behavioral deviations, infers anomalies, and autonomously adapts its defensive responses in alignment with system dynamics.

Results: Lightweight prototype simulations demonstrate the feasibility of NCT-based event-driven anomaly detection and adaptive defense. The results highlight advantages in low-latency detection, contextual awareness, and energy efficiency compared with conventional machine-learning models.

Discussion: The NCT framework represents a biologically inspired paradigm for scalable, self-evolving cybersecurity in virtualized ecosystems. Potential applications include infrastructure monitoring, autonomous transportation, and industrial control systems. Comprehensive benchmarking and large-scale validation are identified as future research directions.

随着信息物理系统的日益虚拟化,数字孪生体已经成为实时监测、模拟和控制的重要组成部分。然而,它们日益增长的复杂性和对动态网络环境的暴露使它们容易受到复杂的网络威胁。传统的基于规则和基于机器学习的安全模型往往无法实时适应不断变化的攻击模式,特别是在分散和资源受限的环境中。方法:本研究引入了神经形态网络孪生体(NCT),这是一种大脑启发的架构框架,集成了脉冲神经网络(snn)和事件驱动认知,以增强自适应网络防御。NCT利用神经形态学原理,如稀疏编码、时间编码和峰值时间依赖的可塑性(STDP),将遥测数据从数字孪生层转换为基于峰值的感官输入。分层的认知架构持续监控行为偏差,推断异常,并根据系统动态自主调整其防御反应。结果:轻量级原型仿真验证了基于nct的事件驱动异常检测和自适应防御的可行性。与传统的机器学习模型相比,研究结果突出了低延迟检测、上下文感知和能源效率方面的优势。讨论:NCT框架代表了虚拟化生态系统中可扩展、自进化的网络安全的生物学启发范例。潜在的应用包括基础设施监控、自动运输和工业控制系统。全面对标和大规模验证是未来的研究方向。
{"title":"Towards the neuromorphic Cyber-Twin: an architecture for cognitive defense in digital twin ecosystems.","authors":"Nida Nasir, Hussam Al Hamadi","doi":"10.3389/fdata.2025.1659757","DOIUrl":"10.3389/fdata.2025.1659757","url":null,"abstract":"<p><strong>Introduction: </strong>As cyber-physical systems become increasingly virtualized, digital twins have emerged as essential components for real-time monitoring, simulation, and control. However, their growing complexity and exposure to dynamic network environments make them vulnerable to sophisticated cyber threats. Traditional rule-based and machine-learning-based security models often fail to adapt in real time to evolving attack patterns, particularly in decentralized and resource-constrained settings.</p><p><strong>Methods: </strong>This study introduces the Neuromorphic Cyber-Twin (NCT), a brain-inspired architectural framework that integrates spiking neural networks (SNNs) and event-driven cognition to enhance adaptive cyber defense. The NCT leverages neuromorphic principles such as sparse coding, temporal encoding, and spike-timing-dependent plasticity (STDP) to transform telemetry data from the digital-twin layer into spike-based sensory inputs. A layered cognitive architecture continuously monitors behavioral deviations, infers anomalies, and autonomously adapts its defensive responses in alignment with system dynamics.</p><p><strong>Results: </strong>Lightweight prototype simulations demonstrate the feasibility of NCT-based event-driven anomaly detection and adaptive defense. The results highlight advantages in low-latency detection, contextual awareness, and energy efficiency compared with conventional machine-learning models.</p><p><strong>Discussion: </strong>The NCT framework represents a biologically inspired paradigm for scalable, self-evolving cybersecurity in virtualized ecosystems. Potential applications include infrastructure monitoring, autonomous transportation, and industrial control systems. Comprehensive benchmarking and large-scale validation are identified as future research directions.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1659757"},"PeriodicalIF":2.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12623207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Urban mobility and crime: causal inference using street closures as an instrumental variable. 城市交通和犯罪:使用街道封闭作为工具变量的因果推理。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-31 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1579332
Karl Vachuska

The advent of widely available cell phone mobility data in the United States has rapidly expanded the study of everyday mobility patterns in social science research. A wide range of existing literature finds ambient population (e.g., visitors) estimates of an area to be predictive of crime. Much of the past research frames neighborhood visitor flows in predictive terms without necessarily indicating or implying a causal effect. Through the use of two causal inference approaches-conventional two-way fixed effects and a novel instrumental variable approach, this brief research report explicitly formulates the causal effect of visitors in counterfactual terms. This study addresses this gap by explicitly estimating the causal effect of visitor flows on crime rates. Using high-resolution mobility and crime data from New York City for the year 2019, I estimate the additive effect of visitors on the multiple measurements of criminal activity. While two-way fixed effects models show a significant effect of visitors on a wide array of crime forms, instrumental variable estimates indicate no statistically significant causal impact, with large standard errors indicating substantial uncertainty in visitors' effect on crime rates.

在美国,广泛可用的手机移动数据的出现迅速扩展了社会科学研究中日常移动模式的研究。大量现有文献发现,周围人口(如游客)对一个地区的估计可以预测犯罪。过去的许多研究都是用预测的方式来构建社区游客流量,而不一定表明或暗示因果关系。通过使用两种因果推理方法——传统的双向固定效应和一种新的工具变量方法,本简短的研究报告明确地以反事实的方式阐述了游客的因果效应。本研究通过明确估计游客流量对犯罪率的因果影响来解决这一差距。利用2019年纽约市的高分辨率流动性和犯罪数据,我估计了游客对犯罪活动的多重测量的叠加效应。虽然双向固定效应模型显示游客对各种犯罪形式的显著影响,但工具变量估计表明,在统计上没有显著的因果影响,较大的标准误差表明游客对犯罪率的影响存在很大的不确定性。
{"title":"Urban mobility and crime: causal inference using street closures as an instrumental variable.","authors":"Karl Vachuska","doi":"10.3389/fdata.2025.1579332","DOIUrl":"10.3389/fdata.2025.1579332","url":null,"abstract":"<p><p>The advent of widely available cell phone mobility data in the United States has rapidly expanded the study of everyday mobility patterns in social science research. A wide range of existing literature finds ambient population (e.g., visitors) estimates of an area to be predictive of crime. Much of the past research frames neighborhood visitor flows in predictive terms without necessarily indicating or implying a causal effect. Through the use of two causal inference approaches-conventional two-way fixed effects and a novel instrumental variable approach, this brief research report explicitly formulates the causal effect of visitors in counterfactual terms. This study addresses this gap by explicitly estimating the causal effect of visitor flows on crime rates. Using high-resolution mobility and crime data from New York City for the year 2019, I estimate the additive effect of visitors on the multiple measurements of criminal activity. While two-way fixed effects models show a significant effect of visitors on a wide array of crime forms, instrumental variable estimates indicate no statistically significant causal impact, with large standard errors indicating substantial uncertainty in visitors' effect on crime rates.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1579332"},"PeriodicalIF":2.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615182/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding the needle in the haystack-An interpretable sequential pattern mining method for classification problems. 大海捞针——一种用于分类问题的可解释顺序模式挖掘方法。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-24 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1604887
Alexander Grote, Anuja Hariharan, Christof Weinhardt

Introduction: The analysis of discrete sequential data, such as event logs and customer clickstreams, is often challenged by the vast number of possible sequential patterns. This complexity makes it difficult to identify meaningful sequences and derive actionable insights.

Methods: We propose a novel feature selection algorithm, that integrates unsupervised sequential pattern mining with supervised machine learning. Unlike existing interpretable machine learning methods, we determine important sequential patterns during the mining process, eliminating the need for post-hoc classification to assess their relevance. Compared to existing interesting measures, we introduce a local, class-specific interestingness measure that is inherently interpretable.

Results: We evaluated the algorithm on three diverse datasets - churn prediction, malware sequence analysis, and a synthetic dataset - covering different sizes, application domains, and feature complexities. Our method achieved classification performance comparable to established feature selection algorithms while maintaining interpretability and reducing computational costs.

Discussion: This study demonstrates a practical and efficient approach for uncovering important sequential patterns in classification tasks. By combining interpretability with competitive predictive performance, our algorithm provides practitioners with an interpretable and efficient alternative to existing methods, paving the way for new advances in sequential data analysis.

对离散顺序数据(如事件日志和客户点击流)的分析经常受到大量可能的顺序模式的挑战。这种复杂性使得识别有意义的序列和获得可操作的见解变得困难。方法:提出了一种新的特征选择算法,该算法将无监督顺序模式挖掘与有监督机器学习相结合。与现有的可解释机器学习方法不同,我们在挖掘过程中确定重要的顺序模式,从而消除了对事后分类来评估其相关性的需要。与现有的兴趣度量相比,我们引入了一个局部的、特定于类的、内在可解释的兴趣度量。结果:我们在三个不同的数据集(流失预测、恶意软件序列分析和合成数据集)上评估了该算法,这些数据集涵盖了不同的规模、应用领域和特征复杂性。我们的方法实现了与现有特征选择算法相当的分类性能,同时保持了可解释性并降低了计算成本。讨论:本研究展示了一种实用而有效的方法来发现分类任务中重要的顺序模式。通过将可解释性与竞争性预测性能相结合,我们的算法为从业者提供了一种可解释且有效的替代现有方法,为序列数据分析的新进展铺平了道路。
{"title":"Finding the needle in the haystack-An interpretable sequential pattern mining method for classification problems.","authors":"Alexander Grote, Anuja Hariharan, Christof Weinhardt","doi":"10.3389/fdata.2025.1604887","DOIUrl":"10.3389/fdata.2025.1604887","url":null,"abstract":"<p><strong>Introduction: </strong>The analysis of discrete sequential data, such as event logs and customer clickstreams, is often challenged by the vast number of possible sequential patterns. This complexity makes it difficult to identify meaningful sequences and derive actionable insights.</p><p><strong>Methods: </strong>We propose a novel feature selection algorithm, that integrates unsupervised sequential pattern mining with supervised machine learning. Unlike existing interpretable machine learning methods, we determine important sequential patterns during the mining process, eliminating the need for post-hoc classification to assess their relevance. Compared to existing interesting measures, we introduce a local, class-specific interestingness measure that is inherently interpretable.</p><p><strong>Results: </strong>We evaluated the algorithm on three diverse datasets - churn prediction, malware sequence analysis, and a synthetic dataset - covering different sizes, application domains, and feature complexities. Our method achieved classification performance comparable to established feature selection algorithms while maintaining interpretability and reducing computational costs.</p><p><strong>Discussion: </strong>This study demonstrates a practical and efficient approach for uncovering important sequential patterns in classification tasks. By combining interpretability with competitive predictive performance, our algorithm provides practitioners with an interpretable and efficient alternative to existing methods, paving the way for new advances in sequential data analysis.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1604887"},"PeriodicalIF":2.4,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12604564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Study on coal and gas outburst prediction technology based on multi-model fusion. 基于多模型融合的煤与瓦斯突出预测技术研究。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1623883
Qian Xie, Junsheng Yan, Zhenhua Dai, Wengang Du, Xuefei Wu

The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has opened up novel avenues for predicting coal and gas outbursts in coal mines. This study proposes a novel prediction framework that integrates advanced AI methodologies through a multi-model fusion strategy based on ensemble learning and model Stacking. The proposed model leverages the diverse data interpretation capabilities and distinct training mechanisms of various algorithms, thereby capitalizing on the complementary strengths of each constituent learner. Specifically, a Stacking-based ensemble model is constructed, incorporating Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (KNN) as base learners. An attention mechanism is then employed to adaptively weight the outputs of these base learners, thereby harnessing their complementary strengths. The meta-learner, primarily built upon the XGBoost algorithm, integrates these weighted outputs to generate the final prediction. The model's performance is rigorously evaluated using real-world coal and gas outburst data collected from a mine in Pingdingshan, China, with evaluation metrics including the F1-score and other standard classification indicators. The results reveal that individual models, such as XGBoost, SVM, and RF, can effectively quantify the contribution of input feature importance using their inherent mechanisms. Furthermore, the ensemble model significantly outperforms single-model approaches, particularly when the base learners are both strong and mutually uncorrelated. The proposed ensemble framework achieves a markedly higher F1-score, demonstrating its robustness and effectiveness in the complex task of coal and gas outburst prediction.

人工智能(AI)和机器学习(ML)技术的快速发展为预测煤矿煤和瓦斯突出开辟了新的途径。本研究提出了一个新的预测框架,该框架通过基于集成学习和模型堆叠的多模型融合策略集成了先进的人工智能方法。该模型利用了各种算法的不同数据解释能力和不同的训练机制,从而利用了每个组成学习器的互补优势。具体来说,构建了一个基于堆叠的集成模型,将支持向量机(SVM)、随机森林(RF)和k近邻(KNN)作为基础学习器。然后采用注意机制自适应地权衡这些基础学习器的输出,从而利用它们的互补优势。元学习器,主要建立在XGBoost算法上,整合这些加权输出来生成最终的预测。利用平顶山某煤矿实际煤与瓦斯突出数据对模型的性能进行了严格评价,评价指标包括f1分和其他标准分类指标。结果表明,单个模型(如XGBoost、SVM和RF)可以利用其固有机制有效地量化输入特征重要性的贡献。此外,集成模型显著优于单模型方法,特别是当基础学习器既强又相互不相关时。所提出的集成框架获得了较高的f1分数,证明了其在复杂的煤与瓦斯突出预测任务中的鲁棒性和有效性。
{"title":"Study on coal and gas outburst prediction technology based on multi-model fusion.","authors":"Qian Xie, Junsheng Yan, Zhenhua Dai, Wengang Du, Xuefei Wu","doi":"10.3389/fdata.2025.1623883","DOIUrl":"10.3389/fdata.2025.1623883","url":null,"abstract":"<p><p>The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has opened up novel avenues for predicting coal and gas outbursts in coal mines. This study proposes a novel prediction framework that integrates advanced AI methodologies through a multi-model fusion strategy based on ensemble learning and model Stacking. The proposed model leverages the diverse data interpretation capabilities and distinct training mechanisms of various algorithms, thereby capitalizing on the complementary strengths of each constituent learner. Specifically, a Stacking-based ensemble model is constructed, incorporating Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (KNN) as base learners. An attention mechanism is then employed to adaptively weight the outputs of these base learners, thereby harnessing their complementary strengths. The meta-learner, primarily built upon the XGBoost algorithm, integrates these weighted outputs to generate the final prediction. The model's performance is rigorously evaluated using real-world coal and gas outburst data collected from a mine in Pingdingshan, China, with evaluation metrics including the F1-score and other standard classification indicators. The results reveal that individual models, such as XGBoost, SVM, and RF, can effectively quantify the contribution of input feature importance using their inherent mechanisms. Furthermore, the ensemble model significantly outperforms single-model approaches, particularly when the base learners are both strong and mutually uncorrelated. The proposed ensemble framework achieves a markedly higher F1-score, demonstrating its robustness and effectiveness in the complex task of coal and gas outburst prediction.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1623883"},"PeriodicalIF":2.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12580147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145446498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on fault-tolerant decision algorithm for data security automation. 数据安全自动化中的容错决策算法研究。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1600540
Jianxin Li, Ruchun Jia, Ning Xiang, Yizhun Tian

Introduction: Traditional operation and maintenance decision algorithms often ignore the analysis of data source security, making them highly susceptible to noise, time-consuming in execution, and lacking in rationality.

Methods: In this study, we design an automated operation and maintenance decision algorithm based on data source security analysis. A multi-angle learning algorithm is adopted to establish a noise data model, introduce relaxation variables, and compare sharing factors with noise data characteristics to determine whether the data source is secure. Taking the ideal power shortage and minimum maintenance cost as the objective function, we construct a classical particle swarm optimization model and derive the expressions for particle search velocity and position. To address the problem of local optima, a niche mechanism is incorporated: the obtained automated data is treated as the population, a reasonable number of iterations is determined, individual fitness is stored, and the optimal state is obtained through a continuous iterative update strategy.

Results: Experimental results show that the proposed strategy can shorten operation and maintenance time, enhance the rationality of decision-making, improve algorithm convergence, and avoid falling into local optima.

Discussion: In addition, fault-tolerant analysis is performed on data source security, effectively eliminating bad data, preventing interference from malicious data, and further improving convergence performance.

简介:传统的运维决策算法往往忽略了对数据源安全性的分析,易受噪声影响,执行时间长,缺乏合理性。方法:设计了一种基于数据源安全分析的运维自动化决策算法。采用多角度学习算法建立噪声数据模型,引入松弛变量,将共享因子与噪声数据特征进行比较,判断数据源是否安全。以理想功率短缺和最小维护成本为目标函数,构造了经典粒子群优化模型,推导了粒子群搜索速度和位置的表达式。为了解决局部最优问题,引入了小生境机制:将获得的自动化数据作为总体,确定合理的迭代次数,存储个体适应度,通过连续迭代更新策略获得最优状态。结果:实验结果表明,所提策略能够缩短运维时间,增强决策的合理性,提高算法收敛性,避免陷入局部最优。讨论:另外,对数据源安全性进行容错分析,有效消除不良数据,防止恶意数据干扰,进一步提高收敛性能。
{"title":"Research on fault-tolerant decision algorithm for data security automation.","authors":"Jianxin Li, Ruchun Jia, Ning Xiang, Yizhun Tian","doi":"10.3389/fdata.2025.1600540","DOIUrl":"https://doi.org/10.3389/fdata.2025.1600540","url":null,"abstract":"<p><strong>Introduction: </strong>Traditional operation and maintenance decision algorithms often ignore the analysis of data source security, making them highly susceptible to noise, time-consuming in execution, and lacking in rationality.</p><p><strong>Methods: </strong>In this study, we design an automated operation and maintenance decision algorithm based on data source security analysis. A multi-angle learning algorithm is adopted to establish a noise data model, introduce relaxation variables, and compare sharing factors with noise data characteristics to determine whether the data source is secure. Taking the ideal power shortage and minimum maintenance cost as the objective function, we construct a classical particle swarm optimization model and derive the expressions for particle search velocity and position. To address the problem of local optima, a niche mechanism is incorporated: the obtained automated data is treated as the population, a reasonable number of iterations is determined, individual fitness is stored, and the optimal state is obtained through a continuous iterative update strategy.</p><p><strong>Results: </strong>Experimental results show that the proposed strategy can shorten operation and maintenance time, enhance the rationality of decision-making, improve algorithm convergence, and avoid falling into local optima.</p><p><strong>Discussion: </strong>In addition, fault-tolerant analysis is performed on data source security, effectively eliminating bad data, preventing interference from malicious data, and further improving convergence performance.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1600540"},"PeriodicalIF":2.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12580102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145446537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing student mental health with RoBERTa-Large: a sentiment analysis and data analytics approach. 用RoBERTa-Large分析学生心理健康:情感分析和数据分析方法。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-17 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1615788
Hikmat Ullah Khan, Anam Naz, Fawaz Khaled Alarfaj, Naif Almusallam

The mental health of students plays an important role in their overall wellbeing and academic performance. Growing pressure from academics, co-curricular activities such as sports and personal challenges highlight the need for modern methods of monitoring mental health. Traditional approaches, such as self-reported surveys and psychological evaluations, can be time-consuming and subject to bias. With advancement in artificial intelligence (AI), particularly in natural language processing (NLP), sentiment analysis has emerged as an effective technique for identifying mental health patterns in textual data. However, analyzing students' mental health remains a challenging task due to the intensity of emotional expressions, linguistic variations, and context-dependent sentiments. In this study, our primary objective was to investigate the mental health of students by conducting sentiment analysis using advanced deep learning models. To accomplish this task, state-of-the-art Large Language Model (LLM) approaches, such as RoBERTa (a robustly optimized BERT approach), RoBERTa-Large, and ELECTRA, were used for empirical analysis. RoBERTa-Large, an expanded architecture derived from Google's BERT, captures complex patterns and performs more effectively on various NLP tasks. Among the applied algorithms, RoBERTa-Large achieved the highest accuracy of 97%, while ELECTRA yielded 91% accuracy on a multi-classification task with seven diverse mental health status labels. These results demonstrate the potential of LLM-based approaches for predicting students' mental health, particularly in relation to the effects of academic and physical activities.

学生的心理健康对他们的整体健康和学习成绩起着重要的作用。来自学术、课外活动(如体育)和个人挑战的压力越来越大,这凸显了对监测心理健康的现代方法的需求。传统的方法,如自我报告的调查和心理评估,可能耗时且容易产生偏见。随着人工智能(AI),特别是自然语言处理(NLP)的发展,情绪分析已成为识别文本数据中心理健康模式的有效技术。然而,由于情绪表达的强度、语言变化和情境依赖情绪,分析学生的心理健康仍然是一项具有挑战性的任务。在这项研究中,我们的主要目标是通过使用先进的深度学习模型进行情绪分析来调查学生的心理健康状况。为了完成这项任务,使用了最先进的大型语言模型(LLM)方法,如RoBERTa(一种鲁棒优化的BERT方法)、RoBERTa-Large和ELECTRA进行实证分析。RoBERTa-Large是谷歌的BERT的扩展架构,可以捕获复杂的模式,并在各种NLP任务上更有效地执行。在应用的算法中,RoBERTa-Large的准确率最高,达到97%,而ELECTRA在包含七种不同心理健康状态标签的多分类任务上的准确率为91%。这些结果证明了基于法学硕士的方法在预测学生心理健康方面的潜力,特别是在学术和体育活动的影响方面。
{"title":"Analyzing student mental health with RoBERTa-Large: a sentiment analysis and data analytics approach.","authors":"Hikmat Ullah Khan, Anam Naz, Fawaz Khaled Alarfaj, Naif Almusallam","doi":"10.3389/fdata.2025.1615788","DOIUrl":"10.3389/fdata.2025.1615788","url":null,"abstract":"<p><p>The mental health of students plays an important role in their overall wellbeing and academic performance. Growing pressure from academics, co-curricular activities such as sports and personal challenges highlight the need for modern methods of monitoring mental health. Traditional approaches, such as self-reported surveys and psychological evaluations, can be time-consuming and subject to bias. With advancement in artificial intelligence (AI), particularly in natural language processing (NLP), sentiment analysis has emerged as an effective technique for identifying mental health patterns in textual data. However, analyzing students' mental health remains a challenging task due to the intensity of emotional expressions, linguistic variations, and context-dependent sentiments. In this study, our primary objective was to investigate the mental health of students by conducting sentiment analysis using advanced deep learning models. To accomplish this task, state-of-the-art Large Language Model (LLM) approaches, such as RoBERTa (a robustly optimized BERT approach), RoBERTa-Large, and ELECTRA, were used for empirical analysis. RoBERTa-Large, an expanded architecture derived from Google's BERT, captures complex patterns and performs more effectively on various NLP tasks. Among the applied algorithms, RoBERTa-Large achieved the highest accuracy of 97%, while ELECTRA yielded 91% accuracy on a multi-classification task with seven diverse mental health status labels. These results demonstrate the potential of LLM-based approaches for predicting students' mental health, particularly in relation to the effects of academic and physical activities.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1615788"},"PeriodicalIF":2.4,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12575187/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on optimization of personalized recommendation method based on RFMQ model- taking outdoor sports products in cross-border e-commerce as an example. 基于RFMQ模型的个性化推荐方法优化研究——以跨境电商户外运动产品为例
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-14 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1680669
Qianlan Chen, Chupeng Chen, Zubai Jiang, Chaoling Li, Yangxizi Tan, Niannian Li, Bolin Zhou, Bingxian Yang

With the rapid development of the global digital economy, cross-border e-commerce has rapidly emerged and developed at a high speed, and has become a crucial bridge connecting global markets. This research focuses on the cross-border e-commerce sector of outdoor sports products, in response to the common problems in the cross-border e-commerce field, such as "information overload" and "insufficient recommendation accuracy," a personalized recommendation optimization framework integrating customer value segmentation and collaborative filtering is proposed. Based on the classic RFM model, the purchase quantity indicator (Quantity) is introduced to construct the RFMQ model, thereby more comprehensively characterizing user behavior characteristics. Further, the customer value stratification is achieved by using the indicator segmentation method and the K-means clustering algorithm, and a differentiated collaborative filtering recommendation mechanism is designed based on the segmented groups. Through a five-fold cross-validation experiment, it is shown that the proposed method significantly outperforms the traditional collaborative filtering model in the TOPN recommendation task. Specifically, when the number of recommended products is between 3 and 7, the RFMQ recommendation model based on indicator segmentation performs best in terms of F1 score (for example, when TOPN = 5, the F1 value increases from 0.1709 to 0.3093), and the method based on K-means clustering also shows a stable improvement (with the F1 value reaching 0.267 at the same time). The results indicate that the indicator segmentation method has a significant advantage in smaller recommendation quantity scenarios. This study verifies the effectiveness of the RFMQ model in customer segmentation and recommendation performance optimization, providing an operational solution for e-commerce platforms to implement precise marketing, enhance user stickiness and commercial competitiveness, and is particularly suitable for low-cost and high-efficiency personalized recommendation scenarios of small and medium-sized enterprises.

随着全球数字经济的快速发展,跨境电子商务迅速兴起并高速发展,成为连接全球市场的重要桥梁。本研究以户外运动产品的跨境电商领域为研究对象,针对跨境电商领域普遍存在的“信息过载”、“推荐准确率不足”等问题,提出了一种整合客户价值细分和协同过滤的个性化推荐优化框架。在经典RFM模型的基础上,引入购买数量指标(quantity)构建RFMQ模型,从而更全面地表征用户行为特征。在此基础上,采用指标分割法和k均值聚类算法实现客户价值分层,并基于细分群体设计差异化协同过滤推荐机制。通过五重交叉验证实验,表明该方法在TOPN推荐任务中显著优于传统协同过滤模型。其中,当推荐的产品数量在3 ~ 7个之间时,基于指标分割的RFMQ推荐模型在F1得分上表现最好(如TOPN = 5时,F1值从0.1709上升到0.3093),基于K-means聚类的方法也表现出稳定的提升(F1值同时达到0.267)。结果表明,指标分割方法在推荐量较小的场景下具有明显的优势。本研究验证了RFMQ模型在客户细分和推荐性能优化方面的有效性,为电商平台实施精准营销、增强用户粘性和商业竞争力提供了一种运营解决方案,特别适用于中小企业低成本、高效率的个性化推荐场景。
{"title":"Research on optimization of personalized recommendation method based on RFMQ model- taking outdoor sports products in cross-border e-commerce as an example.","authors":"Qianlan Chen, Chupeng Chen, Zubai Jiang, Chaoling Li, Yangxizi Tan, Niannian Li, Bolin Zhou, Bingxian Yang","doi":"10.3389/fdata.2025.1680669","DOIUrl":"10.3389/fdata.2025.1680669","url":null,"abstract":"<p><p>With the rapid development of the global digital economy, cross-border e-commerce has rapidly emerged and developed at a high speed, and has become a crucial bridge connecting global markets. This research focuses on the cross-border e-commerce sector of outdoor sports products, in response to the common problems in the cross-border e-commerce field, such as \"information overload\" and \"insufficient recommendation accuracy,\" a personalized recommendation optimization framework integrating customer value segmentation and collaborative filtering is proposed. Based on the classic RFM model, the purchase quantity indicator (Quantity) is introduced to construct the RFMQ model, thereby more comprehensively characterizing user behavior characteristics. Further, the customer value stratification is achieved by using the indicator segmentation method and the K-means clustering algorithm, and a differentiated collaborative filtering recommendation mechanism is designed based on the segmented groups. Through a five-fold cross-validation experiment, it is shown that the proposed method significantly outperforms the traditional collaborative filtering model in the TOPN recommendation task. Specifically, when the number of recommended products is between 3 and 7, the RFMQ recommendation model based on indicator segmentation performs best in terms of F1 score (for example, when TOPN = 5, the F1 value increases from 0.1709 to 0.3093), and the method based on K-means clustering also shows a stable improvement (with the F1 value reaching 0.267 at the same time). The results indicate that the indicator segmentation method has a significant advantage in smaller recommendation quantity scenarios. This study verifies the effectiveness of the RFMQ model in customer segmentation and recommendation performance optimization, providing an operational solution for e-commerce platforms to implement precise marketing, enhance user stickiness and commercial competitiveness, and is particularly suitable for low-cost and high-efficiency personalized recommendation scenarios of small and medium-sized enterprises.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1680669"},"PeriodicalIF":2.4,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12558725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145402935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI biases as asymmetries: a review to guide practice. 作为不对称的人工智能偏见:指导实践的回顾。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-13 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1532397
Gabriella Waters, Phillip Honenberger

The understanding of bias in AI is currently undergoing a revolution. Often assumed to be errors or flaws, biases are increasingly recognized as integral to AI systems and sometimes preferable to less biased alternatives. In this paper we review the reasons for this changed understanding and provide new guidance on three questions: First, how should we think about and measure biases in AI systems, consistent with the new understanding? Second, what kinds of bias in an AI system should we accept or even amplify, and why? And, third, what kinds should we attempt to minimize or eliminate, and why? In answer to the first question, we argue that biases are "violations of a symmetry standard" (following Kelly). Per this definition, many biases in AI systems are benign. This raises the question of how to identify biases that are problematic or undesirable when they occur. To address this question, we distinguish three main ways that asymmetries in AI systems can be problematic or undesirable-erroneous representation, unfair treatment, and violation of process ideals-and highlight places in the pipeline of AI development and application where bias of these types can occur.

目前,人工智能对偏见的理解正在经历一场革命。偏见通常被认为是错误或缺陷,越来越多的人认为它是人工智能系统不可或缺的一部分,有时比不那么有偏见的替代方案更可取。在本文中,我们回顾了这种改变理解的原因,并就三个问题提供了新的指导:首先,我们应该如何思考和衡量人工智能系统中的偏见,与新的理解保持一致?第二,我们应该接受甚至放大人工智能系统中的哪些偏见,为什么?第三,我们应该尽量减少或消除哪些类型,为什么?在回答第一个问题时,我们认为偏见是“对对称标准的违反”(遵循Kelly)。根据这个定义,人工智能系统中的许多偏见是良性的。这就提出了一个问题,即当偏见出现时,如何识别它们是有问题的或不受欢迎的。为了解决这个问题,我们区分了人工智能系统中的不对称可能成为问题或不受欢迎的三种主要方式——错误表示、不公平对待和违反过程理想——并强调了人工智能开发和应用管道中可能发生这些类型偏见的地方。
{"title":"AI biases as asymmetries: a review to guide practice.","authors":"Gabriella Waters, Phillip Honenberger","doi":"10.3389/fdata.2025.1532397","DOIUrl":"10.3389/fdata.2025.1532397","url":null,"abstract":"<p><p>The understanding of bias in AI is currently undergoing a revolution. Often assumed to be errors or flaws, biases are increasingly recognized as integral to AI systems and sometimes preferable to less biased alternatives. In this paper we review the reasons for this changed understanding and provide new guidance on three questions: First, how should we think about and measure biases in AI systems, consistent with the new understanding? Second, what kinds of bias in an AI system should we accept or even amplify, and why? And, third, what kinds should we attempt to minimize or eliminate, and why? In answer to the first question, we argue that biases are \"violations of a symmetry standard\" (following Kelly). Per this definition, many biases in AI systems are benign. This raises the question of how to identify biases that <i>are</i> problematic or undesirable when they occur. To address this question, we distinguish three main ways that asymmetries in AI systems can be problematic or undesirable-erroneous representation, unfair treatment, and violation of process ideals-and highlight places in the pipeline of AI development and application where bias of these types can occur.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1532397"},"PeriodicalIF":2.4,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12554557/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145394968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting deep vein thrombosis using machine learning and blood routine analysis. 利用机器学习和血常规分析预测深静脉血栓形成。
IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 eCollection Date: 2025-01-01 DOI: 10.3389/fdata.2025.1605258
Jie Su, Yuechao Tang, Yanan Wang, Chao Chen, Biao Song

Objective: Lower limb deep vein thrombosis (DVT) is a serious health problem, causing local discomfort and hindering walking. It can lead to severe complications, including pulmonary embolism, chronic post-thrombotic syndrome, and limb amputation, posing risks of death or severe disability. This study aims to develop a diagnostic model for DVT using routine blood analysis and evaluate its effectiveness in early diagnosis.

Methods: This study retrospectively analyzed patient medical records from January 2022 to June 2023, including 658 DVT patients (case group) and 1,418 healthy subjects (control group). SHAP (SHapley Additive exPlanations) analysis was employed for feature selection to identify key blood indices significantly impacting DVT risk prediction. Based on the selected features, six machine learning models were constructed: k-Nearest Neighbors (kNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN). Model performance was assessed using the area under the curve (AUC).

Results: SHAP analysis identified ten key blood routine indices. The six models constructed using these indices demonstrated strong predictive performance, with AUC values exceeding 0.8, accuracy above 70%, and sensitivity and specificity over 70%. Notably, the RF model exhibited superior performance in assessing the risk of DVT.

Conclusions: Our study successfully developed machine learning models for predicting DVT risk using routine blood tests. These models achieved high predictive performance, suggesting their potential for early DVT diagnosis without additional medical burden on patients. Future research will focus on further validation and refinement of these models to enhance their clinical applicability.

目的:下肢深静脉血栓形成(DVT)是一种严重的健康问题,可引起局部不适并妨碍行走。它可导致严重并发症,包括肺栓塞、慢性血栓后综合征和肢体截肢,造成死亡或严重残疾的风险。本研究旨在建立一种基于血常规分析的深静脉血栓诊断模型,并评估其在早期诊断中的有效性。方法:回顾性分析2022年1月至2023年6月期间DVT患者的医疗记录,包括658例DVT患者(病例组)和1418例健康受试者(对照组)。采用SHapley加性解释(SHapley Additive explanation)分析进行特征选择,以确定对DVT风险预测有显著影响的关键血液指标。基于选择的特征,构建了k-近邻(kNN)、逻辑回归(LR)、决策树(DT)、随机森林(RF)、支持向量机(SVM)和人工神经网络(ANN) 6种机器学习模型。使用曲线下面积(AUC)评估模型性能。结果:通过SHAP分析确定了10项关键血常规指标。使用这些指标构建的6个模型具有较强的预测性能,AUC值均超过0.8,准确率均在70%以上,灵敏度和特异性均在70%以上。值得注意的是,RF模型在评估DVT风险方面表现出优越的性能。结论:我们的研究成功开发了通过常规血液检查预测深静脉血栓风险的机器学习模型。这些模型具有很高的预测性能,表明它们具有早期DVT诊断的潜力,而不会给患者带来额外的医疗负担。未来的研究将集中在进一步验证和完善这些模型,以提高其临床适用性。
{"title":"Predicting deep vein thrombosis using machine learning and blood routine analysis.","authors":"Jie Su, Yuechao Tang, Yanan Wang, Chao Chen, Biao Song","doi":"10.3389/fdata.2025.1605258","DOIUrl":"10.3389/fdata.2025.1605258","url":null,"abstract":"<p><strong>Objective: </strong>Lower limb deep vein thrombosis (DVT) is a serious health problem, causing local discomfort and hindering walking. It can lead to severe complications, including pulmonary embolism, chronic post-thrombotic syndrome, and limb amputation, posing risks of death or severe disability. This study aims to develop a diagnostic model for DVT using routine blood analysis and evaluate its effectiveness in early diagnosis.</p><p><strong>Methods: </strong>This study retrospectively analyzed patient medical records from January 2022 to June 2023, including 658 DVT patients (case group) and 1,418 healthy subjects (control group). SHAP (SHapley Additive exPlanations) analysis was employed for feature selection to identify key blood indices significantly impacting DVT risk prediction. Based on the selected features, six machine learning models were constructed: k-Nearest Neighbors (kNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN). Model performance was assessed using the area under the curve (AUC).</p><p><strong>Results: </strong>SHAP analysis identified ten key blood routine indices. The six models constructed using these indices demonstrated strong predictive performance, with AUC values exceeding 0.8, accuracy above 70%, and sensitivity and specificity over 70%. Notably, the RF model exhibited superior performance in assessing the risk of DVT.</p><p><strong>Conclusions: </strong>Our study successfully developed machine learning models for predicting DVT risk using routine blood tests. These models achieved high predictive performance, suggesting their potential for early DVT diagnosis without additional medical burden on patients. Future research will focus on further validation and refinement of these models to enhance their clinical applicability.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1605258"},"PeriodicalIF":2.4,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145349693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers in Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1