Pub Date : 2025-11-14eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1621526
Stan Kachnowski, Asif H Khan, Shadé Floquet, Kendal K Whitlock, Juan Pablo Wisnivesky, Daniel B Neill, Irene Dankwa-Mullan, Gezzer Ortega, Moataz Daoud, Raza Zaheer, Maia Hightower, Paul Rowe
Prevalence of immune diseases is rising, imposing burdens on patients, healthcare providers, and society. Addressing the future impact of immune diseases requires "big data" on global distribution/prevalence, patient demographics, risk factors, biomarkers, and prognosis to inform prevention, diagnosis, and treatment strategies. Big data offer promise by integrating diverse real-world data sources with artificial intelligence (AI) and big data analytics (BDA), yet cautious implementation is vital due to the potential to perpetuate and exacerbate biases. In this review, we outline some of the key challenges associated with achieving health equity through the use of big data, AI, and BDA in immune diseases and present potential solutions. For example, political/institutional will and stakeholder engagement are essential, requiring evidence of return on investment, a clear definition of success (including key metrics), and improved communication of unmet needs, disparities in treatments and outcomes, and the benefits of AI and BDA in achieving health equity. Broad representation and engagement are required to foster trust and inclusivity, involving patients and community organizations in study design, data collection, and decision-making processes. Enhancing technical capabilities and accountability with AI and BDA are also crucial to address data quality and diversity issues, ensuring datasets are of sufficient quality and representative of minoritized populations. Lastly, mitigating biases in AI and BDA is imperative, necessitating robust and iterative fairness assessments, continuous evaluation, and strong governance. Collaborative efforts to overcome these challenges are needed to leverage AI and BDA effectively, including an infrastructure for sharing harmonized big data, to advance health equity in immune diseases through transparent, fair, and impactful data-driven solutions.
{"title":"Achieving health equity in immune disease: leveraging big data and artificial intelligence in an evolving health system landscape.","authors":"Stan Kachnowski, Asif H Khan, Shadé Floquet, Kendal K Whitlock, Juan Pablo Wisnivesky, Daniel B Neill, Irene Dankwa-Mullan, Gezzer Ortega, Moataz Daoud, Raza Zaheer, Maia Hightower, Paul Rowe","doi":"10.3389/fdata.2025.1621526","DOIUrl":"10.3389/fdata.2025.1621526","url":null,"abstract":"<p><p>Prevalence of immune diseases is rising, imposing burdens on patients, healthcare providers, and society. Addressing the future impact of immune diseases requires \"big data\" on global distribution/prevalence, patient demographics, risk factors, biomarkers, and prognosis to inform prevention, diagnosis, and treatment strategies. Big data offer promise by integrating diverse real-world data sources with artificial intelligence (AI) and big data analytics (BDA), yet cautious implementation is vital due to the potential to perpetuate and exacerbate biases. In this review, we outline some of the key challenges associated with achieving health equity through the use of big data, AI, and BDA in immune diseases and present potential solutions. For example, political/institutional will and stakeholder engagement are essential, requiring evidence of return on investment, a clear definition of success (including key metrics), and improved communication of unmet needs, disparities in treatments and outcomes, and the benefits of AI and BDA in achieving health equity. Broad representation and engagement are required to foster trust and inclusivity, involving patients and community organizations in study design, data collection, and decision-making processes. Enhancing technical capabilities and accountability with AI and BDA are also crucial to address data quality and diversity issues, ensuring datasets are of sufficient quality and representative of minoritized populations. Lastly, mitigating biases in AI and BDA is imperative, necessitating robust and iterative fairness assessments, continuous evaluation, and strong governance. Collaborative efforts to overcome these challenges are needed to leverage AI and BDA effectively, including an infrastructure for sharing harmonized big data, to advance health equity in immune diseases through transparent, fair, and impactful data-driven solutions.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1621526"},"PeriodicalIF":2.4,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12660090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1676477
Janis Kampars, Guntis Mosans, Tushar Jogi, Franz Roters, Napat Vajragupta
The management of vast, heterogeneous, and multidisciplinary data presents a critical challenge across scientific domains, hindering interoperability and slowing scientific progress. This paper addresses this challenge by presenting a pragmatic extension to the NeOn iterative ontology engineering framework, a well-established methodology for collaborative ontology design, which integrates Large Language Models (LLMs) to accelerate key tasks while retaining domain expert-in-the-loop validation. The methodology was applied within the HyWay project, an EU-funded research initiative on hydrogen-materials interactions, to develop the Hydrogen-Material Interaction Ontology (HMIO), a domain-specific ontology covering 29 experimental methods and 14 simulation types for assessing interactions between hydrogen and advanced metallic materials. A key result is the successful integration of the HMIO into a Data and Knowledge Management Platform (DKMP), where it drives the automated generation of data entry forms, ensuring that all captured data is Findable, Accessible, Interoperable, and Reusable (FAIR) and HMIO compliant by design. The validation of this approach demonstrates that this hybrid human-machine workflow for ontology engineering and further integration with the DKMP is an effective and efficient strategy for creating and operationalising complex scientific ontologies, thereby providing a scalable solution to advance data-driven research in materials science and other complex scientific domains.
{"title":"LLM-supported collaborative ontology design for data and knowledge management platforms.","authors":"Janis Kampars, Guntis Mosans, Tushar Jogi, Franz Roters, Napat Vajragupta","doi":"10.3389/fdata.2025.1676477","DOIUrl":"https://doi.org/10.3389/fdata.2025.1676477","url":null,"abstract":"<p><p>The management of vast, heterogeneous, and multidisciplinary data presents a critical challenge across scientific domains, hindering interoperability and slowing scientific progress. This paper addresses this challenge by presenting a pragmatic extension to the NeOn iterative ontology engineering framework, a well-established methodology for collaborative ontology design, which integrates Large Language Models (LLMs) to accelerate key tasks while retaining domain expert-in-the-loop validation. The methodology was applied within the HyWay project, an EU-funded research initiative on hydrogen-materials interactions, to develop the Hydrogen-Material Interaction Ontology (HMIO), a domain-specific ontology covering 29 experimental methods and 14 simulation types for assessing interactions between hydrogen and advanced metallic materials. A key result is the successful integration of the HMIO into a Data and Knowledge Management Platform (DKMP), where it drives the automated generation of data entry forms, ensuring that all captured data is Findable, Accessible, Interoperable, and Reusable (FAIR) and HMIO compliant by design. The validation of this approach demonstrates that this hybrid human-machine workflow for ontology engineering and further integration with the DKMP is an effective and efficient strategy for creating and operationalising complex scientific ontologies, thereby providing a scalable solution to advance data-driven research in materials science and other complex scientific domains.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1676477"},"PeriodicalIF":2.4,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145641833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1666962
Xiaoxiao Du, Haomiao Yu, Hao Zhang, Xiangyun Liu, Xinling Yu, Tao Xie, Wenli Diao
Objective: To compare the application of the ARIMA model, the Long Short-Term Memory (LSTM) model and the ARIMA-LSTM model in forecasting foodborne disease incidence.
Methods: Monthly case data of foodborne diseases in Liaoning Province from January 2015 to December 2023 were used to construct ARIMA, LSTM, and ARIMA-LSTM models. These three models were then applied to forecast the monthly incidence of foodborne diseases in 2024, and their predictions were compared with those of a baseline model. Model performance was evaluated by comparing the predicted and observed values using root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE), allowing identification of the optimal model. The best-performing model was subsequently employed to predict the monthly incidence for 2025.
Results: The ARIMA-LSTM model was identified as the optimal model. Specifically, the ARIMA (2,0,0) (0,1,1)12 model produced RMSE = 300.03, MAE = 187.11, and MAPE = 16.38%, while the LSTM model yielded RMSE = 408.71, MAE = 226.03, and MAPE = 17.21%. In contrast, the ARIMA-LSTM model achieved RMSE = 0.44, MAE = 0.44, and MAPE = 0.08%, representing a dramatic improvement over the baseline model (RMSE = 204.17, MAE = 146.75, MAPE = 15.62%), with reductions of 99.5%, 99.7%, and 99.4% in RMSE, MAE, and MAPE, respectively. Based on the ARIMA-LSTM model, the predicted monthly cases of foodborne diseases for 2025 are: 214.62 (Jan), 260.84 (Feb), 462.92 (Mar), 590.92 (Apr), 800.88 (May), 965.11 (Jun), 2410.36 (Jul), 2651.36 (Aug), 1711.15 (Sep), 941.22 (Oct), 628.21 (Nov), and 465.05 (Dec).
Conclusion: The ARIMA-LSTM model is considered the optimal model for predicting foodborne disease incidence in Liaoning Province in 2025.
目的:比较ARIMA模型、长短期记忆(LSTM)模型和ARIMA-LSTM模型在食源性疾病发病率预测中的应用。方法:利用辽宁省2015年1月- 2023年12月食源性疾病病例资料,构建ARIMA、LSTM和ARIMA-LSTM模型。然后应用这三个模型预测2024年食源性疾病的月发病率,并将其预测结果与基线模型的预测结果进行比较。通过使用均方根误差(RMSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE)比较预测值和观察值来评估模型的性能,从而确定最优模型。随后采用表现最好的模型预测2025年的月发病率。结果:ARIMA-LSTM模型为最优模型。其中,ARIMA(2,0,0)(0,1,1)12模型的RMSE = 300.03, MAE = 187.11, MAPE = 16.38%; LSTM模型的RMSE = 408.71, MAE = 226.03, MAPE = 17.21%。相比之下,ARIMA-LSTM模型的RMSE = 0.44, MAE = 0.44, MAPE = 0.08%,比基线模型(RMSE = 204.17, MAE = 146.75, MAPE = 15.62%)有了显著改善,RMSE、MAE和MAPE分别降低了99.5%、99.7%和99.4%。基于ARIMA-LSTM模型,预测2025年食源性疾病月发病例数分别为:214.62(1月)、260.84(2月)、462.92(3月)、590.92(4月)、800.88(5月)、965.11(6月)、2410.36(7月)、2651.36(8月)、1711.15(9月)、941.22(10月)、628.21(11月)和465.05(12月)。结论:ARIMA-LSTM模型是预测2025年辽宁省食源性疾病发病率的最佳模型。
{"title":"Application and comparison of ARIMA, LSTM, and ARIMA-LSTM models for predicting foodborne diseases in Liaoning Province.","authors":"Xiaoxiao Du, Haomiao Yu, Hao Zhang, Xiangyun Liu, Xinling Yu, Tao Xie, Wenli Diao","doi":"10.3389/fdata.2025.1666962","DOIUrl":"https://doi.org/10.3389/fdata.2025.1666962","url":null,"abstract":"<p><strong>Objective: </strong>To compare the application of the ARIMA model, the Long Short-Term Memory (LSTM) model and the ARIMA-LSTM model in forecasting foodborne disease incidence.</p><p><strong>Methods: </strong>Monthly case data of foodborne diseases in Liaoning Province from January 2015 to December 2023 were used to construct ARIMA, LSTM, and ARIMA-LSTM models. These three models were then applied to forecast the monthly incidence of foodborne diseases in 2024, and their predictions were compared with those of a baseline model. Model performance was evaluated by comparing the predicted and observed values using root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE), allowing identification of the optimal model. The best-performing model was subsequently employed to predict the monthly incidence for 2025.</p><p><strong>Results: </strong>The ARIMA-LSTM model was identified as the optimal model. Specifically, the ARIMA (2,0,0) (0,1,1)1<sub>2</sub> model produced RMSE = 300.03, MAE = 187.11, and MAPE = 16.38%, while the LSTM model yielded RMSE = 408.71, MAE = 226.03, and MAPE = 17.21%. In contrast, the ARIMA-LSTM model achieved RMSE = 0.44, MAE = 0.44, and MAPE = 0.08%, representing a dramatic improvement over the baseline model (RMSE = 204.17, MAE = 146.75, MAPE = 15.62%), with reductions of 99.5%, 99.7%, and 99.4% in RMSE, MAE, and MAPE, respectively. Based on the ARIMA-LSTM model, the predicted monthly cases of foodborne diseases for 2025 are: 214.62 (Jan), 260.84 (Feb), 462.92 (Mar), 590.92 (Apr), 800.88 (May), 965.11 (Jun), 2410.36 (Jul), 2651.36 (Aug), 1711.15 (Sep), 941.22 (Oct), 628.21 (Nov), and 465.05 (Dec).</p><p><strong>Conclusion: </strong>The ARIMA-LSTM model is considered the optimal model for predicting foodborne disease incidence in Liaoning Province in 2025.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1666962"},"PeriodicalIF":2.4,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646886/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145641815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1705587
Yunhong Bu, Tingshan Yao, Shaowu Geng, Renjie Huang
Introduction: Tobacco growers usually face particular challenges in predicting the risks of tobacco root diseases due to complex pathogenesis, concealed early symptoms, and heterogeneous farm conditions.
Methods: To address this problem, we proposed a flexible Probabilistic Hybrid Temporal Fusion Network with Random Period Mask (PHTFNet-RPM). This model is designed to forecast future multi-day disease incidences and indices. It incorporates a hybrid input structure with RPM to handle configurable static management variables and time-series data of weather factors and disease metrics, using the RPM to simulate diverse absences of historical observations. The model's internal hierarchically aggregated modules learn cross-variable and cross-temporal feature representations to model the complex non-linear relationships. Furthermore, probabilistic theory-based uncertainty quantification is designed to enhance the model's credibility and reliability.
Results: The proposed PHTFNet-RPM was validated using a large-scale time-series dataset of tobacco root diseases, organized from 20-year meteorological and disease survey records in Chuxiong Prefecture, Yunnan Province. Extensive comparative experiments demonstrated that our model achieves a 4.44%-16.43% lower mean absolute error (MAE) than existing models (including LR, SVR, CNN-LSTM, and LSTM-Attention).
Discussion: The results confirm that the model can reliably forecast disease progression trends under different configurations, even when relying solely on historical weather observations. The integration of uncertainty quantification provides a robust tool for assessing prediction reliability, offering significant practical value for disease management.
{"title":"PHTFNet-RPM: a probabilistic hybrid network with RPM for tobacco root disease forecasting.","authors":"Yunhong Bu, Tingshan Yao, Shaowu Geng, Renjie Huang","doi":"10.3389/fdata.2025.1705587","DOIUrl":"10.3389/fdata.2025.1705587","url":null,"abstract":"<p><strong>Introduction: </strong>Tobacco growers usually face particular challenges in predicting the risks of tobacco root diseases due to complex pathogenesis, concealed early symptoms, and heterogeneous farm conditions.</p><p><strong>Methods: </strong>To address this problem, we proposed a flexible Probabilistic Hybrid Temporal Fusion Network with Random Period Mask (PHTFNet-RPM). This model is designed to forecast future multi-day disease incidences and indices. It incorporates a hybrid input structure with RPM to handle configurable static management variables and time-series data of weather factors and disease metrics, using the RPM to simulate diverse absences of historical observations. The model's internal hierarchically aggregated modules learn cross-variable and cross-temporal feature representations to model the complex non-linear relationships. Furthermore, probabilistic theory-based uncertainty quantification is designed to enhance the model's credibility and reliability.</p><p><strong>Results: </strong>The proposed PHTFNet-RPM was validated using a large-scale time-series dataset of tobacco root diseases, organized from 20-year meteorological and disease survey records in Chuxiong Prefecture, Yunnan Province. Extensive comparative experiments demonstrated that our model achieves a 4.44%-16.43% lower mean absolute error (MAE) than existing models (including LR, SVR, CNN-LSTM, and LSTM-Attention).</p><p><strong>Discussion: </strong>The results confirm that the model can reliably forecast disease progression trends under different configurations, even when relying solely on historical weather observations. The integration of uncertainty quantification provides a robust tool for assessing prediction reliability, offering significant practical value for disease management.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1705587"},"PeriodicalIF":2.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12640811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1659757
Nida Nasir, Hussam Al Hamadi
Introduction: As cyber-physical systems become increasingly virtualized, digital twins have emerged as essential components for real-time monitoring, simulation, and control. However, their growing complexity and exposure to dynamic network environments make them vulnerable to sophisticated cyber threats. Traditional rule-based and machine-learning-based security models often fail to adapt in real time to evolving attack patterns, particularly in decentralized and resource-constrained settings.
Methods: This study introduces the Neuromorphic Cyber-Twin (NCT), a brain-inspired architectural framework that integrates spiking neural networks (SNNs) and event-driven cognition to enhance adaptive cyber defense. The NCT leverages neuromorphic principles such as sparse coding, temporal encoding, and spike-timing-dependent plasticity (STDP) to transform telemetry data from the digital-twin layer into spike-based sensory inputs. A layered cognitive architecture continuously monitors behavioral deviations, infers anomalies, and autonomously adapts its defensive responses in alignment with system dynamics.
Results: Lightweight prototype simulations demonstrate the feasibility of NCT-based event-driven anomaly detection and adaptive defense. The results highlight advantages in low-latency detection, contextual awareness, and energy efficiency compared with conventional machine-learning models.
Discussion: The NCT framework represents a biologically inspired paradigm for scalable, self-evolving cybersecurity in virtualized ecosystems. Potential applications include infrastructure monitoring, autonomous transportation, and industrial control systems. Comprehensive benchmarking and large-scale validation are identified as future research directions.
{"title":"Towards the neuromorphic Cyber-Twin: an architecture for cognitive defense in digital twin ecosystems.","authors":"Nida Nasir, Hussam Al Hamadi","doi":"10.3389/fdata.2025.1659757","DOIUrl":"10.3389/fdata.2025.1659757","url":null,"abstract":"<p><strong>Introduction: </strong>As cyber-physical systems become increasingly virtualized, digital twins have emerged as essential components for real-time monitoring, simulation, and control. However, their growing complexity and exposure to dynamic network environments make them vulnerable to sophisticated cyber threats. Traditional rule-based and machine-learning-based security models often fail to adapt in real time to evolving attack patterns, particularly in decentralized and resource-constrained settings.</p><p><strong>Methods: </strong>This study introduces the Neuromorphic Cyber-Twin (NCT), a brain-inspired architectural framework that integrates spiking neural networks (SNNs) and event-driven cognition to enhance adaptive cyber defense. The NCT leverages neuromorphic principles such as sparse coding, temporal encoding, and spike-timing-dependent plasticity (STDP) to transform telemetry data from the digital-twin layer into spike-based sensory inputs. A layered cognitive architecture continuously monitors behavioral deviations, infers anomalies, and autonomously adapts its defensive responses in alignment with system dynamics.</p><p><strong>Results: </strong>Lightweight prototype simulations demonstrate the feasibility of NCT-based event-driven anomaly detection and adaptive defense. The results highlight advantages in low-latency detection, contextual awareness, and energy efficiency compared with conventional machine-learning models.</p><p><strong>Discussion: </strong>The NCT framework represents a biologically inspired paradigm for scalable, self-evolving cybersecurity in virtualized ecosystems. Potential applications include infrastructure monitoring, autonomous transportation, and industrial control systems. Comprehensive benchmarking and large-scale validation are identified as future research directions.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1659757"},"PeriodicalIF":2.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12623207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1579332
Karl Vachuska
The advent of widely available cell phone mobility data in the United States has rapidly expanded the study of everyday mobility patterns in social science research. A wide range of existing literature finds ambient population (e.g., visitors) estimates of an area to be predictive of crime. Much of the past research frames neighborhood visitor flows in predictive terms without necessarily indicating or implying a causal effect. Through the use of two causal inference approaches-conventional two-way fixed effects and a novel instrumental variable approach, this brief research report explicitly formulates the causal effect of visitors in counterfactual terms. This study addresses this gap by explicitly estimating the causal effect of visitor flows on crime rates. Using high-resolution mobility and crime data from New York City for the year 2019, I estimate the additive effect of visitors on the multiple measurements of criminal activity. While two-way fixed effects models show a significant effect of visitors on a wide array of crime forms, instrumental variable estimates indicate no statistically significant causal impact, with large standard errors indicating substantial uncertainty in visitors' effect on crime rates.
{"title":"Urban mobility and crime: causal inference using street closures as an instrumental variable.","authors":"Karl Vachuska","doi":"10.3389/fdata.2025.1579332","DOIUrl":"10.3389/fdata.2025.1579332","url":null,"abstract":"<p><p>The advent of widely available cell phone mobility data in the United States has rapidly expanded the study of everyday mobility patterns in social science research. A wide range of existing literature finds ambient population (e.g., visitors) estimates of an area to be predictive of crime. Much of the past research frames neighborhood visitor flows in predictive terms without necessarily indicating or implying a causal effect. Through the use of two causal inference approaches-conventional two-way fixed effects and a novel instrumental variable approach, this brief research report explicitly formulates the causal effect of visitors in counterfactual terms. This study addresses this gap by explicitly estimating the causal effect of visitor flows on crime rates. Using high-resolution mobility and crime data from New York City for the year 2019, I estimate the additive effect of visitors on the multiple measurements of criminal activity. While two-way fixed effects models show a significant effect of visitors on a wide array of crime forms, instrumental variable estimates indicate no statistically significant causal impact, with large standard errors indicating substantial uncertainty in visitors' effect on crime rates.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1579332"},"PeriodicalIF":2.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615182/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1604887
Alexander Grote, Anuja Hariharan, Christof Weinhardt
Introduction: The analysis of discrete sequential data, such as event logs and customer clickstreams, is often challenged by the vast number of possible sequential patterns. This complexity makes it difficult to identify meaningful sequences and derive actionable insights.
Methods: We propose a novel feature selection algorithm, that integrates unsupervised sequential pattern mining with supervised machine learning. Unlike existing interpretable machine learning methods, we determine important sequential patterns during the mining process, eliminating the need for post-hoc classification to assess their relevance. Compared to existing interesting measures, we introduce a local, class-specific interestingness measure that is inherently interpretable.
Results: We evaluated the algorithm on three diverse datasets - churn prediction, malware sequence analysis, and a synthetic dataset - covering different sizes, application domains, and feature complexities. Our method achieved classification performance comparable to established feature selection algorithms while maintaining interpretability and reducing computational costs.
Discussion: This study demonstrates a practical and efficient approach for uncovering important sequential patterns in classification tasks. By combining interpretability with competitive predictive performance, our algorithm provides practitioners with an interpretable and efficient alternative to existing methods, paving the way for new advances in sequential data analysis.
{"title":"Finding the needle in the haystack-An interpretable sequential pattern mining method for classification problems.","authors":"Alexander Grote, Anuja Hariharan, Christof Weinhardt","doi":"10.3389/fdata.2025.1604887","DOIUrl":"10.3389/fdata.2025.1604887","url":null,"abstract":"<p><strong>Introduction: </strong>The analysis of discrete sequential data, such as event logs and customer clickstreams, is often challenged by the vast number of possible sequential patterns. This complexity makes it difficult to identify meaningful sequences and derive actionable insights.</p><p><strong>Methods: </strong>We propose a novel feature selection algorithm, that integrates unsupervised sequential pattern mining with supervised machine learning. Unlike existing interpretable machine learning methods, we determine important sequential patterns during the mining process, eliminating the need for post-hoc classification to assess their relevance. Compared to existing interesting measures, we introduce a local, class-specific interestingness measure that is inherently interpretable.</p><p><strong>Results: </strong>We evaluated the algorithm on three diverse datasets - churn prediction, malware sequence analysis, and a synthetic dataset - covering different sizes, application domains, and feature complexities. Our method achieved classification performance comparable to established feature selection algorithms while maintaining interpretability and reducing computational costs.</p><p><strong>Discussion: </strong>This study demonstrates a practical and efficient approach for uncovering important sequential patterns in classification tasks. By combining interpretability with competitive predictive performance, our algorithm provides practitioners with an interpretable and efficient alternative to existing methods, paving the way for new advances in sequential data analysis.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1604887"},"PeriodicalIF":2.4,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12604564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145508005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1623883
Qian Xie, Junsheng Yan, Zhenhua Dai, Wengang Du, Xuefei Wu
The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has opened up novel avenues for predicting coal and gas outbursts in coal mines. This study proposes a novel prediction framework that integrates advanced AI methodologies through a multi-model fusion strategy based on ensemble learning and model Stacking. The proposed model leverages the diverse data interpretation capabilities and distinct training mechanisms of various algorithms, thereby capitalizing on the complementary strengths of each constituent learner. Specifically, a Stacking-based ensemble model is constructed, incorporating Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (KNN) as base learners. An attention mechanism is then employed to adaptively weight the outputs of these base learners, thereby harnessing their complementary strengths. The meta-learner, primarily built upon the XGBoost algorithm, integrates these weighted outputs to generate the final prediction. The model's performance is rigorously evaluated using real-world coal and gas outburst data collected from a mine in Pingdingshan, China, with evaluation metrics including the F1-score and other standard classification indicators. The results reveal that individual models, such as XGBoost, SVM, and RF, can effectively quantify the contribution of input feature importance using their inherent mechanisms. Furthermore, the ensemble model significantly outperforms single-model approaches, particularly when the base learners are both strong and mutually uncorrelated. The proposed ensemble framework achieves a markedly higher F1-score, demonstrating its robustness and effectiveness in the complex task of coal and gas outburst prediction.
{"title":"Study on coal and gas outburst prediction technology based on multi-model fusion.","authors":"Qian Xie, Junsheng Yan, Zhenhua Dai, Wengang Du, Xuefei Wu","doi":"10.3389/fdata.2025.1623883","DOIUrl":"10.3389/fdata.2025.1623883","url":null,"abstract":"<p><p>The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has opened up novel avenues for predicting coal and gas outbursts in coal mines. This study proposes a novel prediction framework that integrates advanced AI methodologies through a multi-model fusion strategy based on ensemble learning and model Stacking. The proposed model leverages the diverse data interpretation capabilities and distinct training mechanisms of various algorithms, thereby capitalizing on the complementary strengths of each constituent learner. Specifically, a Stacking-based ensemble model is constructed, incorporating Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (KNN) as base learners. An attention mechanism is then employed to adaptively weight the outputs of these base learners, thereby harnessing their complementary strengths. The meta-learner, primarily built upon the XGBoost algorithm, integrates these weighted outputs to generate the final prediction. The model's performance is rigorously evaluated using real-world coal and gas outburst data collected from a mine in Pingdingshan, China, with evaluation metrics including the F1-score and other standard classification indicators. The results reveal that individual models, such as XGBoost, SVM, and RF, can effectively quantify the contribution of input feature importance using their inherent mechanisms. Furthermore, the ensemble model significantly outperforms single-model approaches, particularly when the base learners are both strong and mutually uncorrelated. The proposed ensemble framework achieves a markedly higher F1-score, demonstrating its robustness and effectiveness in the complex task of coal and gas outburst prediction.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1623883"},"PeriodicalIF":2.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12580147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145446498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1600540
Jianxin Li, Ruchun Jia, Ning Xiang, Yizhun Tian
Introduction: Traditional operation and maintenance decision algorithms often ignore the analysis of data source security, making them highly susceptible to noise, time-consuming in execution, and lacking in rationality.
Methods: In this study, we design an automated operation and maintenance decision algorithm based on data source security analysis. A multi-angle learning algorithm is adopted to establish a noise data model, introduce relaxation variables, and compare sharing factors with noise data characteristics to determine whether the data source is secure. Taking the ideal power shortage and minimum maintenance cost as the objective function, we construct a classical particle swarm optimization model and derive the expressions for particle search velocity and position. To address the problem of local optima, a niche mechanism is incorporated: the obtained automated data is treated as the population, a reasonable number of iterations is determined, individual fitness is stored, and the optimal state is obtained through a continuous iterative update strategy.
Results: Experimental results show that the proposed strategy can shorten operation and maintenance time, enhance the rationality of decision-making, improve algorithm convergence, and avoid falling into local optima.
Discussion: In addition, fault-tolerant analysis is performed on data source security, effectively eliminating bad data, preventing interference from malicious data, and further improving convergence performance.
{"title":"Research on fault-tolerant decision algorithm for data security automation.","authors":"Jianxin Li, Ruchun Jia, Ning Xiang, Yizhun Tian","doi":"10.3389/fdata.2025.1600540","DOIUrl":"https://doi.org/10.3389/fdata.2025.1600540","url":null,"abstract":"<p><strong>Introduction: </strong>Traditional operation and maintenance decision algorithms often ignore the analysis of data source security, making them highly susceptible to noise, time-consuming in execution, and lacking in rationality.</p><p><strong>Methods: </strong>In this study, we design an automated operation and maintenance decision algorithm based on data source security analysis. A multi-angle learning algorithm is adopted to establish a noise data model, introduce relaxation variables, and compare sharing factors with noise data characteristics to determine whether the data source is secure. Taking the ideal power shortage and minimum maintenance cost as the objective function, we construct a classical particle swarm optimization model and derive the expressions for particle search velocity and position. To address the problem of local optima, a niche mechanism is incorporated: the obtained automated data is treated as the population, a reasonable number of iterations is determined, individual fitness is stored, and the optimal state is obtained through a continuous iterative update strategy.</p><p><strong>Results: </strong>Experimental results show that the proposed strategy can shorten operation and maintenance time, enhance the rationality of decision-making, improve algorithm convergence, and avoid falling into local optima.</p><p><strong>Discussion: </strong>In addition, fault-tolerant analysis is performed on data source security, effectively eliminating bad data, preventing interference from malicious data, and further improving convergence performance.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1600540"},"PeriodicalIF":2.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12580102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145446537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1615788
Hikmat Ullah Khan, Anam Naz, Fawaz Khaled Alarfaj, Naif Almusallam
The mental health of students plays an important role in their overall wellbeing and academic performance. Growing pressure from academics, co-curricular activities such as sports and personal challenges highlight the need for modern methods of monitoring mental health. Traditional approaches, such as self-reported surveys and psychological evaluations, can be time-consuming and subject to bias. With advancement in artificial intelligence (AI), particularly in natural language processing (NLP), sentiment analysis has emerged as an effective technique for identifying mental health patterns in textual data. However, analyzing students' mental health remains a challenging task due to the intensity of emotional expressions, linguistic variations, and context-dependent sentiments. In this study, our primary objective was to investigate the mental health of students by conducting sentiment analysis using advanced deep learning models. To accomplish this task, state-of-the-art Large Language Model (LLM) approaches, such as RoBERTa (a robustly optimized BERT approach), RoBERTa-Large, and ELECTRA, were used for empirical analysis. RoBERTa-Large, an expanded architecture derived from Google's BERT, captures complex patterns and performs more effectively on various NLP tasks. Among the applied algorithms, RoBERTa-Large achieved the highest accuracy of 97%, while ELECTRA yielded 91% accuracy on a multi-classification task with seven diverse mental health status labels. These results demonstrate the potential of LLM-based approaches for predicting students' mental health, particularly in relation to the effects of academic and physical activities.
{"title":"Analyzing student mental health with RoBERTa-Large: a sentiment analysis and data analytics approach.","authors":"Hikmat Ullah Khan, Anam Naz, Fawaz Khaled Alarfaj, Naif Almusallam","doi":"10.3389/fdata.2025.1615788","DOIUrl":"10.3389/fdata.2025.1615788","url":null,"abstract":"<p><p>The mental health of students plays an important role in their overall wellbeing and academic performance. Growing pressure from academics, co-curricular activities such as sports and personal challenges highlight the need for modern methods of monitoring mental health. Traditional approaches, such as self-reported surveys and psychological evaluations, can be time-consuming and subject to bias. With advancement in artificial intelligence (AI), particularly in natural language processing (NLP), sentiment analysis has emerged as an effective technique for identifying mental health patterns in textual data. However, analyzing students' mental health remains a challenging task due to the intensity of emotional expressions, linguistic variations, and context-dependent sentiments. In this study, our primary objective was to investigate the mental health of students by conducting sentiment analysis using advanced deep learning models. To accomplish this task, state-of-the-art Large Language Model (LLM) approaches, such as RoBERTa (a robustly optimized BERT approach), RoBERTa-Large, and ELECTRA, were used for empirical analysis. RoBERTa-Large, an expanded architecture derived from Google's BERT, captures complex patterns and performs more effectively on various NLP tasks. Among the applied algorithms, RoBERTa-Large achieved the highest accuracy of 97%, while ELECTRA yielded 91% accuracy on a multi-classification task with seven diverse mental health status labels. These results demonstrate the potential of LLM-based approaches for predicting students' mental health, particularly in relation to the effects of academic and physical activities.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1615788"},"PeriodicalIF":2.4,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12575187/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}