Pub Date : 2025-12-02eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1677331
Steve Nwaiwu
The successful application of large-scale transformer models in Natural Language Processing (NLP) is often hindered by the substantial computational cost and data requirements of full fine-tuning. This challenge is particularly acute in low-resource settings, where standard fine-tuning can lead to catastrophic overfitting and model collapse. To address this, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a promising solution. However, a direct comparative analysis of their trade-offs under unified low-resource conditions is lacking. This study provides a rigorous empirical evaluation of three prominent PEFT methods: Low-Rank Adaptation (LoRA), Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3), and a Representation Fine-Tuning (ReFT) strategy. Using a DistilBERT base model on low-resource versions of the AG News and Amazon Reviews datasets, the present work compares these methods against a full fine-tuning baseline across accuracy, F1 score, trainable parameters, and GPU memory usage. The findings reveal that while all PEFT methods dramatically outperform the baseline, LoRA consistently achieves the highest F1 scores (0.909 on Amazon Reviews). Critically, ReFT delivers nearly identical performance (~98% of LoRA's F1 score) while training only ~3% of the parameters, establishing it as the most efficient method. This research demonstrates that PEFT is not merely an efficiency optimization, but a necessary tool for robust generalization in data-scarce environments, providing practitioners with a clear guide to navigate the performance-efficiency trade-off. By unifying these evaluations under controlled conditions, this study advances beyond fragmented prior research and offers a systematic framework for selecting PEFT strategies.
{"title":"Parameter-efficient fine-tuning for low-resource text classification: a comparative study of LoRA, IA<sup>3</sup>, and ReFT.","authors":"Steve Nwaiwu","doi":"10.3389/fdata.2025.1677331","DOIUrl":"https://doi.org/10.3389/fdata.2025.1677331","url":null,"abstract":"<p><p>The successful application of large-scale transformer models in Natural Language Processing (NLP) is often hindered by the substantial computational cost and data requirements of full fine-tuning. This challenge is particularly acute in low-resource settings, where standard fine-tuning can lead to catastrophic overfitting and model collapse. To address this, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a promising solution. However, a direct comparative analysis of their trade-offs under unified low-resource conditions is lacking. This study provides a rigorous empirical evaluation of three prominent PEFT methods: Low-Rank Adaptation (LoRA), Infused Adapter by Inhibiting and Amplifying Inner Activations (IA<sup>3</sup>), and a Representation Fine-Tuning (ReFT) strategy. Using a DistilBERT base model on low-resource versions of the AG News and Amazon Reviews datasets, the present work compares these methods against a full fine-tuning baseline across accuracy, F1 score, trainable parameters, and GPU memory usage. The findings reveal that while all PEFT methods dramatically outperform the baseline, LoRA consistently achieves the highest F1 scores (0.909 on Amazon Reviews). Critically, ReFT delivers nearly identical performance (~98% of LoRA's F1 score) while training only ~3% of the parameters, establishing it as the most efficient method. This research demonstrates that PEFT is not merely an efficiency optimization, but a necessary tool for robust generalization in data-scarce environments, providing practitioners with a clear guide to navigate the performance-efficiency trade-off. By unifying these evaluations under controlled conditions, this study advances beyond fragmented prior research and offers a systematic framework for selecting PEFT strategies.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1677331"},"PeriodicalIF":2.4,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12705377/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1697478
Urvashi Khekare, Rajay Vedaraj I S
It is critical that electric vehicles estimate the remaining driving range after charging, as this has direct implications for drivers' range anxiety and thus for large-scale EV adoption. Traditional approaches to predicting range using machine learning rely heavily on large amounts of vehicle-specific data and therefore are not scalable or adaptable. In this paper, a deep reinforcement learning framework is proposed, utilizing big data from 103 different EV models from 31 different manufacturers. This dataset combines several operational variables (state of charge, voltage, current, temperature, vehicle speed, and discharge characteristics) that reflect highly dynamic driving states. Some outliers in this heterogeneous data were reduced through a hybrid fuzzy k-means clustering approach, enhancing the quality of the data used in training. Secondly, a pathfinder meta-heuristics approach has been applied to optimize the reward function of the deep Q-learning algorithm, and thus accelerate convergence and improve accuracy. Experimental validation reveals that the proposed framework halves the range error to [-0.28, 0.40] for independent testing and [-0.23, 0.34] at 10-fold cross-validation. The proposed approach outperforms traditional machine learning and transformer-based approaches in Mean Absolute Error (outperforming by 61.86% and 4.86%, respectively) and in Root Mean Square Error (outperforming by 6.36% and 3.56%, respectively). This highlights the robustness of the proposed framework under complex, dynamic EV data and its ability to enable scalable intelligent range prediction, which engenders innovation in infrastructure and climate conscious mobility.
{"title":"Adaptive deep Q-networks for accurate electric vehicle range estimation.","authors":"Urvashi Khekare, Rajay Vedaraj I S","doi":"10.3389/fdata.2025.1697478","DOIUrl":"10.3389/fdata.2025.1697478","url":null,"abstract":"<p><p>It is critical that electric vehicles estimate the remaining driving range after charging, as this has direct implications for drivers' range anxiety and thus for large-scale EV adoption. Traditional approaches to predicting range using machine learning rely heavily on large amounts of vehicle-specific data and therefore are not scalable or adaptable. In this paper, a deep reinforcement learning framework is proposed, utilizing big data from 103 different EV models from 31 different manufacturers. This dataset combines several operational variables (state of charge, voltage, current, temperature, vehicle speed, and discharge characteristics) that reflect highly dynamic driving states. Some outliers in this heterogeneous data were reduced through a hybrid fuzzy k-means clustering approach, enhancing the quality of the data used in training. Secondly, a pathfinder meta-heuristics approach has been applied to optimize the reward function of the deep Q-learning algorithm, and thus accelerate convergence and improve accuracy. Experimental validation reveals that the proposed framework halves the range error to [-0.28, 0.40] for independent testing and [-0.23, 0.34] at 10-fold cross-validation. The proposed approach outperforms traditional machine learning and transformer-based approaches in Mean Absolute Error (outperforming by 61.86% and 4.86%, respectively) and in Root Mean Square Error (outperforming by 6.36% and 3.56%, respectively). This highlights the robustness of the proposed framework under complex, dynamic EV data and its ability to enable scalable intelligent range prediction, which engenders innovation in infrastructure and climate conscious mobility.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1697478"},"PeriodicalIF":2.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12695611/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1704189
Kexin Ning, Qingguo Lü, Xiaofeng Liao
In this study, we focus on investigating a nonsmooth convex optimization problem involving the l1-norm under a non-negative constraint, with the goal of developing an inverse-problem solver for image deblurring. Research focused on solving this problem has garnered extensive attention and has had a significant impact on the field of image processing. However, existing optimization algorithms often suffer from overfitting and slow convergence, particularly when working with ill-conditioned data or noise. To address these challenges, we propose a momentum-based proximal scaled gradient projection (M-PSGP) algorithm. The M-PSGP algorithm, which is based on the proximal operator and scaled gradient projection (SGP) algorithm, integrates an improved Barzilai-Borwein-like step-size selection rule and a unified momentum acceleration framework to achieve a balance between performance optimization and convergence rate. Numerical experiments demonstrate the superiority of the M-PSGP algorithm over several seminal algorithms in image deblurring tasks, highlighting the significance of our improved step-size strategy and momentum-acceleration framework in enhancing convergence properties.
{"title":"M-PSGP: a momentum-based proximal scaled gradient projection algorithm for nonsmooth optimization with application to image deblurring.","authors":"Kexin Ning, Qingguo Lü, Xiaofeng Liao","doi":"10.3389/fdata.2025.1704189","DOIUrl":"10.3389/fdata.2025.1704189","url":null,"abstract":"<p><p>In this study, we focus on investigating a nonsmooth convex optimization problem involving the <i>l</i> <sub>1</sub>-norm under a non-negative constraint, with the goal of developing an inverse-problem solver for image deblurring. Research focused on solving this problem has garnered extensive attention and has had a significant impact on the field of image processing. However, existing optimization algorithms often suffer from overfitting and slow convergence, particularly when working with ill-conditioned data or noise. To address these challenges, we propose a momentum-based proximal scaled gradient projection (M-PSGP) algorithm. The M-PSGP algorithm, which is based on the proximal operator and scaled gradient projection (SGP) algorithm, integrates an improved Barzilai-Borwein-like step-size selection rule and a unified momentum acceleration framework to achieve a balance between performance optimization and convergence rate. Numerical experiments demonstrate the superiority of the M-PSGP algorithm over several seminal algorithms in image deblurring tasks, highlighting the significance of our improved step-size strategy and momentum-acceleration framework in enhancing convergence properties.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1704189"},"PeriodicalIF":2.4,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12682648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145716468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1677167
Shoichiro Inokuchi, Takumi Tajima
Introduction: Migraine is a prevalent neurological disorder with a substantial socioeconomic burden, underscoring the need for continued identification of therapeutic targets. Given the significant role of genetic factors in migraine pathogenesis, a genetic-based approach is considered effective for identifying potential therapeutic targets. This study aimed to identify candidate treatments for migraine by integrating genome-wide association study (GWAS) data, perturbagen profiles, and a large-scale claims database.
Methods: We used published GWAS data to impute disease-specific gene expression profiles using a transcriptome-wide association study approach. The imputed gene signatures were cross-referenced with perturbagen signatures from the LINCS Connectivity Map to identify candidate compounds capable of reversing the disease-associated gene expression. A real-world claims database was subsequently utilized to assess the clinical efficacy of the identified perturbagens on acute migraine, employing a cohort study design and mixed-effects log-linear models with the frequency of prescribed acute migraine medications as the outcome.
Results: Eighteen approved drugs were identified as candidate therapeutics based on the perturbagen profiles. Real-world analysis using the claims database demonstrated potential inhibitory effects of metformin (relative risk [RR]: 0.81; 95% confidence interval [CI]: 0.77-0.86), statins (RR: 0.94; 95% CI: 0.92-0.96), thiazolidines (RR: 0.84; 95% CI: 0.73-0.97), and angiotensin receptor neprilysin inhibitors (RR: 0.69; 95% CI: 0.61-0.77) on migraine attacks.
Conclusion: This multidisciplinary approach highlights a cost-effective framework for drug repositioning for migraine treatment by integrating genetic, pharmacological, and real-world clinical database.
{"title":"Integrated analysis for drug repositioning in migraine using genetic evidence and claims database.","authors":"Shoichiro Inokuchi, Takumi Tajima","doi":"10.3389/fdata.2025.1677167","DOIUrl":"10.3389/fdata.2025.1677167","url":null,"abstract":"<p><strong>Introduction: </strong>Migraine is a prevalent neurological disorder with a substantial socioeconomic burden, underscoring the need for continued identification of therapeutic targets. Given the significant role of genetic factors in migraine pathogenesis, a genetic-based approach is considered effective for identifying potential therapeutic targets. This study aimed to identify candidate treatments for migraine by integrating genome-wide association study (GWAS) data, perturbagen profiles, and a large-scale claims database.</p><p><strong>Methods: </strong>We used published GWAS data to impute disease-specific gene expression profiles using a transcriptome-wide association study approach. The imputed gene signatures were cross-referenced with perturbagen signatures from the LINCS Connectivity Map to identify candidate compounds capable of reversing the disease-associated gene expression. A real-world claims database was subsequently utilized to assess the clinical efficacy of the identified perturbagens on acute migraine, employing a cohort study design and mixed-effects log-linear models with the frequency of prescribed acute migraine medications as the outcome.</p><p><strong>Results: </strong>Eighteen approved drugs were identified as candidate therapeutics based on the perturbagen profiles. Real-world analysis using the claims database demonstrated potential inhibitory effects of metformin (relative risk [RR]: 0.81; 95% confidence interval [CI]: 0.77-0.86), statins (RR: 0.94; 95% CI: 0.92-0.96), thiazolidines (RR: 0.84; 95% CI: 0.73-0.97), and angiotensin receptor neprilysin inhibitors (RR: 0.69; 95% CI: 0.61-0.77) on migraine attacks.</p><p><strong>Conclusion: </strong>This multidisciplinary approach highlights a cost-effective framework for drug repositioning for migraine treatment by integrating genetic, pharmacological, and real-world clinical database.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1677167"},"PeriodicalIF":2.4,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12678156/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to long-term usage, natural disasters and human factors, pipeline leaks or ruptures may occur, resulting in serious consequences. Therefore, it is of great significance to monitor and conduct real-time detection of pipeline leaks. Currently, the mainstream methods for pipeline leak monitoring mostly rely on a single signal, which have significant limitations such as single temperature being susceptible to environmental temperature interference leading to misjudgment, and single vibration signal being affected by pipeline operation noise. Based on this phenomenon, this research has built a distributed optical fiber system as an experimental platform for temperature and vibration monitoring, obtaining 3,530 sets of real-time synchronized spatial-temporal temperature and vibration signals. A dual-parameter fusion residual neural network structure has been constructed, which can extract characteristic signals from the original spatial-temporal temperature and vibration signals obtained from the above monitoring system, thereby achieving a classification accuracy of 92.16% for pipeline leak status and a leakage location accuracy of 1 m. This solves the problem of insufficient feature extraction and weak anti-interference ability in single signal monitoring. By fusing the original temperature and vibration signals, more leakage features can be extracted. Therefore, compared with single signal monitoring, this study has improved the accuracy of leakage identification and location, bridging the gap of misjudgment caused by single signal interference, and providing a basis for pipeline leakage monitoring and real-time warning in the oil industry.
{"title":"Intelligent leak monitoring of oil pipeline based on distributed temperature and vibration fiber signals.","authors":"Xiaobin Liang, Yonghong Deng, Yibin Wang, Hongtao Li, Weifeng Ma, Ke Wang, Junjie Ren, Ruijiao Ma, Shuai Zhang, Jiawei Liu, Wei Wu","doi":"10.3389/fdata.2025.1667284","DOIUrl":"10.3389/fdata.2025.1667284","url":null,"abstract":"<p><p>Due to long-term usage, natural disasters and human factors, pipeline leaks or ruptures may occur, resulting in serious consequences. Therefore, it is of great significance to monitor and conduct real-time detection of pipeline leaks. Currently, the mainstream methods for pipeline leak monitoring mostly rely on a single signal, which have significant limitations such as single temperature being susceptible to environmental temperature interference leading to misjudgment, and single vibration signal being affected by pipeline operation noise. Based on this phenomenon, this research has built a distributed optical fiber system as an experimental platform for temperature and vibration monitoring, obtaining 3,530 sets of real-time synchronized spatial-temporal temperature and vibration signals. A dual-parameter fusion residual neural network structure has been constructed, which can extract characteristic signals from the original spatial-temporal temperature and vibration signals obtained from the above monitoring system, thereby achieving a classification accuracy of 92.16% for pipeline leak status and a leakage location accuracy of 1 m. This solves the problem of insufficient feature extraction and weak anti-interference ability in single signal monitoring. By fusing the original temperature and vibration signals, more leakage features can be extracted. Therefore, compared with single signal monitoring, this study has improved the accuracy of leakage identification and location, bridging the gap of misjudgment caused by single signal interference, and providing a basis for pipeline leakage monitoring and real-time warning in the oil industry.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1667284"},"PeriodicalIF":2.4,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12675210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust detection framework for adversarial threats in Autonomous Vehicle Platooning.","authors":"Stephanie Ness","doi":"10.3389/fdata.2025.1617978","DOIUrl":"https://doi.org/10.3389/fdata.2025.1617978","url":null,"abstract":"<p><strong>Introduction: </strong>The study addresses adversarial threats in Autonomous Vehicle Platooning (AVP) using machine learning.</p><p><strong>Methods: </strong>A novel method integrating active learning with RF, GB, XGB, KNN, LR, and AdaBoost classifiers was developed.</p><p><strong>Results: </strong>Random Forest with active learning yielded the highest accuracy of 83.91%.</p><p><strong>Discussion: </strong>The proposed framework significantly reduces labeling efforts and improves threat detection, enhancing AVP system security.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1617978"},"PeriodicalIF":2.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672304/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1686479
Emanuel Casmiry, Neema Mduma, Ramadhani Sinde
In the face of increasing cyberattacks, Structured Query Language (SQL) injection remains one of the most common and damaging types of web threats, accounting for over 20% of global cyberattack costs. However, due to its dynamic and variable nature, the current detection methods often suffer from high false positive rates and lower accuracy. This study proposes an enhanced SQL injection detection using Chi-square feature selection (FS) and machine learning models. A combined dataset was assembled by merging a custom dataset with the SQLiV3.csv file from the Kaggle repository. A Jensen-Shannon Divergence (JSD) analysis revealed moderate domain variation (overall JSD = 0.5775), with class-wise divergence of 0.1340 for SQLi and 0.5320 for benign queries. Term Frequency-Inverse Document Frequency (TF-IDF) was used to convert SQL queries into feature vectors, followed by the Chi-square feature selection to retain the most statistically significant features. Five classifiers, namely multinomial Naïve Bayes, support vector machine, logistic regression, decision tree, and K-nearest neighbor, were tested before and after feature selection. The results reveal that Chi-square feature selection improves classification performance across all models by reducing noise and eliminating redundant features. Notably, Decision Tree and K-Nearest Neighbors (KNN) models, which initially performed poorly, showed substantial improvements after feature selection. The Decision Tree improved from being the second-worst performer before feature selection to the best classifier afterward, achieving the highest accuracy of 99.73%, precision of 99.72%, recall of 99.70%, F1-score of 99.71%, a false positive rate (FPR) of 0.25%, and a misclassification rate of 0.27%. These findings highlight the crucial role of feature selection in high-dimensional data environments. Future research will investigate how feature selection impacts deep learning architectures, adaptive feature selection, incremental learning approaches, robustness against adversarial attacks, and evaluate model transferability across production web environments to ensure real-time detection reliability, establishing feature selection as a vital step in developing reliable SQL injection detection systems.
{"title":"Enhanced SQL injection detection using chi-square feature selection and machine learning classifiers.","authors":"Emanuel Casmiry, Neema Mduma, Ramadhani Sinde","doi":"10.3389/fdata.2025.1686479","DOIUrl":"https://doi.org/10.3389/fdata.2025.1686479","url":null,"abstract":"<p><p>In the face of increasing cyberattacks, Structured Query Language (SQL) injection remains one of the most common and damaging types of web threats, accounting for over 20% of global cyberattack costs. However, due to its dynamic and variable nature, the current detection methods often suffer from high false positive rates and lower accuracy. This study proposes an enhanced SQL injection detection using Chi-square feature selection (FS) and machine learning models. A combined dataset was assembled by merging a custom dataset with the SQLiV3.csv file from the Kaggle repository. A Jensen-Shannon Divergence (JSD) analysis revealed moderate domain variation (overall JSD = 0.5775), with class-wise divergence of 0.1340 for SQLi and 0.5320 for benign queries. Term Frequency-Inverse Document Frequency (TF-IDF) was used to convert SQL queries into feature vectors, followed by the Chi-square feature selection to retain the most statistically significant features. Five classifiers, namely multinomial Naïve Bayes, support vector machine, logistic regression, decision tree, and K-nearest neighbor, were tested before and after feature selection. The results reveal that Chi-square feature selection improves classification performance across all models by reducing noise and eliminating redundant features. Notably, Decision Tree and K-Nearest Neighbors (KNN) models, which initially performed poorly, showed substantial improvements after feature selection. The Decision Tree improved from being the second-worst performer before feature selection to the best classifier afterward, achieving the highest accuracy of 99.73%, precision of 99.72%, recall of 99.70%, F1-score of 99.71%, a false positive rate (FPR) of 0.25%, and a misclassification rate of 0.27%. These findings highlight the crucial role of feature selection in high-dimensional data environments. Future research will investigate how feature selection impacts deep learning architectures, adaptive feature selection, incremental learning approaches, robustness against adversarial attacks, and evaluate model transferability across production web environments to ensure real-time detection reliability, establishing feature selection as a vital step in developing reliable SQL injection detection systems.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1686479"},"PeriodicalIF":2.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12672241/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1669488
Shanmin Yang, Hui Guo, Shu Hu, Bin Zhu, Ying Fu, Siwei Lyu, Xi Wu, Xin Wang
Deepfake technology represents a serious risk to safety and public confidence. While current detection approaches perform well in identifying manipulations within datasets that utilize identical deepfake methods for both training and validation, they experience notable declines in accuracy when applied to cross-dataset situations, where unfamiliar deepfake techniques are encountered during testing. To tackle this issue, we propose a Deep Information Decomposition (DID) framework to improve Cross-dataset Deepfake Detection (CrossDF). Distinct from most existing deepfake detection approaches, our framework emphasizes high-level semantic attributes instead of focusing on particular visual anomalies. More specifically, it intrinsically decomposes facial representations into deepfake-relevant and unrelated components, leveraging only the deepfake-relevant features for classification between genuine and fabricated images. Furthermore, we introduce an adversarial mutual information minimization strategy that enhances the separability between these two types of information through decorrelation learning. This significantly improves the model's robustness to irrelevant variations and strengthens its generalization capability to previously unseen manipulation techniques. Extensive experiments demonstrate the effectiveness and superiority of our proposed DID framework for cross-dataset deepfake detection. It achieves an AUC of 0.779 in cross-dataset evaluation from FF++ to CDF2 and improves the state-of-the-art AUC significantly from 0.669 to 0.802 on the diffusion-based Text-to-Image dataset.
{"title":"CrossDF: improving cross-domain deepfake detection with deep information decomposition.","authors":"Shanmin Yang, Hui Guo, Shu Hu, Bin Zhu, Ying Fu, Siwei Lyu, Xi Wu, Xin Wang","doi":"10.3389/fdata.2025.1669488","DOIUrl":"10.3389/fdata.2025.1669488","url":null,"abstract":"<p><p>Deepfake technology represents a serious risk to safety and public confidence. While current detection approaches perform well in identifying manipulations within datasets that utilize identical deepfake methods for both training and validation, they experience notable declines in accuracy when applied to cross-dataset situations, where unfamiliar deepfake techniques are encountered during testing. To tackle this issue, we propose a Deep Information Decomposition (DID) framework to improve Cross-dataset Deepfake Detection (CrossDF). Distinct from most existing deepfake detection approaches, our framework emphasizes high-level semantic attributes instead of focusing on particular visual anomalies. More specifically, it intrinsically decomposes facial representations into deepfake-relevant and unrelated components, leveraging only the deepfake-relevant features for classification between genuine and fabricated images. Furthermore, we introduce an adversarial mutual information minimization strategy that enhances the separability between these two types of information through decorrelation learning. This significantly improves the model's robustness to irrelevant variations and strengthens its generalization capability to previously unseen manipulation techniques. Extensive experiments demonstrate the effectiveness and superiority of our proposed DID framework for cross-dataset deepfake detection. It achieves an AUC of 0.779 in cross-dataset evaluation from FF++ to CDF2 and improves the state-of-the-art AUC significantly from 0.669 to 0.802 on the diffusion-based Text-to-Image dataset.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1669488"},"PeriodicalIF":2.4,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12674592/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1690955
Anastasia Petrovna Pamova, Yuriy Aleksandrovich Vasilev, Tatyana Mikhaylovna Bobrovskaya, Anton Vyacheslavovich Vladzimirskyy, Olga Vasilyevna Omelyanskaya, Kirill Mikhailovich Arzamasov
Background: The rapid integration of artificial intelligence (AI) into mammography necessitates robust quality control methods. The lack of standardized methods for establishing decision thresholds on the Receiver Operating Characteristic (ROC) curves makes it challenging to judge the AI performance. This study aims to develop a method for determining the decision threshold for AI in screening mammography to ensure the widest possible population of women with a breast pathology is diagnosed.
Methods: Three AI models were retrospectively evaluated using a dataset of digital mammograms. The dataset consisted of screening mammography examinations obtained from 663,606 patients over the age of 40. Our method estimates the decision threshold using a novel approach to net benefit (NB) analysis. Our approach to setting the cutoff threshold was compared with the threshold determined by Youden's index using McNemar's test.
Results: Replacing the Youden index with our method across three AI models, resulted in a threefold reduction in false-positive rates, twofold reduction in false-negative rates, and twofold increase in true-positive rates. Thus, the sensitivity at the cutoff threshold determined by NB increased to 99% (maximum) compared to the sensitivity determined by Youden's index threshold (72% maximum). Correspondingly, the specificity when using our method decreased to 48% (minimum), compared to 75% (minimum) with the Youden's index method.
Conclusions: We propose using AI as the initial reader together with our novel method for determining the decision threshold in screening with double reading. This approach enhances the AI sensitivity and improves timely breast cancer diagnosis.
{"title":"Implementation of a net benefit parameter in ROC curve decision thresholds for AI-powered mammography screening.","authors":"Anastasia Petrovna Pamova, Yuriy Aleksandrovich Vasilev, Tatyana Mikhaylovna Bobrovskaya, Anton Vyacheslavovich Vladzimirskyy, Olga Vasilyevna Omelyanskaya, Kirill Mikhailovich Arzamasov","doi":"10.3389/fdata.2025.1690955","DOIUrl":"10.3389/fdata.2025.1690955","url":null,"abstract":"<p><strong>Background: </strong>The rapid integration of artificial intelligence (AI) into mammography necessitates robust quality control methods. The lack of standardized methods for establishing decision thresholds on the Receiver Operating Characteristic (ROC) curves makes it challenging to judge the AI performance. This study aims to develop a method for determining the decision threshold for AI in screening mammography to ensure the widest possible population of women with a breast pathology is diagnosed.</p><p><strong>Methods: </strong>Three AI models were retrospectively evaluated using a dataset of digital mammograms. The dataset consisted of screening mammography examinations obtained from 663,606 patients over the age of 40. Our method estimates the decision threshold using a novel approach to net benefit (NB) analysis. Our approach to setting the cutoff threshold was compared with the threshold determined by Youden's index using McNemar's test.</p><p><strong>Results: </strong>Replacing the Youden index with our method across three AI models, resulted in a threefold reduction in false-positive rates, twofold reduction in false-negative rates, and twofold increase in true-positive rates. Thus, the sensitivity at the cutoff threshold determined by NB increased to 99% (maximum) compared to the sensitivity determined by Youden's index threshold (72% maximum). Correspondingly, the specificity when using our method decreased to 48% (minimum), compared to 75% (minimum) with the Youden's index method.</p><p><strong>Conclusions: </strong>We propose using AI as the initial reader together with our novel method for determining the decision threshold in screening with double reading. This approach enhances the AI sensitivity and improves timely breast cancer diagnosis.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1690955"},"PeriodicalIF":2.4,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669008/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145670990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14eCollection Date: 2025-01-01DOI: 10.3389/fdata.2025.1682984
A H M Shahariar Parvez, Md Samiul Islam, Fahmid Al Farid, Tashida Yeasmin, Md Monirul Islam, Md Shafiul Azam, Jia Uddin, Hezerul Abdul Karim
Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures-Vision Transformer (ViT), VGG-16, and ResNet-50-on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.
{"title":"Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis.","authors":"A H M Shahariar Parvez, Md Samiul Islam, Fahmid Al Farid, Tashida Yeasmin, Md Monirul Islam, Md Shafiul Azam, Jia Uddin, Hezerul Abdul Karim","doi":"10.3389/fdata.2025.1682984","DOIUrl":"https://doi.org/10.3389/fdata.2025.1682984","url":null,"abstract":"<p><p>Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures-Vision Transformer (ViT), VGG-16, and ResNet-50-on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1682984"},"PeriodicalIF":2.4,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12660064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}