Large-scale groundwater quality assessments are often hindered by the limited availability of hydrochemical data. Synthetic data generation provides a means to augment small datasets; however, the reliability of these methods and their implications for predictive modeling remain underexplored in environmental studies, particularly in the context of groundwater sustainability. We systematically evaluated six approaches, including bootstrap sampling, Gaussian noise perturbation, Monte Carlo sampling, SMOGN, CTGAN, and TVAE, using a groundwater quality dataset from southern India. Synthetic datasets were evaluated for their similarity to real data using the Kolmogorov–Smirnov test, the Wasserstein distance, moment differences, Pearson correlation, kernel density estimation plots, and principal component analysis. The practical utility of the synthetic data was evaluated by training a Random Forest model to predict total dissolved solids (TDS) from major ions. The model performance on the real dataset was assessed using R2, RMSE, and MAE. Bootstrap delivered near-perfect agreement with the real data (R2 = 0.999, NSE = 0.999, RMSE = 41.5 mg L−1), with SMOGN being competitive. Gaussian perturbation was acceptable, while TVAE was moderate. Monte Carlo and CTGAN performed poorly, with negative NSE indicating performance worse than predicting the mean. SHAP-based feature importance analysis confirmed that the best-performing synthetic methods preserved the dominant hydrochemical drivers. Overall, traditional resampling approaches (Bootstrap, SMOGN) outperformed complex deep generative models on small-sample groundwater datasets. This methodology can support risk assessments by improving the accuracy of water-quality predictive models, thereby facilitating effective resource management and pollution control. This study provides practical guidance for assessing and managing groundwater quality by recommending synthetic data augmentation strategies tailored to dataset characteristics, particularly in data-limited regions.
扫码关注我们
求助内容:
应助结果提醒方式:
