Ying Wang, Jinchu Dong, Yunchi Zhou, Yinghao Cheng, Xiaoli Zhao, Willie J. G. M. Peijnenburg, Martina G. Vijver, Kenneth M. Y. Leung, Wenhong Fan, Fengchang Wu
{"title":"Addressing the Data Scarcity Problem in Ecotoxicology via Small Data Machine Learning Methods","authors":"Ying Wang, Jinchu Dong, Yunchi Zhou, Yinghao Cheng, Xiaoli Zhao, Willie J. G. M. Peijnenburg, Martina G. Vijver, Kenneth M. Y. Leung, Wenhong Fan, Fengchang Wu","doi":"10.1021/acs.est.5c00510","DOIUrl":null,"url":null,"abstract":"Figure 1. Workflow for the application of SDML methods in ecotoxicology. Step 1. Data collection and analysis Acquire quality ecotoxicological data, physicochemical properties of chemical and/nanomaterial substances, and species information through toxicity experiments, online databases, and pertinent literature and preprocess the data by the Findability, Accessibility, Interoperability, and Reuse (FAIR) Principles or the Klimisch system to scoring methodologies for data quality. (11) Step 2. Choosing the right data augmentation method or learning strategies Choose the most suited ML methods for small data sets based on the research objectives (classification or regression). In the case of performing regression tasks, utilizing SMOTE and GMM to generate virtual data is recommended, followed by filtering based on established data selection principles. For classification problems, strategies emphasizing enhanced learning capabilities across multiple tasks, such as meta-learning and multi-task learning, are advised methods to optimize parameters. Step 3. ML algorithms selection and modeling Select ML modeling algorithms that are more suitable for small data samples and can quickly capture data characteristics, such as SVM, RF, GBM, decision trees, and extreme gradient boosting (XGBoost). Step 4. Model performance evaluation and interpretability Model performance and credibility are recommended to be tested using internal and external validation methods, and the use of the mean square error (MSE) increase, SHapley Additive exPlanation (SHAP), partial dependence plot (PDP), and local interpretable model-agnostic explanation (LIME) ensures that the model is satisfactorily interpretable. This workflow has the potential for the prediction of ecotoxicity of the infinite combinations of chemicals/nanomaterials and species as <i>in vivo</i> testing is tedious and virtually undoable. SDML models are more sensitive to anomalous data (which is more common in toxicological trials), and one must thus be more judicious in data processing when toxicological data are collected in databases or from the literature. When the task is highly variable or the distribution of the data is changed, SDML models often have difficulty in learning and predicting effectively across different domains. SDML methods are therefore more suitable for prediction tasks with a single goal and low complexity. SDML methods pose significant challenges in terms of scalability for large-scale applications due to their high model complexity and computational costs. Just as in case of the ECOlogical Structure–Activity Relationship (ECOSAR) predictive model of the U.S. Environmental Protection Agency, the result generated from SDML models should be primarily used for screening-level assessments of ecological risk due to the insufficiency of quality experimental data, while professional judgment is required to determine the appropriateness and applicability of the predictions from the SDML models. Dr. Fengchang Wu is the director of the State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences (CRAES). His research interests mainly cover environmental toxicology and chemistry by multidisciplinary approaches, including environmental biogeochemistry, chemistry, toxicology, and risk assessment. He has made remarkable contributions to the protection of the aquatic ecosystem in China through his advancement of environmental science, standards, technologies, and engineering. Dr. Wu was elected as an academician of the Chinese Academy of Engineering in 2017 in recognition of his work in the development of the environmental criteria/standards system in China. Dr. Wenhong Fan is a Professor in the School of Materials Science and Engineering at Beihang University. Her research focuses on the environmental behavior, ecological effects, and remediation of metal and nanoparticle pollutants in aqueous environments, as well as the exposure, toxicity, and risk assessment of emerging pollutants. To date, she has published more than 150 international peer-reviewed papers. She serves as an Associate Editor of <i>Aquatic Toxicology</i> and an Editor of <i>Bulletin of Environmental Contamination and Toxicology.</i> She was awarded a 2022 <i>Carbon Research</i> Best Paper and the First Class of the Beijing Science and Technology Award from the Beijing Government. This work was supported by the National Key R&D Program of China (2022YFC3204800), the National Natural Science Foundation of China (42177240, 42330710, and 42430714), the Beijing Natural Science Foundation (8242033), and the Fundamental Research Funds for the Central Universities. M.G.V. received funding from European Union’s ERC-consolidator Grant Agreement 101002123. This article references 15 other publications. This article has not yet been cited by other publications.","PeriodicalId":36,"journal":{"name":"环境科学与技术","volume":"200 1","pages":""},"PeriodicalIF":10.8000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"环境科学与技术","FirstCategoryId":"1","ListUrlMain":"https://doi.org/10.1021/acs.est.5c00510","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Figure 1. Workflow for the application of SDML methods in ecotoxicology. Step 1. Data collection and analysis Acquire quality ecotoxicological data, physicochemical properties of chemical and/nanomaterial substances, and species information through toxicity experiments, online databases, and pertinent literature and preprocess the data by the Findability, Accessibility, Interoperability, and Reuse (FAIR) Principles or the Klimisch system to scoring methodologies for data quality. (11) Step 2. Choosing the right data augmentation method or learning strategies Choose the most suited ML methods for small data sets based on the research objectives (classification or regression). In the case of performing regression tasks, utilizing SMOTE and GMM to generate virtual data is recommended, followed by filtering based on established data selection principles. For classification problems, strategies emphasizing enhanced learning capabilities across multiple tasks, such as meta-learning and multi-task learning, are advised methods to optimize parameters. Step 3. ML algorithms selection and modeling Select ML modeling algorithms that are more suitable for small data samples and can quickly capture data characteristics, such as SVM, RF, GBM, decision trees, and extreme gradient boosting (XGBoost). Step 4. Model performance evaluation and interpretability Model performance and credibility are recommended to be tested using internal and external validation methods, and the use of the mean square error (MSE) increase, SHapley Additive exPlanation (SHAP), partial dependence plot (PDP), and local interpretable model-agnostic explanation (LIME) ensures that the model is satisfactorily interpretable. This workflow has the potential for the prediction of ecotoxicity of the infinite combinations of chemicals/nanomaterials and species as in vivo testing is tedious and virtually undoable. SDML models are more sensitive to anomalous data (which is more common in toxicological trials), and one must thus be more judicious in data processing when toxicological data are collected in databases or from the literature. When the task is highly variable or the distribution of the data is changed, SDML models often have difficulty in learning and predicting effectively across different domains. SDML methods are therefore more suitable for prediction tasks with a single goal and low complexity. SDML methods pose significant challenges in terms of scalability for large-scale applications due to their high model complexity and computational costs. Just as in case of the ECOlogical Structure–Activity Relationship (ECOSAR) predictive model of the U.S. Environmental Protection Agency, the result generated from SDML models should be primarily used for screening-level assessments of ecological risk due to the insufficiency of quality experimental data, while professional judgment is required to determine the appropriateness and applicability of the predictions from the SDML models. Dr. Fengchang Wu is the director of the State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences (CRAES). His research interests mainly cover environmental toxicology and chemistry by multidisciplinary approaches, including environmental biogeochemistry, chemistry, toxicology, and risk assessment. He has made remarkable contributions to the protection of the aquatic ecosystem in China through his advancement of environmental science, standards, technologies, and engineering. Dr. Wu was elected as an academician of the Chinese Academy of Engineering in 2017 in recognition of his work in the development of the environmental criteria/standards system in China. Dr. Wenhong Fan is a Professor in the School of Materials Science and Engineering at Beihang University. Her research focuses on the environmental behavior, ecological effects, and remediation of metal and nanoparticle pollutants in aqueous environments, as well as the exposure, toxicity, and risk assessment of emerging pollutants. To date, she has published more than 150 international peer-reviewed papers. She serves as an Associate Editor of Aquatic Toxicology and an Editor of Bulletin of Environmental Contamination and Toxicology. She was awarded a 2022 Carbon Research Best Paper and the First Class of the Beijing Science and Technology Award from the Beijing Government. This work was supported by the National Key R&D Program of China (2022YFC3204800), the National Natural Science Foundation of China (42177240, 42330710, and 42430714), the Beijing Natural Science Foundation (8242033), and the Fundamental Research Funds for the Central Universities. M.G.V. received funding from European Union’s ERC-consolidator Grant Agreement 101002123. This article references 15 other publications. This article has not yet been cited by other publications.
期刊介绍:
Environmental Science & Technology (ES&T) is a co-sponsored academic and technical magazine by the Hubei Provincial Environmental Protection Bureau and the Hubei Provincial Academy of Environmental Sciences.
Environmental Science & Technology (ES&T) holds the status of Chinese core journals, scientific papers source journals of China, Chinese Science Citation Database source journals, and Chinese Academic Journal Comprehensive Evaluation Database source journals. This publication focuses on the academic field of environmental protection, featuring articles related to environmental protection and technical advancements.