Addressing the Data Scarcity Problem in Ecotoxicology via Small Data Machine Learning Methods

IF 10.8 1区 环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL 环境科学与技术 Pub Date : 2025-03-20 DOI:10.1021/acs.est.5c00510
Ying Wang, Jinchu Dong, Yunchi Zhou, Yinghao Cheng, Xiaoli Zhao, Willie J. G. M. Peijnenburg, Martina G. Vijver, Kenneth M. Y. Leung, Wenhong Fan, Fengchang Wu
{"title":"Addressing the Data Scarcity Problem in Ecotoxicology via Small Data Machine Learning Methods","authors":"Ying Wang, Jinchu Dong, Yunchi Zhou, Yinghao Cheng, Xiaoli Zhao, Willie J. G. M. Peijnenburg, Martina G. Vijver, Kenneth M. Y. Leung, Wenhong Fan, Fengchang Wu","doi":"10.1021/acs.est.5c00510","DOIUrl":null,"url":null,"abstract":"Figure 1. Workflow for the application of SDML methods in ecotoxicology. Step 1. Data collection and analysis Acquire quality ecotoxicological data, physicochemical properties of chemical and/nanomaterial substances, and species information through toxicity experiments, online databases, and pertinent literature and preprocess the data by the Findability, Accessibility, Interoperability, and Reuse (FAIR) Principles or the Klimisch system to scoring methodologies for data quality. (11) Step 2. Choosing the right data augmentation method or learning strategies Choose the most suited ML methods for small data sets based on the research objectives (classification or regression). In the case of performing regression tasks, utilizing SMOTE and GMM to generate virtual data is recommended, followed by filtering based on established data selection principles. For classification problems, strategies emphasizing enhanced learning capabilities across multiple tasks, such as meta-learning and multi-task learning, are advised methods to optimize parameters. Step 3. ML algorithms selection and modeling Select ML modeling algorithms that are more suitable for small data samples and can quickly capture data characteristics, such as SVM, RF, GBM, decision trees, and extreme gradient boosting (XGBoost). Step 4. Model performance evaluation and interpretability Model performance and credibility are recommended to be tested using internal and external validation methods, and the use of the mean square error (MSE) increase, SHapley Additive exPlanation (SHAP), partial dependence plot (PDP), and local interpretable model-agnostic explanation (LIME) ensures that the model is satisfactorily interpretable. This workflow has the potential for the prediction of ecotoxicity of the infinite combinations of chemicals/nanomaterials and species as <i>in vivo</i> testing is tedious and virtually undoable. SDML models are more sensitive to anomalous data (which is more common in toxicological trials), and one must thus be more judicious in data processing when toxicological data are collected in databases or from the literature. When the task is highly variable or the distribution of the data is changed, SDML models often have difficulty in learning and predicting effectively across different domains. SDML methods are therefore more suitable for prediction tasks with a single goal and low complexity. SDML methods pose significant challenges in terms of scalability for large-scale applications due to their high model complexity and computational costs. Just as in case of the ECOlogical Structure–Activity Relationship (ECOSAR) predictive model of the U.S. Environmental Protection Agency, the result generated from SDML models should be primarily used for screening-level assessments of ecological risk due to the insufficiency of quality experimental data, while professional judgment is required to determine the appropriateness and applicability of the predictions from the SDML models. Dr. Fengchang Wu is the director of the State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences (CRAES). His research interests mainly cover environmental toxicology and chemistry by multidisciplinary approaches, including environmental biogeochemistry, chemistry, toxicology, and risk assessment. He has made remarkable contributions to the protection of the aquatic ecosystem in China through his advancement of environmental science, standards, technologies, and engineering. Dr. Wu was elected as an academician of the Chinese Academy of Engineering in 2017 in recognition of his work in the development of the environmental criteria/standards system in China. Dr. Wenhong Fan is a Professor in the School of Materials Science and Engineering at Beihang University. Her research focuses on the environmental behavior, ecological effects, and remediation of metal and nanoparticle pollutants in aqueous environments, as well as the exposure, toxicity, and risk assessment of emerging pollutants. To date, she has published more than 150 international peer-reviewed papers. She serves as an Associate Editor of <i>Aquatic Toxicology</i> and an Editor of <i>Bulletin of Environmental Contamination and Toxicology.</i> She was awarded a 2022 <i>Carbon Research</i> Best Paper and the First Class of the Beijing Science and Technology Award from the Beijing Government. This work was supported by the National Key R&amp;D Program of China (2022YFC3204800), the National Natural Science Foundation of China (42177240, 42330710, and 42430714), the Beijing Natural Science Foundation (8242033), and the Fundamental Research Funds for the Central Universities. M.G.V. received funding from European Union’s ERC-consolidator Grant Agreement 101002123. This article references 15 other publications. This article has not yet been cited by other publications.","PeriodicalId":36,"journal":{"name":"环境科学与技术","volume":"200 1","pages":""},"PeriodicalIF":10.8000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"环境科学与技术","FirstCategoryId":"1","ListUrlMain":"https://doi.org/10.1021/acs.est.5c00510","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Figure 1. Workflow for the application of SDML methods in ecotoxicology. Step 1. Data collection and analysis Acquire quality ecotoxicological data, physicochemical properties of chemical and/nanomaterial substances, and species information through toxicity experiments, online databases, and pertinent literature and preprocess the data by the Findability, Accessibility, Interoperability, and Reuse (FAIR) Principles or the Klimisch system to scoring methodologies for data quality. (11) Step 2. Choosing the right data augmentation method or learning strategies Choose the most suited ML methods for small data sets based on the research objectives (classification or regression). In the case of performing regression tasks, utilizing SMOTE and GMM to generate virtual data is recommended, followed by filtering based on established data selection principles. For classification problems, strategies emphasizing enhanced learning capabilities across multiple tasks, such as meta-learning and multi-task learning, are advised methods to optimize parameters. Step 3. ML algorithms selection and modeling Select ML modeling algorithms that are more suitable for small data samples and can quickly capture data characteristics, such as SVM, RF, GBM, decision trees, and extreme gradient boosting (XGBoost). Step 4. Model performance evaluation and interpretability Model performance and credibility are recommended to be tested using internal and external validation methods, and the use of the mean square error (MSE) increase, SHapley Additive exPlanation (SHAP), partial dependence plot (PDP), and local interpretable model-agnostic explanation (LIME) ensures that the model is satisfactorily interpretable. This workflow has the potential for the prediction of ecotoxicity of the infinite combinations of chemicals/nanomaterials and species as in vivo testing is tedious and virtually undoable. SDML models are more sensitive to anomalous data (which is more common in toxicological trials), and one must thus be more judicious in data processing when toxicological data are collected in databases or from the literature. When the task is highly variable or the distribution of the data is changed, SDML models often have difficulty in learning and predicting effectively across different domains. SDML methods are therefore more suitable for prediction tasks with a single goal and low complexity. SDML methods pose significant challenges in terms of scalability for large-scale applications due to their high model complexity and computational costs. Just as in case of the ECOlogical Structure–Activity Relationship (ECOSAR) predictive model of the U.S. Environmental Protection Agency, the result generated from SDML models should be primarily used for screening-level assessments of ecological risk due to the insufficiency of quality experimental data, while professional judgment is required to determine the appropriateness and applicability of the predictions from the SDML models. Dr. Fengchang Wu is the director of the State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences (CRAES). His research interests mainly cover environmental toxicology and chemistry by multidisciplinary approaches, including environmental biogeochemistry, chemistry, toxicology, and risk assessment. He has made remarkable contributions to the protection of the aquatic ecosystem in China through his advancement of environmental science, standards, technologies, and engineering. Dr. Wu was elected as an academician of the Chinese Academy of Engineering in 2017 in recognition of his work in the development of the environmental criteria/standards system in China. Dr. Wenhong Fan is a Professor in the School of Materials Science and Engineering at Beihang University. Her research focuses on the environmental behavior, ecological effects, and remediation of metal and nanoparticle pollutants in aqueous environments, as well as the exposure, toxicity, and risk assessment of emerging pollutants. To date, she has published more than 150 international peer-reviewed papers. She serves as an Associate Editor of Aquatic Toxicology and an Editor of Bulletin of Environmental Contamination and Toxicology. She was awarded a 2022 Carbon Research Best Paper and the First Class of the Beijing Science and Technology Award from the Beijing Government. This work was supported by the National Key R&D Program of China (2022YFC3204800), the National Natural Science Foundation of China (42177240, 42330710, and 42430714), the Beijing Natural Science Foundation (8242033), and the Fundamental Research Funds for the Central Universities. M.G.V. received funding from European Union’s ERC-consolidator Grant Agreement 101002123. This article references 15 other publications. This article has not yet been cited by other publications.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
图 1.在生态毒理学中应用 SDML 方法的工作流程。步骤 1.数据收集与分析 通过毒性实验、在线数据库和相关文献获取高质量的生态毒理学数据、化学物质和/或纳米材料的物理化学特性以及物种信息,并按照可查找性、可访问性、互操作性和可重用性(FAIR)原则或 Klimisch 系统对数据进行预处理,以制定数据质量评分方法。(11) 第 2 步。选择合适的数据增强方法或学习策略 根据研究目标(分类或回归),为小型数据集选择最合适的 ML 方法。在执行回归任务时,建议利用 SMOTE 和 GMM 生成虚拟数据,然后根据既定的数据选择原则进行过滤。对于分类问题,建议采用元学习和多任务学习等强调增强多任务学习能力的策略来优化参数。步骤 3.ML 算法选择与建模 选择更适合小数据样本且能快速捕捉数据特征的 ML 建模算法,如 SVM、RF、GBM、决策树和极梯度提升(XGBoost)。步骤 4.模型性能评估和可解释性 建议使用内部和外部验证方法对模型性能和可信度进行测试,并使用均方误差(MSE)增加、SHAPLE Additive exPlanation(SHAP)、部分依赖图(PDP)和局部可解释模型-可解释性(LIME)确保模型具有令人满意的可解释性。这种工作流程可用于预测化学品/纳米材料和物种无限组合的生态毒性,因为体内测试非常繁琐,几乎无法进行。SDML 模型对异常数据更为敏感(这在毒理学试验中更为常见),因此,当从数据库或文献中收集毒理学数据时,必须更明智地进行数据处理。当任务变化很大或数据分布发生变化时,SDML 模型往往难以在不同领域进行有效学习和预测。因此,SDML 方法更适合目标单一、复杂度低的预测任务。SDML 方法由于模型复杂度高、计算成本高,在大规模应用的可扩展性方面面临巨大挑战。正如美国环保署的生态逻辑结构-活性关系(ECOSAR)预测模型一样,由于高质量的实验数据不足,SDML 模型产生的结果应主要用于生态风险的筛选级评估,而 SDML 模型预测结果的适当性和适用性则需要专业判断。吴凤昌博士是中国环境科学研究院环境标准与风险评估国家重点实验室主任。他的研究兴趣主要涉及环境毒理学和化学,包括环境生物地球化学、化学、毒理学和风险评估等多学科方法。他通过推动环境科学、标准、技术和工程的发展,为保护中国的水生生态系统做出了卓越贡献。吴博士于 2017 年当选为中国工程院院士,以表彰他在中国环境标准/规范体系发展方面所做的工作。范文宏博士是北京航空航天大学材料科学与工程学院教授。她的研究重点是金属和纳米颗粒污染物在水环境中的环境行为、生态效应和修复,以及新兴污染物的暴露、毒性和风险评估。迄今为止,她已发表了 150 多篇国际同行评审论文。她担任《水生毒理学》(Aquatic Toxicology)副主编和《环境污染与毒理学公报》(Bulletin of Environmental Contamination and Toxicology)编辑。她曾获 2022 年碳研究最佳论文奖和北京市政府颁发的北京市科学技术一等奖。该研究得到了国家重点研发计划(2022YFC3204800)、国家自然科学基金(42177240、42330710 和 42430714)、北京市自然科学基金(8242033)和中央高校基本科研业务费的资助。M.G.V.获得了欧盟 ERC-consolidator Grant Agreement 101002123 的资助。本文引用了 15 篇其他出版物。本文尚未被其他出版物引用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
环境科学与技术
环境科学与技术 环境科学-工程:环境
CiteScore
17.50
自引率
9.60%
发文量
12359
审稿时长
2.8 months
期刊介绍: Environmental Science & Technology (ES&T) is a co-sponsored academic and technical magazine by the Hubei Provincial Environmental Protection Bureau and the Hubei Provincial Academy of Environmental Sciences. Environmental Science & Technology (ES&T) holds the status of Chinese core journals, scientific papers source journals of China, Chinese Science Citation Database source journals, and Chinese Academic Journal Comprehensive Evaluation Database source journals. This publication focuses on the academic field of environmental protection, featuring articles related to environmental protection and technical advancements.
期刊最新文献
Gaseous Air Pollutants and Lung Function in Fibrotic Interstitial Lung Disease (fILD): Evaluation of Different Spatial Analysis Approaches Exposure to Sodium p-Perfluorous Nonenoxybenzenesulfonate Induces Renal Fibrosis in Mice by Disrupting Lysine Metabolism A Framework for Quantifying the Size and Fractal Dimension of Compacting Soot Particles Basic Nitrogenous Heterocyclic Rings at the 7-Position of Fluoroquinolones Foster Their Induction of Antibiotic Resistance in Escherichia coli Organophosphorus Flame Retardant, Phthalate, and Alternative Plasticizer Contamination in Novel Plant-Based Food: A Food Safety Investigation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1