首页 > 最新文献

Chemometrics and Intelligent Laboratory Systems最新文献

英文 中文
Text mining-based profiling of chemical environments in protein–ligand binding assays across analytical techniques 跨分析技术的蛋白质配体结合分析中基于文本挖掘的化学环境分析
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-02-05 DOI: 10.1016/j.chemolab.2026.105659
Erdem Önal , Zeynep Kalaycıoğlu
Protein–ligand binding studies are critical in drug discovery and development, as they offer valuable insights into molecular interactions that underlie biological function, disease mechanisms, and therapeutic effects. The potential of combining text mining with cheminformatics to explore trends in protein–ligand binding studies across a range of analytical techniques was evaluated in this study. Six widely used analytical techniques were selected to reveal important patterns. Utilizing an open-source Python platform (SCOPE), we analyzed over 33,000 scientific articles and more than 1.3 million chemical entities. The resulting data were visualized as two-dimensional hexbin plots, revealing trends in hydrophobicity (log P)–molecular weight (Da) for each technique. Instead of focusing solely on ligands, this study aims to characterize the overall chemical environments—including solvents, buffers, and supporting agents—associated with protein–ligand binding assays. By analyzing the physicochemical properties of compounds reported across different analytical techniques, we highlight how method-specific preferences shape the experimental design landscape. The analysis integrates unsupervised K-means clustering, multivariate principal component analysis (PCA), and nonparametric statistical testing to quantitatively compare technique-associated chemical spaces. Moreover, this study offers a data-driven perspective on methodologies and historical trends in protein–ligand binding research. It is positioned as a data-driven, method-centric literature analysis rather than a traditional narrative review.
蛋白质-配体结合研究在药物发现和开发中至关重要,因为它们为生物学功能、疾病机制和治疗效果基础上的分子相互作用提供了有价值的见解。本研究评估了将文本挖掘与化学信息学相结合的潜力,通过一系列分析技术探索蛋白质配体结合研究的趋势。选择了六种广泛使用的分析技术来揭示重要的模式。利用开源Python平台(SCOPE),我们分析了超过33,000篇科学文章和超过130万个化学实体。结果数据被可视化为二维hexbin图,揭示了每种技术的疏水性(log P) -分子量(Da)的趋势。而不是仅仅关注配体,本研究的目的是表征整体的化学环境-包括溶剂,缓冲液和支持剂-与蛋白质配体结合分析相关。通过分析不同分析技术报告的化合物的物理化学性质,我们强调了方法特定偏好如何塑造实验设计景观。该分析集成了无监督k均值聚类、多元主成分分析(PCA)和非参数统计检验,以定量比较技术相关的化学空间。此外,本研究为蛋白质配体结合研究的方法和历史趋势提供了数据驱动的视角。它被定位为数据驱动的、以方法为中心的文献分析,而不是传统的叙事评论。
{"title":"Text mining-based profiling of chemical environments in protein–ligand binding assays across analytical techniques","authors":"Erdem Önal ,&nbsp;Zeynep Kalaycıoğlu","doi":"10.1016/j.chemolab.2026.105659","DOIUrl":"10.1016/j.chemolab.2026.105659","url":null,"abstract":"<div><div>Protein–ligand binding studies are critical in drug discovery and development, as they offer valuable insights into molecular interactions that underlie biological function, disease mechanisms, and therapeutic effects. The potential of combining text mining with cheminformatics to explore trends in protein–ligand binding studies across a range of analytical techniques was evaluated in this study. Six widely used analytical techniques were selected to reveal important patterns. Utilizing an open-source Python platform (SCOPE), we analyzed over 33,000 scientific articles and more than 1.3 million chemical entities. The resulting data were visualized as two-dimensional hexbin plots, revealing trends in hydrophobicity (log P)–molecular weight (Da) for each technique. Instead of focusing solely on ligands, this study aims to characterize the overall chemical environments—including solvents, buffers, and supporting agents—associated with protein–ligand binding assays. By analyzing the physicochemical properties of compounds reported across different analytical techniques, we highlight how method-specific preferences shape the experimental design landscape. The analysis integrates unsupervised K-means clustering, multivariate principal component analysis (PCA), and nonparametric statistical testing to quantitatively compare technique-associated chemical spaces. Moreover, this study offers a data-driven perspective on methodologies and historical trends in protein–ligand binding research. It is positioned as a data-driven, method-centric literature analysis rather than a traditional narrative review.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105659"},"PeriodicalIF":3.8,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fiducial inference for random-effects calibration models: Advancing reliable quantification in environmental analytical chemistry 随机效应校准模型的基准推理:推进环境分析化学的可靠定量
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-02-05 DOI: 10.1016/j.chemolab.2026.105652
Soumya Sahu , Thomas Mathew , Robert Gibbons , Dulal K. Bhaumik
This article addresses calibration challenges in analytical chemistry by employing a random-effects calibration curve model and its generalizations to capture variability in analyte concentrations. The model is motivated by specific issues in analytical chemistry, where measurement errors remain constant at low concentrations but increase proportionally as concentrations rise. To account for this, the model permits the parameters of the calibration curve, which relate instrument responses to true concentrations, to vary across different laboratories, thereby reflecting the potential variability in measurement processes. The calibration curve that accurately captures the heteroscedastic nature of the data results in more reliable estimates across diverse laboratory conditions. Noting that traditional large-sample interval estimation methods are inadequate for small samples, an alternative approach, namely the fiducial approach, is explored in this work. It turns out that the fiducial approach, when used to construct a confidence interval for an unknown concentration, outperforms all other available approaches in terms of maintaining the coverage probabilities. Applications considered include the determination of the presence of an analyte and the interval estimation of an unknown true analyte concentration. The proposed method is demonstrated for both simulated and real interlaboratory data, including examples involving copper and cadmium in distilled water.
本文通过采用随机效应校准曲线模型及其概括来捕获分析物浓度的可变性,解决了分析化学中的校准挑战。该模型的动机是分析化学中的特定问题,其中测量误差在低浓度下保持恒定,但随着浓度的增加而成比例地增加。为了解释这一点,该模型允许校准曲线的参数在不同的实验室中变化,这些参数与仪器对真实浓度的响应有关,从而反映了测量过程中的潜在可变性。校准曲线准确地捕获了数据的异方差特性,从而在不同的实验室条件下获得更可靠的估计。注意到传统的大样本区间估计方法不适用于小样本,本文探索了一种替代方法,即基准方法。事实证明,当用于为未知浓度构建置信区间时,基准方法在保持覆盖概率方面优于所有其他可用方法。考虑的应用包括分析物存在的确定和未知真实分析物浓度的区间估计。所提出的方法对模拟和真实的实验室间数据进行了验证,包括涉及蒸馏水中铜和镉的示例。
{"title":"Fiducial inference for random-effects calibration models: Advancing reliable quantification in environmental analytical chemistry","authors":"Soumya Sahu ,&nbsp;Thomas Mathew ,&nbsp;Robert Gibbons ,&nbsp;Dulal K. Bhaumik","doi":"10.1016/j.chemolab.2026.105652","DOIUrl":"10.1016/j.chemolab.2026.105652","url":null,"abstract":"<div><div>This article addresses calibration challenges in analytical chemistry by employing a random-effects calibration curve model and its generalizations to capture variability in analyte concentrations. The model is motivated by specific issues in analytical chemistry, where measurement errors remain constant at low concentrations but increase proportionally as concentrations rise. To account for this, the model permits the parameters of the calibration curve, which relate instrument responses to true concentrations, to vary across different laboratories, thereby reflecting the potential variability in measurement processes. The calibration curve that accurately captures the heteroscedastic nature of the data results in more reliable estimates across diverse laboratory conditions. Noting that traditional large-sample interval estimation methods are inadequate for small samples, an alternative approach, namely the fiducial approach, is explored in this work. It turns out that the fiducial approach, when used to construct a confidence interval for an unknown concentration, outperforms all other available approaches in terms of maintaining the coverage probabilities. Applications considered include the determination of the presence of an analyte and the interval estimation of an unknown true analyte concentration. The proposed method is demonstrated for both simulated and real interlaboratory data, including examples involving copper and cadmium in distilled water.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105652"},"PeriodicalIF":3.8,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Near-infrared spectroscopic prediction of gasoline olefin content: A systematic approach using continuous region feature selection and region-sensitive ensemble learning 近红外光谱预测汽油烯烃含量:使用连续区域特征选择和区域敏感集合学习的系统方法
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-02-05 DOI: 10.1016/j.chemolab.2026.105661
Jiaxue Cui , Dawei Zhang , Banglian Xu , Jianzhong Fan , Xianglong Cao
This study addresses the challenges of high-dimensional collinearity and regional information heterogeneity in near-infrared spectroscopy for gasoline olefin content prediction by proposing a systematic optimization approach combining a Continuous Region Utilizing Integrated Spectral Evaluation for Near-Infrared (CRUISE-NIR) algorithm with a Region-Sensitive Adaptive Ensemble Learning (RAEL) framework. The CRUISE-NIR algorithm shifts spectral analysis from a “point” to a “region” perspective, fully considering the physical correlation of adjacent wavelengths and chemical prior knowledge, reducing 4443 original variables to 16 key features. Meanwhile, the RAEL framework dynamically adjusts prediction weights according to sample performance characteristics in different spectral regions, achieving sample-specific precision prediction. Experimental results demonstrate that the proposed method achieves a root mean square error (RMSE) of 0.2795 and a coefficient of determination (R2) of 0.9646 on the test set, significantly outperforming traditional methods in prediction accuracy and fitting capability.Furthermore, the robustness of the framework was successfully validated on heterogeneous matrices including SWRI Diesel, IDRC Tablets, and Soil, demonstrating robust generalizability across diverse liquid and solid physical states. Experimental results indicate that prioritizing high-quality feature selection over variable quantity significantly enhances model performance. The proposed systematic framework demonstrates robust analytical capabilities for high-dimensional spectral data across diverse and complex molecular systems.
本研究针对近红外光谱预测汽油烯烃含量的高维共线性和区域信息异质性的挑战,提出了一种结合连续区域利用近红外综合光谱评估(CRUISE-NIR)算法和区域敏感自适应集成学习(RAEL)框架的系统优化方法。CRUISE-NIR算法将光谱分析从“点”的角度转移到“区域”的角度,充分考虑相邻波长的物理相关性和化学先验知识,将4443个原始变量减少到16个关键特征。同时,根据样本在不同光谱区域的性能特征动态调整预测权重,实现样本特定精度预测。实验结果表明,该方法在测试集上的均方根误差(RMSE)为0.2795,决定系数(R2)为0.9646,在预测精度和拟合能力上显著优于传统方法。此外,该框架的稳健性在包括SWRI Diesel、IDRC药片和土壤在内的异质基质上得到了成功验证,证明了该框架在不同液体和固体物理状态下的稳健性。实验结果表明,将高质量的特征选择优先于可变数量的特征选择可以显著提高模型的性能。提出的系统框架展示了跨不同和复杂分子系统的高维光谱数据的强大分析能力。
{"title":"Near-infrared spectroscopic prediction of gasoline olefin content: A systematic approach using continuous region feature selection and region-sensitive ensemble learning","authors":"Jiaxue Cui ,&nbsp;Dawei Zhang ,&nbsp;Banglian Xu ,&nbsp;Jianzhong Fan ,&nbsp;Xianglong Cao","doi":"10.1016/j.chemolab.2026.105661","DOIUrl":"10.1016/j.chemolab.2026.105661","url":null,"abstract":"<div><div>This study addresses the challenges of high-dimensional collinearity and regional information heterogeneity in near-infrared spectroscopy for gasoline olefin content prediction by proposing a systematic optimization approach combining a Continuous Region Utilizing Integrated Spectral Evaluation for Near-Infrared (CRUISE-NIR) algorithm with a Region-Sensitive Adaptive Ensemble Learning (RAEL) framework. The CRUISE-NIR algorithm shifts spectral analysis from a “point” to a “region” perspective, fully considering the physical correlation of adjacent wavelengths and chemical prior knowledge, reducing 4443 original variables to 16 key features. Meanwhile, the RAEL framework dynamically adjusts prediction weights according to sample performance characteristics in different spectral regions, achieving sample-specific precision prediction. Experimental results demonstrate that the proposed method achieves a root mean square error (RMSE) of 0.2795 and a coefficient of determination (R<sup>2</sup>) of 0.9646 on the test set, significantly outperforming traditional methods in prediction accuracy and fitting capability.Furthermore, the robustness of the framework was successfully validated on heterogeneous matrices including SWRI Diesel, IDRC Tablets, and Soil, demonstrating robust generalizability across diverse liquid and solid physical states. Experimental results indicate that prioritizing high-quality feature selection over variable quantity significantly enhances model performance. The proposed systematic framework demonstrates robust analytical capabilities for high-dimensional spectral data across diverse and complex molecular systems.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105661"},"PeriodicalIF":3.8,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of consolidation behavior of modified clayey soil reinforced with artificial geo-fibers using explainable artificial intelligence 人工土工纤维加固改性粘土固结行为的可解释人工智能预测
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-02-04 DOI: 10.1016/j.chemolab.2026.105654
Mohammed Faisal Noaman , Moinul Haq , Sanjog Chhetri Sapkota , Mehboob Anwer Khan , Kausar Ali , Hesam Kamyab
The present study illustrates an experimental, machine learning (ML), and explainable artificial intelligence integrated framework for the prediction of swelling pressure and consolidation characteristics of polypropylene geo-fiber (PPGF) reinforced clayey soil. A dataset of laboratory consolidation tests that included PPGF content, coefficient of consolidation (Cv), coefficient of compressibility (av), compression index (Cc), coefficient of volume change (mv), settlement (S), and swelling pressure values (ps) was compiled. The experimental observations revealed that the Cc, mv, and S was averagely decreased by about 39.5%, 45.31%, and 90%, respectively, at the optimum PPGF content of 0.3%, thus demonstrating the effectiveness of reinforcing fibers in restraining time-dependent deformation. Six machine learning models, including KNN, SVM, ANN, DT, RF, and XGB, were developed using five folds cross-validation. The XGB regressor proved to have the best predictive performances, having an R2 of 0.994 (with RMSE of 3.14) on training and generalizability in testing, with an R2 of 0.913 (having RMSE of 14.05). The remaining models demonstrated comparatively weaker performance, with ANN and DT exhibiting pronounced overfitting, while KNN and SVM failed to adequately capture the nonlinear swelling response of the gels. The XAI analysis using SHAP indicates that polypropylene geofiber content is the most influential factor governing swelling pressure, followed by mv and soil compressibility. An interactive graphical user interface was built based on the optimized XGB model to predict and visualize swelling pressure in real time from given user inputs. The proposed model integrates experimental validation with robust predictive capability and interpretability, and is complemented by a user-friendly interface and a reliable decision-support system for geotechnical design and soil improvement.
本研究阐述了一个实验、机器学习(ML)和可解释的人工智能集成框架,用于预测聚丙烯土工纤维(PPGF)增强粘土的膨胀压力和固结特性。编制了实验室固结试验数据集,包括PPGF含量、固结系数(Cv)、压缩系数(av)、压缩指数(Cc)、体积变化系数(mv)、沉降(S)和膨胀压力值(ps)。实验结果表明,当PPGF的最佳含量为0.3%时,Cc、mv和S分别平均降低了39.5%、45.31%和90%,证明了增强纤维对时间相关变形的抑制作用。通过五重交叉验证,建立了KNN、SVM、ANN、DT、RF和XGB等6个机器学习模型。XGB回归因子具有最佳的预测性能,在训练和检验中具有0.994的R2 (RMSE为3.14),R2为0.913 (RMSE为14.05)。其余模型表现出相对较弱的性能,ANN和DT表现出明显的过拟合,而KNN和SVM未能充分捕捉凝胶的非线性膨胀响应。利用SHAP进行的XAI分析表明,聚丙烯土工纤维含量是影响膨胀压力最大的因素,其次是mv和土壤压缩率。基于优化后的XGB模型,构建了一个交互式图形用户界面,根据给定的用户输入实时预测和可视化膨胀压力。该模型集实验验证、强大的预测能力和可解释性于一体,并辅以用户友好的界面和可靠的岩土设计和土壤改良决策支持系统。
{"title":"Prediction of consolidation behavior of modified clayey soil reinforced with artificial geo-fibers using explainable artificial intelligence","authors":"Mohammed Faisal Noaman ,&nbsp;Moinul Haq ,&nbsp;Sanjog Chhetri Sapkota ,&nbsp;Mehboob Anwer Khan ,&nbsp;Kausar Ali ,&nbsp;Hesam Kamyab","doi":"10.1016/j.chemolab.2026.105654","DOIUrl":"10.1016/j.chemolab.2026.105654","url":null,"abstract":"<div><div>The present study illustrates an experimental, machine learning (ML), and explainable artificial intelligence integrated framework for the prediction of swelling pressure and consolidation characteristics of polypropylene geo-fiber (<em>PPGF</em>) reinforced clayey soil. A dataset of laboratory consolidation tests that included PPGF content, coefficient of consolidation (<em>C</em><sub><em>v</em></sub>), coefficient of compressibility (<em>a</em><sub><em>v</em></sub>), compression index (<em>C</em><sub><em>c</em></sub>), coefficient of volume change (<em>m</em><sub><em>v</em></sub>), settlement (<em>S</em>), and swelling pressure values (<em>p</em><sub><em>s</em></sub>) was compiled. The experimental observations revealed that the <em>C</em><sub><em>c</em></sub>, <em>m</em><sub><em>v</em></sub>, and <em>S</em> was averagely decreased by about 39.5%, 45.31%, and 90%, respectively, at the optimum PPGF content of 0.3%, thus demonstrating the effectiveness of reinforcing fibers in restraining time-dependent deformation. Six machine learning models, including KNN, SVM, ANN, DT, RF, and XGB, were developed using five folds cross-validation. The XGB regressor proved to have the best predictive performances, having an R<sup>2</sup> of 0.994 (with RMSE of 3.14) on training and generalizability in testing, with an R<sup>2</sup> of 0.913 (having RMSE of 14.05). The remaining models demonstrated comparatively weaker performance, with ANN and DT exhibiting pronounced overfitting, while KNN and SVM failed to adequately capture the nonlinear swelling response of the gels. The XAI analysis using SHAP indicates that polypropylene geofiber content is the most influential factor governing swelling pressure, followed by <em>m</em><sub><em>v</em></sub> and soil compressibility. An interactive graphical user interface was built based on the optimized XGB model to predict and visualize swelling pressure in real time from given user inputs. The proposed model integrates experimental validation with robust predictive capability and interpretability, and is complemented by a user-friendly interface and a reliable decision-support system for geotechnical design and soil improvement.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105654"},"PeriodicalIF":3.8,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A graph-based soft sensor using feature expansion and multi-hop attention for melt index prediction 一种基于特征展开和多跳关注的图形软测量方法用于熔体指数预测
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-02-02 DOI: 10.1016/j.chemolab.2026.105656
Jingwen Ou, Yuhong Wang
Polypropylene serves as a fundamental material used in consumer products and advanced technological applications, where accurate melt index (MI) prediction is critical for quality control in polymerization. Existing offline analysis of MI are time-consuming and costly, so the development of MI soft sensor has become a research hit. The variables in the propylene polymerization process form a complex nonlinear relationship through the polymerization reaction. Graph Convolutional networks can better capture the spatial dependence between variables, but have the disadvantages of fixed structure and insufficient propagation depth. To this end, this work proposes a Feature Expansion Multi-hop Graph Attention Network (FMGAT) framework considering the receptive field enhancement and multi-level capture of features. The novelty of this framework lies in its integrated design for MI soft sensor, combining established attention and feature expansion mechanisms in a novel configuration tailored for polymerization processes. Unconnected nodes are connected by attention diffusion, which increases the receptive field of each layer. FMGAT uses multi-subspace parallel computing to extract features, which effectively reduces the homogenization of features. Marginally Regression Conditional Tabular Generative Adversarial Network (MRCTGAN) is introduced to generate samples in data processing. The statistical and regression evaluation metrics are developed to comprehensively study the performance of MRCTGAN and FMGAT on an industrial dataset. Results show that MRCTGAN has the optimal histogram intersection dissimilarity in sample generation methods. Models trained on MRCTGAN-augmented data achieves average 8.2% lower Root Mean Square Error (RMSE) than original data. FMGAT significantly outperforms baselines, reducing RMSE to 0.4643g/10min. FMGAT establishes an interpretable, robust paradigm for complex industrial process modeling.
聚丙烯是用于消费品和先进技术应用的基础材料,其中准确的熔体指数(MI)预测对聚合的质量控制至关重要。现有的MI离线分析既耗时又昂贵,因此MI软传感器的开发已成为研究热点。丙烯聚合过程中的变量通过聚合反应形成复杂的非线性关系。图卷积网络能较好地捕捉变量间的空间依赖关系,但存在结构固定、传播深度不足的缺点。为此,本文提出了一种考虑接收野增强和特征多层次捕获的特征扩展多跳图注意网络(FMGAT)框架。该框架的新颖之处在于其MI软传感器的集成设计,将已建立的注意力和特征扩展机制结合在为聚合过程量身定制的新配置中。未连接的节点通过注意力扩散连接起来,这增加了每一层的接受野。FMGAT采用多子空间并行计算提取特征,有效降低了特征的同质化程度。引入边际回归条件表生成对抗网络(MRCTGAN)来生成数据处理中的样本。为了全面研究MRCTGAN和FMGAT在工业数据集上的性能,开发了统计和回归评估指标。结果表明,MRCTGAN在样本生成方法中具有最佳的直方图交集不相似度。在mrctgan增强数据上训练的模型比原始数据的均方根误差(RMSE)平均降低8.2%。FMGAT显著优于基线,将RMSE降低到0.4643g/10min。FMGAT为复杂的工业过程建模建立了一个可解释的、健壮的范例。
{"title":"A graph-based soft sensor using feature expansion and multi-hop attention for melt index prediction","authors":"Jingwen Ou,&nbsp;Yuhong Wang","doi":"10.1016/j.chemolab.2026.105656","DOIUrl":"10.1016/j.chemolab.2026.105656","url":null,"abstract":"<div><div>Polypropylene serves as a fundamental material used in consumer products and advanced technological applications, where accurate melt index (MI) prediction is critical for quality control in polymerization. Existing offline analysis of MI are time-consuming and costly, so the development of MI soft sensor has become a research hit. The variables in the propylene polymerization process form a complex nonlinear relationship through the polymerization reaction. Graph Convolutional networks can better capture the spatial dependence between variables, but have the disadvantages of fixed structure and insufficient propagation depth. To this end, this work proposes a Feature Expansion Multi-hop Graph Attention Network (FMGAT) framework considering the receptive field enhancement and multi-level capture of features. The novelty of this framework lies in its integrated design for MI soft sensor, combining established attention and feature expansion mechanisms in a novel configuration tailored for polymerization processes. Unconnected nodes are connected by attention diffusion, which increases the receptive field of each layer. FMGAT uses multi-subspace parallel computing to extract features, which effectively reduces the homogenization of features. Marginally Regression Conditional Tabular Generative Adversarial Network (MRCTGAN) is introduced to generate samples in data processing. The statistical and regression evaluation metrics are developed to comprehensively study the performance of MRCTGAN and FMGAT on an industrial dataset. Results show that MRCTGAN has the optimal histogram intersection dissimilarity in sample generation methods. Models trained on MRCTGAN-augmented data achieves average 8.2% lower Root Mean Square Error (RMSE) than original data. FMGAT significantly outperforms baselines, reducing RMSE to 0.4643g/10min. FMGAT establishes an interpretable, robust paradigm for complex industrial process modeling.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105656"},"PeriodicalIF":3.8,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A directional multi-LSTM framework integrated BERT for S-sulfhydration sites prediction 基于BERT的定向多lstm框架用于s -巯基化位点预测
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-01-30 DOI: 10.1016/j.chemolab.2026.105653
Zhanchang Zhang , Qiao Ning , Xulun Shi , Shikai Guo , Hui Li
Protein S-sulfhydration is an important post-translational modification that regulates signaling pathways in animal cells by influencing protein activity and function. It also plays a crucial role in regulating plant metabolism and morphogenesis. Therefore, the identification of S-sulfhydration sites is crucial for cellular biology research. In this study, we propose a deep learning framework with directional multi-LSTM (Long Short-Term Memory) for predicting protein S-sulfhydration sites. In this study, we propose a deep learning framework utilizing a directional multi-LSTM (Long Short-Term Memory) network to predict protein S-sulfhydration sites. Initially, protein sequence data is preprocessed via an improved BERT strategy to extract high-dimensional sequence features. Hypothesizing that S-sulfhydration modification exhibits directionality, we partition sequences around cysteine residues and extract features using directional multi-LSTM, simulating the enzymatic reaction conditions. Subsequently, a convolutional neural network (CNN) is employed to capture deep local information features. On an independent test set, the accuracy, sensitivity, specificity, Matthews correlation coefficient, area under the curve, and precision are 76.76%, 85.45%, 67.21%, 53.77%, 76.33% and 74.11% respectively. The results demonstrate that the multi-directional LSTM deep learning framework is an effective tool for predicting protein S-sulfhydration. The source code is available on the website https://github.com/endeavor-zzc/Multi-LSTM.
蛋白质s -巯基化是一种重要的翻译后修饰,通过影响蛋白质活性和功能来调节动物细胞中的信号通路。它在调节植物代谢和形态发生中也起着至关重要的作用。因此,s -巯基化位点的鉴定对细胞生物学研究至关重要。在这项研究中,我们提出了一个具有定向多lstm(长短期记忆)的深度学习框架来预测蛋白质s -巯基化位点。在这项研究中,我们提出了一个利用定向多lstm(长短期记忆)网络来预测蛋白质s -巯基化位点的深度学习框架。首先,通过改进的BERT策略对蛋白质序列数据进行预处理,提取高维序列特征。假设s -硫水化修饰具有方向性,我们在半胱氨酸残基周围划分序列,并使用定向多lstm提取特征,模拟酶促反应条件。随后,采用卷积神经网络(CNN)捕获深度局部信息特征。在独立测试集上,准确度、灵敏度、特异度、马修斯相关系数、曲线下面积和精密度分别为76.76%、85.45%、67.21%、53.77%、76.33%和74.11%。结果表明,多向LSTM深度学习框架是预测蛋白质s -巯基化的有效工具。源代码可在网站https://github.com/endeavor-zzc/Multi-LSTM上获得。
{"title":"A directional multi-LSTM framework integrated BERT for S-sulfhydration sites prediction","authors":"Zhanchang Zhang ,&nbsp;Qiao Ning ,&nbsp;Xulun Shi ,&nbsp;Shikai Guo ,&nbsp;Hui Li","doi":"10.1016/j.chemolab.2026.105653","DOIUrl":"10.1016/j.chemolab.2026.105653","url":null,"abstract":"<div><div>Protein S-sulfhydration is an important post-translational modification that regulates signaling pathways in animal cells by influencing protein activity and function. It also plays a crucial role in regulating plant metabolism and morphogenesis. Therefore, the identification of S-sulfhydration sites is crucial for cellular biology research. In this study, we propose a deep learning framework with directional multi-LSTM (Long Short-Term Memory) for predicting protein S-sulfhydration sites. In this study, we propose a deep learning framework utilizing a directional multi-LSTM (Long Short-Term Memory) network to predict protein S-sulfhydration sites. Initially, protein sequence data is preprocessed via an improved BERT strategy to extract high-dimensional sequence features. Hypothesizing that S-sulfhydration modification exhibits directionality, we partition sequences around cysteine residues and extract features using directional multi-LSTM, simulating the enzymatic reaction conditions. Subsequently, a convolutional neural network (CNN) is employed to capture deep local information features. On an independent test set, the accuracy, sensitivity, specificity, Matthews correlation coefficient, area under the curve, and precision are 76.76%, 85.45%, 67.21%, 53.77%, 76.33% and 74.11% respectively. The results demonstrate that the multi-directional LSTM deep learning framework is an effective tool for predicting protein S-sulfhydration. The source code is available on the website <span><span>https://github.com/endeavor-zzc/Multi-LSTM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105653"},"PeriodicalIF":3.8,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time-resolved simulation of hybrid nano-milk flow in an electromagnetic vibration channel with parabolic thermal ramping: A Python AI approach 具有抛物型热斜坡的电磁振动通道中混合纳米奶流动的时间分辨模拟:Python AI方法
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-01-26 DOI: 10.1016/j.chemolab.2026.105647
Sanatan Das , Poly Karmakar
This research paper explores the innovative application of artificial intelligence (AI) in understanding the behaviors of silver and magnesium oxide nanoparticles within milk flow. This study utilizes a specially designed vibrating electromagnetic channel to observe the effects under controlled parabolic thermal ramping and oscillatory pressure variations. This framework couples essential physical mechanisms-radiative emission, thermal sinks, and porous matrix interactions-where Darcy's law quantifies the permeability-driven viscous drag. The mechanics of milk flow through an electromagnetically activated channel are meticulously formulated and solved using mathematical and computational methods, with the Laplace transform (LT) technique facilitating a streamlined solution to the equations. The analysis concentrates on flow metrics, presenting results through detailed graphical representations. Significant findings comprise the enhancement of thermal conductivity and flow viscosity due to the nanoparticles, which improve heat transport efficiency and modify flow patterns. The operational control of milk flow dynamics shows dual dependencies-momentum amplification via electromagnetic intensity (Hartmann number) versus suppression through electrode spacing, while thermal management reveals frequency-dependent shear stress (SS) augmentation and rate of heat transfer (RHT) enhancement through optimized heat uptake parameter. An artificial neural network (ANN) is calibrated to emulate the LT solver's outputs for wall SS and RHT. The ANN achieves high fidelity (R2>0.99) in predicting these metrics across the parameter space explored in the LT simulations, but its generalization to experimental or real dairy systems remains unvalidated and is a focus of future work. The key findings demonstrate the potential of integrating advanced materials and AI technologies to improve product characteristics and processing efficiency.
本研究探讨了人工智能(AI)在理解银和氧化镁纳米颗粒在牛奶流动中的行为方面的创新应用。本研究利用特别设计的振动电磁通道,观察受控抛物线式热斜坡和振荡压力变化下的效应。该框架结合了基本的物理机制——辐射发射、热汇和多孔基质相互作用——其中达西定律量化了渗透率驱动的粘性阻力。牛奶通过电磁激活通道流动的力学是精心制定的,并使用数学和计算方法解决,与拉普拉斯变换(LT)技术促进方程的流线型解决方案。分析集中在流量指标上,通过详细的图形表示来呈现结果。重要的发现包括由于纳米颗粒提高了导热性和流动粘度,从而提高了热传导效率并改变了流动模式。乳流动力学的操作控制显示出双重依赖关系——电磁强度(哈特曼数)对动量的放大和电极间距的抑制,而热管理显示出频率相关的剪切应力(SS)增加和热传递率(RHT)增强,通过优化热吸收参数。校准了人工神经网络(ANN)来模拟LT解算器对wall SS和RHT的输出。人工神经网络在预测LT模拟中探索的参数空间中的这些指标方面实现了高保真度(R2>0.99),但其在实验或真实乳制品系统中的推广仍然未经验证,这是未来工作的重点。这些关键发现表明,将先进材料和人工智能技术相结合,可以改善产品特性和加工效率。
{"title":"Time-resolved simulation of hybrid nano-milk flow in an electromagnetic vibration channel with parabolic thermal ramping: A Python AI approach","authors":"Sanatan Das ,&nbsp;Poly Karmakar","doi":"10.1016/j.chemolab.2026.105647","DOIUrl":"10.1016/j.chemolab.2026.105647","url":null,"abstract":"<div><div>This research paper explores the innovative application of artificial intelligence (AI) in understanding the behaviors of silver and magnesium oxide nanoparticles within milk flow. This study utilizes a specially designed vibrating electromagnetic channel to observe the effects under controlled parabolic thermal ramping and oscillatory pressure variations. This framework couples essential physical mechanisms-radiative emission, thermal sinks, and porous matrix interactions-where Darcy's law quantifies the permeability-driven viscous drag. The mechanics of milk flow through an electromagnetically activated channel are meticulously formulated and solved using mathematical and computational methods, with the Laplace transform (LT) technique facilitating a streamlined solution to the equations. The analysis concentrates on flow metrics, presenting results through detailed graphical representations. Significant findings comprise the enhancement of thermal conductivity and flow viscosity due to the nanoparticles, which improve heat transport efficiency and modify flow patterns. The operational control of milk flow dynamics shows dual dependencies-momentum amplification via electromagnetic intensity (Hartmann number) versus suppression through electrode spacing, while thermal management reveals frequency-dependent shear stress (SS) augmentation and rate of heat transfer (RHT) enhancement through optimized heat uptake parameter. An artificial neural network (ANN) is calibrated to emulate the LT solver's outputs for wall SS and RHT. The ANN achieves high fidelity <span><math><mrow><mo>(</mo><mrow><msup><mi>R</mi><mn>2</mn></msup><mo>&gt;</mo><mn>0.99</mn></mrow><mo>)</mo></mrow></math></span> in predicting these metrics across the parameter space explored in the LT simulations, but its generalization to experimental or real dairy systems remains unvalidated and is a focus of future work. The key findings demonstrate the potential of integrating advanced materials and AI technologies to improve product characteristics and processing efficiency.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"270 ","pages":"Article 105647"},"PeriodicalIF":3.8,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-robust contrastive ensemble learning for flotation process monitoring 面向浮选过程监测的噪声鲁棒对比集成学习
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.chemolab.2026.105649
Mingxi Ai , Jin Zhang , Zhaohui Tang , Yongfang Xie
Froth flotation is a widely used mineral beneficiation technique, where effective process monitoring is essential for optimizing mineral separation. However, in practical industry, manual labeling suffers from noises, leading to a significant portion of incorrectly labeled data. Though deep learning monitoring models are powerful in capturing complex visual patterns, their high capacity makes them vulnerable to overfitting noisy labels, hindering robust model development. To address this challenge, this study proposes a noise-robust contrastive ensemble learning method for practical industrial process monitoring. The method first constructs multiple diverse monitoring models in distinct representation spaces using a novel disparity contrastive learning strategy. Then, clean and mislabeled data for each sub-model are distinguished by measuring the inter-model consensus and intra-model uncertainty of its peer models. Finally, a structure-consistency-based semi-supervised learning strategy is proposed to refine these sub-models by treating mislabeled data as unlabeled, encouraging representation-aligned predictions through mutual information maximization. Through iterative noisy-label identification and semi-supervised refinement, robust monitoring model are obtained even with heavily corrupted training data. Extensive experiments on industrial froth flotation data demonstrate the effectiveness and advantages of the proposed method compared to existing state-of-the-art noise-robust learning techniques.
泡沫浮选是一种应用广泛的选矿技术,有效的过程监控是优化选矿的关键。然而,在实际工业中,人工标注受到噪声的影响,导致很大一部分标注错误的数据。虽然深度学习监测模型在捕获复杂的视觉模式方面很强大,但它们的高容量使它们容易受到过拟合噪声标签的影响,从而阻碍了鲁棒模型的开发。为了解决这一挑战,本研究提出了一种用于实际工业过程监测的噪声鲁棒对比集成学习方法。该方法首先使用一种新的视差对比学习策略在不同的表示空间中构建多个不同的监测模型。然后,通过测量同级模型的模型间一致性和模型内不确定性来区分每个子模型的干净和错误标记数据。最后,提出了一种基于结构一致性的半监督学习策略,通过将错误标记的数据视为未标记的数据来改进这些子模型,并通过相互信息最大化来鼓励表征一致的预测。通过迭代噪声标签识别和半监督改进,即使在训练数据严重损坏的情况下也能获得鲁棒监测模型。工业泡沫浮选数据的大量实验表明,与现有的最先进的噪声鲁棒学习技术相比,所提出的方法具有有效性和优越性。
{"title":"Noise-robust contrastive ensemble learning for flotation process monitoring","authors":"Mingxi Ai ,&nbsp;Jin Zhang ,&nbsp;Zhaohui Tang ,&nbsp;Yongfang Xie","doi":"10.1016/j.chemolab.2026.105649","DOIUrl":"10.1016/j.chemolab.2026.105649","url":null,"abstract":"<div><div>Froth flotation is a widely used mineral beneficiation technique, where effective process monitoring is essential for optimizing mineral separation. However, in practical industry, manual labeling suffers from noises, leading to a significant portion of incorrectly labeled data. Though deep learning monitoring models are powerful in capturing complex visual patterns, their high capacity makes them vulnerable to overfitting noisy labels, hindering robust model development. To address this challenge, this study proposes a noise-robust contrastive ensemble learning method for practical industrial process monitoring. The method first constructs multiple diverse monitoring models in distinct representation spaces using a novel disparity contrastive learning strategy. Then, clean and mislabeled data for each sub-model are distinguished by measuring the inter-model consensus and intra-model uncertainty of its peer models. Finally, a structure-consistency-based semi-supervised learning strategy is proposed to refine these sub-models by treating mislabeled data as unlabeled, encouraging representation-aligned predictions through mutual information maximization. Through iterative noisy-label identification and semi-supervised refinement, robust monitoring model are obtained even with heavily corrupted training data. Extensive experiments on industrial froth flotation data demonstrate the effectiveness and advantages of the proposed method compared to existing state-of-the-art noise-robust learning techniques.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"270 ","pages":"Article 105649"},"PeriodicalIF":3.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146074867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating calibration models in isotope geochemistry: Lessons from carbonates and sulfides 评估同位素地球化学中的校准模型:来自碳酸盐和硫化物的教训
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.chemolab.2026.105640
Alban Petitjean , Olivier Musset , Ludovic Duponchel , Christophe Thomazo
Geology routinely employs isotopic geochemistry with the main objective of measuring radiogenic or stable isotopic compositions to reconstruct the history of the Earth. A critical aspect of this analytical process lies in verifying the accuracy and reliability of the measurements performed. To this end, standards or reference materials are repeatedly analyzed enabling calibration or adjustment of experimental instruments. In order to ensure a strong correlation between the reference values and the averaged measurements, a linear regression is the most widely adopted. Among the available methodologies, this work advocates for the use of models compliant with the ISO 28037:2010 standard, which is specifically designed to perform linear regression in a statistically robust manner. The guidelines established by this standard are, regrettably, not always implemented correctly, and the statistical nature of the measurements is frequently overlooked. This study provides a detailed examination of the methodologies advocated by the standard, with the objective of facilitating their application to geochemical problems specifically, issues related to isotopic measurement by revisiting the underlying theoretical principles, assumptions, and the respective advantages and limitations inherent to each approach. To facilitate implementation and respect recommendations, we propose a software application developed in Python 3.14. This computational tool has been tested and validated using experimental datasets obtained from isotopic analyses of carbon, oxygen, and sulfur elements of fundamental interest in geological studies. The objective of this study is therefore to clearly and practically illustrate the challenges involved in geochemical calibration and adjustment.
地质学通常使用同位素地球化学,其主要目的是测量放射性成因或稳定同位素组成,以重建地球的历史。该分析过程的一个关键方面在于验证所进行测量的准确性和可靠性。为此,反复分析标准或参考物质,以便校准或调整实验仪器。为了保证参考值和平均测量值之间有很强的相关性,最广泛采用的是线性回归。在可用的方法中,本工作提倡使用符合ISO 28037:2010标准的模型,该模型专门用于以统计稳健的方式执行线性回归。遗憾的是,这个标准所建立的指导方针并不总是得到正确的执行,而且测量的统计性质经常被忽视。本研究对该标准所倡导的方法进行了详细的审查,目的是通过重新审视每种方法的基本理论原理、假设以及各自固有的优点和局限性,促进它们在地球化学问题上的具体应用,特别是与同位素测量有关的问题。为了方便实现和尊重建议,我们提出了一个用Python 3.14开发的软件应用程序。这个计算工具已经使用从碳、氧和硫元素的同位素分析中获得的实验数据集进行了测试和验证,这些元素是地质研究中最基本的兴趣。因此,本研究的目的是清楚而实际地说明地球化学定标与平差所涉及的挑战。
{"title":"Evaluating calibration models in isotope geochemistry: Lessons from carbonates and sulfides","authors":"Alban Petitjean ,&nbsp;Olivier Musset ,&nbsp;Ludovic Duponchel ,&nbsp;Christophe Thomazo","doi":"10.1016/j.chemolab.2026.105640","DOIUrl":"10.1016/j.chemolab.2026.105640","url":null,"abstract":"<div><div>Geology routinely employs isotopic geochemistry with the main objective of measuring radiogenic or stable isotopic compositions to reconstruct the history of the Earth. A critical aspect of this analytical process lies in verifying the accuracy and reliability of the measurements performed. To this end, standards or reference materials are repeatedly analyzed enabling calibration or adjustment of experimental instruments. In order to ensure a strong correlation between the reference values and the averaged measurements, a linear regression is the most widely adopted. Among the available methodologies, this work advocates for the use of models compliant with the ISO 28037:2010 standard, which is specifically designed to perform linear regression in a statistically robust manner. The guidelines established by this standard are, regrettably, not always implemented correctly, and the statistical nature of the measurements is frequently overlooked. This study provides a detailed examination of the methodologies advocated by the standard, with the objective of facilitating their application to geochemical problems specifically, issues related to isotopic measurement by revisiting the underlying theoretical principles, assumptions, and the respective advantages and limitations inherent to each approach. To facilitate implementation and respect recommendations, we propose a software application developed in Python 3.14. This computational tool has been tested and validated using experimental datasets obtained from isotopic analyses of carbon, oxygen, and sulfur elements of fundamental interest in geological studies. The objective of this study is therefore to clearly and practically illustrate the challenges involved in geochemical calibration and adjustment.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"270 ","pages":"Article 105640"},"PeriodicalIF":3.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable AI for secure and accurate prediction of bacteriophage virion proteins using NLP descriptors and transformer-guided ideal proximity matrix reconstruction 使用NLP描述符和变压器引导的理想接近矩阵重建,用于安全准确预测噬菌体病毒粒子蛋白的可解释人工智能
IF 3.8 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.chemolab.2026.105648
Naif Almusallam , Maqsood Hayat
The biological functions of bacteria are significantly impacted by bacteriophage virion proteins (BVPs), which are bacterial viruses. BVPs play a major role in phage therapy and genetic engineering. Secure and accurate identification of these proteins is essential for understanding phage-host interactions and for bioinformatics and medical applications. However, ensuring privacy and robustness in computational models is challenging, especially when handling complex biological data. Previous works relied on wet-lab experiments, had limited scalability, incomplete feature coverage, and low generalization ability. In this study, we introduce a privacy-preserving and adversarial-robust deep learning framework. It integrates natural language processing (NLP) descriptors with transformer-guided ideal proximity matrix reconstruction to capture rich information from protein sequences. For post-hoc interpretability, we use SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME). These techniques increase openness and confidence in predictions. SHAP analyzes the dataset to identify the most significant proximity-based and NLP-derived descriptors at global and class levels. LIME provides instance-specific explanations, emphasizing local decision boundaries for particular predictions. The proposed model achieved 95.75 % and 90.27 % accuracy on the training and independent datasets, respectively. We calculated statistical measures, such as Chi-Square and P-value, for each dataset to demonstrate reliability. Our model improves predictive outcomes, transparency, and security. The empirical results validate its outstanding performance compared to existing models, while preserving security and explainable AI. This makes it suitable and reliable for real-world applications in proteomics and bioinformatics.
噬菌体病毒蛋白(bacteriophage virion protein, BVPs)是细菌的一种病毒,它对细菌的生物学功能有重要影响。bvp在噬菌体治疗和基因工程中发挥着重要作用。安全和准确地鉴定这些蛋白质对于理解噬菌体-宿主相互作用以及生物信息学和医学应用至关重要。然而,确保计算模型的隐私性和鲁棒性是具有挑战性的,特别是在处理复杂的生物数据时。以前的工作依赖于湿实验室实验,可扩展性有限,特征覆盖不完整,泛化能力低。在本研究中,我们引入了一个隐私保护和对抗鲁棒的深度学习框架。它将自然语言处理(NLP)描述符与变压器引导的理想接近矩阵重构相结合,从蛋白质序列中捕获丰富的信息。对于事后可解释性,我们使用SHapley加性解释(SHAP)和局部可解释模型不可知论解释(LIME)。这些技术增加了预测的开放性和信心。SHAP分析数据集,以确定全局和类级别上最重要的基于接近性和nlp派生的描述符。LIME提供特定于实例的解释,强调特定预测的局部决策边界。该模型在训练数据集和独立数据集上的准确率分别达到95.75%和90.27%。我们为每个数据集计算了统计度量,如卡方和p值,以证明可靠性。我们的模型提高了预测结果、透明度和安全性。与现有模型相比,实证结果验证了其出色的性能,同时保留了安全性和可解释的AI。这使得它适用于蛋白质组学和生物信息学的实际应用。
{"title":"Explainable AI for secure and accurate prediction of bacteriophage virion proteins using NLP descriptors and transformer-guided ideal proximity matrix reconstruction","authors":"Naif Almusallam ,&nbsp;Maqsood Hayat","doi":"10.1016/j.chemolab.2026.105648","DOIUrl":"10.1016/j.chemolab.2026.105648","url":null,"abstract":"<div><div>The biological functions of bacteria are significantly impacted by bacteriophage virion proteins (BVPs), which are bacterial viruses. BVPs play a major role in phage therapy and genetic engineering. Secure and accurate identification of these proteins is essential for understanding phage-host interactions and for bioinformatics and medical applications. However, ensuring privacy and robustness in computational models is challenging, especially when handling complex biological data. Previous works relied on wet-lab experiments, had limited scalability, incomplete feature coverage, and low generalization ability. In this study, we introduce a privacy-preserving and adversarial-robust deep learning framework. It integrates natural language processing (NLP) descriptors with transformer-guided ideal proximity matrix reconstruction to capture rich information from protein sequences. For post-hoc interpretability, we use SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME). These techniques increase openness and confidence in predictions. SHAP analyzes the dataset to identify the most significant proximity-based and NLP-derived descriptors at global and class levels. LIME provides instance-specific explanations, emphasizing local decision boundaries for particular predictions. The proposed model achieved 95.75 % and 90.27 % accuracy on the training and independent datasets, respectively. We calculated statistical measures, such as Chi-Square and P-value, for each dataset to demonstrate reliability. Our model improves predictive outcomes, transparency, and security. The empirical results validate its outstanding performance compared to existing models, while preserving security and explainable AI. This makes it suitable and reliable for real-world applications in proteomics and bioinformatics.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"270 ","pages":"Article 105648"},"PeriodicalIF":3.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146074875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Chemometrics and Intelligent Laboratory Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1