Pub Date : 2026-04-15Epub Date: 2026-02-05DOI: 10.1016/j.chemolab.2026.105659
Erdem Önal , Zeynep Kalaycıoğlu
Protein–ligand binding studies are critical in drug discovery and development, as they offer valuable insights into molecular interactions that underlie biological function, disease mechanisms, and therapeutic effects. The potential of combining text mining with cheminformatics to explore trends in protein–ligand binding studies across a range of analytical techniques was evaluated in this study. Six widely used analytical techniques were selected to reveal important patterns. Utilizing an open-source Python platform (SCOPE), we analyzed over 33,000 scientific articles and more than 1.3 million chemical entities. The resulting data were visualized as two-dimensional hexbin plots, revealing trends in hydrophobicity (log P)–molecular weight (Da) for each technique. Instead of focusing solely on ligands, this study aims to characterize the overall chemical environments—including solvents, buffers, and supporting agents—associated with protein–ligand binding assays. By analyzing the physicochemical properties of compounds reported across different analytical techniques, we highlight how method-specific preferences shape the experimental design landscape. The analysis integrates unsupervised K-means clustering, multivariate principal component analysis (PCA), and nonparametric statistical testing to quantitatively compare technique-associated chemical spaces. Moreover, this study offers a data-driven perspective on methodologies and historical trends in protein–ligand binding research. It is positioned as a data-driven, method-centric literature analysis rather than a traditional narrative review.
{"title":"Text mining-based profiling of chemical environments in protein–ligand binding assays across analytical techniques","authors":"Erdem Önal , Zeynep Kalaycıoğlu","doi":"10.1016/j.chemolab.2026.105659","DOIUrl":"10.1016/j.chemolab.2026.105659","url":null,"abstract":"<div><div>Protein–ligand binding studies are critical in drug discovery and development, as they offer valuable insights into molecular interactions that underlie biological function, disease mechanisms, and therapeutic effects. The potential of combining text mining with cheminformatics to explore trends in protein–ligand binding studies across a range of analytical techniques was evaluated in this study. Six widely used analytical techniques were selected to reveal important patterns. Utilizing an open-source Python platform (SCOPE), we analyzed over 33,000 scientific articles and more than 1.3 million chemical entities. The resulting data were visualized as two-dimensional hexbin plots, revealing trends in hydrophobicity (log P)–molecular weight (Da) for each technique. Instead of focusing solely on ligands, this study aims to characterize the overall chemical environments—including solvents, buffers, and supporting agents—associated with protein–ligand binding assays. By analyzing the physicochemical properties of compounds reported across different analytical techniques, we highlight how method-specific preferences shape the experimental design landscape. The analysis integrates unsupervised K-means clustering, multivariate principal component analysis (PCA), and nonparametric statistical testing to quantitatively compare technique-associated chemical spaces. Moreover, this study offers a data-driven perspective on methodologies and historical trends in protein–ligand binding research. It is positioned as a data-driven, method-centric literature analysis rather than a traditional narrative review.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105659"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-02DOI: 10.1016/j.chemolab.2026.105656
Jingwen Ou, Yuhong Wang
Polypropylene serves as a fundamental material used in consumer products and advanced technological applications, where accurate melt index (MI) prediction is critical for quality control in polymerization. Existing offline analysis of MI are time-consuming and costly, so the development of MI soft sensor has become a research hit. The variables in the propylene polymerization process form a complex nonlinear relationship through the polymerization reaction. Graph Convolutional networks can better capture the spatial dependence between variables, but have the disadvantages of fixed structure and insufficient propagation depth. To this end, this work proposes a Feature Expansion Multi-hop Graph Attention Network (FMGAT) framework considering the receptive field enhancement and multi-level capture of features. The novelty of this framework lies in its integrated design for MI soft sensor, combining established attention and feature expansion mechanisms in a novel configuration tailored for polymerization processes. Unconnected nodes are connected by attention diffusion, which increases the receptive field of each layer. FMGAT uses multi-subspace parallel computing to extract features, which effectively reduces the homogenization of features. Marginally Regression Conditional Tabular Generative Adversarial Network (MRCTGAN) is introduced to generate samples in data processing. The statistical and regression evaluation metrics are developed to comprehensively study the performance of MRCTGAN and FMGAT on an industrial dataset. Results show that MRCTGAN has the optimal histogram intersection dissimilarity in sample generation methods. Models trained on MRCTGAN-augmented data achieves average 8.2% lower Root Mean Square Error (RMSE) than original data. FMGAT significantly outperforms baselines, reducing RMSE to 0.4643g/10min. FMGAT establishes an interpretable, robust paradigm for complex industrial process modeling.
{"title":"A graph-based soft sensor using feature expansion and multi-hop attention for melt index prediction","authors":"Jingwen Ou, Yuhong Wang","doi":"10.1016/j.chemolab.2026.105656","DOIUrl":"10.1016/j.chemolab.2026.105656","url":null,"abstract":"<div><div>Polypropylene serves as a fundamental material used in consumer products and advanced technological applications, where accurate melt index (MI) prediction is critical for quality control in polymerization. Existing offline analysis of MI are time-consuming and costly, so the development of MI soft sensor has become a research hit. The variables in the propylene polymerization process form a complex nonlinear relationship through the polymerization reaction. Graph Convolutional networks can better capture the spatial dependence between variables, but have the disadvantages of fixed structure and insufficient propagation depth. To this end, this work proposes a Feature Expansion Multi-hop Graph Attention Network (FMGAT) framework considering the receptive field enhancement and multi-level capture of features. The novelty of this framework lies in its integrated design for MI soft sensor, combining established attention and feature expansion mechanisms in a novel configuration tailored for polymerization processes. Unconnected nodes are connected by attention diffusion, which increases the receptive field of each layer. FMGAT uses multi-subspace parallel computing to extract features, which effectively reduces the homogenization of features. Marginally Regression Conditional Tabular Generative Adversarial Network (MRCTGAN) is introduced to generate samples in data processing. The statistical and regression evaluation metrics are developed to comprehensively study the performance of MRCTGAN and FMGAT on an industrial dataset. Results show that MRCTGAN has the optimal histogram intersection dissimilarity in sample generation methods. Models trained on MRCTGAN-augmented data achieves average 8.2% lower Root Mean Square Error (RMSE) than original data. FMGAT significantly outperforms baselines, reducing RMSE to 0.4643g/10min. FMGAT establishes an interpretable, robust paradigm for complex industrial process modeling.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105656"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-11DOI: 10.1016/j.chemolab.2026.105660
Andrew T. Karl
SVEMnet is an R package for fitting Self-Validated Ensemble Models (SVEM) with elastic-net base learners and performing multi-response optimization in small-sample mixture–process design-of-experiments (DOE) studies with numeric, categorical, and mixture factors. SVEMnet wraps elastic-net and relaxed elastic-net models for Gaussian and binomial responses from glmnet in a fractional random-weight (FRW) resampling scheme with anti-correlated train/validation weights; penalties are selected by validation-weighted AIC- and BIC-type criteria, and predictions are averaged across replicates to stabilize fits near the interpolation boundary. In addition to the core SVEM engine, the package provides deterministic high-order formula expansion, a permutation-based whole-model test heuristic, and a mixture-constrained random-search optimizer that combines Derringer–Suich desirability functions, bootstrap-based uncertainty summaries, and optional mean-level specification-limit probabilities to generate scored candidate tables and diverse exploitation and exploration medoids for sequential fit–score–run–refit workflows. A simulated lipid nanoparticle (LNP) formulation study illustrates these tools, and simulation experiments based on sparse quadratic response surfaces benchmark SVEMnet against repeated cross-validated elastic-net baselines.
{"title":"SVEMnet: An R package for self-validated elastic-net ensembles and multi-response optimization in small-sample mixture–process experiments","authors":"Andrew T. Karl","doi":"10.1016/j.chemolab.2026.105660","DOIUrl":"10.1016/j.chemolab.2026.105660","url":null,"abstract":"<div><div><span>SVEMnet</span> is an R package for fitting Self-Validated Ensemble Models (SVEM) with elastic-net base learners and performing multi-response optimization in small-sample mixture–process design-of-experiments (DOE) studies with numeric, categorical, and mixture factors. <span>SVEMnet</span> wraps elastic-net and relaxed elastic-net models for Gaussian and binomial responses from <span>glmnet</span> in a fractional random-weight (FRW) resampling scheme with anti-correlated train/validation weights; penalties are selected by validation-weighted AIC- and BIC-type criteria, and predictions are averaged across replicates to stabilize fits near the interpolation boundary. In addition to the core SVEM engine, the package provides deterministic high-order formula expansion, a permutation-based whole-model test heuristic, and a mixture-constrained random-search optimizer that combines Derringer–Suich desirability functions, bootstrap-based uncertainty summaries, and optional mean-level specification-limit probabilities to generate scored candidate tables and diverse exploitation and exploration medoids for sequential fit–score–run–refit workflows. A simulated lipid nanoparticle (LNP) formulation study illustrates these tools, and simulation experiments based on sparse quadratic response surfaces benchmark <span>SVEMnet</span> against repeated cross-validated elastic-net baselines.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105660"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146186943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-05DOI: 10.1016/j.chemolab.2026.105652
Soumya Sahu , Thomas Mathew , Robert Gibbons , Dulal K. Bhaumik
This article addresses calibration challenges in analytical chemistry by employing a random-effects calibration curve model and its generalizations to capture variability in analyte concentrations. The model is motivated by specific issues in analytical chemistry, where measurement errors remain constant at low concentrations but increase proportionally as concentrations rise. To account for this, the model permits the parameters of the calibration curve, which relate instrument responses to true concentrations, to vary across different laboratories, thereby reflecting the potential variability in measurement processes. The calibration curve that accurately captures the heteroscedastic nature of the data results in more reliable estimates across diverse laboratory conditions. Noting that traditional large-sample interval estimation methods are inadequate for small samples, an alternative approach, namely the fiducial approach, is explored in this work. It turns out that the fiducial approach, when used to construct a confidence interval for an unknown concentration, outperforms all other available approaches in terms of maintaining the coverage probabilities. Applications considered include the determination of the presence of an analyte and the interval estimation of an unknown true analyte concentration. The proposed method is demonstrated for both simulated and real interlaboratory data, including examples involving copper and cadmium in distilled water.
{"title":"Fiducial inference for random-effects calibration models: Advancing reliable quantification in environmental analytical chemistry","authors":"Soumya Sahu , Thomas Mathew , Robert Gibbons , Dulal K. Bhaumik","doi":"10.1016/j.chemolab.2026.105652","DOIUrl":"10.1016/j.chemolab.2026.105652","url":null,"abstract":"<div><div>This article addresses calibration challenges in analytical chemistry by employing a random-effects calibration curve model and its generalizations to capture variability in analyte concentrations. The model is motivated by specific issues in analytical chemistry, where measurement errors remain constant at low concentrations but increase proportionally as concentrations rise. To account for this, the model permits the parameters of the calibration curve, which relate instrument responses to true concentrations, to vary across different laboratories, thereby reflecting the potential variability in measurement processes. The calibration curve that accurately captures the heteroscedastic nature of the data results in more reliable estimates across diverse laboratory conditions. Noting that traditional large-sample interval estimation methods are inadequate for small samples, an alternative approach, namely the fiducial approach, is explored in this work. It turns out that the fiducial approach, when used to construct a confidence interval for an unknown concentration, outperforms all other available approaches in terms of maintaining the coverage probabilities. Applications considered include the determination of the presence of an analyte and the interval estimation of an unknown true analyte concentration. The proposed method is demonstrated for both simulated and real interlaboratory data, including examples involving copper and cadmium in distilled water.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105652"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-09DOI: 10.1016/j.chemolab.2026.105663
Dennis Silva Ferreira , Robson Almeida Silva , Gustavo Macedo Pacheco , Edenir Rodrigues Pereira-Filho , Fabiola Manhas Verbi Pereira
X-ray fluorescence (XRF) techniques have been integrated with chemometrics, enabling more robust qualitative and quantitative analysis across increasingly complex matrices. Energy-dispersive XRF (ED-XRF), despite intrinsic limitations such as matrix effects and low sensitivity to light elements, has benefited from multivariate modelling, including stacked generalization, metaheuristic variable selection, and supervised classification, improving soil fertility prediction, food authentication, and material screening. Data fusion strategies combining ED-XRF with laser-induced breakdown spectroscopy (LIBS), Raman, Fourier transform infrared spectroscopy (FTIR), near-infrared spectroscopy (NIR), or ultraviolet-visible spectroscopy (UV-Vis) further mitigate spectral redundancy and enhance the detection of light elements, supporting applications in cultural heritage, environmental monitoring, biomedical diagnostics, and forensic classification. Advances in micro- and synchrotron-based XRF have expanded analytical resolution, necessitating chemometric tools such as principal component analysis (PCA), multivariate curve resolution-alternating least squares (MCR-ALS), self-organizing map with relational perspective map (SOM-RPM), and partial least squares discriminant analysis (PLS-DA) to decompose hyperspectral datasets, validate conservation treatments, identify phase transformations, and characterize biological tissues. Total reflection XRF (TXRF) and particle-induced x-ray emission (PIXE) likewise demonstrate improved discrimination and biomarker discovery when coupled with variable-selection strategies and multivariate classification. Emerging approaches in wavelength-dispersive XRF (WDXRF), including the exploitation of valence-to-core transitions and scattering spectra with partial least squares (PLS) modelling, provide promising routes for evaluating light-element content and fuel quality. Overall, chemometrics has become indispensable for extracting meaningful chemical information from XRF data, thereby enhancing interpretability and applicability across scientific domains.
{"title":"A review of multivariate modelling for x-ray fluorescence techniques","authors":"Dennis Silva Ferreira , Robson Almeida Silva , Gustavo Macedo Pacheco , Edenir Rodrigues Pereira-Filho , Fabiola Manhas Verbi Pereira","doi":"10.1016/j.chemolab.2026.105663","DOIUrl":"10.1016/j.chemolab.2026.105663","url":null,"abstract":"<div><div>X-ray fluorescence (XRF) techniques have been integrated with chemometrics, enabling more robust qualitative and quantitative analysis across increasingly complex matrices. Energy-dispersive XRF (ED-XRF), despite intrinsic limitations such as matrix effects and low sensitivity to light elements, has benefited from multivariate modelling, including stacked generalization, metaheuristic variable selection, and supervised classification, improving soil fertility prediction, food authentication, and material screening. Data fusion strategies combining ED-XRF with laser-induced breakdown spectroscopy (LIBS), Raman, Fourier transform infrared spectroscopy (FTIR), near-infrared spectroscopy (NIR), or ultraviolet-visible spectroscopy (UV-Vis) further mitigate spectral redundancy and enhance the detection of light elements, supporting applications in cultural heritage, environmental monitoring, biomedical diagnostics, and forensic classification. Advances in micro- and synchrotron-based XRF have expanded analytical resolution, necessitating chemometric tools such as principal component analysis (PCA), multivariate curve resolution-alternating least squares (MCR-ALS), self-organizing map with relational perspective map (SOM-RPM), and partial least squares discriminant analysis (PLS-DA) to decompose hyperspectral datasets, validate conservation treatments, identify phase transformations, and characterize biological tissues. Total reflection XRF (TXRF) and particle-induced x-ray emission (PIXE) likewise demonstrate improved discrimination and biomarker discovery when coupled with variable-selection strategies and multivariate classification. Emerging approaches in wavelength-dispersive XRF (WDXRF), including the exploitation of valence-to-core transitions and scattering spectra with partial least squares (PLS) modelling, provide promising routes for evaluating light-element content and fuel quality. Overall, chemometrics has become indispensable for extracting meaningful chemical information from XRF data, thereby enhancing interpretability and applicability across scientific domains.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105663"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146186944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-05DOI: 10.1016/j.chemolab.2026.105661
Jiaxue Cui , Dawei Zhang , Banglian Xu , Jianzhong Fan , Xianglong Cao
This study addresses the challenges of high-dimensional collinearity and regional information heterogeneity in near-infrared spectroscopy for gasoline olefin content prediction by proposing a systematic optimization approach combining a Continuous Region Utilizing Integrated Spectral Evaluation for Near-Infrared (CRUISE-NIR) algorithm with a Region-Sensitive Adaptive Ensemble Learning (RAEL) framework. The CRUISE-NIR algorithm shifts spectral analysis from a “point” to a “region” perspective, fully considering the physical correlation of adjacent wavelengths and chemical prior knowledge, reducing 4443 original variables to 16 key features. Meanwhile, the RAEL framework dynamically adjusts prediction weights according to sample performance characteristics in different spectral regions, achieving sample-specific precision prediction. Experimental results demonstrate that the proposed method achieves a root mean square error (RMSE) of 0.2795 and a coefficient of determination (R2) of 0.9646 on the test set, significantly outperforming traditional methods in prediction accuracy and fitting capability.Furthermore, the robustness of the framework was successfully validated on heterogeneous matrices including SWRI Diesel, IDRC Tablets, and Soil, demonstrating robust generalizability across diverse liquid and solid physical states. Experimental results indicate that prioritizing high-quality feature selection over variable quantity significantly enhances model performance. The proposed systematic framework demonstrates robust analytical capabilities for high-dimensional spectral data across diverse and complex molecular systems.
{"title":"Near-infrared spectroscopic prediction of gasoline olefin content: A systematic approach using continuous region feature selection and region-sensitive ensemble learning","authors":"Jiaxue Cui , Dawei Zhang , Banglian Xu , Jianzhong Fan , Xianglong Cao","doi":"10.1016/j.chemolab.2026.105661","DOIUrl":"10.1016/j.chemolab.2026.105661","url":null,"abstract":"<div><div>This study addresses the challenges of high-dimensional collinearity and regional information heterogeneity in near-infrared spectroscopy for gasoline olefin content prediction by proposing a systematic optimization approach combining a Continuous Region Utilizing Integrated Spectral Evaluation for Near-Infrared (CRUISE-NIR) algorithm with a Region-Sensitive Adaptive Ensemble Learning (RAEL) framework. The CRUISE-NIR algorithm shifts spectral analysis from a “point” to a “region” perspective, fully considering the physical correlation of adjacent wavelengths and chemical prior knowledge, reducing 4443 original variables to 16 key features. Meanwhile, the RAEL framework dynamically adjusts prediction weights according to sample performance characteristics in different spectral regions, achieving sample-specific precision prediction. Experimental results demonstrate that the proposed method achieves a root mean square error (RMSE) of 0.2795 and a coefficient of determination (R<sup>2</sup>) of 0.9646 on the test set, significantly outperforming traditional methods in prediction accuracy and fitting capability.Furthermore, the robustness of the framework was successfully validated on heterogeneous matrices including SWRI Diesel, IDRC Tablets, and Soil, demonstrating robust generalizability across diverse liquid and solid physical states. Experimental results indicate that prioritizing high-quality feature selection over variable quantity significantly enhances model performance. The proposed systematic framework demonstrates robust analytical capabilities for high-dimensional spectral data across diverse and complex molecular systems.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105661"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-02-04DOI: 10.1016/j.chemolab.2026.105654
Mohammed Faisal Noaman , Moinul Haq , Sanjog Chhetri Sapkota , Mehboob Anwer Khan , Kausar Ali , Hesam Kamyab
The present study illustrates an experimental, machine learning (ML), and explainable artificial intelligence integrated framework for the prediction of swelling pressure and consolidation characteristics of polypropylene geo-fiber (PPGF) reinforced clayey soil. A dataset of laboratory consolidation tests that included PPGF content, coefficient of consolidation (Cv), coefficient of compressibility (av), compression index (Cc), coefficient of volume change (mv), settlement (S), and swelling pressure values (ps) was compiled. The experimental observations revealed that the Cc, mv, and S was averagely decreased by about 39.5%, 45.31%, and 90%, respectively, at the optimum PPGF content of 0.3%, thus demonstrating the effectiveness of reinforcing fibers in restraining time-dependent deformation. Six machine learning models, including KNN, SVM, ANN, DT, RF, and XGB, were developed using five folds cross-validation. The XGB regressor proved to have the best predictive performances, having an R2 of 0.994 (with RMSE of 3.14) on training and generalizability in testing, with an R2 of 0.913 (having RMSE of 14.05). The remaining models demonstrated comparatively weaker performance, with ANN and DT exhibiting pronounced overfitting, while KNN and SVM failed to adequately capture the nonlinear swelling response of the gels. The XAI analysis using SHAP indicates that polypropylene geofiber content is the most influential factor governing swelling pressure, followed by mv and soil compressibility. An interactive graphical user interface was built based on the optimized XGB model to predict and visualize swelling pressure in real time from given user inputs. The proposed model integrates experimental validation with robust predictive capability and interpretability, and is complemented by a user-friendly interface and a reliable decision-support system for geotechnical design and soil improvement.
{"title":"Prediction of consolidation behavior of modified clayey soil reinforced with artificial geo-fibers using explainable artificial intelligence","authors":"Mohammed Faisal Noaman , Moinul Haq , Sanjog Chhetri Sapkota , Mehboob Anwer Khan , Kausar Ali , Hesam Kamyab","doi":"10.1016/j.chemolab.2026.105654","DOIUrl":"10.1016/j.chemolab.2026.105654","url":null,"abstract":"<div><div>The present study illustrates an experimental, machine learning (ML), and explainable artificial intelligence integrated framework for the prediction of swelling pressure and consolidation characteristics of polypropylene geo-fiber (<em>PPGF</em>) reinforced clayey soil. A dataset of laboratory consolidation tests that included PPGF content, coefficient of consolidation (<em>C</em><sub><em>v</em></sub>), coefficient of compressibility (<em>a</em><sub><em>v</em></sub>), compression index (<em>C</em><sub><em>c</em></sub>), coefficient of volume change (<em>m</em><sub><em>v</em></sub>), settlement (<em>S</em>), and swelling pressure values (<em>p</em><sub><em>s</em></sub>) was compiled. The experimental observations revealed that the <em>C</em><sub><em>c</em></sub>, <em>m</em><sub><em>v</em></sub>, and <em>S</em> was averagely decreased by about 39.5%, 45.31%, and 90%, respectively, at the optimum PPGF content of 0.3%, thus demonstrating the effectiveness of reinforcing fibers in restraining time-dependent deformation. Six machine learning models, including KNN, SVM, ANN, DT, RF, and XGB, were developed using five folds cross-validation. The XGB regressor proved to have the best predictive performances, having an R<sup>2</sup> of 0.994 (with RMSE of 3.14) on training and generalizability in testing, with an R<sup>2</sup> of 0.913 (having RMSE of 14.05). The remaining models demonstrated comparatively weaker performance, with ANN and DT exhibiting pronounced overfitting, while KNN and SVM failed to adequately capture the nonlinear swelling response of the gels. The XAI analysis using SHAP indicates that polypropylene geofiber content is the most influential factor governing swelling pressure, followed by <em>m</em><sub><em>v</em></sub> and soil compressibility. An interactive graphical user interface was built based on the optimized XGB model to predict and visualize swelling pressure in real time from given user inputs. The proposed model integrates experimental validation with robust predictive capability and interpretability, and is complemented by a user-friendly interface and a reliable decision-support system for geotechnical design and soil improvement.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105654"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-01-30DOI: 10.1016/j.chemolab.2026.105653
Zhanchang Zhang , Qiao Ning , Xulun Shi , Shikai Guo , Hui Li
Protein S-sulfhydration is an important post-translational modification that regulates signaling pathways in animal cells by influencing protein activity and function. It also plays a crucial role in regulating plant metabolism and morphogenesis. Therefore, the identification of S-sulfhydration sites is crucial for cellular biology research. In this study, we propose a deep learning framework with directional multi-LSTM (Long Short-Term Memory) for predicting protein S-sulfhydration sites. In this study, we propose a deep learning framework utilizing a directional multi-LSTM (Long Short-Term Memory) network to predict protein S-sulfhydration sites. Initially, protein sequence data is preprocessed via an improved BERT strategy to extract high-dimensional sequence features. Hypothesizing that S-sulfhydration modification exhibits directionality, we partition sequences around cysteine residues and extract features using directional multi-LSTM, simulating the enzymatic reaction conditions. Subsequently, a convolutional neural network (CNN) is employed to capture deep local information features. On an independent test set, the accuracy, sensitivity, specificity, Matthews correlation coefficient, area under the curve, and precision are 76.76%, 85.45%, 67.21%, 53.77%, 76.33% and 74.11% respectively. The results demonstrate that the multi-directional LSTM deep learning framework is an effective tool for predicting protein S-sulfhydration. The source code is available on the website https://github.com/endeavor-zzc/Multi-LSTM.
{"title":"A directional multi-LSTM framework integrated BERT for S-sulfhydration sites prediction","authors":"Zhanchang Zhang , Qiao Ning , Xulun Shi , Shikai Guo , Hui Li","doi":"10.1016/j.chemolab.2026.105653","DOIUrl":"10.1016/j.chemolab.2026.105653","url":null,"abstract":"<div><div>Protein S-sulfhydration is an important post-translational modification that regulates signaling pathways in animal cells by influencing protein activity and function. It also plays a crucial role in regulating plant metabolism and morphogenesis. Therefore, the identification of S-sulfhydration sites is crucial for cellular biology research. In this study, we propose a deep learning framework with directional multi-LSTM (Long Short-Term Memory) for predicting protein S-sulfhydration sites. In this study, we propose a deep learning framework utilizing a directional multi-LSTM (Long Short-Term Memory) network to predict protein S-sulfhydration sites. Initially, protein sequence data is preprocessed via an improved BERT strategy to extract high-dimensional sequence features. Hypothesizing that S-sulfhydration modification exhibits directionality, we partition sequences around cysteine residues and extract features using directional multi-LSTM, simulating the enzymatic reaction conditions. Subsequently, a convolutional neural network (CNN) is employed to capture deep local information features. On an independent test set, the accuracy, sensitivity, specificity, Matthews correlation coefficient, area under the curve, and precision are 76.76%, 85.45%, 67.21%, 53.77%, 76.33% and 74.11% respectively. The results demonstrate that the multi-directional LSTM deep learning framework is an effective tool for predicting protein S-sulfhydration. The source code is available on the website <span><span>https://github.com/endeavor-zzc/Multi-LSTM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105653"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-15Epub Date: 2026-01-28DOI: 10.1016/j.chemolab.2026.105650
Nuno Costa , João Lourenço
Theoretical solutions for multiresponse problems may not yield the expected results when implemented in practice at the process and/or product level. Causes that have been overlooked and lead to such discrepancies in problems developed under the Response Surface Methodology framework are the magnitude of prediction errors in some regions of the solution space, unreplicated experimental runs, and responses' sensitivity to changes in the values of response model variables. That discrepancy value must be minimized and can be managed in the generation of optimal solutions. Therefore, to improve the reproducibility of theoretical solutions, a new desirability-based function is proposed. This objective function allows to balance the response's bias, predictions quality, robustness, and resilience according to the decision maker's preferences. Two case studies demonstrate its flexibility and usefulness.
{"title":"Solutions reproducibility in multiresponse optimization problems: A new desirability-based objective function","authors":"Nuno Costa , João Lourenço","doi":"10.1016/j.chemolab.2026.105650","DOIUrl":"10.1016/j.chemolab.2026.105650","url":null,"abstract":"<div><div>Theoretical solutions for multiresponse problems may not yield the expected results when implemented in practice at the process and/or product level. Causes that have been overlooked and lead to such discrepancies in problems developed under the Response Surface Methodology framework are the magnitude of prediction errors in some regions of the solution space, unreplicated experimental runs, and responses' sensitivity to changes in the values of response model variables. That discrepancy value must be minimized and can be managed in the generation of optimal solutions. Therefore, to improve the reproducibility of theoretical solutions, a new desirability-based function is proposed. This objective function allows to balance the response's bias, predictions quality, robustness, and resilience according to the decision maker's preferences. Two case studies demonstrate its flexibility and usefulness.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"271 ","pages":"Article 105650"},"PeriodicalIF":3.8,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146186945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-15Epub Date: 2026-01-21DOI: 10.1016/j.chemolab.2026.105630
Mengyu Wu , Yuan Cao , Ruiyang Wang , Chongxuan Tian , Yang Li , Zunsong Wang
Background
Percutaneous renal biopsy faces three major challenges in clinical management: inherent procedural risks, inability to serially monitor disease activity, and sampling variability. These limitations underscore the demand for safer, repeatable diagnostic tools.
Objective
Our objective was to explore the potential of a liquid biopsy strategy utilizing paired blood and urine analysis via Raman spectroscopy and a 1D-CNN to facilitate the differentiation of common glomerular diseases from each other and from healthy individuals.
Methods
From January 2021 to January 2025, we collected serum and first-void morning urine from 170 biopsy-confirmed patients (81 membranous nephropathy, 36 IgA nephropathy, 33 diabetic nephropathy, 20 focal segmental glomerulosclerosis) and 21 healthy volunteers. Spectra were acquired on an Attenuated Total Reflection-8300 (ATR-8300) instrument (785 nm excitation) and preprocessed via third-order polynomial baseline correction and 13-point Savitzky–Golay smoothing. A 1D-CNN was trained on the combined spectral data; performance was assessed by accuracy, sensitivity, specificity, and Receiver Operating Characteristic - Area Under the Curve (ROC-AUC).
Results
The 1D-CNN model achieved 80.0 % accuracy, 76.2 % sensitivity, and 81.3 % specificity in five-class classification. ROC-AUCs ranged from 0.81 (FSGS) to 0.85 (IgA nephropathy), confirming robust discrimination across disease subtypes and controls. Characteristic Raman bands—e.g. phenylalanine (∼1003 cm−1), Amide I (∼1655 cm−1), and C–H stretching (2800–3000 cm−1)—differed systematically among cohorts, reflecting underlying biochemical alterations.
Conclusions
Raman spectroscopy of paired blood and urine, coupled with deep learning, provides a rapid, label-free approach for minimally invasive classification of glomerular diseases. This integrated liquid biopsy strategy may enable early detection and precise stratification in nephrology, reducing reliance on invasive biopsy and informing personalized therapy.
{"title":"Non-invasive diagnosis of common glomerular diseases via Raman spectroscopy and machine learning: an integrated blood and urine analysis approach","authors":"Mengyu Wu , Yuan Cao , Ruiyang Wang , Chongxuan Tian , Yang Li , Zunsong Wang","doi":"10.1016/j.chemolab.2026.105630","DOIUrl":"10.1016/j.chemolab.2026.105630","url":null,"abstract":"<div><h3>Background</h3><div>Percutaneous renal biopsy faces three major challenges in clinical management: inherent procedural risks, inability to serially monitor disease activity, and sampling variability. These limitations underscore the demand for safer, repeatable diagnostic tools.</div></div><div><h3>Objective</h3><div>Our objective was to explore the potential of a liquid biopsy strategy utilizing paired blood and urine analysis via Raman spectroscopy and a 1D-CNN to facilitate the differentiation of common glomerular diseases from each other and from healthy individuals.</div></div><div><h3>Methods</h3><div>From January 2021 to January 2025, we collected serum and first-void morning urine from 170 biopsy-confirmed patients (81 membranous nephropathy, 36 IgA nephropathy, 33 diabetic nephropathy, 20 focal segmental glomerulosclerosis) and 21 healthy volunteers. Spectra were acquired on an Attenuated Total Reflection-8300 (ATR-8300) instrument (785 nm excitation) and preprocessed via third-order polynomial baseline correction and 13-point Savitzky–Golay smoothing. A 1D-CNN was trained on the combined spectral data; performance was assessed by accuracy, sensitivity, specificity, and Receiver Operating Characteristic - Area Under the Curve (ROC-AUC).</div></div><div><h3>Results</h3><div>The 1D-CNN model achieved 80.0 % accuracy, 76.2 % sensitivity, and 81.3 % specificity in five-class classification. ROC-AUCs ranged from 0.81 (FSGS) to 0.85 (IgA nephropathy), confirming robust discrimination across disease subtypes and controls. Characteristic Raman bands—e.g. phenylalanine (∼1003 cm<sup>−1</sup>), Amide I (∼1655 cm<sup>−1</sup>), and C–H stretching (2800–3000 cm<sup>−1</sup>)—differed systematically among cohorts, reflecting underlying biochemical alterations.</div></div><div><h3>Conclusions</h3><div>Raman spectroscopy of paired blood and urine, coupled with deep learning, provides a rapid, label-free approach for minimally invasive classification of glomerular diseases. This integrated liquid biopsy strategy may enable early detection and precise stratification in nephrology, reducing reliance on invasive biopsy and informing personalized therapy.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"270 ","pages":"Article 105630"},"PeriodicalIF":3.8,"publicationDate":"2026-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146075362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}