Pub Date : 2025-10-10DOI: 10.1016/j.chemolab.2025.105541
Nicolás Hernández , Yoonsun Choi , Tom Fearn
We propose a novel Bayesian optimization framework for interval selection in Partial Least Squares (PLS) regression. Unlike traditional iPLS variants that rely on fixed or grid-based intervals, our approach adaptively searches over the discrete space of interval positions of a pre-defined width using a Gaussian Process surrogate model and an acquisition function. This enables the selection of one or more informative spectral regions without exhaustive enumeration or manual tuning. Through synthetic and real-world spectroscopic datasets, we demonstrate that the proposed method consistently identifies chemically relevant intervals, reduces model complexity, and improves predictive accuracy compared to full-spectrum PLS and stepwise interval selection techniques. A Monte Carlo study further confirms the robustness and convergence of the algorithm across varying signal complexities and uncertainty levels. This flexible, data-efficient approach offers an interpretable and computationally scalable alternative for chemometric applications.
{"title":"Bayesian optimization for interval selection in PLS models","authors":"Nicolás Hernández , Yoonsun Choi , Tom Fearn","doi":"10.1016/j.chemolab.2025.105541","DOIUrl":"10.1016/j.chemolab.2025.105541","url":null,"abstract":"<div><div>We propose a novel Bayesian optimization framework for interval selection in Partial Least Squares (PLS) regression. Unlike traditional iPLS variants that rely on fixed or grid-based intervals, our approach adaptively searches over the discrete space of interval positions of a pre-defined width using a Gaussian Process surrogate model and an acquisition function. This enables the selection of one or more informative spectral regions without exhaustive enumeration or manual tuning. Through synthetic and real-world spectroscopic datasets, we demonstrate that the proposed method consistently identifies chemically relevant intervals, reduces model complexity, and improves predictive accuracy compared to full-spectrum PLS and stepwise interval selection techniques. A Monte Carlo study further confirms the robustness and convergence of the algorithm across varying signal complexities and uncertainty levels. This flexible, data-efficient approach offers an interpretable and computationally scalable alternative for chemometric applications.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105541"},"PeriodicalIF":3.8,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09DOI: 10.1016/j.chemolab.2025.105549
Zhen Li, Jiang Zhang
Leather derived from different animal sources exhibits significant differences in both performance and value. Traditional leather identification methods suffer from subjectivity, inefficiency, and high costs, motivating the need for rapid, objective, and cost-effective alternatives. To achieve rapid and non-destructive classification of leather types, our study introduces a novel combination of Raman spectroscopy and a one-dimensional convolutional neural network (1D-CNN) enhanced with a self-attention mechanism to efficiently capture subtle spectral differences among leather types. A total of 1066 Raman spectra from cow, sheep, pig, and crocodile leathers were collected. Spectral data underwent smoothing, baseline correction, and normalization. Seven samples from each leather class were randomly assigned to the training set, while the remaining three samples per class were designated as an independent validation set. Data augmentation was performed by adding Gaussian noise and applying slight spectral shifts to simulate real-world variability, expanding the training set to 3,810 samples. The proposed 1D-CNN model incorporates a self-attention mechanism to extract key spectral features and is compared with machine learning models and 1D-CNN models that do not integrate attention mechanisms. Experimental results demonstrate that our method outperforms existing approaches. After incorporating the self-attention mechanism, the model maintained a high accuracy during cross-validation, while its average classification accuracy on the independent test set increased from 92.11 % to 96.28 %. This result demonstrates that the proposed approach achieves enhanced generalization performance under different data partitioning schemes. This efficient, non-destructive, and reliable method not only enables accurate leather species identification and luxury goods authentication, but also shows promise for broader material classification and quality control applications.
{"title":"High-accuracy leather species identification via Raman spectroscopy and attention-enhanced 1D-CNN","authors":"Zhen Li, Jiang Zhang","doi":"10.1016/j.chemolab.2025.105549","DOIUrl":"10.1016/j.chemolab.2025.105549","url":null,"abstract":"<div><div>Leather derived from different animal sources exhibits significant differences in both performance and value. Traditional leather identification methods suffer from subjectivity, inefficiency, and high costs, motivating the need for rapid, objective, and cost-effective alternatives. To achieve rapid and non-destructive classification of leather types, our study introduces a novel combination of Raman spectroscopy and a one-dimensional convolutional neural network (1D-CNN) enhanced with a self-attention mechanism to efficiently capture subtle spectral differences among leather types. A total of 1066 Raman spectra from cow, sheep, pig, and crocodile leathers were collected. Spectral data underwent smoothing, baseline correction, and normalization. Seven samples from each leather class were randomly assigned to the training set, while the remaining three samples per class were designated as an independent validation set. Data augmentation was performed by adding Gaussian noise and applying slight spectral shifts to simulate real-world variability, expanding the training set to 3,810 samples. The proposed 1D-CNN model incorporates a self-attention mechanism to extract key spectral features and is compared with machine learning models and 1D-CNN models that do not integrate attention mechanisms. Experimental results demonstrate that our method outperforms existing approaches. After incorporating the self-attention mechanism, the model maintained a high accuracy during cross-validation, while its average classification accuracy on the independent test set increased from 92.11 % to 96.28 %. This result demonstrates that the proposed approach achieves enhanced generalization performance under different data partitioning schemes. This efficient, non-destructive, and reliable method not only enables accurate leather species identification and luxury goods authentication, but also shows promise for broader material classification and quality control applications.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105549"},"PeriodicalIF":3.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1016/j.chemolab.2025.105543
Harun Uslu , Bihter Das , Huseyin Alperen Dagdogen , Yunus Santur , Seval Yılmaz , Ibrahim Turkoglu , Resul Das
The discovery of novel therapeutic molecules against the Human Immunodeficiency Virus (HIV) remains a critical research priority due to the persistent global impact of the disease. Traditional drug discovery processes are often time-consuming, costly, and limited in predictive capacity at early stages. In this study, we propose a three-stage AI-supported framework that integrates deep learning and molecular docking to accelerate candidate identification. First, a customized Autoencoder–Long Short-Term Memory (LSTM) model was employed to generate novel molecular structures consistent with key pharmacokinetic rules. Second, a Geometric Deep Learning (GDL) model was designed to evaluate interactions with major HIV-1 targets, including integrase, protease, and reverse transcriptase. Finally, In silico docking simulations assessed binding affinities and inhibition constants. The framework generated molecules that not only complied with pharmacokinetic and drug-likeness criteria (e.g., QED, ADME, SAScore) but also demonstrated favorable binding properties, particularly towards HIV-1 reverse transcriptase. These findings highlight the potential of the proposed approach to complement early-stage drug discovery and to contribute to the design of promising lead compounds for further experimental validation.
{"title":"Discovery of new anti-HIV candidate molecules with an AI-based multi-stage system approach using molecular docking and ADME predictions","authors":"Harun Uslu , Bihter Das , Huseyin Alperen Dagdogen , Yunus Santur , Seval Yılmaz , Ibrahim Turkoglu , Resul Das","doi":"10.1016/j.chemolab.2025.105543","DOIUrl":"10.1016/j.chemolab.2025.105543","url":null,"abstract":"<div><div>The discovery of novel therapeutic molecules against the Human Immunodeficiency Virus (HIV) remains a critical research priority due to the persistent global impact of the disease. Traditional drug discovery processes are often time-consuming, costly, and limited in predictive capacity at early stages. In this study, we propose a three-stage AI-supported framework that integrates deep learning and molecular docking to accelerate candidate identification. First, a customized Autoencoder–Long Short-Term Memory (LSTM) model was employed to generate novel molecular structures consistent with key pharmacokinetic rules. Second, a Geometric Deep Learning (GDL) model was designed to evaluate interactions with major HIV-1 targets, including integrase, protease, and reverse transcriptase. Finally, <em>In silico</em> docking simulations assessed binding affinities and inhibition constants. The framework generated molecules that not only complied with pharmacokinetic and drug-likeness criteria (e.g., QED, ADME, SAScore) but also demonstrated favorable binding properties, particularly towards HIV-1 reverse transcriptase. These findings highlight the potential of the proposed approach to complement early-stage drug discovery and to contribute to the design of promising lead compounds for further experimental validation.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105543"},"PeriodicalIF":3.8,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08DOI: 10.1016/j.chemolab.2025.105548
Jaume Béjar-Grimalt , Ángel Sánchez-Illana , Guillermo Quintás , Hugh J. Byrne , David Pérez-Guaita
Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as black-box models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.
{"title":"Monte Carlo peaks: Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy","authors":"Jaume Béjar-Grimalt , Ángel Sánchez-Illana , Guillermo Quintás , Hugh J. Byrne , David Pérez-Guaita","doi":"10.1016/j.chemolab.2025.105548","DOIUrl":"10.1016/j.chemolab.2025.105548","url":null,"abstract":"<div><div>Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as <em>black-box</em> models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105548"},"PeriodicalIF":3.8,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1016/j.chemolab.2025.105546
Muhammad Ali , Nasir Abbas , Shabbir Ahmad , Tahir Mahmood , Muhammad Riaz
Early detection of shifts in process mean is crucial for maintaining product quality and operational integrity in chemical industries. This paper proposes a new cumulative sum control chart named the CMD chart, that leverages an auxiliary variable for robust and efficient monitoring. The CMD chart is designed through various parameters, with control limits calibrated to ensure a desired average run length when in control. Its performance is assessed using multiple run-length metrics, including average run length, standard deviation, expected average run length, extra quadratic loss, relative average run length, and performance comparison index. An R Shiny app is also developed to enhance usability, simplify calibration and evaluation for different parameter combinations. Through extensive simulation across a broad range of shifts, the CMD chart consistently outperformed existing charts in quickly detecting shifts while minimizing false alarms. A practical case study in a polymerization reactor further highlighted effectiveness of CMD chart, demonstrating earlier, more accurate, and frequent detections of subtle shifts compared to competing methods. Overall, the CMD chart proves to be a robust and high-performing tool for process monitoring, making it highly relevant for modern chemical-engineering applications.
{"title":"On designing robust and efficient CUSUM chart for mean monitoring: An application in chemical engineering for polymerization reactors","authors":"Muhammad Ali , Nasir Abbas , Shabbir Ahmad , Tahir Mahmood , Muhammad Riaz","doi":"10.1016/j.chemolab.2025.105546","DOIUrl":"10.1016/j.chemolab.2025.105546","url":null,"abstract":"<div><div>Early detection of shifts in process mean is crucial for maintaining product quality and operational integrity in chemical industries. This paper proposes a new cumulative sum control chart named the CMD chart, that leverages an auxiliary variable for robust and efficient monitoring. The CMD chart is designed through various parameters, with control limits calibrated to ensure a desired average run length when in control. Its performance is assessed using multiple run-length metrics, including average run length, standard deviation, expected average run length, extra quadratic loss, relative average run length, and performance comparison index. An R Shiny app is also developed to enhance usability, simplify calibration and evaluation for different parameter combinations. Through extensive simulation across a broad range of shifts, the CMD chart consistently outperformed existing charts in quickly detecting shifts while minimizing false alarms. A practical case study in a polymerization reactor further highlighted effectiveness of CMD chart, demonstrating earlier, more accurate, and frequent detections of subtle shifts compared to competing methods. Overall, the CMD chart proves to be a robust and high-performing tool for process monitoring, making it highly relevant for modern chemical-engineering applications.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105546"},"PeriodicalIF":3.8,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1016/j.chemolab.2025.105545
Bin Li , Eizo Taira , Tetsuya Inagaki
Near-infrared spectroscopy (NIRS) calibration transfer faces significant challenges when deploying models across multiple instruments from different manufacturers, particularly because the inherently low molar absorptivity makes spectral data highly sensitive to minor variations in optical setup. This study presents two enhanced calibration transfer methods (ICTWM1 and ICTWM2) operating within the PLS latent variable space, utilizing dimensionality reduction to preserve analytically relevant variance while reducing noise interference. ICTWM1 employs spectral space transformation (SST) to correct PLS component scores between different instruments, while ICTWM2 selectively corrects the regression coefficients of the principal components with the highest variability.
The methods were validated using wheat protein analysis (7 secondary instruments from manufacturers A and B) and industrial sugarcane Brix determination (8 secondary instruments across geographically distributed facilities). ICTWM1 demonstrated superior performance, achieving 79.3 % relative performance compared to the primary instrument model using only 10 standardization samples on the wheat dataset, with improved cross-instrument consistency (standard deviations of 6.9 %) compared to traditional methods (>15 %). The method exhibited no manufacturer-dependent performance bias and maintained consistent performance across sample sizes ranging from 10 to 110. Under severely constrained sugarcane dataset with only 5 training samples, both ICTWM1 and ICTWM2 achieved good performance with mean RMSEP values of 0.14°Bx and 0.15°Bx, respectively, outperforming traditional calibration transfer methods.
ICTWM1 demonstrates improved sample efficiency and cross-manufacturer robustness through optimized transformation within PLS subspace. These characteristics make it a practical method for industrial NIRS applications requiring reliable calibration transfer with minimal standardization samples.
近红外光谱(NIRS)校准转移在不同制造商的多台仪器上部署模型时面临重大挑战,特别是因为固有的低摩尔吸收率使得光谱数据对光学设置的微小变化高度敏感。本研究提出了在PLS潜变量空间内运行的两种增强的校准传递方法(ICTWM1和ICTWM2),利用降维来保留分析相关方差,同时降低噪声干扰。ICTWM1采用谱空间变换(spectral space transformation, SST)对不同仪器间PLS成分得分进行校正,而ICTWM2则对变异性最高的主成分回归系数进行选择性校正。使用小麦蛋白分析(来自A和B制造商的7台二级仪器)和工业甘蔗糖度测定(分布在不同地理位置的8台二级仪器)对方法进行了验证。ICTWM1表现出优异的性能,与仅使用小麦数据集上10个标准化样本的主要工具模型相比,其相对性能达到79.3%,与传统方法相比,其跨工具一致性(标准偏差为6.9%)有所提高(> 15%)。该方法没有表现出与制造商相关的性能偏差,并且在从10到110的样本量范围内保持一致的性能。在只有5个训练样本的严格约束甘蔗数据集下,ICTWM1和ICTWM2均取得了较好的性能,RMSEP均值分别为0.14°Bx和0.15°Bx,优于传统的校准转移方法。ICTWM1通过优化PLS子空间内的变换,提高了样本效率和跨厂商鲁棒性。这些特性使其成为工业近红外光谱应用的实用方法,需要用最小的标准化样品进行可靠的校准转移。
{"title":"Enhanced PLS subspace-based calibration transfer method for multiple spectrometers using small standardization sample sets","authors":"Bin Li , Eizo Taira , Tetsuya Inagaki","doi":"10.1016/j.chemolab.2025.105545","DOIUrl":"10.1016/j.chemolab.2025.105545","url":null,"abstract":"<div><div>Near-infrared spectroscopy (NIRS) calibration transfer faces significant challenges when deploying models across multiple instruments from different manufacturers, particularly because the inherently low molar absorptivity makes spectral data highly sensitive to minor variations in optical setup. This study presents two enhanced calibration transfer methods (ICTWM1 and ICTWM2) operating within the PLS latent variable space, utilizing dimensionality reduction to preserve analytically relevant variance while reducing noise interference. ICTWM1 employs spectral space transformation (SST) to correct PLS component scores between different instruments, while ICTWM2 selectively corrects the regression coefficients of the principal components with the highest variability.</div><div>The methods were validated using wheat protein analysis (7 secondary instruments from manufacturers A and B) and industrial sugarcane Brix determination (8 secondary instruments across geographically distributed facilities). ICTWM1 demonstrated superior performance, achieving 79.3 % relative performance compared to the primary instrument model using only 10 standardization samples on the wheat dataset, with improved cross-instrument consistency (standard deviations of 6.9 %) compared to traditional methods (>15 %). The method exhibited no manufacturer-dependent performance bias and maintained consistent performance across sample sizes ranging from 10 to 110. Under severely constrained sugarcane dataset with only 5 training samples, both ICTWM1 and ICTWM2 achieved good performance with mean RMSEP values of 0.14°Bx and 0.15°Bx, respectively, outperforming traditional calibration transfer methods.</div><div>ICTWM1 demonstrates improved sample efficiency and cross-manufacturer robustness through optimized transformation within PLS subspace. These characteristics make it a practical method for industrial NIRS applications requiring reliable calibration transfer with minimal standardization samples.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105545"},"PeriodicalIF":3.8,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Developing robust and valuable quantitative structure-activity relationship (QSAR) models has become increasingly significant in modern drug design. These models play a crucial role by enabling the determination of molecular properties of compounds and predicting their bioactivities for therapeutic targets. QSAR models utilize various machine learning methods, such as support vector machines (SVM), multiple linear regression (MLR), and artificial neural networks (ANNs). These widely applicable methods have substantial implications for developing more precise medicines. The effectiveness of QSAR research dramatically relies on how each process step is conducted and how the analysis is carried out. This paper discusses the essential steps in developing and validating QSAR models using machine learning. A case study is presented to provide a clear example, focusing on 121 compounds acting as potent nuclear factor-κB inhibitors (NF-κB). The study compares multiple predictive QSAR models based primarily on linear and non-linear regression techniques.
{"title":"Advancing QSAR models in drug discovery for best practices, theoretical foundations, and applications in targeting nuclear factor-κB inhibitors- A bright future in pharmaceutical chemistry","authors":"Nour-El-Houda Hammoudi , Oussama Lalaoui , Widad Sobhi , Alessandro Erto , Luca Micoli , Byong-Hun Jeon , Yacine Benguerba , Walid Elfalleh , Mohamed A.M. Ali , Nasir A. Ibrahim , Hichem Tahraoui , Abdeltif Amrane","doi":"10.1016/j.chemolab.2025.105544","DOIUrl":"10.1016/j.chemolab.2025.105544","url":null,"abstract":"<div><div>Developing robust and valuable quantitative structure-activity relationship (QSAR) models has become increasingly significant in modern drug design. These models play a crucial role by enabling the determination of molecular properties of compounds and predicting their bioactivities for therapeutic targets. QSAR models utilize various machine learning methods, such as support vector machines (SVM), multiple linear regression (MLR), and artificial neural networks (ANNs). These widely applicable methods have substantial implications for developing more precise medicines. The effectiveness of QSAR research dramatically relies on how each process step is conducted and how the analysis is carried out. This paper discusses the essential steps in developing and validating QSAR models using machine learning. A case study is presented to provide a clear example, focusing on 121 compounds acting as potent nuclear factor-κB inhibitors (NF-κB). The study compares multiple predictive QSAR models based primarily on linear and non-linear regression techniques.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105544"},"PeriodicalIF":3.8,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145262447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-28DOI: 10.1016/j.chemolab.2025.105542
Xiaojing Chen , Zhonghao Xie , Roma Tauler , Yong He , Pengcheng Nie , Yankun Peng , Liang Shu , Shujat Ali , Guangzao Huang , Wen Shi , Xi Chen , Leiming Yuan
Preprocessing plays a vital role in the analysis of Near-infrared spectroscopy (NIRS) data as it aims to remove unintended artifacts. This process involves a series of steps, each with a specific focus on a particular artifact. However, due to the diverse range of NIRS applications, selecting the optimal combination of preprocessing methods remains a challenge. To address this issue, we propose an automated preprocessing framework that can quickly identify the optimal preprocessing strategy. The framework initially constructs a workflow consisting of multiple types of preprocessing methods. Then, a genetic algorithm (GA) technique is used to optimize the best pipeline, avoiding exhaustive searches. In addition, we impose a penalty for the loss function of the GA process to obtain a parsimonious solution. Results on three real-world datasets demonstrate that our approach outperforms several state-of-the-art ensemble preprocessing methods in terms of prediction error. Compared to the raw data, the optimal preprocessing method can improve model performance by at least 48%. Furthermore, our framework enables the identification of the most effective preprocessing methods included in the best pipeline. The source code for our approach is available on GitHub and can be easily integrated with other existing preprocessing techniques.
{"title":"An automated preprocessing framework for near infrared spectroscopic data","authors":"Xiaojing Chen , Zhonghao Xie , Roma Tauler , Yong He , Pengcheng Nie , Yankun Peng , Liang Shu , Shujat Ali , Guangzao Huang , Wen Shi , Xi Chen , Leiming Yuan","doi":"10.1016/j.chemolab.2025.105542","DOIUrl":"10.1016/j.chemolab.2025.105542","url":null,"abstract":"<div><div>Preprocessing plays a vital role in the analysis of Near-infrared spectroscopy (NIRS) data as it aims to remove unintended artifacts. This process involves a series of steps, each with a specific focus on a particular artifact. However, due to the diverse range of NIRS applications, selecting the optimal combination of preprocessing methods remains a challenge. To address this issue, we propose an automated preprocessing framework that can quickly identify the optimal preprocessing strategy. The framework initially constructs a workflow consisting of multiple types of preprocessing methods. Then, a genetic algorithm (GA) technique is used to optimize the best pipeline, avoiding exhaustive searches. In addition, we impose a penalty for the loss function of the GA process to obtain a parsimonious solution. Results on three real-world datasets demonstrate that our approach outperforms several state-of-the-art ensemble preprocessing methods in terms of prediction error. Compared to the raw data, the optimal preprocessing method can improve model performance by at least 48%. Furthermore, our framework enables the identification of the most effective preprocessing methods included in the best pipeline. The source code for our approach is available on GitHub and can be easily integrated with other existing preprocessing techniques.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105542"},"PeriodicalIF":3.8,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145217363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23DOI: 10.1016/j.chemolab.2025.105539
Yangha Chung , Johan Lim , Xinlei Wang , Soohyun Ahn
Quality control procedures are crucial for ensuring the reliability of mass spectrometry (MS) data, vital in biomarker discovery and understanding complex biological systems. However, existing methods often concentrate solely on either sample or peak outlier detection, rely on subjective criteria, and employ overly uniform thresholds based on asymptotic distributions, thereby failing to adequately capture the characteristics of the data. In this paper, we introduce a novel approach, CPOD (Conformal Prediction for Outlier Detection), leveraging conformal prediction for outlier detection in MS data analysis. CPOD simultaneously identifies outlier samples and peaks based on data-driven and distribution-free principles. Rigorous numerical evaluations and comparisons with existing methods demonstrate superior diagnostic performance. Application to real LC-MRM data underscores practical utility, enhancing data reliability and reproducibility in MS studies.
{"title":"Conformalized outlier detection for mass spectrometry data","authors":"Yangha Chung , Johan Lim , Xinlei Wang , Soohyun Ahn","doi":"10.1016/j.chemolab.2025.105539","DOIUrl":"10.1016/j.chemolab.2025.105539","url":null,"abstract":"<div><div>Quality control procedures are crucial for ensuring the reliability of mass spectrometry (MS) data, vital in biomarker discovery and understanding complex biological systems. However, existing methods often concentrate solely on either sample or peak outlier detection, rely on subjective criteria, and employ overly uniform thresholds based on asymptotic distributions, thereby failing to adequately capture the characteristics of the data. In this paper, we introduce a novel approach, CPOD (Conformal Prediction for Outlier Detection), leveraging conformal prediction for outlier detection in MS data analysis. CPOD simultaneously identifies outlier samples and peaks based on data-driven and distribution-free principles. Rigorous numerical evaluations and comparisons with existing methods demonstrate superior diagnostic performance. Application to real LC-MRM data underscores practical utility, enhancing data reliability and reproducibility in MS studies.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105539"},"PeriodicalIF":3.8,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145155368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-20DOI: 10.1016/j.chemolab.2025.105535
Xiaoqing Zheng, Bo Peng, Anke Xue, Ming Ge, Yaguang Kong, Aipeng Jiang
In modern industry, soft sensors provide real-time predictions of quality variables that are difficult to measure directly with physical sensors. However, in industrial processes, changes in material properties, catalyst deactivation, and other factors often lead to shifts in data distribution. Existing soft sensor models often overlook the impact of these distribution changes on performance. To address the issue of performance degradation due to changes in data distribution, this paper proposes a self-attention based Difference Long Short-Term Memory (SA-DLSTM) network for soft sensor modeling. By employing self-attention, industrial raw data is refined to facilitate the extraction of nonlinear features, thereby reducing the difficulty in modeling. A Difference Channel is designed to perform correlation analysis and select significant features from the raw data, followed by extracting the difference information that can reveal changes in the data distribution. The SA-DLSTM soft sensor model is established and validated on two benchmark industrial datasets: Debutanizer Column and Sulfur Recovery Unit. Comparisons with benchmark models, and state-of-the-art models show that SA-DLSTM achieves the best performance across all evaluation metrics, demonstrating the effectiveness of the proposed model.
{"title":"Self-attention based Difference Long Short-Term Memory Network for Industrial Data-driven Modeling","authors":"Xiaoqing Zheng, Bo Peng, Anke Xue, Ming Ge, Yaguang Kong, Aipeng Jiang","doi":"10.1016/j.chemolab.2025.105535","DOIUrl":"10.1016/j.chemolab.2025.105535","url":null,"abstract":"<div><div>In modern industry, soft sensors provide real-time predictions of quality variables that are difficult to measure directly with physical sensors. However, in industrial processes, changes in material properties, catalyst deactivation, and other factors often lead to shifts in data distribution. Existing soft sensor models often overlook the impact of these distribution changes on performance. To address the issue of performance degradation due to changes in data distribution, this paper proposes a self-attention based Difference Long Short-Term Memory (SA-DLSTM) network for soft sensor modeling. By employing self-attention, industrial raw data is refined to facilitate the extraction of nonlinear features, thereby reducing the difficulty in modeling. A Difference Channel is designed to perform correlation analysis and select significant features from the raw data, followed by extracting the difference information that can reveal changes in the data distribution. The SA-DLSTM soft sensor model is established and validated on two benchmark industrial datasets: Debutanizer Column and Sulfur Recovery Unit. Comparisons with benchmark models, and state-of-the-art models show that SA-DLSTM achieves the best performance across all evaluation metrics, demonstrating the effectiveness of the proposed model.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105535"},"PeriodicalIF":3.8,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145109706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}