Chemometrics and Intelligent Laboratory Systems最新文献_第8页

HEnsem_DTIs: A heterogeneous ensemble learning model for drug-target interactions prediction HEnsem_DTIs：药物-靶点相互作用预测的异质集合学习模型

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-09-02 DOI: 10.1016/j.chemolab.2024.105224

Mohammad Reza Keyvanpour , Yasaman Asghari , Soheila Mehrmolaei

Drug discovery is the process by which a drug is discovered. Drug-target interactions prediction is a major part of drug discovery. Unfortunately, producing new drugs is time-consuming and expensive; Because it requires a lot of human and laboratory resources. Recently, predictions have been made using computational methods to solve these problems and prevent blindly examining all interactions. Various experiences using computational methods show that no single algorithm can be suitable for all applications; Hence, ensemble learning is expressed. Although various ensemble methods have been proposed, it is still not easy to find a suitable ensemble method for a particular dataset. In general, the existing algorithms in aggregation and combination method are selected manually based on experience. Reinforcement learning can be one way to meet this challenge. High-dimensional feature space and class imbalance are among the challenges of drug-target interactions prediction. This paper proposes HEnsem_DTIs, a heterogeneous ensemble model, for predicting drug-target interactions using dimensionality reduction and concepts of recommender systems to address these challenges. HEnsem_DTIs is configured with reinforcement learning. Dimensionality reduction is applied to handle the challenge of high-dimensional feature space and recommender systems to improve under-sampling and solve the class imbalance challenge. Six datasets are used to evaluate the proposed model; Results of the evaluation on datasets show that HEnsem_DTIs works better than other models in this field. Results of evaluation of the proposed model on the first dataset using 10-fold cross-validation experiments show the amount of sensitivity 0.896, specificity 0.954, GM 0.924, AUC 0.930 and AUPR 0.935.

药物发现是发现药物的过程。药物-靶点相互作用预测是药物发现的重要组成部分。不幸的是，生产新药既耗时又昂贵，因为它需要大量的人力和实验室资源。最近，人们使用计算方法进行预测，以解决这些问题，避免盲目检查所有相互作用。使用计算方法的各种经验表明，没有一种算法能适用于所有应用；因此，集合学习应运而生。虽然已经提出了各种集合方法，但要为特定数据集找到合适的集合方法仍不容易。一般来说，聚合和组合方法中的现有算法都是根据经验手动选择的。强化学习是应对这一挑战的一种方法。高维特征空间和类不平衡是药物-靶点相互作用预测面临的挑战之一。本文提出了 HEnsem_DTIs--一种异构组合模型，利用降维技术和推荐系统的概念来预测药物-目标相互作用，以应对这些挑战。HEnsem_DTIs 采用强化学习配置。降维技术用于应对高维特征空间的挑战，推荐系统用于改善采样不足和解决类不平衡的挑战。对数据集的评估结果表明，HEnsem_DTIs 比该领域的其他模型效果更好。在第一个数据集上使用 10 倍交叉验证实验对所提模型进行评估的结果显示，灵敏度为 0.896，特异度为 0.954，GM 为 0.924，AUC 为 0.930，AUPR 为 0.935。

{"title":"HEnsem_DTIs: A heterogeneous ensemble learning model for drug-target interactions prediction","authors":"Mohammad Reza Keyvanpour , Yasaman Asghari , Soheila Mehrmolaei","doi":"10.1016/j.chemolab.2024.105224","DOIUrl":"10.1016/j.chemolab.2024.105224","url":null,"abstract":"<div>Drug discovery is the process by which a drug is discovered. Drug-target interactions prediction is a major part of drug discovery. Unfortunately, producing new drugs is time-consuming and expensive; Because it requires a lot of human and laboratory resources. Recently, predictions have been made using computational methods to solve these problems and prevent blindly examining all interactions. Various experiences using computational methods show that no single algorithm can be suitable for all applications; Hence, ensemble learning is expressed. Although various ensemble methods have been proposed, it is still not easy to find a suitable ensemble method for a particular dataset. In general, the existing algorithms in aggregation and combination method are selected manually based on experience. Reinforcement learning can be one way to meet this challenge. High-dimensional feature space and class imbalance are among the challenges of drug-target interactions prediction. This paper proposes HEnsem_DTIs, a heterogeneous ensemble model, for predicting drug-target interactions using dimensionality reduction and concepts of recommender systems to address these challenges. HEnsem_DTIs is configured with reinforcement learning. Dimensionality reduction is applied to handle the challenge of high-dimensional feature space and recommender systems to improve under-sampling and solve the class imbalance challenge. Six datasets are used to evaluate the proposed model; Results of the evaluation on datasets show that HEnsem_DTIs works better than other models in this field. Results of evaluation of the proposed model on the first dataset using 10-fold cross-validation experiments show the amount of sensitivity 0.896, specificity 0.954, GM 0.924, AUC 0.930 and AUPR 0.935.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105224"},"PeriodicalIF":3.7,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142137105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Random projection ensemble conformal prediction for high-dimensional classification 用于高维分类的随机投影集合共形预测

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-09-02 DOI: 10.1016/j.chemolab.2024.105225

Xiaoyu Qian , Jinru Wu , Ligong Wei , Youwu Lin

In classification problems, many models with superior performance fail to provide confidence estimates or intervals for each prediction. This lack of reliability poses risks in real-world applications, making these models difficult to trust. Conformal prediction, as distribution-free and model-free approaches with finite-sample coverage guarantee, have recently been widely used to construct prediction sets for classification models. However, traditional conformal prediction methods only produce set-valued results without specifying a definitive predicted class. Particularly in complex settings, these methods fail to assist models in effectively addressing challenges such as high dimensionality, resulting in ambiguous prediction sets with low statistical efficiency, i.e. the prediction sets contain many false classes. In this study, a novel Ensemble Conformal Prediction algorithm based on Random Projection and a designed voting strategy, RPECP, is developed to tackle these challenges. Initially, a procedure for selecting the approximately oracle random projections and classifiers is executed to best leverage the internal information and structure of the data. Subsequently, based on the approximately oracle random projections and underlying classifiers, conformal prediction is performed on new test samples in a lower-dimensional space, resulting in multiple independent prediction sets. Finally, an accurate predicted class and a precise prediction set with high coverage and statistical efficiency are produced through a designed voting strategy. Compared to several base classifiers, RPECP obtain higher classification accuracy; against other conformal prediction algorithms, it achieves less ambiguous prediction sets with fewer false classes while guaranteeing high coverage. For illustration, this paper demonstrates RPECP's superiority over other methods in four cases: two high-dimensional settings and two real-world datasets.

在分类问题中，许多性能优越的模型无法为每次预测提供置信度估计或区间。这种缺乏可靠性的情况在实际应用中会带来风险，使这些模型难以信赖。共形预测，作为具有有限样本覆盖保证的无分布和无模型方法，最近被广泛用于构建分类模型的预测集。然而，传统的共形预测方法只能产生集合值结果，而不能指定明确的预测类别。特别是在复杂的环境中，这些方法无法帮助模型有效地应对高维度等挑战，导致预测集模糊不清，统计效率低下，即预测集包含许多错误类别。本研究开发了一种基于随机投影和设计的投票策略 RPECP 的新型集合共形预测算法来应对这些挑战。首先，执行一个选择近似甲骨文随机投影和分类器的程序，以充分利用数据的内部信息和结构。随后，根据近似神谕随机投影和底层分类器，在低维空间中对新的测试样本进行保形预测，从而得到多个独立的预测集。最后，通过设计的投票策略，产生准确的预测类和具有高覆盖率和统计效率的精确预测集。与几种基础分类器相比，RPECP 获得了更高的分类准确率；与其他共形预测算法相比，它在保证高覆盖率的同时，获得了更少的模糊预测集和更少的错误类别。为了说明问题，本文在四个案例中展示了 RPECP 相对于其他方法的优越性：两个高维设置和两个真实世界数据集。

{"title":"Random projection ensemble conformal prediction for high-dimensional classification","authors":"Xiaoyu Qian , Jinru Wu , Ligong Wei , Youwu Lin","doi":"10.1016/j.chemolab.2024.105225","DOIUrl":"10.1016/j.chemolab.2024.105225","url":null,"abstract":"<div>In classification problems, many models with superior performance fail to provide confidence estimates or intervals for each prediction. This lack of reliability poses risks in real-world applications, making these models difficult to trust. Conformal prediction, as distribution-free and model-free approaches with finite-sample coverage guarantee, have recently been widely used to construct prediction sets for classification models. However, traditional conformal prediction methods only produce set-valued results without specifying a definitive predicted class. Particularly in complex settings, these methods fail to assist models in effectively addressing challenges such as high dimensionality, resulting in ambiguous prediction sets with low statistical efficiency, i.e. the prediction sets contain many false classes. In this study, a novel Ensemble Conformal Prediction algorithm based on Random Projection and a designed voting strategy, RPECP, is developed to tackle these challenges. Initially, a procedure for selecting the approximately oracle random projections and classifiers is executed to best leverage the internal information and structure of the data. Subsequently, based on the approximately oracle random projections and underlying classifiers, conformal prediction is performed on new test samples in a lower-dimensional space, resulting in multiple independent prediction sets. Finally, an accurate predicted class and a precise prediction set with high coverage and statistical efficiency are produced through a designed voting strategy. Compared to several base classifiers, RPECP obtain higher classification accuracy; against other conformal prediction algorithms, it achieves less ambiguous prediction sets with fewer false classes while guaranteeing high coverage. For illustration, this paper demonstrates RPECP's superiority over other methods in four cases: two high-dimensional settings and two real-world datasets.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105225"},"PeriodicalIF":3.7,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142147568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

G-CovSel: Covariance oriented variable clustering G-CovSel：以协方差为导向的变量聚类

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-29 DOI: 10.1016/j.chemolab.2024.105223

Jean-Michel Roger , Alessandra Biancolillo , Bénédicte Favreau , Federico Marini

Dimensionality reduction is an essential step in the processing of analytical chemistry data. When this reduction is carried out by variable selection, it can enable the identification of biochemical pathways. CovSel has been developed to meet this requirement, through a parsimonious selection of non-redundant variables. This article presents the g-CovSel method, which modifies the CovSel algorithm to produce highly complementary groups containing highly correlated variables. This modification requires the theoretical definition of the groups' construction and of the deflation of the data with respect to the selected groups. Two applications, on two extreme case studies, are presented. The first, based on near-infrared spectra related to four chemicals, demonstrates the relevance of the selected groups and the method's ability to handle highly correlated variables. The second, based on genomic data, demonstrates the method's ability to handle very highly multivariate data. Most of the groups formed can be interpreted from a functional point of view, making g-CovSel a tool of choice for biomarker identification in omics. Further work will be carried out to generalize g-CovSel to multi-block and multi-way data.

降维是处理分析化学数据的重要步骤。通过变量选择进行降维，可以识别生化途径。CovSel 就是为了满足这一要求而开发的，它通过对非冗余变量的合理选择来实现。本文介绍的 g-CovSel 方法对 CovSel 算法进行了修改，以产生包含高度相关变量的高度互补组。这种修改需要从理论上定义分组的构建和数据相对于所选分组的通缩。本文介绍了在两个极端案例研究中的两个应用。第一个应用基于与四种化学物质相关的近红外光谱，证明了所选分组的相关性以及该方法处理高度相关变量的能力。第二组基于基因组数据，展示了该方法处理高度多元数据的能力。所形成的大多数组别都可以从功能的角度进行解释，从而使 g-CovSel 成为 omics 中生物标记物识别的首选工具。我们还将开展进一步的工作，将 g-CovSel 推广到多块和多向数据中。

{"title":"G-CovSel: Covariance oriented variable clustering","authors":"Jean-Michel Roger , Alessandra Biancolillo , Bénédicte Favreau , Federico Marini","doi":"10.1016/j.chemolab.2024.105223","DOIUrl":"10.1016/j.chemolab.2024.105223","url":null,"abstract":"<div>Dimensionality reduction is an essential step in the processing of analytical chemistry data. When this reduction is carried out by variable selection, it can enable the identification of biochemical pathways. CovSel has been developed to meet this requirement, through a parsimonious selection of non-redundant variables. This article presents the g-CovSel method, which modifies the CovSel algorithm to produce highly complementary groups containing highly correlated variables. This modification requires the theoretical definition of the groups' construction and of the deflation of the data with respect to the selected groups. Two applications, on two extreme case studies, are presented. The first, based on near-infrared spectra related to four chemicals, demonstrates the relevance of the selected groups and the method's ability to handle highly correlated variables. The second, based on genomic data, demonstrates the method's ability to handle very highly multivariate data. Most of the groups formed can be interpreted from a functional point of view, making g-CovSel a tool of choice for biomarker identification in omics. Further work will be carried out to generalize g-CovSel to multi-block and multi-way data.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105223"},"PeriodicalIF":3.7,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924001631/pdfft?md5=52fb71b18968f61fe29df549f8fc05f7&pid=1-s2.0-S0169743924001631-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing quantitative 1H NMR model generalizability on honey from different years through partial least squares subspace and optimal transport based unsupervised domain adaptation 通过偏最小二乘子空间和基于无监督域适应的优化传输，增强不同年份蜂蜜的定量 1H NMR 模型通用性

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-28 DOI: 10.1016/j.chemolab.2024.105221

Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao

Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (¹H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The ¹H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled ¹H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled ¹H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.

蜂蜜是一种营养丰富的天然食品，受到不同消费者的广泛青睐。质子核磁共振（1H NMR）是定量分析蜂蜜的有力工具，在确保蜂蜜质量方面发挥着至关重要的作用。1H NMR 技术需要利用多元校准模型来促进对蜂蜜中主要化合物的定量分析。然而，在不同年份保持一致的测量条件几乎是不可能的，这会严重影响训练和测试光谱的分布，最终导致预测模型的性能下降。无监督领域适应（UDA）方法能够在不耗费大量标注的情况下匹配已标注源光谱和未标注目标光谱之间的分布差异，因此受到广泛关注。为了提高定量模型在不同年份蜂蜜上的通用性，我们提出了一种 UDA 方法，即基于偏最小二乘子空间和最优传输的 UDA（PLSS-OT-UDA）。这种方法通过偏最小二乘法（PLS）降维和 OT 消除源子空间和目标子空间之间的分布差异。首先，用 PLS 从源域（即 2017 年标记的 1H NMR 数据）提取最佳潜变量权重矩阵。接着，降低源域和目标域（即 2018 年未标记的 1H NMR 数据）的维度，并利用源域的权重矩阵得到其对应的子空间。最后，再利用 OT 对齐子空间内源域和目标域的分布。在蜂蜜数据集上的实验结果表明，PLSS-OT-UDA 在波美度、含糖量和含水量三个成分上的泛化性能优于传统方法，包括转移分量分析（TCA）、域自适应最优传输（OTDA）、基于主成分分析和最优传输的域自适应（PCA-OTDA）以及子空间配准（SA）。

{"title":"Enhancing quantitative 1H NMR model generalizability on honey from different years through partial least squares subspace and optimal transport based unsupervised domain adaptation","authors":"Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao","doi":"10.1016/j.chemolab.2024.105221","DOIUrl":"10.1016/j.chemolab.2024.105221","url":null,"abstract":"<div><div>Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (1H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The 1H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled 1H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled 1H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105221"},"PeriodicalIF":3.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142441360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing topological descriptors of guar gum and its derivatives for predicting physical properties in carbohydrates 分析瓜尔胶及其衍生物的拓扑描述符以预测碳水化合物的物理性质

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-24 DOI: 10.1016/j.chemolab.2024.105203

Xiujun Zhang , Shamaila Yousaf , Anisa Naeem , Ferdous M. Tawfiq , Adnan Aslam

Guar gum is a non-ionic polysaccharide found in abundance in nature. It may be used as a thickening agent, stabilizer, or emulsifier in pharmaceutical formulations, food products, or cosmetics. Its ability to form viscous solutions makes it useful in drug delivery systems, controlled release formulations, and as a matrix for oral drug delivery. The investigation of chemical structures through graph invariants is of great concern. Topological descriptors are numerical numbers associated with the molecular structure and have the ability to predict certain physical and chemical properties of the underlying structure. In this paper, we have calculated the harmonic index, the inverse sum indeg index, the third Zagreb index, the Hyper Zagreb index, the sigma index, the reformulated first Zagreb index, the reformulated multiplicative first Zagreb index, the Harmonic–arithmetic index, and the Atom Bond sum connectivity indices of guar gum and its chemical derivatives. Finally, the chemical applicability of these topological descriptors is checked for different carbohydrates (monosaccharides, disaccharides, and polysaccharides) by using straight-line, parabolic and logarithmic regression models. It has been observed that these topological descriptors are useful to predict two physical properties, namely density and molecular weight.

瓜尔胶是一种非离子多糖，在自然界中含量丰富。它可在药物配方、食品或化妆品中用作增稠剂、稳定剂或乳化剂。它能形成粘性溶液，因此可用于给药系统、控释配方和口服给药基质。通过图不变式研究化学结构备受关注。拓扑描述符是与分子结构相关联的数字，能够预测底层结构的某些物理和化学特性。本文计算了瓜尔胶及其化学衍生物的谐波指数、逆和 indeg 指数、第三萨格勒布指数、超萨格勒布指数、西格玛指数、重构第一萨格勒布指数、重构乘法第一萨格勒布指数、谐波算术指数和原子键和连通性指数。最后，通过使用直线、抛物线和对数回归模型，检验了这些拓扑描述符对不同碳水化合物（单糖、双糖和多糖）的化学适用性。结果表明，这些拓扑描述符有助于预测两种物理性质，即密度和分子量。

{"title":"Analyzing topological descriptors of guar gum and its derivatives for predicting physical properties in carbohydrates","authors":"Xiujun Zhang , Shamaila Yousaf , Anisa Naeem , Ferdous M. Tawfiq , Adnan Aslam","doi":"10.1016/j.chemolab.2024.105203","DOIUrl":"10.1016/j.chemolab.2024.105203","url":null,"abstract":"<div>Guar gum is a non-ionic polysaccharide found in abundance in nature. It may be used as a thickening agent, stabilizer, or emulsifier in pharmaceutical formulations, food products, or cosmetics. Its ability to form viscous solutions makes it useful in drug delivery systems, controlled release formulations, and as a matrix for oral drug delivery. The investigation of chemical structures through graph invariants is of great concern. Topological descriptors are numerical numbers associated with the molecular structure and have the ability to predict certain physical and chemical properties of the underlying structure. In this paper, we have calculated the harmonic index, the inverse sum indeg index, the third Zagreb index, the Hyper Zagreb index, the sigma index, the reformulated first Zagreb index, the reformulated multiplicative first Zagreb index, the Harmonic–arithmetic index, and the Atom Bond sum connectivity indices of guar gum and its chemical derivatives. Finally, the chemical applicability of these topological descriptors is checked for different carbohydrates (monosaccharides, disaccharides, and polysaccharides) by using straight-line, parabolic and logarithmic regression models. It has been observed that these topological descriptors are useful to predict two physical properties, namely density and molecular weight.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105203"},"PeriodicalIF":3.7,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interpretation of high dimensional definitive screening designs assisted by bootstrapped partial least squares regression 利用引导偏最小二乘法回归解释高维确定性筛选设计

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-24 DOI: 10.1016/j.chemolab.2024.105218

Knut Dyrstad , Frank Westad

Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.

A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.

确定性筛选设计（DSD）具有优化特性，能以最少的实验次数估算主效应、交互效应和变量平方效应，因此已成为化学、制药和生物制药工艺及产品开发中广泛使用的一种实验设计类型。这些高维 DOEs 变量多于样本，而且变量之间存在部分相关性，因此统计解释经常具有挑战性。本研究的目的是使用遗传程序对自举 PLSR 进行测试，以选择最终由 MLR 评估的变量子集。遗传选择基于原始 PLSR 系数（B）除以引导估计标准偏差得出的引导 T 值。通过对 7 个主要变量和 12 个测试文献真实数据（4、5、7 和 8 个主要变量）的模拟研究发现，与两种常见的 DSD 参考方法（DSD 拟合确定性筛选和 AICc 向前逐步回归（AICc FSR））相比，自举 PLSR MLR 方法在小 DSD 特别是大 DSD 中的模型性能有所改善。与两种参考方法中的任何一种方法得出的最佳模型相比，在 13 个 DSD 中，有 6 个的变量选择准确性和预测能力得到了显著提高。其余 7 个 DSD 的模型与最佳参考模型相同。本研究发现，强遗传为所有真实数据提供了最佳模型。在 SVEM FSR 变量效应非零百分比上使用遗传程序，然后使用 MLR，显示出了很好的结果。AICc Lasso 回归是部分测试的其他方法之一，在对三个大型最小 DSD 进行测试时，发现几乎所有变量的效应都为零。虽然 DSD 拟合确定性筛选方法通常可能是 DSD 的首选，但遗传自举 PLSR MLR 和遗传 SVEM FSR MLR 可能是改进变量选择和模型精度的替代方法。

{"title":"Interpretation of high dimensional definitive screening designs assisted by bootstrapped partial least squares regression","authors":"Knut Dyrstad , Frank Westad","doi":"10.1016/j.chemolab.2024.105218","DOIUrl":"10.1016/j.chemolab.2024.105218","url":null,"abstract":"<div>Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105218"},"PeriodicalIF":3.7,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NIR and MIR spectral feature information fusion strategy for multivariate quantitative analysis of tobacco components 用于烟草成分多元定量分析的近红外和中红外光谱特征信息融合策略

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-23 DOI: 10.1016/j.chemolab.2024.105222

Honghong Wang , Qiong Wu , Wuye Yang , Jie Yu , Ting Wu , Zhixin Xiong , Yiping Du

The determination of total nicotine, total sugar, reducing sugar and total nitrogen contents in tobacco is of great significance to tobacco quality evaluation and formulation design. To quickly detect the content of 4 components of tobacco, using near-infrared (NIR) and mid-infrared (MIR) spectral data from 129 solid samples of tobacco powder provided by Shanghai Tobacco Group Co., Ltd., Two NIR-MIR spectral fusion techniques are studied, that is, fusion technology 1 is to establish a model by fusing feature variables after variable selection of each spectrum. The fusion technology 2 is to first fuse the NIR-MIR spectral data and then select the variables to establish the model. Both fusion technologies use successive projections algorithm (SPA), competitive adaptive reweighted sampling (CARS), backward interval PLS (biPLS), forward interval PLS (fiPLS), synergy interval PLS (siPLS), and interval interaction moving window partial least squares (iMWPLS) algorithms to filter wavelength variables. The results showed that for total nicotine and total sugar, the PLSR model established by fusion technology method 2 combined with iMWPLS algorithm is the best, and its RMSEP decreases from 0.2314 to 1.3225 to 0.0821 and 0.8079 respectively compared with the full spectrum fusion method, which is superior to the single NIR and MIR models and NIR-MIR fusion technology 1. For reducing sugars, the simple full-spectrum fusion model has the best analytical ability and the lowest RMSEP, which is superior to the single NIR-MIR models and all models established by two spectral fusion techniques combined with six wavelength selection algorithms. For total nitrogen, the prediction effect of fusion technology 1 combined with iMWPLS algorithm model was significantly improved compared with single NIR and MIR models and NIR-MIR fusion technology 2, and its RMSEP was 0.0634. The results showed that the two NIR-MIR spectral fusion techniques made full use of the complementary information provided by NIR and MIR spectroscopy, and successfully applied them to the rapid detection of total nicotine, total sugar, reducing sugar and total nitrogen content in tobacco, which provided a new method and idea for the rapid detection of tobacco components.

烟叶中总烟碱、总糖、还原糖和总氮含量的测定对烟叶质量评价和配方设计具有重要意义。为了快速检测烟草中 4 种成分的含量，利用上海烟草集团有限责任公司提供的 129 个烟草粉末固体样品的近红外和中红外光谱数据，研究了两种近红外-中红外光谱融合技术，即融合技术 1 是在对每个光谱进行变量选择后，通过融合特征变量建立模型。融合技术 2 是先融合近红外-红外光谱数据，然后选择变量建立模型。两种融合技术都使用了连续预测算法（SPA）、竞争性自适应加权采样（CARS）、后向区间PLS（biPLS）、前向区间PLS（fiPLS）、协同区间PLS（siPLS）和区间交互移动窗偏最小二乘法（iMWPLS）算法来筛选波长变量。结果表明，对于总尼古丁和总糖，融合技术方法 2 结合 iMWPLS 算法建立的 PLSR 模型效果最好，与全光谱融合方法相比，其 RMSEP 分别从 0.2314 到 1.3225 下降到 0.0821 和 0.8079，优于单一的近红外和中红外模型以及近红外-中红外融合技术 1。对于还原糖，简单的全谱融合模型的分析能力最强，RMSEP 最低，优于单一的近红外-中红外模型和所有由两种光谱融合技术结合六种波长选择算法建立的模型。对于总氮，融合技术 1 结合 iMWPLS 算法模型的预测效果较单一近红外和中红外模型以及近红外-中红外融合技术 2 有显著提高，其 RMSEP 为 0.0634。结果表明，两种近红外-近红外光谱融合技术充分利用了近红外光谱和近红外光谱提供的互补信息，成功地应用于烟草中总烟碱、总糖、还原糖和总氮含量的快速检测，为烟草成分的快速检测提供了一种新的方法和思路。

{"title":"NIR and MIR spectral feature information fusion strategy for multivariate quantitative analysis of tobacco components","authors":"Honghong Wang , Qiong Wu , Wuye Yang , Jie Yu , Ting Wu , Zhixin Xiong , Yiping Du","doi":"10.1016/j.chemolab.2024.105222","DOIUrl":"10.1016/j.chemolab.2024.105222","url":null,"abstract":"<div>The determination of total nicotine, total sugar, reducing sugar and total nitrogen contents in tobacco is of great significance to tobacco quality evaluation and formulation design. To quickly detect the content of 4 components of tobacco, using near-infrared (NIR) and mid-infrared (MIR) spectral data from 129 solid samples of tobacco powder provided by Shanghai Tobacco Group Co., Ltd., Two NIR-MIR spectral fusion techniques are studied, that is, fusion technology 1 is to establish a model by fusing feature variables after variable selection of each spectrum. The fusion technology 2 is to first fuse the NIR-MIR spectral data and then select the variables to establish the model. Both fusion technologies use successive projections algorithm (SPA), competitive adaptive reweighted sampling (CARS), backward interval PLS (biPLS), forward interval PLS (fiPLS), synergy interval PLS (siPLS), and interval interaction moving window partial least squares (iMWPLS) algorithms to filter wavelength variables. The results showed that for total nicotine and total sugar, the PLSR model established by fusion technology method 2 combined with iMWPLS algorithm is the best, and its RMSEP decreases from 0.2314 to 1.3225 to 0.0821 and 0.8079 respectively compared with the full spectrum fusion method, which is superior to the single NIR and MIR models and NIR-MIR fusion technology 1. For reducing sugars, the simple full-spectrum fusion model has the best analytical ability and the lowest RMSEP, which is superior to the single NIR-MIR models and all models established by two spectral fusion techniques combined with six wavelength selection algorithms. For total nitrogen, the prediction effect of fusion technology 1 combined with iMWPLS algorithm model was significantly improved compared with single NIR and MIR models and NIR-MIR fusion technology 2, and its RMSEP was 0.0634. The results showed that the two NIR-MIR spectral fusion techniques made full use of the complementary information provided by NIR and MIR spectroscopy, and successfully applied them to the rapid detection of total nicotine, total sugar, reducing sugar and total nitrogen content in tobacco, which provided a new method and idea for the rapid detection of tobacco components.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105222"},"PeriodicalIF":3.7,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint state and process inputs estimation for state-space models with Student’s t-distribution 采用学生 t 分布的状态空间模型的状态和过程输入联合估计

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-23 DOI: 10.1016/j.chemolab.2024.105220

Hang Ci, Chengxi Zhang, Shunyi Zhao

This paper proposes a joint state and unknown inputs (UIs) discrete-time estimation method for industrial processes, represented by a state-space model. To cope with the outliers in process data, the measurement noise is characterized by the Student’s t-distribution. The identification of UIs is accomplished through the recursive expectation–maximization (REM) approach. Specifically, in the E-step, a recursively calculated Q-function is formulated by the maximum likelihood criterion, and the states and the variance scale factor are estimated iteratively. In the M-step, UIs are updated analytically together with the degree of freedom is updated approximately. The effectiveness of the proposed algorithm is validated using a quadruple water tank process and a continuous stirred tank reactor. It shows that the proposed method significantly enhances the robustness and estimation accuracy of state and UIs in industrial processes, effectively handling outliers and reducing computational demands for real-time applications.

本文提出了一种以状态空间模型为代表的工业过程状态和未知输入（UIs）离散时间联合估计方法。为了应对过程数据中的异常值，测量噪声采用了 Student's t 分布。UIs 的识别是通过递归期望最大化（REM）方法完成的。具体来说，在 E 步中，通过最大似然准则制定递归计算的 Q 函数，并对状态和方差比例因子进行迭代估计。在 M 步中，UIs 是通过分析更新的，自由度也是近似更新的。利用四重水槽工艺和连续搅拌罐反应器验证了所提算法的有效性。结果表明，所提出的方法大大提高了工业过程中状态和 UI 的鲁棒性和估计精度，有效地处理了异常值，降低了实时应用的计算需求。

引用次数: 0

Combining algorithm techniques with mechanical and acoustic profiles for the prediction of apples sensory attributes 将算法技术与机械和声学特征相结合，预测苹果的感官属性

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-22 DOI: 10.1016/j.chemolab.2024.105217

Riccardo Ricci , Annachiara Berardinelli , Flavia Gasperi , Isabella Endrizzi , Farid Melgani , Eugenio Aprea

The research work shows the potentiality of advanced linear and nonlinear learning algorithm techniques in the prediction of apples texture sensory attributes as “hardness”, “crunchiness”, “flouriness”, “fibrousness”, and “graininess”. Starting from the information contained in the entire mechanical and acoustic curves acquired during samples compression test, the prediction performances of five different statistical tools as Partial Least Squares regression (PLS), Multilayer Perceptron (MLP), Support Vector Regression (SVR) and Gaussian Process Regression (GPR) are shown and discussed.

All Predictive models validations evidence best accuracies for texture sensory attributes “hardness” and “crunchiness” and in general for GPR learning algorithm. By combining mechanical and acoustic profiles, 5-fold cross validations produce values of coefficient of determination R² up to 0.885 (GPR) and 0.840 (GPR), respectively for “hardness” and “crunchiness”. These results, comparable to those obtained by considering a large number of mechanical and acoustic parameters extracted from acquired profiles as predictive factors, evidence a new and reliable way for the prediction of texture sensory attributes of apples. The proposed approach can overcome the necessity to define, in advance, number and type of features to be calculated from instrumental texture profiles and can be easily implemented in an automatic process.

这项研究工作表明，先进的线性和非线性学习算法技术在预测苹果的 "硬度"、"脆度"、"粉度"、"纤维度 "和 "颗粒度 "等质地感官属性方面具有潜力。从样品压缩测试过程中获取的整个机械和声学曲线所包含的信息出发，展示并讨论了五种不同统计工具的预测性能，包括偏最小二乘回归（PLS）、多层感知器（MLP）、支持向量回归（SVR）和高斯过程回归（GPR）。通过结合机械和声学特征，5 倍交叉验证得出的 "硬度 "和 "松脆度 "判定系数 R2 值分别高达 0.885（GPR）和 0.840（GPR）。这些结果与将从获取的剖面图中提取的大量机械和声学参数作为预测因子所获得的结果相当，证明这是预测苹果质地感官属性的一种可靠的新方法。所提出的方法无需事先确定从仪器纹理剖面中计算出的特征的数量和类型，而且可以很容易地在自动流程中实施。

{"title":"Combining algorithm techniques with mechanical and acoustic profiles for the prediction of apples sensory attributes","authors":"Riccardo Ricci , Annachiara Berardinelli , Flavia Gasperi , Isabella Endrizzi , Farid Melgani , Eugenio Aprea","doi":"10.1016/j.chemolab.2024.105217","DOIUrl":"10.1016/j.chemolab.2024.105217","url":null,"abstract":"<div>The research work shows the potentiality of advanced linear and nonlinear learning algorithm techniques in the prediction of apples texture sensory attributes as “hardness”, “crunchiness”, “flouriness”, “fibrousness”, and “graininess”. Starting from the information contained in the entire mechanical and acoustic curves acquired during samples compression test, the prediction performances of five different statistical tools as Partial Least Squares regression (PLS), Multilayer Perceptron (MLP), Support Vector Regression (SVR) and Gaussian Process Regression (GPR) are shown and discussed.All Predictive models validations evidence best accuracies for texture sensory attributes “hardness” and “crunchiness” and in general for GPR learning algorithm. By combining mechanical and acoustic profiles, 5-fold cross validations produce values of coefficient of determination R2 up to 0.885 (GPR) and 0.840 (GPR), respectively for “hardness” and “crunchiness”. These results, comparable to those obtained by considering a large number of mechanical and acoustic parameters extracted from acquired profiles as predictive factors, evidence a new and reliable way for the prediction of texture sensory attributes of apples. The proposed approach can overcome the necessity to define, in advance, number and type of features to be calculated from instrumental texture profiles and can be easily implemented in an automatic process.</div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105217"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility 结合机器学习和 COSMO-RS 热力学模型预测共形物的溶解度参数，生产提高药物溶解度的共晶体

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems

Pub Date : 2024-08-22 DOI: 10.1016/j.chemolab.2024.105219

Wael A. Mahdi , Ahmad J. Obaidullah

<div>In this study, we develop predictive models for three target variables, denoted as <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>, <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math>, and <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>, the PAR model outperforms others with an R2 score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R2 of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>. The PR model performs on <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math> with an R2 of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math>, the GPR model provides the best performance, achieving an R2 score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math> with an R2 of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math> with a R2 of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math>, the GPR model again shows superior performance with an R2 score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> with a R2 of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> with a R2 of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately p

在本研究中，我们利用一个包含 86 个特征和 181 个样本的数据集开发了三个目标变量的预测模型，分别称为 δd、δp 和 δh。响应参数（即汉森溶解度参数）通过几种机器学习技术与输入参数相关联。输入特征是根据 COMSO-RS 热力学模型和基团贡献法计算得出的共配体分子描述符。分析包括通过库克距离（Cook's distance）进行离群点检测，使用最小-最大标度器进行归一化，以及通过基于 L1 的方法进行特征选择。采用了三种回归模型--高斯过程回归（GPR）、被动渐进回归（PAR）和多项式回归（PR），并通过瞬态搜索优化（TSO）实现了超参数优化。结果表明，对于 δd，PAR 模型的性能优于其他模型，R2 得分为 0.885，RMSE 为 0.607，MAE 为 0.524，最大误差为 1.294。GPR 模型的性能略低，δd 的 R2 为 0.872，RMSE 为 0.816，MAE 为 0.579，最大误差为 2.755。PR 模型对 δd 的 R2 为 0.814，RMSE 为 0.923，MAE 为 0.597，最大误差为 2.814。对于δp，GPR 模型性能最佳，R2 为 0.821，RMSE 为 1.693，MAE 为 1.391，最大误差为 3.457。PAR 模型预测 δp 的 R2 为 0.740，RMSE 为 2.025，MAE 为 1.980，最大误差为 6.609。同样，PR 模型预测 δp 的 R2 为 0.7，RMSE 为 2.329，MAE 为 2.02，最大误差为 6.366。同样，对于 δh，GPR 模型再次显示出卓越的性能，R2 为 0.983，RMSE 为 1.243，MAE 为 1.005，最大误差为 2.577。PAR 模型也能准确预测 δh，R2 为 0.924，RMSE 为 2.713，MAE 为 2.416，最大误差为 6.307。此外，PR 模型预测 δh 的 R2 为 0.927，RMSE 为 2.757，MAE 为 2.334，最大误差为 8.064。这些结果凸显了所选模型和优化技术在准确预测指定输出方面的功效，显示了在相关预测建模任务中的巨大应用潜力。

{"title":"Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility","authors":"Wael A. Mahdi , Ahmad J. Obaidullah","doi":"10.1016/j.chemolab.2024.105219","DOIUrl":"10.1016/j.chemolab.2024.105219","url":null,"abstract":"<div>In this study, we develop predictive models for three target variables, denoted as <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>, <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math>, and <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>, the PAR model outperforms others with an R2 score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R2 of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math>. The PR model performs on <math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math> with an R2 of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math>, the GPR model provides the best performance, achieving an R2 score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math> with an R2 of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts <math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math> with a R2 of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math>, the GPR model again shows superior performance with an R2 score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> with a R2 of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts <math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math> with a R2 of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately p","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105219"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0