首页 > 最新文献

Journal of Chemometrics最新文献

英文 中文
Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation 化学计量学模型验证中的数据泄漏和交叉验证尺度问题
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-01 DOI: 10.1002/cem.70026
Péter Király, Gergely Tóth
<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model
化学计量学是最复杂的数据科学领域之一。几十年来,它一直是使用新型机器学习方法的先驱。化学计量学建模的文献非常多;有一些关于如何执行仔细分析的指南、软件和其他描述。另一方面,文献往往是矛盾和不一致的。有许多研究,在特定数据集上的结果被一概而论而没有证明,后来,一概而论的想法被引用而没有原始的限制。在某些情况下,方法命名的差异会导致误解。在科学的每一个领域,也有一些偏好的方法,这是基于研究小组的力量,没有灵活和真正的科学方法的选择的可能性。在化学计量学的实际方法和理论统计理论之间也存在一些不一致,在理论统计理论中经常研究不切实际的假设和限制。广泛阐述的化学计量学知识给该领域带来了一些刚性。在数据科学中有一些趋势是化学计量学慢慢适应的。一个例子是在偏差-方差权衡模型构建[1]中的排他思维,而不是在大数据集的双下降区域中使用模型[2-4]。另一个问题是数据泄露。迄今为止,化学计量学模型的建立和验证往往是在数据泄露的数据集上进行的。在我们的调查中,我们遇到了一些案例,在这些案例中,巨大的文献背景为纠正误解提供了很大的惯性。在2021年,我们发现,留一和留多交叉验证(LMO-CV)参数可以相互缩放到[5]。此外,我们表明,这两种方法在多元线性回归(MLR)计算中具有大致相同的不确定性[6]。因此,在这些方法之间的选择应该是计算实践,而不是先入为主。我们因为遗漏了一些被广泛引用的研究的结果而受到了一些正式和非正式的批评。在本文中,我们提出了一些例子,以加强对化学计量学中一些传统解决方案的反思。我们展示了一些计算,数据泄漏是如何在化学计量任务中存在的。我们的其他计算集中在缩放定律上,以恢复留一交叉验证。在机器学习中,数据泄漏意味着在模型构建过程中使用信息,这会使模型的预测评估产生偏差,或者在模型的实际预测应用中不可用。一个典型且容易检测的例子是当测试集中存在与训练非常相似的情况时。当变量或类出现在与响应变量过于密切相关的解释变量中时,存在另一种形式的泄漏。数据泄漏在模型性能评估中引起类似过拟合的问题,但它们的定义和验证困难的来源不同。它们可以独立出现;所有的组合都是可能的,例如,没有数据泄漏的强过拟合或缺乏强数据泄漏的过拟合。常见的效果是,它们降低了模型验证的有效性,除了接近最优复杂模型的训练和测试集大小的无限限制的情况。在这个极限下,数据泄漏和过拟合对性能参数的影响趋于零。典型的案例泄漏发生在训练集和测试集之间。最优测试集的目的是永远不会在训练过程中使用,也不会在关于模型选择或超参数优化的决策中使用。测试集应该代表模型的预期应用。如果数据集足够大,可以分为训练集和测试集,并且后者很好地代表了预期的应用领域,则可以在开始模型构建之前从现有数据集中选择测试集。如果在预期的应用程序中有很大的可变性,而开始的数据集没有这种可变性,则应该在新的测量活动中获得一个或多个最佳测试集。因此,独立的测试集可以在开始建模之前通过拆分数据获得,也可以在以后的新测量中获得。抽样可以遵循两种方法,一种是简单的统计抽样,当在选择过程中没有对预测者或反应范围的偏好时,也可以使用不同的抽样理论来设计。例如,参考文献[7]详细介绍了这些可能性的细节。对于具有超参数的模型,最简单的训练/测试分割是不够的。至少需要将训练集划分为临时训练集和验证集[26]。 最简单的方法是在临时训练集上对给定超参数的模型进行参数化,并在验证集上对模型进行评估。在不同的超参数化模型之间的选择是基于模型在验证集上的性能。最终模型通常在聚合的临时训练集和验证集上重新参数化。临时训练集和验证集之间的数据泄漏主要发生在聚合阶段。它以一种固有的方式导致有偏见的模型选择。如果在超参数优化中使用验证参数,可能会有进一步的泄漏。如果在超参数的选择中使用了给定的验证参数,那么与其他验证参数相比,该验证参数在最终模型中会变得过于乐观。这种影响我们可以称之为参数泄漏。这种参数泄漏也可能出现在变量选择中。OECD QSAR指南[8]将验证过程分为内部和外部验证过程。内部是指利用数据计算模型性能的验证参数,这些参数用于模型构建和模型选择。外部验证意味着在测试集(如上所定义的最优测试集)上计算验证参数。外部验证的唯一目的应该是评估最终模型的可预测性。内部验证的目的是评估模型在训练集上的拟合优度和模型的鲁棒性。后者主要通过交叉验证方法来管理,有时也通过引导来管理。经合组织指南没有详细说明应该如何进行超参数优化。交叉验证的使用并不能消除外部(测试)验证的必要性。交叉验证在超参数优化中有其作用,特别是在不可能为该任务分离验证集的情况下。此外,在小数据集的情况下,如果不可能有独立的测试集,交叉验证也是猜测预测性的近似工具。无论如何,我们必须记住Ref.[7]的结论:“交叉验证只是测试集验证的次优模拟。”OECD的指导很少在各个方面都考虑到,特别是在独立测试集的要求方面。相反,在过去的三十年里,化学计量学的文献中有很多关于如何进行交叉验证的强调[9- 12,25]。这里列出了不同的任务,如模型和超参数的选择以及变量的选择,通常,它被用来对模型的“预测”能力进行现实的估计。与数据科学的明确趋势相反,化学计量学中存在一种争论,即预测能力只能通过交叉验证方法来确定[7,13 -19]。有几种交叉验证方法,其中重复和双重交叉验证方法提供了稳定的验证参数,尽管存在数据泄漏[20,21,12]。在双重交叉验证方案(有时称为嵌套方案)中,其中一个迭代通常将数据分成“测试”和验证+临时训练集,但深入细节,可以发现这些“测试”集并不能满足前面提到的无泄漏要求。一些开发人员用“一个测试集不是一个测试集”的想法来证明缺乏真正的外部测试集,因为在单个集上计算的验证参数的方差很大[23,24]。无论如何,最优的解决方案是使用一个巨大的测试集来显示预期应用程序的所有可变性。如果不可能,一个好的解决方案是使用一些独立的测试集,例如,在不同的测量活动中确定,以便为以后应用程序的可变性提供示例。在非嵌套交叉验证中有三种主要方法。他们的名字不匹配,导致人们对他们的权力产生误解。在我们之前的研究中,我们遵循了经合组织指导委员会的名称惯例。我们将留一交叉验证称为以下参数计算过程(LOO-CV):如果使用ntrain案例来优化基本模型参数,则建立ntrain-1观测值的模型。总的来说,我们计算了列车模型,最后,所有的情况只省略一次。在所有的训练案例上计算验证参数,但只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。我们称LMO-CV为计算过程(OECD),其中应用了与lo - cv类似的方法,但省略了m个案例。总的来说,在构建ntrain/m模型时,每种情况只被省略一次。 验证参数在所有的ntrain案例上计算相似,但是只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。这种方法有时被称为m-fold交叉验证。第三种非嵌套交叉验证将训练集分成一个具有nv个元素的验证集和一个不同的nc = ntrain-nv集,在这个集上执行临时模型的训练。验证参数在nv集上计算。通常,分割为nv和nc要重复几次,验证参数在重复中取平均值。我们把这个过程称为重复交叉验证(REP-CV)。在文献中,一些作者将其称为遗漏多重交叉验证或LMO
{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király,&nbsp;Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":"10.1002/cem.70026","url":null,"abstract":"&lt;p&gt;Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.&lt;/p&gt;&lt;p&gt;The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [&lt;span&gt;1&lt;/span&gt;] instead of using models in the double descent region for large datasets [&lt;span&gt;2-4&lt;/span&gt;]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.&lt;/p&gt;&lt;p&gt;In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [&lt;span&gt;5&lt;/span&gt;]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [&lt;span&gt;6&lt;/span&gt;]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.&lt;/p&gt;&lt;p&gt;In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.&lt;/p&gt;&lt;p&gt;In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model ","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection 用紫外分光光度法绿色快速定量兽药中盐酸环丙沙星和酒石酸泰洛星:特征选择自然算法的比较研究
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-29 DOI: 10.1002/cem.70023
Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim

Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (R2). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and R2 values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and R2 values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and R2 values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL−1 for CIP and 0.36 μg mL−1 for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL−1 for CIP and 1.21 μg mL−1 for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.

兽药配方中盐酸环丙沙星(CIP)和酒石酸泰乐菌素(TYZ)的快速准确定量对于确保产品质量和疗效至关重要。本研究介绍了一种绿色且经济高效的分析方法,该方法结合了紫外分光光度法的简便性和自然启发算法的优化能力,用于同时测定片剂兽药配方中的 CIP 和 TYZ。采用平均平方根误差(RASE)、平均绝对误差(AAE)和判定系数(R2)对 14 种自然启发算法进行了比较评估。科罗娜病毒优化(CVO)算法和蝙蝠算法分别在 CIP 和 TYZ 方面表现出卓越的性能。针对 CIP 优化的 CVO 算法在校准集上的 RASE、AAE 和 R2 值分别为 0.37、0.27 和 0.998,而针对 TYZ 定制的蝙蝠算法的 RASE、AAE 和 R2 值分别为 0.54、0.41 和 0.984。测试集的 RASE、AAE 和 R2 值分别为:CIP 0.55、0.46 和 0.991,TYZ 0.20、0.15 和 0.995,证实了算法的预测能力。采用准确度曲线法进行了验证。结果表明,CIP 和 TYZ 的检出限分别为 0.86 μg mL-1 和 0.36 μg mL-1,定量限分别为 2.88 μg mL-1 和 1.21 μg mL-1。利用绿色溶剂选择工具(GSST)、国家环境方法指数(NEMI)、改进的生态尺度、改进的 GAPI(MoGAPI)以及通过 RGBfast 算法进行的补充白度评估,对该方法的环境影响进行了全面评估,确认了其生态友好型特征。拟议的方法显示出卓越的绿色环保性,这体现在其较高的 GSST 分数和良好的 NEMI 评估中。具体来说,根据 RGBfast 算法的测定,该方法获得了 84 分的改良生态尺度分、81 分的 MoGAPI 分和 61 分的白度指数。这些结果证实了该方法在环境上的可持续发展性,加强了其在绿色分析应用中的适用性。与传统色谱技术相比,这种新方法在成本、速度和环境可持续性方面具有显著优势,为制药质量控制领域采用更高效、更环保的分析方法铺平了道路。此外,这项研究还强调了紫外光谱与自然启发算法的创新整合,与传统的紫外药物分析方法相比取得了重大进步。
{"title":"Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection","authors":"Mostafa M. Eraqi,&nbsp;Ayman M. Algohary,&nbsp;Youssef O. Al-Ghamdi,&nbsp;Ahmed M. Ibrahim","doi":"10.1002/cem.70023","DOIUrl":"10.1002/cem.70023","url":null,"abstract":"<div>\u0000 \u0000 <p>Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (<i>R</i><sup>2</sup>). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and <i>R</i><sup>2</sup> values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL<sup>−1</sup> for CIP and 0.36 μg mL<sup>−1</sup> for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL<sup>−1</sup> for CIP and 1.21 μg mL<sup>−1</sup> for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foreword for Special Issue Devoted to the 14th Winter Symposium on Chemometrics (2024) 第十四届化学计量学冬季研讨会特刊前言(2024)
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-25 DOI: 10.1002/cem.70022
Anastasiia Surkova, Dmitry Kirsanov

The 14th Winter Symposium on Chemometrics (WSC14) was held in Tsaghkadzor (Armenia) from 26 February to 1 March 2024. The WSC is a biannual international meeting series started in Russia in 2002. Since that time WSC became an important event that is well known among other chemometric meetings for its friendly and relaxed atmosphere, rich social program and consistently high quality of scientific presentations. The scope of WSC meetings covers all relevant topics in modern chemometrics, both in theoretical developments and practical applications. In 2024, the conference was held under the auspices of the Armenian Academy of Sciences. Thirty-six participants from eight countries took part in the meeting, and the scientific program contained six lectures, 16 talks and 17 poster presentations. The invited lectures were delivered by Prof. Douglas N. Rutledge (France), Prof. Stefan Tsakovski (Bulgaria), Prof. Hadi Parastar (Iran) and Prof. Xihui Bian (China). Key lectures were presented by Dr. Alexey Pomerantsev and Dr. Oxana Rodionova. The variety of presentation topics included applications of near infrared spectrometry, hyperspectral imaging, QSPR, aquaphotomics, multiblock data analysis, machine learning, and deep learning.

The conference venue was located in a spectacular place near the Tsakhkadzor ski resort and as a part of the sportive program the participants were able to enjoy skiing in beautiful Armenian mountains. Traditional evening gatherings, so called “scores and loadings,” were conducted every conference evening with guitar playing, signing and informal discussions on all possible topics, either highly scientific or deeply prosaic. The last day of the conference was devoted to the guided tours to Sevan Lake with ancient Sevanavank monastery and to Yerevan city—the capital of hospitable Armenia.

The WSC meetings are always very friendly to young scientists, offering Best young scientist award—this year the prize was the registration for CAC-2024 (Chemometrics in Analytical Chemistry) in Argentina. The respected jury of senior chemometricians decided to award Dr. Ekaterina Boichenko for her talk “Near-infrared spectroscopy and chemometrics: a promising combination for real-time and nondestructive classification of urinary stones.” Three best poster prizes were awarded to Anastasia Sholokhova, Dr. Maria Khaydukova, and Dr. Larisa Lvova. If the feedback from participants is to be believed, all in all it was an enjoyable event. The place and the time for WSC15 will be announced soon.

Organizing committee of the 14th WSC.

第14届化学计量学冬季研讨会(WSC14)于2024年2月26日至3月1日在察格卡佐尔(亚美尼亚)举行。WSC是一个两年一次的国际系列会议,于2002年在俄罗斯开始。从那时起,WSC成为了一个重要的事件,在其他化学计量学会议中以其友好和轻松的氛围,丰富的社交活动和一贯高质量的科学报告而闻名。WSC会议的范围涵盖了现代化学计量学的所有相关主题,包括理论发展和实际应用。2024年,会议在亚美尼亚科学院的主持下举行。来自8个国家的36名与会者参加了会议,科学项目包括6次讲座、16次会谈和17次海报展示。邀请Douglas N. Rutledge教授(法国)、Stefan Tsakovski教授(保加利亚)、Hadi Parastar教授(伊朗)和Xihui Bian教授(中国)主讲。Alexey Pomerantsev博士和Oxana Rodionova博士主讲。各种演讲主题包括近红外光谱,高光谱成像,QSPR,水光组学,多块数据分析,机器学习和深度学习的应用。会议地点位于Tsakhkadzor滑雪胜地附近的一个壮观的地方,作为体育项目的一部分,与会者能够在美丽的亚美尼亚山脉中享受滑雪。传统的晚间聚会,也就是所谓的“乐谱和装载”,在每个会议的晚上都会举行,会上有吉他演奏、签名和非正式的讨论,讨论所有可能的话题,要么是高度科学的,要么是非常平淡无奇的。会议的最后一天是在导游的带领下参观塞万湖和古老的塞瓦纳瓦克修道院,以及好客的亚美尼亚首都埃里温城。WSC会议总是对年轻科学家非常友好,颁发了最佳青年科学家奖——今年的奖项是在阿根廷注册的CAC-2024(分析化学化学计量学)。受人尊敬的资深化学计量学家评审团决定授予Ekaterina Boichenko博士,以表彰她的演讲“近红外光谱和化学计量学:实时和无损分类尿路结石的有前途的组合”。三个最佳海报奖被授予Anastasia Sholokhova, Maria Khaydukova博士和Larisa Lvova博士。如果参与者的反馈是可信的,那么总的来说,这是一次愉快的活动。WSC15的地点和时间将很快公布。第十四届WSC组委会。
{"title":"Foreword for Special Issue Devoted to the 14th Winter Symposium on Chemometrics (2024)","authors":"Anastasiia Surkova,&nbsp;Dmitry Kirsanov","doi":"10.1002/cem.70022","DOIUrl":"10.1002/cem.70022","url":null,"abstract":"<p>The 14th Winter Symposium on Chemometrics (WSC14) was held in Tsaghkadzor (Armenia) from 26 February to 1 March 2024. The WSC is a biannual international meeting series started in Russia in 2002. Since that time WSC became an important event that is well known among other chemometric meetings for its friendly and relaxed atmosphere, rich social program and consistently high quality of scientific presentations. The scope of WSC meetings covers all relevant topics in modern chemometrics, both in theoretical developments and practical applications. In 2024, the conference was held under the auspices of the Armenian Academy of Sciences. Thirty-six participants from eight countries took part in the meeting, and the scientific program contained six lectures, 16 talks and 17 poster presentations. The invited lectures were delivered by Prof. Douglas N. Rutledge (France), Prof. Stefan Tsakovski (Bulgaria), Prof. Hadi Parastar (Iran) and Prof. Xihui Bian (China). Key lectures were presented by Dr. Alexey Pomerantsev and Dr. Oxana Rodionova. The variety of presentation topics included applications of near infrared spectrometry, hyperspectral imaging, QSPR, aquaphotomics, multiblock data analysis, machine learning, and deep learning.</p><p>The conference venue was located in a spectacular place near the Tsakhkadzor ski resort and as a part of the sportive program the participants were able to enjoy skiing in beautiful Armenian mountains. Traditional evening gatherings, so called “scores and loadings,” were conducted every conference evening with guitar playing, signing and informal discussions on all possible topics, either highly scientific or deeply prosaic. The last day of the conference was devoted to the guided tours to Sevan Lake with ancient Sevanavank monastery and to Yerevan city—the capital of hospitable Armenia.</p><p>The WSC meetings are always very friendly to young scientists, offering Best young scientist award—this year the prize was the registration for CAC-2024 (Chemometrics in Analytical Chemistry) in Argentina. The respected jury of senior chemometricians decided to award Dr. Ekaterina Boichenko for her talk “Near-infrared spectroscopy and chemometrics: a promising combination for real-time and nondestructive classification of urinary stones.” Three best poster prizes were awarded to Anastasia Sholokhova, Dr. Maria Khaydukova, and Dr. Larisa Lvova. If the feedback from participants is to be believed, all in all it was an enjoyable event. The place and the time for WSC15 will be announced soon.</p><p>Organizing committee of the 14th WSC.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143690125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Block Chemometric Approaches to the Unsupervised Spectral Characterization of Geological Samples 地质样品无监督光谱表征的多块化学计量学方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-16 DOI: 10.1002/cem.70010
Beatriz Galindo-Prieto, Ian S. Mudway, Johan Linderholm, Paul Geladi

As an example for the potential use of multi-block chemometric methods to provide improved unsupervised characterization of compositionally complex materials through the integration of multi-modal spectrometric data sets, we analysed spectral data derived from five field instruments (one XRF, two NIR, and two FT-Raman), collected on 76 bedrock samples of diverse composition. These data were analysed by single- and multi- block latent variable models, based on principal component analysis (PCA) and partial least squares (PLS). For the single-block approach, PCA and PLS models were generated; whilst hierarchical partial least squares (HPLS) regression was applied for the multi-block modelling. We also tested whether dimensionality reduction resulted in a more computationally efficient muti-block HPLS model with enhanced model interpretability and geological characterization power using the variable influence on projection (VIP) feature selection method.

The results showed differences in the characterization power of the five spectrometer data sets for the bedrock samples based on their mineral composition and geological properties; moreover, some spectroscopic techniques under-performed for distinguishing samples by composition. The multi-block HPLS and its VIP-strengthened model yielded a more complete unsupervised geological aggrupation of the samples in a single parsimonious model. We conclude that multi-block HPLS models are effective at combining multi-modal spectrometric data to provide a more comprehensive characterization of compositionally complex samples, and VIP can reduce HPLS model complexity, while increasing its data interpretability. These approaches have been applied here to a geological data set, but are amenable to a broad range of applications across chemical and biomedical disciplines.

我们分析了从五台现场仪器(一台 XRF、两台近红外光谱仪和两台傅立叶变换拉曼光谱仪)采集的光谱数据,这些仪器采集了 76 个不同成分的基岩样本。这些数据通过基于主成分分析(PCA)和偏最小二乘法(PLS)的单块和多块潜变量模型进行分析。在单块方法中,生成了 PCA 和 PLS 模型;而在多块建模中,则采用了分层偏最小二乘法(HPLS)回归。我们还测试了降维是否能生成计算效率更高的多区块 HPLS 模型,并利用对投影的可变影响(VIP)特征选择方法增强模型的可解释性和地质特征描述能力。结果表明,基于矿物成分和地质属性,五种光谱仪数据集对基岩样本的特征描述能力存在差异;此外,一些光谱技术在按成分区分样本方面表现不佳。多块 HPLS 及其 VIP 强化模型在一个单一的参数模型中对样品进行了更完整的无监督地质整合。我们的结论是,多块 HPLS 模型可以有效地结合多模态光谱数据,为成分复杂的样品提供更全面的特征描述,而 VIP 可以降低 HPLS 模型的复杂性,同时提高其数据可解释性。这些方法在此应用于地质数据集,但可广泛应用于化学和生物医学学科。
{"title":"Multi-Block Chemometric Approaches to the Unsupervised Spectral Characterization of Geological Samples","authors":"Beatriz Galindo-Prieto,&nbsp;Ian S. Mudway,&nbsp;Johan Linderholm,&nbsp;Paul Geladi","doi":"10.1002/cem.70010","DOIUrl":"10.1002/cem.70010","url":null,"abstract":"<p>As an example for the potential use of multi-block chemometric methods to provide improved unsupervised characterization of compositionally complex materials through the integration of multi-modal spectrometric data sets, we analysed spectral data derived from five field instruments (one XRF, two NIR, and two FT-Raman), collected on 76 bedrock samples of diverse composition. These data were analysed by single- and multi- block latent variable models, based on principal component analysis (PCA) and partial least squares (PLS). For the single-block approach, PCA and PLS models were generated; whilst hierarchical partial least squares (HPLS) regression was applied for the multi-block modelling. We also tested whether dimensionality reduction resulted in a more computationally efficient muti-block HPLS model with enhanced model interpretability and geological characterization power using the variable influence on projection (VIP) feature selection method.</p><p>The results showed differences in the characterization power of the five spectrometer data sets for the bedrock samples based on their mineral composition and geological properties; moreover, some spectroscopic techniques under-performed for distinguishing samples by composition. The multi-block HPLS and its VIP-strengthened model yielded a more complete unsupervised geological aggrupation of the samples in a single parsimonious model. We conclude that multi-block HPLS models are effective at combining multi-modal spectrometric data to provide a more comprehensive characterization of compositionally complex samples, and VIP can reduce HPLS model complexity, while increasing its data interpretability. These approaches have been applied here to a geological data set, but are amenable to a broad range of applications across chemical and biomedical disciplines.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143632623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Partition-Based Cross-Validation With Centering and Scaling for X T X $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$ and X T Y $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$ X T X $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$和X T Y的快速基于分区的中心和缩放交叉验证 $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-13 DOI: 10.1002/cem.70008
Ole-Christian Galbo Engstrøm, Martin Holm Jensen

We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products XTX$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$ and XTY$$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of X$$ mathbf{X} $$ and Y$$ mathbf{Y} $$, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing XTX$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$ and X� <

具体来说,我们将展示如何操作X T X $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$和XT Y $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$只使用来自验证分区的样本来获得预处理的训练分区X TX $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$和X T Y $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$。据我们所知,我们是第一个为列对齐和缩放的16种组合中的任何一种导出正确和有效的交叉验证算法的人,我们也证明了只有12种给出不同的矩阵乘积。
{"title":"Fast Partition-Based Cross-Validation With Centering and Scaling for \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 X\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$\u0000 and \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 Y\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$","authors":"Ole-Christian Galbo Engstrøm,&nbsp;Martin Holm Jensen","doi":"10.1002/cem.70008","DOIUrl":"10.1002/cem.70008","url":null,"abstract":"<p>We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$</annotation>\u0000 </semantics></math>. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{Y} $$</annotation>\u0000 </semantics></math>, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 <","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Getting Insights Into Chromatographic Properties of HILIС and Mixed-Mode Homemade Stationary Phases Using Principal Component and Cluster Analyses 利用主成分和聚类分析深入了解HILIС和混合模式自制固定相的色谱性质
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-12 DOI: 10.1002/cem.70019
A. Shemiakina, M. Khrisanfov, N. Chikurova, A. Samokhin, A. Chernobrovkina

In this work, we compared the chromatographic properties of 27 homemade monomer- and polymer-modified stationary phases synthesized via the Ugi reaction for hydrophilic interaction liquid chromatography (HILIC). These stationary phases along with the unmodified substrate were characterized by retention factors of 33 polar biologically active compounds belonging to various classes (nucleobases/nucleosides, sugars, carboxylic acids, and water-soluble vitamins). Additionally, the widely used Tanaka HILIC test was performed. The experimental data from both characterization approaches were processed using several chemometric techniques, including principal component analysis (PCA), hierarchical cluster analysis (HCA), and K-means algorithm. It was initially expected that polymer-modified phases would differ significantly from monomer-modified ones due to their mixed-mode properties. It was confirmed by the clear separation of these two types of stationary phases on the PCA score plot obtained for binary logarithms of selectivities (calculated from all 33 retention factors). Dissimilarities observed among some monomer-modified stationary phases resulted in insights into Ugi reaction conditions suitable for obtaining adsorbents with distinct chromatographic properties. Each class of test compounds required specific mobile phase composition to achieve reasonable chromatographic characteristics, such as retention times and peak shapes. To exclude the long-lasting re-equilibration stage associated with mobile phase changes, a smaller set of only three test compounds was proposed, yielding nearly the same clustering results as the complete dataset. This simplified procedure can facilitate the rapid characterization of newly synthesized stationary phases and allow for comparison with previously studied phases.

在这项工作中,我们比较了通过乌基反应合成的 27 种自制单体和聚合物改性固定相在亲水相互作用液相色谱(HILIC)中的色谱特性。这些固定相和未经改性的底物对 33 种极性生物活性化合物(核碱基/核苷、糖类、羧酸和水溶性维生素)的保留因子进行了表征。此外,还进行了广泛使用的田中 HILIC 试验。这两种表征方法的实验数据均采用了多种化学计量技术进行处理,包括主成分分析(PCA)、层次聚类分析(HCA)和 K-means 算法。最初预计聚合物改性相由于其混合模式特性,会与单体改性相有显著差异。根据选择性的二进制对数(由所有 33 个保留因子计算得出)绘制的 PCA 分数图上,这两类固定相被明显区分开来,从而证实了这一点。通过观察某些单体改性固定相之间的差异,可以深入了解适合获得具有不同色谱特性的吸附剂的 Ugi 反应条件。每一类测试化合物都需要特定的流动相组成才能获得合理的色谱特性,如保留时间和峰形。为了排除与流动相变化相关的长时间再平衡阶段,我们提出了一个仅包含三种测试化合物的较小集合,其聚类结果与完整数据集几乎相同。这一简化程序有助于快速鉴定新合成的固定相,并可与之前研究过的固定相进行比较。
{"title":"Getting Insights Into Chromatographic Properties of HILIС and Mixed-Mode Homemade Stationary Phases Using Principal Component and Cluster Analyses","authors":"A. Shemiakina,&nbsp;M. Khrisanfov,&nbsp;N. Chikurova,&nbsp;A. Samokhin,&nbsp;A. Chernobrovkina","doi":"10.1002/cem.70019","DOIUrl":"10.1002/cem.70019","url":null,"abstract":"<div>\u0000 \u0000 <p>In this work, we compared the chromatographic properties of 27 homemade monomer- and polymer-modified stationary phases synthesized via the Ugi reaction for hydrophilic interaction liquid chromatography (HILIC). These stationary phases along with the unmodified substrate were characterized by retention factors of 33 polar biologically active compounds belonging to various classes (nucleobases/nucleosides, sugars, carboxylic acids, and water-soluble vitamins). Additionally, the widely used Tanaka HILIC test was performed. The experimental data from both characterization approaches were processed using several chemometric techniques, including principal component analysis (PCA), hierarchical cluster analysis (HCA), and K-means algorithm. It was initially expected that polymer-modified phases would differ significantly from monomer-modified ones due to their mixed-mode properties. It was confirmed by the clear separation of these two types of stationary phases on the PCA score plot obtained for binary logarithms of selectivities (calculated from all 33 retention factors). Dissimilarities observed among some monomer-modified stationary phases resulted in insights into Ugi reaction conditions suitable for obtaining adsorbents with distinct chromatographic properties. Each class of test compounds required specific mobile phase composition to achieve reasonable chromatographic characteristics, such as retention times and peak shapes. To exclude the long-lasting re-equilibration stage associated with mobile phase changes, a smaller set of only three test compounds was proposed, yielding nearly the same clustering results as the complete dataset. This simplified procedure can facilitate the rapid characterization of newly synthesized stationary phases and allow for comparison with previously studied phases.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143595366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can One Recover the Underlying Spectral Data Matrix From a Given Borgen Plot? 能否从给定的Borgen图中恢复底层光谱数据矩阵?
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-08 DOI: 10.1002/cem.70016
Martina Beese, Tomass Andersons, Mathias Sawall, Hamid Abdollahi, Klaus Neymeyr

In multivariate curve resolution (MCR), Borgen plots represent the regions of feasible pure component profiles underlying spectral mixture data. A Borgen plot can be constructed geometrically in the low-dimensional U$$ U $$- and V$$ V $$-spaces if the so-called outer polygon (representing nonnegativity constraints) and the inner polygon (i.e., the convex hull of the data representing points) are given. This paper asks whether it is possible to construct spectral data from the data representing points spanning the polygons and thus reconstruct the data from the associated Borgen plot. A partially positive answer is given.

在多元曲线分辨率(MCR)中,Borgen图表示混合光谱数据下可行的纯组分剖面区域。如果给出了所谓的外多边形(表示非负性约束)和内多边形(即表示点的数据的凸包),则可以在低维U $$ U $$ -和V $$ V $$ -空间中以几何方式构造Borgen图。本文的问题是,是否有可能从表示跨越多边形的点的数据中构建光谱数据,从而从相关的Borgen图中重建数据。给出了部分肯定的答案。
{"title":"Can One Recover the Underlying Spectral Data Matrix From a Given Borgen Plot?","authors":"Martina Beese,&nbsp;Tomass Andersons,&nbsp;Mathias Sawall,&nbsp;Hamid Abdollahi,&nbsp;Klaus Neymeyr","doi":"10.1002/cem.70016","DOIUrl":"10.1002/cem.70016","url":null,"abstract":"<p>In multivariate curve resolution (MCR), Borgen plots represent the regions of feasible pure component profiles underlying spectral mixture data. A Borgen plot can be constructed geometrically in the low-dimensional <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>U</mi>\u0000 </mrow>\u0000 <annotation>$$ U $$</annotation>\u0000 </semantics></math>- and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>V</mi>\u0000 </mrow>\u0000 <annotation>$$ V $$</annotation>\u0000 </semantics></math>-spaces if the so-called outer polygon (representing nonnegativity constraints) and the inner polygon (i.e., the convex hull of the data representing points) are given. This paper asks whether it is possible to construct spectral data from the data representing points spanning the polygons and thus reconstruct the data from the associated Borgen plot. A partially positive answer is given.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143571376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing Classification Models of Pharmaceuticals With Conformal Prediction 用适形预测评价药品分类模型
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-06 DOI: 10.1002/cem.70017
Karl S. Booksh, Caelin P. Celani, Nicole M. Ralbovsky, Joseph P. Smith

Conformal predictions transform a measurable, heuristic notion of uncertainty into statistically valid confidence intervals such that, for a future sample, the true class prediction will be included in the conformal prediction set at a predetermined confidence. In a Bayesian perspective, common estimates of uncertainty in multivariate classification, namely p-values, only provide the probability that the data fits the presumed class model, P(D|M). Conformal predictions, on the other hand, address the more meaningful probability that a model fits the data, P(M|D). Herein, two methods to perform inductive conformal predictions are investigated—the traditional Split Conformal Prediction that uses an external calibration set and a novel Bagged Conformal Prediction, closely related to Cross Conformal Predictions, that utilizes bagging to calibrate the heuristic notions of uncertainty. Methods for preprocessing the conformal prediction scores to improve performance are discussed and investigated. These conformal prediction strategies are applied to identifying four non-steroidal anti-inflammatory drugs (NSAIDs) from hyperspectral Raman imaging data. In addition to assigning meaningful confidence intervals on the model results, we herein demonstrate how conformal predictions can add additional diagnostics for model quality and method stability.

保形预测将一个可测量的、启发式的不确定性概念转化为统计上有效的置信区间,这样,对于未来的样本,真实的类预测将以预定的置信度包含在保形预测集中。从贝叶斯的角度来看,多变量分类中常见的不确定性估计,即P值,仅提供数据符合假定的类模型P(D|M)的概率。另一方面,保形预测指出模型拟合数据的更有意义的概率P(M|D)。本文研究了两种进行归纳共形预测的方法——使用外部校准集的传统分裂共形预测和与交叉共形预测密切相关的新型袋装共形预测,它利用袋装来校准不确定性的启发式概念。讨论和研究了对适形预测分数进行预处理以提高预测性能的方法。这些适形预测策略应用于从高光谱拉曼成像数据中识别四种非甾体抗炎药(NSAIDs)。除了在模型结果上分配有意义的置信区间之外,我们在此演示了适形预测如何为模型质量和方法稳定性添加额外的诊断。
{"title":"Assessing Classification Models of Pharmaceuticals With Conformal Prediction","authors":"Karl S. Booksh,&nbsp;Caelin P. Celani,&nbsp;Nicole M. Ralbovsky,&nbsp;Joseph P. Smith","doi":"10.1002/cem.70017","DOIUrl":"10.1002/cem.70017","url":null,"abstract":"<div>\u0000 \u0000 <p>Conformal predictions transform a measurable, heuristic notion of uncertainty into statistically valid confidence intervals such that, for a future sample, the true class prediction will be included in the conformal prediction set at a predetermined confidence. In a Bayesian perspective, common estimates of uncertainty in multivariate classification, namely <i>p</i>-values, only provide the probability that the data fits the presumed class model, <i>P(D|M)</i>. Conformal predictions, on the other hand, address the more meaningful probability that a model fits the data, <i>P(M|D)</i>. Herein, two methods to perform inductive conformal predictions are investigated—the traditional Split Conformal Prediction that uses an external calibration set and a novel Bagged Conformal Prediction, closely related to Cross Conformal Predictions, that utilizes bagging to calibrate the heuristic notions of uncertainty. Methods for preprocessing the conformal prediction scores to improve performance are discussed and investigated. These conformal prediction strategies are applied to identifying four non-steroidal anti-inflammatory drugs (NSAIDs) from hyperspectral Raman imaging data. In addition to assigning meaningful confidence intervals on the model results, we herein demonstrate how conformal predictions can add additional diagnostics for model quality and method stability.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of ATR-FTIR Spectrum Combined With Ensemble Learning and Deep Learning for Identification of Amomum tsao-ko at Different Drying Temperatures ATR-FTIR光谱结合集成学习和深度学习在不同干燥温度下草砂鉴别中的应用
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-05 DOI: 10.1002/cem.70018
Gang He, Shao-bing Yang, Yuan-zhong Wang

Amomum tsao-ko Crevost et Lemaire (A. tsao-ko) is an important medicinal plant and flavoring spice. A. tsao-ko dried at different drying temperatures has different nutritional and medicinal values, leading to the phenomenon of substandard products in the market from time to time. In this study, attenuated total reflection–Fourier transform infrared spectroscopy (ATR-FTIR) data were pre-processed with SD, normalization, EWMA, SNV to compare their effects on the recognition ability of SVM, RF, XGBoost, and CatBoost models. Meanwhile, full-band and local-band 2DCOS profiles were obtained to characterize the differences in chemical features of A. tsao-ko dried by different drying temperatures and classified in conjunction with the ResNet model. The results show that although traditional machine learning can obtain better classification results, the classification efficiency is very unsatisfactory, and the correct classification rate is improved to 97% after derivative (SD) preprocessing. The 2DCOS atlas is able to visualize the feature information in the samples, which is further combined with the ResNet model to obtain 100% classification correctness with excellent generalization ability and convergence effect. The above study was able to provide new ideas for quality evaluation of A. tsao-ko.

草果砂是一种重要的药用植物和调味香料。在不同的干燥温度下干燥的草子具有不同的营养和药用价值,导致市场上不时出现不合格产品的现象。本研究对衰减全反射-傅里叶变换红外光谱(ATR-FTIR)数据进行SD、归一化、EWMA、SNV预处理,比较其对SVM、RF、XGBoost和CatBoost模型识别能力的影响。同时,利用全波段和局部波段2DCOS谱图表征了不同干燥温度下草树化学特征的差异,并结合ResNet模型进行了分类。结果表明,传统的机器学习虽然可以获得更好的分类结果,但分类效率非常不理想,经过导数(SD)预处理后,正确分类率提高到97%。2DCOS图谱能够将样本中的特征信息可视化,并与ResNet模型进一步结合,获得100%的分类正确率,具有出色的泛化能力和收敛效果。本研究可为曹子的品质评价提供新的思路。
{"title":"Application of ATR-FTIR Spectrum Combined With Ensemble Learning and Deep Learning for Identification of Amomum tsao-ko at Different Drying Temperatures","authors":"Gang He,&nbsp;Shao-bing Yang,&nbsp;Yuan-zhong Wang","doi":"10.1002/cem.70018","DOIUrl":"10.1002/cem.70018","url":null,"abstract":"<div>\u0000 \u0000 <p><i>Amomum tsao-ko</i> Crevost et Lemaire (<i>A. tsao-ko</i>) is an important medicinal plant and flavoring spice. <i>A. tsao-ko</i> dried at different drying temperatures has different nutritional and medicinal values, leading to the phenomenon of substandard products in the market from time to time. In this study, attenuated total reflection–Fourier transform infrared spectroscopy (ATR-FTIR) data were pre-processed with SD, normalization, EWMA, SNV to compare their effects on the recognition ability of SVM, RF, XGBoost, and CatBoost models. Meanwhile, full-band and local-band 2DCOS profiles were obtained to characterize the differences in chemical features of <i>A. tsao-ko</i> dried by different drying temperatures and classified in conjunction with the ResNet model. The results show that although traditional machine learning can obtain better classification results, the classification efficiency is very unsatisfactory, and the correct classification rate is improved to 97% after derivative (SD) preprocessing. The 2DCOS atlas is able to visualize the feature information in the samples, which is further combined with the ResNet model to obtain 100% classification correctness with excellent generalization ability and convergence effect. The above study was able to provide new ideas for quality evaluation of <i>A. tsao-ko</i>.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multidimensional Patterns of Gas Sensors for Assessing the Microbiological Indicators of Raw Milk 原料奶微生物指标评价气体传感器的多维模式
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-04 DOI: 10.1002/cem.70007
Anastasiia Shuba, Tatiana Kuchmenko, Ruslan Umarkhanov, Ekaterina Bogdanova, Ekaterina Anokhina, Inna Burakova

The paper discusses methods of using chemometrics methods for processing the output data of sensors with polycomposite coatings for analyzing the gas phase of raw milk and obtaining analytical information about its total microbiological contamination, the content of yeast and mold, and the presence of pathogenic microorganisms. To predict microbiological indicators of milk quality, the partial least squares regression and quadratic discriminant analysis were used. The initial data matrix included both an optimized set of sensor output data and calculated parameters at various data fusion levels. It is shown that multidimensional patterns of sensor output data differ depending on the task. A model for predicting the microbiological contamination of milk (QMAFAnM) with an error of 0.342 log CFU was obtained. It was shown that the sensitivity of classification of milk samples by the presence or absence of pathogenic microorganisms using discriminant analysis is 67%, and the specificity is 100% when using the calculated parameters of the sensor array. The proposed approaches can be applicable for processing data from various types of sensors when analyzing real objects with complex compositions.

本文讨论了用化学计量学方法处理复合涂层传感器输出数据,分析原料奶气相,获得原料奶微生物污染总量、酵母和霉菌含量、病原微生物存在等分析信息的方法。采用偏最小二乘回归和二次判别分析对牛奶品质微生物指标进行预测。初始数据矩阵包括一组优化的传感器输出数据和在不同数据融合水平下计算的参数。结果表明,传感器输出数据的多维模式随任务的不同而不同。建立了牛奶微生物污染预测模型(QMAFAnM),误差为0.342 log CFU。结果表明,利用该传感器阵列计算参数对牛奶样品进行病原微生物存在与否分类的灵敏度为67%,特异性为100%。所提出的方法可适用于分析具有复杂成分的真实物体时处理来自各种类型传感器的数据。
{"title":"Multidimensional Patterns of Gas Sensors for Assessing the Microbiological Indicators of Raw Milk","authors":"Anastasiia Shuba,&nbsp;Tatiana Kuchmenko,&nbsp;Ruslan Umarkhanov,&nbsp;Ekaterina Bogdanova,&nbsp;Ekaterina Anokhina,&nbsp;Inna Burakova","doi":"10.1002/cem.70007","DOIUrl":"10.1002/cem.70007","url":null,"abstract":"<div>\u0000 \u0000 <p>The paper discusses methods of using chemometrics methods for processing the output data of sensors with polycomposite coatings for analyzing the gas phase of raw milk and obtaining analytical information about its total microbiological contamination, the content of yeast and mold, and the presence of pathogenic microorganisms. To predict microbiological indicators of milk quality, the partial least squares regression and quadratic discriminant analysis were used. The initial data matrix included both an optimized set of sensor output data and calculated parameters at various data fusion levels. It is shown that multidimensional patterns of sensor output data differ depending on the task. A model for predicting the microbiological contamination of milk (QMAFAnM) with an error of 0.342 log CFU was obtained. It was shown that the sensitivity of classification of milk samples by the presence or absence of pathogenic microorganisms using discriminant analysis is 67%, and the specificity is 100% when using the calculated parameters of the sensor array. The proposed approaches can be applicable for processing data from various types of sensors when analyzing real objects with complex compositions.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Chemometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1