首页 > 最新文献

Journal of Chemometrics最新文献

英文 中文
Correction to “Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY” 修正“XTX和XTY的快速分区交叉验证与定心和缩放”
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-28 DOI: 10.1002/cem.70034

Galbo Engstrøm, O.-C. and Holm Jensen, M. (2025), Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY. Journal of Chemometrics, 39: e70008, https://doi.org/10.1002/cem.70008.

On line 27 in Algorithm 7 on page 10, the text to the right reads “Obtain XcsTYcsT” but should read “Obtain XcsTYcs”.

In Proposition 15 on page 11, the last equality contains a double hat over xsT. It should have been a single hat.

On pages 3 and 4, P$$ mathcal{P} $$ has been written multiple times when P[n]$$ mathcal{P}left[nright] $$ was intended. Likewise, V$$ mathcal{V} $$ has been written multiple times when V[p]$$ mathcal{V}left[pright] $$ was intended.

We apologize for the confusion.

Galbo Engstrøm, o . c。和Holm Jensen, M.(2025),基于快速分割的XTX和XTY的定心和缩放交叉验证。化学计量学学报,39:e70008, https://doi.org/10.1002/cem.70008。在第10页算法7的第27行,右侧的文本读为“获取XcsTYcsT”,但应该读为“获取XcsTYcs”。在第11页的命题15中,最后一个等式包含了xsT上的双帽。应该是一顶帽子。我们为造成的混乱道歉。
{"title":"Correction to “Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY”","authors":"","doi":"10.1002/cem.70034","DOIUrl":"10.1002/cem.70034","url":null,"abstract":"<p>\u0000 <span>Galbo Engstrøm, O.-C.</span> and <span>Holm Jensen, M.</span> (<span>2025</span>), <span>Fast Partition-Based Cross-Validation With Centering and Scaling for <b>X</b><sup><b>T</b></sup><b>X</b> and <b>X</b><sup><b>T</b></sup><b>Y</b></span>. <i>Journal of Chemometrics</i>, <span>39</span>: e70008, https://doi.org/10.1002/cem.70008.\u0000 </p><p>On line 27 in Algorithm 7 on page 10, the text to the right reads “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>csT</b></sup>” but should read “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>cs</b></sup>”.</p><p>In Proposition 15 on page 11, the last equality contains a double hat over <b>x</b><sub><b>s</b></sub><sup><b>T</b></sup>. It should have been a single hat.</p><p>On pages 3 and 4, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <annotation>$$ mathcal{P} $$</annotation>\u0000 </semantics></math> has been written multiple times when <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 <mo>[</mo>\u0000 <mo>n</mo>\u0000 <mo>]</mo>\u0000 </mrow>\u0000 <annotation>$$ mathcal{P}left[nright] $$</annotation>\u0000 </semantics></math> was intended. Likewise, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>V</mi>\u0000 </mrow>\u0000 <annotation>$$ mathcal{V} $$</annotation>\u0000 </semantics></math> has been written multiple times when <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>V</mi>\u0000 <mo>[</mo>\u0000 <mo>p</mo>\u0000 <mo>]</mo>\u0000 </mrow>\u0000 <annotation>$$ mathcal{V}left[pright] $$</annotation>\u0000 </semantics></math> was intended.</p><p>We apologize for the confusion.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HiBBKA: A Hybrid Method With Resampling and Heuristic Feature Selection for Class-Imbalanced Data in Chemometrics 化学计量学中类不平衡数据的重采样和启发式特征选择混合方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-20 DOI: 10.1002/cem.70029
Ying Guo, Ying Kou, Lun-Zhao Yi, Guang-Hui Fu

In critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single-algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial-based under-sampling with SMOTE (RBU-SMOTE) and an improved binary black-winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU-SMOTE resampling method synergistically integrates radial-based under-sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class-imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU-SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real-world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU-SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state-of-the-art algorithms in overall classification metrics, particularly in the minority class recognition.

在包括药物化学、生物医学、代谢组学和计算毒理学在内的关键领域,数据集的类别不平衡和对少数类别的识别准确性差仍然是持续存在的挑战。虽然以前的研究使用重采样和特征选择技术来解决数据不平衡和提高分类性能,但大多数方法都集中在单一算法解决方案上,而不是混合方法。混合算法通过综合多种技术的优势,为处理不平衡数据提供更全面、更高效的解决方案,具有明显的优势。本研究提出了一种将径向欠采样与SMOTE算法(RBU-SMOTE)和改进的二进制黑翼风筝算法(iBBKA)相结合的特征选择混合算法HiBBKA。该框架通过两个关键阶段进行:首先,RBU-SMOTE重采样方法将基于径向的欠采样(RBU)与合成少数派过采样技术(SMOTE)协同集成,有效地解决了类不平衡分布问题,同时提高了合成样本的质量。其次,改进的iBBKA特征选择算法系统地识别出对分类任务最具判别性的特征。我们使用多个分类器在16个不平衡数据集(包括现实世界的医疗数据集)中全面评估RBU-SMOTE和HiBBKA,特别强调少数类别的表现。实验结果表明,与现有的重采样方法相比,RBU-SMOTE取得了具有竞争力的性能,而完整的HiBBKA框架在总体分类指标上明显优于最先进的算法,特别是在少数类识别方面。
{"title":"HiBBKA: A Hybrid Method With Resampling and Heuristic Feature Selection for Class-Imbalanced Data in Chemometrics","authors":"Ying Guo,&nbsp;Ying Kou,&nbsp;Lun-Zhao Yi,&nbsp;Guang-Hui Fu","doi":"10.1002/cem.70029","DOIUrl":"10.1002/cem.70029","url":null,"abstract":"<div>\u0000 \u0000 <p>In critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single-algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial-based under-sampling with SMOTE (RBU-SMOTE) and an improved binary black-winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU-SMOTE resampling method synergistically integrates radial-based under-sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class-imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU-SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real-world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU-SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state-of-the-art algorithms in overall classification metrics, particularly in the minority class recognition.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143852899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Geographical Influence on Metabolite Profiles of Cupressus torulosa: UPLC-QTOF-MS (Positive Mode) and Chemometric Insights 地理对柏树代谢物谱的影响:UPLC-QTOF-MS(正模式)和化学计量学研究
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-14 DOI: 10.1002/cem.70031
Radhika Khanna, Khushaboo Bhadoriya, Gaurav Pandey, V. K. Varshney

C. torulosa, known as the Himalayan or Bhutan cypress, is a significant evergreen conifer that typically reaches heights between 20 and 45 m. This species is primarily found in the Himalayan regions of Bhutan, northern India, Nepal, and Tibet. In this study, we utilized ultra-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-QTOF-MS) in positive ion mode, along with chemometric analysis, to investigate the metabolomic profiles of C. torulosa needles collected from 14 geographically distinct areas in Uttarakhand and Himachal Pradesh. Various statistical techniques, including ANOVA, Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), violin plots, scatter plots, box-and-whisker plots, and heatmaps, were employed to illustrate the relative quantitative differences among compounds based on their peak intensities across these regions. Our investigation revealed 34 marker compounds consistently detected across all samples (locations). These compounds were screened using rigorous filtering criteria, incorporating a moderated t-test and multiple testing adjustments using the Benjamini–Hochberg false discovery rate (FDR) approach. Furthermore, we pioneered the identification of the phenylpropanoid and flavonoid biosynthesis pathways in C. torulosa, providing new insights into its metabolic profile. This work establishes a foundational reference for future research into the species metabolome, helping guide studies in areas like genetic diversity, ecological adaptations, and climate resilience in C. torulosa. Mapping these pathways deepens scientific knowledge of C. torulosa's metabolic processes, contributing to a clearer understanding of its unique biochemical makeup.

C. torulosa,被称为喜马拉雅或不丹柏树,是一种重要的常绿针叶树,通常可以达到20到45米的高度。该物种主要分布在不丹、印度北部、尼泊尔和西藏的喜马拉雅地区。在这项研究中,我们利用超高效液相色谱-四极杆飞行时间质谱(UPLC-QTOF-MS)在正离子模式下,结合化学计量学分析,研究了在北阿坎德邦和喜马偕尔邦14个地理不同地区采集的C. torulosa针的代谢组学特征。利用方差分析(ANOVA)、主成分分析(PCA)、层次聚类分析(HCA)、小提琴图、散点图、盒须图和热图等统计技术,分析了这些地区化合物峰强度的相对定量差异。我们的调查揭示了34种标记化合物在所有样品(地点)中一致检测到。这些化合物使用严格的过滤标准进行筛选,包括适度t检验和使用benjamin - hochberg错误发现率(FDR)方法的多重测试调整。此外,我们率先鉴定了C. torulosa中苯丙素和类黄酮的生物合成途径,为其代谢谱提供了新的见解。本研究为今后的物种代谢组研究奠定了基础,有助于在遗传多样性、生态适应和气候适应等方面指导研究。绘制这些途径加深了对C. torulosa代谢过程的科学认识,有助于更清楚地了解其独特的生化组成。
{"title":"Geographical Influence on Metabolite Profiles of Cupressus torulosa: UPLC-QTOF-MS (Positive Mode) and Chemometric Insights","authors":"Radhika Khanna,&nbsp;Khushaboo Bhadoriya,&nbsp;Gaurav Pandey,&nbsp;V. K. Varshney","doi":"10.1002/cem.70031","DOIUrl":"10.1002/cem.70031","url":null,"abstract":"<div>\u0000 \u0000 <p><i>C. torulosa</i>, known as the Himalayan or Bhutan cypress, is a significant evergreen conifer that typically reaches heights between 20 and 45 m. This species is primarily found in the Himalayan regions of Bhutan, northern India, Nepal, and Tibet. In this study, we utilized ultra-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-QTOF-MS) in positive ion mode, along with chemometric analysis, to investigate the metabolomic profiles of <i>C. torulosa</i> needles collected from 14 geographically distinct areas in Uttarakhand and Himachal Pradesh. Various statistical techniques, including ANOVA, Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), violin plots, scatter plots, box-and-whisker plots, and heatmaps, were employed to illustrate the relative quantitative differences among compounds based on their peak intensities across these regions. Our investigation revealed 34 marker compounds consistently detected across all samples (locations). These compounds were screened using rigorous filtering criteria, incorporating a moderated <i>t</i>-test and multiple testing adjustments using the Benjamini–Hochberg false discovery rate (FDR) approach. Furthermore, we pioneered the identification of the phenylpropanoid and flavonoid biosynthesis pathways in <i>C. torulosa</i>, providing new insights into its metabolic profile. This work establishes a foundational reference for future research into the species metabolome, helping guide studies in areas like genetic diversity, ecological adaptations, and climate resilience in <i>C. torulosa</i>. Mapping these pathways deepens scientific knowledge of <i>C. torulosa</i>'s metabolic processes, contributing to a clearer understanding of its unique biochemical makeup.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143831301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Anomaly Score Rank Based Unsupervised Sample Selection Method 基于综合异常评分秩的无监督样本选择方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-08 DOI: 10.1002/cem.70028
Zhongjiang He, Zhonghai He, Xiaofang Zhang

The process of selecting representative samples is crucial for establishing an accurate calibration model. To enhance the representativeness of the samples, a method for sample selection, utilizing the degree of anomaly as the evaluation criterion, is proposed. Initially, anomaly scores corresponding to various detection methods are obtained to ensure a comprehensive evaluation. These scores are then normalized by the confidence lower limit to establish a consistent scoring criterion. Subsequently, the weights of different detection methods are determined through eigenvector centrality analysis of a graph, where the methods serve as nodes and the similarity acts as weighted edges. Finally, the comprehensive anomaly scores are computed as the sum of weighted scores and are subsequently sorted. Representative samples are selected using a uniformly spaced sampling approach, with the spacing determined by a predefined and provided sample number. The efficacy of the method is validated across different sample sets.

选择代表性样本的过程对于建立准确的校准模型至关重要。为了提高样本的代表性,提出了一种以异常程度作为评价标准的样本选择方法。首先得到不同检测方法对应的异常分数,以保证综合评价。然后通过置信下限将这些分数归一化,以建立一致的评分标准。然后,通过图的特征向量中心性分析确定不同检测方法的权重,其中方法作为节点,相似度作为加权边。最后,将综合异常分数计算为加权分数之和,并进行排序。使用均匀间隔采样方法选择代表性样本,其间隔由预定义的和提供的样本数确定。通过不同的样本集验证了该方法的有效性。
{"title":"Comprehensive Anomaly Score Rank Based Unsupervised Sample Selection Method","authors":"Zhongjiang He,&nbsp;Zhonghai He,&nbsp;Xiaofang Zhang","doi":"10.1002/cem.70028","DOIUrl":"10.1002/cem.70028","url":null,"abstract":"<div>\u0000 \u0000 <p>The process of selecting representative samples is crucial for establishing an accurate calibration model. To enhance the representativeness of the samples, a method for sample selection, utilizing the degree of anomaly as the evaluation criterion, is proposed. Initially, anomaly scores corresponding to various detection methods are obtained to ensure a comprehensive evaluation. These scores are then normalized by the confidence lower limit to establish a consistent scoring criterion. Subsequently, the weights of different detection methods are determined through eigenvector centrality analysis of a graph, where the methods serve as nodes and the similarity acts as weighted edges. Finally, the comprehensive anomaly scores are computed as the sum of weighted scores and are subsequently sorted. Representative samples are selected using a uniformly spaced sampling approach, with the spacing determined by a predefined and provided sample number. The efficacy of the method is validated across different sample sets.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143793390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Quality: Importance of the ‘before analysis’ domain (Theory of Sampling, TOS) 数据质量:“分析前”域的重要性(抽样理论,TOS)
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-06 DOI: 10.1002/cem.70025

Data analysts/chemometricians are part of a scientific collegium covering three distinct domains: i) sampling – ii) analysis – iii) data modelling, which are collectively influencing ‘data quality’. There is much more to data quality than analytical uncertainty. There are many situations where analysis is to be made of heterogeneous materials/batches/lots/flowing streams, which need to be sampled appropriately before analysis, following an often long and complex pathway ‘from-lot-to-aliquot’. In most cases, sampling and sub-sampling will dominate the total Measurement Uncertainty budget (MUtotal). Left-out MUsampling contributions may easily overwhelm the Total Analytical Error (TAE) uncertainty by factors 5, 10, 25 or higher as a function of the specific heterogeneity characteristics of the materials and systems targeted, and of the sampling procedure used (grab vs. composite sampling). Focus is here on the consequences of unwittingly ignoring the uncertainties originating in these domains, which e.g. will influence adversely on bilinear component directions (reducing model accuracy) as well as RMSE estimates reflecting precision (analyte concentration prediction, classification, time series prediction) and along the way will also clear up an evergreen mistake: contrary to many beliefs, ‘more data’ will not automatically reduce the magnitude of an unsatisfactory performance RMSE. It is shown how the Theory of Sampling (TOS) is the only guarantor of representative sampling in the critical ‘before analysis’ domain. This article introduces the essential minimum TOS competence which must be mastered by stakeholders from all three domains. The conceptual elements in the TOS system can be visualised as a graphic overview:

Kim H. Esbensen has been professor at three universities (National Geological Survey of Denmark and Greenland (2010–2015), Aalborg University, Denmark (2001–2010), Telemark Institute of Technology, Norway (1990–2000) and professeur associé, Université du Québec à Chicoutimi before switching to a quest as an independent consultant in 2015. He is a member of several scientific societies and has published widely across several scientific fields. He is the author of a widely used textbook in Multivariate Data Analysis (chemometrics), and in 2020 published: “Introduction to the Theory and Practice of Sampling”. He was chairman of the taskforce responsible for the world's first horizontal (matrix-independent) sampling standard DS 3077:2024 - Esbensen is the founding editor of: “Sampling Science and Technology (SST)” - https://www.sst-magazine.info/issues/ He can be reached at his homepage https://kheconsult.com/

数据分析师/化学计量学家是涵盖三个不同领域的科学学院的一部分:i)抽样- ii)分析- iii)数据建模,它们共同影响“数据质量”。数据质量不仅仅是分析上的不确定性。在许多情况下,分析是对异质材料/批次/批次/流动流进行的,需要在分析之前进行适当的采样,这通常是一个漫长而复杂的“从批次到等分”的途径。在大多数情况下,抽样和次抽样将主导整个测量不确定度预算(MUtotal)。遗漏的采样贡献可能很容易以5、10、25或更高的因子压倒总分析误差(TAE)的不确定性,这是针对材料和系统的特定异质性特征以及所使用的采样程序(抓取与复合采样)的函数。这里的重点是无意中忽略这些领域中产生的不确定性的后果,例如,这将对双线性分量方向(降低模型精度)以及反映精度的RMSE估计(分析物浓度预测,分类,时间序列预测)产生不利影响,并在此过程中也将清除一个常绿错误:与许多人的看法相反,“更多的数据”不会自动降低令人不满意的性能RMSE的大小。它显示了抽样理论(TOS)是如何在关键的“分析前”领域的代表性抽样的唯一保证。本文介绍了三个领域的利益相关者必须掌握的基本最低TOS能力。TOS系统中的概念元素可以可视化为图形概述:Kim H. Esbensen曾在三所大学(丹麦和格陵兰国家地质调查局(2010-2015),丹麦奥尔堡大学(2001-2010),挪威Telemark理工学院(1990-2000)担任教授,并在2015年作为独立顾问转换为quest之前,曾在quicoutimi大学担任副教授。他是几个科学学会的成员,并在几个科学领域发表了广泛的文章。他是一本被广泛使用的多元数据分析(化学计量学)教科书的作者,并于2020年出版了《抽样理论与实践概论》。他是负责世界上第一个横向(矩阵无关)采样标准DS 3077:2024的工作组主席- Esbensen是:“采样科学与技术(SST)”的创始编辑- https://www.sst-magazine.info/issues/他可以在他的主页https://kheconsult.com/上找到
{"title":"Data Quality: Importance of the ‘before analysis’ domain (Theory of Sampling, TOS)","authors":"","doi":"10.1002/cem.70025","DOIUrl":"10.1002/cem.70025","url":null,"abstract":"<p>Data analysts/chemometricians are part of a scientific collegium covering three distinct domains: i) sampling – ii) analysis – iii) data modelling, which are collectively influencing ‘data quality’. There is much more to data quality than analytical uncertainty. There are many situations where <i>analysis</i> is to be made of heterogeneous materials/batches/lots/flowing streams, which need to be <i>sampled</i> appropriately before analysis, following an often long and complex pathway ‘from-lot-to-aliquot’. In most cases, sampling and sub-sampling will <i>dominate</i> the total Measurement Uncertainty budget (MU<sub>total</sub>). Left-out MU<sub>sampling</sub> contributions may easily overwhelm the Total Analytical Error (TAE) uncertainty by factors 5, 10, 25 or <i>higher</i> as a function of the specific heterogeneity characteristics of the materials and systems targeted, and of the sampling procedure used (grab vs. composite sampling). Focus is here on the consequences of unwittingly ignoring the uncertainties originating in these domains, which e.g. will influence adversely on bilinear component directions (reducing model <i>accuracy</i>) as well as RMSE estimates reflecting <i>precision</i> (analyte concentration prediction, classification, time series prediction) and along the way will also clear up an evergreen mistake: contrary to many beliefs, ‘more data’ will <span>not</span> automatically reduce the magnitude of an unsatisfactory performance RMSE. It is shown how the Theory of Sampling (TOS) is the only guarantor of representative sampling in the critical ‘before analysis’ domain. This article introduces the essential minimum TOS competence which must be mastered by stakeholders from all three domains. The conceptual elements in the TOS <i>system</i> can be visualised as a graphic overview:</p><p>Kim H. Esbensen has been professor at three universities (National Geological Survey of Denmark and Greenland (2010–2015), Aalborg University, Denmark (2001–2010), Telemark Institute of Technology, Norway (1990–2000) and professeur associé, Université du Québec à Chicoutimi before switching to a quest as an independent consultant in 2015. He is a member of several scientific societies and has published widely across several scientific fields. He is the author of a widely used textbook in Multivariate Data Analysis (chemometrics), and in 2020 published: “Introduction to the Theory and Practice of Sampling”. He was chairman of the taskforce responsible for the world's first horizontal (matrix-independent) sampling standard DS 3077:2024 - Esbensen is the founding editor of: “Sampling Science and Technology (SST)” - https://www.sst-magazine.info/issues/ He can be reached at his homepage https://kheconsult.com/</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Quality: Importance of the ‘Before Analysis’ Domain [Theory of Sampling (TOS)] 数据质量:“前分析”域的重要性[抽样理论(TOS)]
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-06 DOI: 10.1002/cem.70021
Kim H. Esbensen

Data Quality: what is it, where does it originate, how does it influence data modelling, what can chemometricians do about it? The ‘before analysis’ domain is prone to sampling errors resulting in uncertainties influencing the quality of both analysis and data analysis/data modelling. Nonrepresentative sampling of heterogeneous materials, batches, lots and process streams ‘before analysis’ contribute significantly to the total measurement uncertainty, MUtotal = MUsampling + MUanalysis. The total sampling error (TSE) can dominate over the total analytical error (TAE) by factors ranging 5, 10 or higher, depending on the degree of material heterogeneity encountered and the specific sampling procedure employed to produce the final analytical aliquot, which is the only material actually analysed. The analytical aliquot is the physical manifestation of transgressing the boundary from the before analysis (sampling) domain to the domain of analysis. It is only possible to guarantee representativity of the analytical aliquot, and thus of the analytical results with respect to the original target batch/lot/process stream, by invoking the necessary sampling domain competence stipulated by theory of sampling (TOS). Primary sampling is the most important stage in the full lot-to-analysis pathway, quantitatively dominating MUtotal (but subsequent subsampling stages can also be significant). If the sources of adverse sampling error effects have not been eliminated, the sampling process is biased and MUtotal will be unnecessarily inflated. TOS offers ways and means to deal actively with a potential sampling bias (which is fundamentally different from the analytical bias). Overlooking, or deliberately ignoring dealing appropriately with sampling effects constitutes a lack of due diligence, which has critical bearings on the QC/QA demands on both analysis and data analysis/modelling. This article presents all uncertainty contributions in the lot-to-analysis-to-data modelling pathway, which must be identified and managed, eliminated or maximally reduced, to be able to document a fully minimised MUtotal. Data analysts/chemometricians are part of a scientific collegium covering all three domains: sampling—analysis—data modelling, which are collectively responsible for ‘data quality’. This comprehensive scope has serious implications for the current PAT paradigm, the foundation of which turns out to need significant reform regarding a key process sampling aspect regardless of whether physical samples, or PAT sensor technology spectra, are extracted/acquired. This article introduces the essential minimum TOS competence that must be mastered by stakeholders from all three domains.

数据质量:它是什么,它起源于哪里,它如何影响数据建模,化学计量学家对此能做些什么?“分析前”域容易出现抽样误差,导致不确定性影响分析和数据分析/数据建模的质量。非代表性取样的异质性材料,批次,批次和工艺流“分析前”显著贡献总测量不确定度,MUtotal = MUsampling + MUanalysis。总抽样误差(TSE)可以在总分析误差(TAE)上占主导地位,其因子范围为5、10或更高,这取决于所遇到的材料异质性程度和用于产生最终分析同物的特定抽样程序,这是实际分析的唯一材料。解析等值线是从分析前(采样)域向分析域跨越边界的物理表现。只有通过调用抽样理论(TOS)规定的必要采样域权限,才能保证分析同质物的代表性,从而保证分析结果相对于原始目标批/批/工艺流的代表性。初级抽样是整个从批量到分析途径中最重要的阶段,在数量上主导着MUtotal(但随后的次抽样阶段也可能很重要)。如果不利的抽样误差影响的来源没有消除,抽样过程是有偏差的,MUtotal将不必要地膨胀。TOS提供了积极处理潜在抽样偏差的方法和手段(这与分析偏差根本不同)。忽视或故意忽略适当处理抽样效应构成缺乏尽职调查,这对分析和数据分析/建模的QC/QA要求具有关键影响。本文介绍了从批量到分析到数据建模途径中的所有不确定性贡献,必须识别和管理,消除或最大限度地减少,以便能够记录完全最小化的MUtotal。数据分析师/化学计量学家是涵盖所有三个领域的科学学院的一部分:抽样-分析-数据建模,它们共同负责“数据质量”。这种全面的范围对当前的PAT范式具有严重的影响,无论提取/获取物理样本还是PAT传感器技术光谱,其基础都需要对关键过程采样方面进行重大改革。本文介绍了所有三个领域的涉众必须掌握的基本最低TOS能力。
{"title":"Data Quality: Importance of the ‘Before Analysis’ Domain [Theory of Sampling (TOS)]","authors":"Kim H. Esbensen","doi":"10.1002/cem.70021","DOIUrl":"10.1002/cem.70021","url":null,"abstract":"<p>Data Quality: what is it, where does it originate, how does it influence data modelling, what can chemometricians do about it? The ‘before analysis’ domain is prone to sampling errors resulting in uncertainties influencing the quality of both analysis and data analysis/data modelling. Nonrepresentative sampling of heterogeneous materials, batches, lots and process streams ‘before analysis’ contribute significantly to the total measurement uncertainty, MU<sub>total</sub> = MU<sub>sampling</sub> + MU<sub>analysis</sub>. The total sampling error (TSE) can dominate over the total analytical error (TAE) by factors ranging 5, 10 or <i>higher</i>, depending on the <i>degree</i> of material heterogeneity encountered and the specific sampling procedure employed to produce the final analytical aliquot, which is the only material actually analysed. The analytical aliquot is the physical manifestation of transgressing the boundary <span>from</span> the before analysis (sampling) domain <span>to</span> the domain of analysis. It is only possible to guarantee representativity of the analytical aliquot, and thus of the analytical results with respect to the original target batch/lot/process stream, by invoking the necessary sampling domain competence stipulated by theory of sampling (TOS). Primary sampling is the most important stage in the full lot-to-analysis pathway, quantitatively dominating MU<sub>total</sub> (but subsequent subsampling stages can also be significant). If the sources of adverse sampling error effects have not been eliminated, the sampling process is <i>biased</i> and MU<sub>total</sub> will be unnecessarily inflated. TOS offers ways and means to deal actively with a potential sampling bias (which is fundamentally different from the analytical bias). Overlooking, or deliberately ignoring dealing appropriately with sampling effects constitutes a lack of due diligence, which has critical bearings on the QC/QA demands on both analysis and data analysis/modelling. This article presents all uncertainty contributions in the lot-to-analysis-to-data modelling pathway, which must be identified and managed, eliminated or maximally reduced, to be able to document a fully minimised MU<sub>total</sub>. Data analysts/chemometricians are part of a scientific collegium covering all three domains: sampling—analysis—data modelling, which are collectively responsible for ‘data quality’. This comprehensive scope has serious implications for the current PAT paradigm, the foundation of which turns out to need significant reform regarding a key process sampling aspect regardless of whether physical samples, or PAT sensor technology spectra, are extracted/acquired. This article introduces the essential minimum TOS competence that must be mastered by stakeholders from all three domains.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expandable Diffusion Map–Based Weighted k-Nearest Neighbor Technique for Multimode Batch Process Monitoring 基于可扩展扩散图的加权 k 近邻技术用于多模式批量流程监控
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-05 DOI: 10.1002/cem.70020
Liwei Feng, Yifei Wu, Shaofeng Guo, Yu Xing, Yuan Li

The diffusion map–based k-nearest neighbor (DM-kNN) rule faces two challenges in multimode batch process monitoring. Firstly, the DM method encounters difficulties in projecting new samples. The training samples are repeatedly feature extracted, resulting in a time-consuming process. Faulty samples may be merged into normal samples and modeled together, which does not meet the requirements for fault detection. Secondly, DM-kNN has poor monitoring performance for multimode processes with significant variance differences. This paper proposes a technique called the expandable DM–based weighted k-nearest neighbor (EDM-WkNN) to solve these two issues. The expandable DM constructs a local projection matrix to attain the projecting of new samples. The effect of mode variance differences is eliminated by introducing weighted distances in statistic to overcome the difficulties caused by variance differences. We compare EDM-WkNN with classical fault detection methods through numerical examples and the fed-batch fermentation penicillin (FBFP) process. Our experiments confirm that the EDM-WkNN method effectively monitors faults in multimode batch processes.

基于扩散映射的k近邻(DM-kNN)规则在多模式批处理过程监控中面临两个挑战。首先,DM方法在投影新样本时遇到困难。训练样本的特征提取是重复的,耗时长。故障样本可能被合并到正常样本中并一起建模,这不能满足故障检测的要求。其次,DM-kNN对方差差异显著的多模过程监测性能较差。为了解决这两个问题,本文提出了一种基于可扩展dm的加权k近邻算法(EDM-WkNN)。可扩展DM构造一个局部投影矩阵来实现新样本的投影。通过在统计中引入加权距离,消除了模态方差差异的影响,克服了方差差异带来的困难。通过数值算例和分批补料发酵青霉素(FBFP)过程比较了EDM-WkNN与经典故障检测方法。实验结果表明,EDM-WkNN方法可以有效地监测多模批处理过程中的故障。
{"title":"Expandable Diffusion Map–Based Weighted k-Nearest Neighbor Technique for Multimode Batch Process Monitoring","authors":"Liwei Feng,&nbsp;Yifei Wu,&nbsp;Shaofeng Guo,&nbsp;Yu Xing,&nbsp;Yuan Li","doi":"10.1002/cem.70020","DOIUrl":"10.1002/cem.70020","url":null,"abstract":"<div>\u0000 \u0000 <p>The diffusion map–based <i>k</i>-nearest neighbor (DM-kNN) rule faces two challenges in multimode batch process monitoring. Firstly, the DM method encounters difficulties in projecting new samples. The training samples are repeatedly feature extracted, resulting in a time-consuming process. Faulty samples may be merged into normal samples and modeled together, which does not meet the requirements for fault detection. Secondly, DM-kNN has poor monitoring performance for multimode processes with significant variance differences. This paper proposes a technique called the expandable DM–based weighted <i>k</i>-nearest neighbor (EDM-WkNN) to solve these two issues. The expandable DM constructs a local projection matrix to attain the projecting of new samples. The effect of mode variance differences is eliminated by introducing weighted distances in statistic to overcome the difficulties caused by variance differences. We compare EDM-WkNN with classical fault detection methods through numerical examples and the fed-batch fermentation penicillin (FBFP) process. Our experiments confirm that the EDM-WkNN method effectively monitors faults in multimode batch processes.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143778291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smart Monitoring Solutions for Real-Time Water pH Regulation in Aquatic Ecotoxicology 水生生态毒理学中实时水pH调节的智能监测解决方案
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-03 DOI: 10.1002/cem.70024
Usman Ibrahim, Nasir Abbas, Muhammad Riaz, Tahir Mahmood

This study designs a statistical process control tool that effectively detects small and moderate shifts in process parameters, to address challenges in quality monitoring. The proposed control chart employs advanced statistical detection techniques to enhance sensitivity while reducing false alarms, thus improving detection performance in various applications. This methodology is applied in a real-life context within an aquatic ecotoxicology laboratory, where daily monitoring of water pH levels is essential for safeguarding the health of sensitive aquatic organisms, such as mysids. The laboratory environment is meticulously controlled to simulate natural conditions, and our application of the proposed control chart ensures that any deviations from the optimal pH level are detected promptly, thereby maintaining water quality and supporting the reliability of experimental outcomes. The paper comprehensively evaluates the performance of the proposed control chart in both zero-state and steady-state conditions, offering valuable insights for practitioners in the field. We present empirical evidence demonstrating that the proposed control chart significantly outperforms traditional control charts, including Shewhart, CUSUM, and EWMA, particularly in detecting small to moderate shifts in water pH levels. Furthermore, we provide optimal parameter settings tailored for specific monitoring scenarios, enhancing the applicability of proposed control chart for quality control in laboratory environments.

本研究设计了一个统计过程控制工具,可以有效地检测过程参数的微小和中度变化,以解决质量监控中的挑战。本文提出的控制图采用先进的统计检测技术,在提高灵敏度的同时减少误报,从而提高了各种应用中的检测性能。该方法在水生生态毒理学实验室的现实环境中得到应用,在该实验室中,每天监测水的pH值对于保护敏感的水生生物(如蚜虫)的健康至关重要。我们对实验室环境进行了细致的控制,以模拟自然条件,我们所提出的控制图的应用确保及时检测到任何偏离最佳pH值的情况,从而保持水质并支持实验结果的可靠性。本文全面评估了所提出的控制图在零状态和稳态条件下的性能,为该领域的从业者提供了有价值的见解。我们提出的经验证据表明,所提出的控制图明显优于传统的控制图,包括Shewhart、CUSUM和EWMA,特别是在检测水pH值的小到中等变化方面。此外,我们提供了针对特定监测场景的最佳参数设置,增强了所提出的控制图在实验室环境中质量控制的适用性。
{"title":"Smart Monitoring Solutions for Real-Time Water pH Regulation in Aquatic Ecotoxicology","authors":"Usman Ibrahim,&nbsp;Nasir Abbas,&nbsp;Muhammad Riaz,&nbsp;Tahir Mahmood","doi":"10.1002/cem.70024","DOIUrl":"10.1002/cem.70024","url":null,"abstract":"<div>\u0000 \u0000 <p>This study designs a statistical process control tool that effectively detects small and moderate shifts in process parameters, to address challenges in quality monitoring. The proposed control chart employs advanced statistical detection techniques to enhance sensitivity while reducing false alarms, thus improving detection performance in various applications. This methodology is applied in a real-life context within an aquatic ecotoxicology laboratory, where daily monitoring of water pH levels is essential for safeguarding the health of sensitive aquatic organisms, such as mysids. The laboratory environment is meticulously controlled to simulate natural conditions, and our application of the proposed control chart ensures that any deviations from the optimal pH level are detected promptly, thereby maintaining water quality and supporting the reliability of experimental outcomes. The paper comprehensively evaluates the performance of the proposed control chart in both zero-state and steady-state conditions, offering valuable insights for practitioners in the field. We present empirical evidence demonstrating that the proposed control chart significantly outperforms traditional control charts, including Shewhart, CUSUM, and EWMA, particularly in detecting small to moderate shifts in water pH levels. Furthermore, we provide optimal parameter settings tailored for specific monitoring scenarios, enhancing the applicability of proposed control chart for quality control in laboratory environments.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143770223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation 化学计量学模型验证中的数据泄漏和交叉验证尺度问题
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-01 DOI: 10.1002/cem.70026
Péter Király, Gergely Tóth
<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model
化学计量学是最复杂的数据科学领域之一。几十年来,它一直是使用新型机器学习方法的先驱。化学计量学建模的文献非常多;有一些关于如何执行仔细分析的指南、软件和其他描述。另一方面,文献往往是矛盾和不一致的。有许多研究,在特定数据集上的结果被一概而论而没有证明,后来,一概而论的想法被引用而没有原始的限制。在某些情况下,方法命名的差异会导致误解。在科学的每一个领域,也有一些偏好的方法,这是基于研究小组的力量,没有灵活和真正的科学方法的选择的可能性。在化学计量学的实际方法和理论统计理论之间也存在一些不一致,在理论统计理论中经常研究不切实际的假设和限制。广泛阐述的化学计量学知识给该领域带来了一些刚性。在数据科学中有一些趋势是化学计量学慢慢适应的。一个例子是在偏差-方差权衡模型构建[1]中的排他思维,而不是在大数据集的双下降区域中使用模型[2-4]。另一个问题是数据泄露。迄今为止,化学计量学模型的建立和验证往往是在数据泄露的数据集上进行的。在我们的调查中,我们遇到了一些案例,在这些案例中,巨大的文献背景为纠正误解提供了很大的惯性。在2021年,我们发现,留一和留多交叉验证(LMO-CV)参数可以相互缩放到[5]。此外,我们表明,这两种方法在多元线性回归(MLR)计算中具有大致相同的不确定性[6]。因此,在这些方法之间的选择应该是计算实践,而不是先入为主。我们因为遗漏了一些被广泛引用的研究的结果而受到了一些正式和非正式的批评。在本文中,我们提出了一些例子,以加强对化学计量学中一些传统解决方案的反思。我们展示了一些计算,数据泄漏是如何在化学计量任务中存在的。我们的其他计算集中在缩放定律上,以恢复留一交叉验证。在机器学习中,数据泄漏意味着在模型构建过程中使用信息,这会使模型的预测评估产生偏差,或者在模型的实际预测应用中不可用。一个典型且容易检测的例子是当测试集中存在与训练非常相似的情况时。当变量或类出现在与响应变量过于密切相关的解释变量中时,存在另一种形式的泄漏。数据泄漏在模型性能评估中引起类似过拟合的问题,但它们的定义和验证困难的来源不同。它们可以独立出现;所有的组合都是可能的,例如,没有数据泄漏的强过拟合或缺乏强数据泄漏的过拟合。常见的效果是,它们降低了模型验证的有效性,除了接近最优复杂模型的训练和测试集大小的无限限制的情况。在这个极限下,数据泄漏和过拟合对性能参数的影响趋于零。典型的案例泄漏发生在训练集和测试集之间。最优测试集的目的是永远不会在训练过程中使用,也不会在关于模型选择或超参数优化的决策中使用。测试集应该代表模型的预期应用。如果数据集足够大,可以分为训练集和测试集,并且后者很好地代表了预期的应用领域,则可以在开始模型构建之前从现有数据集中选择测试集。如果在预期的应用程序中有很大的可变性,而开始的数据集没有这种可变性,则应该在新的测量活动中获得一个或多个最佳测试集。因此,独立的测试集可以在开始建模之前通过拆分数据获得,也可以在以后的新测量中获得。抽样可以遵循两种方法,一种是简单的统计抽样,当在选择过程中没有对预测者或反应范围的偏好时,也可以使用不同的抽样理论来设计。例如,参考文献[7]详细介绍了这些可能性的细节。对于具有超参数的模型,最简单的训练/测试分割是不够的。至少需要将训练集划分为临时训练集和验证集[26]。 最简单的方法是在临时训练集上对给定超参数的模型进行参数化,并在验证集上对模型进行评估。在不同的超参数化模型之间的选择是基于模型在验证集上的性能。最终模型通常在聚合的临时训练集和验证集上重新参数化。临时训练集和验证集之间的数据泄漏主要发生在聚合阶段。它以一种固有的方式导致有偏见的模型选择。如果在超参数优化中使用验证参数,可能会有进一步的泄漏。如果在超参数的选择中使用了给定的验证参数,那么与其他验证参数相比,该验证参数在最终模型中会变得过于乐观。这种影响我们可以称之为参数泄漏。这种参数泄漏也可能出现在变量选择中。OECD QSAR指南[8]将验证过程分为内部和外部验证过程。内部是指利用数据计算模型性能的验证参数,这些参数用于模型构建和模型选择。外部验证意味着在测试集(如上所定义的最优测试集)上计算验证参数。外部验证的唯一目的应该是评估最终模型的可预测性。内部验证的目的是评估模型在训练集上的拟合优度和模型的鲁棒性。后者主要通过交叉验证方法来管理,有时也通过引导来管理。经合组织指南没有详细说明应该如何进行超参数优化。交叉验证的使用并不能消除外部(测试)验证的必要性。交叉验证在超参数优化中有其作用,特别是在不可能为该任务分离验证集的情况下。此外,在小数据集的情况下,如果不可能有独立的测试集,交叉验证也是猜测预测性的近似工具。无论如何,我们必须记住Ref.[7]的结论:“交叉验证只是测试集验证的次优模拟。”OECD的指导很少在各个方面都考虑到,特别是在独立测试集的要求方面。相反,在过去的三十年里,化学计量学的文献中有很多关于如何进行交叉验证的强调[9- 12,25]。这里列出了不同的任务,如模型和超参数的选择以及变量的选择,通常,它被用来对模型的“预测”能力进行现实的估计。与数据科学的明确趋势相反,化学计量学中存在一种争论,即预测能力只能通过交叉验证方法来确定[7,13 -19]。有几种交叉验证方法,其中重复和双重交叉验证方法提供了稳定的验证参数,尽管存在数据泄漏[20,21,12]。在双重交叉验证方案(有时称为嵌套方案)中,其中一个迭代通常将数据分成“测试”和验证+临时训练集,但深入细节,可以发现这些“测试”集并不能满足前面提到的无泄漏要求。一些开发人员用“一个测试集不是一个测试集”的想法来证明缺乏真正的外部测试集,因为在单个集上计算的验证参数的方差很大[23,24]。无论如何,最优的解决方案是使用一个巨大的测试集来显示预期应用程序的所有可变性。如果不可能,一个好的解决方案是使用一些独立的测试集,例如,在不同的测量活动中确定,以便为以后应用程序的可变性提供示例。在非嵌套交叉验证中有三种主要方法。他们的名字不匹配,导致人们对他们的权力产生误解。在我们之前的研究中,我们遵循了经合组织指导委员会的名称惯例。我们将留一交叉验证称为以下参数计算过程(LOO-CV):如果使用ntrain案例来优化基本模型参数,则建立ntrain-1观测值的模型。总的来说,我们计算了列车模型,最后,所有的情况只省略一次。在所有的训练案例上计算验证参数,但只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。我们称LMO-CV为计算过程(OECD),其中应用了与lo - cv类似的方法,但省略了m个案例。总的来说,在构建ntrain/m模型时,每种情况只被省略一次。 验证参数在所有的ntrain案例上计算相似,但是只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。这种方法有时被称为m-fold交叉验证。第三种非嵌套交叉验证将训练集分成一个具有nv个元素的验证集和一个不同的nc = ntrain-nv集,在这个集上执行临时模型的训练。验证参数在nv集上计算。通常,分割为nv和nc要重复几次,验证参数在重复中取平均值。我们把这个过程称为重复交叉验证(REP-CV)。在文献中,一些作者将其称为遗漏多重交叉验证或LMO
{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király,&nbsp;Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":"10.1002/cem.70026","url":null,"abstract":"&lt;p&gt;Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.&lt;/p&gt;&lt;p&gt;The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [&lt;span&gt;1&lt;/span&gt;] instead of using models in the double descent region for large datasets [&lt;span&gt;2-4&lt;/span&gt;]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.&lt;/p&gt;&lt;p&gt;In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [&lt;span&gt;5&lt;/span&gt;]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [&lt;span&gt;6&lt;/span&gt;]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.&lt;/p&gt;&lt;p&gt;In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.&lt;/p&gt;&lt;p&gt;In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model ","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection 用紫外分光光度法绿色快速定量兽药中盐酸环丙沙星和酒石酸泰洛星:特征选择自然算法的比较研究
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-03-29 DOI: 10.1002/cem.70023
Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim

Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (R2). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and R2 values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and R2 values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and R2 values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL−1 for CIP and 0.36 μg mL−1 for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL−1 for CIP and 1.21 μg mL−1 for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.

兽药配方中盐酸环丙沙星(CIP)和酒石酸泰乐菌素(TYZ)的快速准确定量对于确保产品质量和疗效至关重要。本研究介绍了一种绿色且经济高效的分析方法,该方法结合了紫外分光光度法的简便性和自然启发算法的优化能力,用于同时测定片剂兽药配方中的 CIP 和 TYZ。采用平均平方根误差(RASE)、平均绝对误差(AAE)和判定系数(R2)对 14 种自然启发算法进行了比较评估。科罗娜病毒优化(CVO)算法和蝙蝠算法分别在 CIP 和 TYZ 方面表现出卓越的性能。针对 CIP 优化的 CVO 算法在校准集上的 RASE、AAE 和 R2 值分别为 0.37、0.27 和 0.998,而针对 TYZ 定制的蝙蝠算法的 RASE、AAE 和 R2 值分别为 0.54、0.41 和 0.984。测试集的 RASE、AAE 和 R2 值分别为:CIP 0.55、0.46 和 0.991,TYZ 0.20、0.15 和 0.995,证实了算法的预测能力。采用准确度曲线法进行了验证。结果表明,CIP 和 TYZ 的检出限分别为 0.86 μg mL-1 和 0.36 μg mL-1,定量限分别为 2.88 μg mL-1 和 1.21 μg mL-1。利用绿色溶剂选择工具(GSST)、国家环境方法指数(NEMI)、改进的生态尺度、改进的 GAPI(MoGAPI)以及通过 RGBfast 算法进行的补充白度评估,对该方法的环境影响进行了全面评估,确认了其生态友好型特征。拟议的方法显示出卓越的绿色环保性,这体现在其较高的 GSST 分数和良好的 NEMI 评估中。具体来说,根据 RGBfast 算法的测定,该方法获得了 84 分的改良生态尺度分、81 分的 MoGAPI 分和 61 分的白度指数。这些结果证实了该方法在环境上的可持续发展性,加强了其在绿色分析应用中的适用性。与传统色谱技术相比,这种新方法在成本、速度和环境可持续性方面具有显著优势,为制药质量控制领域采用更高效、更环保的分析方法铺平了道路。此外,这项研究还强调了紫外光谱与自然启发算法的创新整合,与传统的紫外药物分析方法相比取得了重大进步。
{"title":"Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection","authors":"Mostafa M. Eraqi,&nbsp;Ayman M. Algohary,&nbsp;Youssef O. Al-Ghamdi,&nbsp;Ahmed M. Ibrahim","doi":"10.1002/cem.70023","DOIUrl":"10.1002/cem.70023","url":null,"abstract":"<div>\u0000 \u0000 <p>Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (<i>R</i><sup>2</sup>). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and <i>R</i><sup>2</sup> values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL<sup>−1</sup> for CIP and 0.36 μg mL<sup>−1</sup> for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL<sup>−1</sup> for CIP and 1.21 μg mL<sup>−1</sup> for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Chemometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1