This paper describes some statistical tests for comparing the predictive performance of two or more prediction rules. It covers the cases of both quantitative and qualitative predictions, that is, both regression and classification problems. Worked examples are included for both cases.
{"title":"Testing differences in predictive ability: A tutorial","authors":"Tom Fearn","doi":"10.1002/cem.3549","DOIUrl":"10.1002/cem.3549","url":null,"abstract":"<p>This paper describes some statistical tests for comparing the predictive performance of two or more prediction rules. It covers the cases of both quantitative and qualitative predictions, that is, both regression and classification problems. Worked examples are included for both cases.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 8","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3549","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140803276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linearly dependent concentration profiles of a chemical reaction can result in a spectral data matrix with a chemical rank smaller than the number of absorbing chemical species. Such a rank deficiency is problematic for a factor analysis as some information on the pure component spectra cannot be recovered from the mixture data. Matrix augmentation can break rank deficiencies and enable successful pure component recovery. In contrast to this, an artificial breakdown of a rank deficiency can be caused by a numerical finite precision simulation of the underlying kinetic model and can fake a successful MCR analysis. This work discusses the problem and points out some remedies.
{"title":"A note on rank deficiency and numerical modeling","authors":"Klaus Neymeyr, Mathias Sawall, Tomass Andersons","doi":"10.1002/cem.3550","DOIUrl":"10.1002/cem.3550","url":null,"abstract":"<p>Linearly dependent concentration profiles of a chemical reaction can result in a spectral data matrix with a chemical rank smaller than the number of absorbing chemical species. Such a rank deficiency is problematic for a factor analysis as some information on the pure component spectra cannot be recovered from the mixture data. Matrix augmentation can break rank deficiencies and enable successful pure component recovery. In contrast to this, an artificial breakdown of a rank deficiency can be caused by a numerical finite precision simulation of the underlying kinetic model and can fake a successful MCR analysis. This work discusses the problem and points out some remedies.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 8","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3550","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140669905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liwei Feng, Shaofeng Guo, Yifei Wu, Yu Xing, Yuan Li
To solve the problem that the multi-stage process with dynamicity and nonlinear is hard to monitor effectively, the time-space neighborhood standardization (TSNS) method is proposed, which is further applied to partial least squares (PLS) to propose TSNS and PLS (TSNS-PLS) method for process fault detection. TSNS can transform multi-stage data into single-stage data that approximately obeys a standard normal distribution, remove temporal correlation between samples at previous and subsequent moments in the process data, and separate online fault samples. TSNS makes the transformed process data satisfy the requirements of the PLS method for process data and can significantly improve the fault detection rate of the PLS method. Finally, the performance of TSNS-PLS was examined by a numerical simulation process and the penicillin fermentation process design fault detection experiment.
{"title":"Application of time-space neighborhood standardization technology to complex multi-stage process fault detection","authors":"Liwei Feng, Shaofeng Guo, Yifei Wu, Yu Xing, Yuan Li","doi":"10.1002/cem.3546","DOIUrl":"10.1002/cem.3546","url":null,"abstract":"<p>To solve the problem that the multi-stage process with dynamicity and nonlinear is hard to monitor effectively, the time-space neighborhood standardization (TSNS) method is proposed, which is further applied to partial least squares (PLS) to propose TSNS and PLS (TSNS-PLS) method for process fault detection. TSNS can transform multi-stage data into single-stage data that approximately obeys a standard normal distribution, remove temporal correlation between samples at previous and subsequent moments in the process data, and separate online fault samples. TSNS makes the transformed process data satisfy the requirements of the PLS method for process data and can significantly improve the fault detection rate of the PLS method. Finally, the performance of TSNS-PLS was examined by a numerical simulation process and the penicillin fermentation process design fault detection experiment.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 8","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140635546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Puneet Mishra, Michela Albano-Gaglio, Maria Font-i-Furnols
This study demonstrates a new approach to process hyperspectral images where both the contextual spatial information as well as the spectral information are used to predict sample properties. The deep contextual spatial information is extracted using the deep feature extraction from pretrained resnet-18 deep learning architecture, while the spectral information was readily available as the average pixel values. To fuse the information in a complementary way, a multiblock modeling approach called sequential orthogonalized partial least squares was used. The sequential model guarantees that the information learned is complementary from spatial and spectral domains. The potential of the approach is demonstrated to predict several physical and chemical properties in pork bellies.
{"title":"A short note on deep contextual spatial and spectral information fusion for hyperspectral image processing: Case of pork belly properties prediction","authors":"Puneet Mishra, Michela Albano-Gaglio, Maria Font-i-Furnols","doi":"10.1002/cem.3552","DOIUrl":"10.1002/cem.3552","url":null,"abstract":"<p>This study demonstrates a new approach to process hyperspectral images where both the contextual spatial information as well as the spectral information are used to predict sample properties. The deep contextual spatial information is extracted using the deep feature extraction from pretrained resnet-18 deep learning architecture, while the spectral information was readily available as the average pixel values. To fuse the information in a complementary way, a multiblock modeling approach called sequential orthogonalized partial least squares was used. The sequential model guarantees that the information learned is complementary from spatial and spectral domains. The potential of the approach is demonstrated to predict several physical and chemical properties in pork bellies.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 8","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pumpkin seeds are nutritious and have some medicinal value. However, the mold and sprouting are produced during the storage of pumpkin seeds. Food safety and quality problems may be caused if they are not removed in time for processing. The traditional testing methods are cumbersome to operate, complex, and destructive in sample preparation. Therefore, terahertz time-domain spectroscopy (THz-TDS) technology was proposed to achieve the detection of the internal quality of pumpkin seeds. Firstly, samples of pumpkin seeds of different qualities were crafted, and they were moldy for 3 days, moldy for 6 days, sprouted and moldy, sprouted and normal pumpkin seeds, respectively. Then, the pumpkin seeds of different qualities were dried, ground, and pressed, and their spectral data were collected. The terahertz spectra of the five types of samples were significantly different. The support vector machine (SVM), random forest (RF), and convolutional neural network (CNN) qualitative discriminant models were established with the raw absorbance spectral data, the preprocessed absorbance spectral data, and the preprocessed and band-screened absorbance spectral data, respectively, where the CNN model based on the raw spectral data has the highest classification accuracy of 96%. The CNN models do not require advance spectral data processing, simplifying the spectral analysis process. And it achieves best classification results in the accuracy of detection compared to traditional chemometric models. The CNN combined with THz-TDS method has great potential for application in the detection of agricultural products. It provides a new detection method for the field of quality detection of agricultural products.
{"title":"Detection the quality of pumpkin seeds based on terahertz coupled with convolutional neural network","authors":"Zhaoxiang Sun, Bin Li, Akun Yang, Yande Liu","doi":"10.1002/cem.3547","DOIUrl":"10.1002/cem.3547","url":null,"abstract":"<p>Pumpkin seeds are nutritious and have some medicinal value. However, the mold and sprouting are produced during the storage of pumpkin seeds. Food safety and quality problems may be caused if they are not removed in time for processing. The traditional testing methods are cumbersome to operate, complex, and destructive in sample preparation. Therefore, terahertz time-domain spectroscopy (THz-TDS) technology was proposed to achieve the detection of the internal quality of pumpkin seeds. Firstly, samples of pumpkin seeds of different qualities were crafted, and they were moldy for 3 days, moldy for 6 days, sprouted and moldy, sprouted and normal pumpkin seeds, respectively. Then, the pumpkin seeds of different qualities were dried, ground, and pressed, and their spectral data were collected. The terahertz spectra of the five types of samples were significantly different. The support vector machine (SVM), random forest (RF), and convolutional neural network (CNN) qualitative discriminant models were established with the raw absorbance spectral data, the preprocessed absorbance spectral data, and the preprocessed and band-screened absorbance spectral data, respectively, where the CNN model based on the raw spectral data has the highest classification accuracy of 96%. The CNN models do not require advance spectral data processing, simplifying the spectral analysis process. And it achieves best classification results in the accuracy of detection compared to traditional chemometric models. The CNN combined with THz-TDS method has great potential for application in the detection of agricultural products. It provides a new detection method for the field of quality detection of agricultural products.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 7","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Faried Abdel Hakiem, John M. Boushra, Deena A. M. Noureldeen, Adel S. Lashien, Tamer Z. Attia
The antiviral agents, Favipiravir (FAV) and Remdesivir (REM), were introduced in the last few years alone or as combination regimen for successful management of the rapidly spreading CORONA virus pandemic. A newly developed rapid and sensitive high performance liquid chromatographic method (HPLC) has been developed for the simultaneous determination of their mixture. Firstly, one factor at a time optimization (OFAT) has been applied. Afterwards, quality by design approach (QbD) has been utilized using Box Behnken experimental design (BBD) for the development of an experimental design of four independent and nine dependent variables for much better refining of the optimized parameters. The established model has given an optimum resolution at; acetonitrile percentage of 52.66, mobile phase of pH 2.91, percentage of triethylamine of 0.15 and 1.30 mL/min flow rate. The proposed method has been validated according to the USP 31 NF 26 guidelines. Good linearity ranges have been obtained from 5.00 up to 50.00 μg/mL for FAV and from 2.00 up to 60.00 μg/mL for FAV and REM, respectively. Excellent relative standard deviation values (not more than 1.40) were obtained upon investigation of accuracy, precision and robustness. The developed method has succeeded in analysis of investigated drugs in their pharmaceutical formulations and spiked plasma samples with good recoveries of 99.00 and up to 106.00%. The proposed method is considered eligible for the quality control laboratories as well as in-vivo determinations of both analytes.
{"title":"Response surface experimental design for simultaneous chromatographic determination of two antiviral agents “Favipiravir and Remdesivir” in pharmaceuticals and spiked plasma samples","authors":"Ahmed Faried Abdel Hakiem, John M. Boushra, Deena A. M. Noureldeen, Adel S. Lashien, Tamer Z. Attia","doi":"10.1002/cem.3548","DOIUrl":"10.1002/cem.3548","url":null,"abstract":"<p>The antiviral agents, Favipiravir (FAV) and Remdesivir (REM), were introduced in the last few years alone or as combination regimen for successful management of the rapidly spreading CORONA virus pandemic. A newly developed rapid and sensitive high performance liquid chromatographic method (HPLC) has been developed for the simultaneous determination of their mixture. Firstly, one factor at a time optimization (OFAT) has been applied. Afterwards, quality by design approach (QbD) has been utilized using Box Behnken experimental design (BBD) for the development of an experimental design of four independent and nine dependent variables for much better refining of the optimized parameters. The established model has given an optimum resolution at; acetonitrile percentage of 52.66, mobile phase of pH 2.91, percentage of triethylamine of 0.15 and 1.30 mL/min flow rate. The proposed method has been validated according to the USP 31 NF 26 guidelines. Good linearity ranges have been obtained from 5.00 up to 50.00 μg/mL for FAV and from 2.00 up to 60.00 μg/mL for FAV and REM, respectively. Excellent relative standard deviation values (not more than 1.40) were obtained upon investigation of accuracy, precision and robustness. The developed method has succeeded in analysis of investigated drugs in their pharmaceutical formulations and spiked plasma samples with good recoveries of 99.00 and up to 106.00%. The proposed method is considered eligible for the quality control laboratories as well as in-vivo determinations of both analytes.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 8","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gyöngyi Vastag, Suzana Apostolov, Špiro Ivošević, Rebeka Rudolf
Monitoring of the corrosion process of alloys in real conditions often results in extensive data, which is characterized by complex interdependence, but by a large degree of mutual deviation. First of all, the large dispersion of the obtained results makes it very difficult to draw accurate conclusions about the real influence of the tested parameters on the corrosion behavior of alloys. On the other hand, in many cases, the high interdependence between the corrosion factors can also greatly burden the analyzed system and thus make it significantly difficult to recognize the main influence. Multivariate analysis, especially the principal component analysis, is becoming increasingly popular in processing of this type of data, due to its ability to recognize and eliminate redundant data. The aim of this study was to examine the possibility of using multivariate analysis methods in the processing of the corrosion test results obtained under real conditions. Based on the obtained results, it can be concluded that used multivariate method in combination with energy dispersive spectrometer analysis can be successfully used to identify the most important corrosion factors (type of corrosion environment, exposure time and technological production processes), as well as their influence on the degradation of the tested TiNi alloys under the given conditions.
{"title":"Chemometrics as a tool for monitoring corrosion degradation of the selected alloys in real conditions","authors":"Gyöngyi Vastag, Suzana Apostolov, Špiro Ivošević, Rebeka Rudolf","doi":"10.1002/cem.3551","DOIUrl":"10.1002/cem.3551","url":null,"abstract":"<p>Monitoring of the corrosion process of alloys in real conditions often results in extensive data, which is characterized by complex interdependence, but by a large degree of mutual deviation. First of all, the large dispersion of the obtained results makes it very difficult to draw accurate conclusions about the real influence of the tested parameters on the corrosion behavior of alloys. On the other hand, in many cases, the high interdependence between the corrosion factors can also greatly burden the analyzed system and thus make it significantly difficult to recognize the main influence. Multivariate analysis, especially the principal component analysis, is becoming increasingly popular in processing of this type of data, due to its ability to recognize and eliminate redundant data. The aim of this study was to examine the possibility of using multivariate analysis methods in the processing of the corrosion test results obtained under real conditions. Based on the obtained results, it can be concluded that used multivariate method in combination with energy dispersive spectrometer analysis can be successfully used to identify the most important corrosion factors (type of corrosion environment, exposure time and technological production processes), as well as their influence on the degradation of the tested TiNi alloys under the given conditions.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 7","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140581925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nooshin Arabi, Mohammad Reza Torabi, Afshin Fassihi, Fahimeh Ghasemi
Angiogenesis, a crucial process in tumor growth, is widely recognized as a key factor in cancer progression. The vascular endothelial growth factor (VEGF) signaling pathway is important for its pivotal role in promoting angiogenesis. The primary objective of this study was to identify a powerful classifier for distinguishing compounds as active or inactive inhibitors of VEGF receptors. To build the machine learning model, compounds were sourced from the BindingDB database. A variety of common feature selection techniques, including both filter-based and wrapper-based methods, were applied to reduce dimensionality, subsequently, overfitting problem. Robust and accurate tree-based classifiers were employed in the classification procedure. Application of the extra-tree classifier using the MultiSURF* feature selection method provided a model with superior accuracy (83.7%) compared with other feature selection techniques. High-throughput molecular docking followed by an accurate docking and comprehensive analysis of the results was performed to provide the best possible inhibitors of these receptors. Comprehensive analysis of the docking results revealed successful prediction of molecules with VEGFR1 and VEGFR2 inhibitory activity. These results emphasized that the performance of the extra-tree model, coupled with MultiSURF* feature selection, surpassed other methods in identifying chemical compounds targeting specific VEGF receptors.
{"title":"Identification of potential vascular endothelial growth factor receptor inhibitors via tree-based learning modeling and molecular docking simulation","authors":"Nooshin Arabi, Mohammad Reza Torabi, Afshin Fassihi, Fahimeh Ghasemi","doi":"10.1002/cem.3545","DOIUrl":"10.1002/cem.3545","url":null,"abstract":"<p>Angiogenesis, a crucial process in tumor growth, is widely recognized as a key factor in cancer progression. The vascular endothelial growth factor (VEGF) signaling pathway is important for its pivotal role in promoting angiogenesis. The primary objective of this study was to identify a powerful classifier for distinguishing compounds as active or inactive inhibitors of VEGF receptors. To build the machine learning model, compounds were sourced from the BindingDB database. A variety of common feature selection techniques, including both filter-based and wrapper-based methods, were applied to reduce dimensionality, subsequently, overfitting problem. Robust and accurate tree-based classifiers were employed in the classification procedure. Application of the extra-tree classifier using the MultiSURF* feature selection method provided a model with superior accuracy (83.7%) compared with other feature selection techniques. High-throughput molecular docking followed by an accurate docking and comprehensive analysis of the results was performed to provide the best possible inhibitors of these receptors. Comprehensive analysis of the docking results revealed successful prediction of molecules with VEGFR1 and VEGFR2 inhibitory activity. These results emphasized that the performance of the extra-tree model, coupled with MultiSURF* feature selection, surpassed other methods in identifying chemical compounds targeting specific VEGF receptors.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 7","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian A. Gough, Sarah Rassenberg, Claire Velikonja, Brandon Corbett, David R. Latulippe, Prashant Mhaskar
Real-time selective protein quantification is an integral component of operating continuous chromatography processes. Partial least squares models fit with spectroscopic UV-Vis absorbance data have demonstrated the ability to selectively quantify proteins. With standard continuous chromatography equipment that is only capable of measuring absorbance at a few user-defined wavelengths, the problem of selecting appropriate wavelengths that maximize the measurement capability of the instrument remains unaddressed. Therefore, we propose a method for selecting wavelengths for continuous chromatography equipment. We illustrate our method using sets of protein mixtures composed of bovine serum albumin and lysozyme. The first step is to refine the raw wavelength set with a statistical t-test and an absorbance magnitude test. Then, the wavelengths within the refined spectroscopic range are ranked. Three existing techniques are evaluated – sequential forward search, variable importance to projection scores, and the least absolute shrinkage and selection operator. The best technique (in this case, sequential forward search) determines a subset of three wavelengths for further evaluation on the BioSMB PD. We use an exhaustive approach to determine the final wavelength set. We show that soft sensor models trained from the method's wavelength selections can quantify the two proteins more accurately than from the wavelength set of 230, 260 and 280 nm, by a factor of four. The method is shown to determine appropriate wavelengths for different path lengths and protein concentration ranges. Overall, we provide a tool that alleviates the analytical bottleneck for practitioners seeking to develop advanced monitoring and control methods on standard equipment.
{"title":"Selective protein quantification on continuous chromatography equipment with limited absorbance sensing: A partial least squares and statistical wavelength selection solution","authors":"Ian A. Gough, Sarah Rassenberg, Claire Velikonja, Brandon Corbett, David R. Latulippe, Prashant Mhaskar","doi":"10.1002/cem.3541","DOIUrl":"10.1002/cem.3541","url":null,"abstract":"<p>Real-time selective protein quantification is an integral component of operating continuous chromatography processes. Partial least squares models fit with spectroscopic UV-Vis absorbance data have demonstrated the ability to selectively quantify proteins. With standard continuous chromatography equipment that is only capable of measuring absorbance at a few user-defined wavelengths, the problem of selecting appropriate wavelengths that maximize the measurement capability of the instrument remains unaddressed. Therefore, we propose a method for selecting wavelengths for continuous chromatography equipment. We illustrate our method using sets of protein mixtures composed of bovine serum albumin and lysozyme. The first step is to refine the raw wavelength set with a statistical <i>t</i>-test and an absorbance magnitude test. Then, the wavelengths within the refined spectroscopic range are ranked. Three existing techniques are evaluated – sequential forward search, variable importance to projection scores, and the least absolute shrinkage and selection operator. The best technique (in this case, sequential forward search) determines a subset of three wavelengths for further evaluation on the BioSMB PD. We use an exhaustive approach to determine the final wavelength set. We show that soft sensor models trained from the method's wavelength selections can quantify the two proteins more accurately than from the wavelength set of 230, 260 and 280 nm, by a factor of four. The method is shown to determine appropriate wavelengths for different path lengths and protein concentration ranges. Overall, we provide a tool that alleviates the analytical bottleneck for practitioners seeking to develop advanced monitoring and control methods on standard equipment.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 7","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3541","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Stevens, Beatriz Carrasco, Vincent Baeten, Juan A. Fernández Pierna
The t-distributed stochastic neighbour embedding algorithm or t-SNE is a non-linear dimension reduction method used to visualise multivariate data. It enables a high-dimensional dataset, such as a set of infrared spectra, to be represented on a single, typically two-dimensional graph, revealing its global and local structure. t-SNE is very popular in the machine learning community and has been applied in many fields, generally with the aim of visualising large datasets. In vibrational spectroscopy, t-SNE is gaining notoriety but principal component analysis (PCA) remains by far the reference method for exploratory analysis and dimension reduction. However, t-SNE may represent a real aid in the analysis of vibrational spectroscopic datasets. It provides an at-a-glance global view of the dataset allowing to distinguish the main factors influencing the spectral signal and the hierarchy between these factors, and gives an indication on the possibility of performing predictive modelling. It can also provide great support in the choice of the pre-processing, by comparing rapidly different general pre-processing approaches according to their effect on the variable of interest. Here we propose to illustrate these advantages using different datasets. We also propose an approach based on a synergy between the t-SNE and PCA methods, allowing respective advantages of each to be exploited.
{"title":"Use of t-distributed stochastic neighbour embedding in vibrational spectroscopy","authors":"François Stevens, Beatriz Carrasco, Vincent Baeten, Juan A. Fernández Pierna","doi":"10.1002/cem.3544","DOIUrl":"10.1002/cem.3544","url":null,"abstract":"<p>The <i>t-distributed stochastic neighbour embedding</i> algorithm or <i>t-SNE</i> is a non-linear dimension reduction method used to visualise multivariate data. It enables a high-dimensional dataset, such as a set of infrared spectra, to be represented on a single, typically two-dimensional graph, revealing its global and local structure. t-SNE is very popular in the machine learning community and has been applied in many fields, generally with the aim of visualising large datasets. In vibrational spectroscopy, t-SNE is gaining notoriety but principal component analysis (PCA) remains by far the reference method for exploratory analysis and dimension reduction. However, t-SNE may represent a real aid in the analysis of vibrational spectroscopic datasets. It provides an at-a-glance global view of the dataset allowing to distinguish the main factors influencing the spectral signal and the hierarchy between these factors, and gives an indication on the possibility of performing predictive modelling. It can also provide great support in the choice of the pre-processing, by comparing rapidly different general pre-processing approaches according to their effect on the variable of interest. Here we propose to illustrate these advantages using different datasets. We also propose an approach based on a synergy between the t-SNE and PCA methods, allowing respective advantages of each to be exploited.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 4","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}