Nooshin Arabi, Mohammad Reza Torabi, Afshin Fassihi, Fahimeh Ghasemi
Angiogenesis, a crucial process in tumor growth, is widely recognized as a key factor in cancer progression. The vascular endothelial growth factor (VEGF) signaling pathway is important for its pivotal role in promoting angiogenesis. The primary objective of this study was to identify a powerful classifier for distinguishing compounds as active or inactive inhibitors of VEGF receptors. To build the machine learning model, compounds were sourced from the BindingDB database. A variety of common feature selection techniques, including both filter-based and wrapper-based methods, were applied to reduce dimensionality, subsequently, overfitting problem. Robust and accurate tree-based classifiers were employed in the classification procedure. Application of the extra-tree classifier using the MultiSURF* feature selection method provided a model with superior accuracy (83.7%) compared with other feature selection techniques. High-throughput molecular docking followed by an accurate docking and comprehensive analysis of the results was performed to provide the best possible inhibitors of these receptors. Comprehensive analysis of the docking results revealed successful prediction of molecules with VEGFR1 and VEGFR2 inhibitory activity. These results emphasized that the performance of the extra-tree model, coupled with MultiSURF* feature selection, surpassed other methods in identifying chemical compounds targeting specific VEGF receptors.
{"title":"Identification of potential vascular endothelial growth factor receptor inhibitors via tree-based learning modeling and molecular docking simulation","authors":"Nooshin Arabi, Mohammad Reza Torabi, Afshin Fassihi, Fahimeh Ghasemi","doi":"10.1002/cem.3545","DOIUrl":"10.1002/cem.3545","url":null,"abstract":"<p>Angiogenesis, a crucial process in tumor growth, is widely recognized as a key factor in cancer progression. The vascular endothelial growth factor (VEGF) signaling pathway is important for its pivotal role in promoting angiogenesis. The primary objective of this study was to identify a powerful classifier for distinguishing compounds as active or inactive inhibitors of VEGF receptors. To build the machine learning model, compounds were sourced from the BindingDB database. A variety of common feature selection techniques, including both filter-based and wrapper-based methods, were applied to reduce dimensionality, subsequently, overfitting problem. Robust and accurate tree-based classifiers were employed in the classification procedure. Application of the extra-tree classifier using the MultiSURF* feature selection method provided a model with superior accuracy (83.7%) compared with other feature selection techniques. High-throughput molecular docking followed by an accurate docking and comprehensive analysis of the results was performed to provide the best possible inhibitors of these receptors. Comprehensive analysis of the docking results revealed successful prediction of molecules with VEGFR1 and VEGFR2 inhibitory activity. These results emphasized that the performance of the extra-tree model, coupled with MultiSURF* feature selection, surpassed other methods in identifying chemical compounds targeting specific VEGF receptors.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian A. Gough, Sarah Rassenberg, Claire Velikonja, Brandon Corbett, David R. Latulippe, Prashant Mhaskar
Real-time selective protein quantification is an integral component of operating continuous chromatography processes. Partial least squares models fit with spectroscopic UV-Vis absorbance data have demonstrated the ability to selectively quantify proteins. With standard continuous chromatography equipment that is only capable of measuring absorbance at a few user-defined wavelengths, the problem of selecting appropriate wavelengths that maximize the measurement capability of the instrument remains unaddressed. Therefore, we propose a method for selecting wavelengths for continuous chromatography equipment. We illustrate our method using sets of protein mixtures composed of bovine serum albumin and lysozyme. The first step is to refine the raw wavelength set with a statistical t-test and an absorbance magnitude test. Then, the wavelengths within the refined spectroscopic range are ranked. Three existing techniques are evaluated – sequential forward search, variable importance to projection scores, and the least absolute shrinkage and selection operator. The best technique (in this case, sequential forward search) determines a subset of three wavelengths for further evaluation on the BioSMB PD. We use an exhaustive approach to determine the final wavelength set. We show that soft sensor models trained from the method's wavelength selections can quantify the two proteins more accurately than from the wavelength set of 230, 260 and 280 nm, by a factor of four. The method is shown to determine appropriate wavelengths for different path lengths and protein concentration ranges. Overall, we provide a tool that alleviates the analytical bottleneck for practitioners seeking to develop advanced monitoring and control methods on standard equipment.
{"title":"Selective protein quantification on continuous chromatography equipment with limited absorbance sensing: A partial least squares and statistical wavelength selection solution","authors":"Ian A. Gough, Sarah Rassenberg, Claire Velikonja, Brandon Corbett, David R. Latulippe, Prashant Mhaskar","doi":"10.1002/cem.3541","DOIUrl":"10.1002/cem.3541","url":null,"abstract":"<p>Real-time selective protein quantification is an integral component of operating continuous chromatography processes. Partial least squares models fit with spectroscopic UV-Vis absorbance data have demonstrated the ability to selectively quantify proteins. With standard continuous chromatography equipment that is only capable of measuring absorbance at a few user-defined wavelengths, the problem of selecting appropriate wavelengths that maximize the measurement capability of the instrument remains unaddressed. Therefore, we propose a method for selecting wavelengths for continuous chromatography equipment. We illustrate our method using sets of protein mixtures composed of bovine serum albumin and lysozyme. The first step is to refine the raw wavelength set with a statistical <i>t</i>-test and an absorbance magnitude test. Then, the wavelengths within the refined spectroscopic range are ranked. Three existing techniques are evaluated – sequential forward search, variable importance to projection scores, and the least absolute shrinkage and selection operator. The best technique (in this case, sequential forward search) determines a subset of three wavelengths for further evaluation on the BioSMB PD. We use an exhaustive approach to determine the final wavelength set. We show that soft sensor models trained from the method's wavelength selections can quantify the two proteins more accurately than from the wavelength set of 230, 260 and 280 nm, by a factor of four. The method is shown to determine appropriate wavelengths for different path lengths and protein concentration ranges. Overall, we provide a tool that alleviates the analytical bottleneck for practitioners seeking to develop advanced monitoring and control methods on standard equipment.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3541","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Stevens, Beatriz Carrasco, Vincent Baeten, Juan A. Fernández Pierna
The t-distributed stochastic neighbour embedding algorithm or t-SNE is a non-linear dimension reduction method used to visualise multivariate data. It enables a high-dimensional dataset, such as a set of infrared spectra, to be represented on a single, typically two-dimensional graph, revealing its global and local structure. t-SNE is very popular in the machine learning community and has been applied in many fields, generally with the aim of visualising large datasets. In vibrational spectroscopy, t-SNE is gaining notoriety but principal component analysis (PCA) remains by far the reference method for exploratory analysis and dimension reduction. However, t-SNE may represent a real aid in the analysis of vibrational spectroscopic datasets. It provides an at-a-glance global view of the dataset allowing to distinguish the main factors influencing the spectral signal and the hierarchy between these factors, and gives an indication on the possibility of performing predictive modelling. It can also provide great support in the choice of the pre-processing, by comparing rapidly different general pre-processing approaches according to their effect on the variable of interest. Here we propose to illustrate these advantages using different datasets. We also propose an approach based on a synergy between the t-SNE and PCA methods, allowing respective advantages of each to be exploited.
{"title":"Use of t-distributed stochastic neighbour embedding in vibrational spectroscopy","authors":"François Stevens, Beatriz Carrasco, Vincent Baeten, Juan A. Fernández Pierna","doi":"10.1002/cem.3544","DOIUrl":"10.1002/cem.3544","url":null,"abstract":"<p>The <i>t-distributed stochastic neighbour embedding</i> algorithm or <i>t-SNE</i> is a non-linear dimension reduction method used to visualise multivariate data. It enables a high-dimensional dataset, such as a set of infrared spectra, to be represented on a single, typically two-dimensional graph, revealing its global and local structure. t-SNE is very popular in the machine learning community and has been applied in many fields, generally with the aim of visualising large datasets. In vibrational spectroscopy, t-SNE is gaining notoriety but principal component analysis (PCA) remains by far the reference method for exploratory analysis and dimension reduction. However, t-SNE may represent a real aid in the analysis of vibrational spectroscopic datasets. It provides an at-a-glance global view of the dataset allowing to distinguish the main factors influencing the spectral signal and the hierarchy between these factors, and gives an indication on the possibility of performing predictive modelling. It can also provide great support in the choice of the pre-processing, by comparing rapidly different general pre-processing approaches according to their effect on the variable of interest. Here we propose to illustrate these advantages using different datasets. We also propose an approach based on a synergy between the t-SNE and PCA methods, allowing respective advantages of each to be exploited.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study illustrates at-line application of hyperspectral imaging in the visible range for quality control of large-scale offset printing. In particular, the measurement stability of a competitive device is assessed and compared to traditional handheld and desktop spectrophotometers. The performance of the commercially available instruments was assessed based on collected spectra and their corresponding L*, a*, and b* values. The printing process was described by hyperspectral images (in visible range) of selected regions from template color fields acquired at 17 sampling occasions. Spectra constituting hyperspectral images were visualized and evaluated in the space of significant principal components obtained from the principal component analysis. Furthermore, confidence ellipses were constructed for each set of spectra characterizing a specific moment of the printing process. Comparing their mutual locations, shapes, orientations, and sizes enabled effective visualization of process variability and was more comprehensive regarding the classic approach based on information provided by desktop and handheld spectrometers.
{"title":"Toward more efficient and effective color quality control for the large-scale offset printing process","authors":"Pawel Dziki, Lukasz Pieszczek, Michal Daszykowski","doi":"10.1002/cem.3543","DOIUrl":"10.1002/cem.3543","url":null,"abstract":"<p>This study illustrates at-line application of hyperspectral imaging in the visible range for quality control of large-scale offset printing. In particular, the measurement stability of a competitive device is assessed and compared to traditional handheld and desktop spectrophotometers. The performance of the commercially available instruments was assessed based on collected spectra and their corresponding L*, a*, and b* values. The printing process was described by hyperspectral images (in visible range) of selected regions from template color fields acquired at 17 sampling occasions. Spectra constituting hyperspectral images were visualized and evaluated in the space of significant principal components obtained from the principal component analysis. Furthermore, confidence ellipses were constructed for each set of spectra characterizing a specific moment of the printing process. Comparing their mutual locations, shapes, orientations, and sizes enabled effective visualization of process variability and was more comprehensive regarding the classic approach based on information provided by desktop and handheld spectrometers.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Borkovits, E. Kontsek, A. Pesti, P. Gordon, S. Gergely, I. Csabai, A. Kiss, P. Pollner
In this project, we used formalin-fixed paraffin-embedded (FFPE) tissue samples to measure thousands of spectra per tissue core with Fourier transform mid-infrared spectroscopy using an FT-IR imaging system. These cores varied between normal colon (NC) and colorectal primer carcinoma (CRC) tissues. We created a database to manage all the multivariate data obtained from the measurements. Then, we applied classifier algorithms to identify the tissue based on its yielded spectra. For classification, we used the random forest, a support vector machine, XGBoost, and linear discriminant analysis methods, as well as three deep neural networks. We compared two data manipulation techniques using these models and then applied filtering. In the end, we compared model performances via the sum of ranking differences (SRD).
{"title":"Classification of colorectal primer carcinoma from normal colon with mid-infrared spectra","authors":"B. Borkovits, E. Kontsek, A. Pesti, P. Gordon, S. Gergely, I. Csabai, A. Kiss, P. Pollner","doi":"10.1002/cem.3542","DOIUrl":"10.1002/cem.3542","url":null,"abstract":"<p>In this project, we used formalin-fixed paraffin-embedded (FFPE) tissue samples to measure thousands of spectra per tissue core with Fourier transform mid-infrared spectroscopy using an FT-IR imaging system. These cores varied between normal colon (NC) and colorectal primer carcinoma (CRC) tissues. We created a database to manage all the multivariate data obtained from the measurements. Then, we applied classifier algorithms to identify the tissue based on its yielded spectra. For classification, we used the random forest, a support vector machine, XGBoost, and linear discriminant analysis methods, as well as three deep neural networks. We compared two data manipulation techniques using these models and then applied filtering. In the end, we compared model performances via the sum of ranking differences (SRD).</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3542","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling near-infrared (NIR) spectral data to predict fresh fruit properties is a challenging task. The difficulty lies in creating generalized models that can work on fruits of different cultivars, seasons, and even multiple commodities of fruit. Due to intrinsic differences in spectral properties, NIR models often fail in testing, resulting in high bias and prediction errors. One current solution for achieving generalized models is to use large calibration sets measured over multiple cultivars and harvest seasons. However, current practice primarily focuses on calibration sets for single fruit commodities, disregarding the rich information available from other fruit commodities. This study aims to demonstrate the potential of locally weighted partial least-squares an example of just-in-time (JIT) modeling to develop real-time models based on calibration sets consisting of multiple fruit commodities. The study also explores JIT modeling for leveraging relevant information from other fruit commodities or adapting the model based on new samples. The application demonstrated here predicts the dry matter in fresh fruit using portable NIR spectroscopy. The results show that JIT modeling is particularly effective for multiple fruit commodities in a single calibration set. The JIT models achieved a root mean squared error of prediction (RMSEP) of 0.69% fresh weight (FW), while the traditional partial least squares (PLS) modeling RMSEP was 0.93% FW. JIT modeling can be particularly beneficial when the user has multiple fruit datasets and wants to combine them into a single dataset to utilize all the relevant information available.
{"title":"Developing multifruit global near-infrared model to predict dry matter based on just-in-time modeling","authors":"Puneet Mishra","doi":"10.1002/cem.3540","DOIUrl":"10.1002/cem.3540","url":null,"abstract":"<p>Modeling near-infrared (NIR) spectral data to predict fresh fruit properties is a challenging task. The difficulty lies in creating generalized models that can work on fruits of different cultivars, seasons, and even multiple commodities of fruit. Due to intrinsic differences in spectral properties, NIR models often fail in testing, resulting in high bias and prediction errors. One current solution for achieving generalized models is to use large calibration sets measured over multiple cultivars and harvest seasons. However, current practice primarily focuses on calibration sets for single fruit commodities, disregarding the rich information available from other fruit commodities. This study aims to demonstrate the potential of locally weighted partial least-squares an example of just-in-time (JIT) modeling to develop real-time models based on calibration sets consisting of multiple fruit commodities. The study also explores JIT modeling for leveraging relevant information from other fruit commodities or adapting the model based on new samples. The application demonstrated here predicts the dry matter in fresh fruit using portable NIR spectroscopy. The results show that JIT modeling is particularly effective for multiple fruit commodities in a single calibration set. The JIT models achieved a root mean squared error of prediction (RMSEP) of 0.69% fresh weight (FW), while the traditional partial least squares (PLS) modeling RMSEP was 0.93% FW. JIT modeling can be particularly beneficial when the user has multiple fruit datasets and wants to combine them into a single dataset to utilize all the relevant information available.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3540","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140043948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid advancement of industrialization and urbanization has led to the global problem of air pollution. Air quality can decrease due to pollutants in the air, including types of gases and particles that are carcinogenic, causing adverse health effects. Therefore, estimating the concentration of air pollutants is of great interest as it can provide accurate information about air quality with proper planning of future activities. In this manner, this study considers Istanbul, a province with a high concentration of industry, population, and vehicle traffic. Particulate matter (PM), one of the most basic air pollutants, is stated to contain microscopic solids or liquid droplets that are small enough to be inhaled and cause serious health problems. Thus, it is recommended to apply discrete wavelet transform (DWT) and deep learning method long short-term memory (LSTM) as a hybrid model to predict the concentration of PM10. Using the mentioned methods, they can predict air pollution to have been developed within the scope of this study. Furthermore, the hybrid approach with LSTM by selecting the most appropriate discrete wavelet type emphasizes the difference of this study from the existing literature. The ability of these developed methods to make successful future predictions helps institutions and organizations that can take precautions on the subject to take action at the right time; in addition, the deep learning methods used contribute to the development of sustainable smart environmental systems. In today's environment when air pollution is increasing and threatening human health, any precaution that can be taken would improve the quality of life for all living things, reduce health issues and deaths caused by air pollution, and thus raise the degree of well-being. These findings might offer a reliable scientific evidence for Istanbul City's air pollution management, which can serve as an example for other regions.
{"title":"Optimizing air quality predictions: A discrete wavelet transform and long short-term memory approach with wavelet-type selection for hourly PM10 concentrations","authors":"Gökçe Nur Taşağıl Arslan, Serpil Kılıç Depren","doi":"10.1002/cem.3539","DOIUrl":"10.1002/cem.3539","url":null,"abstract":"<p>The rapid advancement of industrialization and urbanization has led to the global problem of air pollution. Air quality can decrease due to pollutants in the air, including types of gases and particles that are carcinogenic, causing adverse health effects. Therefore, estimating the concentration of air pollutants is of great interest as it can provide accurate information about air quality with proper planning of future activities. In this manner, this study considers Istanbul, a province with a high concentration of industry, population, and vehicle traffic. Particulate matter (PM), one of the most basic air pollutants, is stated to contain microscopic solids or liquid droplets that are small enough to be inhaled and cause serious health problems. Thus, it is recommended to apply discrete wavelet transform (DWT) and deep learning method long short-term memory (LSTM) as a hybrid model to predict the concentration of PM<sub>10</sub>. Using the mentioned methods, they can predict air pollution to have been developed within the scope of this study. Furthermore, the hybrid approach with LSTM by selecting the most appropriate discrete wavelet type emphasizes the difference of this study from the existing literature. The ability of these developed methods to make successful future predictions helps institutions and organizations that can take precautions on the subject to take action at the right time; in addition, the deep learning methods used contribute to the development of sustainable smart environmental systems. In today's environment when air pollution is increasing and threatening human health, any precaution that can be taken would improve the quality of life for all living things, reduce health issues and deaths caused by air pollution, and thus raise the degree of well-being. These findings might offer a reliable scientific evidence for Istanbul City's air pollution management, which can serve as an example for other regions.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Yi Liu, Dongping Zhang, Lei Xie, Jiusun Zeng
Aiming at the actual industrial process background that different modes share the same system configurations and control structure, this article proposes a novel structured discriminant Gaussian graph learning for multimode process monitoring. The proposed method considers not only the sparsity of graph model but also the measurement of data variation based on a mismatched graph and the common node support between different graphical structures. The objective function involves two sets of regularization terms: the trace terms for mismatched measurements and the