The inhibition of hERG potassium channel is closely related to the prolonged QT interval, and to assess the risk could greatly contribute to the development of safer therapeutic compounds. In the hit-to-lead optimization stage of drug development, quantitative prediction of hERG inhibitory activity is crucial to design drug candidates without cardiotoxicity risk. Here, we developed a hERG regression model combining support vector regression (SVR) and descriptor selection by non-dominated sorting genetic algorithm (NSGA-II) based on AMED cardiotoxicity database consisting of hERG blocking information built by integrating public and commercial databases. To construct a regression model, 6,561 compounds with IC50 and/or Ki values were derived from AMED cardiotoxicity database, and randomly separated into training set (70%) for model building and test set (30%) for performance evaluation. To avoid overfitting by employing many non-relevant explanatory variables, NSGA-II, a variation of genetic algorithm for multiple objective optimization, was used for descriptor selection in order to maximize Q2 and minimize RMSE in 5-fold cross validation and minimize the number of used descriptors spontaneously. The prediction performance was then compared to those of ADMET predictor, commercial software providing various ADMET property predictions. The SVR model recorded R2 of 0.594 and RMSE of 0.604 for test set, clearly exceeding those of ADMET predictor (0.134 and 0.690, respectively). The regression model is available at our home page (https://drugdesign.riken.jp/hERG).
{"title":"Quantitative prediction of hERG inhibitory activities using support vector regression and the integrated hERG dataset in AMED cardiotoxicity database","authors":"Tomohiro Sato, Hitomi Yuki, T. Honma","doi":"10.1273/cbij.21.70","DOIUrl":"https://doi.org/10.1273/cbij.21.70","url":null,"abstract":"The inhibition of hERG potassium channel is closely related to the prolonged QT interval, and to assess the risk could greatly contribute to the development of safer therapeutic compounds. In the hit-to-lead optimization stage of drug development, quantitative prediction of hERG inhibitory activity is crucial to design drug candidates without cardiotoxicity risk. Here, we developed a hERG regression model combining support vector regression (SVR) and descriptor selection by non-dominated sorting genetic algorithm (NSGA-II) based on AMED cardiotoxicity database consisting of hERG blocking information built by integrating public and commercial databases. To construct a regression model, 6,561 compounds with IC50 and/or Ki values were derived from AMED cardiotoxicity database, and randomly separated into training set (70%) for model building and test set (30%) for performance evaluation. To avoid overfitting by employing many non-relevant explanatory variables, NSGA-II, a variation of genetic algorithm for multiple objective optimization, was used for descriptor selection in order to maximize Q2 and minimize RMSE in 5-fold cross validation and minimize the number of used descriptors spontaneously. The prediction performance was then compared to those of ADMET predictor, commercial software providing various ADMET property predictions. The SVR model recorded R2 of 0.594 and RMSE of 0.604 for test set, clearly exceeding those of ADMET predictor (0.134 and 0.690, respectively). The regression model is available at our home page (https://drugdesign.riken.jp/hERG).","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"6 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88701172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, accelerating the speed of finding seed compounds and reducing the cost of pharmaceutical research has become a necessity. The contribution of in silico drug discovery methods, which predict candidates as new drugs using physicochemical features and substructure fingerprints of compounds, is thus expected. Selecting the seed compounds without conducting experiments could enable us to reduce the time and cost required for drug development. However, estimating the characteristics of compounds in our body using a simple linear model alone is unsatisfactory because effects and distribution of compounds are determined by the environment in our body and their interactions with other molecules. Compared to simple models, more complex models have been prepared to estimate compound characteristics with high predictive accuracy. Thus, it is increasingly important to correctly evaluate the predictive performance when selecting the models appropriate for research purposes. The determinant coefficient, famous as R 2 , is one of the most famous statistical measures for evaluating regression models. However, this measure cannot be used to evaluate nonlinear models. In this paper, the difficulty of using the determinant coefficient is explained and the proper statistical measures were suggested under the following two conditions: mean squared error (MSE) for cross-validation, and MSE along with correlation coefficients for the observed and predicted values of test data. As understanding statistical measures and using them appropriately is necessary, the suggested measures will support the effective selection of promising seed compounds and accelerate drug discovery.
{"title":"Appropriate Evaluation Measurements for Regression Models","authors":"Tsuyoshi Esaki","doi":"10.1273/cbij.21.59","DOIUrl":"https://doi.org/10.1273/cbij.21.59","url":null,"abstract":"In recent years, accelerating the speed of finding seed compounds and reducing the cost of pharmaceutical research has become a necessity. The contribution of in silico drug discovery methods, which predict candidates as new drugs using physicochemical features and substructure fingerprints of compounds, is thus expected. Selecting the seed compounds without conducting experiments could enable us to reduce the time and cost required for drug development. However, estimating the characteristics of compounds in our body using a simple linear model alone is unsatisfactory because effects and distribution of compounds are determined by the environment in our body and their interactions with other molecules. Compared to simple models, more complex models have been prepared to estimate compound characteristics with high predictive accuracy. Thus, it is increasingly important to correctly evaluate the predictive performance when selecting the models appropriate for research purposes. The determinant coefficient, famous as R 2 , is one of the most famous statistical measures for evaluating regression models. However, this measure cannot be used to evaluate nonlinear models. In this paper, the difficulty of using the determinant coefficient is explained and the proper statistical measures were suggested under the following two conditions: mean squared error (MSE) for cross-validation, and MSE along with correlation coefficients for the observed and predicted values of test data. As understanding statistical measures and using them appropriately is necessary, the suggested measures will support the effective selection of promising seed compounds and accelerate drug discovery.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"40 2 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83128039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The identification of molecular descriptors that embody the chemical information for druglikeness will be a step forward in data-driven drug discovery and development endeavor. In this study, over 4000 Dragon-type molecular properties were generated for approximately 2000 known drugs and 2000 surrogate nondrugs. Logistic Regression (LogR) and Random Forest (RF) techniques were carried out to unveil the crucial molecular descriptors that can adequately classify a compound as drug or nondrug. Ten one-variable LogR models each demonstrated at least 70% prediction accuracy. A two-variable model consisting of HVcpx and MDDD correctly classified 85% of the test compounds. The best LogR model with 89.0% prediction accuracy identified five most influential descriptors for druglikeness: an information index HVcpx , topological index MDDD , a ring descriptor NNRS , X2A or average connectivity index of order 2, and walk and path count SRW05. The best RF model involving 10 only weakly correlated descriptors was found to be 92.5% accurate and at par with the RF and LogR models that consisted of over 200 variables. The model featured: molecular weight, MW ; average molecular weight, AMW ; rotatable bond fraction, RBF; percentage carbon, C%; maximal electrotopological negative variation, MAXDN ; all-path Wiener index, Wap ; structural information content index, neighborhood symmetry of 1 order, SIC1 ; number of nitrogen atoms, nN; 2D Petitjean shape index, PJI2 ; and self-returning walk count of order 5, SRW05 . Many of these descriptors have straightforward chemical interpretability and future applicability as druglikeness filters in virtual high throughput drug discovery.
{"title":"Logistic regression and random forest unveil key molecular descriptors of druglikeness","authors":"L. T. Billones, Nadia B. Morales, J. Billones","doi":"10.1273/CBIJ.21.39","DOIUrl":"https://doi.org/10.1273/CBIJ.21.39","url":null,"abstract":"The identification of molecular descriptors that embody the chemical information for druglikeness will be a step forward in data-driven drug discovery and development endeavor. In this study, over 4000 Dragon-type molecular properties were generated for approximately 2000 known drugs and 2000 surrogate nondrugs. Logistic Regression (LogR) and Random Forest (RF) techniques were carried out to unveil the crucial molecular descriptors that can adequately classify a compound as drug or nondrug. Ten one-variable LogR models each demonstrated at least 70% prediction accuracy. A two-variable model consisting of HVcpx and MDDD correctly classified 85% of the test compounds. The best LogR model with 89.0% prediction accuracy identified five most influential descriptors for druglikeness: an information index HVcpx , topological index MDDD , a ring descriptor NNRS , X2A or average connectivity index of order 2, and walk and path count SRW05. The best RF model involving 10 only weakly correlated descriptors was found to be 92.5% accurate and at par with the RF and LogR models that consisted of over 200 variables. The model featured: molecular weight, MW ; average molecular weight, AMW ; rotatable bond fraction, RBF; percentage carbon, C%; maximal electrotopological negative variation, MAXDN ; all-path Wiener index, Wap ; structural information content index, neighborhood symmetry of 1 order, SIC1 ; number of nitrogen atoms, nN; 2D Petitjean shape index, PJI2 ; and self-returning walk count of order 5, SRW05 . Many of these descriptors have straightforward chemical interpretability and future applicability as druglikeness filters in virtual high throughput drug discovery.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"113 1","pages":"39-58"},"PeriodicalIF":0.3,"publicationDate":"2021-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79190641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-26DOI: 10.21203/rs.3.rs-737867/v1
Shuhei Kimura, Yahiro Takeda, M. Tokuhisa, Mariko Okada
Background: Among the various methods so far proposed for genetic network inference, this study focuses on the random-forest-based methods. Confidence values are assigned to all of the candidate regulations when taking the random-forest-based approach. To our knowledge, all of the random-forest-based methods make the assignments using the standard variable importance measure defined in tree-based machine learning techniques. We think however that this measure has drawbacks in the inference of genetic networks. Results: In this study we therefore propose an alternative measure, what we call ``the random-input variable importance measure,'' and design a new inference method that uses the proposed measure in place of the standard measure in the existing random-forest-based inference method. We show, through numerical experiments, that the use of the random-input variable importance measure improves the performance of the existing random-forest-based inference method by as much as 45.5% with respect to the area under the recall-precision curve (AURPC). Conclusion: This study proposed the random-input variable importance measure for the inference of genetic networks. The use of our measure improved the performance of the random-forest-based inference method. In this study, we checked the performance of the proposed measure only on several genetic network inference problems. However, the experimental results suggest that the proposed measure will work well in other applications of random forests.
{"title":"Inference of genetic networks using random forests: performance improvement using a new variable importance measure","authors":"Shuhei Kimura, Yahiro Takeda, M. Tokuhisa, Mariko Okada","doi":"10.21203/rs.3.rs-737867/v1","DOIUrl":"https://doi.org/10.21203/rs.3.rs-737867/v1","url":null,"abstract":"\u0000 Background: Among the various methods so far proposed for genetic network inference, this study focuses on the random-forest-based methods. Confidence values are assigned to all of the candidate regulations when taking the random-forest-based approach. To our knowledge, all of the random-forest-based methods make the assignments using the standard variable importance measure defined in tree-based machine learning techniques. We think however that this measure has drawbacks in the inference of genetic networks. Results: In this study we therefore propose an alternative measure, what we call ``the random-input variable importance measure,'' and design a new inference method that uses the proposed measure in place of the standard measure in the existing random-forest-based inference method. We show, through numerical experiments, that the use of the random-input variable importance measure improves the performance of the existing random-forest-based inference method by as much as 45.5% with respect to the area under the recall-precision curve (AURPC). Conclusion: This study proposed the random-input variable importance measure for the inference of genetic networks. The use of our measure improved the performance of the random-forest-based inference method. In this study, we checked the performance of the proposed measure only on several genetic network inference problems. However, the experimental results suggest that the proposed measure will work well in other applications of random forests.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"23 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85828545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yudai Yamashita, Kotaro Watanabe, S. Murata, I. Kawamata
We introduce an automated procedure of coarse-grained molecular dynamic simulation for DNA nanostructure that has great potential for realizing molecular robotics. As DNA origami is now a standardized technology to fabricate DNA nanostructures with high precision, various computer-aided design software has been developed. For example, a design tool called caDNAno with a simple and intuitive interface is widely used for designing DNA origami structures. Further, a simulation tool called oxDNA is used to predict the behavior of such nanostructures based on coarse-grained molecular dynamics. These tools, however, are not linked directly; thus, repeating the cycle of design and simulation is cumbersome to the user. Moreover, the computer skills required to setup, launch, and run an oxDNA simulation are a potential barrier for non-experts. In our proposal, oxDNA simulation can be launched on a web server simply by providing a caDNAno file; the web server then analyzes the simulation results and provides a visual response. The validity of the proposal is demonstrated using an example. The advantages of our proposed method compared with other conventional methods are also described. This simple-to-use interface for user-friendly simulation of DNA origami eliminates stress to users and accelerates the design process of complicated DNA nanostructures such as wireframe architecture.
{"title":"Web Server with a Simple Interface for Coarse-grained Molecular Dynamics of DNA Nanostructures","authors":"Yudai Yamashita, Kotaro Watanabe, S. Murata, I. Kawamata","doi":"10.1273/CBIJ.21.28","DOIUrl":"https://doi.org/10.1273/CBIJ.21.28","url":null,"abstract":"We introduce an automated procedure of coarse-grained molecular dynamic simulation for DNA nanostructure that has great potential for realizing molecular robotics. As DNA origami is now a standardized technology to fabricate DNA nanostructures with high precision, various computer-aided design software has been developed. For example, a design tool called caDNAno with a simple and intuitive interface is widely used for designing DNA origami structures. Further, a simulation tool called oxDNA is used to predict the behavior of such nanostructures based on coarse-grained molecular dynamics. These tools, however, are not linked directly; thus, repeating the cycle of design and simulation is cumbersome to the user. Moreover, the computer skills required to setup, launch, and run an oxDNA simulation are a potential barrier for non-experts. In our proposal, oxDNA simulation can be launched on a web server simply by providing a caDNAno file; the web server then analyzes the simulation results and provides a visual response. The validity of the proposal is demonstrated using an example. The advantages of our proposed method compared with other conventional methods are also described. This simple-to-use interface for user-friendly simulation of DNA origami eliminates stress to users and accelerates the design process of complicated DNA nanostructures such as wireframe architecture.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"47 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81289172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The viral infection caused by the dengue virus (DENV) is one of the most challenging diseases in the tropical regions of the world. The absence of drugs for dengue to this date calls for intense efforts to discover and develop the much coveted therapeutics for this mosquito-borne disease. One of the most attractive antiviral targets is the DENV RNAdependent RNA polymerase (RdRp), which catalyzes the de novo initiation as well as elongation of the flavivirus RNA genome. In this work, almost 5000 natural products were docked to DENV RdRp. The top 197 molecules with greater binding energies than the known ligand of the target were further clustered down to furnish 35 classes of molecular structures. These compounds with satisfactory predicted drug properties and with known natural origin can be further explored to pave the way for the first anti-dengue drug.
{"title":"In Silico Discovery of Natural Products Against Dengue RNA-Dependent RNA Polymerase Drug Target","authors":"J. Billones, N. A. B. Clavio","doi":"10.1273/CBIJ.21.11","DOIUrl":"https://doi.org/10.1273/CBIJ.21.11","url":null,"abstract":"The viral infection caused by the dengue virus (DENV) is one of the most challenging diseases in the tropical regions of the world. The absence of drugs for dengue to this date calls for intense efforts to discover and develop the much coveted therapeutics for this mosquito-borne disease. One of the most attractive antiviral targets is the DENV RNAdependent RNA polymerase (RdRp), which catalyzes the de novo initiation as well as elongation of the flavivirus RNA genome. In this work, almost 5000 natural products were docked to DENV RdRp. The top 197 molecules with greater binding energies than the known ligand of the target were further clustered down to furnish 35 classes of molecular structures. These compounds with satisfactory predicted drug properties and with known natural origin can be further explored to pave the way for the first anti-dengue drug.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"41 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74951143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Kawashima, Natsumi Mori, N. Kawashita, Yu-Shi Tian, T. Takagi
Fragment molecular orbital (FMO) calculation is a useful ab initio method for analyzing protein–ligand interactions in the current structure-based drug design. When multiple ligands exist for one receptor, a post-FMO calculation tool is required because of large numbers of interaction energy decomposition terms calculated using this method. In this study, a method that combines self-organizing maps (SOM) and hierarchical clustering analysis (HCA) was proposed to analyze the results of the FMO energy components. This method could effectively compress the high-dimensional energy terms and is expected to be useful to analyze the interaction between protein and ligands. A case study of antitype 2 diabetes mellitus target DPP-IV and its inhibitors was analyzed to verify the feasibility of the proposed method. After performing dimensional compression using SOM and further grouping using HCA, we obtained superclasses of the inhibitors based on the dispersion energy (DI), which showed consistency with structural information, indicating that further analyses of detailed energies per superclass can be an effective approach for obtaining important ligand–protein interactions.
{"title":"Combining self-organizing maps and hierarchical clustering for protein–ligand interaction analysis in post-fragment molecular orbital calculation","authors":"Y. Kawashima, Natsumi Mori, N. Kawashita, Yu-Shi Tian, T. Takagi","doi":"10.1273/CBIJ.21.1","DOIUrl":"https://doi.org/10.1273/CBIJ.21.1","url":null,"abstract":"Fragment molecular orbital (FMO) calculation is a useful ab initio method for analyzing protein–ligand interactions in the current structure-based drug design. When multiple ligands exist for one receptor, a post-FMO calculation tool is required because of large numbers of interaction energy decomposition terms calculated using this method. In this study, a method that combines self-organizing maps (SOM) and hierarchical clustering analysis (HCA) was proposed to analyze the results of the FMO energy components. This method could effectively compress the high-dimensional energy terms and is expected to be useful to analyze the interaction between protein and ligands. A case study of antitype 2 diabetes mellitus target DPP-IV and its inhibitors was analyzed to verify the feasibility of the proposed method. After performing dimensional compression using SOM and further grouping using HCA, we obtained superclasses of the inhibitors based on the dispersion energy (DI), which showed consistency with structural information, indicating that further analyses of detailed energies per superclass can be an effective approach for obtaining important ligand–protein interactions.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"105 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2021-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80866109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsuyoshi Esaki, Takaaki Horinouchi, Yayoi Natsume-Kitatani, Yosui Nojima, I. Sakane, H. Matsui
The emergence of antibiotic-resistant bacteria is a serious public health concern. Understanding the relationships between antibiotic compounds and phenotypic changes related to the acquisition of resistance is important to estimate the effective characteristics of drug seeds. It is important to analyze the relationships between phenotypic changes and compound structures; hence, we performed a canonical correlation analysis (CCA) for high dimensional phenotypic and compound structure datasets. For the CCA, the required sample number must be larger than the feature number; however, collecting a large amount of data can sometimes be difficult. Thus, we combined consensus clustering to gather and reduce features. The CCA was performed using the clustered features, and it revealed relationships between the features of chemical substructures and the expression level of genes related to several types of antibiotic resistance.
{"title":"Estimation of relationships between chemical substructures and antibiotic resistance-related gene expression in bacteria: Adapting a canonical correlation analysis for small sample data of gathered features using consensus clustering","authors":"Tsuyoshi Esaki, Takaaki Horinouchi, Yayoi Natsume-Kitatani, Yosui Nojima, I. Sakane, H. Matsui","doi":"10.1273/CBIJ.20.58","DOIUrl":"https://doi.org/10.1273/CBIJ.20.58","url":null,"abstract":"The emergence of antibiotic-resistant bacteria is a serious public health concern. Understanding the relationships between antibiotic compounds and phenotypic changes related to the acquisition of resistance is important to estimate the effective characteristics of drug seeds. It is important to analyze the relationships between phenotypic changes and compound structures; hence, we performed a canonical correlation analysis (CCA) for high dimensional phenotypic and compound structure datasets. For the CCA, the required sample number must be larger than the feature number; however, collecting a large amount of data can sometimes be difficult. Thus, we combined consensus clustering to gather and reduce features. The CCA was performed using the clustered features, and it revealed relationships between the features of chemical substructures and the expression level of genes related to several types of antibiotic resistance.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"107 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2020-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80782026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skin sensitization is an important aspect of occupational and consumer safety. Because of the ban on animal testing for skin sensitization in Europe, in silico approaches to predict skin sensitizers are needed. Recently, several machine learning approaches, such as the gradient boosting decision tree (GBDT) and deep neural networks (DNNs), have been applied to chemical reactivity prediction, showing remarkable accuracy. Herein, we performed a study on DNN- and GBDT-based modeling to investigate their potential for use in predicting skin sensitizers. We separately input two types of chemical properties (physical and structural properties) in the form of one-hot labeled vectors into single- and dual-input models. All the trained dual-input models achieved higher accuracy than single-input models, suggesting that a multi-input machine learning model with different types of chemical properties has excellent potential for skin sensitizer classification.
{"title":"Skin sensitizer classification using dual-input machine learning model","authors":"K. Matsumura","doi":"10.1273/cbij.20.54","DOIUrl":"https://doi.org/10.1273/cbij.20.54","url":null,"abstract":"Skin sensitization is an important aspect of occupational and consumer safety. Because of the ban on animal testing for skin sensitization in Europe, in silico approaches to predict skin sensitizers are needed. Recently, several machine learning approaches, such as the gradient boosting decision tree (GBDT) and deep neural networks (DNNs), have been applied to chemical reactivity prediction, showing remarkable accuracy. Herein, we performed a study on DNN- and GBDT-based modeling to investigate their potential for use in predicting skin sensitizers. We separately input two types of chemical properties (physical and structural properties) in the form of one-hot labeled vectors into single- and dual-input models. All the trained dual-input models achieved higher accuracy than single-input models, suggesting that a multi-input machine learning model with different types of chemical properties has excellent potential for skin sensitizer classification.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"8 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2020-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82052247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although the open-field test has been widely used, its reliability and compatibility are frequently questioned. Many indicating parameters were introduced for this test; however, they did not take data distributions into consideration. This oversight may have caused the problems mentioned above. Here, an exploratory approach for the analysis of video records of tests of elderly mice was taken that described the distributions using the least number of parameters. The locomotor activity of the animals was separated into two clusters: dash and search. The accelerations found in each of the clusters were distributed normally. The speed and the duration of the clusters exhibited an exponential distribution. Although the exponential model includes a single parameter, an additional parameter that indicated instability of the behaviour was required in many cases for fitting to the data. As this instability parameter exhibited an inverse correlation with speed, the function of the brain that maintained stability would be required for a better performance. According to the distributions, the travel distance, which has been regarded as an important indicator, was not a robust estimator of the animals’ condition.
{"title":"A distribution-dependent analysis of open-field test movies","authors":"T. Konishi, Haruna Ohrui","doi":"10.1273/cbij.20.44","DOIUrl":"https://doi.org/10.1273/cbij.20.44","url":null,"abstract":"Although the open-field test has been widely used, its reliability and compatibility are frequently questioned. Many indicating parameters were introduced for this test; however, they did not take data distributions into consideration. This oversight may have caused the problems mentioned above. Here, an exploratory approach for the analysis of video records of tests of elderly mice was taken that described the distributions using the least number of parameters. The locomotor activity of the animals was separated into two clusters: dash and search. The accelerations found in each of the clusters were distributed normally. The speed and the duration of the clusters exhibited an exponential distribution. Although the exponential model includes a single parameter, an additional parameter that indicated instability of the behaviour was required in many cases for fitting to the data. As this instability parameter exhibited an inverse correlation with speed, the function of the brain that maintained stability would be required for a better performance. According to the distributions, the travel distance, which has been regarded as an important indicator, was not a robust estimator of the animals’ condition.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"145 1","pages":""},"PeriodicalIF":0.3,"publicationDate":"2020-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73685542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}