Benedict W J Irwin, T. Whitehead, Scott Rowland, Samar Y. Mahmoud, G. Conduit, M. Segall
{"title":"大规模药物发现数据的深度归算","authors":"Benedict W J Irwin, T. Whitehead, Scott Rowland, Samar Y. Mahmoud, G. Conduit, M. Segall","doi":"10.22541/AU.161111205.55340339/V2","DOIUrl":null,"url":null,"abstract":"More accurate predictions of the biological properties of chemical\ncompounds would guide the selection and design of new compounds in drug\ndiscovery and help to address the enormous cost and low success-rate of\npharmaceutical R&D. However this domain presents a significant\nchallenge for AI methods due to the sparsity of compound data and the\nnoise inherent in results from biological experiments. In this paper, we\ndemonstrate how data imputation using deep learning provides substantial\nimprovements over quantitative structure-activity relationship (QSAR)\nmachine learning models that are widely applied in drug discovery. We\npresent the largest-to-date successful application of deep-learning\nimputation to datasets which are comparable in size to the corporate\ndata repository of a pharmaceutical company (678,994 compounds by 1166\nendpoints). We demonstrate this improvement for three areas of practical\napplication linked to distinct use cases; i) target activity data\ncompiled from a range of drug discovery projects, ii) a high value and\nheterogeneous dataset covering complex absorption, distribution,\nmetabolism and elimination properties and, iii) high throughput\nscreening data, testing the algorithm’s limits on early-stage noisy and\nvery sparse data. Achieving median coefficients of determination,\nR, of 0.69, 0.36 and 0.43 respectively across these\napplications, the deep learning imputation method offers an unambiguous\nimprovement over random forest QSAR methods, which achieve median\nR values of 0.28, 0.19 and 0.23 respectively. We also\ndemonstrate that robust estimates of the uncertainties in the predicted\nvalues correlate strongly with the accuracies in prediction, enabling\ngreater confidence in decision-making based on the imputed values.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Deep Imputation on Large-Scale Drug Discovery Data\",\"authors\":\"Benedict W J Irwin, T. Whitehead, Scott Rowland, Samar Y. Mahmoud, G. Conduit, M. Segall\",\"doi\":\"10.22541/AU.161111205.55340339/V2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More accurate predictions of the biological properties of chemical\\ncompounds would guide the selection and design of new compounds in drug\\ndiscovery and help to address the enormous cost and low success-rate of\\npharmaceutical R&D. However this domain presents a significant\\nchallenge for AI methods due to the sparsity of compound data and the\\nnoise inherent in results from biological experiments. In this paper, we\\ndemonstrate how data imputation using deep learning provides substantial\\nimprovements over quantitative structure-activity relationship (QSAR)\\nmachine learning models that are widely applied in drug discovery. We\\npresent the largest-to-date successful application of deep-learning\\nimputation to datasets which are comparable in size to the corporate\\ndata repository of a pharmaceutical company (678,994 compounds by 1166\\nendpoints). We demonstrate this improvement for three areas of practical\\napplication linked to distinct use cases; i) target activity data\\ncompiled from a range of drug discovery projects, ii) a high value and\\nheterogeneous dataset covering complex absorption, distribution,\\nmetabolism and elimination properties and, iii) high throughput\\nscreening data, testing the algorithm’s limits on early-stage noisy and\\nvery sparse data. Achieving median coefficients of determination,\\nR, of 0.69, 0.36 and 0.43 respectively across these\\napplications, the deep learning imputation method offers an unambiguous\\nimprovement over random forest QSAR methods, which achieve median\\nR values of 0.28, 0.19 and 0.23 respectively. We also\\ndemonstrate that robust estimates of the uncertainties in the predicted\\nvalues correlate strongly with the accuracies in prediction, enabling\\ngreater confidence in decision-making based on the imputed values.\",\"PeriodicalId\":72253,\"journal\":{\"name\":\"Applied AI letters\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied AI letters\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22541/AU.161111205.55340339/V2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied AI letters","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22541/AU.161111205.55340339/V2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Imputation on Large-Scale Drug Discovery Data
More accurate predictions of the biological properties of chemical
compounds would guide the selection and design of new compounds in drug
discovery and help to address the enormous cost and low success-rate of
pharmaceutical R&D. However this domain presents a significant
challenge for AI methods due to the sparsity of compound data and the
noise inherent in results from biological experiments. In this paper, we
demonstrate how data imputation using deep learning provides substantial
improvements over quantitative structure-activity relationship (QSAR)
machine learning models that are widely applied in drug discovery. We
present the largest-to-date successful application of deep-learning
imputation to datasets which are comparable in size to the corporate
data repository of a pharmaceutical company (678,994 compounds by 1166
endpoints). We demonstrate this improvement for three areas of practical
application linked to distinct use cases; i) target activity data
compiled from a range of drug discovery projects, ii) a high value and
heterogeneous dataset covering complex absorption, distribution,
metabolism and elimination properties and, iii) high throughput
screening data, testing the algorithm’s limits on early-stage noisy and
very sparse data. Achieving median coefficients of determination,
R, of 0.69, 0.36 and 0.43 respectively across these
applications, the deep learning imputation method offers an unambiguous
improvement over random forest QSAR methods, which achieve median
R values of 0.28, 0.19 and 0.23 respectively. We also
demonstrate that robust estimates of the uncertainties in the predicted
values correlate strongly with the accuracies in prediction, enabling
greater confidence in decision-making based on the imputed values.