Applied AI letters最新文献

英文中文

Methods and Standards for Research on Explainable Artificial Intelligence: Lessons from Intelligent Tutoring Systems 可解释人工智能的研究方法与标准:来自智能辅导系统的经验教训

Applied AI letters

Pub Date : 2021-06-08 DOI: 10.22541/AU.162317004.45114437/V1

Robert Hoffman, W. Clancey

We reflect on the progress in the area of Explainable AI (XAI) Programrelative to previous work in the area of intelligent tutoring systems(ITS). A great deal was learned about explanation—and many challengesuncovered—in research that is directly relevant to XAI. We suggestopportunities for future XAI research deriving from ITS methods, as wellas the challenges shared by both ITS and XAI in using AI to assistpeople in solving difficult problems effectively and efficiently.

我们反思了可解释人工智能(XAI)程序领域的进展，相对于之前在智能辅导系统(ITS)领域的工作。在与XAI直接相关的研究中，我们学到了很多关于解释的知识，也发现了许多挑战。我们提出了来自ITS方法的未来人工智能研究的机会，以及ITS和人工智能在使用人工智能帮助人们有效和高效地解决难题方面所面临的共同挑战。

引用次数: 13

Adapting natural language processing for technical text 采用自然语言处理技术文本

Applied AI letters

Pub Date : 2021-06-02 DOI: 10.1002/ail2.33

Alden Dima, Sarah Lukens, Melinda Hodkiewicz, Thurston Sexton, Michael P. Brundage

Despite recent dramatic successes, natural language processing (NLP) is not ready to address a variety of real-world problems. Its reliance on large standard corpora, a training and evaluation paradigm that favors the learning of shallow heuristics, and large computational resource requirements, makes domain-specific application of even the most successful NLP techniques difficult. This paper proposes technical language processing (TLP) which brings engineering principles and practices to NLP specifically for the purpose of extracting actionable information from language generated by experts in their technical tasks, systems, and processes. TLP envisages NLP as a socio-technical system rather than as an algorithmic pipeline. We describe how the TLP approach to meaning and generalization differs from that of NLP, how data quantity and quality can be addressed in engineering technical domains, and the potential risks of not adapting NLP for technical use cases. Engineering problems can benefit immensely from the inclusion of knowledge from unstructured data, currently unavailable due to issues with out of the box NLP packages. We illustrate the TLP approach by focusing on maintenance in industrial organizations as a case-study.

尽管最近取得了巨大的成功，但自然语言处理(NLP)还没有准备好解决各种现实世界的问题。它对大型标准语料库的依赖，有利于浅层启发式学习的训练和评估范例，以及大量的计算资源需求，使得即使是最成功的NLP技术在特定领域的应用也变得困难。本文提出了技术语言处理(TLP)，它将工程原理和实践引入NLP，专门用于从专家在其技术任务、系统和过程中生成的语言中提取可操作的信息。TLP设想NLP是一个社会技术系统，而不是一个算法管道。我们描述了TLP方法在意义和泛化方面与NLP的不同之处，如何在工程技术领域解决数据数量和质量问题，以及不将NLP用于技术用例的潜在风险。工程问题可以从包含非结构化数据的知识中受益匪浅，目前由于开箱即用的NLP软件包的问题而无法获得这些知识。我们通过关注工业组织中的维护作为案例研究来说明TLP方法。

{"title":"Adapting natural language processing for technical text","authors":"Alden Dima, Sarah Lukens, Melinda Hodkiewicz, Thurston Sexton, Michael P. Brundage","doi":"10.1002/ail2.33","DOIUrl":"10.1002/ail2.33","url":null,"abstract":"Despite recent dramatic successes, natural language processing (NLP) is not ready to address a variety of real-world problems. Its reliance on large standard corpora, a training and evaluation paradigm that favors the learning of shallow heuristics, and large computational resource requirements, makes domain-specific application of even the most successful NLP techniques difficult. This paper proposes technical language processing (TLP) which brings engineering principles and practices to NLP specifically for the purpose of extracting actionable information from language generated by experts in their technical tasks, systems, and processes. TLP envisages NLP as a socio-technical system rather than as an algorithmic pipeline. We describe how the TLP approach to meaning and generalization differs from that of NLP, how data quantity and quality can be addressed in engineering technical domains, and the potential risks of not adapting NLP for technical use cases. Engineering problems can benefit immensely from the inclusion of knowledge from unstructured data, currently unavailable due to issues with out of the box NLP packages. We illustrate the TLP approach by focusing on maintenance in industrial organizations as a case-study.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":"2 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ail2.33","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9679524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Issue Information 问题信息

Applied AI letters

Pub Date : 2021-06-01 DOI: 10.1002/ail2.13

引用次数: 0

Deep imputation on large-scale drug discovery data 大规模药物发现数据的深度归算

Applied AI letters

Pub Date : 2021-05-20 DOI: 10.1002/ail2.31

Benedict W. J. Irwin, Thomas M. Whitehead, Scott Rowland, Samar Y. Mahmoud, Gareth J. Conduit, Matthew D. Segall

More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However, this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678 994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; (a) target activity data compiled from a range of drug discovery projects, (b) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism, and elimination properties, and (c) high throughput screening data, testing the algorithm's limits on early stage noisy and very sparse data. Achieving median coefficients of determination, R², of 0.69, 0.36, and 0.43, respectively, across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R² values of 0.28, 0.19, and 0.23, respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.

更准确地预测化合物的生物学特性将指导新化合物在药物发现中的选择和设计，并有助于解决药物研发成本高、成功率低的问题。然而，由于复合数据的稀疏性和生物实验结果中固有的噪声，该领域对人工智能方法提出了重大挑战。在本文中，我们展示了使用深度学习的数据导入如何对广泛应用于药物发现的定量结构-活性关系(QSAR)机器学习模型进行实质性改进。我们展示了迄今为止最大的深度学习数据集的成功应用，其规模与制药公司的企业数据存储库(678 994种化合物，1166个端点)相当。我们在三个与不同用例相关的实际应用领域展示了这种改进;(a)从一系列药物发现项目中编译的目标活性数据，(b)涵盖复杂吸收、分布、代谢和消除特性的高价值异构数据集，以及(c)高通量筛选数据，测试该算法在早期嘈杂和非常稀疏数据上的局限性。在这些应用中，深度学习方法的中位数决定系数R2分别为0.69、0.36和0.43，与随机森林QSAR方法相比，深度学习方法提供了明确的改进，随机森林QSAR方法的中位数R2分别为0.28、0.19和0.23。我们还证明，对预测值中不确定性的稳健估计与预测的准确性密切相关，从而使基于估算值的决策更有信心。

{"title":"Deep imputation on large-scale drug discovery data","authors":"Benedict W. J. Irwin, Thomas M. Whitehead, Scott Rowland, Samar Y. Mahmoud, Gareth J. Conduit, Matthew D. Segall","doi":"10.1002/ail2.31","DOIUrl":"https://doi.org/10.1002/ail2.31","url":null,"abstract":"More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However, this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678 994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; (a) target activity data compiled from a range of drug discovery projects, (b) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism, and elimination properties, and (c) high throughput screening data, testing the algorithm's limits on early stage noisy and very sparse data. Achieving median coefficients of determination, R2, of 0.69, 0.36, and 0.43, respectively, across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R2 values of 0.28, 0.19, and 0.23, respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":"2 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ail2.31","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137944497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models VQA模型的关注和误差诱导输入区域的生成和评价解释

Applied AI letters

Pub Date : 2021-03-26 DOI: 10.22541/au.162464902.28050142/v1

Arijit Ray, Michael Cogswell, Xiaoyu Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas

Attention maps, a popular heatmap-based explanation method for VisualQuestion Answering (VQA), are supposed to help users understand themodel by highlighting portions of the image/question used by the modelto infer answers. However, we see that users are often misled by currentattention map visualizations that point to relevant regions despite themodel producing an incorrect answer. Hence, we propose Error Maps thatclarify the error by highlighting image regions where the model is proneto err. Error maps can indicate when a correctly attended region may beprocessed incorrectly leading to an incorrect answer, and hence, improveusers’ understanding of those cases. To evaluate our new explanations,we further introduce a metric that simulates users’ interpretation ofexplanations to evaluate their potential helpfulness to understand modelcorrectness. We finally conduct user studies to see that our newexplanations help users understand model correctness better thanbaselines by an expected 30% and that our proxy helpfulness metricscorrelate strongly (rho>0.97) with how well users canpredict model correctness.

注意力图是一种流行的基于热图的视觉问答（VQA）解释方法，旨在通过突出显示模型用于推断答案的图像/问题的部分来帮助用户理解模型。然而，我们看到，尽管模型产生了错误的答案，但用户经常被当前指向相关区域的注意力地图可视化所误导。因此，我们提出了误差图，通过突出显示模型容易出错的图像区域来澄清误差。错误图可以指示正确参与的区域何时可能被错误处理，从而导致错误答案，从而提高用户对这些情况的理解。为了评估我们的新解释，我们进一步引入了一个指标，模拟用户对解释的解释，以评估他们对理解模型正确性的潜在帮助。最后，我们对用户进行了研究，发现我们的新解释可以帮助用户比基线更好地理解模型的正确性，预期的正确率为30%，并且我们的代理有用性指标与用户预测模型正确性的程度强相关（rho>0.97）。

{"title":"Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models","authors":"Arijit Ray, Michael Cogswell, Xiaoyu Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas","doi":"10.22541/au.162464902.28050142/v1","DOIUrl":"https://doi.org/10.22541/au.162464902.28050142/v1","url":null,"abstract":"Attention maps, a popular heatmap-based explanation method for Visual\u0000Question Answering (VQA), are supposed to help users understand the\u0000model by highlighting portions of the image/question used by the model\u0000to infer answers. However, we see that users are often misled by current\u0000attention map visualizations that point to relevant regions despite the\u0000model producing an incorrect answer. Hence, we propose Error Maps that\u0000clarify the error by highlighting image regions where the model is prone\u0000to err. Error maps can indicate when a correctly attended region may be\u0000processed incorrectly leading to an incorrect answer, and hence, improve\u0000users’ understanding of those cases. To evaluate our new explanations,\u0000we further introduce a metric that simulates users’ interpretation of\u0000explanations to evaluate their potential helpfulness to understand model\u0000correctness. We finally conduct user studies to see that our new\u0000explanations help users understand model correctness better than\u0000baselines by an expected 30% and that our proxy helpfulness metrics\u0000correlate strongly (rho>0.97) with how well users can\u0000predict model correctness.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45352065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Hierarchical spline for time series prediction: An application to naval ship engine failure rate 层次样条时间序列预测在舰船发动机故障率中的应用

Applied AI letters

Pub Date : 2021-03-24 DOI: 10.1002/ail2.22

Hyunji Moon, Jinwoo Choi

Predicting equipment failure is important because it could improve availability and cut down the operating budget. Previous literature has attempted to model failure rate with bathtub-formed function, Weibull distribution, Bayesian network, or analytic hierarchy process. But these models perform well with a sufficient amount of data and could not incorporate the two salient characteristics: imbalanced category and sharing structure. Hierarchical model has the advantage of partial pooling. The proposed model is based on Bayesian hierarchical B-spline. Time series of the failure rate of 99 Republic of Korea Naval ships are modeled hierarchically, where each layer corresponds to ship engine, engine type, and engine archetype. As a result of the analysis, the suggested model predicted the failure rate of an entire lifetime accurately in multiple situational conditions, such as prior knowledge of the engine.

预测设备故障很重要，因为它可以提高可用性并减少运营预算。以前的文献试图用浴缸形函数、威布尔分布、贝叶斯网络或层次分析法来模拟故障率。但这些模型在数据量充足的情况下表现良好，不能兼顾类别不平衡和共享结构这两个显著特征。分层模型具有部分池化的优点。该模型基于贝叶斯分层b样条。对99艘韩国海军舰艇的故障率时间序列进行分层建模，每一层对应舰船发动机、发动机类型和发动机原型。通过分析，建议的模型可以在多种情况下准确预测整个使用寿命的故障率，例如对发动机的先验知识。

引用次数: 2

Cognitive analysis in sports: Supporting match analysis and scouting through artificial intelligence 体育认知分析:通过人工智能支持比赛分析和球探

Applied AI letters

Pub Date : 2021-03-14 DOI: 10.1002/ail2.21

Joe Pavitt, Dave Braines, Richard Tomsett

In elite sports, there is an opportunity to take advantage of rich and detailed datasets generated across multiple threads of the sporting business. Challenges currently exist due to time constraints to analyse the data, as well as the quantity and variety of data available to assess. Artificial Intelligence (AI) techniques can be a valuable asset in assisting decision makers in tackling such challenges, but deep AI skills are generally not held by those with rich experience in sporting domains. Here, we describe how certain commonly available AI services can be used to provide analytic assistance to sports experts in exploring, and gaining insights from, typical data sources. In particular, we focus on the use of Natural Language Processing and Conversational Interfaces to provide users with an intuitive and time-saving toolkit to explore their datasets and the conclusions arising from analytics performed on them. We show the benefit of presenting powerful AI and analytic techniques to domain experts, showing the potential for impact not only at the elite level of sports, where AI and analytic capabilities may be more available, but also at a more grass-roots level where there is generally little access to specialist resources. The work described in this paper was trialled with Leatherhead Football Club, a semi-professional team that, at the time, were based in the English 7th tier of football.

在精英运动中，有机会利用体育业务多个线程生成的丰富而详细的数据集。由于分析数据的时间限制以及可供评估的数据的数量和种类，目前存在挑战。人工智能(AI)技术可以成为帮助决策者应对此类挑战的宝贵资产，但在体育领域拥有丰富经验的人通常不具备深厚的人工智能技能。在这里，我们描述了如何使用某些常用的人工智能服务来为体育专家提供分析帮助，以探索并从典型数据源中获得见解。特别是，我们专注于使用自然语言处理和会话接口，为用户提供一个直观和节省时间的工具包，以探索他们的数据集和从分析中得出的结论。我们展示了向领域专家展示强大的人工智能和分析技术的好处，不仅展示了人工智能和分析能力可能更容易获得的精英体育水平的潜在影响，而且还展示了在通常很少获得专业资源的更基层水平的潜在影响。本文中描述的工作在莱瑟黑德足球俱乐部(Leatherhead Football Club)进行了试验，这是一支半职业球队，当时位于英格兰第7级足球联赛。

{"title":"Cognitive analysis in sports: Supporting match analysis and scouting through artificial intelligence","authors":"Joe Pavitt, Dave Braines, Richard Tomsett","doi":"10.1002/ail2.21","DOIUrl":"10.1002/ail2.21","url":null,"abstract":"In elite sports, there is an opportunity to take advantage of rich and detailed datasets generated across multiple threads of the sporting business. Challenges currently exist due to time constraints to analyse the data, as well as the quantity and variety of data available to assess. Artificial Intelligence (AI) techniques can be a valuable asset in assisting decision makers in tackling such challenges, but deep AI skills are generally not held by those with rich experience in sporting domains. Here, we describe how certain commonly available AI services can be used to provide analytic assistance to sports experts in exploring, and gaining insights from, typical data sources. In particular, we focus on the use of Natural Language Processing and Conversational Interfaces to provide users with an intuitive and time-saving toolkit to explore their datasets and the conclusions arising from analytics performed on them. We show the benefit of presenting powerful AI and analytic techniques to domain experts, showing the potential for impact not only at the elite level of sports, where AI and analytic capabilities may be more available, but also at a more grass-roots level where there is generally little access to specialist resources. The work described in this paper was trialled with Leatherhead Football Club, a semi-professional team that, at the time, were based in the English 7th tier of football.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ail2.21","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"93915417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Towards an affordable magnetomyography instrumentation and low model complexity approach for labour imminency prediction using a novel multiresolution analysis 使用新颖的多分辨率分析，实现负担得起的磁断层成像仪器和低模型复杂性的劳动迫切性预测方法

Applied AI letters

Pub Date : 2021-02-09 DOI: 10.22541/AU.161289481.19912239/V1

E. Nsugbe, I. Sanusi

The ability to predict the onset of labour is seen to be an importanttool in a clinical setting. Magnetomyography has shown promise in thearea of labour imminency prediction, but its clinical applicationremains limited due to high resource consumption associated with itsbroad number of channels. In this study, five electrode channels, whichaccount for 3.3% of the total, are used alongside a novel signaldecomposition algorithm and low complexity classifiers (logisticregression and linear-SVM) to classify between labour imminency duewithin 0–48hrs and >48hrs. The results suggest that theparsimonious representation comprising of five electrode channels andnovel signal decomposition method alongside the candidate classifierscould allow for greater affordability and hence clinical viability ofthe magnetomyography-based prediction model, which carries a good degreeof model interpretability.

在临床环境中，预测分娩开始的能力被视为一个重要的工具。磁断层成像在临产预测领域显示出前景，但其临床应用仍然有限，因为其通道数量多，资源消耗高。在这项研究中，五个电极通道(占总数的3.3%)与一种新的信号分解算法和低复杂度分类器(逻辑回归和线性支持向量机)一起使用，在0 - 48小时和0 - 48小时内对劳动紧迫性进行分类。结果表明，由五个电极通道和新颖的信号分解方法组成的简约表示以及候选分类器可以允许更高的可负担性，因此基于磁层析成像的预测模型的临床可行性，该模型具有良好的模型可解释性。

引用次数: 14

Deep Imputation on Large-Scale Drug Discovery Data 大规模药物发现数据的深度归算

Applied AI letters

Pub Date : 2021-01-20 DOI: 10.22541/AU.161111205.55340339/V2

Benedict W J Irwin, T. Whitehead, Scott Rowland, Samar Y. Mahmoud, G. Conduit, M. Segall

More accurate predictions of the biological properties of chemicalcompounds would guide the selection and design of new compounds in drugdiscovery and help to address the enormous cost and low success-rate ofpharmaceutical R&D. However this domain presents a significantchallenge for AI methods due to the sparsity of compound data and thenoise inherent in results from biological experiments. In this paper, wedemonstrate how data imputation using deep learning provides substantialimprovements over quantitative structure-activity relationship (QSAR)machine learning models that are widely applied in drug discovery. Wepresent the largest-to-date successful application of deep-learningimputation to datasets which are comparable in size to the corporatedata repository of a pharmaceutical company (678,994 compounds by 1166endpoints). We demonstrate this improvement for three areas of practicalapplication linked to distinct use cases; i) target activity datacompiled from a range of drug discovery projects, ii) a high value andheterogeneous dataset covering complex absorption, distribution,metabolism and elimination properties and, iii) high throughputscreening data, testing the algorithm’s limits on early-stage noisy andvery sparse data. Achieving median coefficients of determination,R, of 0.69, 0.36 and 0.43 respectively across theseapplications, the deep learning imputation method offers an unambiguousimprovement over random forest QSAR methods, which achieve medianR values of 0.28, 0.19 and 0.23 respectively. We alsodemonstrate that robust estimates of the uncertainties in the predictedvalues correlate strongly with the accuracies in prediction, enablinggreater confidence in decision-making based on the imputed values.

对化学化合物的生物学性质进行更准确的预测将指导药物发现中新化合物的选择和设计，并有助于解决药物研发的巨大成本和低成功率问题。然而，由于复合数据的稀疏性和生物实验结果固有的噪声，该领域对人工智能方法提出了重大挑战。在本文中，我们展示了使用深度学习的数据插补如何对广泛应用于药物发现的定量构效关系（QSAR）机器学习模型提供实质性改进。我们展示了迄今为止最大规模的深度学习计算在数据集上的成功应用，这些数据集的大小与制药公司的企业数据库相当（678994种化合物，1166个端点）。我们针对与不同用例相关的实践应用程序的三个领域展示了这种改进；i）从一系列药物发现项目中汇编的靶标活性数据，ii）涵盖复杂吸收、分布、代谢和消除特性的高值异构数据集，以及iii）高通量筛选数据，测试算法对早期噪声和非常稀疏数据的限制。深度学习插补方法在这些应用中分别实现了0.69、0.36和0.43的中值决定系数R，与随机森林QSAR方法相比，该方法提供了一个明显的改进，后者的中值R值分别为0.28、0.19和0.23。我们还证明，对预测值中不确定性的稳健估计与预测的准确性密切相关，从而增强了基于估算值的决策的信心。

{"title":"Deep Imputation on Large-Scale Drug Discovery Data","authors":"Benedict W J Irwin, T. Whitehead, Scott Rowland, Samar Y. Mahmoud, G. Conduit, M. Segall","doi":"10.22541/AU.161111205.55340339/V2","DOIUrl":"https://doi.org/10.22541/AU.161111205.55340339/V2","url":null,"abstract":"More accurate predictions of the biological properties of chemical\u0000compounds would guide the selection and design of new compounds in drug\u0000discovery and help to address the enormous cost and low success-rate of\u0000pharmaceutical R&D. However this domain presents a significant\u0000challenge for AI methods due to the sparsity of compound data and the\u0000noise inherent in results from biological experiments. In this paper, we\u0000demonstrate how data imputation using deep learning provides substantial\u0000improvements over quantitative structure-activity relationship (QSAR)\u0000machine learning models that are widely applied in drug discovery. We\u0000present the largest-to-date successful application of deep-learning\u0000imputation to datasets which are comparable in size to the corporate\u0000data repository of a pharmaceutical company (678,994 compounds by 1166\u0000endpoints). We demonstrate this improvement for three areas of practical\u0000application linked to distinct use cases; i) target activity data\u0000compiled from a range of drug discovery projects, ii) a high value and\u0000heterogeneous dataset covering complex absorption, distribution,\u0000metabolism and elimination properties and, iii) high throughput\u0000screening data, testing the algorithm’s limits on early-stage noisy and\u0000very sparse data. Achieving median coefficients of determination,\u0000R, of 0.69, 0.36 and 0.43 respectively across these\u0000applications, the deep learning imputation method offers an unambiguous\u0000improvement over random forest QSAR methods, which achieve median\u0000R values of 0.28, 0.19 and 0.23 respectively. We also\u0000demonstrate that robust estimates of the uncertainties in the predicted\u0000values correlate strongly with the accuracies in prediction, enabling\u0000greater confidence in decision-making based on the imputed values.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44697723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Heritage connector: A machine learning framework for building linked open data from museum collections 遗产连接器:一个机器学习框架，用于从博物馆藏品中构建链接的开放数据

Applied AI letters

Pub Date : 2021-01-06 DOI: 10.22541/au.160994838.81187546/v1

Kalyan Dutia, John Stack

As with almost all data, museum collection catalogues are largelyunstructured, variable in consistency and overwhelmingly composed ofthin records. The form of these catalogues means that the potential fornew forms of research, access and scholarly enquiry that range acrossmultiple collections and related datasets remains dormant. In theproject Heritage Connector: Transforming text into data to extractmeaning and make connections, we are applying a battery of digitaltechniques to connect similar, identical and related items within andacross collections and other publications. In this paper we describe aframework to create a Linked Open Data knowledge graph (KG) from digitalmuseum catalogues, connect entities within this graph to Wikidata, andcreate new connections in this graph from text. We focus on the use ofmachine learning to create these links at scale with a small amount oflabelled data, on a mid-range laptop or a small cloud virtual machine.We publish open-source software providing tools to perform the tasks ofKG creation, entity matching and named entity recognition under theseconstraints.

与几乎所有的数据一样，博物馆藏品目录在很大程度上是非结构化的，一致性多变，而且绝大多数都是由单薄的记录组成的。这些目录的形式意味着跨多个集合和相关数据集的新形式的研究、获取和学术查询的潜力仍然处于休眠状态。在“遗产连接器:将文本转换为数据以提取含义并建立联系”项目中，我们正在应用一系列数字技术来连接馆藏和其他出版物内部和之间的相似、相同和相关项目。在本文中，我们描述了一个框架，用于从数字博物馆目录中创建一个链接开放数据知识图(KG)，将该图中的实体连接到维基数据，并从文本中创建该图中的新连接。我们专注于使用机器学习，在中型笔记本电脑或小型云虚拟机上，通过少量标记数据大规模创建这些链接。我们发布了开源软件，提供在这些约束下执行kg创建、实体匹配和命名实体识别任务的工具。

{"title":"Heritage connector: A machine learning framework for building linked open data from museum collections","authors":"Kalyan Dutia, John Stack","doi":"10.22541/au.160994838.81187546/v1","DOIUrl":"https://doi.org/10.22541/au.160994838.81187546/v1","url":null,"abstract":"As with almost all data, museum collection catalogues are largely\u0000unstructured, variable in consistency and overwhelmingly composed of\u0000thin records. The form of these catalogues means that the potential for\u0000new forms of research, access and scholarly enquiry that range across\u0000multiple collections and related datasets remains dormant. In the\u0000project Heritage Connector: Transforming text into data to extract\u0000meaning and make connections, we are applying a battery of digital\u0000techniques to connect similar, identical and related items within and\u0000across collections and other publications. In this paper we describe a\u0000framework to create a Linked Open Data knowledge graph (KG) from digital\u0000museum catalogues, connect entities within this graph to Wikidata, and\u0000create new connections in this graph from text. We focus on the use of\u0000machine learning to create these links at scale with a small amount of\u0000labelled data, on a mid-range laptop or a small cloud virtual machine.\u0000We publish open-source software providing tools to perform the tasks of\u0000KG creation, entity matching and named entity recognition under these\u0000constraints.","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44868134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Applied AI letters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀