首页 > 最新文献

Nature Machine Intelligence最新文献

英文 中文
Pre-training with fractional denoising to enhance molecular property prediction 利用分数去噪进行预训练,提高分子特性预测能力
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-18 DOI: 10.1038/s42256-024-00900-z
Yuyan Ni, Shikun Feng, Xin Hong, Yuancheng Sun, Wei-Ying Ma, Zhi-Ming Ma, Qiwei Ye, Yanyan Lan

Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. Although many existing methods utilize common pre-training tasks in computer vision and natural language processing, they often overlook the fundamental physical principles governing molecules. In contrast, applying denoising in pre-training can be interpreted as an equivalent force learning, but the limited noise distribution introduces bias into the molecular distribution. To address this issue, we introduce a molecular pre-training framework called fractional denoising, which decouples noise design from the constraints imposed by force learning equivalence. In this way, the noise becomes customizable, allowing for incorporating chemical priors to substantially improve the molecular distribution modelling. Experiments demonstrate that our framework consistently outperforms existing methods, establishing state-of-the-art results across force prediction, quantum chemical properties and binding affinity tasks. The refined noise design enhances force accuracy and sampling coverage, which contribute to the creation of physically consistent molecular representations, ultimately leading to superior predictive performance.

深度学习方法被认为有望加速药物发现和材料设计中的分子筛选。由于标记数据的可用性有限,人们提出了各种自监督分子预训练方法。虽然许多现有方法利用了计算机视觉和自然语言处理中常见的预训练任务,但它们往往忽略了分子的基本物理原理。相比之下,在预训练中应用去噪可以解释为等效的力学习,但有限的噪声分布会给分子分布带来偏差。为了解决这个问题,我们引入了一种称为分数去噪的分子预训练框架,它将噪声设计与力学习等效性所施加的约束分离开来。通过这种方式,噪声变得可定制,从而可以结合化学先验,大幅改进分子分布建模。实验证明,我们的框架始终优于现有方法,在力预测、量子化学特性和结合亲和力任务方面取得了最先进的结果。经过改进的噪声设计提高了力的准确性和采样覆盖率,有助于创建物理上一致的分子表征,最终实现卓越的预测性能。
{"title":"Pre-training with fractional denoising to enhance molecular property prediction","authors":"Yuyan Ni, Shikun Feng, Xin Hong, Yuancheng Sun, Wei-Ying Ma, Zhi-Ming Ma, Qiwei Ye, Yanyan Lan","doi":"10.1038/s42256-024-00900-z","DOIUrl":"https://doi.org/10.1038/s42256-024-00900-z","url":null,"abstract":"<p>Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. Although many existing methods utilize common pre-training tasks in computer vision and natural language processing, they often overlook the fundamental physical principles governing molecules. In contrast, applying denoising in pre-training can be interpreted as an equivalent force learning, but the limited noise distribution introduces bias into the molecular distribution. To address this issue, we introduce a molecular pre-training framework called fractional denoising, which decouples noise design from the constraints imposed by force learning equivalence. In this way, the noise becomes customizable, allowing for incorporating chemical priors to substantially improve the molecular distribution modelling. Experiments demonstrate that our framework consistently outperforms existing methods, establishing state-of-the-art results across force prediction, quantum chemical properties and binding affinity tasks. The refined noise design enhances force accuracy and sampling coverage, which contribute to the creation of physically consistent molecular representations, ultimately leading to superior predictive performance.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142236693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse learned kernels for interpretable and efficient medical time series processing 用于可解释和高效医学时间序列处理的稀疏学习核
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-18 DOI: 10.1038/s42256-024-00898-4
Sully F. Chen, Zhicheng Guo, Cheng Ding, Xiao Hu, Cynthia Rudin

Rapid, reliable and accurate interpretation of medical time series signals is crucial for high-stakes clinical decision-making. Deep learning methods offered unprecedented performance in medical signal processing but at a cost: they were compute intensive and lacked interpretability. We propose sparse mixture of learned kernels (SMoLK), an interpretable architecture for medical time series processing. SMoLK learns a set of lightweight flexible kernels that form a single-layer sparse neural network, providing not only interpretability but also efficiency, robustness and generalization to unseen data distributions. We introduce parameter reduction techniques to reduce the size of SMoLK networks and maintain performance. We test SMoLK on two important tasks common to many consumer wearables: photoplethysmography artefact detection and atrial fibrillation detection from single-lead electrocardiograms. We find that SMoLK matches the performance of models orders of magnitude larger. It is particularly suited for real-time applications using low-power devices, and its interpretability benefits high-stakes situations.

对医学时间序列信号进行快速、可靠和准确的解读,对于事关重大的临床决策至关重要。深度学习方法为医学信号处理提供了前所未有的性能,但也付出了代价:计算密集且缺乏可解释性。我们提出了稀疏混合学习核(SMoLK),这是一种用于医学时间序列处理的可解释架构。SMoLK 学习一组轻量级的灵活内核,形成单层稀疏神经网络,不仅提供可解释性,还提供效率、鲁棒性和对未知数据分布的泛化。我们引入了参数缩减技术,以缩小 SMoLK 网络的规模并保持性能。我们在许多消费类可穿戴设备常见的两个重要任务上测试了 SMoLK:光电血压计伪影检测和单导联心电图的心房颤动检测。我们发现,SMoLK 的性能可媲美更大数量级的模型。它特别适用于使用低功耗设备的实时应用,其可解释性有利于高风险情况。
{"title":"Sparse learned kernels for interpretable and efficient medical time series processing","authors":"Sully F. Chen, Zhicheng Guo, Cheng Ding, Xiao Hu, Cynthia Rudin","doi":"10.1038/s42256-024-00898-4","DOIUrl":"https://doi.org/10.1038/s42256-024-00898-4","url":null,"abstract":"<p>Rapid, reliable and accurate interpretation of medical time series signals is crucial for high-stakes clinical decision-making. Deep learning methods offered unprecedented performance in medical signal processing but at a cost: they were compute intensive and lacked interpretability. We propose sparse mixture of learned kernels (SMoLK), an interpretable architecture for medical time series processing. SMoLK learns a set of lightweight flexible kernels that form a single-layer sparse neural network, providing not only interpretability but also efficiency, robustness and generalization to unseen data distributions. We introduce parameter reduction techniques to reduce the size of SMoLK networks and maintain performance. We test SMoLK on two important tasks common to many consumer wearables: photoplethysmography artefact detection and atrial fibrillation detection from single-lead electrocardiograms. We find that SMoLK matches the performance of models orders of magnitude larger. It is particularly suited for real-time applications using low-power devices, and its interpretability benefits high-stakes situations.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142236674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Realizing full-body control of humanoid robots 实现仿人机器人的全身控制
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-11 DOI: 10.1038/s42256-024-00891-x
Guangliang Li, Randy Gomez
Using deep reinforcement learning, flexible skills and behaviours emerge in humanoid robots, as demonstrated in two recent reports.
正如最近的两篇报告所展示的那样,利用深度强化学习,仿人机器人可以具备灵活的技能和行为。
{"title":"Realizing full-body control of humanoid robots","authors":"Guangliang Li, Randy Gomez","doi":"10.1038/s42256-024-00891-x","DOIUrl":"https://doi.org/10.1038/s42256-024-00891-x","url":null,"abstract":"Using deep reinforcement learning, flexible skills and behaviours emerge in humanoid robots, as demonstrated in two recent reports.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142166408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling 利用基于生成式人工智能的虚拟多重肿瘤特征分析加速组织病理学工作流程
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-09 DOI: 10.1038/s42256-024-00889-5
Pushpak Pati, Sofia Karkampouna, Francesco Bonollo, Eva Compérat, Martina Radić, Martin Spahn, Adriano Martinelli, Martin Wartenberg, Marianna Kruithof-de Julio, Marianna Rapsomaniki

Understanding the spatial heterogeneity of tumours and its links to disease initiation and progression is a cornerstone of cancer biology. Presently, histopathology workflows heavily rely on hematoxylin and eosin and serial immunohistochemistry staining, a cumbersome, tissue-exhaustive process that results in non-aligned tissue images. We propose the VirtualMultiplexer, a generative artificial intelligence toolkit that effectively synthesizes multiplexed immunohistochemistry images for several antibody markers (namely AR, NKX3.1, CD44, CD146, p53 and ERG) from only an input hematoxylin and eosin image. The VirtualMultiplexer captures biologically relevant staining patterns across tissue scales without requiring consecutive tissue sections, image registration or extensive expert annotations. Thorough qualitative and quantitative assessment indicates that the VirtualMultiplexer achieves rapid, robust and precise generation of virtually multiplexed imaging datasets of high staining quality that are indistinguishable from the real ones. The VirtualMultiplexer is successfully transferred across tissue scales and patient cohorts with no need for model fine-tuning. Crucially, the virtually multiplexed images enabled training a graph transformer that simultaneously learns from the joint spatial distribution of several proteins to predict clinically relevant endpoints. We observe that this multiplexed learning scheme was able to greatly improve clinical prediction, as corroborated across several downstream tasks, independent patient cohorts and cancer types. Our results showcase the clinical relevance of artificial intelligence-assisted multiplexed tumour imaging, accelerating histopathology workflows and cancer biology.

了解肿瘤的空间异质性及其与疾病发生和发展的关系是癌症生物学的基石。目前,组织病理学工作流程严重依赖苏木精、伊红和连续免疫组化染色,这是一个繁琐的组织排查过程,会导致组织图像不对齐。我们提出的 VirtualMultiplexer 是一种生成式人工智能工具包,它能从输入的苏木精和伊红图像中有效合成多个抗体标记物(即 AR、NKX3.1、CD44、CD146、p53 和 ERG)的多重免疫组化图像。VirtualMultiplexer 可捕捉跨组织尺度的生物相关染色模式,而无需连续的组织切片、图像注册或大量的专家注释。全面的定性和定量评估表明,VirtualMultiplexer 能够快速、稳健、精确地生成染色质量高且与真实染色无异的虚拟多重成像数据集。虚拟多路复用器可成功跨组织尺度和患者群组转移,无需对模型进行微调。最重要的是,虚拟多路复用图像能够训练图转换器,该转换器可同时从多个蛋白质的联合空间分布中学习,以预测临床相关终点。我们观察到,这种多重学习方案能够大大提高临床预测能力,这一点在多个下游任务、独立患者群和癌症类型中都得到了证实。我们的研究结果展示了人工智能辅助多路复用肿瘤成像的临床意义,加快了组织病理学工作流程和癌症生物学的发展。
{"title":"Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling","authors":"Pushpak Pati, Sofia Karkampouna, Francesco Bonollo, Eva Compérat, Martina Radić, Martin Spahn, Adriano Martinelli, Martin Wartenberg, Marianna Kruithof-de Julio, Marianna Rapsomaniki","doi":"10.1038/s42256-024-00889-5","DOIUrl":"https://doi.org/10.1038/s42256-024-00889-5","url":null,"abstract":"<p>Understanding the spatial heterogeneity of tumours and its links to disease initiation and progression is a cornerstone of cancer biology. Presently, histopathology workflows heavily rely on hematoxylin and eosin and serial immunohistochemistry staining, a cumbersome, tissue-exhaustive process that results in non-aligned tissue images. We propose the VirtualMultiplexer, a generative artificial intelligence toolkit that effectively synthesizes multiplexed immunohistochemistry images for several antibody markers (namely AR, NKX3.1, CD44, CD146, p53 and ERG) from only an input hematoxylin and eosin image. The VirtualMultiplexer captures biologically relevant staining patterns across tissue scales without requiring consecutive tissue sections, image registration or extensive expert annotations. Thorough qualitative and quantitative assessment indicates that the VirtualMultiplexer achieves rapid, robust and precise generation of virtually multiplexed imaging datasets of high staining quality that are indistinguishable from the real ones. The VirtualMultiplexer is successfully transferred across tissue scales and patient cohorts with no need for model fine-tuning. Crucially, the virtually multiplexed images enabled training a graph transformer that simultaneously learns from the joint spatial distribution of several proteins to predict clinically relevant endpoints. We observe that this multiplexed learning scheme was able to greatly improve clinical prediction, as corroborated across several downstream tasks, independent patient cohorts and cancer types. Our results showcase the clinical relevance of artificial intelligence-assisted multiplexed tumour imaging, accelerating histopathology workflows and cancer biology.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and scalable reinforcement learning for large-scale network control 用于大规模网络控制的高效可扩展强化学习
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-03 DOI: 10.1038/s42256-024-00879-7
Chengdong Ma, Aming Li, Yali Du, Hao Dong, Yaodong Yang

The primary challenge in the development of large-scale artificial intelligence (AI) systems lies in achieving scalable decision-making—extending the AI models while maintaining sufficient performance. Existing research indicates that distributed AI can improve scalability by decomposing complex tasks and distributing them across collaborative nodes. However, previous technologies suffered from compromised real-world applicability and scalability due to the massive requirement of communication and sampled data. Here we develop a model-based decentralized policy optimization framework, which can be efficiently deployed in multi-agent systems. By leveraging local observation through the agent-level topological decoupling of global dynamics, we prove that this decentralized mechanism achieves accurate estimations of global information. Importantly, we further introduce model learning to reinforce the optimal policy for monotonic improvement with a limited amount of sampled data. Empirical results on diverse scenarios show the superior scalability of our approach, particularly in real-world systems with hundreds of agents, thereby paving the way for scaling up AI systems.

开发大规模人工智能(AI)系统的主要挑战在于实现可扩展决策--在扩展人工智能模型的同时保持足够的性能。现有研究表明,分布式人工智能可以通过分解复杂任务并将其分配到协作节点来提高可扩展性。然而,由于需要大量的通信和采样数据,以前的技术在现实世界中的适用性和可扩展性大打折扣。在这里,我们开发了一种基于模型的分散式策略优化框架,可以高效地部署在多代理系统中。通过全局动态的代理级拓扑解耦利用局部观察,我们证明了这种分散机制可以实现对全局信息的精确估计。重要的是,我们进一步引入了模型学习,以强化最优策略,从而在采样数据量有限的情况下实现单调改进。在不同场景下的实证结果表明,我们的方法具有卓越的可扩展性,尤其是在拥有数百个代理的真实世界系统中,从而为人工智能系统的扩展铺平了道路。
{"title":"Efficient and scalable reinforcement learning for large-scale network control","authors":"Chengdong Ma, Aming Li, Yali Du, Hao Dong, Yaodong Yang","doi":"10.1038/s42256-024-00879-7","DOIUrl":"https://doi.org/10.1038/s42256-024-00879-7","url":null,"abstract":"<p>The primary challenge in the development of large-scale artificial intelligence (AI) systems lies in achieving scalable decision-making—extending the AI models while maintaining sufficient performance. Existing research indicates that distributed AI can improve scalability by decomposing complex tasks and distributing them across collaborative nodes. However, previous technologies suffered from compromised real-world applicability and scalability due to the massive requirement of communication and sampled data. Here we develop a model-based decentralized policy optimization framework, which can be efficiently deployed in multi-agent systems. By leveraging local observation through the agent-level topological decoupling of global dynamics, we prove that this decentralized mechanism achieves accurate estimations of global information. Importantly, we further introduce model learning to reinforce the optimal policy for monotonic improvement with a limited amount of sampled data. Empirical results on diverse scenarios show the superior scalability of our approach, particularly in real-world systems with hundreds of agents, thereby paving the way for scaling up AI systems.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A large-scale audit of dataset licensing and attribution in AI 对人工智能中的数据集许可和归属进行大规模审计
IF 18.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1038/s42256-024-00878-8
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi (Alexis) Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker
The race to train language models on vast, diverse and inconsistently documented datasets raises pressing legal and ethical concerns. To improve data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace more than 1,800 text datasets. We develop tools and standards to trace the lineage of these datasets, including their source, creators, licences and subsequent use. Our landscape analysis highlights sharp divides in the composition and focus of data licenced for commercial use. Important categories including low-resource languages, creative tasks and new synthetic data all tend to be restrictively licenced. We observe frequent miscategorization of licences on popular dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. This highlights a crisis in misattribution and informed use of popular datasets driving many recent breakthroughs. Our analysis of data sources also explains the application of copyright law and fair use to finetuning data. As a contribution to continuing improvements in dataset transparency and responsible use, we release our audit, with an interactive user interface, the Data Provenance Explorer, to enable practitioners to trace and filter on data provenance for the most popular finetuning data collections: www.dataprovenance.org . The Data Provenance Initiative audits over 1,800 text artificial intelligence (AI) datasets, analysing trends, permissions of use and global representation. It exposes frequent errors on several major data hosting sites and offers tools for transparent and informed use of AI training data.
在庞大、多样且记录不一致的数据集上训练语言模型的竞赛引发了紧迫的法律和道德问题。为了提高数据的透明度和理解度,我们召集了法律和机器学习专家开展多学科合作,对 1800 多个文本数据集进行系统审核和追踪。我们开发了各种工具和标准来追踪这些数据集的源流,包括它们的来源、创建者、许可证和后续使用情况。我们的分析结果表明,获得商业使用许可的数据在构成和重点方面存在明显差异。包括低资源语言、创造性任务和新合成数据在内的重要类别都倾向于采用限制性许可。我们观察到,在流行的数据集托管网站上,许可证经常被错误归类,许可证遗漏率超过 70%,错误率超过 50%。这凸显了在错误归类和知情使用流行数据集方面存在的危机,而这正是近期许多突破性进展的驱动力。我们对数据源的分析还解释了版权法和合理使用在数据微调中的应用。作为对持续提高数据集透明度和负责任使用的贡献,我们发布了带有交互式用户界面的数据出处资源管理器(Data Provenance Explorer)的审计报告,使从业人员能够追踪和过滤最流行的微调数据集的数据出处:www.dataprovenance.org 。数据出处倡议 "对1800多个人工智能(AI)文本数据集进行审计,分析趋势、使用权限和全球代表性。它揭露了几个主要数据托管网站上经常出现的错误,并为透明、知情地使用人工智能训练数据提供了工具。
{"title":"A large-scale audit of dataset licensing and attribution in AI","authors":"Shayne Longpre,&nbsp;Robert Mahari,&nbsp;Anthony Chen,&nbsp;Naana Obeng-Marnu,&nbsp;Damien Sileo,&nbsp;William Brannon,&nbsp;Niklas Muennighoff,&nbsp;Nathan Khazam,&nbsp;Jad Kabbara,&nbsp;Kartik Perisetla,&nbsp;Xinyi (Alexis) Wu,&nbsp;Enrico Shippole,&nbsp;Kurt Bollacker,&nbsp;Tongshuang Wu,&nbsp;Luis Villa,&nbsp;Sandy Pentland,&nbsp;Sara Hooker","doi":"10.1038/s42256-024-00878-8","DOIUrl":"10.1038/s42256-024-00878-8","url":null,"abstract":"The race to train language models on vast, diverse and inconsistently documented datasets raises pressing legal and ethical concerns. To improve data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace more than 1,800 text datasets. We develop tools and standards to trace the lineage of these datasets, including their source, creators, licences and subsequent use. Our landscape analysis highlights sharp divides in the composition and focus of data licenced for commercial use. Important categories including low-resource languages, creative tasks and new synthetic data all tend to be restrictively licenced. We observe frequent miscategorization of licences on popular dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. This highlights a crisis in misattribution and informed use of popular datasets driving many recent breakthroughs. Our analysis of data sources also explains the application of copyright law and fair use to finetuning data. As a contribution to continuing improvements in dataset transparency and responsible use, we release our audit, with an interactive user interface, the Data Provenance Explorer, to enable practitioners to trace and filter on data provenance for the most popular finetuning data collections: www.dataprovenance.org . The Data Provenance Initiative audits over 1,800 text artificial intelligence (AI) datasets, analysing trends, permissions of use and global representation. It exposes frequent errors on several major data hosting sites and offers tools for transparent and informed use of AI training data.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":18.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s42256-024-00878-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142091222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What is in your LLM-based framework? 您的基于 LLM 的框架中有哪些内容?
IF 18.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1038/s42256-024-00896-6
To maintain high standards in clarity and reproducibility, authors need to clearly mention and describe the use of GPT-4 and other large language models in their work.
为了保持高标准的清晰度和可重复性,作者需要清楚地提及并描述 GPT-4 和其他大型语言模型在其工作中的使用情况。
{"title":"What is in your LLM-based framework?","authors":"","doi":"10.1038/s42256-024-00896-6","DOIUrl":"10.1038/s42256-024-00896-6","url":null,"abstract":"To maintain high standards in clarity and reproducibility, authors need to clearly mention and describe the use of GPT-4 and other large language models in their work.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":18.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s42256-024-00896-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142091142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A step forward in tracing and documenting dataset provenance 在追踪和记录数据集出处方面向前迈进了一步
IF 18.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1038/s42256-024-00884-w
Nicholas Vincent
Training data are crucial for advancements in artificial intelligence, but many questions remain regarding the provenance of training datasets, license enforcement and creator consent. Mahari et al. provide a set of tools for tracing, documenting and sharing AI training data and highlight the importance for developers to engage with metadata of datasets.
训练数据对人工智能的进步至关重要,但在训练数据集的来源、许可证的执行和创建者的同意方面仍存在许多问题。Mahari 等人提供了一套用于追踪、记录和共享人工智能训练数据的工具,并强调了开发人员参与数据集元数据的重要性。
{"title":"A step forward in tracing and documenting dataset provenance","authors":"Nicholas Vincent","doi":"10.1038/s42256-024-00884-w","DOIUrl":"10.1038/s42256-024-00884-w","url":null,"abstract":"Training data are crucial for advancements in artificial intelligence, but many questions remain regarding the provenance of training datasets, license enforcement and creator consent. Mahari et al. provide a set of tools for tracing, documenting and sharing AI training data and highlight the importance for developers to engage with metadata of datasets.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":18.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142091186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning integral operators via neural integral equations 通过神经积分方程学习积分算子
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-29 DOI: 10.1038/s42256-024-00886-8
Emanuele Zappala, Antonio Henrique de Oliveira Fonseca, Josue Ortega Caro, Andrew Henry Moberly, Michael James Higley, Jessica Cardin, David van Dijk

Nonlinear operators with long-distance spatiotemporal dependencies are fundamental in modelling complex systems across sciences; yet, learning these non-local operators remains challenging in machine learning. Integral equations, which model such non-local systems, have wide-ranging applications in physics, chemistry, biology and engineering. We introduce the neural integral equation, a method for learning unknown integral operators from data using an integral equation solver. To improve scalability and model capacity, we also present the attentional neural integral equation, which replaces the integral with self-attention. Both models are grounded in the theory of second-kind integral equations, where the indeterminate appears both inside and outside the integral operator. We provide a theoretical analysis showing how self-attention can approximate integral operators under mild regularity assumptions, further deepening previously reported connections between transformers and integration, as well as deriving corresponding approximation results for integral operators. Through numerical benchmarks on synthetic and real-world data, including Lotka–Volterra, Navier–Stokes and Burgers’ equations, as well as brain dynamics and integral equations, we showcase the models’ capabilities and their ability to derive interpretable dynamics embeddings. Our experiments demonstrate that attentional neural integral equations outperform existing methods, especially for longer time intervals and higher-dimensional problems. Our work addresses a critical gap in machine learning for non-local operators and offers a powerful tool for studying unknown complex systems with long-range dependencies.

具有长距离时空依赖性的非线性算子是各科学领域复杂系统建模的基础;然而,学习这些非局部算子在机器学习中仍具有挑战性。积分方程是此类非局部系统的模型,在物理学、化学、生物学和工程学中有着广泛的应用。我们介绍了神经积分方程,这是一种利用积分方程求解器从数据中学习未知积分算子的方法。为了提高可扩展性和模型容量,我们还提出了注意力神经积分方程,用自我注意力取代积分。这两种模型都以第二类积分方程理论为基础,在第二类积分方程中,不定数既出现在积分算子内部,也出现在积分算子外部。我们提供的理论分析表明,在温和的正则性假设条件下,自我注意可以近似积分算子,进一步深化了之前报道的变换器与积分之间的联系,并推导出积分算子的相应近似结果。通过对合成数据和真实世界数据(包括洛特卡-伏特拉方程、纳维-斯托克斯方程、伯格斯方程以及脑动力学和积分方程)进行数值基准测试,我们展示了模型的能力及其推导可解释动力学嵌入的能力。我们的实验证明,注意力神经积分方程优于现有方法,尤其是在处理较长的时间间隔和较高维度的问题时。我们的工作解决了非局部算子机器学习中的一个关键缺口,为研究具有长程依赖性的未知复杂系统提供了一个强大的工具。
{"title":"Learning integral operators via neural integral equations","authors":"Emanuele Zappala, Antonio Henrique de Oliveira Fonseca, Josue Ortega Caro, Andrew Henry Moberly, Michael James Higley, Jessica Cardin, David van Dijk","doi":"10.1038/s42256-024-00886-8","DOIUrl":"https://doi.org/10.1038/s42256-024-00886-8","url":null,"abstract":"<p>Nonlinear operators with long-distance spatiotemporal dependencies are fundamental in modelling complex systems across sciences; yet, learning these non-local operators remains challenging in machine learning. Integral equations, which model such non-local systems, have wide-ranging applications in physics, chemistry, biology and engineering. We introduce the neural integral equation, a method for learning unknown integral operators from data using an integral equation solver. To improve scalability and model capacity, we also present the attentional neural integral equation, which replaces the integral with self-attention. Both models are grounded in the theory of second-kind integral equations, where the indeterminate appears both inside and outside the integral operator. We provide a theoretical analysis showing how self-attention can approximate integral operators under mild regularity assumptions, further deepening previously reported connections between transformers and integration, as well as deriving corresponding approximation results for integral operators. Through numerical benchmarks on synthetic and real-world data, including Lotka–Volterra, Navier–Stokes and Burgers’ equations, as well as brain dynamics and integral equations, we showcase the models’ capabilities and their ability to derive interpretable dynamics embeddings. Our experiments demonstrate that attentional neural integral equations outperform existing methods, especially for longer time intervals and higher-dimensional problems. Our work addresses a critical gap in machine learning for non-local operators and offers a powerful tool for studying unknown complex systems with long-range dependencies.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning motif-based graphs for drug–drug interaction prediction via local–global self-attention 通过局部-全局自我关注,学习基于图案的药物相互作用预测图谱
IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-27 DOI: 10.1038/s42256-024-00888-6
Yi Zhong, Gaozheng Li, Ji Yang, Houbing Zheng, Yongqiang Yu, Jiheng Zhang, Heng Luo, Biao Wang, Zuquan Weng

Unexpected drug–drug interactions (DDIs) are important issues for both pharmaceutical research and clinical applications due to the high risk of causing severe adverse drug reactions or drug withdrawals. Many deep learning models have achieved high performance in DDI prediction, but model interpretability to reveal the underlying causes of DDIs has not been extensively explored. Here we propose MeTDDI—a deep learning framework with local–global self-attention and co-attention to learn motif-based graphs for DDI prediction. MeTDDI achieved competitive performance compared with state-of-the-art models. Regarding interpretability, we conducted extensive assessments on 73 drugs with 13,786 DDIs and MeTDDI can precisely explain the structural mechanisms for 5,602 DDIs involving 58 drugs. Besides, MeTDDI shows potential to explain complex DDI mechanisms and mitigate DDI risks. To summarize, MeTDDI provides a new perspective on exploring DDI mechanisms, which will benefit both drug discovery and polypharmacy for safer therapies for patients.

意外的药物相互作用(DDIs)是药物研究和临床应用中的重要问题,因为它极有可能导致严重的药物不良反应或停药。许多深度学习模型在 DDI 预测方面取得了很高的性能,但揭示 DDIs 潜在原因的模型可解释性尚未得到广泛探索。在此,我们提出了MeTDDI--一种具有局部-全局自关注和共关注的深度学习框架,用于学习基于主题图的DDI预测。与最先进的模型相比,MeTDDI 的性能极具竞争力。在可解释性方面,我们对73种药物的13786个DDIs进行了广泛评估,MeTDDI可以精确解释58种药物的5602个DDIs的结构机制。此外,MeTDDI 显示出解释复杂 DDI 机制和降低 DDI 风险的潜力。总之,MeTDDI 为探索 DDI 机制提供了一个新的视角,这将有利于药物发现和多药治疗,为患者提供更安全的疗法。
{"title":"Learning motif-based graphs for drug–drug interaction prediction via local–global self-attention","authors":"Yi Zhong, Gaozheng Li, Ji Yang, Houbing Zheng, Yongqiang Yu, Jiheng Zhang, Heng Luo, Biao Wang, Zuquan Weng","doi":"10.1038/s42256-024-00888-6","DOIUrl":"https://doi.org/10.1038/s42256-024-00888-6","url":null,"abstract":"<p>Unexpected drug–drug interactions (DDIs) are important issues for both pharmaceutical research and clinical applications due to the high risk of causing severe adverse drug reactions or drug withdrawals. Many deep learning models have achieved high performance in DDI prediction, but model interpretability to reveal the underlying causes of DDIs has not been extensively explored. Here we propose MeTDDI—a deep learning framework with local–global self-attention and co-attention to learn motif-based graphs for DDI prediction. MeTDDI achieved competitive performance compared with state-of-the-art models. Regarding interpretability, we conducted extensive assessments on 73 drugs with 13,786 DDIs and MeTDDI can precisely explain the structural mechanisms for 5,602 DDIs involving 58 drugs. Besides, MeTDDI shows potential to explain complex DDI mechanisms and mitigate DDI risks. To summarize, MeTDDI provides a new perspective on exploring DDI mechanisms, which will benefit both drug discovery and polypharmacy for safer therapies for patients.</p>","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.8,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Nature Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1