Machine Learning: Science and Technology最新文献_第3页

An extended de Bruijn graph for feature engineering over biological sequential data 用于生物序列数据特征工程的扩展德布鲁因图

Machine Learning: Science and Technology

Pub Date : 2024-07-05 DOI: 10.1088/2632-2153/ad5fde

Mert Onur Çakıroğlu, H. Kurban, Parichit Sharma, Oguzhan Kulekci, Elham Khorasani Buxton, Maryam Raeeszadeh-Sarmazdeh, Mehmet Dalkilic

In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art (SOTA) PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich (GAR) motifs with high coverage, highlighting it's potential in general pattern discovery. The software code is accessible at: https://github.com/parichit/TIMP_Classification

在本研究中，我们为蛋白质等生物序列数据的特征工程引入了一种基于德布鲁因图（dBG）的新型框架。该框架可为传统人工智能（TAI）算法动态生成高质量、可解释的特征，从而简化特征提取。我们的框架通过使用二级三角形结构有效调整 dBG 中的边缘权重来考虑氨基酸的替换。我们通过遍历重边从 dBG 中提取主题，然后结合 BLAST 和 Smith-Waterman 等比对算法为 TAI 算法生成特征。在 TIMP（基质金属蛋白酶组织抑制剂）数据上进行的经验验证表明，与稳健基线、最先进（SOTA）PLM 模型和流行的 GLAM2 工具相比，我们的准确性有了显著提高。此外，我们的框架还成功识别了富含甘氨酸和精氨酸（GAR）的图案，覆盖率很高，这凸显了它在一般模式发现方面的潜力。软件代码请访问：https://github.com/parichit/TIMP_Classification

{"title":"An extended de Bruijn graph for feature engineering over biological sequential data","authors":"Mert Onur Çakıroğlu, H. Kurban, Parichit Sharma, Oguzhan Kulekci, Elham Khorasani Buxton, Maryam Raeeszadeh-Sarmazdeh, Mehmet Dalkilic","doi":"10.1088/2632-2153/ad5fde","DOIUrl":"https://doi.org/10.1088/2632-2153/ad5fde","url":null,"abstract":"\u0000 In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art (SOTA) PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich (GAR) motifs with high coverage, highlighting it's potential in general pattern discovery. The software code is accessible at: https://github.com/parichit/TIMP_Classification","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":" 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141676107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Adaptive Particle Swarm Optimization with Information Interaction Mechanism 具有信息交互机制的自适应粒子群优化技术

Machine Learning: Science and Technology

Pub Date : 2024-06-07 DOI: 10.1088/2632-2153/ad55a5

Rui Liu, Lisheng Wei, Pinggai Zhang

The Particle Swarm Optimization (PSO) algorithm is easy to implement owing to its simple framework, and has been successfully applied to many optimization problems. However, the standard PSO easily falls into the local optimum and has weak search ability. To enhance the optimization ability of the algorithm, this paper proposes an adaptive particle swarm optimization with information interaction mechanism (APSOIIM). First, a chaotic sequence strategy was used to produce uniformly distributed particles and enhance their convergence speed at the initialization stage of the algorithm. Then, an interaction information mechanism is introduced to enhance the diversity of the population with the progress of the search, which can effectively interact with the best information of neighboring particles to maintain the balance between exploration and exploitation. Besides, the convergence was proven to verify the robustness and efficiency of the proposed APSOIIM algorithm. Finally, the proposed APSOIIM was applied to solve the CEC2014 benchmark functions and CEC2017 benchmark functions as well as famous engineering optimization problems. The experimental results show that the proposed APSOIIM has significant advantages over the compared algorithms.

粒子群优化（PSO）算法因其框架简单而易于实现，并已成功应用于许多优化问题。然而，标准的 PSO 算法容易陷入局部最优，搜索能力较弱。为了提高该算法的优化能力，本文提出了一种具有信息交互机制的自适应粒子群优化算法（APSOIIM）。首先，在算法的初始化阶段，采用混沌序列策略产生均匀分布的粒子，并提高粒子的收敛速度。然后，引入交互信息机制，使种群的多样性随着搜索的进展而增强，从而有效地与相邻粒子的最佳信息进行交互，保持探索与开发之间的平衡。此外，收敛性的证明也验证了所提出的 APSOIIM 算法的鲁棒性和高效性。最后，将所提出的 APSOIIM 应用于解决 CEC2014 基准函数和 CEC2017 基准函数以及著名的工程优化问题。实验结果表明，与其他算法相比，所提出的 APSOIIM 算法具有显著优势。

{"title":"An Adaptive Particle Swarm Optimization with Information Interaction Mechanism","authors":"Rui Liu, Lisheng Wei, Pinggai Zhang","doi":"10.1088/2632-2153/ad55a5","DOIUrl":"https://doi.org/10.1088/2632-2153/ad55a5","url":null,"abstract":"\u0000 The Particle Swarm Optimization (PSO) algorithm is easy to implement owing to its simple framework, and has been successfully applied to many optimization problems. However, the standard PSO easily falls into the local optimum and has weak search ability. To enhance the optimization ability of the algorithm, this paper proposes an adaptive particle swarm optimization with information interaction mechanism (APSOIIM). First, a chaotic sequence strategy was used to produce uniformly distributed particles and enhance their convergence speed at the initialization stage of the algorithm. Then, an interaction information mechanism is introduced to enhance the diversity of the population with the progress of the search, which can effectively interact with the best information of neighboring particles to maintain the balance between exploration and exploitation. Besides, the convergence was proven to verify the robustness and efficiency of the proposed APSOIIM algorithm. Finally, the proposed APSOIIM was applied to solve the CEC2014 benchmark functions and CEC2017 benchmark functions as well as famous engineering optimization problems. The experimental results show that the proposed APSOIIM has significant advantages over the compared algorithms.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":" 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141372768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explainable machine learning assisted molecular-level insights for enhanced specific stiffness exploiting the large compositional space of AlCoCrFeNi high entropy alloys 可解释的机器学习辅助分子水平见解，利用铝钴铬铁镍高熵合金的巨大成分空间增强比刚度

Machine Learning: Science and Technology

Pub Date : 2024-06-07 DOI: 10.1088/2632-2153/ad55a4

Kritesh Kumar Gupta, Subrata Barman, S. Dey, T. Mukhopadhyay

Design of high entropy alloys (HEA) presents a significant challenge due to the large compositional space and composition-specific variation in their functional behavior. The traditional alloy design would include trial-and-error prototyping and high-throughput experimentation, which again is challenging due to large-scale fabrication and experimentation. To address these challenges, this article presents a computational strategy for HEA design based on the seamless integration of quasi-random sampling, molecular dynamics (MD) simulations and machine learning (ML). A limited number of algorithmically chosen molecular-level simulations are performed to create a Gaussian process-based computational mapping between the varying concentrations of constituent elements of the HEA and effective properties like Young’s modulus and density. The computationally efficient ML models are subsequently exploited for large-scale predictions and multi-objective functionality attainment with non-aligned goals. The study reveals that there exists a strong negative correlation between Al concentration and the desired effective properties of AlCoCrFeNi HEA, whereas the Ni concentration exhibits a strong positive correlation. The deformation mechanism further shows that excessive increase of Al concentration leads to a higher percentage of FCC to BCC phase transformation which is found to be relatively lower in the HEA with reduced Al concentration. Such physical insights during the deformation process would be crucial in the alloy design process along with the data-driven predictions. As an integral part of this investigation, the developed ML models are interpreted based on Shapley Additive exPlanations, which are essential to explain and understand the model’s mechanism along with meaningful deployment. The data-driven strategy presented here will lead to devising an efficient explainable machine learning-based bottom-up approach to alloy design for multi-objective non-aligned functionality attainment.

由于高熵合金（HEA）的组成空间很大，而且其功能行为因组成而异，因此高熵合金（HEA）的设计面临着巨大的挑战。传统的合金设计包括试错原型设计和高通量实验，而这又因大规模制造和实验而具有挑战性。为了应对这些挑战，本文提出了一种基于准随机抽样、分子动力学（MD）模拟和机器学习（ML）无缝集成的 HEA 设计计算策略。通过执行数量有限的算法选择的分子级模拟，在 HEA 组成元素的不同浓度与杨氏模量和密度等有效特性之间建立基于高斯过程的计算映射。计算效率高的 ML 模型随后被用于大规模预测和多目标功能实现，其目标并不一致。研究结果表明，铝浓度与铝钴铬铁镍 HEA 所需的有效特性之间存在很强的负相关性，而镍浓度则表现出很强的正相关性。变形机理进一步表明，过量增加铝浓度会导致更高的 FCC 到 BCC 相变比例，而在铝浓度降低的 HEA 中，这种比例相对较低。这种变形过程中的物理洞察力对于合金设计过程以及数据驱动的预测至关重要。作为这项研究不可分割的一部分，所开发的 ML 模型是基于 Shapley Additive exPlanations 进行解释的，这对于解释和理解模型的机制以及有意义的部署至关重要。本文介绍的数据驱动策略将有助于设计出一种基于机器学习的自下而上的高效可解释方法，用于合金设计，以实现多目标非对齐功能。

{"title":"Explainable machine learning assisted molecular-level insights for enhanced specific stiffness exploiting the large compositional space of AlCoCrFeNi high entropy alloys","authors":"Kritesh Kumar Gupta, Subrata Barman, S. Dey, T. Mukhopadhyay","doi":"10.1088/2632-2153/ad55a4","DOIUrl":"https://doi.org/10.1088/2632-2153/ad55a4","url":null,"abstract":"\u0000 Design of high entropy alloys (HEA) presents a significant challenge due to the large compositional space and composition-specific variation in their functional behavior. The traditional alloy design would include trial-and-error prototyping and high-throughput experimentation, which again is challenging due to large-scale fabrication and experimentation. To address these challenges, this article presents a computational strategy for HEA design based on the seamless integration of quasi-random sampling, molecular dynamics (MD) simulations and machine learning (ML). A limited number of algorithmically chosen molecular-level simulations are performed to create a Gaussian process-based computational mapping between the varying concentrations of constituent elements of the HEA and effective properties like Young’s modulus and density. The computationally efficient ML models are subsequently exploited for large-scale predictions and multi-objective functionality attainment with non-aligned goals. The study reveals that there exists a strong negative correlation between Al concentration and the desired effective properties of AlCoCrFeNi HEA, whereas the Ni concentration exhibits a strong positive correlation. The deformation mechanism further shows that excessive increase of Al concentration leads to a higher percentage of FCC to BCC phase transformation which is found to be relatively lower in the HEA with reduced Al concentration. Such physical insights during the deformation process would be crucial in the alloy design process along with the data-driven predictions. As an integral part of this investigation, the developed ML models are interpreted based on Shapley Additive exPlanations, which are essential to explain and understand the model’s mechanism along with meaningful deployment. The data-driven strategy presented here will lead to devising an efficient explainable machine learning-based bottom-up approach to alloy design for multi-objective non-aligned functionality attainment.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":" 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141371953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the prediction of the turbulent flow behind cylinder arrays via Echo State Networks 通过回声状态网络预测气缸阵列后的湍流

Machine Learning: Science and Technology

Pub Date : 2024-06-04 DOI: 10.1088/2632-2153/ad5414

M. Sharifi Ghazijahani, C. Cierpka

This study aims at the prediction of the turbulent flow behind cylinder arrays by the application of Echo State Networks (ESN). Three different arrangements of arrays of seven cylinders are chosen for the current study. These represent different flow regimes: single bluff body flow, transient flow, and co-shedding flow. This allows the investigation of turbulent flows that fundamentally originate from wake flows yet exhibit highly diverse dynamics. The data is reduced by Proper Orthogonal Decomposition (POD) which is optimal in terms of kinetic energy. The Time Coefficients of the POD Modes (TCPM) are predicted by the ESN. The network architecture is optimized with respect to its three main hyperparameters, Input Scaling (INS), Spectral Radius (SR), and Leaking Rate (LR), in order to produce the best predictions in terms of Weighted Prediction Score (WPS), a metric leveling statistic and deterministic prediction. In general, the ESN is capable of imitating the complex dynamics of turbulent flows even for longer periods of several vortex shedding cycles. Furthermore, the mutual interdependencies of the TCPM are well preserved. However, optimal hyperparameters depend strongly on the flow characteristics. Generally, as flow dynamics become faster and more intermittent, larger LR and INS values result in better predictions, whereas less clear trends for SR are observable.

本研究旨在应用回声状态网络（ESN）预测气缸阵列后的湍流。本次研究选择了三种不同排列的七个圆柱体阵列。它们代表了不同的流态：单崖体流、瞬态流和共甩流。这样就可以研究从根本上源于唤醒流但又表现出高度多样化动态的湍流。通过适当正交分解（POD）对数据进行缩减，这是动能方面的最佳方法。POD 模式的时间系数（TCPM）由 ESN 预测。该网络架构针对其三个主要超参数（输入缩放（INS）、频谱半径（SR）和泄漏率（LR））进行了优化，以便在加权预测得分（WPS）、度量均衡统计和确定性预测方面产生最佳预测结果。总体而言，ESN 能够模仿湍流的复杂动态，甚至能够模仿几个涡流脱落周期的较长时间。此外，TCPM 的相互依赖关系也得到了很好的保留。不过，最佳超参数在很大程度上取决于流动特性。一般来说，随着流动动态变得越来越快，间歇性越来越强，LR 和 INS 值越大，预测结果越好，而 SR 的趋势则不太明显。

{"title":"On the prediction of the turbulent flow behind cylinder arrays via Echo State Networks","authors":"M. Sharifi Ghazijahani, C. Cierpka","doi":"10.1088/2632-2153/ad5414","DOIUrl":"https://doi.org/10.1088/2632-2153/ad5414","url":null,"abstract":"\u0000 This study aims at the prediction of the turbulent flow behind cylinder arrays by the application of Echo State Networks (ESN). Three different arrangements of arrays of seven cylinders are chosen for the current study. These represent different flow regimes: single bluff body flow, transient flow, and co-shedding flow. This allows the investigation of turbulent flows that fundamentally originate from wake flows yet exhibit highly diverse dynamics. The data is reduced by Proper Orthogonal Decomposition (POD) which is optimal in terms of kinetic energy. The Time Coefficients of the POD Modes (TCPM) are predicted by the ESN. The network architecture is optimized with respect to its three main hyperparameters, Input Scaling (INS), Spectral Radius (SR), and Leaking Rate (LR), in order to produce the best predictions in terms of Weighted Prediction Score (WPS), a metric leveling statistic and deterministic prediction. In general, the ESN is capable of imitating the complex dynamics of turbulent flows even for longer periods of several vortex shedding cycles. Furthermore, the mutual interdependencies of the TCPM are well preserved. However, optimal hyperparameters depend strongly on the flow characteristics. Generally, as flow dynamics become faster and more intermittent, larger LR and INS values result in better predictions, whereas less clear trends for SR are observable.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"7 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141267795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interpreting Variational Quantum Models with Active Paths in Parameterized Quantum Circuits 用参数化量子电路中的主动路径解读变分量子模型

Machine Learning: Science and Technology

Pub Date : 2024-06-04 DOI: 10.1088/2632-2153/ad5412

Kyungmin Lee, Hyungjun Jeon, Dongkyu Lee, Bongsang Kim, Jeongho Bang, Taehyun Kim

Variational quantum machine learning (VQML) models based on parameterized quantum circuits (PQC) have been expected to offer a potential quantum advantage for machine learning applications. However, comparison between VQML models and their classical counterparts is hard due to the lack of interpretability of VQML models. In this study, we introduce a graphical approach to analyze the PQC and the corresponding operation of VQML models to deal with this problem. In particular, we utilize the Stokes representation of quantum states to treat VQML models as network models based on the corresponding representations of basic gates. From this approach, we suggest the notion of active paths in the networks and relate the expressivity of VQML models with it. We investigate the growth of active paths in VQML models and observe that the expressivity of VQML models can be significantly limited for certain cases. Then we construct classical models inspired by our graphical interpretation of VQML models and show that they can emulate or outperform the outputs of VQML models for these cases. Our result provides a new way to interpret the operation of VQML models and facilitates the interconnection between quantum and classical machine learning areas.

基于参数化量子电路（PQC）的变量子机器学习（VQML）模型有望为机器学习应用提供潜在的量子优势。然而，由于 VQML 模型缺乏可解释性，因此很难将 VQML 模型与经典模型进行比较。在本研究中，我们引入了一种图形方法来分析 PQC 和 VQML 模型的相应操作，以解决这一问题。特别是，我们利用量子态的斯托克斯表示法，将 VQML 模型视为基于基本门的相应表示法的网络模型。从这种方法出发，我们提出了网络中活动路径的概念，并将 VQML 模型的表现力与之联系起来。我们研究了 VQML 模型中活动路径的增长，发现在某些情况下 VQML 模型的表达能力会受到很大限制。然后，我们从 VQML 模型的图形解释中得到启发，构建了经典模型，并证明这些模型在这些情况下可以模拟或优于 VQML 模型的输出。我们的成果为解释 VQML 模型的运行提供了一种新方法，并促进了量子和经典机器学习领域的相互联系。

{"title":"Interpreting Variational Quantum Models with Active Paths in Parameterized Quantum Circuits","authors":"Kyungmin Lee, Hyungjun Jeon, Dongkyu Lee, Bongsang Kim, Jeongho Bang, Taehyun Kim","doi":"10.1088/2632-2153/ad5412","DOIUrl":"https://doi.org/10.1088/2632-2153/ad5412","url":null,"abstract":"\u0000 Variational quantum machine learning (VQML) models based on parameterized quantum circuits (PQC) have been expected to offer a potential quantum advantage for machine learning applications. However, comparison between VQML models and their classical counterparts is hard due to the lack of interpretability of VQML models. In this study, we introduce a graphical approach to analyze the PQC and the corresponding operation of VQML models to deal with this problem. In particular, we utilize the Stokes representation of quantum states to treat VQML models as network models based on the corresponding representations of basic gates. From this approach, we suggest the notion of active paths in the networks and relate the expressivity of VQML models with it. We investigate the growth of active paths in VQML models and observe that the expressivity of VQML models can be significantly limited for certain cases. Then we construct classical models inspired by our graphical interpretation of VQML models and show that they can emulate or outperform the outputs of VQML models for these cases. Our result provides a new way to interpret the operation of VQML models and facilitates the interconnection between quantum and classical machine learning areas.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141267030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Completion of Partial Chemical Equations 完成部分化学方程式

Machine Learning: Science and Technology

Pub Date : 2024-06-04 DOI: 10.1088/2632-2153/ad5413

F. Zipoli, Zeineb Ayadi, P. Schwaller, Teodoro Laino, A. Vaucher

Inferring missing molecules in chemical equations is an important task in chemistry and drug discovery. In fact, the completion of chemical equations with necessary reagents is important for improving existing datasets by detecting missing compounds, making them compatible with deep learning models that require complete information about reactants, products, and reagents in a chemical equation for increased performance. Here, we present a deep learning model to predict missing molecules using a multi-task approach, which can ultimately be viewed as a generalization of the forward reaction prediction and retrosynthesis models, since both can be expressed in terms of incomplete chemical equations. We illustrate that a single trained model, based on the transformer architecture and acting on reaction SMILES strings, can address the prediction of products (forward), precursors (retro) or any other molecule in arbitrary positions such as solvents, catalysts or reagents (completion). Our aim is to assess whether a unified model trained simultaneously on different tasks can effectively leverage diverse knowledge from various prediction tasks within the chemical domain, compared to models trained individually on each application. The multi-task models demonstrate top-1 performance of 72.4 %, 16.1 %, and 30.5 % for the forward, retro, and completion tasks, respectively. For the same model we computed round-trip accuracy of 83.4 %. The completion task exhibiting improvements due to the multi-task approach.

推断化学方程式中缺失的分子是化学和药物发现领域的一项重要任务。事实上，用必要的试剂完成化学方程式对于通过检测缺失化合物来改进现有数据集非常重要，这使得它们与深度学习模型兼容，而深度学习模型需要化学方程式中反应物、产物和试剂的完整信息才能提高性能。在这里，我们提出了一种使用多任务方法预测缺失分子的深度学习模型，该模型最终可被视为正向反应预测模型和逆合成模型的泛化，因为这两种模型都可以用不完整的化学方程式来表示。我们说明，基于转换器架构并作用于反应 SMILES 字符串的单一训练模型可以处理产物（正向）、前体（逆向）或任意位置的任何其他分子（如溶剂、催化剂或试剂）（完成）的预测。我们的目的是评估，与针对每个应用单独训练的模型相比，针对不同任务同时训练的统一模型能否有效利用化学领域内各种预测任务的不同知识。多任务模型在正向、复古和完成任务方面的性能分别为 72.4%、16.1% 和 30.5%。对于同一模型，我们计算出的往返准确率为 83.4%。由于采用了多任务方法，完成任务的性能有所提高。

{"title":"Completion of Partial Chemical Equations","authors":"F. Zipoli, Zeineb Ayadi, P. Schwaller, Teodoro Laino, A. Vaucher","doi":"10.1088/2632-2153/ad5413","DOIUrl":"https://doi.org/10.1088/2632-2153/ad5413","url":null,"abstract":"\u0000 Inferring missing molecules in chemical equations is an important task in chemistry and drug discovery. In fact, the completion of chemical equations with necessary reagents is important for improving existing datasets by detecting missing compounds, making them compatible with deep learning models that require complete information about reactants, products, and reagents in a chemical equation for increased performance. Here, we present a deep learning model to predict missing molecules using a multi-task approach, which can ultimately be viewed as a generalization of the forward reaction prediction and retrosynthesis models, since both can be expressed in terms of incomplete chemical equations. We illustrate that a single trained model, based on the transformer architecture and acting on reaction SMILES strings, can address the prediction of products (forward), precursors (retro) or any other molecule in arbitrary positions such as solvents, catalysts or reagents (completion). Our aim is to assess whether a unified model trained simultaneously on different tasks can effectively leverage diverse knowledge from various prediction tasks within the chemical domain, compared to models trained individually on each application. The multi-task models demonstrate top-1 performance of 72.4 %, 16.1 %, and 30.5 % for the forward, retro, and completion tasks, respectively. For the same model we computed round-trip accuracy of 83.4 %. The completion task exhibiting improvements due to the multi-task approach.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"7 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141266145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine-Learning Strategies for the Accurate and Eﬃcient Analysis of X-ray Spectroscopy 准确、高效分析 X 射线光谱的机器学习策略

Machine Learning: Science and Technology

Pub Date : 2024-05-24 DOI: 10.1088/2632-2153/ad5074

Thomas Penfold, Luke Watson, Clelia Middleton, Tudur David, Sneha Verma, thomas pope, Julia Kaczmarek, Conor Douglas Rankine

Computational spectroscopy has emerged as a critical tool for researchers looking to achieve both qualitative and quantitative interpretations of experimental spectra. Over the past decade, increased interactions between experiment and theory have created a positive feedback loop that has stimulated developments in both domains. In particular, the increased accuracy of calculations has led to them becoming an indispensable tool for the analysis of spectroscopies across the electromagnetic spectrum. This progress is especially well demonstrated for short-wavelength techniques, e.g. core-hole (X-ray) spectroscopies, whose prevalence has increased following the advent of modern X-ray facilities including third-generation synchrotrons and X-ray free-electron lasers (XFELs). While calculations based on well-established wavefunction or density-functional methods continue to dominate the greater part of spectral analyses in the literature, emerging developments in machine-learning algorithms are beginning to open up new opportunities to complement these traditional techniques with fast, accurate, and affordable 'black-box' approaches. This Topical Review recounts recent progress in data-driven/machine-learning approaches for computational X-ray spectroscopy. We discuss the achievements and limitations of the presently-available approaches and review the potential that these techniques have to expand the scope and reach of computational and experimental X-ray spectroscopic studies.

计算光谱学已成为研究人员对实验光谱进行定性和定量解释的重要工具。在过去十年中，实验与理论之间的互动日益频繁，形成了一个正反馈循环，促进了这两个领域的发展。特别是，计算精度的提高使其成为分析整个电磁波谱的不可或缺的工具。这种进步在短波长技术（如芯孔（X 射线）光谱）方面体现得尤为明显，随着包括第三代同步加速器和 X 射线自由电子激光器（XFEL）在内的现代 X 射线设备的出现，这种技术的普及率也在不断提高。虽然基于成熟的波函数或密度函数方法的计算仍在文献中的光谱分析中占主导地位，但机器学习算法的新兴发展已开始为利用快速、准确和经济实惠的 "黑盒 "方法补充这些传统技术带来新的机遇。本专题综述回顾了计算 X 射线光谱学数据驱动/机器学习方法的最新进展。我们讨论了目前可用方法的成就和局限性，并回顾了这些技术在扩大计算和实验 X 射线光谱研究的范围和影响力方面的潜力。

{"title":"Machine-Learning Strategies for the Accurate and Eﬃcient Analysis of X-ray Spectroscopy","authors":"Thomas Penfold, Luke Watson, Clelia Middleton, Tudur David, Sneha Verma, thomas pope, Julia Kaczmarek, Conor Douglas Rankine","doi":"10.1088/2632-2153/ad5074","DOIUrl":"https://doi.org/10.1088/2632-2153/ad5074","url":null,"abstract":"\u0000 Computational spectroscopy has emerged as a critical tool for researchers looking to achieve both qualitative and quantitative interpretations of experimental spectra. Over the past decade, increased interactions between experiment and theory have created a positive feedback loop that has stimulated developments in both domains. In particular, the increased accuracy of calculations has led to them becoming an indispensable tool for the analysis of spectroscopies across the electromagnetic spectrum. This progress is especially well demonstrated for short-wavelength techniques, e.g. core-hole (X-ray) spectroscopies, whose prevalence has increased following the advent of modern X-ray facilities including third-generation synchrotrons and X-ray free-electron lasers (XFELs). While calculations based on well-established wavefunction or density-functional methods continue to dominate the greater part of spectral analyses in the literature, emerging developments in machine-learning algorithms are beginning to open up new opportunities to complement these traditional techniques with fast, accurate, and affordable 'black-box' approaches. This Topical Review recounts recent progress in data-driven/machine-learning approaches for computational X-ray spectroscopy. We discuss the achievements and limitations of the presently-available approaches and review the potential that these techniques have to expand the scope and reach of computational and experimental X-ray spectroscopic studies.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"79 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141101474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Global System Errors to Simultaneously Improve the Identification of Subsystems with Mixed Data Gaussian Process Regression 利用混合数据高斯过程回归同时改进子系统识别的全局系统误差

Machine Learning: Science and Technology

Pub Date : 2024-05-20 DOI: 10.1088/2632-2153/ad4e05

Cameron J LaMack, Eric M. Schearer

This paper explores the use of Gaussian Process Regression (GPR) for system iden- tification in control engineering. It introduces two novel approaches that utilize the data from a measured global system error. The paper demonstrates these approaches by identifying a simulated system with three subsystems, a one degree of freedom mass with two antagonist muscles. The first approach uses this whole-system error data alone, achieving accuracy on the same order of magnitude as subsystem-specific data (9.28 ± 0.87 N vs. 6.96 ± 0.32 N of total model errors). This is significant, as it shows that the same data set can be used to identify unique subsystems, as op- posed to requiring a set of data descriptive of only a single subsystem. The second approach demonstrated in this paper mixes traditional subsystem-specific data with the whole system error data, achieving up to 98.71% model improvement.

本文探讨了在控制工程中使用高斯过程回归（GPR）进行系统识别的问题。论文介绍了两种利用全局系统误差测量数据的新方法。本文通过识别一个具有三个子系统的模拟系统（一个具有两个拮抗肌的单自由度质量）来演示这些方法。第一种方法仅使用全系统误差数据，就达到了与特定子系统数据相同数量级的精度（9.28 ± 0.87 N 对 6.96 ± 0.32 N 的总模型误差）。这一点意义重大，因为它表明同一数据集可用于识别独特的子系统，而不需要仅描述单一子系统的数据集。本文展示的第二种方法将传统的特定子系统数据与整个系统误差数据相结合，实现了高达 98.71% 的模型改进。

引用次数: 0

Journey over Destination: Dynamic Sensor Placement Enhances Generalization 旅程重于目的地：动态传感器定位增强通用性

Machine Learning: Science and Technology

Pub Date : 2024-05-20 DOI: 10.1088/2632-2153/ad4e06

Agnese Marcato, E. Guiltinan, Hari S. Viswanathan, Dan O’Malley, Nicholas Lubbers, Javier E. Santos

Reconstructing complex, high-dimensional global fields from limited data points is a challenge across various scientific and industrial domains. This is particularly important for recovering spatio-temporal fields using sensor data from, for example, laboratory-based scientific experiments, weather forecasting, or drone surveys. Given the prohibitive costs of specialized sensors and the inaccessibility of certain regions of the domain, achieving full field coverage is typically not feasible. Therefore, the development of machine learning algorithms trained to reconstruct fields given a limited dataset is of critical importance. In this study, we introduce a general approach that employs moving sensors to enhance data exploitation during the training of an attention based neural network, thereby improving field reconstruction. The training of sensor locations is accomplished using an end-to-end workflow, ensuring differentiability in the interpolation of field values associated to the sensors, and is simple to implement using differentiable programming. Additionally, we have incorporated a correction mechanism to prevent sensors from entering invalid regions within the domain. We evaluated our method using two distinct datasets; the results show that our approach enhances learning, as evidenced by improved test scores.

从有限的数据点重建复杂的高维全局场是各种科学和工业领域面临的挑战。这对于利用来自实验室科学实验、天气预报或无人机勘测等的传感器数据恢复时空场尤为重要。由于专用传感器的成本过高，而且无法进入领域的某些区域，实现全场覆盖通常是不可行的。因此，开发经过训练的机器学习算法，以便在有限数据集的情况下重建实地至关重要。在本研究中，我们引入了一种通用方法，在基于注意力的神经网络训练过程中，利用移动传感器来加强数据利用，从而改善场重建。传感器位置的训练采用端到端工作流程完成，确保与传感器相关的场值插值的可微分性，并通过可微分编程简单实现。此外，我们还采用了一种校正机制，以防止传感器进入域内的无效区域。我们使用两个不同的数据集对我们的方法进行了评估；结果表明，我们的方法提高了学习效果，测试分数的提高就是证明。

{"title":"Journey over Destination: Dynamic Sensor Placement Enhances Generalization","authors":"Agnese Marcato, E. Guiltinan, Hari S. Viswanathan, Dan O’Malley, Nicholas Lubbers, Javier E. Santos","doi":"10.1088/2632-2153/ad4e06","DOIUrl":"https://doi.org/10.1088/2632-2153/ad4e06","url":null,"abstract":"\u0000 Reconstructing complex, high-dimensional global fields from limited data points is a challenge across various scientific and industrial domains. This is particularly important for recovering spatio-temporal fields using sensor data from, for example, laboratory-based scientific experiments, weather forecasting, or drone surveys. Given the prohibitive costs of specialized sensors and the inaccessibility of certain regions of the domain, achieving full field coverage is typically not feasible. Therefore, the development of machine learning algorithms trained to reconstruct fields given a limited dataset is of critical importance. In this study, we introduce a general approach that employs moving sensors to enhance data exploitation during the training of an attention based neural network, thereby improving field reconstruction. The training of sensor locations is accomplished using an end-to-end workflow, ensuring differentiability in the interpolation of field values associated to the sensors, and is simple to implement using differentiable programming. Additionally, we have incorporated a correction mechanism to prevent sensors from entering invalid regions within the domain. We evaluated our method using two distinct datasets; the results show that our approach enhances learning, as evidenced by improved test scores.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"79 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141121371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Particle String Detection in Electrorheological Plasmas Using Asymmetrical Kernel Convolutional Networks 利用非对称核卷积网络增强电流变等离子体中的粒子串探测能力

Machine Learning: Science and Technology

Pub Date : 2024-05-17 DOI: 10.1088/2632-2153/ad4d3e

Max Klein, Niklas Dormagen, Christopher Dietz, Markus Thoma, Mike Schwarz

Under different plasma conditions and electric fields in a complex plasma the plasma particles organize themselves in a string-like or chain-like manner. A phase transition from string-like to an isotropic particle distribution is observed at different electrical conditions. The streaming of charged ions around plasma particles with the surrounding electric field gives the plasma its electrorheological properties. The visibility of individual particles in a complex plasma opens up the opportunity to examine properties and phase transitions of such electrorheological fluids in detail. Because of the limited one-dimensional symmetry, determining the configuration of a particle and recognizing strings in particle distributions is not always straightforward. Several approaches have already been used to analyse particle clouds while either considering each particle locally or considering the particle cloud as a whole without providing information about single particle configurations. This paper presents a new machine learning approach that takes advantage of particle distributions over the entire particle cloud and detects all string-like particles at once, using a convolutional neural network in form of an encoder-decoder network with asymmetric kernel convolutions. This not only enhances the result quality but also accelerates the evaluation process, possibly enabling real-time analyses on electrorheological phase transitions, while achieving an accuracy of over 95% on manually labelled data.

在复杂等离子体中，在不同的等离子体条件和电场下，等离子体粒子以串状或链状方式组织起来。在不同的电场条件下，可观察到从串状到各向同性粒子分布的相变。在周围电场的作用下，等离子体粒子周围的带电离子流赋予了等离子体电流变特性。复杂等离子体中单个粒子的可见性为详细研究此类电流变流体的特性和相变提供了机会。由于有限的一维对称性，确定粒子的构型和识别粒子分布中的字符串并不总是那么简单。已经有几种方法用于分析粒子云，但要么只考虑每个粒子的局部情况，要么只考虑粒子云的整体情况，而不提供单个粒子的构型信息。本文提出了一种新的机器学习方法，它利用整个粒子云的粒子分布，采用非对称内核卷积的编码器-解码器卷积神经网络形式，一次性检测出所有类似字符串的粒子。这不仅提高了结果质量，还加快了评估过程，有可能实现电流变相变的实时分析，同时在人工标注数据上达到 95% 以上的准确率。

{"title":"Enhancing Particle String Detection in Electrorheological Plasmas Using Asymmetrical Kernel Convolutional Networks","authors":"Max Klein, Niklas Dormagen, Christopher Dietz, Markus Thoma, Mike Schwarz","doi":"10.1088/2632-2153/ad4d3e","DOIUrl":"https://doi.org/10.1088/2632-2153/ad4d3e","url":null,"abstract":"\u0000 Under different plasma conditions and electric fields in a complex plasma the plasma particles organize themselves in a string-like or chain-like manner. A phase transition from string-like to an isotropic particle distribution is observed at different electrical conditions. The streaming of charged ions around plasma particles with the surrounding electric field gives the plasma its electrorheological properties. The visibility of individual particles in a complex plasma opens up the opportunity to examine properties and phase transitions of such electrorheological fluids in detail. Because of the limited one-dimensional symmetry, determining the configuration of a particle and recognizing strings in particle distributions is not always straightforward. Several approaches have already been used to analyse particle clouds while either considering each particle locally or considering the particle cloud as a whole without providing information about single particle configurations. This paper presents a new machine learning approach that takes advantage of particle distributions over the entire particle cloud and detects all string-like particles at once, using a convolutional neural network in form of an encoder-decoder network with asymmetric kernel convolutions. This not only enhances the result quality but also accelerates the evaluation process, possibly enabling real-time analyses on electrorheological phase transitions, while achieving an accuracy of over 95% on manually labelled data.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"2 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140963636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0