Proceedings of machine learning research最新文献_第8页

Privacy-preserving patient clustering for personalized federated learning. 针对个性化联合学习的隐私保护患者聚类。

Proceedings of machine learning research

Pub Date : 2023-01-01

Ahmed Elhussein, Gamze Gürsoy

Federated Learning (FL) is a machine learning framework that enables multiple organizations to train a model without sharing their data with a central server. However, it experiences significant performance degradation if the data is non-identically independently distributed (non-IID). This is a problem in medical settings, where variations in the patient population contribute significantly to distribution differences across hospitals. Personalized FL addresses this issue by accounting for site-specific distribution differences. Clustered FL, a Personalized FL variant, was used to address this problem by clustering patients into groups across hospitals and training separate models on each group. However, privacy concerns remained as a challenge as the clustering process requires exchange of patient-level information. This was previously solved by forming clusters using aggregated data, which led to inaccurate groups and performance degradation. In this study, we propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level data while protecting privacy. PCBFL uses Secure Multiparty Computation, a cryptographic technique, to securely calculate patient-level similarity scores across hospitals. We then evaluate PCBFL by training a federated mortality prediction model using 20 sites from the eICU dataset. We compare the performance gain from PCBFL against traditional and existing Clustered FL frameworks. Our results show that PCBFL successfully forms clinically meaningful cohorts of low, medium, and high-risk patients. PCBFL outperforms traditional and existing Clustered FL frameworks with an average AUC improvement of 4.3% and AUPRC improvement of 7.8%.

联合学习（FL）是一种机器学习框架，它能让多个组织在不与中央服务器共享数据的情况下训练一个模型。但是，如果数据是非相同独立分布的（非 IID），它的性能就会明显下降。这在医疗环境中是个问题，因为病人群体的变化会极大地导致医院间的分布差异。个性化 FL 通过考虑特定地点的分布差异来解决这一问题。聚类 FL 是个性化 FL 的一种变体，通过将不同医院的患者聚类为不同组别，并对每个组别进行单独的模型训练，从而解决了这一问题。然而，由于聚类过程需要交换患者级别的信息，隐私问题仍然是一个挑战。以前解决这个问题的方法是使用聚合数据形成聚类，但这会导致分组不准确和性能下降。在本研究中，我们提出了保护隐私的基于社区的联合机器学习（PCBFL），这是一种新颖的聚类 FL 框架，可在保护隐私的同时使用患者级数据对患者进行聚类。PCBFL 使用加密技术 "安全多方计算"（Secure Multiparty Computation）来安全地计算医院间患者级别的相似性得分。然后，我们使用 eICU 数据集中的 20 个站点训练了一个联合死亡率预测模型，对 PCBFL 进行了评估。我们将 PCBFL 的性能增益与传统和现有的聚类 FL 框架进行了比较。我们的结果表明，PCBFL 成功地形成了具有临床意义的低、中、高风险患者队列。PCBFL 优于传统和现有的聚类 FL 框架，平均 AUC 提高了 4.3%，AUPRC 提高了 7.8%。

{"title":"Privacy-preserving patient clustering for personalized federated learning.","authors":"Ahmed Elhussein, Gamze Gürsoy","doi":"","DOIUrl":"","url":null,"abstract":"Federated Learning (FL) is a machine learning framework that enables multiple organizations to train a model without sharing their data with a central server. However, it experiences significant performance degradation if the data is non-identically independently distributed (non-IID). This is a problem in medical settings, where variations in the patient population contribute significantly to distribution differences across hospitals. Personalized FL addresses this issue by accounting for site-specific distribution differences. Clustered FL, a Personalized FL variant, was used to address this problem by clustering patients into groups across hospitals and training separate models on each group. However, privacy concerns remained as a challenge as the clustering process requires exchange of patient-level information. This was previously solved by forming clusters using aggregated data, which led to inaccurate groups and performance degradation. In this study, we propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level data while protecting privacy. PCBFL uses Secure Multiparty Computation, a cryptographic technique, to securely calculate patient-level similarity scores across hospitals. We then evaluate PCBFL by training a federated mortality prediction model using 20 sites from the eICU dataset. We compare the performance gain from PCBFL against traditional and existing Clustered FL frameworks. Our results show that PCBFL successfully forms clinically meaningful cohorts of low, medium, and high-risk patients. PCBFL outperforms traditional and existing Clustered FL frameworks with an average AUC improvement of 4.3% and AUPRC improvement of 7.8%.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"219 ","pages":"150-166"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11376435/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SRDA: Mobile Sensing based Fluid Overload Detection for End Stage Kidney Disease Patients using Sensor Relation Dual Autoencoder. SRDA：基于移动传感的末期肾病患者体液超负荷检测（使用传感器关系双自动编码器）。

Proceedings of machine learning research

Pub Date : 2023-01-01

Mingyue Tang, Jiechao Gao, Guimin Dong, Carl Yang, Bradford Campbell, Brendan Bowman, Jamie Marie Zoellner, Emaad Abdel-Rahman, Mehdi Boukhechba

Chronic kidney disease (CKD) is a life-threatening and prevalent disease. CKD patients, especially endstage kidney disease (ESKD) patients on hemodialysis, suffer from kidney failures and are unable to remove excessive fluid, causing fluid overload and multiple morbidities including death. Current solutions for fluid overtake monitoring such as ultrasonography and biomarkers assessment are cumbersome, discontinuous, and can only be performed in the clinic. In this paper, we propose SRDA, a latent graph learning powered fluid overload detection system based on Sensor Relation Dual Autoencoder to detect excessive fluid consumption of EKSD patients based on passively collected bio-behavioral data from smartwatch sensors. Experiments using real-world mobile sensing data indicate that SRDA outperforms the state-of-the-art baselines in both F1 score and recall, and demonstrate the potential of ubiquitous sensing for ESKD fluid intake management.

慢性肾脏病（CKD）是一种威胁生命的常见疾病。慢性肾脏病患者，尤其是接受血液透析的终末期肾脏病（ESKD）患者，因肾功能衰竭而无法排出过多的液体，导致体液超负荷和包括死亡在内的多种病症。目前的体液超负荷监测解决方案，如超声波检查和生物标志物评估，都非常繁琐、不连续，而且只能在临床上进行。在本文中，我们提出了基于传感器关系双自动编码器的潜图学习驱动的体液过量检测系统 SRDA，该系统可根据智能手表传感器被动收集的生物行为数据检测 EKSD 患者的过量液体消耗。使用真实世界移动传感数据进行的实验表明，SRDA 在 F1 分数和召回率方面都优于最先进的基线系统，并证明了无处不在的传感在 ESKD 摄入液体管理方面的潜力。

引用次数: 0

Temporal Supervised Contrastive Learning for Modeling Patient Risk Progression. 为患者风险进展建模的时间监督对比学习。

Proceedings of machine learning research

Pub Date : 2023-01-01

Shahriar Noroozizadeh, Jeremy C Weiss, George H Chen

We consider the problem of predicting how the likelihood of an outcome of interest for a patient changes over time as we observe more of the patient's data. To solve this problem, we propose a supervised contrastive learning framework that learns an embedding representation for each time step of a patient time series. Our framework learns the embedding space to have the following properties: (1) nearby points in the embedding space have similar predicted class probabilities, (2) adjacent time steps of the same time series map to nearby points in the embedding space, and (3) time steps with very different raw feature vectors map to far apart regions of the embedding space. To achieve property (3), we employ a nearest neighbor pairing mechanism in the raw feature space. This mechanism also serves as an alternative to "data augmentation", a key ingredient of contrastive learning, which lacks a standard procedure that is adequately realistic for clinical tabular data, to our knowledge. We demonstrate that our approach outperforms state-of-the-art baselines in predicting mortality of septic patients (MIMIC-III dataset) and tracking progression of cognitive impairment (ADNI dataset). Our method also consistently recovers the correct synthetic dataset embedding structure across experiments, a feat not achieved by baselines. Our ablation experiments show the pivotal role of our nearest neighbor pairing.

我们考虑的问题是，当我们观察到更多病人的数据时，如何预测病人感兴趣的结果的可能性会随着时间的推移而发生变化。为了解决这个问题，我们提出了一个有监督的对比学习框架，该框架可以为患者时间序列的每个时间步学习一个嵌入表征。我们的框架学习的嵌入空间具有以下特性：(1) 嵌入空间中的邻近点具有相似的预测类别概率；(2) 同一时间序列的相邻时间步映射到嵌入空间中的邻近点；(3) 原始特征向量截然不同的时间步映射到嵌入空间中相距甚远的区域。为了实现特性（3），我们在原始特征空间中采用了近邻配对机制。这种机制也是 "数据扩增 "的替代方法，而 "数据扩增 "是对比学习的一个关键要素，据我们所知，临床表格数据缺乏足够现实的标准程序。我们证明，在预测败血症患者死亡率（MIMIC-III 数据集）和跟踪认知障碍进展（ADNI 数据集）方面，我们的方法优于最先进的基线方法。我们的方法还能在各种实验中持续恢复正确的合成数据集嵌入结构，这是基线方法无法实现的。我们的消融实验显示了近邻配对的关键作用。

{"title":"Temporal Supervised Contrastive Learning for Modeling Patient Risk Progression.","authors":"Shahriar Noroozizadeh, Jeremy C Weiss, George H Chen","doi":"","DOIUrl":"","url":null,"abstract":"We consider the problem of predicting how the likelihood of an outcome of interest for a patient changes over time as we observe more of the patient's data. To solve this problem, we propose a supervised contrastive learning framework that learns an embedding representation for each time step of a patient time series. Our framework learns the embedding space to have the following properties: (1) nearby points in the embedding space have similar predicted class probabilities, (2) adjacent time steps of the same time series map to nearby points in the embedding space, and (3) time steps with very different raw feature vectors map to far apart regions of the embedding space. To achieve property (3), we employ a nearest neighbor pairing mechanism in the raw feature space. This mechanism also serves as an alternative to \"data augmentation\", a key ingredient of contrastive learning, which lacks a standard procedure that is adequately realistic for clinical tabular data, to our knowledge. We demonstrate that our approach outperforms state-of-the-art baselines in predicting mortality of septic patients (MIMIC-III dataset) and tracking progression of cognitive impairment (ADNI dataset). Our method also consistently recovers the correct synthetic dataset embedding structure across experiments, a feat not achieved by baselines. Our ablation experiments show the pivotal role of our nearest neighbor pairing.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"225 ","pages":"403-427"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10976929/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140320101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data. 高维不完整数据的神经网络高斯过程多重插值。

Proceedings of machine learning research

Pub Date : 2022-12-01

Zongyu Dai, Zhiqi Bu, Qi Long

Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR. Code is available in the GitHub repository https://github.com/bestadcarry/MI-NNGP.

在现实世界的应用程序中，丢失的数据无处不在，如果处理不当，可能会导致信息丢失，并在下游分析中导致有偏差的结果。特别是，中等样本量的高维不完整数据，如多组学数据的分析，提出了艰巨的挑战。尽管现有的插入方法有许多局限性，但插入可以说是处理缺失数据的最流行的方法。单一的输入方法，如矩阵补全方法不能充分考虑输入的不确定性，因此会产生不适当的统计推断。相比之下，多重插值(MI)方法允许适当的推理，但现有方法在高维设置中表现不佳。我们的工作旨在解决这些重要的方法差距，从贝叶斯的角度利用神经网络高斯过程(NNGP)的最新进展。我们提出了两种基于nngp的MI方法，即MI- nngp，该方法可以对联合(后验预测)分布中的缺失值进行多次插值。在三种缺失数据机制(MCAR、MAR和MNAR)下，MI-NNGP方法在输入误差、统计推断、对缺失率的鲁棒性和计算成本方面明显优于现有的最先进的合成和真实数据集方法。代码可在GitHub存储库https://github.com/bestadcarry/MI-NNGP中获得。

{"title":"Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data.","authors":"Zongyu Dai, Zhiqi Bu, Qi Long","doi":"","DOIUrl":"","url":null,"abstract":"Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR. Code is available in the GitHub repository https://github.com/bestadcarry/MI-NNGP.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"189 ","pages":"265-279"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10348708/pdf/nihms-1861886.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9821494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Meta-analysis of individualized treatment rules via sign-coherency 基于符号一致性的个体化治疗规则元分析

Proceedings of machine learning research

Pub Date : 2022-11-28 DOI: 10.48550/arXiv.2211.15476

Jay Jojo Cheng, J. Huling, Guanhua Chen

Medical treatments tailored to a patient's baseline characteristics hold the potential of improving patient outcomes while reducing negative side effects. Learning individualized treatment rules (ITRs) often requires aggregation of multiple datasets(sites); however, current ITR methodology does not take between-site heterogeneity into account, which can hurt model generalizability when deploying back to each site. To address this problem, we develop a method for individual-level meta-analysis of ITRs, which jointly learns site-specific ITRs while borrowing information about feature sign-coherency via a scientifically-motivated directionality principle. We also develop an adaptive procedure for model tuning, using information criteria tailored to the ITR learning problem. We study the proposed methods through numerical experiments to understand their performance under different levels of between-site heterogeneity and apply the methodology to estimate ITRs in a large multi-center database of electronic health records. This work extends several popular methodologies for estimating ITRs (A-learning, weighted learning) to the multiple-sites setting.

根据患者的基线特征量身定制的医学治疗有可能改善患者的预后，同时减少负面副作用。学习个性化治疗规则(itr)通常需要汇总多个数据集(站点);然而，当前的ITR方法没有考虑到站点之间的异质性，这可能会在部署到每个站点时损害模型的泛化性。为了解决这一问题，我们开发了一种个体层面的itr元分析方法，该方法通过科学动机的方向性原则，在借鉴特征符号一致性信息的同时，共同学习特定地点的itr。我们还开发了一个自适应的模型调优过程，使用针对ITR学习问题量身定制的信息标准。我们通过数值实验研究了所提出的方法，以了解它们在不同站点间异质性水平下的性能，并将该方法应用于大型多中心电子健康记录数据库中的itr估计。这项工作将几种流行的估计itr的方法(a -学习，加权学习)扩展到多站点设置。

{"title":"Meta-analysis of individualized treatment rules via sign-coherency","authors":"Jay Jojo Cheng, J. Huling, Guanhua Chen","doi":"10.48550/arXiv.2211.15476","DOIUrl":"https://doi.org/10.48550/arXiv.2211.15476","url":null,"abstract":"Medical treatments tailored to a patient's baseline characteristics hold the potential of improving patient outcomes while reducing negative side effects. Learning individualized treatment rules (ITRs) often requires aggregation of multiple datasets(sites); however, current ITR methodology does not take between-site heterogeneity into account, which can hurt model generalizability when deploying back to each site. To address this problem, we develop a method for individual-level meta-analysis of ITRs, which jointly learns site-specific ITRs while borrowing information about feature sign-coherency via a scientifically-motivated directionality principle. We also develop an adaptive procedure for model tuning, using information criteria tailored to the ITR learning problem. We study the proposed methods through numerical experiments to understand their performance under different levels of between-site heterogeneity and apply the methodology to estimate ITRs in a large multi-center database of electronic health records. This work extends several popular methodologies for estimating ITRs (A-learning, weighted learning) to the multiple-sites setting.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 1","pages":"171-198"},"PeriodicalIF":0.0,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47047493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data 高维不完全数据的神经网络高斯过程多重脉冲

Proceedings of machine learning research

Pub Date : 2022-11-23 DOI: 10.48550/arXiv.2211.13297

Zongyu Dai, Zhiqi Bu, Q. Long

Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR. Code is available in the GitHub repository https://github.com/bestadcarry/MI-NNGP.

在现实世界的应用程序中，丢失的数据无处不在，如果处理不当，可能会导致信息丢失，并在下游分析中导致有偏差的结果。特别是，中等样本量的高维不完整数据，如多组学数据的分析，提出了艰巨的挑战。尽管现有的插入方法有许多局限性，但插入可以说是处理缺失数据的最流行的方法。单一的输入方法，如矩阵补全方法不能充分考虑输入的不确定性，因此会产生不适当的统计推断。相比之下，多重插值(MI)方法允许适当的推理，但现有方法在高维设置中表现不佳。我们的工作旨在解决这些重要的方法差距，从贝叶斯的角度利用神经网络高斯过程(NNGP)的最新进展。我们提出了两种基于nngp的MI方法，即MI- nngp，该方法可以对联合(后验预测)分布中的缺失值进行多次插值。在三种缺失数据机制(MCAR、MAR和MNAR)下，MI-NNGP方法在输入误差、统计推断、对缺失率的鲁棒性和计算成本方面明显优于现有的最先进的合成和真实数据集方法。代码可在GitHub存储库https://github.com/bestadcarry/MI-NNGP中获得。

{"title":"Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data","authors":"Zongyu Dai, Zhiqi Bu, Q. Long","doi":"10.48550/arXiv.2211.13297","DOIUrl":"https://doi.org/10.48550/arXiv.2211.13297","url":null,"abstract":"Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR. Code is available in the GitHub repository https://github.com/bestadcarry/MI-NNGP.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"189 1","pages":"265-279"},"PeriodicalIF":0.0,"publicationDate":"2022-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43147782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Predicting Attrition Patterns from Pediatric Weight Management Programs. 预测儿科体重管理计划的减员模式

Proceedings of machine learning research

Pub Date : 2022-11-01

Hamed Fayyaz, Thao-Ly T Phan, H Timothy Bunnell, Rahmatollah Beheshti

Obesity is a major public health concern. Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity who are not able to be successfully managed in the primary care setting. Despite their great potential, high dropout rates (referred to as attrition) are a major hurdle in delivering successful interventions. Predicting attrition patterns can help providers reduce the alarmingly high rates of attrition (up to 80%) by engaging in earlier and more personalized interventions. Previous work has mainly focused on finding static predictors of attrition on smaller datasets and has achieved limited success in effective prediction. In this study, we have collected a five-year comprehensive dataset of 4,550 children from diverse backgrounds receiving treatment at four pediatric weight management programs in the US. We then developed a machine learning pipeline to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining the weight management program. Our pipeline is greatly customized for this problem using advanced machine learning techniques to process longitudinal data, smaller-size data, and interrelated prediction tasks. The proposed method showed strong prediction performance as measured by AUROC scores (average AUROC of 0.77 for predicting attrition, and 0.78 for predicting weight outcomes).

肥胖症是一个重大的公共卫生问题。多学科儿科体重管理计划被认为是针对无法在初级保健环境中成功管理的肥胖症儿童的标准治疗方法。尽管其潜力巨大，但高辍学率（简称减员）是成功实施干预的主要障碍。预测流失模式可以帮助医疗服务提供者更早、更个性化地采取干预措施，从而降低惊人的高流失率（高达 80%）。以往的工作主要侧重于在较小的数据集上寻找流失的静态预测因素，在有效预测方面取得的成功有限。在这项研究中，我们收集了一个为期五年的综合数据集，其中包括在美国四个儿科体重管理项目中接受治疗的 4550 名不同背景的儿童。然后，我们开发了一个机器学习管道，用于预测儿童在加入体重管理项目后不同时间点的(a)减员可能性和(b)体重指数(BMI)百分位数的变化。针对这一问题，我们采用先进的机器学习技术，对纵向数据、较小规模的数据以及相互关联的预测任务进行处理，并对管道进行了大幅定制。从 AUROC 分数来看，所提出的方法显示出很强的预测性能（预测减员的平均 AUROC 为 0.77，预测体重结果的平均 AUROC 为 0.78）。

{"title":"Predicting Attrition Patterns from Pediatric Weight Management Programs.","authors":"Hamed Fayyaz, Thao-Ly T Phan, H Timothy Bunnell, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"Obesity is a major public health concern. Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity who are not able to be successfully managed in the primary care setting. Despite their great potential, high dropout rates (referred to as attrition) are a major hurdle in delivering successful interventions. Predicting attrition patterns can help providers reduce the alarmingly high rates of attrition (up to 80%) by engaging in earlier and more personalized interventions. Previous work has mainly focused on finding static predictors of attrition on smaller datasets and has achieved limited success in effective prediction. In this study, we have collected a five-year comprehensive dataset of 4,550 children from diverse backgrounds receiving treatment at four pediatric weight management programs in the US. We then developed a machine learning pipeline to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining the weight management program. Our pipeline is greatly customized for this problem using advanced machine learning techniques to process longitudinal data, smaller-size data, and interrelated prediction tasks. The proposed method showed strong prediction performance as measured by AUROC scores (average AUROC of 0.77 for predicting attrition, and 0.78 for predicting weight outcomes).","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 ","pages":"326-342"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854275/pdf/nihms-1865420.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10604379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enzyme Activity Prediction of Sequence Variants on Novel Substrates using Improved Substrate Encodings and Convolutional Pooling. 基于改进底物编码和卷积池的新底物序列变异酶活性预测。

Proceedings of machine learning research

Pub Date : 2022-11-01

Zhiqing Xu, Jinghao Wu, Yun S Song, Radhakrishnan Mahadevan

Protein engineering is currently being revolutionized by deep learning applications, especially through natural language processing (NLP) techniques. It has been shown that state-of-the-art self-supervised language models trained on entire protein databases capture hidden contextual and structural information in amino acid sequences and are capable of improving sequence-to-function predictions. Yet, recent studies have reported that current compound-protein modeling approaches perform poorly on learning interactions between enzymes and substrates of interest within one protein family. We attribute this to low-grade substrate encoding methods and over-compressed sequence representations received by downstream predictive models. In this study, we propose a new substrate-encoding based on Extended Connectivity Fingerprints (ECFPs) and a convolutional-pooling of the sequence embeddings. Through testing on an activity profiling dataset of haloalkanoate dehalogenase superfamily that measures activities of 218 phosphatases against 168 substrates, we show substantial improvements in predictive performances of compound-protein interaction modeling. In addition, we also test the workflow on three other datasets from the halogenase, kinase and aminotransferase families and show that our pipeline achieves good performance on these datasets as well. We further demonstrate the utility of this downstream model architecture by showing that it achieves good performance with six different protein embeddings, including ESM-1b (Rives et al., 2021), TAPE (Rao et al., 2019), ProtBert, ProtAlbert, ProtT5, and ProtXLNet (Elnaggar et al., 2021). This study provides a new workflow for activity prediction on novel substrates that can be used to engineer new enzymes for sustainability applications.

蛋白质工程目前正在通过深度学习应用，特别是通过自然语言处理(NLP)技术发生革命性的变化。研究表明，在整个蛋白质数据库上训练的最先进的自监督语言模型可以捕获氨基酸序列中隐藏的上下文和结构信息，并能够改进序列到功能的预测。然而，最近的研究报道，目前的化合物蛋白质建模方法在学习一个蛋白质家族中酶和底物之间的相互作用方面表现不佳。我们将此归因于低级底物编码方法和下游预测模型接收的过度压缩序列表示。在这项研究中，我们提出了一种新的基于扩展连接指纹(ECFPs)和序列嵌入的卷积池的基板编码。通过测试卤代烷酸脱卤酶超家族的活性分析数据集(测量218种磷酸酶对168种底物的活性)，我们发现化合物-蛋白质相互作用模型的预测性能有了实质性的改进。此外，我们还在来自卤化酶，激酶和转氨酶家族的其他三个数据集上测试了工作流，并表明我们的管道在这些数据集上也取得了良好的性能。我们进一步证明了这种下游模型架构的有效性，表明它在六种不同的蛋白质嵌入中实现了良好的性能，包括ESM-1b (Rives等人，2021)、TAPE (Rao等人，2019)、ProtBert、ProtAlbert、ProtT5和ProtXLNet (Elnaggar等人，2021)。该研究为新型底物的活性预测提供了新的工作流程，可用于设计可持续性应用的新酶。

{"title":"Enzyme Activity Prediction of Sequence Variants on Novel Substrates using Improved Substrate Encodings and Convolutional Pooling.","authors":"Zhiqing Xu, Jinghao Wu, Yun S Song, Radhakrishnan Mahadevan","doi":"","DOIUrl":"","url":null,"abstract":"Protein engineering is currently being revolutionized by deep learning applications, especially through natural language processing (NLP) techniques. It has been shown that state-of-the-art self-supervised language models trained on entire protein databases capture hidden contextual and structural information in amino acid sequences and are capable of improving sequence-to-function predictions. Yet, recent studies have reported that current compound-protein modeling approaches perform poorly on learning interactions between enzymes and substrates of interest within one protein family. We attribute this to low-grade substrate encoding methods and over-compressed sequence representations received by downstream predictive models. In this study, we propose a new substrate-encoding based on Extended Connectivity Fingerprints (ECFPs) and a convolutional-pooling of the sequence embeddings. Through testing on an activity profiling dataset of haloalkanoate dehalogenase superfamily that measures activities of 218 phosphatases against 168 substrates, we show substantial improvements in predictive performances of compound-protein interaction modeling. In addition, we also test the workflow on three other datasets from the halogenase, kinase and aminotransferase families and show that our pipeline achieves good performance on these datasets as well. We further demonstrate the utility of this downstream model architecture by showing that it achieves good performance with six different protein embeddings, including ESM-1b (Rives et al., 2021), TAPE (Rao et al., 2019), ProtBert, ProtAlbert, ProtT5, and ProtXLNet (Elnaggar et al., 2021). This study provides a new workflow for activity prediction on novel substrates that can be used to engineer new enzymes for sustainability applications.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"165 ","pages":"78-87"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9759087/pdf/nihms-1842132.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10762656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Selecting deep neural networks that yield consistent attribution-based interpretations for genomics. 为基因组学选择能产生一致归因解释的深度神经网络。

Proceedings of machine learning research

Pub Date : 2022-11-01

Antonio Majdandzic, Chandana Rajesh, Amber Tang, Shushan Toneyan, Ethan Labelson, Rohit Tripathy, Peter K Koo

Deep neural networks (DNNs) have advanced our ability to take DNA primary sequence as input and predict a myriad of molecular activities measured via high-throughput functional genomic assays. Post hoc attribution analysis has been employed to provide insights into the importance of features learned by DNNs, often revealing patterns such as sequence motifs. However, attribution maps typically harbor spurious importance scores to an extent that varies from model to model, even for DNNs whose predictions generalize well. Thus, the standard approach for model selection, which relies on performance of a held-out validation set, does not guarantee that a high-performing DNN will provide reliable explanations. Here we introduce two approaches that quantify the consistency of important features across a population of attribution maps; consistency reflects a qualitative property of human interpretable attribution maps. We employ the consistency metrics as part of a multivariate model selection framework to identify models that yield high generalization performance and interpretable attribution analysis. We demonstrate the efficacy of this approach across various DNNs quantitatively with synthetic data and qualitatively with chromatin accessibility data.

深度神经网络（DNN）提高了我们将 DNA 原始序列作为输入并预测通过高通量功能基因组测定所测得的大量分子活动的能力。事后归因分析被用来深入了解 DNNs 所学特征的重要性，通常能揭示序列图案等模式。然而，归因图通常包含虚假的重要性得分，其程度因模型而异，即使是预测通用性良好的 DNN 也不例外。因此，标准的模型选择方法依赖于保留验证集的表现，并不能保证表现优异的 DNN 能够提供可靠的解释。在此，我们介绍两种量化归因图群体中重要特征一致性的方法；一致性反映了人类可解释归因图的定性属性。我们将一致性度量作为多元模型选择框架的一部分，以确定能产生高泛化性能和可解释归因分析的模型。我们通过合成数据和染色质可及性数据分别定量和定性地证明了这种方法在各种 DNN 中的有效性。

{"title":"Selecting deep neural networks that yield consistent attribution-based interpretations for genomics.","authors":"Antonio Majdandzic, Chandana Rajesh, Amber Tang, Shushan Toneyan, Ethan Labelson, Rohit Tripathy, Peter K Koo","doi":"","DOIUrl":"","url":null,"abstract":"Deep neural networks (DNNs) have advanced our ability to take DNA primary sequence as input and predict a myriad of molecular activities measured via high-throughput functional genomic assays. Post hoc attribution analysis has been employed to provide insights into the importance of features learned by DNNs, often revealing patterns such as sequence motifs. However, attribution maps typically harbor spurious importance scores to an extent that varies from model to model, even for DNNs whose predictions generalize well. Thus, the standard approach for model selection, which relies on performance of a held-out validation set, does not guarantee that a high-performing DNN will provide reliable explanations. Here we introduce two approaches that quantify the consistency of important features across a population of attribution maps; consistency reflects a qualitative property of human interpretable attribution maps. We employ the consistency metrics as part of a multivariate model selection framework to identify models that yield high generalization performance and interpretable attribution analysis. We demonstrate the efficacy of this approach across various DNNs quantitatively with synthetic data and qualitatively with chromatin accessibility data.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"200 ","pages":"131-149"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10194041/pdf/nihms-1895253.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9544629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Path Towards Clinical Adaptation of Accelerated MRI. 加速MRI的临床适应之路。

Proceedings of machine learning research

Pub Date : 2022-11-01

Michael S Yao, Michael S Hansen

Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier F ₂ score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.

加速MRI从稀疏采样信号数据重建临床解剖图像，以减少患者扫描时间。虽然最近的研究已经利用深度学习来完成这项任务，但这种方法通常只在没有信号损坏或资源限制的模拟环境中进行了探索。在这项工作中，我们探索增强神经网络MRI图像重建，以提高其临床相关性。也就是说，我们提出了一个用于检测图像伪影源的卷积神经网络模型，该模型的分类器f2得分为79.1%。我们还证明，在具有可变加速因子的MR信号数据上训练重建器可以在临床患者扫描期间将其平均性能提高2%。我们提供了一个损失函数来克服模型学习重建多个解剖和方向的MR图像时的灾难性遗忘。最后，我们提出了一种方法，在临床上获得的数据集和计算能力有限的情况下，使用模拟幻像数据对重建器进行预训练。我们的研究结果为加速MRI的临床应用提供了一条潜在的途径。

{"title":"A Path Towards Clinical Adaptation of Accelerated MRI.","authors":"Michael S Yao, Michael S Hansen","doi":"","DOIUrl":"","url":null,"abstract":"Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier F 2 score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 ","pages":"489-511"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10061571/pdf/nihms-1846161.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9336136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0