Proceedings of machine learning research最新文献

英文中文

Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation. 用于体育锻炼数据估算的时空多尺度稀疏自我关注。

Proceedings of machine learning research

Pub Date : 2024-06-01

Hui Wei, Maxwell A Xu, Colin Samplawski, James M Rehg, Santosh Kumar, Benjamin M Marlin

Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a novel and large scale data set consisting of a training set with over 3 million hourly step count observations and a test set with over 2.5 million hourly step count observations. We propose a domain knowledge-informed sparse self-attention model for this task that captures the temporal multi-scale nature of step-count data. We assess the performance of the model relative to baselines and conduct ablation studies to verify our specific model designs.

可穿戴传感器使健康研究人员能够在真实世界环境中持续收集与个人生理状态有关的数据。然而，由于各种因素的复杂组合，这些数据可能会出现大量缺失。在这项工作中，我们研究了缺失步数数据的估算问题，这是最普遍的可穿戴传感器数据形式之一。我们构建了一个新颖的大规模数据集，包括一个包含 300 多万个每小时步数观测值的训练集和一个包含 250 多万个每小时步数观测值的测试集。我们为这项任务提出了一个以领域知识为基础的稀疏自我注意力模型，该模型能捕捉到步数数据的时间多尺度特性。我们评估了该模型相对于基线的性能，并进行了消融研究，以验证我们的特定模型设计。

引用次数: 0

Online Calibrated and Conformal Prediction Improves Bayesian Optimization. 在线校准和适形预测改进了贝叶斯优化。

Proceedings of machine learning research

Pub Date : 2024-05-01

Shachi Deshpande, Charles Marx, Volodymyr Kuleshov

Accurate uncertainty estimates are important in sequential model-based decision-making tasks such as Bayesian optimization. However, these estimates can be imperfect if the data violates assumptions made by the model (e.g., Gaussianity). This paper studies which uncertainties are needed in model-based decision-making and in Bayesian optimization, and argues that uncertainties can benefit from calibration-i.e., an 80% predictive interval should contain the true outcome 80% of the time. Maintaining calibration, however, can be challenging when the data is non-stationary and depends on our actions. We propose using simple algorithms based on online learning to provably maintain calibration on non-i.i.d. data, and we show how to integrate these algorithms in Bayesian optimization with minimal overhead. Empirically, we find that calibrated Bayesian optimization converges to better optima in fewer steps, and we demonstrate improved performance on standard benchmark functions and hyperparameter optimization tasks.

在贝叶斯优化等基于模型的连续决策任务中，准确的不确定性估计非常重要。然而，如果数据违反了模型的假设（如高斯性），这些估计可能并不完美。本文研究了在基于模型的决策和贝叶斯优化中需要哪些不确定性，并认为不确定性可以从校准中获益，即 80% 的预测区间在 80% 的情况下应该包含真实结果。然而，当数据是非稳态的并取决于我们的行动时，保持校准就会面临挑战。我们建议使用基于在线学习的简单算法来证明在非 i.i.d. 数据上保持校准，并展示了如何以最小的开销将这些算法集成到贝叶斯优化中。从经验上看，我们发现经过校准的贝叶斯优化算法能以更少的步骤收敛到更好的最优值，我们还展示了在标准基准函数和超参数优化任务上的更佳性能。

引用次数: 0

Adaptive Discretization for Event PredicTion (ADEPT). 事件预测自适应离散化（ADEPT）。

Proceedings of machine learning research

Pub Date : 2024-05-01

Jimmy Hickey, Ricardo Henao, Daniel Wojdyla, Michael Pencina, Matthew Engelhard

Recently developed survival analysis methods improve upon existing approaches by predicting the probability of event occurrence in each of a number pre-specified (discrete) time intervals. By avoiding placing strong parametric assumptions on the event density, this approach tends to improve prediction performance, particularly when data are plentiful. However, in clinical settings with limited available data, it is often preferable to judiciously partition the event time space into a limited number of intervals well suited to the prediction task at hand. In this work, we develop Adaptive Discretization for Event PredicTion (ADEPT) to learn from data a set of cut points defining such a partition. We show that in two simulated datasets, we are able to recover intervals that match the underlying generative model. We then demonstrate improved prediction performance on three real-world observational datasets, including a large, newly harmonized stroke risk prediction dataset. Finally, we argue that our approach facilitates clinical decision-making by suggesting time intervals that are most appropriate for each task, in the sense that they facilitate more accurate risk prediction.

最近开发的生存分析方法对现有方法进行了改进，预测了在若干预先指定的（离散）时间间隔内事件发生的概率。这种方法避免了对事件密度进行强参数假设，因此往往能提高预测效果，尤其是在数据丰富的情况下。然而，在可用数据有限的临床环境中，明智地将事件时间空间划分为适合当前预测任务的数量有限的时间间隔往往更为可取。在这项工作中，我们开发了 "事件预测自适应离散化"（Adaptive Discretization for Event PredicTion，ADEPT），以从数据中学习一组定义这种分区的切点。我们表明，在两个模拟数据集中，我们能够恢复与底层生成模型相匹配的区间。然后，我们在三个真实世界观察数据集（包括一个新近统一的大型中风风险预测数据集）上证明了预测性能的提高。最后，我们认为，我们的方法通过提出最适合每项任务的时间间隔来促进临床决策，因为它们有助于更准确的风险预测。

{"title":"Adaptive Discretization for Event PredicTion (ADEPT).","authors":"Jimmy Hickey, Ricardo Henao, Daniel Wojdyla, Michael Pencina, Matthew Engelhard","doi":"","DOIUrl":"","url":null,"abstract":"Recently developed survival analysis methods improve upon existing approaches by predicting the probability of event occurrence in each of a number pre-specified (discrete) time intervals. By avoiding placing strong parametric assumptions on the event density, this approach tends to improve prediction performance, particularly when data are plentiful. However, in clinical settings with limited available data, it is often preferable to judiciously partition the event time space into a limited number of intervals well suited to the prediction task at hand. In this work, we develop Adaptive Discretization for Event PredicTion (ADEPT) to learn from data a set of cut points defining such a partition. We show that in two simulated datasets, we are able to recover intervals that match the underlying generative model. We then demonstrate improved prediction performance on three real-world observational datasets, including a large, newly harmonized stroke risk prediction dataset. Finally, we argue that our approach facilitates clinical decision-making by suggesting time intervals that are most appropriate for each task, in the sense that they facilitate more accurate risk prediction.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"238 ","pages":"1351-1359"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11078624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140900566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proceedings of machine learning research

Pub Date : 2024-05-01

Zongyu Dai, Emily Getzen, Qi Long

Missing values are prevalent in temporal electronic health records (EHR) data and are known to complicate data analysis and lead to biased results. The current state-of-the-art (SOTA) models for imputing missing values in EHR primarily leverage correlations across time points and across features, which perform well when data have strong correlation across time points, such as in ICU data where high-frequency time series data are collected. However, this is often insufficient for temporal EHR data from non-ICU settings (e.g., outpatient visits for primary care or specialty care), where data are collected at substantially sparser time points, resulting in much weaker correlation across time points. To address this methodological gap, we propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer's Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

缺失值在时态电子健康记录（EHR）数据中非常普遍，众所周知，缺失值会使数据分析复杂化并导致结果偏差。目前最先进的（SOTA）模型主要利用跨时间点和跨特征的相关性来弥补电子健康记录中的缺失值，在数据跨时间点相关性很强的情况下，例如在收集高频时间序列数据的重症监护室数据中，这种模型表现良好。然而，对于非重症监护室环境下的时间 EHR 数据（如初级保健或专科护理的门诊就诊）来说，这往往是不够的，因为在这些环境下，数据收集的时间点要稀疏得多，导致跨时间点的相关性要弱得多。为了弥补这一方法上的不足，我们提出了基于相似性感知扩散模型的估算（SADI），这是一种利用扩散模型和跨因变量信息的新型估算方法。我们将 SADI 应用于不完整的时间 EHR 数据的估算，并提出了一种相似性感知去噪函数，其中包括一种自我关注机制，用于模拟时间点、特征和相似患者之间的相关性。据我们所知，这是首次直接利用相似患者的信息来构建不完整时态电子病历数据的估算。我们在两个数据集（阿尔茨海默病关键路径（CPAD）数据和 2012 年物理网挑战赛数据）上进行了大量实验，结果表明，在各种数据缺失机制下，包括完全随机缺失（MCAR）、随机缺失（MAR）和非随机缺失（MNAR），SADI 都优于目前的 SOTA。

{"title":"SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data.","authors":"Zongyu Dai, Emily Getzen, Qi Long","doi":"","DOIUrl":"","url":null,"abstract":"Missing values are prevalent in temporal electronic health records (EHR) data and are known to complicate data analysis and lead to biased results. The current state-of-the-art (SOTA) models for imputing missing values in EHR primarily leverage correlations across time points and across features, which perform well when data have strong correlation across time points, such as in ICU data where high-frequency time series data are collected. However, this is often insufficient for temporal EHR data from non-ICU settings (e.g., outpatient visits for primary care or specialty care), where data are collected at substantially sparser time points, resulting in much weaker correlation across time points. To address this methodological gap, we propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer's Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"238 ","pages":"4195-4203"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11391213/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142302980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods. 在线双级优化：在线梯度交替法的遗憾分析

Proceedings of machine learning research

Pub Date : 2024-05-01

Davoud Ataee Tarzanagh, Parvin Nazari, Bojian Hou, Li Shen, Laura Balzano

This paper introduces an online bilevel optimization setting in which a sequence of time-varying bilevel problems is revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we provide new notions of bilevel regret, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and give regret bounds in terms of the path-length of the inner and outer minimizer sequences.

本文介绍了一种在线双层优化设置，在这种设置中，一连串时变双层问题相继揭示。我们将已知的单级在线算法的遗憾边界扩展到双级设置。具体来说，我们提供了双级遗憾的新概念，开发了一种能够利用平滑性的在线交替时间平均梯度法，并给出了内外部最小化序列的路径长度的遗憾边界。

引用次数: 0

On the Generalization Ability of Unsupervised Pretraining. 论无监督预训练的泛化能力

Proceedings of machine learning research

Pub Date : 2024-05-01

Yuyang Deng, Junyuan Hong, Jiayu Zhou, Mehrdad Mahdavi

Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.

无监督学习的最新进展表明，无监督预训练后再进行微调，可以提高模型的泛化能力。然而，对于在无标签数据集上学习到的表征函数如何影响微调模型的泛化，还缺乏严格的理解。现有的理论研究没有充分考虑到预训练和微调阶段的分布和任务的异质性。为了弥补这一不足，本文提出了一个新颖的理论框架，阐明了影响无监督预训练期间所获知识向后续微调阶段转移的关键因素，这些因素最终会影响微调模型对下游任务的泛化能力。我们应用我们的理论框架来分析两种不同场景的泛化约束：使用深度神经网络进行上下文编码器预训练，以及使用深度转换器进行掩码自动编码器预训练，然后在二元分类任务上进行微调。最后，受我们研究结果的启发，我们提出了一种新颖的预训练正则化方法，以进一步增强微调模型的泛化能力。总之，我们的研究结果有助于更好地理解无监督预训练和微调范式，并为设计更有效的预训练算法提供启示。

{"title":"On the Generalization Ability of Unsupervised Pretraining.","authors":"Yuyang Deng, Junyuan Hong, Jiayu Zhou, Mehrdad Mahdavi","doi":"","DOIUrl":"","url":null,"abstract":"Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"238 ","pages":"4519-4527"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online learning in bandits with predicted context. 有预测背景的匪帮在线学习

Proceedings of machine learning research

Pub Date : 2024-05-01

Yongyi Guo, Ziping Xu, Susan Murphy

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.

我们考虑的是上下文强盗问题，在这个问题中，代理每次只能获得上下文的噪声版本和误差方差（或该方差的估计值）。这种设置的动机来自于广泛的应用，在这些应用中，决策的真实情境是无法观测到的，只能通过潜在的复杂机器学习算法来预测情境。当上下文误差不等时，经典的强盗算法无法实现亚线性遗憾。在这种情况下，我们提出了第一种在温和条件下保证亚线性遗憾的在线算法。其关键思路是将经典统计中的测量误差模型扩展到在线决策环境中，由于策略依赖于有噪声的上下文观测，因此在线决策环境并不复杂。我们在基于合成和真实数字干预数据集的模拟环境中进一步展示了所提方法的优势。

引用次数: 0

Optimal Sparse Survival Trees. 最优稀疏生存树

Proceedings of machine learning research

Pub Date : 2024-05-01

Rui Zhang, Rui Xin, Margo Seltzer, Cynthia Rudin

Interpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problems that involve human health. Tree-based methods have been widely adopted for survival analysis due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.

可解释性对于医生、医院、制药公司和生物技术公司分析涉及人类健康的重大问题并做出决策至关重要。基于树的方法因其极具吸引力的可解释性和捕捉复杂关系的能力，已被广泛用于生存分析。然而，大多数现有的生存树生成方法都依赖于启发式（或贪婪式）算法，这有可能生成次优模型。我们提出了一种动态编程加边界的方法，它能找到可证明的最优稀疏生存树模型，通常只需几秒钟。

引用次数: 0

Fusing Individualized Treatment Rules Using Secondary Outcomes. 利用次要结果融合个性化治疗规则。

Proceedings of machine learning research

Pub Date : 2024-05-01

Daiqi Gao, Yuanjia Wang, Donglin Zeng

An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.

个体化治疗规则（ITR）是根据患者的个体特征变量为其推荐治疗方法的决策规则。在许多实践中，针对主要结果的理想 ITR 也会对其他次要结果造成最小伤害。因此，我们的目标是学习一种 ITR，它不仅能使主要结果的价值函数最大化，还能尽可能接近次要结果的最优规则。为了实现这一目标，我们引入了融合惩罚，鼓励基于不同结果的 ITR 产生相似的建议。我们提出了两种使用替代损失函数估算 ITR 的算法。我们证明，与不考虑次要结果的情况相比，主要结果的估计 ITR 与次要结果的最优 ITR 之间的一致率收敛到真实一致率的速度更快。此外，我们还推导出了所提方法的价值函数和误分类率的非渐近特性。最后，我们使用模拟研究和真实数据实例来证明所提方法的有限样本性能。

{"title":"Fusing Individualized Treatment Rules Using Secondary Outcomes.","authors":"Daiqi Gao, Yuanjia Wang, Donglin Zeng","doi":"","DOIUrl":"","url":null,"abstract":"An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"238 ","pages":"712-720"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11450767/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142382691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simple and Scalable Algorithms for Cluster-Aware Precision Medicine. 集群感知精准医学的简单可扩展算法。

Proceedings of machine learning research

Pub Date : 2024-05-01

Amanda M Buch, Conor Liston, Logan Grosenick

AI-enabled precision medicine promises a transformational improvement in healthcare outcomes. However, training on biomedical data presents significant challenges as they are often high dimensional, clustered, and of limited sample size. To overcome these challenges, we propose a simple and scalable approach for cluster-aware embedding that combines latent factor methods with a convex clustering penalty in a modular way. Our novel approach overcomes the complexity and limitations of current joint embedding and clustering methods and enables hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through numerical experiments and real-world examples, we demonstrate that our approach outperforms fourteen clustering methods on highly underdetermined problems (e.g., with limited sample size) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, yields improved model selection if they do, and yields interpretable hierarchically clustered embedding dendrograms. Thus, our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data and enables scalable and interpretable biomarkers for precision medicine.

人工智能支持的精准医疗有望实现医疗成果的变革性改善。然而，由于生物医学数据通常具有高维、聚类和样本量有限的特点，因此对其进行训练面临着巨大的挑战。为了克服这些挑战，我们提出了一种简单、可扩展的集群感知嵌入方法，它以模块化的方式将潜在因子方法与凸聚类惩罚相结合。我们的新方法克服了当前联合嵌入和聚类方法的复杂性和局限性，实现了分层聚类主成分分析（PCA）、局部线性嵌入（LLE）和典型相关分析（CCA）。通过数值实验和实际案例，我们证明了我们的方法在高度不确定问题（如样本量有限）和大样本数据集上的表现优于 14 种聚类分析方法。重要的是，我们的方法不需要用户选择所需的聚类数量，如果用户选择了，就能改进模型选择，并生成可解释的分层聚类嵌入树状图。因此，我们的方法大大改进了在多组学和神经成像数据中识别患者亚群的现有方法，并为精准医疗提供了可扩展、可解释的生物标记。

{"title":"Simple and Scalable Algorithms for Cluster-Aware Precision Medicine.","authors":"Amanda M Buch, Conor Liston, Logan Grosenick","doi":"","DOIUrl":"","url":null,"abstract":"AI-enabled precision medicine promises a transformational improvement in healthcare outcomes. However, training on biomedical data presents significant challenges as they are often high dimensional, clustered, and of limited sample size. To overcome these challenges, we propose a simple and scalable approach for cluster-aware embedding that combines latent factor methods with a convex clustering penalty in a modular way. Our novel approach overcomes the complexity and limitations of current joint embedding and clustering methods and enables hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through numerical experiments and real-world examples, we demonstrate that our approach outperforms fourteen clustering methods on highly underdetermined problems (e.g., with limited sample size) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, yields improved model selection if they do, and yields interpretable hierarchically clustered embedding dendrograms. Thus, our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data and enables scalable and interpretable biomarkers for precision medicine.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"238 ","pages":"136-144"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11251711/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of machine learning research

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀