Proceedings of machine learning research最新文献

An Interoperable Machine Learning Pipeline for Pediatric Obesity Risk Estimation.

Proceedings of machine learning research

Pub Date : 2024-12-01

Hamed Fayyaz, Mehak Gupta, Alejandra Perez Ramirez, Claudine Jurkovitz, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti

Reliable prediction of pediatric obesity can offer a valuable resource to providers, helping them engage in timely preventive interventions before the disease is established. Many efforts have been made to develop ML-based predictive models of obesity, and some studies have reported high predictive performances. However, no commonly used clinical decision support tool based on existing ML models currently exists. This study presents a novel end-to-end pipeline specifically designed for pediatric obesity prediction, which supports the entire process of data extraction, inference, and communication via an API or a user interface. While focusing only on routinely recorded data in pediatric electronic health records (EHRs), our pipeline uses a diverse expert-curated list of medical concepts to predict the 1-3 years risk of developing obesity. Furthermore, by using the Fast Healthcare Interoperability Resources (FHIR) standard in our design procedure, we specifically target facilitating low-effort integration of our pipeline with different EHR systems. In our experiments, we report the effectiveness of the predictive model as well as its alignment with the feedback from various stakeholders, including ML scientists, providers, health IT personnel, health administration representatives, and patient group representatives.

{"title":"An Interoperable Machine Learning Pipeline for Pediatric Obesity Risk Estimation.","authors":"Hamed Fayyaz, Mehak Gupta, Alejandra Perez Ramirez, Claudine Jurkovitz, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"Reliable prediction of pediatric obesity can offer a valuable resource to providers, helping them engage in timely preventive interventions before the disease is established. Many efforts have been made to develop ML-based predictive models of obesity, and some studies have reported high predictive performances. However, no commonly used clinical decision support tool based on existing ML models currently exists. This study presents a novel end-to-end pipeline specifically designed for pediatric obesity prediction, which supports the entire process of data extraction, inference, and communication via an API or a user interface. While focusing only on routinely recorded data in pediatric electronic health records (EHRs), our pipeline uses a diverse expert-curated list of medical concepts to predict the 1-3 years risk of developing obesity. Furthermore, by using the Fast Healthcare Interoperability Resources (FHIR) standard in our design procedure, we specifically target facilitating low-effort integration of our pipeline with different EHR systems. In our experiments, we report the effectiveness of the predictive model as well as its alignment with the feedback from various stakeholders, including ML scientists, providers, health IT personnel, health administration representatives, and patient group representatives.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"259 ","pages":"308-324"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Sleep Apnea Detection with Missing or Noisy Modalities.

Proceedings of machine learning research

Pub Date : 2024-08-01

Hamed Fayyaz, Niharika S D'Souza, Rahmatollah Beheshti

Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC>0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios). Our code is publicly available at https://github.com/healthylaife/apnea-missing-modality.

{"title":"Multimodal Sleep Apnea Detection with Missing or Noisy Modalities.","authors":"Hamed Fayyaz, Niharika S D'Souza, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC>0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios). Our code is publicly available at https://github.com/healthylaife/apnea-missing-modality.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"252 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143598009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks. 用双层 ReLU 神经网络进行可证明的多任务表征学习

Proceedings of machine learning research

Pub Date : 2024-07-01

Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai

An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an $r$ -dimensional subspace within the $d ≫ r$ -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of $d$ . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all $r$ ground-truth features.

一种日益流行的机器学习范式是在许多任务上离线预训练神经网络（NN），然后使其适应下游任务，通常只重新训练网络的最后一层线性层。这种方法在各种情况下都能产生强大的下游性能，证明多任务预训练能带来有效的特征学习。尽管最近的一些理论研究表明，浅层网络在以下两种情况下都能学习到有意义的特征：(i) 在单一任务中训练；(ii) 是线性的，但对于在多个任务中训练的非线性网络这种更贴近实践的情况却知之甚少。在这项研究中，我们首次证明了在多个任务中使用非线性模型进行训练时会出现特征学习。我们的主要见解是，多任务预训练会产生一种伪对比损失，这种损失有利于将通常在不同任务中具有相同标签的点对齐的表征。利用这一观察结果，我们证明，当任务是二元分类任务时，标签取决于数据在 d ≫ r -dimensional 输入空间内的 r -dimensional 子空间上的投影，在双层 ReLU NN 上的基于梯度的简单多任务学习算法可以恢复这一投影，从而在样本和神经元复杂度与 d 无关的情况下泛化到下游任务。与此相反，我们的研究表明，在单个任务的高概率抽取中，对该单个任务的训练无法保证学习到所有 r 个地面真实特征。

{"title":"Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks.","authors":"Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai","doi":"","DOIUrl":"","url":null,"abstract":"An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an <math><mi>r</mi></math> -dimensional subspace within the <math><mi>d</mi> <mo>≫</mo> <mi>r</mi></math> -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of <math><mi>d</mi></math> . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all <math><mi>r</mi></math> ground-truth features.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"9292-9345"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11486479/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Region Markovian Gaussian Process: An Efficient Method to Discover Directional Communications Across Multiple Brain Regions. 多区域马尔可夫高斯过程：发现跨多个脑区定向通信的高效方法

Proceedings of machine learning research

Pub Date : 2024-07-01

Weihan Li, Chengrui Li, Yule Wang, Anqi Wu

Studying the complex interactions between different brain regions is crucial in neuroscience. Various statistical methods have explored the latent communication across multiple brain regions. Two main categories are the Gaussian Process (GP) and Linear Dynamical System (LDS), each with unique strengths. The GP-based approach effectively discovers latent variables with frequency bands and communication directions. Conversely, the LDS-based approach is computationally efficient but lacks powerful expressiveness in latent representation. In this study, we merge both methodologies by creating an LDS mirroring a multi-output GP, termed Multi-Region Markovian Gaussian Process (MRM-GP). Our work establishes a connection between an LDS and a multi-output GP that explicitly models frequencies and phase delays within the latent space of neural recordings. Consequently, the model achieves a linear inference cost over time points and provides an interpretable low-dimensional representation, revealing communication directions across brain regions and separating oscillatory communications into different frequency bands.

研究不同脑区之间复杂的相互作用对神经科学至关重要。各种统计方法探索了多个脑区之间的潜在交流。其中两大类是高斯过程（GP）和线性动力系统（LDS），它们各有千秋。基于 GP 的方法能有效发现具有频带和通信方向的潜变量。相反，基于 LDS 的方法计算效率高，但在潜在表示方面缺乏强大的表现力。在本研究中，我们将这两种方法融合在一起，创建了一个反映多输出 GP 的 LDS，称为多区域马尔可夫高斯过程（MRM-GP）。我们的研究在 LDS 和多输出 GP 之间建立了联系，明确地模拟了神经记录潜空间内的频率和相位延迟。因此，该模型在时间点上实现了线性推理成本，并提供了可解释的低维表示，揭示了跨脑区的通信方向，并将振荡通信分离为不同的频段。

{"title":"Multi-Region Markovian Gaussian Process: An Efficient Method to Discover Directional Communications Across Multiple Brain Regions.","authors":"Weihan Li, Chengrui Li, Yule Wang, Anqi Wu","doi":"","DOIUrl":"","url":null,"abstract":"Studying the complex interactions between different brain regions is crucial in neuroscience. Various statistical methods have explored the latent communication across multiple brain regions. Two main categories are the Gaussian Process (GP) and Linear Dynamical System (LDS), each with unique strengths. The GP-based approach effectively discovers latent variables with frequency bands and communication directions. Conversely, the LDS-based approach is computationally efficient but lacks powerful expressiveness in latent representation. In this study, we merge both methodologies by creating an LDS mirroring a multi-output GP, termed Multi-Region Markovian Gaussian Process (MRM-GP). Our work establishes a connection between an LDS and a multi-output GP that explicitly models frequencies and phase delays within the latent space of neural recordings. Consequently, the model achieves a linear inference cost over time points and provides an interpretable low-dimensional representation, revealing communication directions across brain regions and separating oscillatory communications into different frequency bands.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"28112-28131"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526605/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Source Conformal Inference Under Distribution Shift. 分布偏移下的多源共形推理

Proceedings of machine learning research

Pub Date : 2024-07-01

Yi Liu, Alexander W Levis, Sharon-Lise Normand, Larry Han

Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.

近年来，人们越来越多地利用跨多个数据源的复杂机器学习模型来为更具通用性的决策提供信息。然而，数据源之间的分布变化和与共享个人层面数据相关的隐私问题，再加上机器学习预测缺乏不确定性量化，使得在多源环境中实现有效推断具有挑战性。在本文中，我们考虑的问题是如何利用多个可能存在偏差的数据源，获得目标人群的无分布预测区间。我们推导出了目标人群和源人群中未观测到结果的量值的有效影响函数，并证明了在估计骚扰函数时可以结合机器学习预测算法，同时仍能达到名义覆盖概率的参数收敛率。此外，当违反条件结果不变性时，我们提出了一种数据自适应策略，即提高信息数据源的权重以提高效率，降低非信息数据源的权重以减少偏差。我们通过大量的合成实验，强调了我们的建议对于各种保形得分和数据生成机制的稳健性和效率。2016-2022 年间美国接受高风险心脏外科手术的儿科患者的住院时间预测区间说明了我们的方法的实用性。

{"title":"Multi-Source Conformal Inference Under Distribution Shift.","authors":"Yi Liu, Alexander W Levis, Sharon-Lise Normand, Larry Han","doi":"","DOIUrl":"","url":null,"abstract":"Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"31344-31382"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11345809/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adapt and Diffuse: Sample-Adaptive Reconstruction Via Latent Diffusion Models. 适应与扩散：通过潜在扩散模型进行样本适应性重建。

Proceedings of machine learning research

Pub Date : 2024-07-01

Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi

Inverse problems arise in a multitude of applications, where the goal is to recover a clean signal from noisy and possibly (non)linear observations. The difficulty of a reconstruction problem depends on multiple factors, such as the structure of the ground truth signal, the severity of the degradation and the complex interactions between the above. This results in natural sample-by-sample variation in the difficulty of a reconstruction task, which is often overlooked by contemporary techniques. Our key observation is that most existing inverse problem solvers lack the ability to adapt their compute power to the difficulty of the reconstruction task, resulting in subpar performance and wasteful resource allocation. We propose a novel method that we call severity encoding, to estimate the degradation severity of noisy, degraded signals in the latent space of an autoencoder. We show that the estimated severity has strong correlation with the true corruption level and can give useful hints at the difficulty of reconstruction problems on a sample-by-sample basis. Furthermore, we propose a reconstruction method based on latent diffusion models that leverages the predicted degradation severities to fine-tune the reverse diffusion sampling trajectory and thus achieve sample-adaptive inference times. Our framework acts as a wrapper that can be combined with any latent diffusion-based baseline solver, imbuing it with sample-adaptivity and acceleration. We perform numerical experiments on both linear and nonlinear inverse problems and demonstrate that our technique greatly improves the performance of the baseline solver and achieves up to 10× acceleration in mean sampling speed.

在许多应用中都会出现逆问题，其目标是从嘈杂的、可能是（非）线性的观测数据中恢复干净的信号。重建问题的难度取决于多种因素，如地面实况信号的结构、退化的严重程度以及上述因素之间复杂的相互作用。这就导致了重建任务难度的自然逐样变化，而当代技术往往忽视了这一点。我们观察到的主要问题是，大多数现有的逆问题求解器缺乏根据重建任务的难度调整计算能力的能力，从而导致性能不佳和资源分配浪费。我们提出了一种称为 "严重度编码 "的新方法，用于在自动编码器的潜空间中估计噪声、降级信号的降级严重度。我们的研究表明，估计的严重程度与真实的劣化程度有很强的相关性，并能在逐个样本的基础上为重构问题的难度提供有用的提示。此外，我们还提出了一种基于潜在扩散模型的重建方法，该方法利用预测的损坏严重程度来微调反向扩散采样轨迹，从而实现样本自适应推理时间。我们的框架就像一个包装器，可以与任何基于潜扩散的基线求解器相结合，使其具有样本自适应性和加速度。我们对线性和非线性逆问题进行了数值实验，证明我们的技术大大提高了基线求解器的性能，平均采样速度提高了 10 倍。

{"title":"Adapt and Diffuse: Sample-Adaptive Reconstruction Via Latent Diffusion Models.","authors":"Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi","doi":"","DOIUrl":"","url":null,"abstract":"Inverse problems arise in a multitude of applications, where the goal is to recover a clean signal from noisy and possibly (non)linear observations. The difficulty of a reconstruction problem depends on multiple factors, such as the structure of the ground truth signal, the severity of the degradation and the complex interactions between the above. This results in natural sample-by-sample variation in the difficulty of a reconstruction task, which is often overlooked by contemporary techniques. Our key observation is that most existing inverse problem solvers lack the ability to adapt their compute power to the difficulty of the reconstruction task, resulting in subpar performance and wasteful resource allocation. We propose a novel method that we call severity encoding, to estimate the degradation severity of noisy, degraded signals in the latent space of an autoencoder. We show that the estimated severity has strong correlation with the true corruption level and can give useful hints at the difficulty of reconstruction problems on a sample-by-sample basis. Furthermore, we propose a reconstruction method based on latent diffusion models that leverages the predicted degradation severities to fine-tune the reverse diffusion sampling trajectory and thus achieve sample-adaptive inference times. Our framework acts as a wrapper that can be combined with any latent diffusion-based baseline solver, imbuing it with sample-adaptivity and acceleration. We perform numerical experiments on both linear and nonlinear inverse problems and demonstrate that our technique greatly improves the performance of the baseline solver and achieves up to 10× acceleration in mean sampling speed.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"12723-12753"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11421836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142334004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Contrastive Learning for Clinical Outcome Prediction with Partial Data Sources. 利用部分数据源进行临床结果预测的对比学习

Proceedings of machine learning research

Pub Date : 2024-07-01

Meng Xia, Jonathan Wilson, Benjamin Goldstein, Ricardo Henao

The use of machine learning models to predict clinical outcomes from (longitudinal) electronic health record (EHR) data is becoming increasingly popular due to advances in deep architectures, representation learning, and the growing availability of large EHR datasets. Existing models generally assume access to the same data sources during both training and inference stages. However, this assumption is often challenged by the fact that real-world clinical datasets originate from various data sources (with distinct sets of covariates), which though can be available for training (in a research or retrospective setting), are more realistically only partially available (a subset of such sets) for inference when deployed. So motivated, we introduce Contrastive Learning for clinical Outcome Prediction with Partial data Sources (CLOPPS), that trains encoders to capture information across different data sources and then leverages them to build classifiers restricting access to a single data source. This approach can be used with existing cross-sectional or longitudinal outcome classification models. We present experiments on two real-world datasets demonstrating that CLOPPS consistently outperforms strong baselines in several practical scenarios.

由于深度架构、表征学习的进步以及大型电子病历数据集的日益普及，使用机器学习模型从（纵向）电子病历数据中预测临床结果正变得越来越流行。现有模型通常假设在训练和推理阶段都能访问相同的数据源。然而，现实世界中的临床数据集来自不同的数据源（具有不同的协变量集），虽然可以用于训练（在研究或回顾性设置中），但更现实的是，在部署时，只有部分数据（这些数据集的子集）可用于推理，因此这一假设常常受到挑战。受此启发，我们推出了利用部分数据源进行临床结果预测的对比学习（CLOPPS），该方法可训练编码器捕捉不同数据源的信息，然后利用编码器构建限制访问单一数据源的分类器。这种方法可用于现有的横截面或纵向结果分类模型。我们在两个真实世界数据集上进行了实验，证明 CLOPPS 在多个实际场景中的表现始终优于强大的基线。

{"title":"Contrastive Learning for Clinical Outcome Prediction with Partial Data Sources.","authors":"Meng Xia, Jonathan Wilson, Benjamin Goldstein, Ricardo Henao","doi":"","DOIUrl":"","url":null,"abstract":"The use of machine learning models to predict clinical outcomes from (longitudinal) electronic health record (EHR) data is becoming increasingly popular due to advances in deep architectures, representation learning, and the growing availability of large EHR datasets. Existing models generally assume access to the same data sources during both training and inference stages. However, this assumption is often challenged by the fact that real-world clinical datasets originate from various data sources (with distinct sets of covariates), which though can be available for training (in a research or retrospective setting), are more realistically only partially available (a subset of such sets) for inference when deployed. So motivated, we introduce Contrastive Learning for clinical Outcome Prediction with Partial data Sources (CLOPPS), that trains encoders to capture information across different data sources and then leverages them to build classifiers restricting access to a single data source. This approach can be used with existing cross-sectional or longitudinal outcome classification models. We present experiments on two real-world datasets demonstrating that CLOPPS consistently outperforms strong baselines in several practical scenarios.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"54156-54177"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11326519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141989752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiracDiffusion: Denoising and Incremental Reconstruction with Assured Data-Consistency. DiracDiffusion：去噪和增量重建，确保数据一致性

Proceedings of machine learning research

Pub Date : 2024-07-01

Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi

Diffusion models have established new state of the art in a multitude of computer vision tasks, including image restoration. Diffusion-based inverse problem solvers generate reconstructions of exceptional visual quality from heavily corrupted measurements. However, in what is widely known as the perception-distortion trade-off, the price of perceptually appealing reconstructions is often paid in declined distortion metrics, such as PSNR. Distortion metrics measure faithfulness to the observation, a crucial requirement in inverse problems. In this work, we propose a novel framework for inverse problem solving, namely we assume that the observation comes from a stochastic degradation process that gradually degrades and noises the original clean image. We learn to reverse the degradation process in order to recover the clean image. Our technique maintains consistency with the original measurement throughout the reverse process, and allows for great flexibility in trading off perceptual quality for improved distortion metrics and sampling speedup via early-stopping. We demonstrate the efficiency of our method on different high-resolution datasets and inverse problems, achieving great improvements over other state-of-the-art diffusion-based methods with respect to both perceptual and distortion metrics.

在包括图像复原在内的众多计算机视觉任务中，扩散模型已确立了新的技术水平。基于扩散的逆问题求解器能从严重破坏的测量结果中生成视觉质量极高的重建图像。然而，在众所周知的 "感知-失真 "权衡中，具有感知吸引力的重构往往要以下降的失真指标（如 PSNR）为代价。失真度指标衡量的是对观察结果的忠实度，这是逆向问题的一个关键要求。在这项工作中，我们提出了一个新颖的逆问题求解框架，即我们假定观察结果来自一个随机退化过程，该过程会使原始清晰图像逐渐退化并产生噪声。我们要学会逆转退化过程，以恢复干净的图像。我们的技术能在整个逆向过程中保持与原始测量结果的一致性，并能通过早期停止，灵活地以感知质量换取改进的失真指标和采样速度。我们在不同的高分辨率数据集和逆向问题上展示了我们方法的效率，在感知和失真指标方面都比其他最先进的基于扩散的方法有了很大改进。

{"title":"DiracDiffusion: Denoising and Incremental Reconstruction with Assured Data-Consistency.","authors":"Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi","doi":"","DOIUrl":"","url":null,"abstract":"Diffusion models have established new state of the art in a multitude of computer vision tasks, including image restoration. Diffusion-based inverse problem solvers generate reconstructions of exceptional visual quality from heavily corrupted measurements. However, in what is widely known as the perception-distortion trade-off, the price of perceptually appealing reconstructions is often paid in declined distortion metrics, such as PSNR. Distortion metrics measure faithfulness to the observation, a crucial requirement in inverse problems. In this work, we propose a novel framework for inverse problem solving, namely we assume that the observation comes from a stochastic degradation process that gradually degrades and noises the original clean image. We learn to reverse the degradation process in order to recover the clean image. Our technique maintains consistency with the original measurement throughout the reverse process, and allows for great flexibility in trading off perceptual quality for improved distortion metrics and sampling speedup via early-stopping. We demonstrate the efficiency of our method on different high-resolution datasets and inverse problems, achieving great improvements over other state-of-the-art diffusion-based methods with respect to both perceptual and distortion metrics.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"12754-12783"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11483186/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised pretraining in the wild imparts image acquisition robustness to medical image transformers: an application to lung cancer segmentation. 野外自监督预训练赋予医学图像转换器图像获取鲁棒性：肺癌分割的应用。

Proceedings of machine learning research

Pub Date : 2024-07-01

Jue Jiang, Harini Veeraraghavan

Self-supervised learning (SSL) is an approach to pretrain models with unlabeled datasets and extract useful feature representations such that these models can be easily fine-tuned for various downstream tasks. Self-pretraining applies SSL on curated task-specific datasets without using task-specific labels. Increasing availability of public data repositories has now made it possible to utilize diverse and large, task unrelated datasets to pretrain models in the "wild" using SSL. However, the benefit of such wild-pretraining over self-pretraining has not been studied in the context of medical image analysis. Hence, we analyzed transformers (Swin and ViT) and a convolutional neural network created using wild- and self-pretraining trained to segment lung tumors from 3D-computed tomography (CT) scans in terms of: (a) accuracy, (b) fine-tuning epoch efficiency, and (c) robustness to image acquisition differences (contrast versus non-contrast, slice thickness, and image reconstruction kernels). We also studied feature reuse using centered kernel alignment (CKA) with the Swin networks. Our analysis with two independent testing (public N = 139; internal N = 196) datasets showed that wild-pretrained Swin models significantly outperformed self-pretrained Swin for the various imaging acquisitions. Fine-tuning epoch efficiency was higher for both wild-pretrained Swin and ViT models compared to their self-pretrained counterparts. Feature reuse close to the final encoder layers was lower than in the early layers for wild-pretrained models irrespective of the pretext tasks used in SSL. Models and code will be made available through GitHub upon manuscript acceptance.

自监督学习（Self-supervised learning， SSL）是一种使用未标记数据集预训练模型并提取有用特征表示的方法，这样这些模型就可以很容易地针对各种下游任务进行微调。自我预训练在特定任务的数据集上应用SSL，而不使用特定任务的标签。公共数据存储库的可用性越来越高，现在可以利用各种各样的、大型的、任务无关的数据集来使用SSL在“野外”预训练模型。然而，在医学图像分析的背景下，这种野生预训练相对于自我预训练的好处尚未得到研究。因此，我们分析了变压器（Swin和ViT）和卷积神经网络，该网络使用野生和自我预训练创建，用于从3d计算机断层扫描（CT）扫描中分割肺肿瘤，并从以下方面进行了分析：(a)准确性，(b)微调epoch效率，以及(c)对图像采集差异（对比度与非对比度，切片厚度和图像重建核）的鲁棒性。我们还研究了在Swin网络中使用中心核对齐（CKA）的特征重用。我们的分析采用两个独立检验(公共N = 139；内部N = 196)数据集表明，野生预训练的Swin模型在各种成像获取方面明显优于自预训练的Swin。与自我预训练的模型相比，野生预训练的Swin和ViT模型的微调历元效率更高。无论SSL中使用的借口任务如何，接近最终编码器层的特征重用低于原始预训练模型的早期层。稿件接受后，模型和代码将通过GitHub提供。

{"title":"Self-supervised pretraining in the wild imparts image acquisition robustness to medical image transformers: an application to lung cancer segmentation.","authors":"Jue Jiang, Harini Veeraraghavan","doi":"","DOIUrl":"","url":null,"abstract":"Self-supervised learning (SSL) is an approach to pretrain models with unlabeled datasets and extract useful feature representations such that these models can be easily fine-tuned for various downstream tasks. Self-pretraining applies SSL on curated task-specific datasets without using task-specific labels. Increasing availability of public data repositories has now made it possible to utilize diverse and large, task unrelated datasets to pretrain models in the \"wild\" using SSL. However, the benefit of such wild-pretraining over self-pretraining has not been studied in the context of medical image analysis. Hence, we analyzed transformers (Swin and ViT) and a convolutional neural network created using wild- and self-pretraining trained to segment lung tumors from 3D-computed tomography (CT) scans in terms of: (a) accuracy, (b) fine-tuning epoch efficiency, and (c) robustness to image acquisition differences (contrast versus non-contrast, slice thickness, and image reconstruction kernels). We also studied feature reuse using centered kernel alignment (CKA) with the Swin networks. Our analysis with two independent testing (public N = 139; internal N = 196) datasets showed that wild-pretrained Swin models significantly outperformed self-pretrained Swin for the various imaging acquisitions. Fine-tuning epoch efficiency was higher for both wild-pretrained Swin and ViT models compared to their self-pretrained counterparts. Feature reuse close to the final encoder layers was lower than in the early layers for wild-pretrained models irrespective of the pretext tasks used in SSL. Models and code will be made available through GitHub upon manuscript acceptance.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"250 ","pages":"708-721"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11741178/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability. 通过分布可学习性对分布转移下的可学习数据库操作进行理论分析》（Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability）。

Proceedings of machine learning research

Pub Date : 2024-07-01

Sepanta Zeighami, Cyrus Shahabi

Use of machine learning to perform database operations, such as indexing, cardinality estimation, and sorting, is shown to provide substantial performance benefits. However, when datasets change and data distribution shifts, empirical results also show performance degradation for learned models, possibly to worse than non-learned alternatives. This, together with a lack of theoretical understanding of learned methods undermines their practical applicability, since there are no guarantees on how well the models will perform after deployment. In this paper, we present the first known theoretical characterization of the performance of learned models in dynamic datasets, for the aforementioned operations. Our results show novel theoretical characteristics achievable by learned models and provide bounds on the performance of the models that characterize their advantages over non-learned methods, showing why and when learned models can outperform the alternatives. Our analysis develops the distribution learnability framework and novel theoretical tools which build the foundation for the analysis of learned database operations in the future.

使用机器学习执行数据库操作（如索引、卡片性估计和排序）可带来巨大的性能优势。然而，当数据集发生变化和数据分布发生变化时，经验结果也显示学习模型的性能下降，可能比非学习模型更差。由于无法保证模型部署后的性能，再加上缺乏对学习方法的理论理解，这就削弱了其实际应用性。在本文中，我们针对上述操作，首次从理论上描述了动态数据集中学习模型的性能。我们的结果表明了学习模型可实现的新理论特性，并提供了模型性能的界限，这些界限描述了模型相对于非学习方法的优势，说明了为什么以及什么时候学习模型可以优于其他方法。我们的分析建立了分布式可学习性框架和新颖的理论工具，为今后分析学习型数据库操作奠定了基础。

{"title":"Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability.","authors":"Sepanta Zeighami, Cyrus Shahabi","doi":"","DOIUrl":"","url":null,"abstract":"Use of machine learning to perform database operations, such as indexing, cardinality estimation, and sorting, is shown to provide substantial performance benefits. However, when datasets change and data distribution shifts, empirical results also show performance degradation for learned models, possibly to worse than non-learned alternatives. This, together with a lack of theoretical understanding of learned methods undermines their practical applicability, since there are no guarantees on how well the models will perform after deployment. In this paper, we present the first known theoretical characterization of the performance of learned models in dynamic datasets, for the aforementioned operations. Our results show novel theoretical characteristics achievable by learned models and provide bounds on the performance of the models that characterize their advantages over non-learned methods, showing why and when learned models can outperform the alternatives. Our analysis develops the distribution learnability framework and novel theoretical tools which build the foundation for the analysis of learned database operations in the future.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"58283-58305"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11534081/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0