Proceedings of machine learning research最新文献_第9页

Selecting deep neural networks that yield consistent attribution-based interpretations for genomics. 为基因组学选择能产生一致归因解释的深度神经网络。

Proceedings of machine learning research

Pub Date : 2022-11-01

Antonio Majdandzic, Chandana Rajesh, Amber Tang, Shushan Toneyan, Ethan Labelson, Rohit Tripathy, Peter K Koo

Deep neural networks (DNNs) have advanced our ability to take DNA primary sequence as input and predict a myriad of molecular activities measured via high-throughput functional genomic assays. Post hoc attribution analysis has been employed to provide insights into the importance of features learned by DNNs, often revealing patterns such as sequence motifs. However, attribution maps typically harbor spurious importance scores to an extent that varies from model to model, even for DNNs whose predictions generalize well. Thus, the standard approach for model selection, which relies on performance of a held-out validation set, does not guarantee that a high-performing DNN will provide reliable explanations. Here we introduce two approaches that quantify the consistency of important features across a population of attribution maps; consistency reflects a qualitative property of human interpretable attribution maps. We employ the consistency metrics as part of a multivariate model selection framework to identify models that yield high generalization performance and interpretable attribution analysis. We demonstrate the efficacy of this approach across various DNNs quantitatively with synthetic data and qualitatively with chromatin accessibility data.

深度神经网络（DNN）提高了我们将 DNA 原始序列作为输入并预测通过高通量功能基因组测定所测得的大量分子活动的能力。事后归因分析被用来深入了解 DNNs 所学特征的重要性，通常能揭示序列图案等模式。然而，归因图通常包含虚假的重要性得分，其程度因模型而异，即使是预测通用性良好的 DNN 也不例外。因此，标准的模型选择方法依赖于保留验证集的表现，并不能保证表现优异的 DNN 能够提供可靠的解释。在此，我们介绍两种量化归因图群体中重要特征一致性的方法；一致性反映了人类可解释归因图的定性属性。我们将一致性度量作为多元模型选择框架的一部分，以确定能产生高泛化性能和可解释归因分析的模型。我们通过合成数据和染色质可及性数据分别定量和定性地证明了这种方法在各种 DNN 中的有效性。

{"title":"Selecting deep neural networks that yield consistent attribution-based interpretations for genomics.","authors":"Antonio Majdandzic, Chandana Rajesh, Amber Tang, Shushan Toneyan, Ethan Labelson, Rohit Tripathy, Peter K Koo","doi":"","DOIUrl":"","url":null,"abstract":"Deep neural networks (DNNs) have advanced our ability to take DNA primary sequence as input and predict a myriad of molecular activities measured via high-throughput functional genomic assays. Post hoc attribution analysis has been employed to provide insights into the importance of features learned by DNNs, often revealing patterns such as sequence motifs. However, attribution maps typically harbor spurious importance scores to an extent that varies from model to model, even for DNNs whose predictions generalize well. Thus, the standard approach for model selection, which relies on performance of a held-out validation set, does not guarantee that a high-performing DNN will provide reliable explanations. Here we introduce two approaches that quantify the consistency of important features across a population of attribution maps; consistency reflects a qualitative property of human interpretable attribution maps. We employ the consistency metrics as part of a multivariate model selection framework to identify models that yield high generalization performance and interpretable attribution analysis. We demonstrate the efficacy of this approach across various DNNs quantitatively with synthetic data and qualitatively with chromatin accessibility data.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"200 ","pages":"131-149"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10194041/pdf/nihms-1895253.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9544629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Path Towards Clinical Adaptation of Accelerated MRI. 加速MRI的临床适应之路。

Proceedings of machine learning research

Pub Date : 2022-11-01

Michael S Yao, Michael S Hansen

Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier F ₂ score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.

加速MRI从稀疏采样信号数据重建临床解剖图像，以减少患者扫描时间。虽然最近的研究已经利用深度学习来完成这项任务，但这种方法通常只在没有信号损坏或资源限制的模拟环境中进行了探索。在这项工作中，我们探索增强神经网络MRI图像重建，以提高其临床相关性。也就是说，我们提出了一个用于检测图像伪影源的卷积神经网络模型，该模型的分类器f2得分为79.1%。我们还证明，在具有可变加速因子的MR信号数据上训练重建器可以在临床患者扫描期间将其平均性能提高2%。我们提供了一个损失函数来克服模型学习重建多个解剖和方向的MR图像时的灾难性遗忘。最后，我们提出了一种方法，在临床上获得的数据集和计算能力有限的情况下，使用模拟幻像数据对重建器进行预训练。我们的研究结果为加速MRI的临床应用提供了一条潜在的途径。

{"title":"A Path Towards Clinical Adaptation of Accelerated MRI.","authors":"Michael S Yao, Michael S Hansen","doi":"","DOIUrl":"","url":null,"abstract":"Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier F 2 score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 ","pages":"489-511"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10061571/pdf/nihms-1846161.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9336136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Counterfactual and Factual Reasoning over Hypergraphs for Interpretable Clinical Predictions on EHR. 在超图上进行反事实和事实推理，在电子病历上进行可解释的临床预测。

Proceedings of machine learning research

Pub Date : 2022-11-01

Ran Xu, Yue Yu, Chao Zhang, Mohammed K Ali, Joyce C Ho, Carl Yang

Electronic Health Record modeling is crucial for digital medicine. However, existing models ignore higher-order interactions among medical codes and their causal relations towards downstream clinical predictions. To address such limitations, we propose a novel framework CACHE, to provide effective and insightful clinical predictions based on hypergraph representation learning and counterfactual and factual reasoning techniques. Experiments on two real EHR datasets show the superior performance of CACHE. Case studies with a domain expert illustrate a preferred capability of CACHE in generating clinically meaningful interpretations towards the correct predictions.

电子健康记录建模对数字医疗至关重要。然而，现有模型忽略了医疗代码之间的高阶交互作用及其对下游临床预测的因果关系。为了解决这些局限性，我们提出了一个新颖的框架 CACHE，基于超图表示学习以及反事实和事实推理技术，提供有效且有洞察力的临床预测。在两个真实的电子病历数据集上进行的实验显示了 CACHE 的卓越性能。与领域专家进行的案例研究表明，CACHE 在生成对正确预测有临床意义的解释方面具有优越的能力。

引用次数: 0

Automated intracranial vessel labeling with learning boosted by vessel connectivity, radii and spatial context. 自动颅内血管标记与学习促进血管连接，半径和空间背景。

Proceedings of machine learning research

Pub Date : 2022-11-01

Jannik Sobisch, Žiga Bizjak, Aichi Chien, Žiga Špiclin

Cerebrovascular diseases are among the world's top causes of death and their screening and diagnosis rely on angiographic imaging. We focused on automated anatomical labeling of cerebral arteries that enables their cross-sectional quantification and inter-subject comparisons and thereby identification of geometric risk factors correlated to the cerebrovascular diseases. We used 152 cerebral TOF-MRA angiograms from three publicly available datasets and manually created reference labeling using Slicer3D. We extracted centerlines from nnU-net based segmentations using VesselVio and labeled them according to the reference labeling. Vessel centerline coordinates, in combination with additional vessel connectivity, radius and spatial context features were used for training seven distinct PointNet++ models. Model trained solely on the vessel centerline coordinates resulted in ACC of 0.93 and across-labels average TPR was 0.88. Including vessel radius significantly improved ACC to 0.95, and average TPR to 0.91. Finally, focusing spatial context to the Circle of Willis are resulted in best ACC of 0.96 and best average TPR of 0.93. Hence, using vessel radius and spatial context greatly improved vessel labeling, with the attained perfomance opening the avenue for clinical applications of intracranial vessel labeling.

脑血管疾病是世界上最主要的死亡原因之一，其筛查和诊断依赖于血管造影成像。我们专注于脑动脉的自动解剖标记，使其横断面量化和受试者间比较成为可能，从而确定与脑血管疾病相关的几何危险因素。我们使用了来自三个公开数据集的152张大脑TOF-MRA血管图，并使用Slicer3D手动创建了参考标签。我们使用VesselVio从基于nnU-net的分割中提取中心线，并根据参考标记对其进行标记。船舶中心线坐标，结合额外的船舶连通性、半径和空间环境特征，用于训练七种不同的PointNet++模型。仅在血管中心线坐标上训练的模型ACC为0.93，跨标签平均TPR为0.88。加入血管半径后，ACC显著提高至0.95,TPR显著提高至0.91。最后，将空间背景聚焦于威利斯圈的最佳ACC值为0.96，最佳平均TPR值为0.93。因此，利用血管半径和空间背景极大地改善了血管标记，所获得的性能为颅内血管标记的临床应用开辟了道路。

{"title":"Automated intracranial vessel labeling with learning boosted by vessel connectivity, radii and spatial context.","authors":"Jannik Sobisch, Žiga Bizjak, Aichi Chien, Žiga Špiclin","doi":"","DOIUrl":"","url":null,"abstract":"Cerebrovascular diseases are among the world's top causes of death and their screening and diagnosis rely on angiographic imaging. We focused on automated anatomical labeling of cerebral arteries that enables their cross-sectional quantification and inter-subject comparisons and thereby identification of geometric risk factors correlated to the cerebrovascular diseases. We used 152 cerebral TOF-MRA angiograms from three publicly available datasets and manually created reference labeling using Slicer3D. We extracted centerlines from nnU-net based segmentations using VesselVio and labeled them according to the reference labeling. Vessel centerline coordinates, in combination with additional vessel connectivity, radius and spatial context features were used for training seven distinct PointNet++ models. Model trained solely on the vessel centerline coordinates resulted in ACC of 0.93 and across-labels average TPR was 0.88. Including vessel radius significantly improved ACC to 0.95, and average TPR to 0.91. Finally, focusing spatial context to the Circle of Willis are resulted in best ACC of 0.96 and best average TPR of 0.93. Hence, using vessel radius and spatial context greatly improved vessel labeling, with the attained perfomance opening the avenue for clinical applications of intracranial vessel labeling.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"194 ","pages":"34-44"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10112880/pdf/nihms-1889674.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9389427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Extensive Data Processing Pipeline for MIMIC-IV. 用于MIMIC-IV的扩展数据处理管道。

Proceedings of machine learning research

Pub Date : 2022-11-01

Mehak Gupta, Brennan Gallamoza, Nicolas Cutrona, Pranjal Dhakal, Raphael Poulain, Rahmatollah Beheshti

An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical purposes. This growing area of research has exposed the challenges of the accessibility of EHRs. MIMIC is a popular, public, and free EHR dataset in a raw format that has been used in numerous studies. The absence of standardized preprocessing steps can be, however, a significant barrier to the wider adoption of this rare resource. Additionally, this absence can reduce the reproducibility of the developed tools and limit the ability to compare the results among similar studies. In this work, we provide a greatly customizable pipeline to extract, clean, and preprocess the data available in the fourth version of the MIMIC dataset (MIMIC-IV). The pipeline also presents an end-to-end wizard-like package supporting predictive model creations and evaluations. The pipeline covers a range of clinical prediction tasks which can be broadly classified into four categories - readmission, length of stay, mortality, and phenotype prediction. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.

越来越多的研究致力于将机器学习方法应用于各种临床目的的电子健康记录(EHR)数据。这一不断发展的研究领域暴露了电子病历可及性的挑战。MIMIC是一个流行的、公开的、免费的电子病历数据集，其原始格式已在许多研究中使用。然而，缺乏标准化的预处理步骤可能是广泛采用这种稀有资源的一个重大障碍。此外，这种缺失会降低开发工具的可重复性，并限制在类似研究中比较结果的能力。在这项工作中，我们提供了一个非常可定制的管道来提取、清理和预处理第四版MIMIC数据集(MIMIC- iv)中的数据。该管道还提供了一个端到端的类似向导的包，支持预测模型的创建和评估。该管道涵盖了一系列临床预测任务，可大致分为四类-再入院，住院时间，死亡率和表型预测。该工具可在https://github.com/healthylaife/MIMIC-IV-Data-Pipeline上公开获取。

{"title":"An Extensive Data Processing Pipeline for MIMIC-IV.","authors":"Mehak Gupta, Brennan Gallamoza, Nicolas Cutrona, Pranjal Dhakal, Raphael Poulain, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical purposes. This growing area of research has exposed the challenges of the accessibility of EHRs. MIMIC is a popular, public, and free EHR dataset in a raw format that has been used in numerous studies. The absence of standardized preprocessing steps can be, however, a significant barrier to the wider adoption of this rare resource. Additionally, this absence can reduce the reproducibility of the developed tools and limit the ability to compare the results among similar studies. In this work, we provide a greatly customizable pipeline to extract, clean, and preprocess the data available in the fourth version of the MIMIC dataset (MIMIC-IV). The pipeline also presents an end-to-end wizard-like package supporting predictive model creations and evaluations. The pipeline covers a range of clinical prediction tasks which can be broadly classified into four categories - readmission, length of stay, mortality, and phenotype prediction. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 ","pages":"311-325"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854277/pdf/nihms-1865425.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10604378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Contrastive Representation Learning for Gaze Estimation 注视估计的对比表征学习

Proceedings of machine learning research

Pub Date : 2022-10-24 DOI: 10.48550/arXiv.2210.13404

Swati Jindal, R. Manduchi

Self-supervised learning (SSL) has become prevalent for learning representations in computer vision. Notably, SSL exploits contrastive learning to encourage visual representations to be invariant under various image transformations. The task of gaze estimation, on the other hand, demands not just invariance to various appearances but also equivariance to the geometric transformations. In this work, we propose a simple contrastive representation learning framework for gaze estimation, named Gaze Contrastive Learning (GazeCLR). GazeCLR exploits multi-view data to promote equivariance and relies on selected data augmentation techniques that do not alter gaze directions for invariance learning. Our experiments demonstrate the effectiveness of GazeCLR for several settings of the gaze estimation task. Particularly, our results show that GazeCLR improves the performance of cross-domain gaze estimation and yields as high as 17.2% relative improvement. Moreover, the GazeCLR framework is competitive with state-of-the-art representation learning methods for few-shot evaluation. The code and pre-trained models are available at https://github.com/jswati31/gazeclr.

自监督学习(Self-supervised learning, SSL)已经成为计算机视觉中学习表征的主流。值得注意的是，SSL利用对比学习来鼓励视觉表示在各种图像转换下保持不变。另一方面，注视估计的任务不仅要求对各种外观的不变性，而且要求对几何变换的等变性。在这项工作中，我们提出了一个简单的凝视估计对比表征学习框架，称为凝视对比学习(GazeCLR)。GazeCLR利用多视图数据来促进等方差，并依赖于不改变凝视方向的选择数据增强技术来进行不变性学习。我们的实验证明了GazeCLR在几种注视估计任务设置下的有效性。特别是，我们的研究结果表明，GazeCLR提高了跨域凝视估计的性能，相对提高了17.2%。此外，GazeCLR框架在少镜头评估方面与最先进的表示学习方法具有竞争力。代码和预训练模型可在https://github.com/jswati31/gazeclr上获得。

引用次数: 4

A Path Towards Clinical Adaptation of Accelerated MRI 加速MRI的临床适应之路

Proceedings of machine learning research

Pub Date : 2022-08-26 DOI: 10.48550/arXiv.2208.12835

Michael S. Yao, M. Hansen

Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier F 2 score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.

加速MRI从稀疏采样信号数据重建临床解剖图像，以减少患者扫描时间。虽然最近的研究已经利用深度学习来完成这项任务，但这种方法通常只在没有信号损坏或资源限制的模拟环境中进行了探索。在这项工作中，我们探索增强神经网络MRI图像重建，以提高其临床相关性。也就是说，我们提出了一个用于检测图像伪影源的卷积神经网络模型，该模型的分类器f2得分为79.1%。我们还证明，在具有可变加速因子的MR信号数据上训练重建器可以在临床患者扫描期间将其平均性能提高2%。我们提供了一个损失函数来克服模型学习重建多个解剖和方向的MR图像时的灾难性遗忘。最后，我们提出了一种方法，在临床上获得的数据集和计算能力有限的情况下，使用模拟幻像数据对重建器进行预训练。我们的研究结果为加速MRI的临床应用提供了一条潜在的途径。

引用次数: 0

DIET: Conditional independence testing with marginal dependence measures of residual information DIET：剩余信息的边际依赖性度量的条件独立性检验

Proceedings of machine learning research

Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.08579

Mukund Sudarshan, A. Puli, Wesley Tansey, R. Ranganath

Conditional randomization tests (CRTs) assess whether a variable x is predictive of another variable y, having observed covariates z. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: Fx∣z(x∣z) and Fy∣z(y∣z) where F⋅∣z(⋅∣z) is a conditional cumulative distribution function (CDF) for the distribution p(⋅∣z). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.

条件随机化测试(crt)评估变量x是否预测另一个变量y，观察到协变量z。crt需要拟合大量的预测模型，这通常是计算上难以处理的。现有的降低crt成本的解决方案通常将数据集分成训练和测试部分，或者依赖于启发式交互，这两种方法都会导致功率损失。我们提出了解耦独立性测试(DIET)，这是一种通过利用边际独立性统计来测试条件独立性关系来避免这两个问题的算法。DIET检验两个随机变量Fx∣z(x∣z)和Fy∣z(y∣z)的边际独立性，其中F⋅∣z(⋅∣z)是分布p(⋅∣z)的条件累积分布函数(CDF)。这些变量被称为“信息残差”。给出了节食法实现有限样本1型误差控制和功率大于1型错误率的充分条件。然后，我们证明了当使用信息残差之间的互信息作为检验统计量时，DIET产生了最强大的条件有效检验。最后，我们展示了DIET在几个合成和实际基准测试中比其他可处理crt获得更高的功率。

{"title":"DIET: Conditional independence testing with marginal dependence measures of residual information","authors":"Mukund Sudarshan, A. Puli, Wesley Tansey, R. Ranganath","doi":"10.48550/arXiv.2208.08579","DOIUrl":"https://doi.org/10.48550/arXiv.2208.08579","url":null,"abstract":"Conditional randomization tests (CRTs) assess whether a variable x is predictive of another variable y, having observed covariates z. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: Fx∣z(x∣z) and Fy∣z(y∣z) where F⋅∣z(⋅∣z) is a conditional cumulative distribution function (CDF) for the distribution p(⋅∣z). These variables are termed \"information residuals.\" We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"206 1","pages":"10343-10367"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43051329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness 临床存在下的推断策略：对算法公平性的影响

Proceedings of machine learning research

Pub Date : 2022-08-13 DOI: 10.48550/arXiv.2208.06648

V. Jeanselme, Maria De-Arteaga, Zhe Zhang, J. Barrett, Brian D. M. Tom

Biases have marked medical history, leading to unequal care affecting marginalised groups. The patterns of missingness in observational data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is too often an overlooked preprocessing step. When explicitly considered, attention is placed on overall performance, ignoring how this preprocessing can reinforce groupspecific inequities. Our work questions this choice by studying how imputation affects downstream algorithmic fairness. First, we provide a structured view of the relationship between clinical presence mechanisms and groupspecific missingness patterns. Then, through simulations and real-world experiments, we demonstrate that the imputation choice influences marginalised group performance and that no imputation strategy consistently reduces disparities. Importantly, our results show that current practices may endanger health equity as similarly performing imputation strategies at the population level can affect marginalised groups differently. Finally, we propose recommendations for mitigating inequities that may stem from a neglected step of the machine learning pipeline.

偏见在医学史上留下了印记，导致不平等的护理影响到边缘化群体。观测数据中的缺失模式通常反映了这些群体差异，但群体特定缺失的算法公平性含义还没有得到很好的理解。尽管插补有潜在的影响，但它往往是一个被忽视的预处理步骤。当明确考虑时，会将注意力放在整体性能上，忽略这种预处理如何会加剧特定群体的不公平。我们的工作通过研究插补如何影响下游算法的公平性来质疑这一选择。首先，我们对临床存在机制和群体特异性缺失模式之间的关系提供了一个结构化的观点。然后，通过模拟和真实世界的实验，我们证明了插补选择会影响边缘化群体的表现，并且没有插补策略可以持续减少差异。重要的是，我们的研究结果表明，目前的做法可能会危及健康公平，因为在人口层面执行类似的插补策略可能会对边缘化群体产生不同的影响。最后，我们提出了减少不平等的建议，这些不平等可能源于机器学习过程中被忽视的一步。

{"title":"Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness","authors":"V. Jeanselme, Maria De-Arteaga, Zhe Zhang, J. Barrett, Brian D. M. Tom","doi":"10.48550/arXiv.2208.06648","DOIUrl":"https://doi.org/10.48550/arXiv.2208.06648","url":null,"abstract":"Biases have marked medical history, leading to unequal care affecting marginalised groups. The patterns of missingness in observational data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is too often an overlooked preprocessing step. When explicitly considered, attention is placed on overall performance, ignoring how this preprocessing can reinforce groupspecific inequities. Our work questions this choice by studying how imputation affects downstream algorithmic fairness. First, we provide a structured view of the relationship between clinical presence mechanisms and groupspecific missingness patterns. Then, through simulations and real-world experiments, we demonstrate that the imputation choice influences marginalised group performance and that no imputation strategy consistently reduces disparities. Importantly, our results show that current practices may endanger health equity as similarly performing imputation strategies at the population level can affect marginalised groups differently. Finally, we propose recommendations for mitigating inequities that may stem from a neglected step of the machine learning pipeline.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"193 1","pages":"12 - 34"},"PeriodicalIF":0.0,"publicationDate":"2022-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43981392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Bias Aware Probabilistic Boolean Matrix Factorization. 偏差感知概率布尔矩阵分解。

Proceedings of machine learning research

Pub Date : 2022-08-01

Changlin Wan, Pengtao Dang, Tong Zhao, Yong Zang, Chi Zhang, Sha Cao

Boolean matrix factorization (BMF) is a combinatorial problem arising from a wide range of applications including recommendation system, collaborative filtering, and dimensionality reduction. Currently, the noise model of existing BMF methods is often assumed to be homoscedastic; however, in real world data scenarios, the deviations of observed data from their true values are almost surely diverse due to stochastic noises, making each data point not equally suitable for fitting a model. In this case, it is not ideal to treat all data points as equally distributed. Motivated by such observations, we introduce a probabilistic BMF model that recognizes the object- and feature-wise bias distribution respectively, called bias aware BMF (BABF). To the best of our knowledge, BABF is the first approach for Boolean decomposition with consideration of the feature-wise and object-wise bias in binary data. We conducted experiments on datasets with different levels of background noise, bias level, and sizes of the signal patterns, to test the effectiveness of our method in various scenarios. We demonstrated that our model outperforms the state-of-the-art factorization methods in both accuracy and efficiency in recovering the original datasets, and the inferred bias level is highly significantly correlated with true existing bias in both simulated and real world datasets.

布尔矩阵分解(BMF)是一个组合问题，在推荐系统、协同过滤和降维等方面有着广泛的应用。目前，现有BMF方法的噪声模型通常被假设为均方差;然而，在现实世界的数据场景中，由于随机噪声，观测数据与其真实值的偏差几乎肯定是不同的，这使得每个数据点并不同样适合拟合模型。在这种情况下，将所有数据点视为均匀分布是不理想的。基于这些观察结果，我们引入了一种概率BMF模型，分别识别对象和特征方向的偏差分布，称为偏差感知BMF (BABF)。据我们所知，BABF是考虑到二进制数据中特征和对象偏差的布尔分解的第一种方法。我们对具有不同背景噪声水平、偏置水平和信号模式大小的数据集进行了实验，以测试我们的方法在各种场景下的有效性。我们证明，我们的模型在恢复原始数据集的准确性和效率方面都优于最先进的分解方法，并且推断的偏差水平与模拟和真实世界数据集的真实存在偏差高度显著相关。

{"title":"Bias Aware Probabilistic Boolean Matrix Factorization.","authors":"Changlin Wan, Pengtao Dang, Tong Zhao, Yong Zang, Chi Zhang, Sha Cao","doi":"","DOIUrl":"","url":null,"abstract":"Boolean matrix factorization (BMF) is a combinatorial problem arising from a wide range of applications including recommendation system, collaborative filtering, and dimensionality reduction. Currently, the noise model of existing BMF methods is often assumed to be homoscedastic; however, in real world data scenarios, the deviations of observed data from their true values are almost surely diverse due to stochastic noises, making each data point not equally suitable for fitting a model. In this case, it is not ideal to treat all data points as equally distributed. Motivated by such observations, we introduce a probabilistic BMF model that recognizes the object- and feature-wise bias distribution respectively, called bias aware BMF (BABF). To the best of our knowledge, BABF is the first approach for Boolean decomposition with consideration of the feature-wise and object-wise bias in binary data. We conducted experiments on datasets with different levels of background noise, bias level, and sizes of the signal patterns, to test the effectiveness of our method in various scenarios. We demonstrated that our model outperforms the state-of-the-art factorization methods in both accuracy and efficiency in recovering the original datasets, and the inferred bias level is highly significantly correlated with true existing bias in both simulated and real world datasets.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"180 ","pages":"2035-2044"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10421704/pdf/nihms-1891928.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10060510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0