Transcriptional regulation through cis-regulatory elements (CREs) is crucial for numerous biological functions, with its disruption potentially leading to various diseases. It is well-known that these CREs often exhibit redundancy, allowing them to compensate for each other in response to external disturbances, highlighting the need for methods to identify CRE sets that collaboratively regulate gene expression effectively. To address this, we introduce GRIDS, an in silico computational method that approaches the task as a global feature explanation challenge to dissect combinatorial CRE effects in two phases. First, GRIDS constructs a differentiable surrogate function to mirror the complex gene regulatory process, facilitating cross-translations in single-cell modalities. It then employs learnable perturbations within a state transition framework to offer global explanations, efficiently navigating the combinatorial feature landscape. Through comprehensive benchmarks, GRIDS demonstrates superior explanatory capabilities compared to other leading methods. Moreover, GRIDS's global explanations reveal intricate regulatory redundancy across cell types and states, underscoring its potential to advance our understanding of cellular regulation in biological research.
{"title":"Understanding Transcriptional Regulatory Redundancy by Learnable Global Subset Perturbations.","authors":"Junhao Liu, Siwei Xu, Dylan Riffle, Ziheng Duan, Martin Renqiang Min, Jing Zhang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Transcriptional regulation through cis-regulatory elements (CREs) is crucial for numerous biological functions, with its disruption potentially leading to various diseases. It is well-known that these CREs often exhibit redundancy, allowing them to compensate for each other in response to external disturbances, highlighting the need for methods to identify CRE sets that collaboratively regulate gene expression effectively. To address this, we introduce GRIDS, an in silico computational method that approaches the task as a global feature explanation challenge to dissect combinatorial CRE effects in two phases. First, GRIDS constructs a differentiable surrogate function to mirror the complex gene regulatory process, facilitating cross-translations in single-cell modalities. It then employs learnable perturbations within a state transition framework to offer global explanations, efficiently navigating the combinatorial feature landscape. Through comprehensive benchmarks, GRIDS demonstrates superior explanatory capabilities compared to other leading methods. Moreover, GRIDS's global explanations reveal intricate regulatory redundancy across cell types and states, underscoring its potential to advance our understanding of cellular regulation in biological research.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"260 ","pages":"383-398"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamed Fayyaz, Mehak Gupta, Alejandra Perez Ramirez, Claudine Jurkovitz, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti
Reliable prediction of pediatric obesity can offer a valuable resource to providers, helping them engage in timely preventive interventions before the disease is established. Many efforts have been made to develop ML-based predictive models of obesity, and some studies have reported high predictive performances. However, no commonly used clinical decision support tool based on existing ML models currently exists. This study presents a novel end-to-end pipeline specifically designed for pediatric obesity prediction, which supports the entire process of data extraction, inference, and communication via an API or a user interface. While focusing only on routinely recorded data in pediatric electronic health records (EHRs), our pipeline uses a diverse expert-curated list of medical concepts to predict the 1-3 years risk of developing obesity. Furthermore, by using the Fast Healthcare Interoperability Resources (FHIR) standard in our design procedure, we specifically target facilitating low-effort integration of our pipeline with different EHR systems. In our experiments, we report the effectiveness of the predictive model as well as its alignment with the feedback from various stakeholders, including ML scientists, providers, health IT personnel, health administration representatives, and patient group representatives.
{"title":"An Interoperable Machine Learning Pipeline for Pediatric Obesity Risk Estimation.","authors":"Hamed Fayyaz, Mehak Gupta, Alejandra Perez Ramirez, Claudine Jurkovitz, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Reliable prediction of pediatric obesity can offer a valuable resource to providers, helping them engage in timely preventive interventions before the disease is established. Many efforts have been made to develop ML-based predictive models of obesity, and some studies have reported high predictive performances. However, no commonly used clinical decision support tool based on existing ML models currently exists. This study presents a novel end-to-end pipeline specifically designed for pediatric obesity prediction, which supports the entire process of data extraction, inference, and communication via an API or a user interface. While focusing only on routinely recorded data in pediatric electronic health records (EHRs), our pipeline uses a diverse expert-curated list of medical concepts to predict the 1-3 years risk of developing obesity. Furthermore, by using the Fast Healthcare Interoperability Resources (FHIR) standard in our design procedure, we specifically target facilitating low-effort integration of our pipeline with different EHR systems. In our experiments, we report the effectiveness of the predictive model as well as its alignment with the feedback from various stakeholders, including ML scientists, providers, health IT personnel, health administration representatives, and patient group representatives.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"259 ","pages":"308-324"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Song, Haitao Mao, Jiachen Xiao, Jingzhe Liu, Zhikai Chen, Wei Jin, Carl Yang, Jiliang Tang, Hui Liu
Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges represented by feature heterogeneity and structural heterogeneity. Recent efforts have been made to address feature heterogeneity via Large Language Models (LLMs) on text-attributed graphs (TAGs) by generating fixed-length text representations as node features. These high-quality features reduce the previously critical role of graph structure, resulting in a modest performance gap between Graph Neural Networks (GNNs) and structure-agnostic Multi-Layer Perceptrons (MLPs). Motivated by this, we introduce a feature-centric pretraining perspective by treating graph structure as a prior and leveraging the rich, unified feature space to learn refined interaction patterns that generalizes across graphs. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walk and employs masked feature reconstruction to capture pairwise proximity in the LLM-unified feature space using a standard Transformer. By utilizing unified text representations rather than varying structures, GSPT alleviates structural heterogeneity and achieves significantly better transferability among graphs within the same domain. Our approach can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets. The source code is publicly available at https://github.com/SongYYYY/GSPT.
{"title":"A Pure Transformer Pretraining Framework on Text-attributed Graphs.","authors":"Yu Song, Haitao Mao, Jiachen Xiao, Jingzhe Liu, Zhikai Chen, Wei Jin, Carl Yang, Jiliang Tang, Hui Liu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges represented by feature heterogeneity and structural heterogeneity. Recent efforts have been made to address feature heterogeneity via Large Language Models (LLMs) on text-attributed graphs (TAGs) by generating fixed-length text representations as node features. These high-quality features reduce the previously critical role of graph structure, resulting in a modest performance gap between Graph Neural Networks (GNNs) and structure-agnostic Multi-Layer Perceptrons (MLPs). Motivated by this, we introduce a feature-centric pretraining perspective by treating graph structure as a prior and leveraging the rich, unified feature space to learn refined interaction patterns that generalizes across graphs. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walk and employs masked feature reconstruction to capture pairwise proximity in the LLM-unified feature space using a standard Transformer. By utilizing unified text representations rather than varying structures, GSPT alleviates structural heterogeneity and achieves significantly better transferability among graphs within the same domain. Our approach can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets. The source code is publicly available at https://github.com/SongYYYY/GSPT.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"269 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145031307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oladimeji Macaulay, Michael Servilla, David Arredondo, Kushal Virupakshappa, Yue Hu, Luis Tafoya, Yanfu Zhang, Avinash Sahu
Genetic, molecular, and environmental factors influence diseases through complex interactions with genes, phenotypes, and drugs. Current methods often fail to integrate diverse multi-relational biological data meaningfully, limiting the discovery of novel risk genes and drugs. To address this, we present MedGraphNet, a multi-relational Graph Neural Network (GNN) model designed to infer relationships among drugs, genes, diseases, and phenotypes. MedGraphNet initializes nodes using informative embeddings from existing text knowledge, allowing for robust integration of various data types and improved generalizability. Our results demonstrate that MedGraphNet matches and often outperforms traditional single-relation approaches, particularly in scenarios with isolated or sparsely connected nodes. The model shows generalizability to external datasets, achieving high accuracy in identifying disease-gene associations and drug-phenotype relationships. Notably, MedGraphNet accurately inferred drug side effects without direct training on such data. Using Alzheimer's disease as a case study, MedGraphNet successfully identified relevant phenotypes, genes, and drugs, corroborated by existing literature. These findings demonstrate the potential of integrating multi-relational data with text knowledge to enhance biomedical predictions and drug repurposing for diseases. MedGraphNet code is available at https://github.com/vinash85/MedGraphNet.
{"title":"<i>MedGraphNet</i>: Leveraging Multi-Relational Graph Neural Networks and Text Knowledge for Biomedical Predictions.","authors":"Oladimeji Macaulay, Michael Servilla, David Arredondo, Kushal Virupakshappa, Yue Hu, Luis Tafoya, Yanfu Zhang, Avinash Sahu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Genetic, molecular, and environmental factors influence diseases through complex interactions with genes, phenotypes, and drugs. Current methods often fail to integrate diverse multi-relational biological data meaningfully, limiting the discovery of novel risk genes and drugs. To address this, we present <i>MedGraphNet</i>, a multi-relational Graph Neural Network (GNN) model designed to infer relationships among drugs, genes, diseases, and phenotypes. <i>MedGraphNet</i> initializes nodes using informative embeddings from existing text knowledge, allowing for robust integration of various data types and improved generalizability. Our results demonstrate that <i>MedGraphNet</i> matches and often outperforms traditional single-relation approaches, particularly in scenarios with isolated or sparsely connected nodes. The model shows generalizability to external datasets, achieving high accuracy in identifying disease-gene associations and drug-phenotype relationships. Notably, <i>MedGraphNet</i> accurately inferred drug side effects without direct training on such data. Using Alzheimer's disease as a case study, <i>MedGraphNet</i> successfully identified relevant phenotypes, genes, and drugs, corroborated by existing literature. These findings demonstrate the potential of integrating multi-relational data with text knowledge to enhance biomedical predictions and drug repurposing for diseases. <i>MedGraphNet</i> code is available at https://github.com/vinash85/MedGraphNet.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"261 ","pages":"162-182"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12424194/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamed Fayyaz, Niharika S D'Souza, Rahmatollah Beheshti
Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC>0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios). Our code is publicly available at https://github.com/healthylaife/apnea-missing-modality.
{"title":"Multimodal Sleep Apnea Detection with Missing or Noisy Modalities.","authors":"Hamed Fayyaz, Niharika S D'Souza, Rahmatollah Beheshti","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC>0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios). Our code is publicly available at https://github.com/healthylaife/apnea-missing-modality.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"252 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143598009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Xiong, Feng Wu, Leon Deng, Megan Su, Zach Shahn, Li-Wei H Lehman
In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. In this work, we present G-Transformer for counterfactual outcome prediction under dynamic and time-varying treatment strategies. Our approach leverages a Transformer architecture to capture complex, long-range dependencies in time-varying covariates while enabling g-computation, a causal inference method for estimating the effects of dynamic treatment regimes. Specifically, we use a Transformer-based encoder architecture to estimate the conditional distribution of relevant covariates given covariate and treatment history at each time point, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture that supports g-computation for counterfactual outcome prediction under dynamic and time-varying treatment strategies.
{"title":"G-Transformer: Counterfactual Outcome Prediction under Dynamic and Time-varying Treatment Regimes.","authors":"Hong Xiong, Feng Wu, Leon Deng, Megan Su, Zach Shahn, Li-Wei H Lehman","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. In this work, we present G-Transformer for counterfactual outcome prediction under dynamic and time-varying treatment strategies. Our approach leverages a Transformer architecture to capture complex, long-range dependencies in time-varying covariates while enabling g-computation, a causal inference method for estimating the effects of dynamic treatment regimes. Specifically, we use a Transformer-based encoder architecture to estimate the conditional distribution of relevant covariates given covariate and treatment history at each time point, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture that supports g-computation for counterfactual outcome prediction under dynamic and time-varying treatment strategies.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"252 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12113242/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144164074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hye Sun Yun, David Pogrebitskiy, Iain J Marshall, Byron C Wallace
Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs-including ones trained on biomedical texts-perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
{"title":"Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models.","authors":"Hye Sun Yun, David Pogrebitskiy, Iain J Marshall, Byron C Wallace","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs-including ones trained on biomedical texts-perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"252 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448672/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an -dimensional subspace within the -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all ground-truth features.
一种日益流行的机器学习范式是在许多任务上离线预训练神经网络(NN),然后使其适应下游任务,通常只重新训练网络的最后一层线性层。这种方法在各种情况下都能产生强大的下游性能,证明多任务预训练能带来有效的特征学习。尽管最近的一些理论研究表明,浅层网络在以下两种情况下都能学习到有意义的特征:(i) 在单一任务中训练;(ii) 是线性的,但对于在多个任务中训练的非线性网络这种更贴近实践的情况却知之甚少。在这项研究中,我们首次证明了在多个任务中使用非线性模型进行训练时会出现特征学习。我们的主要见解是,多任务预训练会产生一种伪对比损失,这种损失有利于将通常在不同任务中具有相同标签的点对齐的表征。利用这一观察结果,我们证明,当任务是二元分类任务时,标签取决于数据在 d ≫ r -dimensional 输入空间内的 r -dimensional 子空间上的投影,在双层 ReLU NN 上的基于梯度的简单多任务学习算法可以恢复这一投影,从而在样本和神经元复杂度与 d 无关的情况下泛化到下游任务。与此相反,我们的研究表明,在单个任务的高概率抽取中,对该单个任务的训练无法保证学习到所有 r 个地面真实特征。
{"title":"Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks.","authors":"Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a <i>single</i> task or (ii) they are <i>linear</i>, very little is known about the closer-to-practice case of <i>nonlinear</i> NNs trained on <i>multiple</i> tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an <math><mi>r</mi></math> -dimensional subspace within the <math><mi>d</mi> <mo>≫</mo> <mi>r</mi></math> -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of <math><mi>d</mi></math> . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all <math><mi>r</mi></math> ground-truth features.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"9292-9345"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11486479/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here.
{"title":"Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.","authors":"Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of <math><mrow><mn>10</mn> <mi>x</mi></mrow> </math> larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"43632-43648"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12189541/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144499715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Studying the complex interactions between different brain regions is crucial in neuroscience. Various statistical methods have explored the latent communication across multiple brain regions. Two main categories are the Gaussian Process (GP) and Linear Dynamical System (LDS), each with unique strengths. The GP-based approach effectively discovers latent variables with frequency bands and communication directions. Conversely, the LDS-based approach is computationally efficient but lacks powerful expressiveness in latent representation. In this study, we merge both methodologies by creating an LDS mirroring a multi-output GP, termed Multi-Region Markovian Gaussian Process (MRM-GP). Our work establishes a connection between an LDS and a multi-output GP that explicitly models frequencies and phase delays within the latent space of neural recordings. Consequently, the model achieves a linear inference cost over time points and provides an interpretable low-dimensional representation, revealing communication directions across brain regions and separating oscillatory communications into different frequency bands.
研究不同脑区之间复杂的相互作用对神经科学至关重要。各种统计方法探索了多个脑区之间的潜在交流。其中两大类是高斯过程(GP)和线性动力系统(LDS),它们各有千秋。基于 GP 的方法能有效发现具有频带和通信方向的潜变量。相反,基于 LDS 的方法计算效率高,但在潜在表示方面缺乏强大的表现力。在本研究中,我们将这两种方法融合在一起,创建了一个反映多输出 GP 的 LDS,称为多区域马尔可夫高斯过程(MRM-GP)。我们的研究在 LDS 和多输出 GP 之间建立了联系,明确地模拟了神经记录潜空间内的频率和相位延迟。因此,该模型在时间点上实现了线性推理成本,并提供了可解释的低维表示,揭示了跨脑区的通信方向,并将振荡通信分离为不同的频段。
{"title":"Multi-Region Markovian Gaussian Process: An Efficient Method to Discover Directional Communications Across Multiple Brain Regions.","authors":"Weihan Li, Chengrui Li, Yule Wang, Anqi Wu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Studying the complex interactions between different brain regions is crucial in neuroscience. Various statistical methods have explored the latent communication across multiple brain regions. Two main categories are the Gaussian Process (GP) and Linear Dynamical System (LDS), each with unique strengths. The GP-based approach effectively discovers latent variables with frequency bands and communication directions. Conversely, the LDS-based approach is computationally efficient but lacks powerful expressiveness in latent representation. In this study, we merge both methodologies by creating an LDS mirroring a multi-output GP, termed Multi-Region Markovian Gaussian Process (MRM-GP). Our work establishes a connection between an LDS and a multi-output GP that explicitly models frequencies and phase delays within the latent space of neural recordings. Consequently, the model achieves a linear inference cost over time points and provides an interpretable low-dimensional representation, revealing communication directions across brain regions and separating oscillatory communications into different frequency bands.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"28112-28131"},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526605/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}