Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-aoas1517
Andrew J Holbrook, Xiang Ji, Marc A Suchard
Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the "wrong" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.
{"title":"BAYESIAN MITIGATION OF SPATIAL COARSENING FOR A HAWKES MODEL APPLIED TO GUNFIRE, WILDFIRE AND VIRAL CONTAGION.","authors":"Andrew J Holbrook, Xiang Ji, Marc A Suchard","doi":"10.1214/21-aoas1517","DOIUrl":"10.1214/21-aoas1517","url":null,"abstract":"<p><p>Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the \"wrong\" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"573-595"},"PeriodicalIF":1.3,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536472/pdf/nihms-1797628.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9194379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-AOAS1495
Eric F Lock, Jun Young Park, Katherine A Hoadley
Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across four different omics platforms and 29 different cancer types.
{"title":"BIDIMENSIONAL LINKED MATRIX FACTORIZATION FOR PAN-OMICS PAN-CANCER ANALYSIS.","authors":"Eric F Lock, Jun Young Park, Katherine A Hoadley","doi":"10.1214/21-AOAS1495","DOIUrl":"https://doi.org/10.1214/21-AOAS1495","url":null,"abstract":"<p><p>Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, <i>pan-omics pan-cancer analysis</i>, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such <i>bidimensionally linked</i> matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across <i>four</i> different omics platforms and 29 different cancer types.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"193-215"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9060567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71523301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu
Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals' longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women's Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs effects on disparate depression symptoms, but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.
{"title":"BAGEL: A BAYESIAN GRAPHICAL MODEL FOR INFERRING DRUG EFFECT LONGITUDINALLY ON DEPRESSION IN PEOPLE WITH HIV.","authors":"Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu","doi":"10.1214/21-AOAS1492","DOIUrl":"https://doi.org/10.1214/21-AOAS1492","url":null,"abstract":"<p><p>Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals' longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women's Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs effects on disparate depression symptoms, but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"21-39"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9236217/pdf/nihms-1778597.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10737070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01Epub Date: 2021-12-21DOI: 10.1214/20-aoas1433
Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron
The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.
{"title":"JOINT AND INDIVIDUAL ANALYSIS OF BREAST CANCER HISTOLOGIC IMAGES AND GENOMIC COVARIATES.","authors":"Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron","doi":"10.1214/20-aoas1433","DOIUrl":"10.1214/20-aoas1433","url":null,"abstract":"<p><p>The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 4","pages":"1697-1722"},"PeriodicalIF":1.3,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9007558/pdf/nihms-1780328.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10147676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton
Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.
{"title":"ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS.","authors":"Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton","doi":"10.1214/21-aoas1453","DOIUrl":"https://doi.org/10.1214/21-aoas1453","url":null,"abstract":"<p><p>Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 3","pages":"1556-1581"},"PeriodicalIF":1.8,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8887878/pdf/nihms-1773953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10686307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton
On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.
{"title":"A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA.","authors":"Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton","doi":"10.1214/20-aoas1423","DOIUrl":"https://doi.org/10.1214/20-aoas1423","url":null,"abstract":"<p><p>On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 2","pages":"880-901"},"PeriodicalIF":1.8,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10275512/pdf/nihms-1901161.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9762402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-01Epub Date: 2021-03-18DOI: 10.1214/20-aoas1407
David K Lim, Naim U Rashid, Joseph G Ibrahim
Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.
聚类是一种无监督学习,旨在根据一组特征的相似性发现数据中的潜在群体。这种方法在生物医学研究中的一个常见应用是,在给定一组信息基因的情况下,从病人的基因表达数据中划分出新的癌症亚型。然而,人们通常不知道哪些基因在区分群组时可能具有参考价值,也不知道最佳群组数目是多少。对 RNA-seq 样本进行无监督聚类的方法寥寥无几,目前没有一种方法能调整样本间的全局归一化因子、选择聚类区分基因或在聚类过程中考虑潜在的混杂变量。为了解决这些问题,我们提出了 RNA-seq 特征选择和聚类(FSCseq):一种基于模型的聚类算法,它利用有限混合回归(FMR)模型和带有平滑绝对偏差(SCAD)惩罚的二次惩罚法。最大化是通过受惩罚的分类 EM 算法完成的,这样我们就可以在建模框架中加入归一化因素和混杂因素。有了拟合模型,即使存在批次效应,我们的框架也能通过群组成员的后验概率对新患者进行亚型预测。基于模拟和真实数据分析,我们展示了我们的方法相对于其他方法的优势。
{"title":"MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.","authors":"David K Lim, Naim U Rashid, Joseph G Ibrahim","doi":"10.1214/20-aoas1407","DOIUrl":"10.1214/20-aoas1407","url":null,"abstract":"<p><p>Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown <i>a priori</i> what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 1","pages":"481-508"},"PeriodicalIF":1.8,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8386505/pdf/nihms-1716637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9546884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brian J Reich, Yawen Guan, Denis Fourches, Joshua L Warren, Stefanie E Sarnat, Howard H Chang
Humans are concurrently exposed to chemically, structurally and toxicologically diverse chemicals. A critical challenge for environmental epidemiology is to quantify the risk of adverse health outcomes resulting from exposures to such chemical mixtures and to identify which mixture constituents may be driving etiologic associations. A variety of statistical methods have been proposed to address these critical research questions. However, they generally rely solely on measured exposure and health data available within a specific study. Advancements in understanding of the role of mixtures on human health impacts may be better achieved through the utilization of external data and knowledge from multiple disciplines with innovative statistical tools. In this paper we develop new methods for health analyses that incorporate auxiliary information about the chemicals in a mixture, such as physicochemical, structural and/or toxicological data. We expect that the constituents identified using auxiliary information will be more biologically meaningful than those identified by methods that solely utilize observed correlations between measured exposure. We develop flexible Bayesian models by specifying prior distributions for the exposures and their effects that include auxiliary information and examine this idea over a spectrum of analyses from regression to factor analysis. The methods are applied to study the effects of volatile organic compounds on emergency room visits in Atlanta. We find that including cheminformatic information about the exposure variables improves prediction and provides a more interpretable model for emergency room visits for respiratory diseases.
{"title":"INTEGRATIVE STATISTICAL METHODS FOR EXPOSURE MIXTURES AND HEALTH.","authors":"Brian J Reich, Yawen Guan, Denis Fourches, Joshua L Warren, Stefanie E Sarnat, Howard H Chang","doi":"10.1214/20-AOAS1364","DOIUrl":"https://doi.org/10.1214/20-AOAS1364","url":null,"abstract":"<p><p>Humans are concurrently exposed to chemically, structurally and toxicologically diverse chemicals. A critical challenge for environmental epidemiology is to quantify the risk of adverse health outcomes resulting from exposures to such chemical mixtures and to identify which mixture constituents may be driving etiologic associations. A variety of statistical methods have been proposed to address these critical research questions. However, they generally rely solely on measured exposure and health data available within a specific study. Advancements in understanding of the role of mixtures on human health impacts may be better achieved through the utilization of external data and knowledge from multiple disciplines with innovative statistical tools. In this paper we develop new methods for health analyses that incorporate auxiliary information about the chemicals in a mixture, such as physicochemical, structural and/or toxicological data. We expect that the constituents identified using auxiliary information will be more biologically meaningful than those identified by methods that solely utilize observed correlations between measured exposure. We develop flexible Bayesian models by specifying prior distributions for the exposures and their effects that include auxiliary information and examine this idea over a spectrum of analyses from regression to factor analysis. The methods are applied to study the effects of volatile organic compounds on emergency room visits in Atlanta. We find that including cheminformatic information about the exposure variables improves prediction and provides a more interpretable model for emergency room visits for respiratory diseases.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 4","pages":"1945-1963"},"PeriodicalIF":1.8,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914338/pdf/nihms-1780774.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10265042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01Epub Date: 2020-09-18DOI: 10.1214/20-aoas1357
Zhe Sun, Wanli Xu, Xiaomei Cong, Gen Li, Kun Chen
The neonatal intensive care unit (NICU) experience is known to be one of the most crucial factors that drive preterm infant's neurodevelopmental and health outcome. It is hypothesized that stressful early life experience of very preterm neonate is imprinting gut microbiome by the regulation of the so-called brain-gut axis, and consequently, certain microbiome markers are predictive of later infant neurodevelopment. To investigate, a preterm infant study was conducted; infant fecal samples were collected during the infants' first month of postnatal age, resulting in functional compositional microbiome data, and neurobehavioral outcomes were measured when infants reached 36-38 weeks of post-menstrual age. To identify potential microbiome markers and estimate how the trajectories of gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes of the preterm infants, we innovate a sparse log-contrast regression with functional compositional predictors. The functional simplex structure is strictly preserved, and the functional compositional predictors are allowed to have sparse, smoothly varying, and accumulating effects on the outcome through time. Through a pragmatic basis expansion step, the problem boils down to a linearly constrained sparse group regression, for which we develop an efficient algorithm and obtain theoretical performance guarantees. Our approach yields insightful results in the preterm infant study. The identified microbiome markers and the estimated time dynamics of their impact on the neurobehavioral outcome shed lights on the linkage between stress accumulation in early postnatal stage and neurodevelpomental process of infants.
{"title":"LOG-CONTRAST REGRESSION WITH FUNCTIONAL COMPOSITIONAL PREDICTORS: LINKING PRETERM INFANT'S GUT MICROBIOME TRAJECTORIES TO NEUROBEHAVIORAL OUTCOME.","authors":"Zhe Sun, Wanli Xu, Xiaomei Cong, Gen Li, Kun Chen","doi":"10.1214/20-aoas1357","DOIUrl":"https://doi.org/10.1214/20-aoas1357","url":null,"abstract":"<p><p>The neonatal intensive care unit (NICU) experience is known to be one of the most crucial factors that drive preterm infant's neurodevelopmental and health outcome. It is hypothesized that stressful early life experience of very preterm neonate is imprinting gut microbiome by the regulation of the so-called brain-gut axis, and consequently, certain microbiome markers are predictive of later infant neurodevelopment. To investigate, a preterm infant study was conducted; infant fecal samples were collected during the infants' first month of postnatal age, resulting in functional compositional microbiome data, and neurobehavioral outcomes were measured when infants reached 36-38 weeks of post-menstrual age. To identify potential microbiome markers and estimate how the trajectories of gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes of the preterm infants, we innovate a sparse log-contrast regression with functional compositional predictors. The functional simplex structure is strictly preserved, and the functional compositional predictors are allowed to have sparse, smoothly varying, and accumulating effects on the outcome through time. Through a pragmatic basis expansion step, the problem boils down to a linearly constrained sparse group regression, for which we develop an efficient algorithm and obtain theoretical performance guarantees. Our approach yields insightful results in the preterm infant study. The identified microbiome markers and the estimated time dynamics of their impact on the neurobehavioral outcome shed lights on the linkage between stress accumulation in early postnatal stage and neurodevelpomental process of infants.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 3","pages":"1535-1556"},"PeriodicalIF":1.8,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218926/pdf/nihms-1601428.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39100587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-01Epub Date: 2020-09-18DOI: 10.1214/20-aoas1361
Philip S Boonstra, John C Krauss
A patient's medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient's problem in the electronic medical record actually reflect a consensus communication of that patient's current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. We conduct a simulation study to evaluate the performance of our methodology and then analyze the motivating problem list data. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorated; in contrast, physicians broadly disagreed on the relevant problems for the middle - and most difficult - case abstract.
{"title":"Inferring a consensus problem list using penalized multistage models for ordered data.","authors":"Philip S Boonstra, John C Krauss","doi":"10.1214/20-aoas1361","DOIUrl":"10.1214/20-aoas1361","url":null,"abstract":"<p><p>A patient's medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient's problem in the electronic medical record actually reflect a consensus communication of that patient's current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. We conduct a simulation study to evaluate the performance of our methodology and then analyze the motivating problem list data. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorated; in contrast, physicians broadly disagreed on the relevant problems for the middle - and most difficult - case abstract.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 3","pages":"1557-1580"},"PeriodicalIF":1.8,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8345315/pdf/nihms-1696242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39291448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}