Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-aoas1510
By Zoe Guan, Giovanni Parmigiani, Danielle Braun, Lorenzo Trippa
Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions, based on knowledge of cancer susceptibility genes. These models are widely used in clinical practice to help identify high-risk individuals. Mendelian models leverage the entire family history, but they rely on many assumptions about cancer susceptibility genes that are either unrealistic or challenging to validate, due to low mutation prevalence. Training more flexible models, such as neural networks, on large databases of pedigrees can potentially lead to accuracy gains. In this paper we develop a framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. We propose adaptations of fully-connected neural networks and convolutional neural networks to pedigrees. In data simulated under Mendelian inheritance, we demonstrate that our proposed neural network models are able to achieve nearly optimal prediction performance. Moreover, when the observed family history includes misreported cancer diagnoses, neural networks are able to outperform the Mendelian BRCAPRO model embedding the correct inheritance laws. Using a large dataset of over 200,000 family histories, the Risk Service cohort, we train prediction models for future risk of breast cancer. We validate the models using data from the Cancer Genetics Network.
{"title":"PREDICTION OF HEREDITARY CANCERS USING NEURAL NETWORKS.","authors":"By Zoe Guan, Giovanni Parmigiani, Danielle Braun, Lorenzo Trippa","doi":"10.1214/21-aoas1510","DOIUrl":"10.1214/21-aoas1510","url":null,"abstract":"<p><p>Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions, based on knowledge of cancer susceptibility genes. These models are widely used in clinical practice to help identify high-risk individuals. Mendelian models leverage the entire family history, but they rely on many assumptions about cancer susceptibility genes that are either unrealistic or challenging to validate, due to low mutation prevalence. Training more flexible models, such as neural networks, on large databases of pedigrees can potentially lead to accuracy gains. In this paper we develop a framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. We propose adaptations of fully-connected neural networks and convolutional neural networks to pedigrees. In data simulated under Mendelian inheritance, we demonstrate that our proposed neural network models are able to achieve nearly optimal prediction performance. Moreover, when the observed family history includes misreported cancer diagnoses, neural networks are able to outperform the Mendelian BRCAPRO model embedding the correct inheritance laws. Using a large dataset of over 200,000 family histories, the Risk Service cohort, we train prediction models for future risk of breast cancer. We validate the models using data from the Cancer Genetics Network.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"495-520"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10593124/pdf/nihms-1937267.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49693607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Gu, John S Preisser, Donglin Zeng, Poojan Shrestha, Molina Shah, Miguel A Simancas-Pallares, Jeannie Ginnis, Kimon Divaris
Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.
{"title":"PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA.","authors":"Yu Gu, John S Preisser, Donglin Zeng, Poojan Shrestha, Molina Shah, Miguel A Simancas-Pallares, Jeannie Ginnis, Kimon Divaris","doi":"10.1214/21-aoas1516","DOIUrl":"https://doi.org/10.1214/21-aoas1516","url":null,"abstract":"<p><p>Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (<i>k</i>-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"551-572"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963777/pdf/nihms-1731052.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9615691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-aoas1513
Mark J Meyer, Jeffrey S Morris, Regina Paxton Gazes, Brent A Coull
Research in functional regression has made great strides in expanding to non-Gaussian functional outcomes, but exploration of ordinal functional outcomes remains limited. Motivated by a study of computer-use behavior in rhesus macaques (Macaca mulatta), we introduce the Ordinal Probit Functional Outcome Regression model (OPFOR). OPFOR models can be fit using one of several basis functions including penalized B-splines, wavelets, and O'Sullivan splines-the last of which typically performs best. Simulation using a variety of underlying covariance patterns shows that the model performs reasonably well in estimation under multiple basis functions with near nominal coverage for joint credible intervals. Finally, in application, we use Bayesian model selection criteria adapted to functional outcome regression to best characterize the relation between several demographic factors of interest and the monkeys' computer use over the course of a year. In comparison with a standard ordinal longitudinal analysis, OPFOR outperforms a cumulative-link mixed-effects model in simulation and provides additional and more nuanced information on the nature of the monkeys' computer-use behavior.
{"title":"ORDINAL PROBIT FUNCTIONAL OUTCOME REGRESSION WITH APPLICATION TO COMPUTER-USE BEHAVIOR IN RHESUS MONKEYS.","authors":"Mark J Meyer, Jeffrey S Morris, Regina Paxton Gazes, Brent A Coull","doi":"10.1214/21-aoas1513","DOIUrl":"10.1214/21-aoas1513","url":null,"abstract":"<p><p>Research in functional regression has made great strides in expanding to non-Gaussian functional outcomes, but exploration of ordinal functional outcomes remains limited. Motivated by a study of computer-use behavior in rhesus macaques (<i>Macaca mulatta</i>), we introduce the Ordinal Probit Functional Outcome Regression model (OPFOR). OPFOR models can be fit using one of several basis functions including penalized B-splines, wavelets, and O'Sullivan splines-the last of which typically performs best. Simulation using a variety of underlying covariance patterns shows that the model performs reasonably well in estimation under multiple basis functions with near nominal coverage for joint credible intervals. Finally, in application, we use Bayesian model selection criteria adapted to functional outcome regression to best characterize the relation between several demographic factors of interest and the monkeys' computer use over the course of a year. In comparison with a standard ordinal longitudinal analysis, OPFOR outperforms a cumulative-link mixed-effects model in simulation and provides additional and more nuanced information on the nature of the monkeys' computer-use behavior.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"537-550"},"PeriodicalIF":1.3,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9629817/pdf/nihms-1805320.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10428988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-aoas1517
Andrew J Holbrook, Xiang Ji, Marc A Suchard
Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the "wrong" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.
{"title":"BAYESIAN MITIGATION OF SPATIAL COARSENING FOR A HAWKES MODEL APPLIED TO GUNFIRE, WILDFIRE AND VIRAL CONTAGION.","authors":"Andrew J Holbrook, Xiang Ji, Marc A Suchard","doi":"10.1214/21-aoas1517","DOIUrl":"10.1214/21-aoas1517","url":null,"abstract":"<p><p>Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the \"wrong\" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"573-595"},"PeriodicalIF":1.3,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536472/pdf/nihms-1797628.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9194379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01Epub Date: 2022-03-28DOI: 10.1214/21-AOAS1495
Eric F Lock, Jun Young Park, Katherine A Hoadley
Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across four different omics platforms and 29 different cancer types.
{"title":"BIDIMENSIONAL LINKED MATRIX FACTORIZATION FOR PAN-OMICS PAN-CANCER ANALYSIS.","authors":"Eric F Lock, Jun Young Park, Katherine A Hoadley","doi":"10.1214/21-AOAS1495","DOIUrl":"https://doi.org/10.1214/21-AOAS1495","url":null,"abstract":"<p><p>Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, <i>pan-omics pan-cancer analysis</i>, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such <i>bidimensionally linked</i> matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across <i>four</i> different omics platforms and 29 different cancer types.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"193-215"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9060567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71523301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu
Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals' longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women's Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs effects on disparate depression symptoms, but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.
{"title":"BAGEL: A BAYESIAN GRAPHICAL MODEL FOR INFERRING DRUG EFFECT LONGITUDINALLY ON DEPRESSION IN PEOPLE WITH HIV.","authors":"Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu","doi":"10.1214/21-AOAS1492","DOIUrl":"https://doi.org/10.1214/21-AOAS1492","url":null,"abstract":"<p><p>Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals' longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women's Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs effects on disparate depression symptoms, but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"21-39"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9236217/pdf/nihms-1778597.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10737070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01Epub Date: 2021-12-21DOI: 10.1214/20-aoas1433
Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron
The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.
{"title":"JOINT AND INDIVIDUAL ANALYSIS OF BREAST CANCER HISTOLOGIC IMAGES AND GENOMIC COVARIATES.","authors":"Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron","doi":"10.1214/20-aoas1433","DOIUrl":"10.1214/20-aoas1433","url":null,"abstract":"<p><p>The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 4","pages":"1697-1722"},"PeriodicalIF":1.3,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9007558/pdf/nihms-1780328.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10147676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton
Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.
{"title":"ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS.","authors":"Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton","doi":"10.1214/21-aoas1453","DOIUrl":"https://doi.org/10.1214/21-aoas1453","url":null,"abstract":"<p><p>Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 3","pages":"1556-1581"},"PeriodicalIF":1.8,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8887878/pdf/nihms-1773953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10686307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton
On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.
{"title":"A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA.","authors":"Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton","doi":"10.1214/20-aoas1423","DOIUrl":"https://doi.org/10.1214/20-aoas1423","url":null,"abstract":"<p><p>On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 2","pages":"880-901"},"PeriodicalIF":1.8,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10275512/pdf/nihms-1901161.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9762402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-01Epub Date: 2021-03-18DOI: 10.1214/20-aoas1407
David K Lim, Naim U Rashid, Joseph G Ibrahim
Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.
聚类是一种无监督学习,旨在根据一组特征的相似性发现数据中的潜在群体。这种方法在生物医学研究中的一个常见应用是,在给定一组信息基因的情况下,从病人的基因表达数据中划分出新的癌症亚型。然而,人们通常不知道哪些基因在区分群组时可能具有参考价值,也不知道最佳群组数目是多少。对 RNA-seq 样本进行无监督聚类的方法寥寥无几,目前没有一种方法能调整样本间的全局归一化因子、选择聚类区分基因或在聚类过程中考虑潜在的混杂变量。为了解决这些问题,我们提出了 RNA-seq 特征选择和聚类(FSCseq):一种基于模型的聚类算法,它利用有限混合回归(FMR)模型和带有平滑绝对偏差(SCAD)惩罚的二次惩罚法。最大化是通过受惩罚的分类 EM 算法完成的,这样我们就可以在建模框架中加入归一化因素和混杂因素。有了拟合模型,即使存在批次效应,我们的框架也能通过群组成员的后验概率对新患者进行亚型预测。基于模拟和真实数据分析,我们展示了我们的方法相对于其他方法的优势。
{"title":"MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.","authors":"David K Lim, Naim U Rashid, Joseph G Ibrahim","doi":"10.1214/20-aoas1407","DOIUrl":"10.1214/20-aoas1407","url":null,"abstract":"<p><p>Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown <i>a priori</i> what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 1","pages":"481-508"},"PeriodicalIF":1.8,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8386505/pdf/nihms-1716637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9546884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}