Clustering to find subgroups with common features is often a necessary first step in the statistical modelling and analysis of large and complex datasets. Although follow-up analyses often make use of complex statistical models that are appropriate for the specific application, most popular clustering approaches are either nonparametric, or based on Gaussian mixture models and their variants, often for reasons of computational efficiency. Certain characteristics in the data, such as the presence of outliers, or non-ellipsoidal cluster shapes, that are common in modern scientific datasets, often lead these methods to fail to detect the cluster components accurately. In this article, we present two efficient and robust Bayesian clustering approaches that seek to overcome these limitations—a model-based ‘tight’ clustering approach to cluster points in the presence of outliers, and a hierarchical Laplace mixture-based approach to cluster heavy-tailed and otherwise non-normal cluster components—and illustrate their power and accuracy in detecting meaningful clusters in datasets from genomics, imaging and the environmental sciences.
{"title":"Bayesian hierarchical mixture models for detecting non-normal clusters applied to noisy genomic and environmental datasets","authors":"Huizi Zhang, Ben Swallow, Mayetri Gupta","doi":"10.1111/anzs.12370","DOIUrl":"10.1111/anzs.12370","url":null,"abstract":"<p>Clustering to find subgroups with common features is often a necessary first step in the statistical modelling and analysis of large and complex datasets. Although follow-up analyses often make use of complex statistical models that are appropriate for the specific application, most popular clustering approaches are either nonparametric, or based on Gaussian mixture models and their variants, often for reasons of computational efficiency. Certain characteristics in the data, such as the presence of outliers, or non-ellipsoidal cluster shapes, that are common in modern scientific datasets, often lead these methods to fail to detect the cluster components accurately. In this article, we present two efficient and robust Bayesian clustering approaches that seek to overcome these limitations—a model-based ‘tight’ clustering approach to cluster points in the presence of outliers, and a hierarchical Laplace mixture-based approach to cluster heavy-tailed and otherwise non-normal cluster components—and illustrate their power and accuracy in detecting meaningful clusters in datasets from genomics, imaging and the environmental sciences.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":"64 2","pages":"313-337"},"PeriodicalIF":1.1,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12370","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83368306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J.-B. Durand, F. Forbes, C.D. Phan, L. Truong, H.D. Nguyen, F. Dama
We develop a Bayesian non-parametric (BNP) model coupled with Markov random fields (MRFs) for risk mapping, to infer homogeneous spatial regions in terms of risks. In contrast to most existing methods, the proposed approach does not require an arbitrary commitment to a specified number of risk classes and determines their risk levels automatically. We consider settings in which the relevant information are counts and propose a so-called BNP hidden MRF (BNP-HMRF) model that is able to handle such data. The model inference is carried out using a variational Bayes expectation–maximisation algorithm and the approach is illustrated on traffic crash data in the state of Victoria, Australia. The obtained results corroborate well with the traffic safety literature. More generally, the model presented here for risk mapping offers an effective, convenient and fast way to conduct partition of spatially localised count data.
{"title":"Bayesian non-parametric spatial prior for traffic crash risk mapping: A case study of Victoria, Australia","authors":"J.-B. Durand, F. Forbes, C.D. Phan, L. Truong, H.D. Nguyen, F. Dama","doi":"10.1111/anzs.12369","DOIUrl":"10.1111/anzs.12369","url":null,"abstract":"<p>We develop a Bayesian non-parametric (BNP) model coupled with Markov random fields (MRFs) for risk mapping, to infer homogeneous spatial regions in terms of risks. In contrast to most existing methods, the proposed approach does not require an arbitrary commitment to a specified number of risk classes and determines their risk levels automatically. We consider settings in which the relevant information are counts and propose a so-called BNP hidden MRF (BNP-HMRF) model that is able to handle such data. The model inference is carried out using a variational Bayes expectation–maximisation algorithm and the approach is illustrated on traffic crash data in the state of Victoria, Australia. The obtained results corroborate well with the traffic safety literature. More generally, the model presented here for risk mapping offers an effective, convenient and fast way to conduct partition of spatially localised count data.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":"64 2","pages":"171-204"},"PeriodicalIF":1.1,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72911289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vivi N. Arief, Ian H. DeLacy, Thomas Payne, Kaye E. Basford
Historical data from plant breeding programs provide valuable resources to study the response of genotypes to the changing environment (i.e. genotype-by-environment interaction). Such data have been used to evaluate the pattern of genotype performance across regions or locations, but its use to evaluate the long-term pattern of genotype performance across environments (i.e. locations-by-years) has been hampered by the lack of common genotypes across years. This lack of common genotypes is due to the structure of the breeding program, especially for annual crops, where only a proportion of selected genotypes are tested in subsequent years. This has resulted in a sparse prediction of the performance of genotypes across years (i.e. a genotype-by-year table). A genomic prediction method that fitted both a relationship matrix among genotypes and a relationship matrix among environments (i.e. years) could overcome this limitation and produce a dense genotype-by-year table, thereby enabling some evaluation of long-term genotype performance. In this paper, we applied the genomic prediction model to the yield data from CIMMYT's Elite Spring Wheat Yield Trials (ESWYT) to visualise the pattern of genotype performance over 25 years.
{"title":"Visualising the pattern of long-term genotype performance by leveraging a genomic prediction model","authors":"Vivi N. Arief, Ian H. DeLacy, Thomas Payne, Kaye E. Basford","doi":"10.1111/anzs.12362","DOIUrl":"10.1111/anzs.12362","url":null,"abstract":"<p>Historical data from plant breeding programs provide valuable resources to study the response of genotypes to the changing environment (i.e. genotype-by-environment interaction). Such data have been used to evaluate the pattern of genotype performance across regions or locations, but its use to evaluate the long-term pattern of genotype performance across environments (i.e. locations-by-years) has been hampered by the lack of common genotypes across years. This lack of common genotypes is due to the structure of the breeding program, especially for annual crops, where only a proportion of selected genotypes are tested in subsequent years. This has resulted in a sparse prediction of the performance of genotypes across years (i.e. a genotype-by-year table). A genomic prediction method that fitted both a relationship matrix among genotypes and a relationship matrix among environments (i.e. years) could overcome this limitation and produce a dense genotype-by-year table, thereby enabling some evaluation of long-term genotype performance. In this paper, we applied the genomic prediction model to the yield data from CIMMYT's Elite Spring Wheat Yield Trials (ESWYT) to visualise the pattern of genotype performance over 25 years.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":"64 2","pages":"297-312"},"PeriodicalIF":1.1,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12362","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72714347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}