Pub Date : 2024-12-13eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2440031
Qunzhi Xu, Hongzhen Tian, Ananda Sarkar, Yajun Mei
This work studies rollout design problems with a focus of suitable choices of rollout rate under the standard Type I and Type II error probabilities control framework. The main challenge of rollout design is that data is often observed in a lump-sum manner from a spatio-temporal point of view: (1) temporally, only the sum of data in a given sliding time window can be observed; (2) spatially, there are two subgroups for the data at each time step: control and treatment, but one can only observe the total values instead of individual values from each subgroup. We develop rollout tests of lump-sum data under both fixed-sample-size and sequential settings, subject to the constraints on Type I and Type II error probabilities. Numerical studies are conducted to validate our theoretical results.
{"title":"Rollout designs for lump-sum data.","authors":"Qunzhi Xu, Hongzhen Tian, Ananda Sarkar, Yajun Mei","doi":"10.1080/02664763.2024.2440031","DOIUrl":"10.1080/02664763.2024.2440031","url":null,"abstract":"<p><p>This work studies rollout design problems with a focus of suitable choices of rollout rate under the standard Type I and Type II error probabilities control framework. The main challenge of rollout design is that data is often observed in a lump-sum manner from a spatio-temporal point of view: (1) temporally, only the sum of data in a given sliding time window can be observed; (2) spatially, there are two subgroups for the data at each time step: control and treatment, but one can only observe the total values instead of individual values from each subgroup. We develop rollout tests of lump-sum data under both fixed-sample-size and sequential settings, subject to the constraints on Type I and Type II error probabilities. Numerical studies are conducted to validate our theoretical results.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1777-1790"},"PeriodicalIF":1.1,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217103/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-12eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2436007
Dayna P Saldaña Zepeda, Richard Heerema, Ciro Velasco Cruz, William Giese, Joshua Sherman
A multivariate Bayesian Probit model is adapted to analyze a longitudinal multiclass-ordinal response, with a linear plateau as the longitudinal model. Measurements on pecan bud growth were collected on irregular time intervals, about a week apart from late March to mid April, using a six-level ordinal scale. The data are from two randomized complete block designs with four blocks each. The experiments were setup and initiated in 2018 in a pecan orchard, at two different locations, to evaluate the effect of two sets of four treatments on delaying growth of recently broken pecan buds to minimize bud loss due to low temperatures. A simulation study was successfully carried out to validate the model implementation. Treatment 3 of Experiment 1 was associated with the greatest reduction in bud growth rate. In Experiment 2, Treatments 2 and 3 had some effect on delaying bud growth. Although treatment effects were not statistically different in either experiment, this paper presents a practical and efficient modeling technique for longitudinal multinomial ordinal data, a common data type in applied agricultural research studies.
{"title":"Delaying bud-break on pecan trees: a Bayesian longitudinal multinomial regression approach.","authors":"Dayna P Saldaña Zepeda, Richard Heerema, Ciro Velasco Cruz, William Giese, Joshua Sherman","doi":"10.1080/02664763.2024.2436007","DOIUrl":"10.1080/02664763.2024.2436007","url":null,"abstract":"<p><p>A multivariate Bayesian Probit model is adapted to analyze a longitudinal multiclass-ordinal response, with a linear plateau as the longitudinal model. Measurements on pecan bud growth were collected on irregular time intervals, about a week apart from late March to mid April, using a six-level ordinal scale. The data are from two randomized complete block designs with four blocks each. The experiments were setup and initiated in 2018 in a pecan orchard, at two different locations, to evaluate the effect of two sets of four treatments on delaying growth of recently broken pecan buds to minimize bud loss due to low temperatures. A simulation study was successfully carried out to validate the model implementation. Treatment 3 of Experiment 1 was associated with the greatest reduction in bud growth rate. In Experiment 2, Treatments 2 and 3 had some effect on delaying bud growth. Although treatment effects were not statistically different in either experiment, this paper presents a practical and efficient modeling technique for longitudinal multinomial ordinal data, a common data type in applied agricultural research studies.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 8","pages":"1649-1669"},"PeriodicalIF":1.1,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147487/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144266339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-11eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2440035
Kaiyuan Liu, Min Xu, Jiang Du, Tianfa Xie
Interval-valued functional data, a new type of data in symbolic data analysis, depicts the characteristics of a variety of big data and has drawn the attention of many researchers. Mean regression is one of the important methods for analyzing interval-valued functional data. However, this method is sensitive to outliers and may lead to unreliable results. As an important complement to mean regression, this paper proposes an interval-valued scalar-on-function linear quantile regression model. Specifically, we constructed two linear quantile regression models for the interval-valued response and interval-valued functional regressors based on the bivariate center and radius method. The proposed model is more robust and efficient than mean regression methods when the data contain outliers as well as the error does not follow the normal distribution. Numerical simulations and real data analysis of a climate dataset demonstrate the effectiveness and superiority of the proposed method over the existing methods.
{"title":"Interval-valued scalar-on-function linear quantile regression based on the bivariate center and radius method.","authors":"Kaiyuan Liu, Min Xu, Jiang Du, Tianfa Xie","doi":"10.1080/02664763.2024.2440035","DOIUrl":"10.1080/02664763.2024.2440035","url":null,"abstract":"<p><p>Interval-valued functional data, a new type of data in symbolic data analysis, depicts the characteristics of a variety of big data and has drawn the attention of many researchers. Mean regression is one of the important methods for analyzing interval-valued functional data. However, this method is sensitive to outliers and may lead to unreliable results. As an important complement to mean regression, this paper proposes an interval-valued scalar-on-function linear quantile regression model. Specifically, we constructed two linear quantile regression models for the interval-valued response and interval-valued functional regressors based on the bivariate center and radius method. The proposed model is more robust and efficient than mean regression methods when the data contain outliers as well as the error does not follow the normal distribution. Numerical simulations and real data analysis of a climate dataset demonstrate the effectiveness and superiority of the proposed method over the existing methods.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1791-1824"},"PeriodicalIF":1.1,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217117/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-11eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2436608
Rowland G Seymour, Fabian Hernandez
Honour-based abuse covers a wide range of family abuse including female genital mutilation and forced marriage. Safeguarding professionals need to identify where abuses are happening in their local community to the best support those at risk of these crimes and take preventative action. However, there is little local data about these kinds of crime. To tackle this problem, we ran comparative judgement surveys to map abuses at the local level, where participants where shown pairs of wards and asked which had a higher rate of honour based abuse. In previous comparative judgement studies, participants reported fatigue associated with comparisons between areas with similar levels of abuse. Allowing for tied comparisons reduces fatigue, but increase the computational complexity when fitting the model. We designed an efficient Markov Chain Monte Carlo algorithm to fit a model with ties, allowing for a wide range of prior distributions on the model parameters. Working with South Yorkshire Police and Oxford Against Cutting, we mapped the risk of honour-based abuse at the community level in two counties in the UK.
{"title":"Scalable Bayesian inference for bradley-Terry models with ties: an application to honour based abuse.","authors":"Rowland G Seymour, Fabian Hernandez","doi":"10.1080/02664763.2024.2436608","DOIUrl":"https://doi.org/10.1080/02664763.2024.2436608","url":null,"abstract":"<p><p>Honour-based abuse covers a wide range of family abuse including female genital mutilation and forced marriage. Safeguarding professionals need to identify where abuses are happening in their local community to the best support those at risk of these crimes and take preventative action. However, there is little local data about these kinds of crime. To tackle this problem, we ran comparative judgement surveys to map abuses at the local level, where participants where shown pairs of wards and asked which had a higher rate of honour based abuse. In previous comparative judgement studies, participants reported fatigue associated with comparisons between areas with similar levels of abuse. Allowing for tied comparisons reduces fatigue, but increase the computational complexity when fitting the model. We designed an efficient Markov Chain Monte Carlo algorithm to fit a model with ties, allowing for a wide range of prior distributions on the model parameters. Working with South Yorkshire Police and Oxford Against Cutting, we mapped the risk of honour-based abuse at the community level in two counties in the UK.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1695-1712"},"PeriodicalIF":1.2,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217112/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-10eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2440056
Lixia Hu, Jinhong You, Qian Huang, Shu Liu
Time-varying coefficient regression is commonly used in the modeling of nonstationary stochastic processes. In this paper, we consider a time-varying coefficient convolution-type smoothed quantileregression (conquer). The covariates and errors are assumed to belong to a general class of locally stationary processes. We propose a local linear conquer estimator for the varying-coefficient function, and obtain the global Bahadur-Kiefer representation, which yields the asymptotic normality. Furthermore, statistical inference on simultaneous confidence bands is also studied. We investigate the finite-sample performance of the conquer estimator and confirm the validity of our asymptotic theory by conducting extensive simulation studies. We also consider financial volatility data as an example of a real-world application.
{"title":"Estimation for time-varying coefficient smoothed quantile regression.","authors":"Lixia Hu, Jinhong You, Qian Huang, Shu Liu","doi":"10.1080/02664763.2024.2440056","DOIUrl":"10.1080/02664763.2024.2440056","url":null,"abstract":"<p><p>Time-varying coefficient regression is commonly used in the modeling of nonstationary stochastic processes. In this paper, we consider a time-varying coefficient <b>con</b>volution-type smoothed <b>qu</b>antil<b>e</b> <b>r</b>egression (<i>conquer</i>). The covariates and errors are assumed to belong to a general class of locally stationary processes. We propose a local linear <i>conquer</i> estimator for the varying-coefficient function, and obtain the global Bahadur-Kiefer representation, which yields the asymptotic normality. Furthermore, statistical inference on simultaneous confidence bands is also studied. We investigate the finite-sample performance of the <i>conquer</i> estimator and confirm the validity of our asymptotic theory by conducting extensive simulation studies. We also consider financial volatility data as an example of a real-world application.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1825-1846"},"PeriodicalIF":1.1,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the era of big data, the simultaneous analysis of multiple high-dimensional, heavy-tailed datasets has become essential. Integrative analysis offers a powerful approach to combine and synthesize information from these various datasets, and often outperforming traditional meta-analysis and single-dataset analysis. In this paper, we introduce a novel high-dimensional integrative quantile regression that can accommodate the complexities inherent in multi-dataset analysis. A contrast penalty that smooths regression coefficients is introduced to account for across-dataset structures and improve variable selection. To ease the computational burden associated with high-dimensional quantile regression, a new algorithm is developed that is effective at computing solution paths and selecting significant variables. Monte Carlo simulations demonstrate its competitive performance. Additionally, the proposed method is applied to data from the China Health and Retirement Longitudinal Study, illustrating its practical utility in identifying influential factors affecting support income for the elderly. Findings indicate that adult children's individual characteristics and emotional comfort are primary factors of support income, and the extent of their impact varies across regions.
{"title":"Integrative analysis of high-dimensional quantile regression with contrasted penalization.","authors":"Panpan Ren, Xu Liu, Xiao Zhang, Peng Zhan, Tingting Qiu","doi":"10.1080/02664763.2024.2438799","DOIUrl":"10.1080/02664763.2024.2438799","url":null,"abstract":"<p><p>In the era of big data, the simultaneous analysis of multiple high-dimensional, heavy-tailed datasets has become essential. Integrative analysis offers a powerful approach to combine and synthesize information from these various datasets, and often outperforming traditional meta-analysis and single-dataset analysis. In this paper, we introduce a novel high-dimensional integrative quantile regression that can accommodate the complexities inherent in multi-dataset analysis. A contrast penalty that smooths regression coefficients is introduced to account for across-dataset structures and improve variable selection. To ease the computational burden associated with high-dimensional quantile regression, a new algorithm is developed that is effective at computing solution paths and selecting significant variables. Monte Carlo simulations demonstrate its competitive performance. Additionally, the proposed method is applied to data from the China Health and Retirement Longitudinal Study, illustrating its practical utility in identifying influential factors affecting support income for the elderly. Findings indicate that adult children's individual characteristics and emotional comfort are primary factors of support income, and the extent of their impact varies across regions.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1760-1776"},"PeriodicalIF":1.1,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2438798
Danijel G Aleksić, Bojana Milošević
In this paper, we focus on testing multivariate normality using the BHEP test with data that are missing completely at random. Our objective is twofold: first, to gain insight into the asymptotic behavior of the BHEP test statistics under two widely used approaches for handling missing data, namely complete-case analysis and imputation, and second, to compare the power performance of the test statistic under these approaches. Since complete-case approach removes all elements of the sample with at least one missing component, it might lead to the loss of information. On the other hand, we note that performing the test on imputed data as if they were complete, Type I error becomes severely distorted. To address these issues, we propose an appropriate bootstrap algorithm for approximating p-values. Extensive simulation studies demonstrate that both mean and median approaches exhibit greater power compared to testing with complete-case analysis, and open some questions for further research. The proposed methodology is illustrated with real-data examples.
{"title":"To impute or not? Testing multivariate normality on incomplete dataset: revisiting the BHEP test.","authors":"Danijel G Aleksić, Bojana Milošević","doi":"10.1080/02664763.2024.2438798","DOIUrl":"10.1080/02664763.2024.2438798","url":null,"abstract":"<p><p>In this paper, we focus on testing multivariate normality using the BHEP test with data that are missing completely at random. Our objective is twofold: first, to gain insight into the asymptotic behavior of the BHEP test statistics under two widely used approaches for handling missing data, namely complete-case analysis and imputation, and second, to compare the power performance of the test statistic under these approaches. Since complete-case approach removes all elements of the sample with at least one missing component, it might lead to the loss of information. On the other hand, we note that performing the test on imputed data as if they were complete, Type I error becomes severely distorted. To address these issues, we propose an appropriate bootstrap algorithm for approximating <i>p</i>-values. Extensive simulation studies demonstrate that both mean and median approaches exhibit greater power compared to testing with complete-case analysis, and open some questions for further research. The proposed methodology is illustrated with real-data examples.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1742-1759"},"PeriodicalIF":1.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-03eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2436008
Zhong-Cheng Han, Kong-Sheng Zhang, Yan-Yong Zhao
In the context of linear models, a key problem of interest is to estimate the regression coefficient. Nevertheless, in certain instances, the vector of unknown coefficient parameters in a linear regression model differs from one segment to another. In this paper, when the dimension of covariates is high, a new method is proposed to examine a linear model in which the regression coefficient of two subpopulations may be different. To achieve robustness and efficiency, we introduce modal linear regression as a means of estimating the unknown coefficient parameters. Furthermore, our proposed method is capable of selecting variables and checking change points. Under certain mild assumptions, the limiting behavior of our proposed method can be established. Additionally, an estimation algorithm based on kick-one-off and SCAD approach is developed to implement in practice. For illustration, simulation studies and a real data are considered to assess the performance of our proposed method.
{"title":"A robust and efficient change point detection method for high-dimensional linear models.","authors":"Zhong-Cheng Han, Kong-Sheng Zhang, Yan-Yong Zhao","doi":"10.1080/02664763.2024.2436008","DOIUrl":"10.1080/02664763.2024.2436008","url":null,"abstract":"<p><p>In the context of linear models, a key problem of interest is to estimate the regression coefficient. Nevertheless, in certain instances, the vector of unknown coefficient parameters in a linear regression model differs from one segment to another. In this paper, when the dimension of covariates is high, a new method is proposed to examine a linear model in which the regression coefficient of two subpopulations may be different. To achieve robustness and efficiency, we introduce modal linear regression as a means of estimating the unknown coefficient parameters. Furthermore, our proposed method is capable of selecting variables and checking change points. Under certain mild assumptions, the limiting behavior of our proposed method can be established. Additionally, an estimation algorithm based on kick-one-off and SCAD approach is developed to implement in practice. For illustration, simulation studies and a real data are considered to assess the performance of our proposed method.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 9","pages":"1671-1694"},"PeriodicalIF":1.1,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217119/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144560247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2431742
Dongyan Yan, Subharup Guha
Rapid technological advances have allowed for molecular profiling across multiple omics domains for clinical decision-making in many diseases, especially cancer. However, as tumor development and progression are biological processes involving composite genomic aberrations, key challenges are to effectively assimilate information from these domains to identify genomic signatures and druggable biological entities, develop accurate risk prediction profiles for future patients, and identify novel patient subgroups for tailored therapy and monitoring. We propose integrative frameworks for high-dimensional multiple-domain cancer data. These Bayesian mixture model-based approaches coherently incorporate dependence within and between domains to accurately detect tumor subtypes, thus providing a catalog of genomic aberrations associated with cancer taxonomy. The flexible and scalable Bayesian nonparametric strategy performs simultaneous bidirectional clustering of the tumor samples and genomic probes to achieve dimension reduction. We describe an efficient variable selection procedure that can identify relevant genomic aberrations and potentially reveal underlying drivers of disease. Although the work is motivated by lung cancer datasets, the proposed methods are broadly applicable in a variety of contexts involving high-dimensional data. The success of the methodology is demonstrated using artificial data and lung cancer omics profiles publicly available from The Cancer Genome Atlas.
{"title":"A clustering approach to integrative analyses of multiomic cancer data.","authors":"Dongyan Yan, Subharup Guha","doi":"10.1080/02664763.2024.2431742","DOIUrl":"10.1080/02664763.2024.2431742","url":null,"abstract":"<p><p>Rapid technological advances have allowed for molecular profiling across multiple omics domains for clinical decision-making in many diseases, especially cancer. However, as tumor development and progression are biological processes involving composite genomic aberrations, key challenges are to effectively assimilate information from these domains to identify genomic signatures and druggable biological entities, develop accurate risk prediction profiles for future patients, and identify novel patient subgroups for tailored therapy and monitoring. We propose integrative frameworks for high-dimensional multiple-domain cancer data. These Bayesian mixture model-based approaches coherently incorporate dependence within and between domains to accurately detect tumor subtypes, thus providing a catalog of genomic aberrations associated with cancer taxonomy. The flexible and scalable Bayesian nonparametric strategy performs simultaneous bidirectional clustering of the tumor samples and genomic probes to achieve dimension reduction. We describe an efficient variable selection procedure that can identify relevant genomic aberrations and potentially reveal underlying drivers of disease. Although the work is motivated by lung cancer datasets, the proposed methods are broadly applicable in a variety of contexts involving high-dimensional data. The success of the methodology is demonstrated using artificial data and lung cancer omics profiles publicly available from The Cancer Genome Atlas.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 8","pages":"1539-1560"},"PeriodicalIF":1.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144266335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2431736
Zhumengmeng Jin, Juan Sosa, Shangchen Song, Brenda Betancourt
The increasing prevalence of multiplex networks has spurred a critical need to take into account potential dependencies across different layers, especially when the goal is community detection, which is a fundamental learning task in network analysis. We propose a full Bayesian mixture model for community detection in both single-layer and multi-layer networks. A key feature of our model is the joint modeling of the nodal attributes that often come with the network data as a spatial process over the latent space. In addition, our model for multi-layer networks allows layers to have different strengths of dependency in the unique latent position structure and assumes that the probability of a relation between two actors (in a layer) depends on the distances between their latent positions (multiplied by a layer-specific factor) and the difference between their nodal attributes. Under our prior specifications, the actors' positions in the latent space arise from a finite mixture of Gaussian distributions, each corresponding to a cluster. Simulated examples show that our model outperforms existing benchmark models and exhibits significantly greater robustness when handling datasets with missing values. The model is also applied to a real-world three-layer network of employees in a law firm.
{"title":"A robust Bayesian latent position approach for community detection in networks with continuous attributes.","authors":"Zhumengmeng Jin, Juan Sosa, Shangchen Song, Brenda Betancourt","doi":"10.1080/02664763.2024.2431736","DOIUrl":"10.1080/02664763.2024.2431736","url":null,"abstract":"<p><p>The increasing prevalence of multiplex networks has spurred a critical need to take into account potential dependencies across different layers, especially when the goal is community detection, which is a fundamental learning task in network analysis. We propose a full Bayesian mixture model for community detection in both single-layer and multi-layer networks. A key feature of our model is the joint modeling of the nodal attributes that often come with the network data as a spatial process over the latent space. In addition, our model for multi-layer networks allows layers to have different strengths of dependency in the unique latent position structure and assumes that the probability of a relation between two actors (in a layer) depends on the distances between their latent positions (multiplied by a layer-specific factor) and the difference between their nodal attributes. Under our prior specifications, the actors' positions in the latent space arise from a finite mixture of Gaussian distributions, each corresponding to a cluster. Simulated examples show that our model outperforms existing benchmark models and exhibits significantly greater robustness when handling datasets with missing values. The model is also applied to a real-world three-layer network of employees in a law firm.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 8","pages":"1513-1538"},"PeriodicalIF":1.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147515/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144266337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}