首页 > 最新文献

Biometrics最新文献

英文 中文
A group distributional ICA method for decomposing multi-subject diffusion tensor imaging. 多主体扩散张量成像分解的组分布ICA方法。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf117
Guangming Yang, Ben Wu, Jian Kang, Ying Guo

Diffusion tensor imaging (DTI) is a frequently used imaging modality to investigate white matter fiber connections of human brain. DTI provides an important tool for characterizing human brain structural organization. Common goals in DTI analysis include dimension reduction, denoising, and extraction of underlying structure networks. Blind source separation methods are often used to achieve these goals for other imaging modalities. However, there has been very limited work for multi-subject DTI data. Due to the special characteristics of the 3D diffusion tensor measured in DTI, existing methods such as standard independent component analysis (ICA) cannot be directly applied. We propose a Group Distributional ICA (G-DICA) method to fill this gap. G-DICA represents a fundamentally new blind source separation method that separates the parameters in the distribution function of the observed imaging data as a mixture of independent source signals. Decomposing multi-subject DTI using G-DICA uncovers structural networks corresponding to several major white matter fiber bundles in the brain. Through simulation studies and real data applications, the proposed G-DICA method demonstrates superior performance and improved reproducibility compared to the existing method.

弥散张量成像(DTI)是一种常用的研究人脑白质纤维连接的成像方式。DTI是表征人脑结构组织的重要工具。DTI分析的共同目标包括降维、去噪和提取底层结构网络。对于其他成像模式,通常采用盲源分离方法来实现这些目标。然而,对于多学科DTI数据的研究非常有限。由于DTI测量的三维扩散张量的特殊特性,现有的方法如标准独立分量分析(ICA)不能直接应用。我们提出了一种分组分布ICA (G-DICA)方法来填补这一空白。G-DICA代表了一种全新的盲源分离方法,它将观测到的成像数据的分布函数中的参数分离为独立源信号的混合物。利用G-DICA对多主体DTI进行分解,揭示了大脑中几种主要白质纤维束对应的结构网络。通过仿真研究和实际数据应用,与现有方法相比,所提出的G-DICA方法具有更好的性能和更高的再现性。
{"title":"A group distributional ICA method for decomposing multi-subject diffusion tensor imaging.","authors":"Guangming Yang, Ben Wu, Jian Kang, Ying Guo","doi":"10.1093/biomtc/ujaf117","DOIUrl":"10.1093/biomtc/ujaf117","url":null,"abstract":"<p><p>Diffusion tensor imaging (DTI) is a frequently used imaging modality to investigate white matter fiber connections of human brain. DTI provides an important tool for characterizing human brain structural organization. Common goals in DTI analysis include dimension reduction, denoising, and extraction of underlying structure networks. Blind source separation methods are often used to achieve these goals for other imaging modalities. However, there has been very limited work for multi-subject DTI data. Due to the special characteristics of the 3D diffusion tensor measured in DTI, existing methods such as standard independent component analysis (ICA) cannot be directly applied. We propose a Group Distributional ICA (G-DICA) method to fill this gap. G-DICA represents a fundamentally new blind source separation method that separates the parameters in the distribution function of the observed imaging data as a mixture of independent source signals. Decomposing multi-subject DTI using G-DICA uncovers structural networks corresponding to several major white matter fiber bundles in the brain. Through simulation studies and real data applications, the proposed G-DICA method demonstrates superior performance and improved reproducibility compared to the existing method.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448322/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A flexible framework for N-mixture occupancy models: applications to breeding bird surveys. 氮混合占用模型的灵活框架:在种鸟调查中的应用。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf087
Huu-Dinh Huynh, J Andrew Royle, Wen-Han Hwang

Estimating species abundance under imperfect detection is a key challenge in biodiversity conservation. The N-mixture model, widely recognized for its ability to distinguish between abundance and individual detection probability without marking individuals, is constrained by its stringent closure assumption, which leads to biased estimates when violated in real-world settings. To address this limitation, we propose an extended framework based on a development of the mixed Gamma-Poisson model, incorporating a community parameter that represents the proportion of individuals consistently present throughout the survey period. This flexible framework generalizes both the zero-inflated type occupancy model and the standard N-mixture model as special cases, corresponding to community parameter values of 0 and 1, respectively. The model's effectiveness is validated through simulations and applications to real-world datasets, specifically with 5 species from the North American Breeding Bird Survey and 46 species from the Swiss Breeding Bird Survey, demonstrating its improved accuracy and adaptability in settings where strict closure may not hold.

在不完全检测条件下估算物种丰度是生物多样性保护的关键问题。n -混合物模型因其在不标记个体的情况下区分丰度和个体检测概率的能力而得到广泛认可,但它受到严格的封闭假设的限制,在现实环境中违反该假设会导致估计偏差。为了解决这一限制,我们提出了一个基于混合伽玛-泊松模型发展的扩展框架,纳入了一个代表整个调查期间始终存在的个体比例的社区参数。这个灵活的框架将零膨胀型占用模型和标准n -混合模型作为特例进行推广,分别对应于社区参数值为0和1。该模型的有效性通过模拟和实际数据集的应用得到验证,特别是来自北美繁殖鸟类调查的5个物种和来自瑞士繁殖鸟类调查的46个物种,证明了其在严格封闭可能无法维持的情况下提高的准确性和适应性。
{"title":"A flexible framework for N-mixture occupancy models: applications to breeding bird surveys.","authors":"Huu-Dinh Huynh, J Andrew Royle, Wen-Han Hwang","doi":"10.1093/biomtc/ujaf087","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf087","url":null,"abstract":"<p><p>Estimating species abundance under imperfect detection is a key challenge in biodiversity conservation. The N-mixture model, widely recognized for its ability to distinguish between abundance and individual detection probability without marking individuals, is constrained by its stringent closure assumption, which leads to biased estimates when violated in real-world settings. To address this limitation, we propose an extended framework based on a development of the mixed Gamma-Poisson model, incorporating a community parameter that represents the proportion of individuals consistently present throughout the survey period. This flexible framework generalizes both the zero-inflated type occupancy model and the standard N-mixture model as special cases, corresponding to community parameter values of 0 and 1, respectively. The model's effectiveness is validated through simulations and applications to real-world datasets, specifically with 5 species from the North American Breeding Bird Survey and 46 species from the Swiss Breeding Bird Survey, demonstrating its improved accuracy and adaptability in settings where strict closure may not hold.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to "Propensity weighting plus adjustment in proportional hazards model is not doubly robust," by Erin E. Gabriel, Michael C. Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F. Blanche, Stijn Vansteelandt, Arvid Sjölander, and Thomas Scheike; Volume 80, Issue 3, September 2024, https://doi.org/10.1093/biomtc/ujae069. 对Erin E. Gabriel、Michael C. Sachs、Ingeborg Waernbaum、Els Goetghebeur、Paul F. Blanche、Stijn Vansteelandt、Arvid Sjölander和Thomas Scheike的“比例风险模型的倾向加权加调整并非双重稳健”的修正;80卷,第3期,2024年9月,https://doi.org/10.1093/biomtc/ujae069。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf091
Erin E Gabriel, Michael C Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F Blanche, Stijn Vansteelandt, Arvid Sjölander, Thomas Scheike
{"title":"Correction to \"Propensity weighting plus adjustment in proportional hazards model is not doubly robust,\" by Erin E. Gabriel, Michael C. Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F. Blanche, Stijn Vansteelandt, Arvid Sjölander, and Thomas Scheike; Volume 80, Issue 3, September 2024, https://doi.org/10.1093/biomtc/ujae069.","authors":"Erin E Gabriel, Michael C Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F Blanche, Stijn Vansteelandt, Arvid Sjölander, Thomas Scheike","doi":"10.1093/biomtc/ujaf091","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf091","url":null,"abstract":"","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using model-assisted calibration methods to improve efficiency of regression analyses using two-phase samples or pooled samples under complex survey designs. 利用模型辅助校准方法提高复杂调查设计下两相样本或混合样本回归分析的效率。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf092
Lingxiao Wang

Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the "pooled design" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.

两阶段抽样设计常用于流行病学研究和大规模健康调查。在这种设计中,某些变量只在初始第一阶段样本的第二阶段随机子样本中收集,这通常是由于诸如高成本、响应负担或数据收集或评估的限制等因素。因此,由于样本量的减少,第二阶段的样本估计器可能是低效的。模型辅助校正方法用于提高回归分析中第二阶段估计器的效率。然而,有限的文献提供了使用适当的校准辅助变量的校准估计器的有效有限总体推断,同时考虑了第一阶段和第二阶段样本的复杂样本设计。此外,没有文献考虑在某些重复调查周期中只测量某些协变量的“合并设计”。本文提出基于使用第二阶段变量对第一阶段样本的预测的回归模型的得分函数,将第二阶段样本的样本权重校准为加权的第一阶段样本。我们使用校准的权重来建立估计的一致性,并对两阶段设计或复杂调查设计中嵌套的合并设计下的回归系数进行方差估计。与现有的校准和插值方法相比,经验证据突出了所提出的校准的效率和鲁棒性。提供了来自全国健康和营养检查调查的数据实例。
{"title":"Using model-assisted calibration methods to improve efficiency of regression analyses using two-phase samples or pooled samples under complex survey designs.","authors":"Lingxiao Wang","doi":"10.1093/biomtc/ujaf092","DOIUrl":"10.1093/biomtc/ujaf092","url":null,"abstract":"<p><p>Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the \"pooled design\" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mastering rare event analysis: subsample-size determination in Cox and logistic regressions. 掌握罕见事件分析:在Cox和逻辑回归中确定子样本大小。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf110
Tal Agassi, Nir Keret, Malka Gorfine

In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.

在当代数据分析领域,大量数据集的使用已经变得越来越重要,尽管通常需要大量的计算时间和内存。虽然许多现有的工作提供了以最小的效率损失对子样本进行分析的最佳子抽样方法,但它们明显缺乏明智地选择子样本大小的工具。为了弥补这一差距,我们的工作引入了用于选择子样本大小的工具。我们关注三种设置:针对罕见事件的生存数据的Cox回归模型,以及针对平衡和不平衡数据集的逻辑回归。此外,我们提出了一种新的最优子抽样程序,适合于不平衡数据的逻辑回归。通过对两个大型数据集的广泛模拟研究和细致分析,证明了这些工具和程序的有效性:对英国生物银行(UK Biobank)约3.5亿行结直肠癌数据的生存分析,以及对约2800万观察值的相关出生和婴儿死亡数据的逻辑回归。
{"title":"Mastering rare event analysis: subsample-size determination in Cox and logistic regressions.","authors":"Tal Agassi, Nir Keret, Malka Gorfine","doi":"10.1093/biomtc/ujaf110","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf110","url":null,"abstract":"<p><p>In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cumulative incidence function estimation using population-based biobank data. 基于种群的生物样本库数据的累积关联函数估计。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf049
Malka Gorfine, David M Zucker, Shoval Shoham

Many countries have established population-based biobanks, which are being used increasingly in epidemiological and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data are collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency and (2) CIF estimation for ages before the lower limit, $c_L$.

许多国家建立了以人口为基础的生物库,并越来越多地用于流行病学和临床研究。这些生物库为解决传统临床试验或队列研究范围之外的问题提供了大规模研究的机会。然而,使用生物银行数据带来了新的挑战。通常情况下,生物银行的数据是从一个确定的日历期间招募的研究队列中收集的,受试者在不同年龄进入研究,年龄介于$c_L$和$c_U$之间。这项工作的重点是生物库数据,包括招募时报告疾病发病年龄的个人,称为流行数据,以及最初招募时健康的个人,以及在随访期间观察到的疾病发病情况。与现有方法相比,我们提出了一种新的累积关联函数(CIF)估计器,该估计器有效地包含了流行病例,具有两个优点:(1)提高了效率;(2)在下限c_L之前的年龄进行CIF估计。
{"title":"Cumulative incidence function estimation using population-based biobank data.","authors":"Malka Gorfine, David M Zucker, Shoval Shoham","doi":"10.1093/biomtc/ujaf049","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf049","url":null,"abstract":"<p><p>Many countries have established population-based biobanks, which are being used increasingly in epidemiological and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data are collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency and (2) CIF estimation for ages before the lower limit, $c_L$.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144783415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical significance of clustering for count data. 计数数据聚类的统计显著性。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf120
Yifan Dai, Di Wu, Yufeng Liu

Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.

聚类在生物医学研究中广泛应用于有意义的亚群识别。然而,大多数现有的聚类算法没有考虑到聚类的统计不确定性,因此可能会由于自然采样变化而产生虚假聚类。为了解决这个问题,开发了聚类的统计显著性(SigClust)方法来评估高维数据中聚类的显著性。虽然SigClust已经成功地评估了连续数据的聚类显著性,但它并不是专门为离散数据设计的,比如基因组学中的计数数据。此外,SigClust及其变体在应用于非高斯高维数据时可能会受到统计能力降低的影响。为了克服这些限制,我们提出了sigcluster - dev,这是一种旨在评估计数数据中集群重要性的方法。通过广泛的模拟,我们将sigcluster - dev与其他现有的sigcluster方法在各种计数分布中进行了比较,并证明了其优越的性能。此外,我们将我们提出的sigcluster - dev应用于Hydra单细胞RNA测序(scRNA)数据和癌症患者的电子健康记录(EHRs),分别识别有意义的潜在细胞类型和患者亚组。
{"title":"Statistical significance of clustering for count data.","authors":"Yifan Dai, Di Wu, Yufeng Liu","doi":"10.1093/biomtc/ujaf120","DOIUrl":"10.1093/biomtc/ujaf120","url":null,"abstract":"<p><p>Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448855/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved prediction and flagging of extreme random effects for non-Gaussian outcomes using weighted methods. 使用加权方法改进非高斯结果的极端随机效应的预测和标记。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf094
John Neuhaus, Charles McCulloch, Ross Boylan

Investigators often focus on predicting extreme random effects from mixed effects models fitted to longitudinal or clustered data, and on identifying or "flagging" outliers such as poorly performing hospitals or rapidly deteriorating patients. Our recent work with Gaussian outcomes showed that weighted prediction methods can substantially reduce mean square error of prediction for extremes and substantially increase correct flagging rates compared to previous methods, while controlling the incorrect flagging rates. This paper extends the weighted prediction methods to non-Gaussian outcomes such as binary and count data. Closed-form expressions for predicted random effects and probabilities of correct and incorrect flagging are not available for the usual non-Gaussian outcomes, and the computational challenges are substantial. Therefore, our results include the development of theory to support algorithms that tune predictors that we call "self-calibrated" (which control the incorrect flagging rate using very simple flagging rules) and innovative numerical methods to calculate weighted predictors as well as to evaluate their performance. Comprehensive numerical evaluations show that the novel weighted predictors for non-Gaussian outcomes have substantially lower mean square error of prediction at the extremes and considerably higher correct flagging rates than previously proposed methods, while controlling the incorrect flagging rates. We illustrate our new methods using data on emergency room readmissions for children with asthma.

调查人员通常侧重于从适合纵向或聚类数据的混合效应模型中预测极端随机效应,以及识别或“标记”异常值,如表现不佳的医院或病情迅速恶化的病人。我们最近对高斯结果的研究表明,与以前的方法相比,加权预测方法可以大大降低极端预测的均方误差,大大提高正确的标记率,同时控制错误的标记率。本文将加权预测方法扩展到非高斯结果,如二进制和计数数据。对于通常的非高斯结果,预测的随机效应和正确和错误标记的概率的封闭形式表达式是不可用的,并且计算挑战是实质性的。因此,我们的结果包括理论的发展,以支持调整预测器的算法,我们称之为“自我校准”(它使用非常简单的标记规则控制不正确的标记率)和创新的数值方法来计算加权预测器以及评估其性能。综合数值评估表明,与先前提出的方法相比,非高斯结果的新型加权预测器在控制错误标记率的同时,在极值处的预测均方误差显著降低,正确标记率显著提高。我们使用哮喘儿童急诊室再入院的数据来说明我们的新方法。
{"title":"Improved prediction and flagging of extreme random effects for non-Gaussian outcomes using weighted methods.","authors":"John Neuhaus, Charles McCulloch, Ross Boylan","doi":"10.1093/biomtc/ujaf094","DOIUrl":"10.1093/biomtc/ujaf094","url":null,"abstract":"<p><p>Investigators often focus on predicting extreme random effects from mixed effects models fitted to longitudinal or clustered data, and on identifying or \"flagging\" outliers such as poorly performing hospitals or rapidly deteriorating patients. Our recent work with Gaussian outcomes showed that weighted prediction methods can substantially reduce mean square error of prediction for extremes and substantially increase correct flagging rates compared to previous methods, while controlling the incorrect flagging rates. This paper extends the weighted prediction methods to non-Gaussian outcomes such as binary and count data. Closed-form expressions for predicted random effects and probabilities of correct and incorrect flagging are not available for the usual non-Gaussian outcomes, and the computational challenges are substantial. Therefore, our results include the development of theory to support algorithms that tune predictors that we call \"self-calibrated\" (which control the incorrect flagging rate using very simple flagging rules) and innovative numerical methods to calculate weighted predictors as well as to evaluate their performance. Comprehensive numerical evaluations show that the novel weighted predictors for non-Gaussian outcomes have substantially lower mean square error of prediction at the extremes and considerably higher correct flagging rates than previously proposed methods, while controlling the incorrect flagging rates. We illustrate our new methods using data on emergency room readmissions for children with asthma.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309285/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144741072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A monotone single index model for spatially referenced multistate current status data. 空间引用多状态电流状态数据的单调单索引模型。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf105
Snigdha Das, Minwoo Chae, Debdeep Pati, Dipankar Bandyopadhyay

Assessment of multistate disease progression is commonplace in biomedical research, such as in periodontal disease (PD). However, the presence of multistate current status endpoints, where only a single snapshot of each subject's progression through disease states is available at a random inspection time after a known starting state, complicates the inferential framework. In addition, these endpoints can be clustered, and spatially associated, where a group of proximally located teeth (within subjects) may experience similar PD status, compared to those distally located. Motivated by a clinical study recording PD progression, we propose a Bayesian semiparametric accelerated failure time model with an inverse-Wishart proposal for accommodating (spatial) random effects, and flexible errors that follow a Dirichlet process mixture of Gaussians. For clinical interpretability, the systematic component of the event times is modeled using a monotone single index model, with the (unknown) link function estimated via a novel integrated basis expansion and basis coefficients endowed with constrained Gaussian process priors. In addition to establishing parameter identifiability, we present scalable computing via a combination of elliptical slice sampling, fast circulant embedding techniques, and smoothing of hard constraints, leading to straightforward estimation of parameters, and state occupation and transition probabilities. Using synthetic data, we study the finite sample properties of our Bayesian estimates and their performance under model misspecification. We also illustrate our method via application to the real clinical PD dataset.

多状态疾病进展评估在生物医学研究中很常见,如牙周病(PD)。然而,多状态当前状态端点的存在使推理框架变得复杂,在已知的起始状态后,在随机检查时间内,每个受试者通过疾病状态的进展只有一个快照。此外,这些端点可以聚类,并在空间上关联,其中一组近端位置的牙齿(在受试者中)可能经历与远端位置的牙齿相似的PD状态。在一项记录PD进展的临床研究的激励下,我们提出了一个贝叶斯半参数加速失效时间模型,该模型具有逆wishart建议,用于适应(空间)随机效应和遵循Dirichlet过程混合高斯的灵活误差。为了临床可解释性,事件时间的系统分量使用单调单指标模型建模,(未知)链接函数通过一种新的集成基展开和基系数赋予约束高斯过程先验估计。除了建立参数可识别性之外,我们还通过椭圆切片采样、快速循环嵌入技术和硬约束平滑的组合提出了可扩展计算,从而可以直接估计参数、状态占用和转移概率。利用合成数据,研究了贝叶斯估计的有限样本性质及其在模型不规范情况下的性能。我们还通过实际临床PD数据集的应用来说明我们的方法。
{"title":"A monotone single index model for spatially referenced multistate current status data.","authors":"Snigdha Das, Minwoo Chae, Debdeep Pati, Dipankar Bandyopadhyay","doi":"10.1093/biomtc/ujaf105","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf105","url":null,"abstract":"<p><p>Assessment of multistate disease progression is commonplace in biomedical research, such as in periodontal disease (PD). However, the presence of multistate current status endpoints, where only a single snapshot of each subject's progression through disease states is available at a random inspection time after a known starting state, complicates the inferential framework. In addition, these endpoints can be clustered, and spatially associated, where a group of proximally located teeth (within subjects) may experience similar PD status, compared to those distally located. Motivated by a clinical study recording PD progression, we propose a Bayesian semiparametric accelerated failure time model with an inverse-Wishart proposal for accommodating (spatial) random effects, and flexible errors that follow a Dirichlet process mixture of Gaussians. For clinical interpretability, the systematic component of the event times is modeled using a monotone single index model, with the (unknown) link function estimated via a novel integrated basis expansion and basis coefficients endowed with constrained Gaussian process priors. In addition to establishing parameter identifiability, we present scalable computing via a combination of elliptical slice sampling, fast circulant embedding techniques, and smoothing of hard constraints, leading to straightforward estimation of parameters, and state occupation and transition probabilities. Using synthetic data, we study the finite sample properties of our Bayesian estimates and their performance under model misspecification. We also illustrate our method via application to the real clinical PD dataset.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12391879/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simple simulation based reconstruction of incidence rates from death data. 基于死亡数据的简单模拟的发病率重建。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf088
Simon N Wood

Daily deaths from an infectious disease provide a means for retrospectively inferring daily incidence, given knowledge of the infection-to-death interval distribution. Existing methods for doing so rely either on fitting simplified non-linear epidemic models to the deaths data or on spline based deconvolution approaches. The former runs the risk of introducing unintended artefacts via the model formulation, while the latter may be viewed as technically obscure, impeding uptake by practitioners. This note proposes a simple simulation based approach to inferring fatal incidence from deaths that requires minimal assumptions, is easy to understand, and allows testing of alternative hypothesized incidence trajectories. The aim is that in any future situation similar to the COVID pandemic, the method can be easily, rapidly, transparently, and uncontroversially deployed as an input to management.

在了解感染至死亡间隔分布的情况下,传染病的每日死亡人数为回顾性推断每日发病率提供了一种手段。现有的方法要么依赖于将简化的非线性流行病模型拟合到死亡数据上,要么依赖于基于样条的反卷积方法。前者有通过模型公式引入意想不到的工件的风险,而后者可能在技术上被认为是模糊的,阻碍了从业者的吸收。本说明提出了一种简单的基于模拟的方法,从死亡中推断致命发病率,这种方法需要最少的假设,易于理解,并允许测试其他假设的发病率轨迹。其目的是,在未来任何类似于COVID大流行的情况下,该方法都可以轻松,快速,透明和无争议地作为管理投入而部署。
{"title":"Simple simulation based reconstruction of incidence rates from death data.","authors":"Simon N Wood","doi":"10.1093/biomtc/ujaf088","DOIUrl":"10.1093/biomtc/ujaf088","url":null,"abstract":"<p><p>Daily deaths from an infectious disease provide a means for retrospectively inferring daily incidence, given knowledge of the infection-to-death interval distribution. Existing methods for doing so rely either on fitting simplified non-linear epidemic models to the deaths data or on spline based deconvolution approaches. The former runs the risk of introducing unintended artefacts via the model formulation, while the latter may be viewed as technically obscure, impeding uptake by practitioners. This note proposes a simple simulation based approach to inferring fatal incidence from deaths that requires minimal assumptions, is easy to understand, and allows testing of alternative hypothesized incidence trajectories. The aim is that in any future situation similar to the COVID pandemic, the method can be easily, rapidly, transparently, and uncontroversially deployed as an input to management.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1