Conditional dependence relationships for random vectors are extensively studied and broadly applied. But it is not very clear how to construct the dependence graph for unstructured data like concept words or phrases in text corpus, where the variables(concepts) are not jointly observed with i.i.d. assumption. Using the global embedding methods like GloVe, we get the ‘structured’ representation vectors for concepts. Then we assume that all the concept vectors jointly follow a matrix normal distribution with sparse precision matrices. With the observation of the word-word co-occurrence matrix and the GloVe construction procedure, we can test this assumption empirically. The asymptotic distribution for the test statistics is derived. Another advantage of this matrix-normal distributional assumption is that the linearly additive property in word analogy tasks is natural and straightforward. Different from knowledge graph methods, the conditional dependence graph describes the conditional dependence structure between concepts given all other concepts, which means that the concepts(nodes) linked by edges cannot be separated by other concepts. It represents an essential semantic relationship. There is no need to enumerate all related pairs as head and tail elements of a triplet in knowledge graph regime. And the relation type in this graph is solely the conditional dependence between concepts. A penalized matrix normal graphical model (MNGM) is then employed to learn the conditional dependence graph for both the concepts and the embedding ‘dimensions’. Since the concept words are nodes in our graph with huge dimensions, we employ the MDMC optimization method to speed up the glasso algorithm. Also, the algorithm is adaptive to incremental accumulation of new concepts in text corpus. On the other hand, we propose a sentence granularity bootstrap to get ‘independent’ repeats of samples to enhance the penalized MNGM algorithm.We name the proposed method as Matrix-GloVe. In simulation studies, we check that the graph learned by Matrix-GloVe is more suitable for Graph Convolutional Networks(GCN) than a correlation graph, i.e. a graph determined from the k-NN method. We employ the proposed method in two scenarios from real data. The first scenario is concept graph learning for concepts in textbook corpus. Under this scenario, two tasks are studied. One is comparing the vectors output by GloVe and other word2vec methods, i.e. CBOW and Skip-Gram, then the vectors are used by penalized MNGM. Another task is link prediction among the concepts. On both tasks, Matrix-GloVe achieves better. In the second scenario, Matrix-GloVe is applied to a downstream method i.e. GCN. For node classification tasks on the BBC and BBCSport datasets, both GCN with Matrix- GloVe and GCN with Matrix-GloVe plus Deepwalk outperform GCN with k-NN.
{"title":"Learning conditional dependence graph for concepts via matrix normal graphical model","authors":"Jizheng Lai, Jianxin Yin","doi":"10.4310/23-sii784","DOIUrl":"https://doi.org/10.4310/23-sii784","url":null,"abstract":"Conditional dependence relationships for random vectors are extensively studied and broadly applied. But it is not very clear how to construct the dependence graph for unstructured data like concept words or phrases in text corpus, where the variables(concepts) are not jointly observed with i.i.d. assumption. Using the global embedding methods like GloVe, we get the ‘structured’ representation vectors for concepts. Then we assume that all the concept vectors jointly follow a matrix normal distribution with sparse precision matrices. With the observation of the word-word co-occurrence matrix and the GloVe construction procedure, we can test this assumption empirically. The asymptotic distribution for the test statistics is derived. Another advantage of this matrix-normal distributional assumption is that the linearly additive property in word analogy tasks is natural and straightforward. Different from knowledge graph methods, the conditional dependence graph describes the conditional dependence structure between concepts given all other concepts, which means that the concepts(nodes) linked by edges cannot be separated by other concepts. It represents an essential semantic relationship. There is no need to enumerate all related pairs as head and tail elements of a triplet in knowledge graph regime. And the relation type in this graph is solely the conditional dependence between concepts. A penalized matrix normal graphical model (MNGM) is then employed to learn the conditional dependence graph for both the concepts and the embedding ‘dimensions’. Since the concept words are nodes in our graph with huge dimensions, we employ the MDMC optimization method to speed up the glasso algorithm. Also, the algorithm is adaptive to incremental accumulation of new concepts in text corpus. On the other hand, we propose a sentence granularity bootstrap to get ‘independent’ repeats of samples to enhance the penalized MNGM algorithm.We name the proposed method as Matrix-GloVe. In simulation studies, we check that the graph learned by Matrix-GloVe is more suitable for Graph Convolutional Networks(GCN) than a correlation graph, i.e. a graph determined from the k-NN method. We employ the proposed method in two scenarios from real data. The first scenario is concept graph learning for concepts in textbook corpus. Under this scenario, two tasks are studied. One is comparing the vectors output by GloVe and other word2vec methods, i.e. CBOW and Skip-Gram, then the vectors are used by penalized MNGM. Another task is link prediction among the concepts. On both tasks, Matrix-GloVe achieves better. In the second scenario, Matrix-GloVe is applied to a downstream method i.e. GCN. For node classification tasks on the BBC and BBCSport datasets, both GCN with Matrix- GloVe and GCN with Matrix-GloVe plus Deepwalk outperform GCN with k-NN.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"281 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139659215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of matrix data learning has witnessed significant advancements in recent years, encompassing diverse datasets such as medical images, social networks, and personalized recommendation systems. These advancements have found widespread application in various domains, including medicine, biology, public health, engineering, finance, economics, sports analytics, and environmental sciences. While extensive research has been conducted on estimation, inference, prediction, and computation for matrix data, the ranking problem has not received adequate attention. Statistical depth, a measure providing a centeroutward rank for different data types, has been introduced in the past few decades. However, its exploration has been limited due to the complexity of the second and higher orderstatistics. In this paper, we propose an approach to rank matrix data by employing a model-based depth framework. Our methodology involves estimating the eigen-decomposition of a 4th-order covariance tensor. To enable this process using conventional matrix operations, we specify the tensor product operator between matrices and 4th-order tensors. Furthermore, we introduce a Kronecker product form on the covariance to enhance the robustness and efficiency of the estimation process, effectively reducing the number of parameters in the model. Based on this new framework, we develop an efficient algorithm to estimate the model-based statistical depth. To validate the effectiveness of our proposed method, we conduct simulations and apply it to two real-world applications: field goal attempts of NBA players and global temperature anomalies.
{"title":"Model-based statistical depth for matrix data","authors":"Yue Mu, Guanyu Hu, Wei Wu","doi":"10.4310/23-sii829","DOIUrl":"https://doi.org/10.4310/23-sii829","url":null,"abstract":"The field of matrix data learning has witnessed significant advancements in recent years, encompassing diverse datasets such as medical images, social networks, and personalized recommendation systems. These advancements have found widespread application in various domains, including medicine, biology, public health, engineering, finance, economics, sports analytics, and environmental sciences. While extensive research has been conducted on estimation, inference, prediction, and computation for matrix data, the ranking problem has not received adequate attention. Statistical depth, a measure providing a centeroutward rank for different data types, has been introduced in the past few decades. However, its exploration has been limited due to the complexity of the second and higher orderstatistics. In this paper, we propose an approach to rank matrix data by employing a model-based depth framework. Our methodology involves estimating the eigen-decomposition of a 4th-order covariance tensor. To enable this process using conventional matrix operations, we specify the tensor product operator between matrices and 4th-order tensors. Furthermore, we introduce a Kronecker product form on the covariance to enhance the robustness and efficiency of the estimation process, effectively reducing the number of parameters in the model. Based on this new framework, we develop an efficient algorithm to estimate the model-based statistical depth. To validate the effectiveness of our proposed method, we conduct simulations and apply it to two real-world applications: field goal attempts of NBA players and global temperature anomalies.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"281 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139658971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan-Jung Hsu, Hsin-Cheng Huang, Ruey S. Tsay, Tzu-Chieh Kao
We develop a matrix-variate autoregressive (MAR) model to analyze spatio-temporal data organized on a regular grid in space. The model is an extension of the bilinear MAR spatial model of Hsu, Huang and Tsay $href{ https://doi.org/10.1080/10618600.2021.1938587 }{[10]}$ by increasing its flexibility and applicability in empirical applications. Specifically, we propose to model each autoregressive (AR) coefficient matrix of the MAR model by $R$ bilinear terms, thereby establishing a rank‑R model. The extension can be interpreted as decomposing the AR dynamics of the data into $R$ bilinear MAR components. We further incorporate a banded neighborhood structure for AR coefficient matrices and utilize a flexible nonstationary low-rank covariance model for the spatial innovation process, leading to a parsimonious model without sacrificing its flexibility. We estimate all parameters of the model by the maximum likelihood method and develop a computationally efficient alternating direction method of multipliers algorithm, involving only closed-form expressions in all steps. Applications to a wind-speed dataset and an employment dataset, as well as two simulation experiments, demonstrate the effectiveness of the proposed method in estimation, model selection, and prediction.
我们建立了一个矩阵变量自回归(MAR)模型,用于分析在空间规则网格上组织的时空数据。该模型是对 Hsu、Huang 和 Tsay $href{ https://doi.org/10.1080/10618600.2021.1938587 }{[10]}$ 的双线性 MAR 空间模型的扩展,提高了其灵活性和在实证应用中的适用性。具体来说,我们建议用 $R$ 双线性项来模拟 MAR 模型的每个自回归(AR)系数矩阵,从而建立一个秩 R 模型。这种扩展可以解释为将数据的 AR 动态分解为 $R$ 双线性 MAR 组件。我们进一步为 AR 系数矩阵加入了带状邻域结构,并为空间创新过程使用了灵活的非平稳低阶协方差模型,从而在不牺牲灵活性的前提下建立了一个简洁的模型。我们用最大似然法估计了模型的所有参数,并开发了一种计算高效的交替方向乘法算法,所有步骤都只涉及闭式表达式。风速数据集和就业数据集的应用以及两个模拟实验证明了所提方法在估计、模型选择和预测方面的有效性。
{"title":"Rank-R matrix autoregressive models for modeling spatio-temporal data","authors":"Nan-Jung Hsu, Hsin-Cheng Huang, Ruey S. Tsay, Tzu-Chieh Kao","doi":"10.4310/23-sii812","DOIUrl":"https://doi.org/10.4310/23-sii812","url":null,"abstract":"We develop a matrix-variate autoregressive (MAR) model to analyze spatio-temporal data organized on a regular grid in space. The model is an extension of the bilinear MAR spatial model of Hsu, Huang and Tsay $href{ https://doi.org/10.1080/10618600.2021.1938587 }{[10]}$ by increasing its flexibility and applicability in empirical applications. Specifically, we propose to model each autoregressive (AR) coefficient matrix of the MAR model by $R$ bilinear terms, thereby establishing a rank‑R model. The extension can be interpreted as decomposing the AR dynamics of the data into $R$ bilinear MAR components. We further incorporate a banded neighborhood structure for AR coefficient matrices and utilize a flexible nonstationary low-rank covariance model for the spatial innovation process, leading to a parsimonious model without sacrificing its flexibility. We estimate all parameters of the model by the maximum likelihood method and develop a computationally efficient alternating direction method of multipliers algorithm, involving only closed-form expressions in all steps. Applications to a wind-speed dataset and an employment dataset, as well as two simulation experiments, demonstrate the effectiveness of the proposed method in estimation, model selection, and prediction.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"36 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139659193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-01Epub Date: 2024-07-19DOI: 10.4310/23-sii815
Yimei Li, Jade Xiaoqing Wang, Grace Chen Zhou, Heather M Conklin, Arzu Onar-Thomas, Amar Gajjar, Wilburn E Reddick, Cai Li
Aggressive cancer treatments that affect the central nervous system are associated with an increased risk of cognitive deficits. As treatment for pediatric brain tumors has become more effective, there has been a heightened focus on improving cognitive outcomes, which can significantly affect the quality of life for pediatric cancer survivors. This paper is motivated by and applied to a clinical trial for medulloblastoma, the most common malignant brain tumor in children. The trial collects comprehensive data including treatment-related clinical information, neuroimaging, and longitudinal neurocognitive outcomes to enhance our understanding of the responses to treatment and the enduring impacts of radiation therapy on the survivors of medulloblastoma. To this end, we have developed a new mediation model tailored for longitudinal outcomes with high-dimensional imaging mediators. Specifically, we adopt a joint binary Ising-Gaussian Markov random field prior distribution to account for spatial dependency and smoothness of ultra-high-dimensional neuroimaging mediators for enhancing detection power of informative voxels. By exploiting the proposed approach, we identify causal pathways and the corresponding white matter microstructures mediating the negative impact of irradiation on neurodevelopment. The results provide guidance on sparing the brain regions and improving long-term neurodevelopment for pediatric cancer survivors. Simulation studies also confirm the validity of the proposed method.
{"title":"Imaging mediation analysis for longitudinal outcomes: a case study of childhood brain tumor survivorship.","authors":"Yimei Li, Jade Xiaoqing Wang, Grace Chen Zhou, Heather M Conklin, Arzu Onar-Thomas, Amar Gajjar, Wilburn E Reddick, Cai Li","doi":"10.4310/23-sii815","DOIUrl":"10.4310/23-sii815","url":null,"abstract":"<p><p>Aggressive cancer treatments that affect the central nervous system are associated with an increased risk of cognitive deficits. As treatment for pediatric brain tumors has become more effective, there has been a heightened focus on improving cognitive outcomes, which can significantly affect the quality of life for pediatric cancer survivors. This paper is motivated by and applied to a clinical trial for medulloblastoma, the most common malignant brain tumor in children. The trial collects comprehensive data including treatment-related clinical information, neuroimaging, and longitudinal neurocognitive outcomes to enhance our understanding of the responses to treatment and the enduring impacts of radiation therapy on the survivors of medulloblastoma. To this end, we have developed a new mediation model tailored for longitudinal outcomes with high-dimensional imaging mediators. Specifically, we adopt a joint binary Ising-Gaussian Markov random field prior distribution to account for spatial dependency and smoothness of ultra-high-dimensional neuroimaging mediators for enhancing detection power of informative voxels. By exploiting the proposed approach, we identify causal pathways and the corresponding white matter microstructures mediating the negative impact of irradiation on neurodevelopment. The results provide guidance on sparing the brain regions and improving long-term neurodevelopment for pediatric cancer survivors. Simulation studies also confirm the validity of the proposed method.</p>","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"17 3","pages":"533-548"},"PeriodicalIF":0.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12467661/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145187440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous survival data are commonly present in chronic disease studies. Delineating meaningful disease subtypes directly linked to a survival outcome can generate useful scientific implications. In this work, we develop a latent class proportional hazards (PH) regression framework to address such an interest. We propose mixture proportional hazards modeling, which flexibly accommodates class-specific covariate effects while allowing for the baseline hazard function to vary across latent classes. Adapting the strategy of nonparametric maximum likelihood estimation, we derive an Expectation-Maximization (E‑M) algorithm to estimate the proposed model. We establish the theoretical properties of the resulting estimators. Extensive simulation studies are conducted, demonstrating satisfactory finite-sample performance of the proposed method as well as the predictive benefit from accounting for the heterogeneity across latent classes. We further illustrate the practical utility of the proposed method through an application to a mild cognitive impairment (MCI) cohort in the Uniform Data Set.
{"title":"Latent class proportional hazards regression with heterogeneous survival data","authors":"Teng Fei, John J. Hanfelt, Limin Peng","doi":"10.4310/23-sii785","DOIUrl":"https://doi.org/10.4310/23-sii785","url":null,"abstract":"Heterogeneous survival data are commonly present in chronic disease studies. Delineating meaningful disease subtypes directly linked to a survival outcome can generate useful scientific implications. In this work, we develop a latent class proportional hazards (PH) regression framework to address such an interest. We propose mixture proportional hazards modeling, which flexibly accommodates class-specific covariate effects while allowing for the baseline hazard function to vary across latent classes. Adapting the strategy of nonparametric maximum likelihood estimation, we derive an Expectation-Maximization (E‑M) algorithm to estimate the proposed model. We establish the theoretical properties of the resulting estimators. Extensive simulation studies are conducted, demonstrating satisfactory finite-sample performance of the proposed method as well as the predictive benefit from accounting for the heterogeneity across latent classes. We further illustrate the practical utility of the proposed method through an application to a mild cognitive impairment (MCI) cohort in the Uniform Data Set.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"36 3","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138525577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In practice often either the Bayesian or frequentist method is used, although there are some combined uses of the two methods, a formal unified methodology of the two hasn’t been seen. Here we first give a brief review of the two methods and some combination of the two, then propose a procedure using both the frequentist likelihood and the Bayesian posterior loss in parameter estimation and hypothesis testing, as an attempt to unify the two methods. Basic properties of the proposed method are studied, and simulation studies are carried out to evaluate the performance of the method.
{"title":"Frequentist Bayesian compound inference","authors":"Jinfeng Xu, Ao Yuan","doi":"10.4310/23-sii797","DOIUrl":"https://doi.org/10.4310/23-sii797","url":null,"abstract":"In practice often either the Bayesian or frequentist method is used, although there are some combined uses of the two methods, a formal unified methodology of the two hasn’t been seen. Here we first give a brief review of the two methods and some combination of the two, then propose a procedure using both the frequentist likelihood and the Bayesian posterior loss in parameter estimation and hypothesis testing, as an attempt to unify the two methods. Basic properties of the proposed method are studied, and simulation studies are carried out to evaluate the performance of the method.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"8 3","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138525578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lincheng Zhao was admitted to the Department of Applied Mathematics of the University of Science and Technology of China (USTC) in 1960, three years before me, and then took a year off due to illness and transferred to the entering class of 1961. We were both not good at socializing, so although we had been classmates for three years, we didn’t know each other. In 1978, when we were both admitted to the Department of Mathematics for graduate studies, we got to know each other. Since then, we have known each other, made friends, and helped each other in all aspects of research and life, and we have become good mentors and friends with each other. On the occasion of Professor Zhao’s 80th birthday, I would like to recall a little of the past events of our acquaintance and friendship to express my gratitude to Academic Elder Brother Zhao.
{"title":"Guiding light: An essay for Professor Lincheng Zhao on the occasion of his 80th birthday","authors":"Zhidong Bai","doi":"10.4310/22-sii772","DOIUrl":"https://doi.org/10.4310/22-sii772","url":null,"abstract":"Lincheng Zhao was admitted to the Department of Applied Mathematics of the University of Science and Technology of China (USTC) in 1960, three years before me, and then took a year off due to illness and transferred to the entering class of 1961. We were both not good at socializing, so although we had been classmates for three years, we didn’t know each other. In 1978, when we were both admitted to the Department of Mathematics for graduate studies, we got to know each other. Since then, we have known each other, made friends, and helped each other in all aspects of research and life, and we have become good mentors and friends with each other. On the occasion of Professor Zhao’s 80th birthday, I would like to recall a little of the past events of our acquaintance and friendship to express my gratitude to Academic Elder Brother Zhao.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"33 3-4","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138525584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longitudinal data, which involve measuring a group of subjects repeatedly over time, frequently arise in many clinical and biomedical applications. To identify the complex patterns of change in the outcome and their association with covariates over time, a sufficiently flexible model is always required. Nonparametric regression, known for being data-adaptive and less restrictive than parametric approaches, becomes a promising tool for handling longitudinal data. This paper reviews various nonparametric regression methods for longitudinal data, including specific traditional nonparametric methods for the univariate case and several representative methods for the multivariate case, among which tree-based techniques are dominant. We summarize their motivations and provide a brief practical performance comparison of these methods in simulations, as well as discuss potential future research directions.
{"title":"A review of nonparametric regression methods for longitudinal data","authors":"Changxin Yang, Zhongyi Zhu","doi":"10.4310/23-sii801","DOIUrl":"https://doi.org/10.4310/23-sii801","url":null,"abstract":"Longitudinal data, which involve measuring a group of subjects repeatedly over time, frequently arise in many clinical and biomedical applications. To identify the complex patterns of change in the outcome and their association with covariates over time, a sufficiently flexible model is always required. Nonparametric regression, known for being data-adaptive and less restrictive than parametric approaches, becomes a promising tool for handling longitudinal data. This paper reviews various nonparametric regression methods for longitudinal data, including specific traditional nonparametric methods for the univariate case and several representative methods for the multivariate case, among which tree-based techniques are dominant. We summarize their motivations and provide a brief practical performance comparison of these methods in simulations, as well as discuss potential future research directions.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"2 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138542339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Copy number variations (CNVs) are a form of structural variation of a DNA sequence, including amplification and deletion of a particular DNA segment on chromosomes. Due to the huge amount of data in every DNA sequence, there is a great need for a computationally fast algorithm that accurately identifies CNVs. In this paper, we formulate the detection of CNVs as a constraint least squares problem and show that circular binary segmentation is a greedy approach to solving this problem. To solve this problem with high accuracy and efficiency, we first derived a necessary optimality condition for its solution based on the alternating minimization technique and then developed a computationally efficient algorithm named AMIAS. The performance of our method was tested on both simulated data and two realworld applications using genomic data from diagnosed primal glioblastoma and the HapMap project. Our proposed method has competitive performance in identifying CNVs with high-throughput genotypic data.
{"title":"Copy number variation detection based on constraint least squares","authors":"Xiaopu Wang, Xueqin Wang, Aijun Zhang, Canhong Wen","doi":"10.4310/23-sii814","DOIUrl":"https://doi.org/10.4310/23-sii814","url":null,"abstract":"Copy number variations (CNVs) are a form of structural variation of a DNA sequence, including amplification and deletion of a particular DNA segment on chromosomes. Due to the huge amount of data in every DNA sequence, there is a great need for a computationally fast algorithm that accurately identifies CNVs. In this paper, we formulate the detection of CNVs as a constraint least squares problem and show that circular binary segmentation is a greedy approach to solving this problem. To solve this problem with high accuracy and efficiency, we first derived a necessary optimality condition for its solution based on the alternating minimization technique and then developed a computationally efficient algorithm named AMIAS. The performance of our method was tested on both simulated data and two realworld applications using genomic data from diagnosed primal glioblastoma and the HapMap project. Our proposed method has competitive performance in identifying CNVs with high-throughput genotypic data.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"2 4","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138525581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ICH E9(R1) guidance recommended a framework to align planning, design, conduct, analysis, and interpretation of any clincial trial with its objective and estimand. How to handle intercurrent events (ICEs) is one of the five attributes of an estimand and sample size calculation is a key step in the trial planning and design. Therefore, sample size calculation should be aligned with the estimand and, in particular, with how the ICEs are handled. ICH E9(R1) summarized five strategies for handling ICEs, and five approaches have been proposed in the literature for sample size calculation when planning trials with quantitative and binary outcomes. In this paper, we discuss how to apply the five strategies to deal with ICEs in clinical trials with time-to-event outcomes and propose five approaches for sample size calculation that are aligned with the five strategies, respectively.
ICH E9(R1)指南建议建立一个框架,使任何临床试验的计划、设计、实施、分析和解释与其目标和评价保持一致。如何处理并发事件(ICEs)是估计的五大属性之一,而样本容量的计算是试验计划和设计的关键步骤。因此,样本量的计算应该与估计保持一致,特别是与如何处理ICEs保持一致。ICH E9(R1)总结了处理ICEs的五种策略,并在计划具有定量和二元结果的试验时提出了五种计算样本量的方法。在本文中,我们讨论了如何应用这五种策略来处理具有事件时间结局的临床试验中的ICEs,并分别提出了与这五种策略相一致的五种样本量计算方法。
{"title":"Aligning sample size calculations with estimands in clinical trials with time-to-event outcomes","authors":"Yixin Fang, Man Jin, Chengqing Wu","doi":"10.4310/23-sii804","DOIUrl":"https://doi.org/10.4310/23-sii804","url":null,"abstract":"The ICH E9(R1) guidance recommended a framework to align planning, design, conduct, analysis, and interpretation of any clincial trial with its objective and estimand. How to handle intercurrent events (ICEs) is one of the five attributes of an estimand and sample size calculation is a key step in the trial planning and design. Therefore, sample size calculation should be aligned with the estimand and, in particular, with how the ICEs are handled. ICH E9(R1) summarized five strategies for handling ICEs, and five approaches have been proposed in the literature for sample size calculation when planning trials with quantitative and binary outcomes. In this paper, we discuss how to apply the five strategies to deal with ICEs in clinical trials with time-to-event outcomes and propose five approaches for sample size calculation that are aligned with the five strategies, respectively.","PeriodicalId":51230,"journal":{"name":"Statistics and Its Interface","volume":"24 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138525566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}