Statistical Analysis and Data Mining最新文献_第2页

Semiparametric estimation of average treatment effects in observational studies 观察性研究中平均治疗效果的半参数估计

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-05-18 DOI: 10.1002/sam.11688

Jun Wang, Yujiao Guo

We propose a semiparametric method to estimate average treatment effects in observational studies based on the assumption of unconfoundedness. Assume that the propensity score model and outcome model are a general single index model, which are estimated by the kernel method and the unknown index parameter is estimated via linearized maximum rank correlation method. The proposed estimator is computationally tractable, allows for large dimension covariates and not involves the approximation of link functions. We showed that the proposed estimator is consistent and asymptotically normally distributed. In general, the proposed estimator is superior to existing methods when the model is incorrectly specified. We also provide an empirical analysis on the average treatment effect and average treatment effect on the treated of 401(k) eligibility on net financial assets.

我们提出了一种基于无边界假设的半参数方法来估计观察性研究中的平均治疗效果。假设倾向评分模型和结果模型是一般的单指标模型，采用核方法估计，未知指标参数采用线性化最大秩相关方法估计。所提出的估计方法计算简便，允许使用大维度协变量，且不涉及链接函数的近似。我们的研究表明，所提出的估计方法具有一致性和渐近正态分布。一般来说，当模型指定不正确时，所提出的估计方法优于现有方法。我们还对 401（k）资格对净金融资产的平均处理效应和平均处理效应进行了实证分析。

引用次数: 0

Prior effective sample size for exponential family distributions with multiple parameters 多参数指数族分布的先验有效样本量

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-05-09 DOI: 10.1002/sam.11685

Ryota Tamanoi

The setting of priors is an important issue in Bayesian analysis. In particular, when external information is applied, a prior with too much information can dominate the posterior inferences. To prevent this effect, the effective sample size (ESS) can be used. Various ESSs have been proposed recently; however, all have the problem of limiting the applicable prior distributions. For example, one ESS can only be used with a prior that can be approximated by a normal distribution, and another ESS cannot be applied when the parameters are multidimensional. We propose an ESS to be applied to more prior distributions when the sampling model belongs to an exponential family (including the normal model and logistic regression models). This ESS has the predictive consistency and can be used with multidimensional parameters. It is confirmed from normally distributed data with the Student's‐t priors that this ESS behaves as well as an existing predictively consistent ESS for one‐parameter exponential families. As examples of multivariate parameters, ESSs for linear and logistic regression models are also discussed.

先验的设置是贝叶斯分析中的一个重要问题。特别是在应用外部信息时，信息量过大的先验会主导后验推断。为了防止这种影响，可以使用有效样本量（ESS）。最近提出了多种有效样本量（ESS），但都存在限制适用先验分布的问题。例如，有一种 ESS 只能用于可以用正态分布近似的先验分布，而另一种 ESS 则不能用于参数是多维的情况。当抽样模型属于指数族（包括正态模型和逻辑回归模型）时，我们提出了一种适用于更多先验分布的 ESS。这种 ESS 具有预测一致性，可用于多维参数。通过采用 Student's-t 先验的正态分布数据可以证实，这种 ESS 与现有的单参数指数族预测一致性 ESS 的表现一样好。作为多变量参数的例子，还讨论了线性和逻辑回归模型的ESS。

引用次数: 0

Density estimation via measure transport: Outlook for applications in the biological sciences 通过测量传输进行密度估算：生物科学应用前景

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-05-04 DOI: 10.1002/sam.11687

Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo

One among several advantages of measure transport methods is that they allow or a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.

度量传输方法有几个优点，其中之一是它们可以提供一个统一的框架，用于处理和分析根据多种概率度量分布的数据。在此背景下，我们介绍了计算研究的结果，这些研究旨在评估度量传输技术的潜力，特别是三角传输图的使用，它是旨在支持生物科学研究的工作流程的一部分。辐射生物学等领域中常见的样本数据量有限的情况尤其引人关注。我们发现，在利用有限的样本数据估计分布密度函数时，自适应传输图具有优势。特别是，通过计算一系列自适应传输图（在一系列随机选择的可用数据样本集子集上进行训练）收集到的统计数据，可以揭示隐藏在数据中的信息。因此，在本文所考虑的辐射生物学应用中，这种方法提供了一种工具，用于生成关于基因关系及其在辐射照射下动态变化的假设。

{"title":"Density estimation via measure transport: Outlook for applications in the biological sciences","authors":"Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo","doi":"10.1002/sam.11687","DOIUrl":"https://doi.org/10.1002/sam.11687","url":null,"abstract":"One among several advantages of measure transport methods is that they allow or a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"10 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Individualized image region detection with total variation 具有总体变化的个性化图像区域检测

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-05-01 DOI: 10.1002/sam.11684

Sanyou Wu, Fuying Wang, Long Feng

Medical image data have emerged to be an indispensable component of modern medicine. Different from many general image problems that focus on outcome prediction or image recognition, medical image analysis pays more attention to model interpretation. For instance, given a list of medical images and corresponding labels of patients' health status, it is often of greater importance to identify the image regions that could differentiate the outcome status, compared to simply predicting labels of new images. Moreover, medical image data often demonstrate strong individual heterogeneity. In other words, the image regions associated with an outcome could be different across patients. As a consequence, the traditional one‐model‐fits‐all approach not only omits patient heterogeneity but also possibly leads to misleading or even wrong conclusions. In this article, we introduce a novel statistical framework to detect individualized regions that are associated with a binary outcome, that is, whether a patient has a certain disease or not. Moreover, we propose a total variation‐based penalization for individualized image region detection under a local label‐free scenario. Considering that local labeling is often difficult to obtain for medical image data, our approach may potentially have a wider range of applications in medical research. The effectiveness of our proposed approach is validated by two real histopathology databases: Colon Cancer and Camelyon16.

医学图像数据已成为现代医学不可或缺的组成部分。与许多侧重于结果预测或图像识别的一般图像问题不同，医学图像分析更注重模型解释。例如，给定一系列医学图像和相应的患者健康状况标签，与简单预测新图像的标签相比，识别能区分结果状况的图像区域往往更为重要。此外，医学影像数据通常具有很强的个体异质性。换句话说，与结果相关的图像区域可能因患者而异。因此，传统的 "一模适合所有 "方法不仅忽略了患者的异质性，还可能导致误导甚至错误的结论。在本文中，我们介绍了一种新的统计框架，用于检测与二元结果（即患者是否患有某种疾病）相关的个性化区域。此外，我们还提出了一种基于总变异的惩罚方法，用于局部无标记情况下的个性化图像区域检测。考虑到医学图像数据通常难以获得局部标签，我们的方法有可能在医学研究中得到更广泛的应用。两个真实的组织病理学数据库验证了我们提出的方法的有效性：结肠癌和 Camelyon16。

{"title":"Individualized image region detection with total variation","authors":"Sanyou Wu, Fuying Wang, Long Feng","doi":"10.1002/sam.11684","DOIUrl":"https://doi.org/10.1002/sam.11684","url":null,"abstract":"Medical image data have emerged to be an indispensable component of modern medicine. Different from many general image problems that focus on outcome prediction or image recognition, medical image analysis pays more attention to model interpretation. For instance, given a list of medical images and corresponding labels of patients' health status, it is often of greater importance to identify the image regions that could differentiate the outcome status, compared to simply predicting labels of new images. Moreover, medical image data often demonstrate strong individual heterogeneity. In other words, the image regions associated with an outcome could be different across patients. As a consequence, the traditional one‐model‐fits‐all approach not only omits patient heterogeneity but also possibly leads to misleading or even wrong conclusions. In this article, we introduce a novel statistical framework to detect individualized regions that are associated with a binary outcome, that is, whether a patient has a certain disease or not. Moreover, we propose a total variation‐based penalization for individualized image region detection under a local label‐free scenario. Considering that local labeling is often difficult to obtain for medical image data, our approach may potentially have a wider range of applications in medical research. The effectiveness of our proposed approach is validated by two real histopathology databases: Colon Cancer and Camelyon16.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"105 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The analysis of association rules: Latent class analysis 关联规则分析潜类分析

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-05-01 DOI: 10.1002/sam.11686

Ron S. Kenett, Chris Gotwalt

Association rules are used to extract information from transactional databases with a collection of items also called “tokens” or “words.” The aim of association rule analysis is to indicate what and how items go with what items in a set of transactions called “documents.” This approach is used in the analysis of text records, of blogs in social media and of shopping baskets. We present here an approach to analyze documents using latent class analysis (LCA) clustering of document term matrices. A document term matrix (DTM) consists of rows referring to documents and columns corresponding to items. In binary weights, “1” indicates the presence of a term in a document and “0” otherwise. The clustering of similar documents provides stratified data sets used to enhance the interpretability of measures of interest such as lift, odds ratios and relative linkage disequilibrium. The article demonstrates the approach with two case studies. A first example consists of comments recorded in a survey aimed at pet owners. A second, much larger example, is based on online reviews to crocs sandals. Association rules describe combinations of terms in the pet survey and crocs reviews. In Section 3, we compute, for these case studies, association rule measures of interest defined in Section 2. We first introduce the case studies to motivate the methods proposed here. In Section 4, we provide a new approach with an enhanced interpretations of measures such as lift by comparing them across clusters derived from an LCA of the DTM. A key result is the application of clustered data in analyzing observational data. This enhances generalizability and interpretability of findings from text analytics. The article concludes with a discussion in Section 5.

关联规则用于从事务数据库中提取信息，数据库中的项目集合也称为 "标记 "或 "词"。关联规则分析的目的是指出在一组被称为 "文档 "的事务中，哪些项目与哪些项目有关联，以及如何关联。这种方法可用于分析文本记录、社交媒体中的博客和购物篮。我们在此介绍一种使用文档术语矩阵的潜在类分析（LCA）聚类来分析文档的方法。文档术语矩阵（DTM）由指文档的行和对应项目的列组成。在二进制权重中，"1 "表示文档中存在某个术语，"0 "表示不存在。相似文档的聚类提供了分层数据集，用于提高相关度量的可解释性，如提升率、几率比和相对联系不平衡。文章通过两个案例研究展示了这一方法。第一个例子是在一项针对宠物主人的调查中记录的评论。第二个更大的例子是基于对鳄鱼凉鞋的在线评论。关联规则描述了宠物调查和 Crocs 评论中的术语组合。在第 3 节中，我们将针对这些案例研究计算第 2 节中定义的关联规则度量。我们首先介绍了案例研究，以激发本文提出的方法。在第 4 节中，我们提供了一种新方法，通过比较 DTM 的 LCA 得出的聚类，加强了对提升等指标的解释。一个关键结果是在分析观测数据时应用聚类数据。这提高了文本分析结果的可推广性和可解释性。文章最后在第 5 节进行了讨论。

{"title":"The analysis of association rules: Latent class analysis","authors":"Ron S. Kenett, Chris Gotwalt","doi":"10.1002/sam.11686","DOIUrl":"https://doi.org/10.1002/sam.11686","url":null,"abstract":"Association rules are used to extract information from transactional databases with a collection of items also called “tokens” or “words.” The aim of association rule analysis is to indicate what and how items go with what items in a set of transactions called “documents.” This approach is used in the analysis of text records, of blogs in social media and of shopping baskets. We present here an approach to analyze documents using latent class analysis (LCA) clustering of document term matrices. A document term matrix (DTM) consists of rows referring to documents and columns corresponding to items. In binary weights, “1” indicates the presence of a term in a document and “0” otherwise. The clustering of similar documents provides stratified data sets used to enhance the interpretability of measures of interest such as lift, odds ratios and relative linkage disequilibrium. The article demonstrates the approach with two case studies. A first example consists of comments recorded in a survey aimed at pet owners. A second, much larger example, is based on online reviews to crocs sandals. Association rules describe combinations of terms in the pet survey and crocs reviews. In Section 3, we compute, for these case studies, association rule measures of interest defined in Section 2. We first introduce the case studies to motivate the methods proposed here. In Section 4, we provide a new approach with an enhanced interpretations of measures such as lift by comparing them across clusters derived from an LCA of the DTM. A key result is the application of clustered data in analyzing observational data. This enhances generalizability and interpretability of findings from text analytics. The article concludes with a discussion in Section 5.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"104 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian relative composite quantile regression approach of ordinal latent regression model with L1/2 regularization 具有 L1/2 正则化的贝叶斯相对复合量回归方法的序潜回归模型

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-15 DOI: 10.1002/sam.11683

Tian Yu-Zhu, Wu Chun-Ho, Tai Ling-Nan, Mian Zhi-Bao, Tian Mao-Zai

Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian

� � � �_{L � � 1 � / � 2 � �} � �$$ {L}_{1/2} $$�

penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.

序数数据经常出现在知识水平评估、信用评级、临床疾病诊断和心理评估等各个领域。包括累积逻辑回归或概率回归在内的经典模型经常被用于对此类序数数据建模。但这些建模方法将响应变量的平均特征有条件地刻画在一组预测变量上，这往往会导致不可靠的估计结果。作为一种重要的替代方法，复合量化回归（CQR）方法通常被用来获得更稳健和相对高效的结果。本文提出了一种针对序潜回模型的贝叶斯 CQR 建模方法。为了克服所考虑模型的可识别性问题并获得更稳健的估计结果，我们主张使用贝叶斯相对 CQR 方法来估计回归参数。此外，在回归建模中，获得一个只保留重要协变量的简约模型是一项非常理想的任务。我们将贝叶斯 L1/2$$ {L}_{1/2} $$ 惩罚纳入序潜伏 CQR 回归模型，以同时进行参数估计和变量选择。最后，通过蒙特卡罗模拟和实际数据应用说明了所提出的贝叶斯相对 CQR 方法。仿真结果和真实数据实例表明，建议的贝叶斯相对 CQR 方法在序数回归模型中具有良好的性能。

{"title":"Bayesian relative composite quantile regression approach of ordinal latent regression model with L1/2 regularization","authors":"Tian Yu-Zhu, Wu Chun-Ho, Tai Ling-Nan, Mian Zhi-Bao, Tian Mao-Zai","doi":"10.1002/sam.11683","DOIUrl":"https://doi.org/10.1002/sam.11683","url":null,"abstract":"Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian <math altimg=\"urn:x-wiley:19321864:media:sam11683:sam11683-math-0003\" display=\"inline\" location=\"graphic/sam11683-math-0003.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<msub>\u0000<mi>L</mi>\u0000<mrow>\u0000<mn>1</mn>\u0000<mo stretchy=\"false\">/</mo>\u0000<mn>2</mn>\u0000</mrow>\u0000</msub>\u0000</mrow>\u0000$$ {L}_{1/2} $$</annotation>\u0000</semantics></math> penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"207 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A treeless absolutely random forest with closed‐form estimators of expected proximities 无树绝对随机森林与预期邻近度的闭式估计值

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-10 DOI: 10.1002/sam.11678

Eugene Laska, Ziqiang Lin, Carole Siegel, Charles Marmar

We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At every node, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature in the root node, does not change. This enables closed‐form estimators of parameters, such as pairwise proximities, to be obtained without having to grow a forest. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, the algorithm will split units whose feature vectors are far apart and keep together units whose feature vectors are similar. Thus, the underlying structure of the data drives the growth of the tree. The expected value of pairwise proximities is obtained for three pathway functions. One, a completely common pathway function, is an indicator of whether a pair of units follow the same path from the root to the leaf node. The properties of TARF‐based proximity estimators for clustering and classification are compared to other methods in eight real‐world datasets and in simulations. Results show substantial performance and computing efficiencies of particular value for large datasets.

我们介绍一种纯随机森林的简单变体，称为绝对随机森林（ARF），用于聚类。在每个节点上，单位的分割由随机选择的特征和从均匀分布中随机抽取的阈值决定，而阈值的支持度（即根节点上所选特征的范围）不会改变。这样就可以获得参数的闭式估计值，例如成对接近度，而无需种植森林。与 ARF 相对应的概率结构称为无树绝对随机森林（TARF）。该算法很有可能将特征向量相距较远的单元拆分开来，而将特征向量相近的单元放在一起。因此，数据的基本结构驱动着树的生长。配对亲缘关系的期望值是针对三种路径函数得出的。其中一个是完全共同路径函数，它是一对单元从根节点到叶节点是否遵循相同路径的指标。基于 TARF 的聚类和分类接近度估计器的特性在八个实际数据集和模拟中与其他方法进行了比较。结果表明，对于大型数据集而言，该方法具有显著的性能和计算效率。

{"title":"A treeless absolutely random forest with closed‐form estimators of expected proximities","authors":"Eugene Laska, Ziqiang Lin, Carole Siegel, Charles Marmar","doi":"10.1002/sam.11678","DOIUrl":"https://doi.org/10.1002/sam.11678","url":null,"abstract":"We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At every node, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature in the root node, does not change. This enables closed‐form estimators of parameters, such as pairwise proximities, to be obtained without having to grow a forest. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, the algorithm will split units whose feature vectors are far apart and keep together units whose feature vectors are similar. Thus, the underlying structure of the data drives the growth of the tree. The expected value of pairwise proximities is obtained for three pathway functions. One, a completely common pathway function, is an indicator of whether a pair of units follow the same path from the root to the leaf node. The properties of TARF‐based proximity estimators for clustering and classification are compared to other methods in eight real‐world datasets and in simulations. Results show substantial performance and computing efficiencies of particular value for large datasets.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"37 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transfer learning under the Cox model with interval‐censored data 考克斯模型下的转移学习与区间删失数据

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-10 DOI: 10.1002/sam.11680

Mengqi Xie, Tao Hu, Jie Zhou

Transfer learning, focusing on information borrowing to address limited sample size issues, has gained increasing attention in recent years. Our method aims to utilize data from other population groups as a complement to enhance risk factor discernment and failure time prediction among underrepresented subgroups. However, a literature gap exists in effective knowledge transfer from the source to the target for risk assessment with interval‐censored data while accommodating population incomparability and privacy constraints. Our objective is to bridge this gap by developing a transfer learning approach under the Cox proportional hazards model. We introduce the tuning‐free Trans‐Cox‐MIC algorithm, enabling adaptable information sharing in regression coefficients and baseline hazards, while ensuring computational efficiency. Our approach accommodates covariate distribution shifts, coefficient variations, and baseline hazard discrepancies. Extensive simulations showcase the method's accuracy, robustness, and efficiency. Application to the prostate cancer screening data demonstrates enhanced risk estimation precision and predictive performance in the African American population.

转移学习侧重于借用信息来解决样本量有限的问题，近年来越来越受到关注。我们的方法旨在利用其他人群的数据作为补充，以提高对风险因素的识别能力，并对代表性不足的亚群进行失败时间预测。然而，在利用区间删失数据进行风险评估时，如何将知识有效地从源头转移到目标，同时又能兼顾人群的不可比性和隐私限制，这在文献中还存在空白。我们的目标是通过开发 Cox 比例危险模型下的转移学习方法来弥补这一空白。我们引入了免调整的 Trans-Cox-MIC 算法，在确保计算效率的同时，实现了回归系数和基线危险度的自适应信息共享。我们的方法能适应协变量分布变化、系数变化和基线危险差异。大量的模拟展示了该方法的准确性、稳健性和高效性。在前列腺癌筛查数据中的应用表明，非裔美国人的风险估计精度和预测性能都得到了提高。

{"title":"Transfer learning under the Cox model with interval‐censored data","authors":"Mengqi Xie, Tao Hu, Jie Zhou","doi":"10.1002/sam.11680","DOIUrl":"https://doi.org/10.1002/sam.11680","url":null,"abstract":"Transfer learning, focusing on information borrowing to address limited sample size issues, has gained increasing attention in recent years. Our method aims to utilize data from other population groups as a complement to enhance risk factor discernment and failure time prediction among underrepresented subgroups. However, a literature gap exists in effective knowledge transfer from the source to the target for risk assessment with interval‐censored data while accommodating population incomparability and privacy constraints. Our objective is to bridge this gap by developing a transfer learning approach under the Cox proportional hazards model. We introduce the tuning‐free Trans‐Cox‐MIC algorithm, enabling adaptable information sharing in regression coefficients and baseline hazards, while ensuring computational efficiency. Our approach accommodates covariate distribution shifts, coefficient variations, and baseline hazard discrepancies. Extensive simulations showcase the method's accuracy, robustness, and efficiency. Application to the prostate cancer screening data demonstrates enhanced risk estimation precision and predictive performance in the African American population.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"58 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven stochastic model for quantifying the interplay between amyloid-beta and calcium levels in Alzheimer's disease 量化阿尔茨海默病中淀粉样蛋白-β和钙水平之间相互作用的数据驱动型随机模型

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-09 DOI: 10.1002/sam.11679

Hina Shaheen, Roderick Melnik, Sundeep Singh

The abnormal aggregation of extracellular amyloid-β <mjx-container ctxtmenu_counter="563" ctxtmenu_oldtabindex="1" jax="CHTML" role="application" sre-explorer- style="font-size: 103%; position: relative;" tabindex="0"><mjx-math aria-hidden="true" location="graphic/sam11679-math-0003.png"><mjx-semantics><mjx-mrow><mjx-mrow data-semantic-children="4" data-semantic-content="0,5" data-semantic- data-semantic-role="leftright" data-semantic-speech="left parenthesis upper A beta right parenthesis" data-semantic-type="fenced"><mjx-mo data-semantic- data-semantic-operator="fenced" data-semantic-parent="6" data-semantic-role="open" data-semantic-type="fence" style="margin-left: 0.056em; margin-right: 0.056em;"><mjx-c></mjx-c></mjx-mo><mjx-mrow data-semantic-annotation="clearspeak:simple;clearspeak:unit" data-semantic-children="1,2" data-semantic-content="3" data-semantic- data-semantic-parent="6" data-semantic-role="implicit" data-semantic-type="infixop"><mjx-mi data-semantic-annotation="clearspeak:simple" data-semantic-font="italic" data-semantic- data-semantic-parent="4" data-semantic-role="latinletter" data-semantic-type="identifier"><mjx-c></mjx-c></mjx-mi><mjx-mo data-semantic-added="true" data-semantic- data-semantic-operator="infixop,⁢" data-semantic-parent="4" data-semantic-role="multiplication" data-semantic-type="operator" style="margin-left: 0.056em; margin-right: 0.056em;"><mjx-c></mjx-c></mjx-mo><mjx-mi data-semantic-annotation="clearspeak:simple" data-semantic-font="italic" data-semantic- data-semantic-parent="4" data-semantic-role="greekletter" data-semantic-type="identifier"><mjx-c></mjx-c></mjx-mi></mjx-mrow><mjx-mo data-semantic- data-semantic-operator="fenced" data-semantic-parent="6" data-semantic-role="close" data-semantic-type="fence" style="margin-left: 0.056em; margin-right: 0.056em;"><mjx-c></mjx-c></mjx-mo></mjx-mrow></mjx-mrow></mjx-semantics></mjx-math><mjx-assistive-mml display="inline" unselectable="on"><math altimg="urn:x-wiley:19321864:media:sam11679:sam11679-math-0003" display="inline" location="graphic/sam11679-math-0003.png" overflow="scroll" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow data-semantic-="" data-semantic-children="4" data-semantic-content="0,5" data-semantic-role="leftright" data-semantic-speech="left parenthesis upper A beta right parenthesis" data-semantic-type="fenced"><mo data-semantic-="" data-semantic-operator="fenced" data-semantic-parent="6" data-semantic-role="open" data-semantic-type="fence" stretchy="false">(</mo><mrow data-semantic-="" data-semantic-annotation="clearspeak:simple;clearspeak:unit" data-semantic-children="1,2" data-semantic-content="3" data-semantic-parent="6" data-semantic-role="implicit" data-semantic-type="infixop"><mi data-semantic-="" data-semantic-annotation="clearspeak:simple" data-semantic-font="italic" data-semantic-parent="4" data-semantic-role="lat

细胞外淀粉样蛋白-β（Aβ）$$ 在老年斑中的异常聚集导致钙（Ca+2）$$ left({Ca}^{+2}right) $$失衡是阿尔茨海默病（AD）的主要症状之一。为了更好地了解驱动 Aβ$$ Abeta $$ 沉积和 Ca+2$$ {Ca}^{+2} $$ 失调的潜在分子机制，过去人们投入了大量的研究工作。重要的是，AD 患者的突触损伤、神经元缺失和认知功能衰竭都与神经元内 Aβ$$ Abeta $$ 堆积有关。此外，越来越多的证据显示，Aβ$$ Abeta $$与Ca+2$$ {Ca}^{+2} $$水平之间存在前馈循环，即Aβ$$ Abeta $$破坏神经元的Ca+2$$ {Ca}^{+2} $$水平，进而影响Aβ$$ Abeta $$的形成。为了更好地理解这种相互作用，我们报告了一个新的随机模型，利用ADNI数据分析了Aβ$$ Abeta $$和Ca+2$$ {Ca}^{+2} $$之间的正反馈回路。良好的 AD 治疗方案需要精确的预测。由于注意力缺失症研究属于观察性质，需要定期访问病人，因此随机模型为注意力缺失症建模提供了一个合适的框架。使用近似贝叶斯计算方法，可以将 AD 的病因描述为一个多状态疾病过程。因此，我们利用 AD 患者 2$$ 2$$ 年访问的 ADNI 数据，采用这种方法来研究不同疾病发展阶段之间的相互作用和水平。将 ADNI 数据纳入我们基于物理学的贝叶斯模型后，我们发现，如果新陈代谢或细胞内稳态发生足够大的破坏，就会导致和的相对增长率下降，这与 AD 的发展相对应。离子失衡通过直接或间接影响各种细胞和亚细胞过程而导致失调，而平衡的改变可能会加剧离子运输和沉积的异常。这表明，通过螯合改变离子平衡或离子与离子之间的平衡，可能会减少与注意力缺失症相关的失调，并为注意力缺失症治疗开辟新的研究可能性。

引用次数: 0

Randomized multiarm bandits: An improved adaptive data collection method 随机多臂匪帮：一种改进的自适应数据收集方法

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-06 DOI: 10.1002/sam.11681

Zhigen Zhao, Tong Wang, Bo Ji

In many scientific experiments, multiarmed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alternative approach called randomized multiarm bandits (rMAB). This combines a randomization step with a chosen MAB algorithm, and by selecting the randomization probability appropriately, optimal regret can be achieved asymptotically. Numerical evidence shows that the bias of the sample mean based on the rMAB is much smaller than that of other methods. The test statistic and confidence interval produced by this method also perform much better than its competitors.

在许多科学实验中，多臂匪帮被用作一种自适应数据收集方法。然而，这种自适应过程可能会导致一种依赖性，使许多常用的统计推断方法失效。其中一个例子是样本平均值，它是平均参数的自然估计值，但可能存在偏差。这会导致基于该估计值的测试统计具有夸大的 I 类错误率，由此产生的置信区间的覆盖概率可能会大大低于其标称值。为了解决这个问题，我们提出了一种名为随机多臂匪帮（rMAB）的替代方法。这种方法将随机化步骤与所选的 MAB 算法相结合，通过适当选择随机化概率，可以渐进地获得最佳遗憾值。数值证据表明，基于 rMAB 算法的样本均值偏差远远小于其他方法。该方法产生的检验统计量和置信区间也比其他方法好得多。

引用次数: 0