We propose a semiparametric method to estimate average treatment effects in observational studies based on the assumption of unconfoundedness. Assume that the propensity score model and outcome model are a general single index model, which are estimated by the kernel method and the unknown index parameter is estimated via linearized maximum rank correlation method. The proposed estimator is computationally tractable, allows for large dimension covariates and not involves the approximation of link functions. We showed that the proposed estimator is consistent and asymptotically normally distributed. In general, the proposed estimator is superior to existing methods when the model is incorrectly specified. We also provide an empirical analysis on the average treatment effect and average treatment effect on the treated of 401(k) eligibility on net financial assets.
{"title":"Semiparametric estimation of average treatment effects in observational studies","authors":"Jun Wang, Yujiao Guo","doi":"10.1002/sam.11688","DOIUrl":"https://doi.org/10.1002/sam.11688","url":null,"abstract":"We propose a semiparametric method to estimate average treatment effects in observational studies based on the assumption of unconfoundedness. Assume that the propensity score model and outcome model are a general single index model, which are estimated by the kernel method and the unknown index parameter is estimated via linearized maximum rank correlation method. The proposed estimator is computationally tractable, allows for large dimension covariates and not involves the approximation of link functions. We showed that the proposed estimator is consistent and asymptotically normally distributed. In general, the proposed estimator is superior to existing methods when the model is incorrectly specified. We also provide an empirical analysis on the average treatment effect and average treatment effect on the treated of 401(k) eligibility on net financial assets.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"133 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The setting of priors is an important issue in Bayesian analysis. In particular, when external information is applied, a prior with too much information can dominate the posterior inferences. To prevent this effect, the effective sample size (ESS) can be used. Various ESSs have been proposed recently; however, all have the problem of limiting the applicable prior distributions. For example, one ESS can only be used with a prior that can be approximated by a normal distribution, and another ESS cannot be applied when the parameters are multidimensional. We propose an ESS to be applied to more prior distributions when the sampling model belongs to an exponential family (including the normal model and logistic regression models). This ESS has the predictive consistency and can be used with multidimensional parameters. It is confirmed from normally distributed data with the Student's‐t priors that this ESS behaves as well as an existing predictively consistent ESS for one‐parameter exponential families. As examples of multivariate parameters, ESSs for linear and logistic regression models are also discussed.
{"title":"Prior effective sample size for exponential family distributions with multiple parameters","authors":"Ryota Tamanoi","doi":"10.1002/sam.11685","DOIUrl":"https://doi.org/10.1002/sam.11685","url":null,"abstract":"The setting of priors is an important issue in Bayesian analysis. In particular, when external information is applied, a prior with too much information can dominate the posterior inferences. To prevent this effect, the effective sample size (ESS) can be used. Various ESSs have been proposed recently; however, all have the problem of limiting the applicable prior distributions. For example, one ESS can only be used with a prior that can be approximated by a normal distribution, and another ESS cannot be applied when the parameters are multidimensional. We propose an ESS to be applied to more prior distributions when the sampling model belongs to an exponential family (including the normal model and logistic regression models). This ESS has the predictive consistency and can be used with multidimensional parameters. It is confirmed from normally distributed data with the Student's‐<jats:italic>t</jats:italic> priors that this ESS behaves as well as an existing predictively consistent ESS for one‐parameter exponential families. As examples of multivariate parameters, ESSs for linear and logistic regression models are also discussed.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo
One among several advantages of measure transport methods is that they allow or a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.
{"title":"Density estimation via measure transport: Outlook for applications in the biological sciences","authors":"Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo","doi":"10.1002/sam.11687","DOIUrl":"https://doi.org/10.1002/sam.11687","url":null,"abstract":"One among several advantages of measure transport methods is that they allow or a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"10 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image data have emerged to be an indispensable component of modern medicine. Different from many general image problems that focus on outcome prediction or image recognition, medical image analysis pays more attention to model interpretation. For instance, given a list of medical images and corresponding labels of patients' health status, it is often of greater importance to identify the image regions that could differentiate the outcome status, compared to simply predicting labels of new images. Moreover, medical image data often demonstrate strong individual heterogeneity. In other words, the image regions associated with an outcome could be different across patients. As a consequence, the traditional one‐model‐fits‐all approach not only omits patient heterogeneity but also possibly leads to misleading or even wrong conclusions. In this article, we introduce a novel statistical framework to detect individualized regions that are associated with a binary outcome, that is, whether a patient has a certain disease or not. Moreover, we propose a total variation‐based penalization for individualized image region detection under a local label‐free scenario. Considering that local labeling is often difficult to obtain for medical image data, our approach may potentially have a wider range of applications in medical research. The effectiveness of our proposed approach is validated by two real histopathology databases: Colon Cancer and Camelyon16.
{"title":"Individualized image region detection with total variation","authors":"Sanyou Wu, Fuying Wang, Long Feng","doi":"10.1002/sam.11684","DOIUrl":"https://doi.org/10.1002/sam.11684","url":null,"abstract":"Medical image data have emerged to be an indispensable component of modern medicine. Different from many general image problems that focus on outcome prediction or image recognition, medical image analysis pays more attention to model interpretation. For instance, given a list of medical images and corresponding labels of patients' health status, it is often of greater importance to identify the image regions that could differentiate the outcome status, compared to simply predicting labels of new images. Moreover, medical image data often demonstrate strong individual heterogeneity. In other words, the image regions associated with an outcome could be different across patients. As a consequence, the traditional one‐model‐fits‐all approach not only omits patient heterogeneity but also possibly leads to misleading or even wrong conclusions. In this article, we introduce a novel statistical framework to detect individualized regions that are associated with a binary outcome, that is, whether a patient has a certain disease or not. Moreover, we propose a total variation‐based penalization for individualized image region detection under a local label‐free scenario. Considering that local labeling is often difficult to obtain for medical image data, our approach may potentially have a wider range of applications in medical research. The effectiveness of our proposed approach is validated by two real histopathology databases: Colon Cancer and Camelyon16.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"105 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Association rules are used to extract information from transactional databases with a collection of items also called “tokens” or “words.” The aim of association rule analysis is to indicate what and how items go with what items in a set of transactions called “documents.” This approach is used in the analysis of text records, of blogs in social media and of shopping baskets. We present here an approach to analyze documents using latent class analysis (LCA) clustering of document term matrices. A document term matrix (DTM) consists of rows referring to documents and columns corresponding to items. In binary weights, “1” indicates the presence of a term in a document and “0” otherwise. The clustering of similar documents provides stratified data sets used to enhance the interpretability of measures of interest such as lift, odds ratios and relative linkage disequilibrium. The article demonstrates the approach with two case studies. A first example consists of comments recorded in a survey aimed at pet owners. A second, much larger example, is based on online reviews to crocs sandals. Association rules describe combinations of terms in the pet survey and crocs reviews. In Section 3, we compute, for these case studies, association rule measures of interest defined in Section 2. We first introduce the case studies to motivate the methods proposed here. In Section 4, we provide a new approach with an enhanced interpretations of measures such as lift by comparing them across clusters derived from an LCA of the DTM. A key result is the application of clustered data in analyzing observational data. This enhances generalizability and interpretability of findings from text analytics. The article concludes with a discussion in Section 5.
{"title":"The analysis of association rules: Latent class analysis","authors":"Ron S. Kenett, Chris Gotwalt","doi":"10.1002/sam.11686","DOIUrl":"https://doi.org/10.1002/sam.11686","url":null,"abstract":"Association rules are used to extract information from transactional databases with a collection of items also called “tokens” or “words.” The aim of association rule analysis is to indicate what and how items go with what items in a set of transactions called “documents.” This approach is used in the analysis of text records, of blogs in social media and of shopping baskets. We present here an approach to analyze documents using latent class analysis (LCA) clustering of document term matrices. A document term matrix (DTM) consists of rows referring to documents and columns corresponding to items. In binary weights, “1” indicates the presence of a term in a document and “0” otherwise. The clustering of similar documents provides stratified data sets used to enhance the interpretability of measures of interest such as lift, odds ratios and relative linkage disequilibrium. The article demonstrates the approach with two case studies. A first example consists of comments recorded in a survey aimed at pet owners. A second, much larger example, is based on online reviews to crocs sandals. Association rules describe combinations of terms in the pet survey and crocs reviews. In Section 3, we compute, for these case studies, association rule measures of interest defined in Section 2. We first introduce the case studies to motivate the methods proposed here. In Section 4, we provide a new approach with an enhanced interpretations of measures such as lift by comparing them across clusters derived from an LCA of the DTM. A key result is the application of clustered data in analyzing observational data. This enhances generalizability and interpretability of findings from text analytics. The article concludes with a discussion in Section 5.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"104 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Yu-Zhu, Wu Chun-Ho, Tai Ling-Nan, Mian Zhi-Bao, Tian Mao-Zai
Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian