首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Optimal ratio for data splitting 数据分割的最佳比例
Pub Date : 2022-02-07 DOI: 10.1002/sam.11583
V. R. Joseph
It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is p:1$$ sqrt{p}:1 $$ , where p$$ p $$ is the number of parameters in a linear regression model that explains the data well.
在拟合统计或机器学习模型之前,将数据集分成训练集和测试集是很常见的。然而,对于应该使用多少数据进行培训和测试,并没有明确的指导。在本文中,我们展示了最优的训练/测试分割比是p:1 $$ sqrt{p}:1 $$,其中p $$ p $$是线性回归模型中能够很好地解释数据的参数数量。
{"title":"Optimal ratio for data splitting","authors":"V. R. Joseph","doi":"10.1002/sam.11583","DOIUrl":"https://doi.org/10.1002/sam.11583","url":null,"abstract":"It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is p:1$$ sqrt{p}:1 $$ , where p$$ p $$ is the number of parameters in a linear regression model that explains the data well.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 129
Coupled support tensor machine classification for multimodal neuroimaging data 多模态神经影像数据的耦合支持张量机分类
Pub Date : 2022-01-19 DOI: 10.1002/sam.11587
L. Peide, Seyyid Emre Sofuoglu, T. Maiti, Selin Aviyente
Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision‐making. Recently, coupled matrix–tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix–tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A coupled support tensor machine (C‐STM) built upon the latent factors jointly estimated from the advanced coupled matrix–tensor factorization is proposed. C‐STM combines individual and shared latent factors with multiple kernels and estimates a maximal‐margin classifier for coupled matrix–tensor data. The classification risk of C‐STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C‐STM is validated through simulation studies as well as a simultaneous analysis on electroencephalography with functional magnetic resonance imaging data. The empirical evidence shows that C‐STM can utilize information from multiple sources and provide a better classification performance than traditional single‐mode classifiers.
多模态数据出现在各种应用中,其中从多个传感器和不同成像模式获取有关同一现象的信息。从多模态数据中学习在机器学习和统计研究中具有很大的兴趣,因为它提供了在模态之间捕获互补信息的可能性。多模态建模有助于解释异构数据源之间的相互依存关系,发现可能无法从单一模态获得的新见解,并改进决策。近年来,在多模态数据融合中引入了耦合矩阵-张量分解,以联合估计潜在因素并识别潜在因素之间复杂的相互依存关系。然而,以往关于矩阵-张量因子耦合的研究大多集中在无监督学习上,而利用联合估计的潜在因子进行监督学习的研究很少。本文研究了多模态张量数据分类问题。提出了一种基于高级耦合矩阵-张量分解联合估计的潜在因子的耦合支持张量机(C‐STM)。C - STM将单个和共享的潜在因素与多个核相结合,并估计耦合矩阵-张量数据的最大边际分类器。C‐STM的分类风险收敛于最优贝叶斯风险,使其成为统计上一致的规则。C‐STM通过模拟研究以及同时分析脑电与功能磁共振成像数据进行验证。经验证据表明,C - STM可以利用来自多个来源的信息,并提供比传统单模分类器更好的分类性能。
{"title":"Coupled support tensor machine classification for multimodal neuroimaging data","authors":"L. Peide, Seyyid Emre Sofuoglu, T. Maiti, Selin Aviyente","doi":"10.1002/sam.11587","DOIUrl":"https://doi.org/10.1002/sam.11587","url":null,"abstract":"Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision‐making. Recently, coupled matrix–tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix–tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A coupled support tensor machine (C‐STM) built upon the latent factors jointly estimated from the advanced coupled matrix–tensor factorization is proposed. C‐STM combines individual and shared latent factors with multiple kernels and estimates a maximal‐margin classifier for coupled matrix–tensor data. The classification risk of C‐STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C‐STM is validated through simulation studies as well as a simultaneous analysis on electroencephalography with functional magnetic resonance imaging data. The empirical evidence shows that C‐STM can utilize information from multiple sources and provide a better classification performance than traditional single‐mode classifiers.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114211773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Development and validation of models for two‐week mortality of inpatients with COVID‐19 infection: A large prospective cohort study COVID - 19感染住院患者两周死亡率模型的建立和验证:一项大型前瞻性队列研究
Pub Date : 2022-01-11 DOI: 10.1002/sam.11572
M. Fathi, N. M. Moghaddam, L. Kheyrati
Recognizing COVID‐19 patients at a greater risk of mortality assists medical staff to identify who benefits from more serious care. We developed and validated prediction models for two‐week mortality of inpatients with COVID‐19 infection based on clinical predictors. A prospective cohort study was started in February 2020 and is still continuing. In total, 57,705 inpatients with both a positive reverse transcription‐polymerase chain reaction test and positive chest CT findings for COVID‐19 were included. The outcome was mortality within 2 weeks of admission. Three prognostic models were developed for young, adult, and senior patients. Data from the capital province (Tehran) of Iran were used for validation, and data from all other provinces were used for development of the models. The model Young, was well‐fitted to the data (p < 0.001, Nagelkerke R2 = 0.697, C‐statistics = 0.88) and the models Adult (p < 0.001, Nagelkerke R2 = 0.340, C‐statistics = 0.70) and Senior (p < 0.001, Nagelkerke R2 = 0.208, C‐statistics = 0.68) were also significant. Intubation, saturated O2 < 93%, impaired consciousness, acute respiratory distress syndrome, and cancer treatment were major risk factors. Elderly people were at greater risk of mortality. Young patients with a history of blood hypertension, vomiting, and fever; and adults with diabetes mellitus and cardiovascular disease had more mortality risk. Young people with myalgia; and adult patients with nausea, anorexia, and headache showed less risk of mortality than others.
认识到死亡风险更高的COVID - 19患者有助于医务人员确定谁可以从更严重的护理中受益。基于临床预测因子,我们开发并验证了COVID - 19感染住院患者两周死亡率的预测模型。一项前瞻性队列研究于2020年2月开始,目前仍在继续。总共纳入了57,705例逆转录聚合酶链反应试验阳性和胸部CT阳性的住院患者。结果为入院2周内的死亡率。针对青年、成人和老年患者建立了三种预后模型。来自伊朗首都省(德黑兰)的数据用于验证,来自所有其他省份的数据用于模型的开发。Young模型与数据拟合良好(p < 0.001, Nagelkerke R2 = 0.697, C‐statistics = 0.88), Adult模型(p < 0.001, Nagelkerke R2 = 0.340, C‐statistics = 0.70)和Senior模型(p < 0.001, Nagelkerke R2 = 0.208, C‐statistics = 0.68)也具有显著性。插管、饱和氧< 93%、意识受损、急性呼吸窘迫综合征和癌症治疗是主要危险因素。老年人的死亡风险更大。有高血压、呕吐、发热史的年轻患者;患有糖尿病和心血管疾病的成年人死亡风险更高。患有肌痛的年轻人;患有恶心、厌食症和头痛的成年患者的死亡率比其他患者低。
{"title":"Development and validation of models for two‐week mortality of inpatients with COVID‐19 infection: A large prospective cohort study","authors":"M. Fathi, N. M. Moghaddam, L. Kheyrati","doi":"10.1002/sam.11572","DOIUrl":"https://doi.org/10.1002/sam.11572","url":null,"abstract":"Recognizing COVID‐19 patients at a greater risk of mortality assists medical staff to identify who benefits from more serious care. We developed and validated prediction models for two‐week mortality of inpatients with COVID‐19 infection based on clinical predictors. A prospective cohort study was started in February 2020 and is still continuing. In total, 57,705 inpatients with both a positive reverse transcription‐polymerase chain reaction test and positive chest CT findings for COVID‐19 were included. The outcome was mortality within 2 weeks of admission. Three prognostic models were developed for young, adult, and senior patients. Data from the capital province (Tehran) of Iran were used for validation, and data from all other provinces were used for development of the models. The model Young, was well‐fitted to the data (p < 0.001, Nagelkerke R2 = 0.697, C‐statistics = 0.88) and the models Adult (p < 0.001, Nagelkerke R2 = 0.340, C‐statistics = 0.70) and Senior (p < 0.001, Nagelkerke R2 = 0.208, C‐statistics = 0.68) were also significant. Intubation, saturated O2 < 93%, impaired consciousness, acute respiratory distress syndrome, and cancer treatment were major risk factors. Elderly people were at greater risk of mortality. Young patients with a history of blood hypertension, vomiting, and fever; and adults with diabetes mellitus and cardiovascular disease had more mortality risk. Young people with myalgia; and adult patients with nausea, anorexia, and headache showed less risk of mortality than others.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Weighted AutoEncoding recommender system 加权自动编码推荐系统
Pub Date : 2022-01-07 DOI: 10.1002/sam.11571
Shuying Zhu, Weining Shen, Annie Qu
Recommender systems are information filtering tools that seek to match customers with products or services of interest. Most of the prevalent collaborative filtering recommender systems, such as matrix factorization and AutoRec, suffer from the “cold‐start” problem, where they fail to provide meaningful recommendations for new users or new items due to informative‐missing from the training data. To address this problem, we propose a weighted AutoEncoding model to leverage information from other users or items that share similar characteristics. The proposed method provides an effective strategy for borrowing strength from user or item‐specific clustering structure as well as pairwise similarity in the training data, while achieving high computational efficiency and dimension reduction, and preserving nonlinear relationships between user preferences and item features. Simulation studies and applications to three real datasets show advantages in prediction accuracy of the proposed model compared to current state‐of‐the‐art approaches.
推荐系统是信息过滤工具,旨在将客户与感兴趣的产品或服务相匹配。大多数流行的协同过滤推荐系统,如矩阵分解和AutoRec,都存在“冷启动”问题,即由于训练数据中的信息缺失,它们无法为新用户或新项目提供有意义的推荐。为了解决这个问题,我们提出了一个加权的AutoEncoding模型来利用来自其他用户或具有相似特征的项目的信息。该方法提供了一种有效的策略,可以借鉴用户或特定项目的聚类结构以及训练数据中的成对相似性,同时实现了高计算效率和降维,并保留了用户偏好与项目特征之间的非线性关系。对三个真实数据集的仿真研究和应用表明,与当前最先进的方法相比,所提出的模型在预测精度方面具有优势。
{"title":"Weighted AutoEncoding recommender system","authors":"Shuying Zhu, Weining Shen, Annie Qu","doi":"10.1002/sam.11571","DOIUrl":"https://doi.org/10.1002/sam.11571","url":null,"abstract":"Recommender systems are information filtering tools that seek to match customers with products or services of interest. Most of the prevalent collaborative filtering recommender systems, such as matrix factorization and AutoRec, suffer from the “cold‐start” problem, where they fail to provide meaningful recommendations for new users or new items due to informative‐missing from the training data. To address this problem, we propose a weighted AutoEncoding model to leverage information from other users or items that share similar characteristics. The proposed method provides an effective strategy for borrowing strength from user or item‐specific clustering structure as well as pairwise similarity in the training data, while achieving high computational efficiency and dimension reduction, and preserving nonlinear relationships between user preferences and item features. Simulation studies and applications to three real datasets show advantages in prediction accuracy of the proposed model compared to current state‐of‐the‐art approaches.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122942317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Portability analysis of data mining models for fog events forecasting 雾事件预测数据挖掘模型的可移植性分析
Pub Date : 2021-12-30 DOI: 10.1002/sam.11568
G. Zazzaro
This article describes an analytical method for comparing geographical sites and transferring fog forecasting models, trained by Data Mining techniques on a fixed site, across Italian airports. This portability method uses a specific intersite similarity measure based on the Euclidean distance between the performance vectors associated with each airport site. Performance vectors are useful for characterizing geographical sites. The components of a performance vector are the performance metrics of an Ensemble descriptive model. In the tests carried out, the comparison method provided very promising results, and the forecast model, when applied and evaluated on a new compatible site, shows only a small decrease in performance. The portability schema provides a meta‐learning methodology for applying predictive models to new sites where a new model cannot be trained from scratch owing to the class imbalance problem or the lack of data for a specific learning. The methodology offers a measure for clustering geographical sites and extending weather knowledge from one site to another.
本文描述了一种比较地理位置和传输雾预报模型的分析方法,该方法由数据挖掘技术在固定地点训练,横跨意大利机场。这种可移植性方法使用基于与每个机场站点相关的性能向量之间的欧几里得距离的特定站点间相似性度量。性能矢量对于描述地理站点很有用。性能向量的组件是集成描述性模型的性能度量。在进行的测试中,对比方法提供了非常有希望的结果,并且当在新的兼容站点上应用和评估预测模型时,仅显示性能略有下降。可移植性模式提供了一种元学习方法,用于将预测模型应用于由于类不平衡问题或缺乏特定学习数据而无法从头开始训练新模型的新站点。该方法提供了一种聚类地理站点和将天气知识从一个站点扩展到另一个站点的方法。
{"title":"Portability analysis of data mining models for fog events forecasting","authors":"G. Zazzaro","doi":"10.1002/sam.11568","DOIUrl":"https://doi.org/10.1002/sam.11568","url":null,"abstract":"This article describes an analytical method for comparing geographical sites and transferring fog forecasting models, trained by Data Mining techniques on a fixed site, across Italian airports. This portability method uses a specific intersite similarity measure based on the Euclidean distance between the performance vectors associated with each airport site. Performance vectors are useful for characterizing geographical sites. The components of a performance vector are the performance metrics of an Ensemble descriptive model. In the tests carried out, the comparison method provided very promising results, and the forecast model, when applied and evaluated on a new compatible site, shows only a small decrease in performance. The portability schema provides a meta‐learning methodology for applying predictive models to new sites where a new model cannot be trained from scratch owing to the class imbalance problem or the lack of data for a specific learning. The methodology offers a measure for clustering geographical sites and extending weather knowledge from one site to another.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131743632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Handwriting identification using random forests and score‐based likelihood ratios 使用随机森林和基于分数的似然比的笔迹识别
Pub Date : 2021-12-03 DOI: 10.1002/sam.11566
M. Q. Johnson, Danica M. Ommen
Handwriting analysis is conducted by forensic document examiners who are able to visually recognize characteristics of writing to evaluate the evidence of writership. Recently, there have been incentives to investigate how to quantify the similarity between two written documents to support the conclusions drawn by experts. We use an automatic algorithm within the “handwriter” package in R, to decompose a handwritten sample into small graphical units of writing. These graphs are sorted into 40 exemplar groups or clusters. We hypothesize that the frequency with which a person contributes graphs to each cluster is characteristic of their handwriting. Given two questioned handwritten documents, we can then use the vectors of cluster frequencies to quantify the similarity between the two documents. We extract features from the difference between the vectors and combine them using a random forest. The output from the random forest is used as the similarity score to compare documents. We estimate the distributions of the similarity scores computed from multiple pairs of documents known to have been written by the same and by different persons, and use these estimated densities to obtain score‐based likelihood ratios (SLRs) that rely on different assumptions. We find that the SLRs are able to indicate whether the similarity observed between two documents is more or less likely depending on writership.
笔迹分析是由法医文件审查员进行的,他们能够从视觉上识别笔迹的特征,以评估笔迹的证据。最近,人们开始研究如何量化两份书面文件之间的相似性,以支持专家得出的结论。我们使用R中的“handwriter”包中的自动算法,将手写样本分解为小的图形书写单元。这些图表被分为40个范例组或集群。我们假设,一个人在每个集群中贡献图形的频率是他们笔迹的特征。给定两个被质疑的手写文档,然后我们可以使用聚类频率向量来量化两个文档之间的相似性。我们从向量之间的差异中提取特征,并使用随机森林将它们组合起来。随机森林的输出用作比较文档的相似度评分。我们估计了已知由同一人和不同人撰写的多对文档计算出的相似分数的分布,并使用这些估计密度来获得依赖于不同假设的基于分数的似然比(slr)。我们发现单反能够表明两个文档之间观察到的相似性是否或多或少取决于写作。
{"title":"Handwriting identification using random forests and score‐based likelihood ratios","authors":"M. Q. Johnson, Danica M. Ommen","doi":"10.1002/sam.11566","DOIUrl":"https://doi.org/10.1002/sam.11566","url":null,"abstract":"Handwriting analysis is conducted by forensic document examiners who are able to visually recognize characteristics of writing to evaluate the evidence of writership. Recently, there have been incentives to investigate how to quantify the similarity between two written documents to support the conclusions drawn by experts. We use an automatic algorithm within the “handwriter” package in R, to decompose a handwritten sample into small graphical units of writing. These graphs are sorted into 40 exemplar groups or clusters. We hypothesize that the frequency with which a person contributes graphs to each cluster is characteristic of their handwriting. Given two questioned handwritten documents, we can then use the vectors of cluster frequencies to quantify the similarity between the two documents. We extract features from the difference between the vectors and combine them using a random forest. The output from the random forest is used as the similarity score to compare documents. We estimate the distributions of the similarity scores computed from multiple pairs of documents known to have been written by the same and by different persons, and use these estimated densities to obtain score‐based likelihood ratios (SLRs) that rely on different assumptions. We find that the SLRs are able to indicate whether the similarity observed between two documents is more or less likely depending on writership.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Efficient importance sampling imputation algorithms for quantile and composite quantile regression 分位数和复合分位数回归的有效重要抽样输入算法
Pub Date : 2021-11-29 DOI: 10.1002/sam.11565
Haoyang Cheng
Nowadays, missing data in regression model is one of the most well‐known topics. In this paper, we propose a class of efficient importance sampling imputation algorithms (EIS) for quantile and composite quantile regression with missing covariates. They are an EIS in quantile regression (EISQ) and its three extensions in composite quantile regression (EISCQ). Our EISQ uses an interior point (IP) approach, while EISCQ algorithms use IP and other two well‐known approaches: Majorize‐minimization (MM) and coordinate descent (CD). The aims of our proposed EIS algorithms are to decrease estimated variances and relieve computational burden at the same time, which improves the performances of coefficients estimators in both estimated and computational efficiencies. To compare our EIS algorithms with other existing competitors including complete cases analysis and multiple imputation, the paper carries out a series of simulation studies with different sample sizes and different levels of missing rates under different missing mechanism models. Finally, we apply all the algorithms to part of the examination data in National Health and Nutrition Examination Survey.
目前,回归模型中的数据缺失问题是人们最为关注的问题之一。本文针对缺失协变量的分位数和复合分位数回归,提出了一类有效的重要抽样插值算法。它们是分位数回归中的分位数回归(EISQ)及其在复合分位数回归中的三个扩展。我们的EISQ使用内部点(IP)方法,而EISCQ算法使用IP和其他两种众所周知的方法:最大化最小化(MM)和坐标下降(CD)。我们提出的EIS算法的目的是在减少估计方差的同时减轻计算负担,从而提高系数估计器的估计效率和计算效率。为了将我们的EIS算法与现有的竞争算法进行比较,包括完整案例分析和多重代入,本文在不同缺失机制模型下进行了不同样本量和不同缺失率水平的一系列仿真研究。最后,我们将所有算法应用于国家健康与营养检查调查的部分检查数据。
{"title":"Efficient importance sampling imputation algorithms for quantile and composite quantile regression","authors":"Haoyang Cheng","doi":"10.1002/sam.11565","DOIUrl":"https://doi.org/10.1002/sam.11565","url":null,"abstract":"Nowadays, missing data in regression model is one of the most well‐known topics. In this paper, we propose a class of efficient importance sampling imputation algorithms (EIS) for quantile and composite quantile regression with missing covariates. They are an EIS in quantile regression (EISQ) and its three extensions in composite quantile regression (EISCQ). Our EISQ uses an interior point (IP) approach, while EISCQ algorithms use IP and other two well‐known approaches: Majorize‐minimization (MM) and coordinate descent (CD). The aims of our proposed EIS algorithms are to decrease estimated variances and relieve computational burden at the same time, which improves the performances of coefficients estimators in both estimated and computational efficiencies. To compare our EIS algorithms with other existing competitors including complete cases analysis and multiple imputation, the paper carries out a series of simulation studies with different sample sizes and different levels of missing rates under different missing mechanism models. Finally, we apply all the algorithms to part of the examination data in National Health and Nutrition Examination Survey.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115452235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural‐network transformation models for counting processes 计数过程的神经网络转换模型
Pub Date : 2021-11-26 DOI: 10.1002/sam.11564
Rongzi Liu, Chenxi Li, Qing Lu
While many survival models have been invented, the Cox model and the proportional odds model are among the most popular ones. Both models are special cases of the linear transformation model. The linear transformation model typically assumes a linear function on covariates, which may not reflect the complex relationship between covariates and survival outcomes. Nonlinear functional form can also be specified in the linear transformation model. Nonetheless, the underlying functional form is unknown and mis‐specifying it leads to biased estimates and reduced prediction accuracy of the model. To address this issue, we develop a neural‐network transformation model. Similar to neural networks, the neural‐network transformation model uses its hierarchical structure to learn complex features from simpler ones and is capable of approximating the underlying functional form of covariates. It also inherits advantages from the linear transformation model, making it applicable to both time‐to‐event analyses and recurrent event analyses. Simulations demonstrate that the neural‐network transformation model outperforms the linear transformation model in terms of estimation and prediction accuracy when the covariate effects are nonlinear. The advantage of the new model over the linear transformation model is also illustrated via two real applications.
虽然已经发明了许多生存模型,但考克斯模型和比例赔率模型是最受欢迎的。这两种模型都是线性变换模型的特殊情况。线性变换模型通常假设协变量为线性函数,这可能无法反映协变量与生存结果之间的复杂关系。非线性函数形式也可以在线性变换模型中指定。尽管如此,潜在的功能形式是未知的,并且错误地指定它会导致有偏见的估计和降低模型的预测精度。为了解决这个问题,我们开发了一个神经网络转换模型。与神经网络类似,神经网络转换模型使用其层次结构从简单特征中学习复杂特征,并且能够近似协变量的底层函数形式。它还继承了线性变换模型的优点,使其既适用于时间-事件分析,也适用于循环事件分析。仿真结果表明,当协变量效应为非线性时,神经网络变换模型在估计和预测精度方面优于线性变换模型。通过两个实际应用,说明了新模型相对于线性变换模型的优越性。
{"title":"Neural‐network transformation models for counting processes","authors":"Rongzi Liu, Chenxi Li, Qing Lu","doi":"10.1002/sam.11564","DOIUrl":"https://doi.org/10.1002/sam.11564","url":null,"abstract":"While many survival models have been invented, the Cox model and the proportional odds model are among the most popular ones. Both models are special cases of the linear transformation model. The linear transformation model typically assumes a linear function on covariates, which may not reflect the complex relationship between covariates and survival outcomes. Nonlinear functional form can also be specified in the linear transformation model. Nonetheless, the underlying functional form is unknown and mis‐specifying it leads to biased estimates and reduced prediction accuracy of the model. To address this issue, we develop a neural‐network transformation model. Similar to neural networks, the neural‐network transformation model uses its hierarchical structure to learn complex features from simpler ones and is capable of approximating the underlying functional form of covariates. It also inherits advantages from the linear transformation model, making it applicable to both time‐to‐event analyses and recurrent event analyses. Simulations demonstrate that the neural‐network transformation model outperforms the linear transformation model in terms of estimation and prediction accuracy when the covariate effects are nonlinear. The advantage of the new model over the linear transformation model is also illustrated via two real applications.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122063264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bag of little bootstraps for massive and distributed longitudinal data 大量和分布的纵向数据的小自举包
Pub Date : 2021-11-22 DOI: 10.1002/sam.11563
Xinkai Zhou, Jin J. Zhou, Hua Zhou
Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia package MixedModelsBLB.jl. Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.
线性混合模型广泛用于纵向数据集的分析,其方差成分参数的推断依赖于自举法。然而,卫生系统和技术公司通常会生成大量的纵向数据集,这使得传统的自举方法不可行。为了解决这个问题,我们将高度可扩展的独立数据的小引导方法扩展到纵向数据,并开发了一个高效的Julia包MixedModelsBLB.jl。仿真实验和实际数据分析表明,与传统的自举方法相比,该方法具有良好的统计性能和计算优势。对于方差分量的统计推断,它在100万受试者(2000万总观测值)的尺度上实现了200倍的加速,是目前唯一可以使用台式计算机处理1000万受试者(2亿总观测值)以上的工具。
{"title":"Bag of little bootstraps for massive and distributed longitudinal data","authors":"Xinkai Zhou, Jin J. Zhou, Hua Zhou","doi":"10.1002/sam.11563","DOIUrl":"https://doi.org/10.1002/sam.11563","url":null,"abstract":"Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia package MixedModelsBLB.jl. Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121133935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intuitively adaptable outlier detector 直观适应性异常检测器
Pub Date : 2021-11-12 DOI: 10.1002/sam.11562
Krystyna Kiersztyn
Nowadays, we have been dealing with a large amount of data in which anomalies occur naturally for many reasons, both due to hardware and humans. Therefore, it is necessary to develop efficient tools that are easily adaptable to various data. The paper presents an innovative use of classical statistical tools to detect outliers in multidimensional data sets. The proposed approach uses well‐known statistical methods in an innovative way and allows for a high level of efficiency to be achieved using multi‐level aggregation. The effectiveness of the proposed innovative method is demonstrated by a series of numerical experiments.
如今,我们一直在处理大量的数据,其中自然发生的异常有很多原因,包括硬件和人为的原因。因此,有必要开发易于适应各种数据的高效工具。本文提出了一种创新的使用经典统计工具来检测多维数据集中的异常值。所提出的方法以一种创新的方式使用了众所周知的统计方法,并允许使用多级聚合实现高水平的效率。通过一系列数值实验验证了该方法的有效性。
{"title":"Intuitively adaptable outlier detector","authors":"Krystyna Kiersztyn","doi":"10.1002/sam.11562","DOIUrl":"https://doi.org/10.1002/sam.11562","url":null,"abstract":"Nowadays, we have been dealing with a large amount of data in which anomalies occur naturally for many reasons, both due to hardware and humans. Therefore, it is necessary to develop efficient tools that are easily adaptable to various data. The paper presents an innovative use of classical statistical tools to detect outliers in multidimensional data sets. The proposed approach uses well‐known statistical methods in an innovative way and allows for a high level of efficiency to be achieved using multi‐level aggregation. The effectiveness of the proposed innovative method is demonstrated by a series of numerical experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122257246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1