首页 > 最新文献

arXiv - STAT - Methodology最新文献

英文 中文
Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap 改进敏感性分析,综合有限重叠的随机临床试验
Pub Date : 2024-09-11 DOI: arxiv-2409.07391
Kuan Jiang, Wenjie Hu, Shu Yang, Xinxing Lai, Xiaohua Zhou
To estimate the average treatment effect in real-world populations,observational studies are typically designed around real-world cohorts.However, even when study samples from these designs represent the population,unmeasured confounders can introduce bias. Sensitivity analysis is often usedto estimate bounds for the average treatment effect without relying on thestrict mathematical assumptions of other existing methods. This articleintroduces a new approach that improves sensitivity analysis in observationalstudies by incorporating randomized clinical trial data, even with limitedoverlap due to inclusion/exclusion criteria. Theoretical proof and simulationsshow that this method provides a tighter bound width than existing approaches.We also apply this method to both a trial dataset and a real-world drugeffectiveness comparison dataset for practical analysis.
为了估算真实世界人群的平均治疗效果,观察性研究通常围绕真实世界的队列进行设计。然而,即使这些设计中的研究样本代表了人群,未测量的混杂因素也会带来偏差。敏感性分析通常用于估计平均治疗效果的界限,而无需依赖其他现有方法的严格数学假设。本文介绍了一种新方法,通过纳入随机临床试验数据来改进观察性研究中的灵敏度分析,即使由于纳入/排除标准造成的重叠有限。理论证明和模拟结果表明,与现有方法相比,这种方法能提供更严格的边界宽度。我们还将这种方法应用于试验数据集和真实世界的药物疗效比较数据集,以进行实际分析。
{"title":"Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap","authors":"Kuan Jiang, Wenjie Hu, Shu Yang, Xinxing Lai, Xiaohua Zhou","doi":"arxiv-2409.07391","DOIUrl":"https://doi.org/arxiv-2409.07391","url":null,"abstract":"To estimate the average treatment effect in real-world populations,\u0000observational studies are typically designed around real-world cohorts.\u0000However, even when study samples from these designs represent the population,\u0000unmeasured confounders can introduce bias. Sensitivity analysis is often used\u0000to estimate bounds for the average treatment effect without relying on the\u0000strict mathematical assumptions of other existing methods. This article\u0000introduces a new approach that improves sensitivity analysis in observational\u0000studies by incorporating randomized clinical trial data, even with limited\u0000overlap due to inclusion/exclusion criteria. Theoretical proof and simulations\u0000show that this method provides a tighter bound width than existing approaches.\u0000We also apply this method to both a trial dataset and a real-world drug\u0000effectiveness comparison dataset for practical analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extended-support beta regression for $[0, 1]$ responses 针对 $[0, 1]$ 响应的扩展支持贝塔回归
Pub Date : 2024-09-11 DOI: arxiv-2409.07233
Ioannis Kosmidis, Achim Zeileis
We introduce the XBX regression model, a continuous mixture ofextended-support beta regressions for modeling bounded responses with orwithout boundary observations. The core building block of the new model is theextended-support beta distribution, which is a censored version of afour-parameter beta distribution with the same exceedance on the left and rightof $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. Weprove that both beta regression with dispersion effects and heteroscedasticnormal regression with censoring at both $0$ and $1$ -- known as theheteroscedastic two-limit tobit model in the econometrics literature -- arespecial cases of the extended-support beta regression model, depending onwhether a single extra parameter is zero or infinity, respectively. To overcomeidentifiability issues that may arise in estimating the extra parameter due tothe similarity of the beta and normal distribution for certain parametersettings, we assume that the additional parameter has an exponentialdistribution with an unknown mean. The associated marginal likelihood can beconveniently and accurately approximated using a Gauss-Laguerre quadraturerule, resulting in efficient estimation and inference procedures. The new modelis used to analyze investment decisions in a behavioral economics experiment,where the occurrence and extent of loss aversion is of interest. In contrast tostandard approaches, XBX regression can simultaneously capture the probabilityof rational behavior as well as the mean amount of loss aversion. Moreover, theeffectiveness of the new model is illustrated through extensive numericalcomparisons with alternative models.
我们引入了 XBX 回归模型,这是一种扩展支持贝塔回归的连续混合物,用于对有或无边界观测值的有界响应建模。新模型的核心构件是扩展支持贝塔分布,它是四参数贝塔分布的删减版本,在 $(0, 1)$ 左侧和右侧具有相同的超出度。因此,XBX 回归是贝塔回归的直接扩展。我们证明,根据单个额外参数是零还是无穷大,具有离散效应的贝塔回归和在 $0$ 和 $1$ 处均有删减的异塞尔德正态回归(计量经济学文献中称为异塞尔德两限 tobit 模型)都是扩展支持贝塔回归模型的特例。在某些参数设置下,贝塔分布和正态分布相似,为了克服估计额外参数时可能出现的可识别性问题,我们假设额外参数具有未知均值的指数分布。利用高斯-拉盖尔四则运算法则,可以方便、准确地逼近相关的边际似然值,从而实现高效的估计和推理过程。新模型用于分析行为经济学实验中的投资决策,其中损失规避的发生和程度是令人感兴趣的。与标准方法相比,XBX 回归能同时捕捉理性行为的概率和损失规避的平均值。此外,新模型的有效性还通过与其他模型的大量数值比较得到了说明。
{"title":"Extended-support beta regression for $[0, 1]$ responses","authors":"Ioannis Kosmidis, Achim Zeileis","doi":"arxiv-2409.07233","DOIUrl":"https://doi.org/arxiv-2409.07233","url":null,"abstract":"We introduce the XBX regression model, a continuous mixture of\u0000extended-support beta regressions for modeling bounded responses with or\u0000without boundary observations. The core building block of the new model is the\u0000extended-support beta distribution, which is a censored version of a\u0000four-parameter beta distribution with the same exceedance on the left and right\u0000of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We\u0000prove that both beta regression with dispersion effects and heteroscedastic\u0000normal regression with censoring at both $0$ and $1$ -- known as the\u0000heteroscedastic two-limit tobit model in the econometrics literature -- are\u0000special cases of the extended-support beta regression model, depending on\u0000whether a single extra parameter is zero or infinity, respectively. To overcome\u0000identifiability issues that may arise in estimating the extra parameter due to\u0000the similarity of the beta and normal distribution for certain parameter\u0000settings, we assume that the additional parameter has an exponential\u0000distribution with an unknown mean. The associated marginal likelihood can be\u0000conveniently and accurately approximated using a Gauss-Laguerre quadrature\u0000rule, resulting in efficient estimation and inference procedures. The new model\u0000is used to analyze investment decisions in a behavioral economics experiment,\u0000where the occurrence and extent of loss aversion is of interest. In contrast to\u0000standard approaches, XBX regression can simultaneously capture the probability\u0000of rational behavior as well as the mean amount of loss aversion. Moreover, the\u0000effectiveness of the new model is illustrated through extensive numerical\u0000comparisons with alternative models.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"108 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-source Stable Variable Importance Measure via Adversarial Machine Learning 通过对抗式机器学习进行多源稳定变量重要性测量
Pub Date : 2024-09-11 DOI: arxiv-2409.07380
Zitao Wang, Nian Si, Zijian Guo, Molei Liu
As part of enhancing the interpretability of machine learning, it is ofrenewed interest to quantify and infer the predictive importance of certainexposure covariates. Modern scientific studies often collect data from multiplesources with distributional heterogeneity. Thus, measuring and inferring stableassociations across multiple environments is crucial in reliable andgeneralizable decision-making. In this paper, we propose MIMAL, a novelstatistical framework for Multi-source stable Importance Measure viaAdversarial Learning. MIMAL measures the importance of some exposure variablesby maximizing the worst-case predictive reward over the source mixture. Ourframework allows various machine learning methods for confounding adjustmentand exposure effect characterization. For inferential analysis, the asymptoticnormality of our introduced statistic is established under a general machinelearning framework that requires no stronger learning accuracy conditions thanthose for single source variable importance. Numerical studies with varioustypes of data generation setups and machine learning implementation areconducted to justify the finite-sample performance of MIMAL. We also illustrateour method through a real-world study of Beijing air pollution in multiplelocations.
作为提高机器学习可解释性的一部分,量化和推断某些暴露协变量的预测重要性日益受到关注。现代科学研究通常从具有分布异质性的多个来源收集数据。因此,测量和推断多个环境中的稳定关联对于可靠和可推广的决策至关重要。在本文中,我们提出了一个新颖的统计框架 MIMAL,即通过对抗学习进行多源稳定重要性测量。MIMAL 通过最大化源混合物的最坏情况预测奖励来衡量某些暴露变量的重要性。我们的框架允许使用各种机器学习方法进行混杂调整和暴露效应表征。在推理分析中,我们引入的统计量的渐近正态性是在一般机器学习框架下建立的,它所要求的学习准确性条件并不比单一来源变量重要性的条件强。我们对不同类型的数据生成设置和机器学习实现进行了数值研究,以证明 MIMAL 的有限样本性能。我们还通过对北京多地点空气污染的实际研究来说明我们的方法。
{"title":"Multi-source Stable Variable Importance Measure via Adversarial Machine Learning","authors":"Zitao Wang, Nian Si, Zijian Guo, Molei Liu","doi":"arxiv-2409.07380","DOIUrl":"https://doi.org/arxiv-2409.07380","url":null,"abstract":"As part of enhancing the interpretability of machine learning, it is of\u0000renewed interest to quantify and infer the predictive importance of certain\u0000exposure covariates. Modern scientific studies often collect data from multiple\u0000sources with distributional heterogeneity. Thus, measuring and inferring stable\u0000associations across multiple environments is crucial in reliable and\u0000generalizable decision-making. In this paper, we propose MIMAL, a novel\u0000statistical framework for Multi-source stable Importance Measure via\u0000Adversarial Learning. MIMAL measures the importance of some exposure variables\u0000by maximizing the worst-case predictive reward over the source mixture. Our\u0000framework allows various machine learning methods for confounding adjustment\u0000and exposure effect characterization. For inferential analysis, the asymptotic\u0000normality of our introduced statistic is established under a general machine\u0000learning framework that requires no stronger learning accuracy conditions than\u0000those for single source variable importance. Numerical studies with various\u0000types of data generation setups and machine learning implementation are\u0000conducted to justify the finite-sample performance of MIMAL. We also illustrate\u0000our method through a real-world study of Beijing air pollution in multiple\u0000locations.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning 利用合作学习将多数据源与多图像中的互动整合在一起
Pub Date : 2024-09-11 DOI: arxiv-2409.07125
Matteo D'Alessandro, Theophilus Quachie Asenso, Manuela Zucknick
Modeling with multi-omics data presents multiple challenges such as thehigh-dimensionality of the problem ($p gg n$), the presence of interactionsbetween features, and the need for integration between multiple data sources.We establish an interaction model that allows for the inclusion of multiplesources of data from the integration of two existing methods, pliable lasso andcooperative learning. The integrated model is tested both on simulation studiesand on real multi-omics datasets for predicting labor onset and cancertreatment response. The results show that the model is effective in modelingmulti-source data in various scenarios where interactions are present, both interms of prediction performance and selection of relevant variables.
使用多组学数据建模面临着多重挑战,例如问题的高维性($p gg n$)、特征之间存在交互作用以及需要整合多个数据源。我们在模拟研究和真实的多组学数据集上测试了这一集成模型,以预测临产和癌症治疗反应。结果表明,该模型能在各种存在交互作用的情况下对多源数据进行有效建模,无论是在预测性能方面还是在选择相关变量方面都是如此。
{"title":"Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning","authors":"Matteo D'Alessandro, Theophilus Quachie Asenso, Manuela Zucknick","doi":"arxiv-2409.07125","DOIUrl":"https://doi.org/arxiv-2409.07125","url":null,"abstract":"Modeling with multi-omics data presents multiple challenges such as the\u0000high-dimensionality of the problem ($p gg n$), the presence of interactions\u0000between features, and the need for integration between multiple data sources.\u0000We establish an interaction model that allows for the inclusion of multiple\u0000sources of data from the integration of two existing methods, pliable lasso and\u0000cooperative learning. The integrated model is tested both on simulation studies\u0000and on real multi-omics datasets for predicting labor onset and cancer\u0000treatment response. The results show that the model is effective in modeling\u0000multi-source data in various scenarios where interactions are present, both in\u0000terms of prediction performance and selection of relevant variables.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"195 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential stratified inference for the mean 均值的连续分层推断
Pub Date : 2024-09-10 DOI: arxiv-2409.06680
Jacob V. Spertus, Mayuri Sridhar, Philip B. Stark
We develop conservative tests for the mean of a bounded population using datafrom a stratified sample. The sample may be drawn sequentially, with or withoutreplacement. The tests are "anytime valid," allowing optional stopping andcontinuation in each stratum. We call this combination of propertiessequential, finite-sample, nonparametric validity. The methods express ahypothesis about the population mean as a union of intersection hypothesesdescribing within-stratum means. They test each intersection hypothesis usingindependent test supermartingales (TSMs) combined across strata bymultiplication. The $P$-value of the global null hypothesis is then the maximum$P$-value of any intersection hypothesis in the union. This approach has threeprimary moving parts: (i) the rule for deciding which stratum to draw from nextto test each intersection null, given the sample so far; (ii) the form of theTSM for each null in each stratum; and (iii) the method of combining evidenceacross strata. These choices interact. We examine the performance of a varietyof rules with differing computational complexity. Approximately optimal methodshave a prohibitive computational cost, while naive rules may be inconsistent --they will never reject for some alternative populations, no matter how largethe sample. We present a method that is statistically comparable to optimalmethods in examples where optimal methods are computable, but computationallytractable for arbitrarily many strata. In numerical examples its expectedsample size is substantially smaller than that of previous methods.
我们利用分层抽样的数据,对有界群体的均值进行保守检验。样本可以连续抽取,也可以不替换。检验是 "随时有效 "的,允许在每个分层中选择停止和继续。我们把这种特性组合称为序列、有限样本、非参数有效性。这些方法将关于总体均值的假设表达为描述层内均值的交叉假设的结合。它们使用通过乘法跨层组合的独立检验超马尔廷公式(TSM)来检验每个交集假设。然后,全局零假设的 P$ 值就是联盟中任何交叉假设的最大 P$ 值。这种方法有三个主要的活动部分:(i) 根据迄今为止的样本情况,决定下一步从哪个层抽取样本来检验每个交叉零假设的规则;(ii) 每个层中每个零假设的全局矩阵形式;(iii) 组合跨层证据的方法。这些选择是相互影响的。我们研究了计算复杂度不同的各种规则的性能。近似最优方法的计算成本过高,而天真规则则可能不一致--无论样本量有多大,它们都不会拒绝某些替代人群。在最优方法可计算的例子中,我们提出了一种在统计上可与最优方法相媲美的方法,但对于任意多的分层来说,这种方法在计算上非常困难。在数值例子中,它的预期样本量大大小于以前的方法。
{"title":"Sequential stratified inference for the mean","authors":"Jacob V. Spertus, Mayuri Sridhar, Philip B. Stark","doi":"arxiv-2409.06680","DOIUrl":"https://doi.org/arxiv-2409.06680","url":null,"abstract":"We develop conservative tests for the mean of a bounded population using data\u0000from a stratified sample. The sample may be drawn sequentially, with or without\u0000replacement. The tests are \"anytime valid,\" allowing optional stopping and\u0000continuation in each stratum. We call this combination of properties\u0000sequential, finite-sample, nonparametric validity. The methods express a\u0000hypothesis about the population mean as a union of intersection hypotheses\u0000describing within-stratum means. They test each intersection hypothesis using\u0000independent test supermartingales (TSMs) combined across strata by\u0000multiplication. The $P$-value of the global null hypothesis is then the maximum\u0000$P$-value of any intersection hypothesis in the union. This approach has three\u0000primary moving parts: (i) the rule for deciding which stratum to draw from next\u0000to test each intersection null, given the sample so far; (ii) the form of the\u0000TSM for each null in each stratum; and (iii) the method of combining evidence\u0000across strata. These choices interact. We examine the performance of a variety\u0000of rules with differing computational complexity. Approximately optimal methods\u0000have a prohibitive computational cost, while naive rules may be inconsistent --\u0000they will never reject for some alternative populations, no matter how large\u0000the sample. We present a method that is statistically comparable to optimal\u0000methods in examples where optimal methods are computable, but computationally\u0000tractable for arbitrarily many strata. In numerical examples its expected\u0000sample size is substantially smaller than that of previous methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble Doubly Robust Bayesian Inference via Regression Synthesis 通过回归合成进行集合双稳健贝叶斯推理
Pub Date : 2024-09-10 DOI: arxiv-2409.06288
Kaoru Babasaki, Shonosuke Sugasawa, Kosaku Takanashi, Kenichiro McAlinn
The doubly robust estimator, which models both the propensity score andoutcomes, is a popular approach to estimate the average treatment effect in thepotential outcome setting. The primary appeal of this estimator is itstheoretical property, wherein the estimator achieves consistency as long aseither the propensity score or outcomes is correctly specified. In mostapplications, however, both are misspecified, leading to considerable bias thatcannot be checked. In this paper, we propose a Bayesian ensemble approach thatsynthesizes multiple models for both the propensity score and outcomes, whichwe call doubly robust Bayesian regression synthesis. Our approach appliesBayesian updating to the ensemble model weights that adapt at the unit level,incorporating data heterogeneity, to significantly mitigate misspecificationbias. Theoretically, we show that our proposed approach is consistent regardingthe estimation of both the propensity score and outcomes, ensuring that thedoubly robust estimator is consistent, even if no single model is correctlyspecified. An efficient algorithm for posterior computation facilitates thecharacterization of uncertainty regarding the treatment effect. Our proposedapproach is compared against standard and state-of-the-art methods through twocomprehensive simulation studies, where we find that our approach is superiorin all cases. An empirical study on the impact of maternal smoking on birthweight highlights the practical applicability of our proposed method.
双重稳健估计法同时对倾向得分和结果进行建模,是估计潜在结果环境下平均治疗效果的常用方法。这种估计方法的主要吸引力在于它的理论特性,即只要正确指定倾向得分或结果,估计方法就能实现一致性。然而,在大多数应用中,两者都被错误地指定,从而导致无法检查的巨大偏差。在本文中,我们提出了一种贝叶斯集合方法,它可以同时合成倾向得分和结果的多个模型,我们称之为双重稳健贝叶斯回归综合法。我们的方法将贝叶斯更新应用于集合模型权重,该权重在单位水平上进行调整,并结合了数据异质性,从而显著减轻了误规范偏差。从理论上讲,我们证明了我们提出的方法对倾向评分和结果的估计都是一致的,确保了双重稳健估计器的一致性,即使没有一个模型被正确地指定。后验计算的高效算法有助于描述治疗效果的不确定性。我们通过两项综合模拟研究,将我们提出的方法与标准方法和最先进的方法进行了比较,发现我们的方法在所有情况下都更胜一筹。关于产妇吸烟对出生体重影响的实证研究突出了我们提出的方法的实际应用性。
{"title":"Ensemble Doubly Robust Bayesian Inference via Regression Synthesis","authors":"Kaoru Babasaki, Shonosuke Sugasawa, Kosaku Takanashi, Kenichiro McAlinn","doi":"arxiv-2409.06288","DOIUrl":"https://doi.org/arxiv-2409.06288","url":null,"abstract":"The doubly robust estimator, which models both the propensity score and\u0000outcomes, is a popular approach to estimate the average treatment effect in the\u0000potential outcome setting. The primary appeal of this estimator is its\u0000theoretical property, wherein the estimator achieves consistency as long as\u0000either the propensity score or outcomes is correctly specified. In most\u0000applications, however, both are misspecified, leading to considerable bias that\u0000cannot be checked. In this paper, we propose a Bayesian ensemble approach that\u0000synthesizes multiple models for both the propensity score and outcomes, which\u0000we call doubly robust Bayesian regression synthesis. Our approach applies\u0000Bayesian updating to the ensemble model weights that adapt at the unit level,\u0000incorporating data heterogeneity, to significantly mitigate misspecification\u0000bias. Theoretically, we show that our proposed approach is consistent regarding\u0000the estimation of both the propensity score and outcomes, ensuring that the\u0000doubly robust estimator is consistent, even if no single model is correctly\u0000specified. An efficient algorithm for posterior computation facilitates the\u0000characterization of uncertainty regarding the treatment effect. Our proposed\u0000approach is compared against standard and state-of-the-art methods through two\u0000comprehensive simulation studies, where we find that our approach is superior\u0000in all cases. An empirical study on the impact of maternal smoking on birth\u0000weight highlights the practical applicability of our proposed method.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric Inference for Balance in Signed Networks 符号网络中平衡的非参数推理
Pub Date : 2024-09-10 DOI: arxiv-2409.06172
Xuyang Chen, Yinjie Wang, Weijing Tang
In many real-world networks, relationships often go beyond simple dyadicpresence or absence; they can be positive, like friendship, alliance, andmutualism, or negative, characterized by enmity, disputes, and competition. Tounderstand the formation mechanism of such signed networks, the social balancetheory sheds light on the dynamics of positive and negative connections. Inparticular, it characterizes the proverbs, "a friend of my friend is my friend"and "an enemy of my enemy is my friend". In this work, we propose anonparametric inference approach for assessing empirical evidence for thebalance theory in real-world signed networks. We first characterize thegenerating process of signed networks with node exchangeability and propose anonparametric sparse signed graphon model. Under this model, we constructconfidence intervals for the population parameters associated with balancetheory and establish their theoretical validity. Our inference procedure is ascomputationally efficient as a simple normal approximation but offershigher-order accuracy. By applying our method, we find strong real-worldevidence for balance theory in signed networks across various domains,extending its applicability beyond social psychology.
在现实世界的许多网络中,关系往往不只是简单的二元存在或不存在;它们可以是积极的,如友谊、联盟和互助,也可以是消极的,如敌意、争端和竞争。为了理解这种签约网络的形成机制,社会平衡理论揭示了积极和消极联系的动态。特别是谚语 "朋友的朋友就是我的朋友 "和 "敌人的敌人就是我的朋友"。在这项工作中,我们提出了一种非参数推理方法,用于评估现实世界签名网络中平衡理论的经验证据。我们首先描述了具有节点交换性的签名网络的生成过程,并提出了一个非参数稀疏签名图模型。在该模型下,我们构建了与平衡理论相关的群体参数的置信区间,并确定了其理论有效性。我们的推理过程与简单的正态近似一样,计算效率高,但精度更高。通过应用我们的方法,我们在各个领域的签名网络中发现了平衡理论在现实世界中的有力证据,从而将其适用范围扩展到了社会心理学之外。
{"title":"Nonparametric Inference for Balance in Signed Networks","authors":"Xuyang Chen, Yinjie Wang, Weijing Tang","doi":"arxiv-2409.06172","DOIUrl":"https://doi.org/arxiv-2409.06172","url":null,"abstract":"In many real-world networks, relationships often go beyond simple dyadic\u0000presence or absence; they can be positive, like friendship, alliance, and\u0000mutualism, or negative, characterized by enmity, disputes, and competition. To\u0000understand the formation mechanism of such signed networks, the social balance\u0000theory sheds light on the dynamics of positive and negative connections. In\u0000particular, it characterizes the proverbs, \"a friend of my friend is my friend\"\u0000and \"an enemy of my enemy is my friend\". In this work, we propose a\u0000nonparametric inference approach for assessing empirical evidence for the\u0000balance theory in real-world signed networks. We first characterize the\u0000generating process of signed networks with node exchangeability and propose a\u0000nonparametric sparse signed graphon model. Under this model, we construct\u0000confidence intervals for the population parameters associated with balance\u0000theory and establish their theoretical validity. Our inference procedure is as\u0000computationally efficient as a simple normal approximation but offers\u0000higher-order accuracy. By applying our method, we find strong real-world\u0000evidence for balance theory in signed networks across various domains,\u0000extending its applicability beyond social psychology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach 利用大容量转录组测序优化监督机器学习的样本量:学习曲线方法
Pub Date : 2024-09-10 DOI: arxiv-2409.06180
Yunhui Qi, Xinyi Wang, Li-Xuan Qin
Accurate sample classification using transcriptomics data is crucial foradvancing personalized medicine. Achieving this goal necessitates determining asuitable sample size that ensures adequate statistical power without undueresource allocation. Current sample size calculation methods rely onassumptions and algorithms that may not align with supervised machine learningtechniques for sample classification. Addressing this critical methodologicalgap, we present a novel computational approach that establishes thepower-versus-sample-size relationship by employing a data augmentation strategyfollowed by fitting a learning curve. We comprehensively evaluated itsperformance for microRNA and RNA sequencing data, considering diverse datacharacteristics and algorithm configurations, based on a spectrum of evaluationmetrics. To foster accessibility and reproducibility, the Python and R code forimplementing our approach is available on GitHub. Its deployment willsignificantly facilitate the adoption of machine learning in transcriptomicsstudies and accelerate their translation into clinically useful classifiers forpersonalized treatment.
利用转录组学数据对样本进行准确分类对于推进个性化医疗至关重要。要实现这一目标,就必须确定合适的样本量,以确保在不分配过多资源的情况下获得足够的统计能力。目前的样本量计算方法所依赖的假设和算法可能与用于样本分类的监督机器学习技术不一致。针对这一关键的方法论差距,我们提出了一种新颖的计算方法,通过采用数据扩增策略和拟合学习曲线来建立统计能力与样本量之间的关系。考虑到不同的数据特征和算法配置,我们基于一系列评估指标,全面评估了该方法在 microRNA 和 RNA 测序数据方面的性能。为了提高可访问性和可重复性,我们在 GitHub 上提供了实现我们方法的 Python 和 R 代码。它的部署将极大地促进机器学习在转录组学研究中的应用,并加速将其转化为临床上有用的分类器,用于个性化治疗。
{"title":"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":"https://doi.org/arxiv-2409.06180","url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\u0000advancing personalized medicine. Achieving this goal necessitates determining a\u0000suitable sample size that ensures adequate statistical power without undue\u0000resource allocation. Current sample size calculation methods rely on\u0000assumptions and algorithms that may not align with supervised machine learning\u0000techniques for sample classification. Addressing this critical methodological\u0000gap, we present a novel computational approach that establishes the\u0000power-versus-sample-size relationship by employing a data augmentation strategy\u0000followed by fitting a learning curve. We comprehensively evaluated its\u0000performance for microRNA and RNA sequencing data, considering diverse data\u0000characteristics and algorithm configurations, based on a spectrum of evaluation\u0000metrics. To foster accessibility and reproducibility, the Python and R code for\u0000implementing our approach is available on GitHub. Its deployment will\u0000significantly facilitate the adoption of machine learning in transcriptomics\u0000studies and accelerate their translation into clinically useful classifiers for\u0000personalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new paradigm for global sensitivity analysis 全球敏感性分析的新范例
Pub Date : 2024-09-10 DOI: arxiv-2409.06271
Gildas MazoMaIAGE

Current theory of global sensitivity analysis, based on a nonlinearfunctional ANOVA decomposition of the random output, is limited in scope-forinstance, the analysis is limited to the output's variance and the inputs haveto be mutually independent-and leads to sensitivity indices the interpretationof which is not fully clear, especially interaction effects. Alternatively,sensitivity indices built for arbitrary user-defined importance measures havebeen proposed but a theory to define interactions in a systematic fashionand/or establish a decomposition of the total importance measure is stillmissing. It is shown that these important problems are solved all at once byadopting a new paradigm. By partitioning the inputs into those causing thechange in the output and those which do not, arbitrary user-defined variabilitymeasures are identified with the outcomes of a factorial experiment at twolevels, leading to all factorial effects without assuming any functionaldecomposition. To link various well-known sensitivity indices of the literature(Sobol indices and Shapley effects), weighted factorial effects are studied andutilized.

目前的全局灵敏度分析理论基于随机输出的非线性函数方差分析分解,其范围有限,例如,分析仅限于输出的方差,而且输入必须相互独立,这导致灵敏度指数的解释并不完全清楚,尤其是交互效应。另外,也有人提出了为任意用户定义的重要性度量建立敏感度指数,但仍然缺乏系统地定义交互作用和/或建立总重要性度量分解的理论。研究表明,采用一种新的范式可以一次性解决这些重要问题。通过将输入划分为导致输出变化的输入和不导致输出变化的输入,用户定义的任意可变性度量与两级因子实验结果相一致,从而得出所有因子效应,而无需假设任何函数分解。为了将文献中各种著名的灵敏度指数(Sobol 指数和 Shapley 效应)联系起来,对加权因子效应进行了研究和利用。
{"title":"A new paradigm for global sensitivity analysis","authors":"Gildas MazoMaIAGE","doi":"arxiv-2409.06271","DOIUrl":"https://doi.org/arxiv-2409.06271","url":null,"abstract":"<div><p>Current theory of global sensitivity analysis, based on a nonlinear\u0000functional ANOVA decomposition of the random output, is limited in scope-for\u0000instance, the analysis is limited to the output's variance and the inputs have\u0000to be mutually independent-and leads to sensitivity indices the interpretation\u0000of which is not fully clear, especially interaction effects. Alternatively,\u0000sensitivity indices built for arbitrary user-defined importance measures have\u0000been proposed but a theory to define interactions in a systematic fashion\u0000and/or establish a decomposition of the total importance measure is still\u0000missing. It is shown that these important problems are solved all at once by\u0000adopting a new paradigm. By partitioning the inputs into those causing the\u0000change in the output and those which do not, arbitrary user-defined variability\u0000measures are identified with the outcomes of a factorial experiment at two\u0000levels, leading to all factorial effects without assuming any functional\u0000decomposition. To link various well-known sensitivity indices of the literature\u0000(Sobol indices and Shapley effects), weighted factorial effects are studied and\u0000utilized.</p></div>","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis 这是不正常的!(Re-) 评估回归分析的下 $n$ 准则
Pub Date : 2024-09-10 DOI: arxiv-2409.06413
David Randahl
The commonly cited rule of thumb for regression analysis, which suggests thata sample size of $n geq 30$ is sufficient to ensure valid inferences, isfrequently referenced but rarely scrutinized. This research note evaluates thelower bound for the number of observations required for regression analysis byexploring how different distributional characteristics, such as skewness andkurtosis, influence the convergence of t-values to the t-distribution in linearregression models. Through an extensive simulation study involving over 22billion regression models, this paper examines a range of symmetric,platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.The results reveal that it is sufficient that either the dependent orindependent variable follow a symmetric distribution for the t-values toconverge to the t-distribution at much smaller sample sizes than $n=30$. Thisis contrary to previous guidance which suggests that the error term needs to benormally distributed for this convergence to happen at low $n$. On the otherhand, if both dependent and independent variables are highly skewed therequired sample size is substantially higher. In cases of extreme skewness,even sample sizes of 10,000 do not ensure convergence. These findings suggestthat the $ngeq30$ rule is too permissive in certain cases but overlyconservative in others, depending on the underlying distributionalcharacteristics. This study offers revised guidelines for determining theminimum sample size necessary for valid regression analysis.
通常引用的回归分析经验法则认为,样本量为 $n geq 30$ 就足以确保有效推论,这一法则经常被引用,但却很少被仔细研究。本研究报告通过探讨不同的分布特征(如偏斜度和峰度)如何影响线性回归模型中 t 值向 t 分布的收敛,评估了回归分析所需的观察数下限。本文通过一项涉及超过 220 亿个回归模型的广泛模拟研究,考察了一系列对称分布、偏桔型分布和倾斜分布,测试了 4 到 10,000 个样本量。结果发现,在样本量远小于 $n=30$ 的情况下,因变量或自变量遵循对称分布就足以使 t 值收敛到 t 分布。这与以前的指导相反,以前的指导认为误差项需要呈正态分布,才能在低 $n$ 时收敛。另一方面,如果因变量和自变量都高度偏斜,所需的样本量就会大大增加。在极度偏斜的情况下,即使样本量达到 10,000 个,也不能确保收敛。这些发现表明,$ngeq30$ 规则在某些情况下过于宽松,而在另一些情况下则过于保守,这取决于基本的分布特征。本研究为确定有效回归分析所需的最小样本量提供了修订指南。
{"title":"This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis","authors":"David Randahl","doi":"arxiv-2409.06413","DOIUrl":"https://doi.org/arxiv-2409.06413","url":null,"abstract":"The commonly cited rule of thumb for regression analysis, which suggests that\u0000a sample size of $n geq 30$ is sufficient to ensure valid inferences, is\u0000frequently referenced but rarely scrutinized. This research note evaluates the\u0000lower bound for the number of observations required for regression analysis by\u0000exploring how different distributional characteristics, such as skewness and\u0000kurtosis, influence the convergence of t-values to the t-distribution in linear\u0000regression models. Through an extensive simulation study involving over 22\u0000billion regression models, this paper examines a range of symmetric,\u0000platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\u0000The results reveal that it is sufficient that either the dependent or\u0000independent variable follow a symmetric distribution for the t-values to\u0000converge to the t-distribution at much smaller sample sizes than $n=30$. This\u0000is contrary to previous guidance which suggests that the error term needs to be\u0000normally distributed for this convergence to happen at low $n$. On the other\u0000hand, if both dependent and independent variables are highly skewed the\u0000required sample size is substantially higher. In cases of extreme skewness,\u0000even sample sizes of 10,000 do not ensure convergence. These findings suggest\u0000that the $ngeq30$ rule is too permissive in certain cases but overly\u0000conservative in others, depending on the underlying distributional\u0000characteristics. This study offers revised guidelines for determining the\u0000minimum sample size necessary for valid regression analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - STAT - Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1