首页 > 最新文献

Advances in Data Analysis and Classification最新文献

英文 中文
Loss-guided stability selection 损失引导的稳定性选择
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-12-15 DOI: 10.1007/s11634-023-00573-3
Tino Werner

In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.

在现代数据分析中,一旦预测变量的数量非常多,稀疏模型选择就变得不可避免。众所周知,拉索(Lasso)或提升(Boosting)等模型选择程序往往会对真实数据产生过拟合。著名的 "稳定性选择"(Stability Selection)克服了这些弱点,它根据训练数据的子样本聚合模型,然后选择一个稳定的预测集,这个预测集通常比原始模型的预测集稀疏得多。标准的稳定性选择基于全局标准,即每族误差率,同时还需要专家知识来适当配置超参数。模型选择取决于损失函数,即根据特定损失函数选择的预测集与根据其他损失函数选择的预测集不同。因此,我们提出了稳定性选择变体,它通过基于样本外验证数据的额外验证步骤来尊重所选的损失函数,并可选择使用穷举搜索策略进行增强。我们的稳定性选择变体具有广泛的适用性和用户友好性。此外,我们的稳定性选择变体还能避免严重拟合不足的问题,而这一问题会影响原始稳定性选择对高噪声高维数据的处理,因此我们的首要任务不是不惜一切代价避免误报,而是建立一个稀疏的稳定模型,并以此进行预测。在实验中,我们使用 Boosting 作为模型选择算法,对回归和二元分类进行了研究,结果表明,与原始的 Boosting 模型相比,精度有了显著提高,同时也没有出现原始稳定性选择算法中提到的任何问题。
{"title":"Loss-guided stability selection","authors":"Tino Werner","doi":"10.1007/s11634-023-00573-3","DOIUrl":"https://doi.org/10.1007/s11634-023-00573-3","url":null,"abstract":"<p>In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"199 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138690803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fresh look at mean-shift based modal clustering 重新审视基于均值移动的模态聚类
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-12-14 DOI: 10.1007/s11634-023-00575-1
Jose Ameijeiras-Alonso, Jochen Einbeck

Modal clustering is an unsupervised learning technique where cluster centers are identified as the local maxima of nonparametric probability density estimates. A natural algorithmic engine for the computation of these maxima is the mean shift procedure, which is essentially an iteratively computed chain of local means. We revisit this technique, focusing on its link to kernel density gradient estimation, in this course proposing a novel concept for bandwidth selection based on the concept of a critical bandwidth. Furthermore, in the one-dimensional case, an inverse version of the mean shift is developed to provide a novel approach for the estimation of antimodes, which is then used to identify cluster boundaries. A simulation study is provided which assesses, in the univariate case, the classification accuracy of the mean-shift based clustering approach. Three (univariate and multivariate) examples from the fields of philately, engineering, and imaging, illustrate how modal clusterings identified through mean shift based methods relate directly and naturally to physical properties of the data-generating system. Solutions are proposed to deal computationally efficiently with large data sets.

模态聚类是一种无监督学习技术,聚类中心被识别为非参数概率密度估计的局部最大值。计算这些最大值的自然算法引擎是均值移动程序,它本质上是一个迭代计算的局部均值链。在本课程中,我们重温了这一技术,重点关注其与核密度梯度估计的联系,并根据临界带宽的概念提出了带宽选择的新概念。此外,在一维情况下,还开发了均值移动的逆版本,为估计反节点提供了一种新方法,然后用于识别聚类边界。模拟研究评估了基于均值偏移的聚类方法在单变量情况下的分类准确性。来自集邮、工程和成像领域的三个(单变量和多变量)实例说明了通过基于均值偏移的方法确定的模态聚类如何直接、自然地与数据生成系统的物理特性相关联。此外,还提出了高效计算大型数据集的解决方案。
{"title":"A fresh look at mean-shift based modal clustering","authors":"Jose Ameijeiras-Alonso,&nbsp;Jochen Einbeck","doi":"10.1007/s11634-023-00575-1","DOIUrl":"10.1007/s11634-023-00575-1","url":null,"abstract":"<div><p>Modal clustering is an unsupervised learning technique where cluster centers are identified as the local maxima of nonparametric probability density estimates. A natural algorithmic engine for the computation of these maxima is the <i>mean shift procedure</i>, which is essentially an iteratively computed chain of local means. We revisit this technique, focusing on its link to kernel density gradient estimation, in this course proposing a novel concept for bandwidth selection based on the concept of a critical bandwidth. Furthermore, in the one-dimensional case, an inverse version of the mean shift is developed to provide a novel approach for the estimation of antimodes, which is then used to identify cluster boundaries. A simulation study is provided which assesses, in the univariate case, the classification accuracy of the mean-shift based clustering approach. Three (univariate and multivariate) examples from the fields of philately, engineering, and imaging, illustrate how modal clusterings identified through mean shift based methods relate directly and naturally to physical properties of the data-generating system. Solutions are proposed to deal computationally efficiently with large data sets.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1067 - 1095"},"PeriodicalIF":1.4,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138690553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A probabilistic method for reconstructing the Foreign Direct Investments network in search of ultimate host economies 重构外国直接投资网络以寻找最终东道国经济的概率方法
IF 1.6 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-12-08 DOI: 10.1007/s11634-023-00571-5
Nadia Accoto, Valerio Astuti, Costanza Catalano

The Ultimate Host Economies (UHEs) of a given country are defined as the ultimate destinations of Foreign Direct Investment (FDI) originating in that country. Bilateral FDI statistics struggle to identify them due to the non-negligible presence of conduit jurisdictions, which provide attractive intermediate destinations for pass-through investments due to favorable tax regimes. At the same time, determining UHEs is crucial for understanding the actual paths followed by FDI among increasingly interdependent economies. In this paper, we first reconstruct the global FDI network through mirroring and clustering techniques, starting from data collected by the International Monetary Fund. Then we provide a method for computing an (approximate) distribution of the UHEs of a country by using a probabilistic approach to this network, based on Markov chains. More specifically, we analyze the Italian case.

特定国家的最终东道国经济体(UHEs)被定义为源自该国的外国直接投资(FDI)的最终目的地。双边外国直接投资统计数据难以确定这些经济体,原因是管道管辖区的存在不容忽视,这些管辖区因税收制度优惠而为转手投资提供了有吸引力的中间目的地。同时,确定超常规经济体对于了解日益相互依存的经济体之间外国直接投资的实际路径至关重要。在本文中,我们首先从国际货币基金组织收集的数据出发,通过镜像和聚类技术重建全球外国直接投资网络。然后,我们提供了一种方法,通过使用基于马尔可夫链的概率方法来计算一个国家的超高净值(近似)分布。更具体地说,我们分析了意大利的情况。
{"title":"A probabilistic method for reconstructing the Foreign Direct Investments network in search of ultimate host economies","authors":"Nadia Accoto, Valerio Astuti, Costanza Catalano","doi":"10.1007/s11634-023-00571-5","DOIUrl":"https://doi.org/10.1007/s11634-023-00571-5","url":null,"abstract":"<p>The Ultimate Host Economies (UHEs) of a given country are defined as the ultimate destinations of Foreign Direct Investment (FDI) originating in that country. Bilateral FDI statistics struggle to identify them due to the non-negligible presence of conduit jurisdictions, which provide attractive intermediate destinations for pass-through investments due to favorable tax regimes. At the same time, determining UHEs is crucial for understanding the actual paths followed by FDI among increasingly interdependent economies. In this paper, we first reconstruct the global FDI network through mirroring and clustering techniques, starting from data collected by the International Monetary Fund. Then we provide a method for computing an (approximate) distribution of the UHEs of a country by using a probabilistic approach to this network, based on Markov chains. More specifically, we analyze the Italian case.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"251 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138553072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational inference for semiparametric Bayesian novelty detection in large datasets 大数据集中半参数贝叶斯新颖性检测的变分推理
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-12-04 DOI: 10.1007/s11634-023-00569-z
Luca Benedetti, Eric Boniardi, Leonardo Chiani, Jacopo Ghirri, Marta Mastropietro, Andrea Cappozzo, Francesco Denti

After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.

在一个完全标记的训练集上训练后,观察结果被分组到一定数量的已知类中,新颖性检测方法的目标是对未标记的测试集的实例进行分类,同时允许存在以前未见过的类。这些模型在许多领域都很有价值,从社会网络和食品掺假分析到可能存在进化种群的生物学。在本文中,我们重点研究了最近在文献中介绍的两阶段贝叶斯半参数新颖性检测器,也称为Brand。利用基于模型的混合表示,Brand允许将测试观察聚类到已知的训练项或单个新项中。此外,用Dirichlet过程混合模型对新颖性项进行建模,以灵活地捕获与已知模式的任何偏离。Brand最初是使用MCMC方案来估计的,当应用于高维数据时,这种方案的成本非常高。为了扩大Brand对大型数据集的适用性,我们建议采用变分贝叶斯方法,提供一种有效的后验逼近算法。通过深入的仿真研究,我们证明了该方法在效率和分类性能方面的显著提高。最后,为了展示其适用性,我们使用公开可用的Statlog数据集(大量卫星成像光谱集合)进行新颖性检测分析,以搜索新的土壤类型。
{"title":"Variational inference for semiparametric Bayesian novelty detection in large datasets","authors":"Luca Benedetti,&nbsp;Eric Boniardi,&nbsp;Leonardo Chiani,&nbsp;Jacopo Ghirri,&nbsp;Marta Mastropietro,&nbsp;Andrea Cappozzo,&nbsp;Francesco Denti","doi":"10.1007/s11634-023-00569-z","DOIUrl":"10.1007/s11634-023-00569-z","url":null,"abstract":"<div><p>After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available <span>Statlog</span> dataset, a large collection of satellite imaging spectra, to search for novel soil types.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"681 - 703"},"PeriodicalIF":1.4,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00569-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Claims fraud detection with uncertain labels 标签不确定的索赔欺诈检测
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-30 DOI: 10.1007/s11634-023-00568-0
Félix Vandervorst, Wouter Verbeke, Tim Verdonck

Insurance fraud is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented in the Dempster–Shafer theory of belief functions, a generalization of supervised and unsupervised learning suited to represent uncertainty. In this paper, we show that partial information from the historical investigations can add valuable, learnable information for the fraud detection system and improves its performances. We also show that belief function theory provides a flexible mathematical framework for concept drift detection and cost sensitive learning, two common challenges in fraud detection. Finally, we present an application to a real-world motor insurance claim fraud.

保险欺诈是一种非自我暴露的欺诈。真正的历史标签(欺诈或合法)取决于调查人员的努力和成功发现。有监督和无监督学习的流行方法未能捕捉到不确定标签的模糊性。不精确观察到的标签可以用信念函数的Dempster-Shafer理论来表示,这是一种适用于表示不确定性的监督学习和无监督学习的推广。在本文中,我们证明了来自历史调查的部分信息可以为欺诈检测系统增加有价值的、可学习的信息,并提高其性能。我们还表明,信念函数理论为概念漂移检测和成本敏感学习提供了一个灵活的数学框架,这是欺诈检测中的两个常见挑战。最后,我们提出了一个实际汽车保险索赔欺诈的应用程序。
{"title":"Claims fraud detection with uncertain labels","authors":"Félix Vandervorst,&nbsp;Wouter Verbeke,&nbsp;Tim Verdonck","doi":"10.1007/s11634-023-00568-0","DOIUrl":"10.1007/s11634-023-00568-0","url":null,"abstract":"<div><p><i>Insurance fraud</i> is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented in the Dempster–Shafer theory of belief functions, a generalization of supervised and unsupervised learning suited to represent uncertainty. In this paper, we show that partial information from the historical investigations can add valuable, learnable information for the fraud detection system and improves its performances. We also show that belief function theory provides a flexible mathematical framework for concept drift detection and cost sensitive learning, two common challenges in fraud detection. Finally, we present an application to a real-world motor insurance claim fraud.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"219 - 243"},"PeriodicalIF":1.4,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust and sparse logistic regression 鲁棒稀疏逻辑回归
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-27 DOI: 10.1007/s11634-023-00572-4
Dries Cornilly, Lise Tubex, Stefan Van Aelst, Tim Verdonck

Logistic regression is one of the most popular statistical techniques for solving (binary) classification problems in various applications (e.g. credit scoring, cancer detection, ad click predictions and churn classification). Typically, the maximum likelihood estimator is used, which is very sensitive to outlying observations. In this paper, we propose a robust and sparse logistic regression estimator where robustness is achieved by means of the (gamma)-divergence. An elastic net penalty ensures sparsity in the regression coefficients such that the model is more stable and interpretable. We show that the influence function is bounded and demonstrate its robustness properties in simulations. The good performance of the proposed estimator is also illustrated in an empirical application that deals with classifying the type of fuel used by cars.

逻辑回归是解决各种应用(如信用评分、癌症检测、广告点击预测和客户流失分类)中(二元)分类问题的最流行的统计技术之一。通常,使用极大似然估计器,它对离群观测值非常敏感。在本文中,我们提出了一个鲁棒稀疏逻辑回归估计器,其中鲁棒性是通过(gamma) -散度来实现的。弹性网络惩罚确保回归系数的稀疏性,从而使模型更加稳定和可解释。我们在仿真中证明了影响函数是有界的,并证明了它的鲁棒性。在对汽车使用的燃料类型进行分类的经验应用中也说明了所提出的估计器的良好性能。
{"title":"Robust and sparse logistic regression","authors":"Dries Cornilly,&nbsp;Lise Tubex,&nbsp;Stefan Van Aelst,&nbsp;Tim Verdonck","doi":"10.1007/s11634-023-00572-4","DOIUrl":"10.1007/s11634-023-00572-4","url":null,"abstract":"<div><p>Logistic regression is one of the most popular statistical techniques for solving (binary) classification problems in various applications (e.g. credit scoring, cancer detection, ad click predictions and churn classification). Typically, the maximum likelihood estimator is used, which is very sensitive to outlying observations. In this paper, we propose a robust and sparse logistic regression estimator where robustness is achieved by means of the <span>(gamma)</span>-divergence. An elastic net penalty ensures sparsity in the regression coefficients such that the model is more stable and interpretable. We show that the influence function is bounded and demonstrate its robustness properties in simulations. The good performance of the proposed estimator is also illustrated in an empirical application that deals with classifying the type of fuel used by cars.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"663 - 679"},"PeriodicalIF":1.4,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semiparametric mixture of linear regressions with nonparametric Gaussian scale mixture errors 具有非参数高斯尺度混合误差的半参数混合线性回归
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-23 DOI: 10.1007/s11634-023-00570-6
Sangkon Oh, Byungtae Seo

In finite mixture of regression models, normal assumption for the errors of each regression component is typically adopted. Though this common assumption is theoretically and computationally convenient, it often produces inefficient and undesirable estimates which undermine the applicability of the model particularly in the presence of outliers. To reduce these defects, we propose to use nonparametric Gaussian scale mixture distributions for component error distributions. By this means, we can lessen the risk of misspecification and obtain robust estimators. In this paper, we study the identifiability of the proposed model and develop a feasible estimating algorithm. Numerical studies including simulation studies and real data analysis to demonstrate the performance of the proposed method are also presented.

在有限混合回归模型中,对各回归分量的误差通常采用正态假设。虽然这种常见的假设在理论上和计算上都很方便,但它经常产生低效和不理想的估计,从而破坏了模型的适用性,特别是在存在异常值的情况下。为了减少这些缺陷,我们提出使用非参数高斯尺度混合分布作为分量误差分布。通过这种方法,我们可以减少错误说明的风险并获得健壮的估计量。在本文中,我们研究了该模型的可辨识性,并开发了一种可行的估计算法。数值研究包括仿真研究和实际数据分析,以证明所提出的方法的性能。
{"title":"Semiparametric mixture of linear regressions with nonparametric Gaussian scale mixture errors","authors":"Sangkon Oh,&nbsp;Byungtae Seo","doi":"10.1007/s11634-023-00570-6","DOIUrl":"10.1007/s11634-023-00570-6","url":null,"abstract":"<div><p>In finite mixture of regression models, normal assumption for the errors of each regression component is typically adopted. Though this common assumption is theoretically and computationally convenient, it often produces inefficient and undesirable estimates which undermine the applicability of the model particularly in the presence of outliers. To reduce these defects, we propose to use nonparametric Gaussian scale mixture distributions for component error distributions. By this means, we can lessen the risk of misspecification and obtain robust estimators. In this paper, we study the identifiability of the proposed model and develop a feasible estimating algorithm. Numerical studies including simulation studies and real data analysis to demonstrate the performance of the proposed method are also presented.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"5 - 31"},"PeriodicalIF":1.4,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Functional clustering of fictional narratives using Vonnegut curves 利用冯内古特曲线对小说叙事进行功能聚类
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-04 DOI: 10.1007/s11634-023-00567-1
Shan Zhong, David B. Hitchcock

Motivated by a public suggestion by the famous novelist Kurt Vonnegut, we clustered functional data that represented sentiment curves for famous fictional stories. We analyzed text data from novels written between 1612 and 1925, and transformed them into curves measuring sentiment as a function of the percentage of elapsed contents of the novel. We employed sentence-level sentiment evaluation and nonparametric curve smoothing. Our clustering methods involved finding the optimal number of clusters, aligning curves using different chronological warping functions to account for phase and amplitude variation, and implementing functional K-means algorithms under the square root velocity framework. Our results revealed insights about patterns in fictional narratives that Vonnegut and others have suggested but not analyzed in a functional way.

受著名小说家库尔特-冯内古特(Kurt Vonnegut)公开建议的启发,我们对代表著名小说情感曲线的功能数据进行了聚类。我们分析了 1612 年至 1925 年间创作的小说文本数据,并将其转换为衡量情感的曲线,作为小说内容所占百分比的函数。我们采用了句子级情感评估和非参数曲线平滑法。我们的聚类方法包括寻找最佳聚类数量、使用不同的时间扭曲函数对曲线进行对齐以考虑相位和振幅变化,以及在平方根速度框架下实施函数式 K-means 算法。我们的研究结果揭示了冯内古特等人提出但未以功能方式分析的小说叙事模式。
{"title":"Functional clustering of fictional narratives using Vonnegut curves","authors":"Shan Zhong,&nbsp;David B. Hitchcock","doi":"10.1007/s11634-023-00567-1","DOIUrl":"10.1007/s11634-023-00567-1","url":null,"abstract":"<div><p>Motivated by a public suggestion by the famous novelist Kurt Vonnegut, we clustered functional data that represented sentiment curves for famous fictional stories. We analyzed text data from novels written between 1612 and 1925, and transformed them into curves measuring sentiment as a function of the percentage of elapsed contents of the novel. We employed sentence-level sentiment evaluation and nonparametric curve smoothing. Our clustering methods involved finding the optimal number of clusters, aligning curves using different chronological warping functions to account for phase and amplitude variation, and implementing functional K-means algorithms under the square root velocity framework. Our results revealed insights about patterns in fictional narratives that Vonnegut and others have suggested but not analyzed in a functional way.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1045 - 1066"},"PeriodicalIF":1.4,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135774377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A between-cluster approach for clustering skew-symmetric data 斜对称数据聚类的聚类间方法
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-28 DOI: 10.1007/s11634-023-00566-2
Donatella Vicari, Cinzia Di Nuzzo

In order to investigate exchanges between objects, a clustering model for skew-symmetric data is proposed, which relies on the between-cluster effects of the skew-symmetries that represent the imbalances of the observed exchanges between pairs of objects. The aim is to detect clusters of objects that share the same behaviour of exchange so that origin and destination clusters are identified. The proposed model is based on the decomposition of the skew-symmetric matrix pertaining to the imbalances between clusters into a sum of a number of off-diagonal block matrices. Each matrix can be approximated by a skew-symmetric matrix by using a truncated Singular Value Decomposition (SVD) which exploits the properties of the skew-symmetric matrices. The model is fitted in a least-squares framework and an efficient Alternating Least Squares algorithm is provided. Finally, in order to show the potentiality of the model and the features of the resulting clusters, an extensive simulation study and an illustrative application to real data are presented.

为了研究对象之间的交换,提出了一种斜对称数据的聚类模型,该模型依赖于表示观察到的对象对之间交换的不平衡的斜对称的簇间效应。其目的是检测具有相同交换行为的对象集群,以便识别起源和目的地集群。该模型是基于将与簇间不平衡有关的偏对称矩阵分解为若干非对角线块矩阵的和。利用斜对称矩阵的性质,利用截断奇异值分解(SVD),每个矩阵都可以近似为一个斜对称矩阵。将模型拟合到最小二乘框架中,并给出了一种有效的交替最小二乘算法。最后,为了展示模型的潜力和所得聚类的特征,进行了广泛的仿真研究和对实际数据的说明性应用。
{"title":"A between-cluster approach for clustering skew-symmetric data","authors":"Donatella Vicari,&nbsp;Cinzia Di Nuzzo","doi":"10.1007/s11634-023-00566-2","DOIUrl":"10.1007/s11634-023-00566-2","url":null,"abstract":"<div><p>In order to investigate exchanges between objects, a clustering model for skew-symmetric data is proposed, which relies on the between-cluster effects of the skew-symmetries that represent the imbalances of the observed exchanges between pairs of objects. The aim is to detect clusters of objects that share the same behaviour of exchange so that origin and destination clusters are identified. The proposed model is based on the decomposition of the skew-symmetric matrix pertaining to the imbalances <i>between</i> clusters into a sum of a number of off-diagonal block matrices. Each matrix can be approximated by a skew-symmetric matrix by using a truncated Singular Value Decomposition (SVD) which exploits the properties of the skew-symmetric matrices. The model is fitted in a least-squares framework and an efficient Alternating Least Squares algorithm is provided. Finally, in order to show the potentiality of the model and the features of the resulting clusters, an extensive simulation study and an illustrative application to real data are presented.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"163 - 192"},"PeriodicalIF":1.4,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00566-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applications of dual regularized Laplacian matrix for community detection 双正则化拉普拉斯矩阵在群落检测中的应用
IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-26 DOI: 10.1007/s11634-023-00565-3
Huan Qing, Jingli Wang

Spectral clustering is widely used for detecting clusters in networks for community detection, while a small change on the graph Laplacian matrix could bring a dramatic improvement. In this paper, we propose a dual regularized graph Laplacian matrix and then employ it to the classical spectral clustering approach under the degree-corrected stochastic block model. If the number of communities is known as K, we consider more than K leading eigenvectors and weight them by their corresponding eigenvalues in the spectral clustering procedure to improve the performance. The improved spectral clustering method is dual regularized spectral clustering (DRSC). Theoretical analysis of DRSC shows that under mild conditions it yields stable consistent community detection. Meanwhile, we develop a strategy by taking advantage of DRSC and Newman’s modularity to estimate the number of communities K. We compare the performance of DRSC with several spectral methods and investigate the behaviors of our strategy for estimating K by substantial simulated networks and real-world networks. Numerical results show that DRSC enjoys satisfactory performance and our strategy on estimating K performs accurately and consistently, even in cases where there is only one community in a network.

光谱聚类被广泛应用于网络中的聚类检测,以实现群落检测,而对图拉普拉斯矩阵的微小改动就能带来巨大的改进。本文提出了一种双重正则化图拉普拉斯矩阵,并将其应用于度校正随机块模型下的经典光谱聚类方法。如果已知群落数量为 K,我们会考虑 K 个以上的前导特征向量,并在谱聚类过程中根据其对应的特征值对它们进行加权,以提高性能。改进后的光谱聚类方法就是双重正则化光谱聚类(DRSC)。DRSC 的理论分析表明,在温和的条件下,它能产生稳定一致的群落检测。同时,我们利用 DRSC 和纽曼模块化的优势开发了一种策略来估计群落数 K。我们比较了 DRSC 和几种光谱方法的性能,并通过大量模拟网络和真实世界网络研究了我们估计 K 的策略的行为。数值结果表明,DRSC 的性能令人满意,即使在网络中只有一个社区的情况下,我们的 K 估算策略也能准确一致地进行估算。
{"title":"Applications of dual regularized Laplacian matrix for community detection","authors":"Huan Qing,&nbsp;Jingli Wang","doi":"10.1007/s11634-023-00565-3","DOIUrl":"10.1007/s11634-023-00565-3","url":null,"abstract":"<div><p>Spectral clustering is widely used for detecting clusters in networks for community detection, while a small change on the graph Laplacian matrix could bring a dramatic improvement. In this paper, we propose a dual regularized graph Laplacian matrix and then employ it to the classical spectral clustering approach under the degree-corrected stochastic block model. If the number of communities is known as <i>K</i>, we consider more than <i>K</i> leading eigenvectors and weight them by their corresponding eigenvalues in the spectral clustering procedure to improve the performance. The improved spectral clustering method is dual regularized spectral clustering (DRSC). Theoretical analysis of DRSC shows that under mild conditions it yields stable consistent community detection. Meanwhile, we develop a strategy by taking advantage of DRSC and Newman’s modularity to estimate the number of communities <i>K</i>. We compare the performance of DRSC with several spectral methods and investigate the behaviors of our strategy for estimating <i>K</i> by substantial simulated networks and real-world networks. Numerical results show that DRSC enjoys satisfactory performance and our strategy on estimating <i>K</i> performs accurately and consistently, even in cases where there is only one community in a network.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1001 - 1043"},"PeriodicalIF":1.4,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134909473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in Data Analysis and Classification
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1