Advances in Data Analysis and Classification最新文献

英文中文

An analytic strategy for data processing of multimode networks 多模网络数据处理分析策略

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-08-29 DOI: 10.1007/s11634-023-00556-4

Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, Maria Prosperina Vitale

Complex network data structures are considered to capture the richness of social phenomena and real-life data settings. Multipartite networks are an example in which various scenarios are represented by different types of relations, actors, or modes. Within this context, the present contribution aims at discussing an analytic strategy for simplifying multipartite networks in which different sets of nodes are linked. By considering the connection of multimode networks and hypergraphs as theoretical concepts, a three-step procedure is introduced to simplify, normalize, and filter network data structures. Thus, a model-based approach is introduced for derived bipartite weighted networks in order to extract statistically significant links. The usefulness of the strategy is demonstrated in handling two application fields, that is, intranational student mobility in higher education and research collaboration in European framework programs. Finally, both examples are explored using community detection algorithms to determine the presence of groups by mixing up different modes.

复杂的网络数据结构被认为可以捕捉到丰富的社会现象和现实生活中的数据设置。多方网络就是一个例子，其中的各种情况由不同类型的关系、参与者或模式来表示。在此背景下，本文旨在讨论一种简化多方网络的分析策略，在这种网络中，不同的节点集相互连接。通过将多模式网络和超图的联系视为理论概念，本文介绍了简化、规范化和过滤网络数据结构的三步程序。因此，为了提取具有统计意义的链接，我们为衍生的双方形加权网络引入了一种基于模型的方法。在处理两个应用领域（即高等教育中的跨国学生流动和欧洲框架计划中的研究合作）时，展示了该策略的实用性。最后，使用群体检测算法对这两个例子进行了探讨，通过混合不同的模式来确定群体的存在。

{"title":"An analytic strategy for data processing of multimode networks","authors":"Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, Maria Prosperina Vitale","doi":"10.1007/s11634-023-00556-4","DOIUrl":"10.1007/s11634-023-00556-4","url":null,"abstract":"<div><p>Complex network data structures are considered to capture the richness of social phenomena and real-life data settings. Multipartite networks are an example in which various scenarios are represented by different types of relations, actors, or modes. Within this context, the present contribution aims at discussing an analytic strategy for simplifying multipartite networks in which different sets of nodes are linked. By considering the connection of multimode networks and hypergraphs as theoretical concepts, a three-step procedure is introduced to simplify, normalize, and filter network data structures. Thus, a model-based approach is introduced for derived bipartite weighted networks in order to extract statistically significant links. The usefulness of the strategy is demonstrated in handling two application fields, that is, intranational student mobility in higher education and research collaboration in European framework programs. Finally, both examples are explored using community detection algorithms to determine the presence of groups by mixing up different modes.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"745 - 767"},"PeriodicalIF":1.4,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00556-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82739517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust gradient boosting for generalized additive models for location, scale and shape 位置、尺度和形状广义加性模型的鲁棒梯度增强

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-08-26 DOI: 10.1007/s11634-023-00555-5

Jan Speller, C. Staerk, Francisco Gude, A. Mayr

引用次数: 0

Editorial for ADAC issue 3 of volume 17 (2023) ADAC第17卷第3期(2023年)社论

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-08-03 DOI: 10.1007/s11634-023-00554-6

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

On the efficient implementation of classification rule learning 论分类规则学习的高效实施

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-07-27 DOI: 10.1007/s11634-023-00553-7

Michael Rapp, Johannes Fürnkranz, Eyke Hüllermeier

Rule learning methods have a long history of active research in the machine learning community. They are not only a common choice in applications that demand human-interpretable classification models but have also been shown to achieve state-of-the-art performance when used in ensemble methods. Unfortunately, only little information can be found in the literature about the various implementation details that are crucial for the efficient induction of rule-based models. This work provides a detailed discussion of algorithmic concepts and approximations that enable applying rule learning techniques to large amounts of data. To demonstrate the advantages and limitations of these individual concepts in a series of experiments, we rely on BOOMER—a flexible and publicly available implementation for the efficient induction of gradient boosted single- or multi-label classification rules.

在机器学习领域，规则学习方法的研究由来已久。它们不仅是需要人类可解释分类模型的应用中的常见选择，而且在用于集合方法时也被证明能达到最先进的性能。遗憾的是，关于对高效归纳基于规则的模型至关重要的各种实现细节，在文献中几乎找不到任何信息。本研究详细讨论了可将规则学习技术应用于海量数据的算法概念和近似方法。为了在一系列实验中展示这些单个概念的优势和局限性，我们使用了 BOOMER--一种灵活且公开可用的实现方法，用于高效归纳梯度提升的单标签或多标签分类规则。

引用次数: 0

Model-based clustering using a new multivariate skew distribution 使用新的多元倾斜分布进行基于模型的聚类

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-07-22 DOI: 10.1007/s11634-023-00552-8

Salvatore D. Tomarchio, Luca Bagnato, Antonio Punzo

Quite often real data exhibit non-normal features, such as asymmetry and heavy tails, and present a latent group structure. In this paper, we first propose the multivariate skew shifted exponential normal distribution that can account for these non-normal characteristics. Then, we use this distribution in a finite mixture modeling framework. An EM algorithm is illustrated for maximum-likelihood parameter estimation. We provide a simulation study that compares the fitting performance of our model with those of several alternative models. The comparison is also conducted on a real dataset concerning the log returns of four cryptocurrencies.

真实数据往往呈现出非正态分布的特征，如不对称和重尾，并呈现出潜在的群体结构。在本文中，我们首先提出了可以解释这些非正态分布特征的多元偏移指数正态分布。然后，我们在有限混合物建模框架中使用这种分布。说明了最大似然参数估计的 EM 算法。我们提供了一项模拟研究，比较了我们的模型与其他几个模型的拟合性能。比较还在一个有关四种加密货币对数收益的真实数据集上进行。

引用次数: 0

A topological data analysis based classifier 基于拓扑数据分析的分类器

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-07-01 DOI: 10.1007/s11634-023-00548-4

Rolando Kindelan, José Frías, Mauricio Cerda, Nancy Hitschfeld

Topological Data Analysis (TDA) is an emerging field that aims to discover a dataset’s underlying topological information. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes a different TDA pipeline to classify balanced and imbalanced multi-class datasets without additional ML methods. Our proposed method was designed to solve multi-class and imbalanced classification problems with no data resampling preprocessing stage. The proposed TDA-based classifier (TDABC) builds a filtered simplicial complex on the dataset representing high-order data relationships. Following the assumption that a meaningful sub-complex exists in the filtration that approximates the data topology, we apply Persistent Homology (PH) to guide the selection of that sub-complex by considering detected topological features. We use each unlabeled point’s link and star operators to provide different-sized and multi-dimensional neighborhoods to propagate labels from labeled to unlabeled points. The labeling function depends on the filtration’s entire history of the filtered simplicial complex and it is encoded within the persistence diagrams at various dimensions. We select eight datasets with different dimensions, degrees of class overlap, and imbalanced samples per class to validate our method. The TDABC outperforms all baseline methods classifying multi-class imbalanced data with high imbalanced ratios and data with overlapped classes. Also, on average, the proposed method was better than K Nearest Neighbors (KNN) and weighted KNN and behaved competitively with Support Vector Machine and Random Forest baseline classifiers in balanced datasets.

拓扑数据分析（TDA）是一个新兴领域，旨在发现数据集的潜在拓扑信息。拓扑数据分析工具通常用于创建过滤器和拓扑描述符，以改进机器学习（ML）方法。本文提出了一种不同的 TDA 管道，无需额外的 ML 方法即可对平衡和不平衡的多类数据集进行分类。我们提出的方法旨在解决多类和不平衡分类问题，无需数据重采样预处理阶段。所提出的基于 TDA 的分类器（TDABC）会在数据集上建立一个过滤简约复合物，代表高阶数据关系。根据过滤中存在近似数据拓扑的有意义子复合物这一假设，我们应用持久同源性（PH），通过考虑检测到的拓扑特征来指导选择该子复合物。我们使用每个未标记点的链接和星形算子来提供不同大小的多维邻域，以便将标签从已标记点传播到未标记点。标签函数取决于滤波简约复合物的整个滤波历史，它被编码在不同维度的持久图中。我们选择了八个具有不同维度、类重叠程度和每类不平衡样本的数据集来验证我们的方法。在对高不平衡率的多类不平衡数据和类重叠数据进行分类时，TDABC 优于所有基线方法。此外，平均而言，所提出的方法优于 K Nearest Neighbors (KNN) 和加权 KNN，在平衡数据集中与支持向量机和随机森林基准分类器的表现也很有竞争力。

{"title":"A topological data analysis based classifier","authors":"Rolando Kindelan, José Frías, Mauricio Cerda, Nancy Hitschfeld","doi":"10.1007/s11634-023-00548-4","DOIUrl":"10.1007/s11634-023-00548-4","url":null,"abstract":"<div><p>Topological Data Analysis (TDA) is an emerging field that aims to discover a dataset’s underlying topological information. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes a different TDA pipeline to classify balanced and imbalanced multi-class datasets without additional ML methods. Our proposed method was designed to solve multi-class and imbalanced classification problems with no data resampling preprocessing stage. The proposed TDA-based classifier (TDABC) builds a filtered simplicial complex on the dataset representing high-order data relationships. Following the assumption that a meaningful sub-complex exists in the filtration that approximates the data topology, we apply Persistent Homology (PH) to guide the selection of that sub-complex by considering detected topological features. We use each unlabeled point’s link and star operators to provide different-sized and multi-dimensional neighborhoods to propagate labels from labeled to unlabeled points. The labeling function depends on the filtration’s entire history of the filtered simplicial complex and it is encoded within the persistence diagrams at various dimensions. We select eight datasets with different dimensions, degrees of class overlap, and imbalanced samples per class to validate our method. The TDABC outperforms all baseline methods classifying multi-class imbalanced data with high imbalanced ratios and data with overlapped classes. Also, on average, the proposed method was better than K Nearest Neighbors (KNN) and weighted KNN and behaved competitively with Support Vector Machine and Random Forest baseline classifiers in balanced datasets.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"493 - 538"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87127200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A link function specification test in the single functional index model 单函数索引模型中的链接函数规范测试

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-06-22 DOI: 10.1007/s11634-023-00545-7

Lax Chan, L. Delsol, A. Goia

引用次数: 1

MLE for the parameters of bivariate interval-valued model 双变量区间值模型参数的 MLE

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-06-18 DOI: 10.1007/s11634-023-00546-6

S. Yaser Samadi, L. Billard, Jiin-Huarng Guo, Wei Xu

With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (J Stat Plan Infer 141:1593–1602, 2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations.

随着当代数据集变得过于庞大而无法直接分析数据，各种形式的汇总数据变得越来越常见。原始的单个数据是点，但汇总后的观测值是区间值（例如）。虽然有些研究人员只是按聚合类别分析观测值的平均值集，但很容易确定这种方法忽略了原始数据集中的许多信息。勒-拉德马赫和比拉尔德（J Stat Plan Infer 141:1593-1602, 2011）对区间值数据进行了初步的理论研究，但这些研究成果仅限于估计单一变量的均值和方差。本文试图通过推导最重要的协方差统计量的最大似然估计器来弥补他们工作的局限性，协方差统计量是回归、主成分和典型分析等众多方法的基本要求。提出的估计器的渐近特性已经确定。Le-Rademacher 和 Billard 结果是我们更广泛推导的特例。

引用次数: 0

Multivariate count time series segmentation with “sums and shares” and Poisson lognormal mixture models: a comparative study using pedestrian flows within a multimodal transport hub 使用 "总和与份额 "和泊松对数正态混合模型进行多变量计数时间序列分割：利用多式联运枢纽内的人流进行比较研究

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-05-29 DOI: 10.1007/s11634-023-00543-9

Paul de Nailly, Etienne Côme, Latifa Oukhellou, Allou Samé, Jacques Ferriere, Yasmine Merad-Boudia

This paper deals with a clustering approach based on mixture models to analyze multidimensional mobility count time-series data within a multimodal transport hub. These time series are very likely to evolve depending on various periods characterized by strikes, maintenance works, or health measures against the Covid19 pandemic. In addition, exogenous one-off factors, such as concerts and transport disruptions, can also impact mobility. Our approach flexibly detects time segments within which the very noisy count data is synthesized into regular spatio-temporal mobility profiles. At the upper level of the modeling, evolving mixing weights are designed to detect segments properly. At the lower level, segment-specific count regression models take into account correlations between series and overdispersion as well as the impact of exogenous factors. For this purpose, we set up and compare two promising strategies that can address this issue, namely the “sums and shares” and “Poisson log-normal” models. The proposed methodologies are applied to actual data collected within a multimodal transport hub in the Paris region. Ticketing logs and pedestrian counts provided by stereo cameras are considered here. Experiments are carried out to show the ability of the statistical models to highlight mobility patterns within the transport hub. One model is chosen based on its ability to detect the most continuous segments possible while fitting the count time series well. An in-depth analysis of the time segmentation, mobility patterns, and impact of exogenous factors obtained with the chosen model is finally performed.

本文探讨了一种基于混合模型的聚类方法，用于分析多式联运枢纽内的多维流动计数时间序列数据。这些时间序列极有可能随罢工、维修工程或针对 Covid19 大流行病的卫生措施等不同时期的变化而变化。此外，音乐会和交通中断等一次性外生因素也会影响流动性。我们的方法可以灵活地检测时间段，在这些时间段内，非常嘈杂的计数数据会被合成为有规律的时空流动曲线。在建模的上层，我们设计了不断变化的混合权重来正确检测时间段。在下层，分段计数回归模型考虑了序列间的相关性、过度分散性以及外生因素的影响。为此，我们建立并比较了可以解决这一问题的两种有前途的策略，即 "和与份额 "模型和 "泊松对数正态 "模型。我们将所提出的方法应用于在巴黎大区的一个多式联运交通枢纽收集到的实际数据。这里考虑的是立体摄像机提供的售票记录和行人计数。实验显示了统计模型突出交通枢纽内流动模式的能力。我们选择了一个模型，该模型能够检测出尽可能多的连续路段，同时又能很好地拟合计数时间序列。最后对所选模型得到的时间分割、流动模式和外在因素的影响进行了深入分析。

{"title":"Multivariate count time series segmentation with “sums and shares” and Poisson lognormal mixture models: a comparative study using pedestrian flows within a multimodal transport hub","authors":"Paul de Nailly, Etienne Côme, Latifa Oukhellou, Allou Samé, Jacques Ferriere, Yasmine Merad-Boudia","doi":"10.1007/s11634-023-00543-9","DOIUrl":"10.1007/s11634-023-00543-9","url":null,"abstract":"<div><p>This paper deals with a clustering approach based on mixture models to analyze multidimensional mobility count time-series data within a multimodal transport hub. These time series are very likely to evolve depending on various periods characterized by strikes, maintenance works, or health measures against the Covid19 pandemic. In addition, exogenous one-off factors, such as concerts and transport disruptions, can also impact mobility. Our approach flexibly detects time segments within which the very noisy count data is synthesized into regular spatio-temporal mobility profiles. At the upper level of the modeling, evolving mixing weights are designed to detect segments properly. At the lower level, segment-specific count regression models take into account correlations between series and overdispersion as well as the impact of exogenous factors. For this purpose, we set up and compare two promising strategies that can address this issue, namely the “sums and shares” and “Poisson log-normal” models. The proposed methodologies are applied to actual data collected within a multimodal transport hub in the Paris region. Ticketing logs and pedestrian counts provided by stereo cameras are considered here. Experiments are carried out to show the ability of the statistical models to highlight mobility patterns within the transport hub. One model is chosen based on its ability to detect the most continuous segments possible while fitting the count time series well. An in-depth analysis of the time segmentation, mobility patterns, and impact of exogenous factors obtained with the chosen model is finally performed.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"455 - 491"},"PeriodicalIF":1.4,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83868644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for ADAC issue 2 of volume 17 (2023) ADAC第17卷第2期编辑（2023）

IF 1.6 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2023-05-12 DOI: 10.1007/s11634-023-00544-8

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in Data Analysis and Classification

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀