首页 > 最新文献

Big Data Research最新文献

英文 中文
A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes 一种分散的元启发式方法,以社会框架内的社会互动为灵感,用于处理不同规模的数据集
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-08-22 DOI: 10.1016/j.bdr.2025.100556
Sobia Tariq Javed , Kashif Zafar , Irfan Younas
The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called Binary Kids Learning Optimization Algorithm (BKLO) is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.
科技的飞速发展导致了大数据的产生。这些庞大而多样的数据可以发现有价值的模式,并在有效地挖掘、处理和分析时产生有希望的结果。然而,它也引入了“维度诅咒”,这可能会对机器学习模型的性能产生负面影响。特征选择(FS)是一种旨在识别最优特征集以提高模型效率和减少处理时间的数据预处理技术。许多基于元启发式包装的FS技术已经在文献中进行了探索。然而,许多这些算法的一个重大缺点是它们依赖于集中学习,其中全局最优解驱动搜索方向。这种集中的方法是有风险的,因为全局最优的任何错误都可能阻碍对其他潜在区域的探索和开发,从而导致发现真正的全局最优的不准确性。本文提出了一种新的去中心化元启发式儿童学习优化算法(KLO)的二进制变体,称为二进制儿童学习优化算法(BKLO),用于在包装器模式下进行分类目的的最优特征选择。利用传递函数将KLO的连续解转换为二进制空间。比较了两种传递函数:双曲tan (v形)和s形(s形)传递函数。BKLO与7种最先进的算法进行了比较。算法的性能通过来自加州大学欧文分校(UCI)存储库和亚利桑那州立大学的15个具有广泛维度(小、中、大)的基准数据集的几个评估指标进行评估和比较。通过实验和Friedman's Mean Rank (FMR)统计检验,证明了BKLO在减少特征数量和提高分类精度方面优于其他竞争算法。
{"title":"A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes","authors":"Sobia Tariq Javed ,&nbsp;Kashif Zafar ,&nbsp;Irfan Younas","doi":"10.1016/j.bdr.2025.100556","DOIUrl":"10.1016/j.bdr.2025.100556","url":null,"abstract":"<div><div>The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called <strong>Binary Kids Learning Optimization Algorithm (BKLO)</strong> is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100556"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144903932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
E-word of mouth in sales volume forecasting: Toyota Camry case study 电子口碑在销量预测中的应用:丰田凯美瑞案例研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-05-15 DOI: 10.1016/j.bdr.2025.100542
Domenica Fioredistella Iezzi , Roberto Monte
In recent years, electronic word of mouth has become a significant factor in purchasing decisions, with consumers' sentiments playing a crucial role in shaping the sales of products and services.
This paper introduces a novel approach to sales forecasting that addresses consumers' sentiments toward goods or services by combining the sales volume time series with a quantitative proxy of the unobservable true sentiment. Numerous studies have explored various methods to capture sentiment and accurately predict sales. We have integrated an estimated sentiment signal, variously built via lexicon-based, machine-learning, and deep-learning approaches, into a multivariate autoregressive state space (MARSS) model. We have tested our model on a dataset of 163,000 tweets about the Toyota Camry, covering the period from June 2009 to December 2022 and sales volumes in the US market over the same timeframe.
近年来,电子口碑已经成为影响购买决策的一个重要因素,消费者的情绪在影响产品和服务的销售方面起着至关重要的作用。本文介绍了一种新的销售预测方法,通过将销售量时间序列与不可观察的真实情绪的定量代理相结合,解决消费者对商品或服务的情绪。许多研究已经探索了各种方法来捕捉情绪并准确预测销售。我们通过基于词典、机器学习和深度学习的方法,将估计的情绪信号整合到一个多变量自回归状态空间(MARSS)模型中。我们在一个包含16.3万条关于丰田凯美瑞(Toyota Camry)的推文的数据集上测试了我们的模型,这些推文涵盖了2009年6月至2022年12月这段时间内丰田凯美瑞在美国市场的销量。
{"title":"E-word of mouth in sales volume forecasting: Toyota Camry case study","authors":"Domenica Fioredistella Iezzi ,&nbsp;Roberto Monte","doi":"10.1016/j.bdr.2025.100542","DOIUrl":"10.1016/j.bdr.2025.100542","url":null,"abstract":"<div><div>In recent years, electronic word of mouth has become a significant factor in purchasing decisions, with consumers' sentiments playing a crucial role in shaping the sales of products and services.</div><div>This paper introduces a novel approach to sales forecasting that addresses consumers' sentiments toward goods or services by combining the sales volume time series with a quantitative proxy of the unobservable true sentiment. Numerous studies have explored various methods to capture sentiment and accurately predict sales. We have integrated an estimated sentiment signal, variously built via lexicon-based, machine-learning, and deep-learning approaches, into a multivariate autoregressive state space (MARSS) model. We have tested our model on a dataset of 163,000 tweets about the Toyota Camry, covering the period from June 2009 to December 2022 and sales volumes in the US market over the same timeframe.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100542"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The narrative on tourism sustainability in Italian news: A text mining approach 意大利新闻中旅游业可持续性的叙述:一种文本挖掘方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-05-16 DOI: 10.1016/j.bdr.2025.100541
Carla Galluccio , Paola Beccherle , Alessandra Petrucci
Tourism sustainability is a complex and multidimensional construct, for which there is no shared definition in the literature. Consequently, there is no standard method for its measurement, and the adoption of sustainable practices often falls short of reached goals. Therefore, contributing to the definition of the concept of sustainable tourism is essential, both for policymakers and academics. In this vein, news media data can represent a key element through which to understand the debate about tourism sustainability. This research aims to exploit the potential of news texts to explore how sustainable tourism is conceived within specific cultural contexts. Focusing on the case study of Italy, we analysed how the concept of tourism sustainability is represented in Italian newspapers, extracting the topics discussed in relation to this theme. From a methodological point of view, we employed a network-based approach for topic extraction. Our study contributes to the literature on tourism sustainability by proposing an innovative method for extracting information from unstructured data sources, such as textual data, providing policymakers with insights about the narrative around this topic.
旅游可持续性是一个复杂的多维结构,在文献中没有共同的定义。因此,没有衡量它的标准方法,采用可持续的做法往往达不到达到的目标。因此,对政策制定者和学者来说,为可持续旅游概念的定义做出贡献至关重要。在这方面,新闻媒体数据可以成为理解关于旅游业可持续性辩论的一个关键因素。本研究旨在利用新闻文本的潜力,探索如何在特定的文化背景下构思可持续旅游。着眼于意大利的案例研究,我们分析了旅游可持续发展的概念是如何在意大利报纸上表现出来的,提取了与这一主题相关的讨论话题。从方法论的角度来看,我们采用了基于网络的方法进行主题提取。我们的研究提出了一种从非结构化数据源(如文本数据)中提取信息的创新方法,为政策制定者提供了关于这一主题的见解,从而为旅游业可持续发展的文献做出了贡献。
{"title":"The narrative on tourism sustainability in Italian news: A text mining approach","authors":"Carla Galluccio ,&nbsp;Paola Beccherle ,&nbsp;Alessandra Petrucci","doi":"10.1016/j.bdr.2025.100541","DOIUrl":"10.1016/j.bdr.2025.100541","url":null,"abstract":"<div><div>Tourism sustainability is a complex and multidimensional construct, for which there is no shared definition in the literature. Consequently, there is no standard method for its measurement, and the adoption of sustainable practices often falls short of reached goals. Therefore, contributing to the definition of the concept of sustainable tourism is essential, both for policymakers and academics. In this vein, news media data can represent a key element through which to understand the debate about tourism sustainability. This research aims to exploit the potential of news texts to explore how sustainable tourism is conceived within specific cultural contexts. Focusing on the case study of Italy, we analysed how the concept of tourism sustainability is represented in Italian newspapers, extracting the topics discussed in relation to this theme. From a methodological point of view, we employed a network-based approach for topic extraction. Our study contributes to the literature on tourism sustainability by proposing an innovative method for extracting information from unstructured data sources, such as textual data, providing policymakers with insights about the narrative around this topic.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100541"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development of an integrated data system for regional tourism analysis in Italy: A microdata perspective 意大利区域旅游分析综合数据系统的开发:微数据视角
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-06-07 DOI: 10.1016/j.bdr.2025.100550
Samuele Cesarini, Fabrizio Antolini, Ivan Terraglia
This paper presents the development of an integrated data system tailored for the Italian regions, combining microdata from the Bank of Italy's and ISTAT's surveys. These datasets offer an in-depth analysis of both domestic and international aspects of tourism, framed within the theoretical context of the tourism determinants. By merging this integrated dataset with additional data from other statistical sources, this study offers a queryable relational database enabling granular regional analysis. Currently, tourism statistics in Italy are fragmented and do not provide a unified picture of tourism in its many aspects. The relational model's interoperability addresses Italy's fragmented tourism data landscape, and its data definition language represents an important step towards the creation of a unified tourism archive. Micro-data allows for different statistical analyses than those usually carried out with aggregated data, increasing knowledge of the dynamics of the sector.
本文介绍了为意大利地区量身定制的综合数据系统的开发,结合了意大利银行和ISTAT调查的微观数据。这些数据集在旅游决定因素的理论背景下,对国内和国际旅游方面进行了深入分析。通过将这个集成数据集与其他统计来源的其他数据合并,本研究提供了一个可查询的关系数据库,可以进行粒度区域分析。目前,意大利的旅游统计数据是支离破碎的,不能提供一个统一的旅游业的许多方面的画面。关系模型的互操作性解决了意大利支离破碎的旅游数据格局,其数据定义语言代表了创建统一旅游档案的重要一步。与通常使用汇总数据进行的统计分析相比,微观数据允许进行不同的统计分析,从而增加了对该部门动态的了解。
{"title":"Development of an integrated data system for regional tourism analysis in Italy: A microdata perspective","authors":"Samuele Cesarini,&nbsp;Fabrizio Antolini,&nbsp;Ivan Terraglia","doi":"10.1016/j.bdr.2025.100550","DOIUrl":"10.1016/j.bdr.2025.100550","url":null,"abstract":"<div><div>This paper presents the development of an integrated data system tailored for the Italian regions, combining microdata from the Bank of Italy's and ISTAT's surveys. These datasets offer an in-depth analysis of both domestic and international aspects of tourism, framed within the theoretical context of the tourism determinants. By merging this integrated dataset with additional data from other statistical sources, this study offers a queryable relational database enabling granular regional analysis. Currently, tourism statistics in Italy are fragmented and do not provide a unified picture of tourism in its many aspects. The relational model's interoperability addresses Italy's fragmented tourism data landscape, and its data definition language represents an important step towards the creation of a unified tourism archive. Micro-data allows for different statistical analyses than those usually carried out with aggregated data, increasing knowledge of the dynamics of the sector.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100550"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bankruptcy risk prediction: A new approach based on compositional analysis of financial statements 破产风险预测:基于财务报表成分分析的新方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-05-23 DOI: 10.1016/j.bdr.2025.100537
Alessandro Magrini
The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.
近年来,由于财务报表数据的可获得性很大,破产风险预测模型的发展受到了广泛关注。大多数现有的预测模型依赖于财务比率,这是一种基于业绩的指标,表示两个会计项目的相对大小。尽管财务比率很受欢迎,但众所周知,它们的使用伴随着严重的实际缺陷,如异常值和冗余的出现,使得数据预处理成为必要,以避免计算问题并获得良好的预测准确性。等距对数比可以潜在地克服这些问题,因为它们被设计为有效地表示组成数据,并且具有限制异常值出现的对数形式。然而,尽管它们在分析财务报表方面并不新颖,但还没有研究使用它们来预测破产。在本文中,我们展示了等距对数比率在138,720家意大利公司(127,420家活跃公司和11,300家破产公司)中检测破产事件的有效性,这些公司属于不同的行业,具有不同的规模和年龄。为此,我们使用具有自适应LASSO正则化和随机森林的逻辑回归来构建几个具有财务比率或等距对数比率的预测模型,并结合不同的视界和滞后结构。结果表明,在没有预处理的情况下,一组8个等距对数比率提供的预测精度几乎与选择16个财务比率相同,这需要减少3.6%的数据。此外,自适应LASSO正则化表明,等距对数比率的冗余度始终低于20%,在某些情况下接近0%,而财务比率的冗余度在12.5%至46.9%之间。基于logistic回归的模型预测精度符合甚至高于近期研究报道,随机森林在受试者工作特征(ROC)曲线下的面积增加了2 - 3个百分点。
{"title":"Bankruptcy risk prediction: A new approach based on compositional analysis of financial statements","authors":"Alessandro Magrini","doi":"10.1016/j.bdr.2025.100537","DOIUrl":"10.1016/j.bdr.2025.100537","url":null,"abstract":"<div><div>The development of models for bankruptcy risk prediction has gained much attention in recent years due to the great availability of financial statement data. Most existing predictive models rely on financial ratios, which are performance-based measures expressing the relative magnitude of two accounting items. Despite the popularity of financial ratios, their use is notoriously accompanied by serious practical drawbacks, like the occurrence of outliers and redundancy, making data preprocessing necessary to avoid computational problems and obtain a good predictive accuracy. Isometric log ratios can potentially overcome these problems because they are designed to represent compositional data efficiently and have a logarithmic form that limits the occurrence of outliers. However, although they are not novel in the analysis of financial statements, no study has ever employed them to predict bankruptcy. In this article, we show the effectiveness of isometric log ratios to detect bankruptcy events in a sample of 138,720 Italian firms (127,420 active and 11,300 bankrupted) belonging to different industries and with different size and age. For this purpose, we use logistic regression with adaptive LASSO regularization and random forests to construct several predictive models featuring either financial ratios or isometric log ratios, and combining different horizons and lag structures. The results show that a set of 8 isometric log ratios provides, without preprocessing, almost the same predictive accuracy as a selection of 16 financial ratios that requires dropping 3.6% of the data. Also, the adaptive LASSO regularization reveals that redundancy for isometric log ratios is always below 20%, and in some cases near 0%, while it ranges from 12.5% to 46.9% for financial ratios. The predictive accuracy of models based on logistic regression is in line with and even higher than the one reported by recent studies, and random forests achieve a gain in the area under the Receiver Operating Characteristic (ROC) curve ranging between two and three percentage points.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100537"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multiple-group hidden Markov model for multi-source data. Cross-country differences in employment mobility in the presence of measurement error 多源数据的多组隐马尔可夫模型。存在测量误差的就业流动性的跨国差异
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-05-19 DOI: 10.1016/j.bdr.2025.100527
Roberta Varriale , Mauricio Garnier-Villarreal , Dimitris Pavlopoulos , Danila Filipponi
In this paper, we develop a multigroup hidden Markov model to tackle the issue of measurement error in multi-source data from different countries. We focus, in particular, on the measurement of employment mobility in the Netherlands and Italy using linked data from the Labour Force Survey and administrative sources. The measurement-error correction we apply reconciles differences between data sources and shows that cross-country differences in employment mobility are smaller than originally thought. Error-corrected estimates indicate that mobility from temporary to permanent employment has become, over time, larger in Italy than in the Netherlands, while mobility from non-employment to temporary employment has steadily been higher in the Netherlands than in Italy.
在本文中,我们建立了一个多组隐马尔可夫模型来解决来自不同国家的多源数据的测量误差问题。我们特别关注荷兰和意大利的就业流动性,使用来自劳动力调查和行政来源的相关数据。我们采用的测量误差修正调和了数据源之间的差异,并表明就业流动性的跨国差异比最初想象的要小。修正错误的估计表明,随着时间的推移,意大利从临时就业到永久就业的流动性比荷兰大,而荷兰从非就业到临时就业的流动性一直高于意大利。
{"title":"A multiple-group hidden Markov model for multi-source data. Cross-country differences in employment mobility in the presence of measurement error","authors":"Roberta Varriale ,&nbsp;Mauricio Garnier-Villarreal ,&nbsp;Dimitris Pavlopoulos ,&nbsp;Danila Filipponi","doi":"10.1016/j.bdr.2025.100527","DOIUrl":"10.1016/j.bdr.2025.100527","url":null,"abstract":"<div><div>In this paper, we develop a multigroup hidden Markov model to tackle the issue of measurement error in multi-source data from different countries. We focus, in particular, on the measurement of employment mobility in the Netherlands and Italy using linked data from the Labour Force Survey and administrative sources. The measurement-error correction we apply reconciles differences between data sources and shows that cross-country differences in employment mobility are smaller than originally thought. Error-corrected estimates indicate that mobility from temporary to permanent employment has become, over time, larger in Italy than in the Netherlands, while mobility from non-employment to temporary employment has steadily been higher in the Netherlands than in Italy.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100527"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BETM: A new pre-trained BERT-guided embedding-based topic model BETM:一种新的预训练bert引导的基于嵌入的主题模型
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-06-06 DOI: 10.1016/j.bdr.2025.100551
Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang
The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).
主题模型和预训练BERT在自然语言处理(NLP)中的应用越来越广泛,但目前还没有一个标准的方法来整合它们。本文提出了一种新的预训练bert引导的基于嵌入的主题模型(BETM)。通过对主题-词分布和文档-主题分布的约束,BETM可以巧妙地从BERT嵌入中学习语义信息、句法信息和主题信息。此外,针对BETM中短输入导致的上下文信息不足和长输入导致的语义截断问题,我们设计了两种解决方案。我们发现BETM的词嵌入比预训练好的GloVe词嵌入更适合于主题建模,并且BETM可以针对特定数据集灵活地选择预训练BERT的不同变体,从而获得更好的主题质量。我们发现,即使包含停止词,BETM也能很好地处理大而重尾的词汇。BETM在几个基准数据集上获得了最先进的(SOTA) - Yelp Review Polarity(106,586个样本),Wiki Text 103(71,533个样本),Open-Web-Text(35,713个样本),20Newsgroups(10,899个样本)和AG-news(127,588个样本)。
{"title":"BETM: A new pre-trained BERT-guided embedding-based topic model","authors":"Yang Liu ,&nbsp;Xiaotang Zhou ,&nbsp;Zhenwei Zhang ,&nbsp;Xiran Yang","doi":"10.1016/j.bdr.2025.100551","DOIUrl":"10.1016/j.bdr.2025.100551","url":null,"abstract":"<div><div>The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100551"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN 基于嵌入感知条件GAN的不平衡欺诈交易检测问题建模研究
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-08-13 DOI: 10.1016/j.bdr.2025.100557
Luping Zhi , Wanmin Wang
Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.
由于多模态、非高斯连续变量、混合类型特征和严重的类不平衡,在结构化金融数据中检测欺诈交易提出了重大挑战。为了解决这些问题,我们提出了一个嵌入感知条件生成对抗网络(EAC-GAN),它将可训练的标签嵌入到生成器和鉴别器中,以实现少数类样本的语义控制合成。除了对抗性训练之外,EAC-GAN还引入了一个辅助分类目标,形成了一个联合优化策略,提高了生成数据的保真度和类别一致性,特别是对于代表性不足的类别。在真实的信用卡数据集上进行的实验表明,即使标记数据有限,EAC-GAN也能实现稳定的收敛。当与LightGBM分类器结合使用时,EAC-GAN生成的合成样本显著提高了欺诈检测性能,精度为96.8%,AUC为96.38%,AUPRC为83.89%,MCC为88.94%。此外,使用主成分分析(PCA)和t分布随机邻居嵌入(t-SNE)进行降维,表明生成的样本与真实数据分布紧密一致,并且在潜在空间中表现出明显的类可分性。这些结果强调了EAC-GAN在合成高质量少数类样本和改进下游欺诈检测方面的有效性,优于传统的过采样技术和基线生成模型。
{"title":"Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN","authors":"Luping Zhi ,&nbsp;Wanmin Wang","doi":"10.1016/j.bdr.2025.100557","DOIUrl":"10.1016/j.bdr.2025.100557","url":null,"abstract":"<div><div>Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100557"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable malware detection through integrated graph reduction and learning techniques 可解释的恶意软件检测通过集成图约简和学习技术
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-08-19 DOI: 10.1016/j.bdr.2025.100555
Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani
Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.
近年来,控制流图和函数调用图由于能够表征程序复杂的结构和功能行为,在恶意软件检测任务中受到了广泛的关注。为了更好地利用这些表征在恶意软件检测中并提高检测性能,将它们与图神经网络(gnn)配对。然而,这些图形表示的规模和复杂性给研究人员带来了重大挑战。同时,GNN模型提供的简单的二值分类对于恶意软件分析来说是不够的。为了解决这些挑战,本文将新的图约简技术和GNN可解释性集成到恶意软件检测框架中,以提高效率和可解释性。通过我们广泛的进化,我们证明了所提出的图约简技术显着降低了输入图的大小和复杂性,同时保持了检测性能。此外,使用gninterpreter提取的重要子图提供了关于模型决策的更好的见解,并帮助安全专家进行进一步的分析。
{"title":"Explainable malware detection through integrated graph reduction and learning techniques","authors":"Hesamodin Mohammadian,&nbsp;Griffin Higgins,&nbsp;Samuel Ansong,&nbsp;Roozbeh Razavi-Far,&nbsp;Ali A. Ghorbani","doi":"10.1016/j.bdr.2025.100555","DOIUrl":"10.1016/j.bdr.2025.100555","url":null,"abstract":"<div><div>Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100555"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NGLinker: Link prediction for node featureless networks NGLinker:无节点特征网络的链路预测
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-28 Epub Date: 2025-08-18 DOI: 10.1016/j.bdr.2025.100558
Yong Li , Jingpeng Wu , Zhongying Zhang
Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.
链路预测是网络科学中一个具有广泛实际应用的典型问题,其目的是根据当前观察到的部分节点和链路推断缺失的链路或未来的链路。然而,传统的链路预测模型是基于网络结构的,预测精度较低,缺乏通用性和可扩展性。基于机器学习和人工特征的链接预测的性能受主观意识的影响很大。虽然图嵌入学习(GEL)模型可以避免这些缺点,但它仍然存在一些挑战。由于GEL模型通常基于随机行走和图神经网络(gnn),其预测精度相对较低,不适合在无节点特征网络中揭示隐藏信息。为了解决这些问题,我们提出了一种新的基于Node2vec和GraphSage的链路预测模型NGLinker,它可以在无节点特征网络中协调性能和准确性。与使用标签信息学习节点特征不同,NGLinker只依赖于局部网络结构。在定量上,我们观察到NGLinker和实验室测试估算的预测精度优于当前最先进的模型,这有力地支持了使用NGLinker预测三个公网和一个专网并进行预测结果的可行性和有效性。该nglink不仅能在精度和接收机工作特性曲线下面积上达到预测精度,而且具有较强的通用性和可扩展性。NGLinker模型扩大了gnn在无节点特征网络中的应用。
{"title":"NGLinker: Link prediction for node featureless networks","authors":"Yong Li ,&nbsp;Jingpeng Wu ,&nbsp;Zhongying Zhang","doi":"10.1016/j.bdr.2025.100558","DOIUrl":"10.1016/j.bdr.2025.100558","url":null,"abstract":"<div><div>Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100558"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1